CRM-17748 Expose options['result_buffering'] to CRM_Core_DAO

    Details

    • Versioning Impact:
      Major (incompatible API change)
    • Documentation Required?:
      None
    • Funding Source:
      Core Team Contract
    • Payment Status:
      Paid

      Description

      We have a custom contact search form task. The search parameters could easily result in more than 200k contacts being returned to this task, which often consumes PHP's memory_limit. Examination of the code shows that DB_mysql->simpleQuery() has the ability to use mysql_unbuffered_query() instead of mysql_query(), with the actual behavior determined by the setting of DB_mysql->options['result_buffering']. DB_common (and thus DB_mysql) has the setOption() method to allow for setting this option, but it is not utilized by CRM_Core_DAO. Given that CRM_Core_DAO is the fundamental object used for database interaction within Civi, I'd like to suggest allowing for manipulating DB_mysql->options from within the DAO object.

      1) create a setOptions() method for CRM_Core_DAO to allow for setting options within the DB_mysql object. The proper object to be referenced is found in the global $_DB_DATAOBJECT['CONNECTIONS'], indexed by CRM_Core_DAO->_database_dsn_md5.

      2) Modify CRM_Core_DAO->executeQuery() to recognize a new parameter. The new parameter should be an array("option_name"=>"option_value"), corresponding to the options available in DB_mysql->options. If the new parameter is populated, executeQuery() should call $dao->setOptions() prior to calling $dao->query().

      We like to get core team's thoughts on this change before implementing it in our own code.

        Attachments

        1. DAO.php
          61 kB
          Steve Binkowski

          Activity

          [CRM-17748] Expose options['result_buffering'] to CRM_Core_DAO
          Eileen McNaughton added a comment -

          A couple of notes - although @totten should look at this

          1) I'm not clear on how unbuffered helps (although I'm sure I can do some googling on that & edjumicate myself)
          2) I did some digging into memory leaks because I saw some odd things & found there is some dao level caching that can act as a memory leak I believe - here are the tests I did - you can see that after 1000 iterations quite different amounts of memory were in use.

          https://github.com/eileenmcnaughton/civicrm-core/commit/901799627e6e710a4345332d839036351ea15f4b

          I believe the logic is that the DAO caches the results of simple get $dao->find(); queries in memory and only frees them when you do a free or run an update or insert query (which has the same effect) so when iterating through large result sets it may be necessary to free the DAO ($dao->free) between rows.

          Tim Otten added a comment - - edited

          Wow, that's a pretty good catch – I wasn't aware that PHP/MySQL/DB_mysql buffered the full query results by default. A few thoughts:

          1. There seems to be a design discrepancy about the grain of the option. In the upstream php-mysql API, they treat it as a per-query decision (e.g. "choose mysql_query or mysql_unbuffered_query based on whether you expect jumbo-results"). But in DB_mysql, they treat it as a per-connection decision ("enable or disable buffering for all queries through the given connection"). Imagine a page-request that involves 10 queries. In the php-mysql design, you might run 8 buffered queries and 2 unbuffered queries. But in DB_mysql's design, you'd probably run 10 buffered XOR 10 unbuffered queries (or else you do a bit of juggling/swapping/cleanup to use mix).

          The proposed changes to CRM_Core_DAO sound like they aim for per-query buffering. (You either instantiate a new DAO for each query and call setOptions, or you use executeQuery and pass in $options.) Intuitively, I agree that per-query buffering seems like the better design.

          You may need to take some care to ensure that per-query options remain strictly within that query – and don't leak out to other queries. For example, https://gist.github.com/totten/31fbb6ac48db026c3493

          2. I don't know about your custom task, but skimming the base class – CRM_Contact_Form_Task – the functions preProcessCommon() and getContactIds() look like they will try to load the full list of contact IDs into memory. This doesn't negate the value of setting options['result_buffering']==FALSE, but if the goal is to avoid O( n ) memory usage, then it merits a look.

          3. Eileen McNaughton, I think you're referring to a similar-but-different issue. Both issues have the effect of using large amounts of memory, but they involve slightly different edge-cases and data-structures. The DAO-cache can grow large over the series of many independent, modestly-sized queries, and it can be mitigated by explicitly calling a free() function. The mysql buffering grows large with a single query. A free() function wouldn't make any difference - I think you'd see a symptom where the original call to mysql_query() or the first mysql_fetch() fails due to memory exhaustion.

          Steve Binkowski added a comment -

          1. I see your point regarding the bleed over to other queries, and can adjust my strategy appropriately. Your gist example can serve that purpose admirably.

          2. For our custom task, it actually isn't the initial getContactIds() which presents the problem, but a query later on (in the custom postProcess() function). It pulls full rows from a temp table, and uses a DAO object instantiated for that specific purpose. This proposed change should present an easy resolution to the issue we are encountering.

          3. I agree with your assessment on Eileen's findings. The issue we are encountering is from a single query returning a large number of rows. Just before the query is run, memory usage is at around 80mb. This one query blows through the rest of PHP's limit before the DAO cache has a chance to be part of the problem.

          I'll be returning to this issue on Monday, and will update this ticket with my proposed changes when I have something working locally.

          Steve Binkowski added a comment - - edited

          I've made the changes to CRM_Core_DAO, but ran into Tim's predicted getContactIds() issue while trying to test. I was under the impression that getContactIds() would be querying for only the contact ID field, when it is actually using the BAO layer to generate a default "all fields on deck" query.

          REF:
          https://github.com/civicrm/civicrm-core/blob/master/CRM/Contact/Form/Task.php#l309
          https://github.com/civicrm/civicrm-core/blob/master/CRM/Contact/Selector.php#l1222
          https://github.com/civicrm/civicrm-core/blob/master/CRM/Contact/BAO/Query.php#l437

          CRM_Contact_Selector instantiates CRM_Contact_BAO_Query, but it is passing NULL for the $fields parameter. This allows CRM_Contact_BAO_Query to generate a very complete list of fields to return via four different sources. The absolute waste here is that the task is only consuming the contact_id field, but has no way to inform the selector or BAO_Query of the simpler need.

          CRM_Contact_Selector->contactIDQuery(), at least in this context, is expected to minimally return a dataset with the contact_id field. Is it necessary for this method to query all fields? What options do we have for limiting this return?

          Edit: a grep for contactIDQuery() shows it is only used in this context. I think we might be safe altering it to request only the contact_id field. This sounds to me to be in line with the original intent.

          Steve Binkowski added a comment -

          Here's the modified CRM_Core_DAO. I'd like your feedback on these changes, as well.

          Tim Otten added a comment -

          Thank you! Merged.

          While testing, bumped into an unrelated issue with tasks (CRM-17799).

          Eileen McNaughton added a comment -

          Just an FYI - I think I referred to 'having seen a similar results issue' in contribution - I just came across the issue again https://issues.civicrm.org/jira/browse/CRM-17691

          Eileen McNaughton added a comment -

          Also, a question ... did you do any tests on what performance improvement you got out of this? (If it is significant I might consider backporting it for clients)

          Steve Binkowski added a comment -

          I have come across an issue with the previous patch. When an unbuffered query is run, the result object shows 0 rows returned, which is correct. That count is stored in DB_DataObject->N. However, DB_DataObject->fetch() short-circuits and returns if DB_DataObject->N==0, preventing any records from being consumed.

          This new patch creates a wrapper around the process of executing an unbuffered query, and fixes the issue by forcing N to TRUE. The justification for the hack-ish fix is that DB_DataObject does the same thing if the underlying db layer does not have a num_rows() function.

          https://github.com/civicrm/civicrm-core/pull/7694

          Steve Binkowski added a comment -

          @Eileen, I did not complete any execution benchmarks. Our goal was mostly concerned with memory usage. The unbuffered fix completes eliminates any concerns about memory overhead for the query we were targeting. I'd also be interested in seeing performance benchmarks, if you happen to come across some.

          Eileen McNaughton added a comment -

          I tried this out CRM-18128 Slow performance on export - it didn't give me any speed improvement but it allowed me to change the mysql limit without worrying about blowing out memory.

          Seemed to work fine!

          Eileen McNaughton added a comment -

          erm,.. better than fine

          Jitendra Purohit added a comment -

          Hi All,

          Changes done here seems to affect the search result (one of them filed at CRM-18284). All the task in the dropdown generates a DB error if the search results are being sorted by any of the field not present in 'civicrm_contact' table (eg - city, state, email, postal code, etc).

          As the "$this->_returnProperties" is modified to an empty array in the PR, the ORDER clause still has the sorted field in it, resulting in DB error of no such field.

          Thanks,
          Jitendra

          Monish Deb added a comment -

          Brian Shaughnessy I have submitted the PR https://github.com/civicrm/civicrm-core/pull/10585 for the second requirement - introduce a new parameter to executeQuery() to set internal DB options. In addition, I have included some other possible options here https://github.com/civicrm/civicrm-core/pull/10585/files#diff-6329da386c1213c4300469cf93a3d00cR1328. Didn't found any use-case to add unit tests for those options.

          Brian Shaughnessy added a comment -

          this looks good. this accomplishes what we need for now.

            People

            • Assignee:
              Brian Shaughnessy
              Reporter:
              Steve Binkowski

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 1 day
                1d
                Remaining:
                Remaining Estimate - 1 day
                1d
                Logged:
                Time Spent - Not Specified
                Not Specified