CRM-10700 TRUNCATE is slow on some versions of Mysql InnoDB

    Details

    • Type: Patch
    • Status: Done/Fixed
    • Priority: Minor
    • Resolution: Won't Fix
    • Affects Version/s: 4.1.1, 4.2.0
    • Fix Version/s: Unscheduled
    • Component/s: Core CiviCRM
    • Labels:
    • Documentation Required?:
      None
    • Funding Source:
      Needs Funding

      Description

      There has been a long-standing bug in InnoDB which causes TRUNCATE to be much slower than DELETE FROM, or even DROP / CREATE. I noticed the problem while running the "devel" module query profiling. Truncates can take 20-200ms, on an empty table, while a delete is < 1ms.

      This bug was reported in Mysql 4.1, http://bugs.mysql.com/bug.php?id=7150, and still affects Mysql 5.5.24
      Correction: Current behavior is explained here: http://dev.mysql.com/doc/refman/5.6/en/innodb-other-changes-truncate.html

      The attached patch replaces instances of "TRUNCATE" in core code.

      Incidentally, the worst offender here was CRM_ACL_BAO_Cache, which insists on truncating an empty table at every opportunity.

        Attachments

        1. CRM-10700_v2.patch
          3 kB
          Adam Wight
        2. delete_from.patch
          4 kB
          Adam Wight

          Activity

          [CRM-10700] TRUNCATE is slow on some versions of Mysql InnoDB
          Deepak Srivastava added a comment -

          Its possible that due to frequent resets & refill cache may hit the limit of int(10), as deletes make id column auto-increment from last position / value (and not from 1).

          It may be useful to truncate once in a while while keeping the deletes default. May be do truncate when menu is manually rebuilt from UI OR during upgrades, and continue to use deletes for other operations ?

          Tim Otten added a comment -

          I'm no longer able to run unit-tests in branches/v4.2 – after testing several recent commits, it appears that this one breaks things:

          https://fisheye2.atlassian.com/changelog/CiviCRM?cs=42142

          For example, when running "./scripts/phpunit CRM_Core_RegionTest", this message appears:

          ---------------
          Installing civicrm_tests_dev database
          Sorry. A non-recoverable error has occurred. The error trace below might help to resolve the issue<p><p><pre>Array
          (
          [callback] => Array
          (
          [0] => CRM_Core_Error
          [1] => handle
          )

          [code] => -3
          [message] => DB Error: constraint violation
          [mode] => 16
          [debug_info] => INSERT INTO civicrm_setting (group_name , name , value , domain_id , is_domain , created_date ) VALUES ('Directory Preferences' , 'customFileUploadDir' , 's:7:\"custom/\";' , 1 , 1 , 20120831070207 ) [nativecode=1452 ** Cannot add or update a child row: a foreign key constraint fails (`civicrm_tests_dev`.`civicrm_setting`, CONSTRAINT `FK_civicrm_setting_domain_id` FOREIGN KEY (`domain_id`) REFERENCES `civicrm_domain` (`id`) ON DELETE CASCADE)]
          [type] => DB_Error
          [user_info] => INSERT INTO civicrm_setting (group_name , name , value , domain_id , is_domain , created_date ) VALUES ('Directory Preferences' , 'customFileUploadDir' , 's:7:\"custom/\";' , 1 , 1 , 20120831070207 ) [nativecode=1452 ** Cannot add or update a child row: a foreign key constraint fails (`civicrm_tests_dev`.`civicrm_setting`, CONSTRAINT `FK_civicrm_setting_domain_id` FOREIGN KEY (`domain_id`) REFERENCES `civicrm_domain` (`id`) ON DELETE CASCADE)]
          [to_string] => [db_error: message="DB Error: constraint violation" code=-3 mode=callback callback=CRM_Core_Error::handle prefix="" info="INSERT INTO civicrm_setting (group_name , name , value , domain_id , is_domain , created_date ) VALUES ('Directory Preferences' , 'customFileUploadDir' , 's:7:\"custom/\";' , 1 , 1 , 20120831070207 ) [nativecode=1452 ** Cannot add or update a child row: a foreign key constraint fails (`civicrm_tests_dev`.`civicrm_setting`, CONSTRAINT `FK_civicrm_setting_domain_id` FOREIGN KEY (`domain_id`) REFERENCES `civicrm_domain` (`id`) ON DELETE CASCADE)]"]
          )
          ---------------

          Environment: Debian Squeeze, PHP 5.3.3, MySQL 5.1.61, CiviCRM (branches/v4.2)

          Eileen McNaughton added a comment -

          Hi,

          My experience is that TRUNCATE is often faster than DELETE (when there is data in the tables - perhaps not when they are empty) - if this is a version specific SQL bug then is it fixed in 5.5.27?

          Should this really be added to a point release?

          Eileen McNaughton added a comment -

          OK - I can confirm that the tests start passing again when the changes on CiviUnitTestCase.php are reverted.

          Not sure about the rest but my feeling is the changes should be 4.3 not 4.2.1

          Tim Otten added a comment -

          Eileen also found that this patch caused many regressions in 4.2, so I reverted it in branches/v4.2 (r42208). Assuming someone is going to fix this, could the work be done in a different branch (maybe trunk or branches/trunk.deletefrom)?

          Deepak Srivastava added a comment -

          Looking at the comments and implications pushing to v.4.3 for now.

          Adam could you reply the comments.

          Adam Wight added a comment -

          Interesting to hear about the breakage... As Deepak mentioned, the only difference in behavior would have been that autoincrement columns are not reset. The code with "SET FOREIGN_KEY_CHECKS=0;" around it is almost certainly to blame-- we should take more care to truncate tables in order, and verify that this works with foreign key constraints enabled. That said, it makes sense to disable the constraints in production because it probably saves quite a bit of db power.

          The second and less likely explanation for the breakage might be that something is referencing civicrm_domain using a hardcoded id of "1".

          Adam Wight added a comment -

          I think the regressions were all due to unit tests relying on the primary key being reset.

          This version of the patch doesn't change the TRUNCATE statements in CiviUnitTestCase, and seems to pass after a few spot-tests.

          However, it would be great if Eileen could confirm that TRUNCATE is actually faster on some versions of mysql, so we can decide how to triage.

          Adam Wight added a comment -

          Also, we are only talking about mysql/innodb tables here.

          Adam Wight added a comment -

          Here are two scary quotes from the mysql bug cited in the description:

          > By the way, if the table you are truncating is being referenced by foreign key constraints, also the new TRUNCATE TABLE code will resort to DELETE row by row.

          and

          > apparently [DELETE FROM] degrades markedly over time

          Eileen McNaughton added a comment - - edited

          My take on this is

          1) TRUNCATE is faster than delete where there are no foreign keys
          2) Truncate has added benefits when using file-per-table. It releases disk space by recreating the IBD file
          3) Where there are foreign keys it falls back to using Delete. It is perceived as slower that just doing delete here - but that could be because it locks the table causing the slowness to be experienced by others
          4) I think that the first question we should look at is whether we should have foreign keys on _cache tables. Taking the acl_contact_cache table first. The foreign keys cause contacts to be removed from the acl_cache table if they are deleted from the database. However, the cache table is generally used as part of a larger query which involves the contact table & just provides some restriction on that query - limits the available IDS. Having a contact in the acl_contact_cache table who no longer exists is only a problem if there is a left join onto the contact_cache table and no WHERE clause relevant to contacts. This would really only happen where the search is specifically in the case where someone is specifically searching for contacts that may or may not be deleted (since normally is_deleted =1).

          My suggestion is to remove the constraints in 4.7 in a way that would not crash had people previously deleted them and to add specific unit tests (via the api & possibly a saved search test) to test for the 'might-be-deleted' scenario above.

          Other tables in it:
          civicrm_cache - surely removing the fk to component is a no-brainer here
          civicrm_group_contact_cache - known performance killing table with recurring issues around truncating it. We should treat this similar to acl_contact_cache
          civicrm_menu - has foreign keys to domain_id & component_id - less sure if this makes a difference performance-wise as relatively small table
          civicrm_acl_cache keys to acl_id & contact_id - can add tests & remove keys I think
          civicrm_prev_next_cache - no foreign keys - should be OK?
          Full-text search - not too sure what table is affected here yet

          Tim Otten added a comment -

          1. I generally like the idea of removing FK's on cache tables.

          2. An edge case to be conscientious of – Suppose:

          • We execute testA() which creates a contact (CID 1) and a cache record related to CID 1.
          • We truncate the contact table. This deletes CID 1 and resets the starting ID.
          • The cache record for CID 1 is left behind (because we don't have FK).
          • We execute testB() which creates another contact (CID 1 again).
          • The cache record still now references the new contact.
            => The cache data leaks between tests.

          3. The above could be resolved by explicitly clearing out the cache tables. (Possible we already do that anyway?) Not sure if that's a net boost to performance, but it might be.

          Adam Wight added a comment -

          Really great to see light cast on this issue! All I can add at the moment is that foreign keys can be added without the DELETE CASCADE option, this might change the truncate behavior. Code using these tables would have to be robust against missing records or nulls, though.

          Eileen McNaughton added a comment -

          Regarding the DELETE CASCADE - that is the only value the FKs offer currently - so the goal is to be sure we wouldn't miss it.

          Tim - your edge case sounds like a test-only edge case? So as long as we truncate the cache tables in the classes that test them we should be OK?

          Eileen McNaughton added a comment - - edited

          NB - on my local DELETE FROM is SLOWER than truncate

          mysql> DELETE FROM civicrm_group_contact_cache;
          Query OK, 336021 rows affected (2.70 sec)

          mysql> truncate civicrm_group_contact_cache;
          Query OK, 0 rows affected (0.08 sec)

          I suspect various mysql things play into it. Possibly it's the clearing out of the mysql query cache that slows things down.

          However,

          1) I don't think ANYONE would have a performance downside from removing the FKs on cache tables.
          2) We might treat this like caching & have a function to clear tables which can be configured (setting?) to DELETE FROM or TRUNCATE rather than one size fits all.

          Whether it's worth doing #2 will depend on some performance testing on whether a DB that experiences a slow TRUNCATE with FKs experiences a substantial improvement without FKs on the same tables.

          Adam Wight added a comment -

          Just a small correction to the last comment--looks like truncate is faster?

          Eileen McNaughton added a comment -

          Yes - on my local truncate is faster, even with FKs! Also on the server I tried it on I got the same result.

          I'd like to check how truncate with no FKs preforms for you, vs truncate with FKs vs DELETE FROM

          Eileen McNaughton added a comment -

          (edited the above)

          Eileen McNaughton added a comment -

          More discussion on DELETE vs TRUNCATE

          (2:54:54 PM) eileen: so, the customer who just contacted me is seeing mails go out 2-3 times per person
          (2:55:15 PM) eileen: & we are trying the server-wide lock since it is a multisite
          (2:56:34 PM) totten: any particular reason to suspect locking as the issue?
          (2:57:20 PM) eileen: well - the recipients count is correct
          (2:57:26 PM) eileen: it's the delivered that isn't
          (2:57:48 PM) eileen: they do run crons that could overlap on different domains
          (2:58:13 PM) eileen: & the symptoms are the same as they were when the cache clearing was stealing locks (leading to hackyHandleBrokenCode
          (2:58:31 PM) totten: ah, ok
          (2:59:11 PM) totten: don't have any brilliant advice... but it might be interesting to chek if the customers config has any analog in api_v3_JobProcessMailingTest
          (2:59:41 PM) eileen: oh they are also seeing lots of mailings called 'placeholder' in the ui...
          (3:00:05 PM) totten: eileen: hmm, wasn't that fixed in a point release?
          (3:00:22 PM) eileen: they are 4.6.8 - you don't have a link for that do you?
          (3:00:24 PM) totten: incidentally – that would cause the symptom of users getting multiple messages.
          (3:00:53 PM) eileen: for the same mail?
          (3:00:55 PM) totten: it was an issue with CREATE TEMP TABLE / DROP TABLE breaking transactions
          (3:01:14 PM) eileen: tell me more....
          (3:01:40 PM) totten: the problem we tracked down involved previewing mailings to smart groups
          (3:02:26 PM) eileen: hmm
          (3:02:28 PM) totten: when you preview the mailing, it does a trial run to see what would happen if you scheduled the mailing. (ie #recips)
          (3:02:43 PM) totten: and then rolls back.
          (3:03:13 PM) totten: but if you were composing a message to a smart group, then it needed to setup a temp table to flesh out hte smart group
          (3:03:40 PM) totten: interestingly, temp tables have a quirk when you create them as part of the transaction
          (3:04:02 PM) totten: if you do CREATE TEMPORARY TABLE ... DROP TEMPORARY TABLE, then the transaction works exaclty as you'd expect
          (3:04:23 PM) totten: but if you do CREATE TEMPORARY TABLE ... DROP TABLE, then it force-commits the transaction when you drop
          (3:04:30 PM) totten: so rollbacks won't work
          (3:04:36 PM) eileen: totten: OK - so in light of that here is a comment the customer made when asked about recent changes "i made some alterations to groupContactCache.php, but only to replace DELETE statements with TRUNCATE"
          (3:04:49 PM) totten: lol
          (3:04:56 PM) totten: yeah, that's the problem
          (3:05:12 PM) totten: TRUNCATE will force-commit the transaction
          (3:05:31 PM) eileen: OK - so I have an immediate fix for the customer - but
          (3:06:26 PM) eileen: I had been going down the path of 'the way to stop these performance probs is to get rid of foreign keys from cache tables & use truncate'
          (3:07:06 PM) eileen: (which is a next week problem ... not a today but it might be a hard one to figure out)
          (3:07:24 PM) totten: do you think there's perf benefit of just getting rid of the fkeys – and still using delete?
          (3:08:35 PM) totten: incidentally... when working tx mgmt, i kept thinking, "we should have some code which throws an error if you issue a TRUNCATE / ALTER / DROP TABLE /etc that would break transactionality"
          (3:08:50 PM) eileen: possibly - I need more tests - I have one customer who says ONLY TRUNCATE wo't kill my DB & another who says TRUNCATE destroys my DB
          (3:09:34 PM) eileen: my tests show TRUNCATE is about 7 secs faster but doesn't lock the table
          (3:10:05 PM) totten: "kill/destroy" --> the two customers have disagreeing results on whether truncate-vs-delete is better for performance?
          (3:10:27 PM) eileen: yes
          (3:11:12 PM) eileen: but I think the reason is that TRUNCATE is as slow as DELETE from for the customer with binary logging AND locks the table
          (3:11:32 PM) eileen: whereas for the other TRUNCATE is almost instant
          (3:11:56 PM) totten: wait, which one has binary logging? the one who likes DELETE or who likes TRUNCATE?
          (3:12:02 PM) eileen: (not sure if binary logging or FKs or ? causes is to be slow for that customer - still undecided)
          (3:12:08 PM) eileen: binary logging likes DELETE
          (3:12:18 PM) totten: ok
          (3:12:30 PM) eileen: binary logging DB is about 10 * bigger too
          (3:13:25 PM) eileen: so 2 questions 1) the answer to the performance Q (I'm still digging on that) & 2) if TRUNCATE turns out to be the way to get good performance - where does that leave the CiviMail UI
          (3:14:09 PM) totten: well, if you sprinkle truncations at random times, it will break anything transactional
          (3:14:20 PM) totten: civimail just happens to have the worst failure mode
          (3:14:54 PM) eileen: yeah - the problem is do we have a choice between transactions vs a responsive DB
          (3:15:37 PM) totten: don't suppose these customers would be inclined to rework smart-groups to make the whole question go away?
          (3:15:58 PM) eileen: well - maybe in the longer term
          (3:16:26 PM) totten: i.e. smart groups should be updated incrementally
          (3:16:52 PM) totten: (if you have a large db ... current pattern probably acceptable on small db)
          (3:17:18 PM) eileen: yeah - I don't actually know why it is how it is- but I know that deleting a single smart group or a handful of smartgroups can lock the table & cause deadlocks
          (3:17:30 PM) eileen: so, I assume truncate came in because of complaints about that
          (3:20:23 PM) totten: seems like a reasonable theory
          (3:22:24 PM) totten: if someone just wanted to remove the connection between civmail ui and txn, then that's possible... but i think it's a scary project to remove transactions generally.
          (3:23:07 PM) totten: feels like more value comes from reengineering smart-groups so to make large-scale delete/truncate unnecessary
          (3:28:37 PM) eileen: totten: yeah - I guess so - something to think about
          (3:31:59 PM) eileen: totten: AFAIK this is the only place where transaction roolback occur when nothing has gone wrong?
          (3:32:38 PM) totten: at runtime, yes. it's unusual.
          (3:32:59 PM) totten: tests also do rollbacks in (their) normal circumstances
          (3:33:09 PM) eileen: right - yes
          (3:33:34 PM) eileen: & tests have the luxury of having small datasets too
          (3:33:58 PM) totten: that's true
          (3:34:23 PM) monish left the room.
          (3:34:44 PM) eileen: The problem with the group-contact-cache is that you aren't just adding ... you are removing too
          (3:35:33 PM) totten: so... basically, anytime something changes in a contact's graph, you'd want to schedule a check to see which smart-groups to put that person in
          (3:35:54 PM) eileen: yes, which includes a new calendar day
          (3:36:08 PM) eileen: since smart groups include things like made a donation last month
          (3:37:28 PM) totten: hmm, if you're using relative dates as a smart-group criterion, then yes - then those smart groups effectively need a full refresh once a day
          (3:38:10 PM) eileen: I'm trying to think about other odd ones - I think you can also have custom searches with related contact data
          (3:38:14 PM) totten: (or else need some sophisticated date-math ... prob on par with the date-math in scheduled-reminders)
          (3:40:44 PM) totten: at the end of the day, don't smart-group updates boil-down to a query like:
          (3:41:08 PM) totten: INSERT INTO smartgroupcache SELECT contact_id {$expr}
          (3:41:13 PM) totten: which could be rewritten as
          (3:41:35 PM) totten: SELECT contact_id {$expr} WHERE contact_id IN ($recentlyChangedContacts)
          (3:42:42 PM) eileen: well - you also need to delete from smart group cache
          (3:42:48 PM) totten: simple-search vs advanced-search custom-search => different ways of computing {$expr}
          (3:42:50 PM) eileen: which I suppose you could do the dame way
          (3:42:55 PM) eileen: same
          (3:43:00 PM) sushant [~sushant@123.201.113.22] entered the room.
          (3:43:26 PM) eileen: the question is how many smart-group possibilities rely on things beyond the immediate contact
          (3:43:30 PM) totten: right. the SELECT basically returns a true (cid matched) or false (cid omitted) which can decide whether to insert / preserve /delete
          (3:44:17 PM) eileen: The delete issue is probably harder on hierarchies
          (3:44:19 PM) totten: oh, i see. so, e.g., a smart-group based on a financial_type
          (3:44:25 PM) totten: and the financial_type of a record changes
          (3:44:33 PM) eileen: yeah
          (3:45:13 PM) eileen: or you can do advanced search (& hence smart group) on whether they have a related contact in group x
          (3:45:16 PM) totten: well, you could add a bit to BAO_Query where any code code that influences {$expr} can also define a watch-list
          (3:46:02 PM) eileen: for hierarchicals - you can't decide to delete an entry from a parent group by calculating any one group
          (3:46:05 PM) totten: basically, the watch-list says, "if there's a change like X, then refresh the smartgroup"
          (3:46:14 PM) eileen: ie groups 1-25 are members of group 26
          (3:46:42 PM) eileen: you can REPLACE INTO to get who should be in the group - but you have to calculate all 25 to know who shouldn't
          (3:47:45 PM) totten: you'd calculate all 25 anyway, right? and you'd want hte 25 to calculate quickly
          (3:47:48 PM) eileen: and some parent groups are smart groups TOO!
          (3:48:17 PM) eileen: yeah so you either do what they do now & delete the parent group from group_contact_cache & rebuild it all
          (3:48:23 PM) eileen: or you calculate who to remove
          (3:48:32 PM) totten: then the 26th is then computed with its variant of "SELECT cid {$expr} WHERE cid in(recent)"
          (3:48:52 PM) totten: the {$expr} in the 26th case happens to check memberships for the other 25
          (3:49:20 PM) eileen: so you would do DELETE FROM civicrm_group_contact_cache where group_id = 26 and contact_id NOT IN recent
          (3:49:45 PM) totten: ah, yeah
          (3:49:51 PM) eileen: because you assume the act of updating the 25 groups would force that flag
          (3:50:05 PM) eileen: which means ... you can't use modified_date on the contact
          (3:50:12 PM) eileen: because you would update it too often
          (3:50:23 PM) eileen: so you'd need an extra field on the contact
          (3:50:32 PM) totten: do we change contact.modified_date for smart-group changes?
          (3:50:44 PM) eileen: no
          (3:50:57 PM) eileen: & actually this isn't just a change
          (3:51:10 PM) eileen: the question is not did the person get added to one of the 25 groups
          (3:51:20 PM) eileen: it is are they still in one of the 25 groups
          (3:51:26 PM) totten: oh, i see. you're saying the 'modified_date' might be good enough for detecting $recentlyChangedcontacts – i'd agree with that
          (3:52:07 PM) eileen: so, one question - is the mechanism of one cache for calculating smart groups & hierarchical groups correct?
          (3:53:38 PM) totten: that sounds reasonable. isn't a heirarchal group basically a "a smart group composed of the union of other groups" ?
          (3:53:38 PM) eileen: so civicrm_group table has a cache_date & a refresh_date - I guess we could add some combo of those to civicrm_contact
          (3:54:03 PM) eileen: yeah - I guess the hierarchical group is & maybe it would be better stored that way
          (3:56:54 PM) eileen: (I'm still struggling a bit with believing that transactions actually do more good than bad in the code base - since mostly they just remove data that might help you recover from something going wrong in my experience & I can't figure out how people cope if they don't have the log tables)
          (3:57:07 PM) totten: hmm, my brain is failing me. i feel like there'd be related materials to this under the heading of 'chaining' in a compsci text
          (3:57:37 PM) eileen:

          Seamus Lee added a comment -

          Recently i put a PR in which got merged in which made the smart group cache code match what the UI says about the smart group cache timeout variable which is that setting it to 0 = no cache, and that 0 is an acceptable value.

          The PR is here https://github.com/civicrm/civicrm-core/pull/6694

          One of the important implications is that it brings in a truncate statement here when the value is set to 0 https://github.com/civicrm/civicrm-core/blob/master/CRM/Contact/BAO/GroupContactCache.php#L323.

          Given the negative impacts truncate has,

          There are 2 options

          1. We re-write that excluding the truncate statement
          2. Revert my PR and change the UI text to say you can't set it to 0 and maybe even add a form rule to that effect

          David Greenberg added a comment -

          Seamus Lee are you able to test out alternative(s) to truncate on your large dataset? (DELETE FROM ...)

          Adam Wight added a comment -

          JCrespo (WMF) just pointed out another important detail. We need to avoid unlimited DELETE FROM statements as well, because they cause replication lag. The best practice is to delete a limited batch (e.g. 10,000 rows) at a time, then wait for the database slaves to catch up before deleting the next batch.

          See https://github.com/wikimedia/mediawiki-extensions-ORES/blob/master/includes/Cache.php#L64-L91

          Eileen McNaughton added a comment -

          will re-open if this seems a priority to us to fix

            People

            • Assignee:
              Pratik Joshi
              Reporter:
              Adam Wight

              Dates

              • Created:
                Updated:
                Resolved: