CRM-17454 New speed bump (since last LTS) on deduping

    Details

    • Type: Bug
    • Status: Done/Fixed
    • Priority: Minor
    • Resolution: Fixed/Completed
    • Affects Version/s: 4.6.9
    • Fix Version/s: 4.7
    • Component/s: None
    • Documentation Required?:
      None
    • Funding Source:
      Contributed Code

      Description

      On upgrading to 4.6 I discovered a new query that runs slowly in some circumstances.

      The query is

      SELECT
      table_name,
      column_name
      FROM information_schema.key_column_usage
      WHERE
      referenced_table_schema = database() AND
      referenced_table_name = 'civicrm_contact' AND
      referenced_column_name = 'id';

      and it was introduced in https://issues.civicrm.org/jira/browse/CRM-14500 in order to make the selection of which tables to merge dynamic.

      The fastest this query ran for me was locally at 150ms (locally, 202 contacts, file_per_table = OFF) and the slowest was on a dev server with many databases & file_per_table= ON at 1 min 17 sec. Those extremes were running the query directly. I experience of up to 17 seconds within the context of CiviCRM.

      It seems that running this query will open the db file for as many database tables as the mysql user has permission to. The slowest results were thus obtained running under the user with access to the most databases on the dev server. However, that was not the full picture as it was still slower against some databases than others within that server and this didn't seem to match directly the database size. Perhaps file fragmentation is a factor?

      Some informations about why it opens each table file is available at this link http://dev.mysql.com/doc/refman/5.5/en/information-schema-optimization.html.
      Limiting the query to the database in use did not alter the speed - which is consistent with that link.

      I played around with setting innodb_stats_on_metadata = 0; and it seemed to shave some time off - locally from 250 ms to 150ms and at the upper extreme from 1.17 min to 1.05 min. I was 'fairly' consistent.

      https://www.percona.com/blog/2011/12/23/solving-information_schema-slowness/

      My local was the only one I tested with file_per_table OFF as setting that to on is standard for us so I don't know if that affects the speed.

      The data derived from the query only needs to be updated as frequently as a system_flush is done - ie. upgrades, extension installs, module installs & custom group creation so some sort of caching makes sense

        Attachments

          Activity

          [CRM-17454] New speed bump (since last LTS) on deduping
          Eileen McNaughton added a comment -

          Interestingly (alarmingly) I found an article about how to know if accessing your information schema could crash your DB & the query 'ticks the boxes' (in a bad way).

          http://www.pythian.com/blog/how-to-tell-when-using-information_schema-might-crash-your-database/

          "“querying the INFORMATION_SCHEMA database on MySQL can completely lock a busy server for a long time. It can even crash it. It is very dangerous.”

          This query met the criteria - explain shows : Using where; Open_full_table; Scanned all databases

          It is not possible to restructure this query to be 'safe' according to those critieria.

          I did witness some 'lost connection' messages while testing (which might be the symptom per https://bugs.mysql.com/bug.php?id=20110)

          We do see very occasional corruption of archive tables (log tables). I can't find any direct hits on searching for a connection between the 'unsafe' query & the archive table issue but it's pause for thought.

          Eileen McNaughton added a comment -

          This link suggests turning metadata off affects more than just query performance - also disk use.

          https://www.percona.com/blog/2013/12/03/innodb_stats_on_metadata-slow-queries-information_schema/

          Tim Otten added a comment -

          Agree caching that data would be good.

          Possible alternative: if the goal is to get a list of table references, maybe use CRM_Core_DAO::getReferencesToTable() which produces a similar list. The only difference is that it reads Civi's metadata rather than SQL, so it would miss out on third-party SQL tables.

          Eileen McNaughton added a comment -

          I'm sold on using that - since there appear to be avoidable AND unavoidable risks in what we are seeing.

          Assuming the function you suggest already looks at custom data then the only think missing (IMHO) is a hook to allow other tables to be added in.

          NB the inclusion of non-core tables was NOT a driver for the change that introduced this query - but I think it makes sense that it should be possible to include them

          Coleman Watts added a comment -

          +1 I would see non-civicrm tables as a "nice to have" but really not a priority right now.

          Tim Otten added a comment -

          A. IMHO, there's nothing wrong with using a cache with the current query. For a point-release, it's certainly safer/easier.

          B. +1, non-civicrm tables as "nice to have" but not important.

          C. Drilling into getReferencesToTables():

          1. Good call on custom data. I think it's missing custom data. If you add, there should be some way to differentiate custom-data refs. (Ex: CRM_Core_Reference_Interface has a few implementations – Basic, Dynamic, OptionValue. Maybe do CRM_Core_Reference_CustomData?)
          2.It also reports dynamic FKs (e.g. "civicrm_entity_tag").
          3. If third-party code implements hook_civicrm_entityTypes and provides a DAO (like in CiviVolunteer and SEPA), then it should detect those tables already. But might also be good to have a more targeted hook.

          D. A seed to plant in the back of the brain... conceptually we might really want CRM_Core_DAO::findReferences() rather than CRM_Core_DAO::getReferencesToTable(). (At high-level, "merge"=="findReferences+pick+replace".) But in terms of replacing that one query, getReferencesToTable() is a closer match.

          Eileen McNaughton added a comment -

          Funnily enough I had concluded that anything but querying the info schema was safer for a point release - but that's because the web-articles about DB crashes put the wind up me (from my first comment)

          Eileen McNaughton added a comment -

          I've pushed up PRs & tests to address this.

          I couldn't find any references to the entity_types hook so have added one - http://wiki.civicrm.org/confluence/display/CRMDOC/hook_civicrm_entityTypes (feel free to fix formatting

          Are we thinking we would extend to

          $entityTypes[] = array(
          'name' => 'VolunteerNeed',
          'class' => 'CRM_Volunteer_DAO_Need',
          'table' => 'civicrm_volunteer_need',
          'references' => array('civicrm_contact' => array('contact_id'))
          );

          or some more detailed variant of the above?

          Eileen McNaughton added a comment -

          I'm closing this against 4.7 - but we are also running it against 4.6 on our sites. Noticeable improvement for large sites AND for sites where there are many DBs on one server as it is the information schema that is the bottle neck

          Eileen McNaughton added a comment -

          Note this is the 4.6 patch https://github.com/eileenmcnaughton/civicrm-core/commit/fe1be0f66a30cd81cecacafce8dea8db1760e0ea

          I'm going to delete the github branch - but the commit doesn't go anywhere so storing the link in case we want to add it to 4.6

            People

            • Assignee:
              Eileen McNaughton
              Reporter:
              Eileen McNaughton

              Dates

              • Created:
                Updated:
                Resolved: