Uploaded image for project: 'CiviCRM'
  1. CiviCRM
  2. CRM-7809

Dedupe should use UNION queries rather than WHERE (foo OR bar) for a huge performance gain.

    Details

    • Type: Bug
    • Status: Done/Fixed
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 3.3.0, 3.3.1, 3.3.2, 3.3.3, 3.3.5, 3.4.alpha, 3.4.beta, 3.4.0, 4.0.0
    • Fix Version/s: 4.4.0
    • Component/s: Core CiviCRM
    • Labels:
      None

      Description

      As mentioned in my dedupe proof-of-concept
      http://forum.civicrm.org/index.php?topic=12563.0
      one of the best ways to improve the process is to use UNION queries rather than WHERE(foo OR bar).

      Currently for the following rule
      first name: weight-5
      last name: weight-5
      threshold: 9

      Dedupe is running the following query:
      <code>INSERT INTO dedupe (id1, id2, weight) SELECT t1.id id1, t2.id id2, 5 weight FROM civicrm_contact t1 JOIN civicrm_contact t2 USING (last_name) WHERE t1.id < t2.id AND t1.last_name IS NOT NULL AND (t1.id IN (732622,732623,732626) OR t2.id IN (732622,732623,732626)) GROUP BY id1, id2 ON DUPLICATE KEY UPDATE weight = weight + VALUES(weight);</code>

      The SELECT portion is:
      <code>
      SELECT t1.id id1, t2.id id2, 5 weight
      FROM civicrm_contact t1
      JOIN civicrm_contact t2
      USING (last_name)
      WHERE t1.id < t2.id
      AND t1.last_name IS NOT NULL
      AND (
      t1.id IN (732622,732623,732626)
      OR t2.id IN (732622,732623,732626)
      )
      GROUP BY id1, id2
      </code>

      Here's the EXPLAIN
      <code>
      -----------------------------------------------------------------------------------------------------------------------------------------------------+

      id select_type table type possible_keys key key_len ref rows Extra

      -----------------------------------------------------------------------------------------------------------------------------------------------------+

      1 SIMPLE t1 ALL index_last_name NULL NULL NULL 707049 Using where; Using temporary; Using filesort
      1 SIMPLE t2 ref PRIMARY,index_last_name index_last_name 33 democrats_civicrm.t1.last_name 2 Using where

      -----------------------------------------------------------------------------------------------------------------------------------------------------+
      </code>

      Ouch, that's 1.4M records scanned to dedupe 3 contacts.

      If we remove the "OR t2.id IN (732622,732623,732626)" then here is the EXPLAIN:

      <code>
      ----------------------------------------------------------------------------------------------------------------------------------------------------+

      id select_type table type possible_keys key key_len ref rows Extra

      ----------------------------------------------------------------------------------------------------------------------------------------------------+

      1 SIMPLE t1 range PRIMARY,index_last_name PRIMARY 4 NULL 3 Using where; Using temporary; Using filesort
      1 SIMPLE t2 ref PRIMARY,index_last_name index_last_name 33 democrats_civicrm.t1.last_name 2 Using where

      ----------------------------------------------------------------------------------------------------------------------------------------------------+
      </code>

      Yay, only 6 records scanned!

      To get the other half of the results back we can setup a UNION query:
      <code>
      SELECT t1.id id1, t2.id id2, 5 weight
      FROM civicrm_contact t1
      JOIN civicrm_contact t2
      USING (last_name)
      WHERE t1.id < t2.id
      AND t1.last_name IS NOT NULL
      AND (
      t1.id IN (732622,732623,732626)
      )
      GROUP BY id1, id2

      UNION ALL SELECT t1.id id1, t2.id id2, 5 weight
      FROM civicrm_contact t1
      JOIN civicrm_contact t2
      USING (last_name)
      WHERE t1.id < t2.id
      AND t1.last_name IS NOT NULL
      AND (
      t2.id IN (732622,732623,732626)
      )
      GROUP BY id1, id2
      </code>

      The EXPLAIN
      <code>
      ----------------------------------------------------------------------------------------------------------------------------------------------------------+

      id select_type table type possible_keys key key_len ref rows Extra

      ----------------------------------------------------------------------------------------------------------------------------------------------------------+

      1 PRIMARY t1 range PRIMARY,index_last_name PRIMARY 4 NULL 3 Using where; Using temporary; Using filesort
      1 PRIMARY t2 ref PRIMARY,index_last_name index_last_name 33 democrats_civicrm.t1.last_name 2 Using where
      2 UNION t2 range PRIMARY,index_last_name PRIMARY 4 NULL 3 Using where; Using temporary; Using filesort
      2 UNION t1 ref PRIMARY,index_last_name index_last_name 33 democrats_civicrm.t2.last_name 2 Using where
      NULL UNION RESULT <union1,2> ALL NULL NULL NULL NULL NULL  

      ----------------------------------------------------------------------------------------------------------------------------------------------------------+
      </code>

      Awesome, we went from 1.4M records scanned, to 12. Query runs in .2 seconds as opposed to 12mins!

      I took a cursory browse through the code and it should be a fairly simple modification. I'll ask my client if they want to sponsor the work.

        Attachments

          Activity

            People

            • Assignee:
              deepak Deepak Srivastava
              Reporter:
              dalin dave hansen-lange
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: