Details
-
Type: Bug
-
Status: Done/Fixed
-
Priority: Major
-
Resolution: Won't Fix
-
Affects Version/s: 3.3.0, 3.3.1, 3.3.2, 3.3.3, 3.3.5, 3.4.alpha, 3.4.beta, 3.4.0, 4.0.0
-
Fix Version/s: 4.4.0
-
Component/s: Core CiviCRM
-
Labels:None
Description
As mentioned in my dedupe proof-of-concept
http://forum.civicrm.org/index.php?topic=12563.0
one of the best ways to improve the process is to use UNION queries rather than WHERE(foo OR bar).
Currently for the following rule
first name: weight-5
last name: weight-5
threshold: 9
Dedupe is running the following query:
<code>INSERT INTO dedupe (id1, id2, weight) SELECT t1.id id1, t2.id id2, 5 weight FROM civicrm_contact t1 JOIN civicrm_contact t2 USING (last_name) WHERE t1.id < t2.id AND t1.last_name IS NOT NULL AND (t1.id IN (732622,732623,732626) OR t2.id IN (732622,732623,732626)) GROUP BY id1, id2 ON DUPLICATE KEY UPDATE weight = weight + VALUES(weight);</code>
The SELECT portion is:
<code>
SELECT t1.id id1, t2.id id2, 5 weight
FROM civicrm_contact t1
JOIN civicrm_contact t2
USING (last_name)
WHERE t1.id < t2.id
AND t1.last_name IS NOT NULL
AND (
t1.id IN (732622,732623,732626)
OR t2.id IN (732622,732623,732626)
)
GROUP BY id1, id2
</code>
Here's the EXPLAIN
<code>
-----------------------------------------------------------------------------------------------------------------------------------------------------+
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
-----------------------------------------------------------------------------------------------------------------------------------------------------+
1 | SIMPLE | t1 | ALL | index_last_name | NULL | NULL | NULL | 707049 | Using where; Using temporary; Using filesort |
1 | SIMPLE | t2 | ref | PRIMARY,index_last_name | index_last_name | 33 | democrats_civicrm.t1.last_name | 2 | Using where |
-----------------------------------------------------------------------------------------------------------------------------------------------------+
</code>
Ouch, that's 1.4M records scanned to dedupe 3 contacts.
If we remove the "OR t2.id IN (732622,732623,732626)" then here is the EXPLAIN:
<code>
----------------------------------------------------------------------------------------------------------------------------------------------------+
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
----------------------------------------------------------------------------------------------------------------------------------------------------+
1 | SIMPLE | t1 | range | PRIMARY,index_last_name | PRIMARY | 4 | NULL | 3 | Using where; Using temporary; Using filesort |
1 | SIMPLE | t2 | ref | PRIMARY,index_last_name | index_last_name | 33 | democrats_civicrm.t1.last_name | 2 | Using where |
----------------------------------------------------------------------------------------------------------------------------------------------------+
</code>
Yay, only 6 records scanned!
To get the other half of the results back we can setup a UNION query:
<code>
SELECT t1.id id1, t2.id id2, 5 weight
FROM civicrm_contact t1
JOIN civicrm_contact t2
USING (last_name)
WHERE t1.id < t2.id
AND t1.last_name IS NOT NULL
AND (
t1.id IN (732622,732623,732626)
)
GROUP BY id1, id2
UNION ALL SELECT t1.id id1, t2.id id2, 5 weight
FROM civicrm_contact t1
JOIN civicrm_contact t2
USING (last_name)
WHERE t1.id < t2.id
AND t1.last_name IS NOT NULL
AND (
t2.id IN (732622,732623,732626)
)
GROUP BY id1, id2
</code>
The EXPLAIN
<code>
----------------------------------------------------------------------------------------------------------------------------------------------------------+
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
----------------------------------------------------------------------------------------------------------------------------------------------------------+
1 | PRIMARY | t1 | range | PRIMARY,index_last_name | PRIMARY | 4 | NULL | 3 | Using where; Using temporary; Using filesort |
1 | PRIMARY | t2 | ref | PRIMARY,index_last_name | index_last_name | 33 | democrats_civicrm.t1.last_name | 2 | Using where |
2 | UNION | t2 | range | PRIMARY,index_last_name | PRIMARY | 4 | NULL | 3 | Using where; Using temporary; Using filesort |
2 | UNION | t1 | ref | PRIMARY,index_last_name | index_last_name | 33 | democrats_civicrm.t2.last_name | 2 | Using where |
NULL | UNION RESULT | <union1,2> | ALL | NULL | NULL | NULL | NULL | NULL |
----------------------------------------------------------------------------------------------------------------------------------------------------------+
</code>
Awesome, we went from 1.4M records scanned, to 12. Query runs in .2 seconds as opposed to 12mins!
I took a cursory browse through the code and it should be a fairly simple modification. I'll ask my client if they want to sponsor the work.