[CRM-7809] Dedupe should use UNION queries rather than WHERE (foo OR bar) for a huge performance gain. - CiviCRM Issue Tracker

Details

Type: Bug
Status: Done/Fixed
Priority: Major
Resolution: Won't Fix
Affects Version/s: 3.3.0, 3.3.1, 3.3.2, 3.3.3, 3.3.5, 3.4.alpha, 3.4.beta, 3.4.0, 4.0.0
Fix Version/s: 4.4.0
Component/s: Core CiviCRM
Labels:
None

Description

As mentioned in my dedupe proof-of-concept
http://forum.civicrm.org/index.php?topic=12563.0
one of the best ways to improve the process is to use UNION queries rather than WHERE(foo OR bar).

Currently for the following rule
first name: weight-5
last name: weight-5
threshold: 9

Dedupe is running the following query:
<code>INSERT INTO dedupe (id1, id2, weight) SELECT t1.id id1, t2.id id2, 5 weight FROM civicrm_contact t1 JOIN civicrm_contact t2 USING (last_name) WHERE t1.id < t2.id AND t1.last_name IS NOT NULL AND (t1.id IN (732622,732623,732626) OR t2.id IN (732622,732623,732626)) GROUP BY id1, id2 ON DUPLICATE KEY UPDATE weight = weight + VALUES(weight);</code>

The SELECT portion is:
<code>
SELECT t1.id id1, t2.id id2, 5 weight
FROM civicrm_contact t1
JOIN civicrm_contact t2
USING (last_name)
WHERE t1.id < t2.id
AND t1.last_name IS NOT NULL
AND (
t1.id IN (732622,732623,732626)
OR t2.id IN (732622,732623,732626)
)
GROUP BY id1, id2
</code>

Here's the EXPLAIN
<code>
-----------------------------------------------------------------------------------------------------------------------------------------------------+

id

select_type

table

type

possible_keys

key

key_len

ref

rows

Extra

-----------------------------------------------------------------------------------------------------------------------------------------------------+

1	SIMPLE	t1	ALL	index_last_name	NULL	NULL	NULL	707049	Using where; Using temporary; Using filesort
1	SIMPLE	t2	ref	PRIMARY,index_last_name	index_last_name	33	democrats_civicrm.t1.last_name	2	Using where

-----------------------------------------------------------------------------------------------------------------------------------------------------+
</code>

Ouch, that's 1.4M records scanned to dedupe 3 contacts.

If we remove the "OR t2.id IN (732622,732623,732626)" then here is the EXPLAIN:

<code>
----------------------------------------------------------------------------------------------------------------------------------------------------+

id

select_type

table

type

possible_keys

key

key_len

ref

rows

Extra

----------------------------------------------------------------------------------------------------------------------------------------------------+

1	SIMPLE	t1	range	PRIMARY,index_last_name	PRIMARY	4	NULL	3	Using where; Using temporary; Using filesort
1	SIMPLE	t2	ref	PRIMARY,index_last_name	index_last_name	33	democrats_civicrm.t1.last_name	2	Using where

----------------------------------------------------------------------------------------------------------------------------------------------------+
</code>

Yay, only 6 records scanned!

To get the other half of the results back we can setup a UNION query:
<code>
SELECT t1.id id1, t2.id id2, 5 weight
FROM civicrm_contact t1
JOIN civicrm_contact t2
USING (last_name)
WHERE t1.id < t2.id
AND t1.last_name IS NOT NULL
AND (
t1.id IN (732622,732623,732626)
)
GROUP BY id1, id2

UNION ALL SELECT t1.id id1, t2.id id2, 5 weight
FROM civicrm_contact t1
JOIN civicrm_contact t2
USING (last_name)
WHERE t1.id < t2.id
AND t1.last_name IS NOT NULL
AND (
t2.id IN (732622,732623,732626)
)
GROUP BY id1, id2
</code>

The EXPLAIN
<code>
----------------------------------------------------------------------------------------------------------------------------------------------------------+

id

select_type

table

type

possible_keys

key

key_len

ref

rows

Extra

----------------------------------------------------------------------------------------------------------------------------------------------------------+

1	PRIMARY	t1	range	PRIMARY,index_last_name	PRIMARY	4	NULL	3	Using where; Using temporary; Using filesort
1	PRIMARY	t2	ref	PRIMARY,index_last_name	index_last_name	33	democrats_civicrm.t1.last_name	2	Using where
2	UNION	t2	range	PRIMARY,index_last_name	PRIMARY	4	NULL	3	Using where; Using temporary; Using filesort
2	UNION	t1	ref	PRIMARY,index_last_name	index_last_name	33	democrats_civicrm.t2.last_name	2	Using where
NULL	UNION RESULT	<union1,2>	ALL	NULL	NULL	NULL	NULL	NULL

----------------------------------------------------------------------------------------------------------------------------------------------------------+
</code>

Awesome, we went from 1.4M records scanned, to 12. Query runs in .2 seconds as opposed to 12mins!

I took a cursory browse through the code and it should be a fairly simple modification. I'll ask my client if they want to sponsor the work.

Dedupe should use UNION queries rather than WHERE (foo OR bar) for a huge performance gain.

Details

Description

Attachments

Attachments

Activity

People

Dates