Details
-
Type: Improvement
-
Status: Done/Fixed
-
Priority: Minor
-
Resolution: Fixed/Completed
-
Affects Version/s: 4.6.11
-
Fix Version/s: 4.7.2
-
Component/s: CiviReport
-
Labels:
-
Documentation Required?:None
-
Funding Source:Contributed Code
Description
I am submitting a bunch of improvements to the Lybunt report - the most significant of which is improving speed on a large dataset (discussion below). However there are some others
WARNING
The changes here will affect pre-existing reports to the point where some filters & fields will be de-selected and it will be necessary to edit and resave them. I think this can go in the upgrade message / release notes. I don't think we need to handle it.
Changes
1) All reports that use the grand total - move the totalled fields to the right hand side
2) All reports that use the basic contact fields (via the function) - add communication preferences :communication preferences (do not email, do not phone, do not trade, do not sms do not mail & is deceased)
Lybunt specificy
3) Custom fields re-introduced to the lybunt work and made to work (including on order by)
4) Order by amount contributed last year as an option (default)
5) First report data oriented unit tests added & lybunt & sybunt enabled for generic tests
6) remove non-functional address group by section
7) make order by work
8) fix miscalculation on number of rows (present in master but not 4.6)
9) Alter the column name for last year total (e.g from 2010 to Total 2010 or Total 2010-2011 for financial year)
10) Add a total for last year to the bottom of the report
11) enable developer tab for this report
Tests
1) Lybunt, sybunt, walklist reports enabled for the generic test
2) Specific lybunt reports enabled for 2-3 data scenarios that should be used in it (this is the first data-specific unit test added for a report so some tweaks had to be done)
Proposed
1) Remove barchart & pie chart options - they really don't seem to be meaningful to me - they show the last year total & lifetime total of the subset of donors - which seems really misleading to represent in either of those formats.
Performance
I did this work on a very large dataset that was being filtered by a smart group. Overall improvement was 4.5 minutes to around 30 seconds with 10 seconds of the remaining time being smart group rendering (separate issue).
The discussion below is my workings process. However, my final changes were
1) create temp table of contact ids in the relevant smart group first (if a smart group filter is used)
2) create temp table of contact ids that meet the lybunt filters and are in the above temp table.
3) replace subquery with 2 separate joins on the contribution table.
4) use index hint to use received_date index where that was the primary source of prefiltering (ie. no smart group in play). This made a big difference on retrieving up to around half a million rows.
Discussion of query:
The above tests enabled me to reduce the queries from 4 minutes to a few seconds on a large data set. On a small dataset (7000 rows returned) the improvement was only from 1.5 seconds to .7 seconds but each step improved the speed in a similar way
Currently the query run to get the contact IDs against the large dataset (4 min 59.32 sec)
SELECT SQL_CALC_FOUND_ROWS contact_civireport.id as cid
FROM civicrm_contribution contribution_civireport
INNER JOIN civicrm_contact contact_civireport
ON contact_civireport.id = contribution_civireport.contact_id
WHERE contribution_civireport.is_test = 0 AND contribution_civireport.contact_id NOT IN
(SELECT distinct contri.contact_id
FROM civicrm_contribution contri
WHERE contri.is_test = 0 AND YEAR( contri.receive_date - INTERVAL 6 MONTH ) = 2011) AND contribution_civireport.contact_id IN (SELECT distinct contri.contact_id
FROM civicrm_contribution contri
WHERE YEAR( contri.receive_date - INTERVAL 6 MONTH ) = 2010 AND contri.is_test = 0) AND YEAR( contribution_civireport.receive_date - INTERVAL 6 MONTH ) = 2010 AND ( contribution_civireport.contribution_status_id IN (1) )
GROUP BY contact_civireport.id;
The most significant speed improvement came by removing the subquery and replacing it with 2 joins against the contribution table. This change meant we have 2 indexed joins rather than an unindexed join:
Amended query
(32.53 sec)
SELECT DISTINCT contact_civireport.id as cid
FROM civicrm_contact contact_civireport
INNER JOIN civicrm_contribution contribution_civireport
ON contact_civireport.id = contribution_civireport.contact_id AND is_test = 0
LEFT JOIN civicrm_contribution cont_not ON contact_civireport.id = cont_not.contact_id AND YEAR( cont_not.receive_date - INTERVAL 6 MONTH ) = 2011 AND cont_not.is_test = 0
WHERE cont_not.id IS NULL
AND YEAR( contribution_civireport.receive_date - INTERVAL 6 MONTH ) = 2010
AND ( contribution_civireport.contribution_status_id IN (1) );
Previous digging had revealed that 'Between' was quicker than 'DATE' so I tried that
Amended query with BETWEEN
(26.34 sec)
SELECT DISTINCT contact_civireport.id as cid
FROM civicrm_contact contact_civireport
INNER JOIN civicrm_contribution contribution_civireport
ON contact_civireport.id = contribution_civireport.contact_id AND is_test = 0
LEFT JOIN civicrm_contribution cont_not ON contact_civireport.id = cont_not.contact_id AND cont_not.receive_date BETWEEN '2011-07-01' AND '2012-06-30 23:59:59' AND cont_not.is_test = 0
WHERE cont_not.id IS NULL
AND contribution_civireport.receive_date BETWEEN '2010-07-01' AND '2011-06-30 23:59:59'
AND ( contribution_civireport.contribution_status_id IN (1) );
Moving criteria between the where clause & the join clause had no significant effect (with the margin of error)
(26.22 sec)
SELECT DISTINCT contact_civireport.id as cid
FROM civicrm_contact contact_civireport
INNER JOIN civicrm_contribution contribution_civireport
ON contact_civireport.id = contribution_civireport.contact_id AND is_test = 0
AND ( contribution_civireport.contribution_status_id IN (1) ) AND contribution_civireport.receive_date BETWEEN '2010-07-01' AND '2011-06-30 23:59:59'
LEFT JOIN civicrm_contribution cont_not ON contact_civireport.id = cont_not.contact_id AND cont_not.receive_date BETWEEN '2011-07-01' AND '2012-06-30 23:59:59' AND cont_not.is_test = 0
WHERE cont_not.id IS NULL ;
However, I found that it was choosing a suboptimal index to do a WHERE on the first contribution table. It was using contribution_status_id rather than receive date. This was presumably because it is being passed a constant for contribution status ID and it is a shorter index. However, using an index hint forced it to use the index on received_date with a noticeable improvement:
Use index hint to choose receive_date index
(6.60 sec)
SELECT DISTINCT contact_civireport.id as cid
FROM civicrm_contact contact_civireport
INNER JOIN civicrm_contribution contribution_civireport USE index (received_date)
ON contact_civireport.id = contribution_civireport.contact_id AND is_test = 0 AND
contribution_civireport.receive_date BETWEEN '2010-07-01' AND '2011-06-30 23:59:59' AND contribution_civireport.contribution_status_id = 1
LEFT JOIN civicrm_contribution cont_not ON contact_civireport.id = cont_not.contact_id AND cont_not.receive_date BETWEEN '2011-07-01' AND '2012-06-30 23:59:59' AND cont_not.is_test = 0
WHERE cont_not.id IS NULL
I tried playing around with what indexes exist on the smaller database but couldn't find a way to improve it using combined indexes. Changing the field type for receive_date from datetime to timestamp slowed the query down (a little) on the small database. I wasn't able to test that on the larger database but I thought it was an interesting result. I tried a combined key on contact_id and receive_date with no improvement and adding contribution_status_id to the index with the receive_date also did not show an improvement. It would have been interesting to try playing with indexes on the larger DB but that is a very slow operation.
Attachments
Issue Links
- is supplemented by
-
CRM-17680 Lybunt report presents non-functional options (custom fields)
- Done/Fixed
- links to