CiviCRM

Indices on civicrm_email could be better

Details

  • Type: Bug Bug
  • Status: Open Open
  • Priority: Minor Minor
  • Resolution: Unresolved
  • Affects Version/s: 2.2.9
  • Fix Version/s: 4.3
  • Component/s: Core CiviCRM
  • Labels:
    None

Description

Not sure if this is still the case in current versions, but the indexing of civicrm_email is fairly strange:

1) There's a single field index on is_primary. Not very useful, because there will never be a query for all primary email addresses. This index will be pseudo-combined with other indexes, but this is not ideal.

2) same goes for location_type

3) The index on contact_id is a single field index. Should be multi-field because most queries want to restrict searches on other fields.

3) There is an auto-incrementing PK on this table. Not sure why? Is this table on the 'one' side of any one-to-many relationships?

I'd recommend dropping the indexes for is_primary, contact_id and location_type, and adding two new ones:
contact_id,is_primary
contact_id, location_type, is_primary
These should work for most searches. We could add additional indices on any of the remaining fields if necessary which InnoDB will pseudo-combine when necessary. But I think queries that restrict on the remaining fields are likely only to happen during CiviMail batch operations so I'm not sure the higher cost of inserting/updating records is worth it.

We could then drop the auto-incrementing field and make either of the new indices the PK. We should choose whichever one is used more, probably contact_id, is_primary. As some background as to why you want to do this, InnoDB stores all data in a tree where the trunk is the PK. Thus queries are fastest when using the PK. Most queries to this table have WHERE clauses on either contact_id with is_primary or location_type; or with email. Thus you want to make these queries as fast as possible.

Activity

Hide
dave hansen-lange added a comment -
Of course this would probably have to be benchmarked.
Show
dave hansen-lange added a comment - Of course this would probably have to be benchmarked.
Hide
dave hansen-lange added a comment -
At first I thought that having an index for email, is_primary would be a good idea too. But it would be silly to do a query WHERE email="foo" and is_primary=1
Show
dave hansen-lange added a comment - At first I thought that having an index for email, is_primary would be a good idea too. But it would be silly to do a query WHERE email="foo" and is_primary=1
Hide
dave hansen-lange added a comment -
I've done a bit more testing with this table and I'm refining some of my original thoughts:

1) There's a single field index on is_primary. Not very useful, because there will never be a query for all primary email addresses. This index will be pseudo-combined with other indexes, but this is not ideal.

This is still the case. This index should disappear. However this should happen on all CiviCRM tables, so I'll open a new issue.

2) same goes for location_type

This is still mostly the case. It is a foreign key, so it should be indexed. However, it's only going to be useful for "where address type is foo". Which is going to be very rare. More frequently we'll see "where contact_id = X and location_type = Y" (I think the query builder resolves location types to IDs rather than doing a JOIN to the civicrm_location_types) in which case a multi-field index on contact_id,location_type is going to be much more useful. However most use cases will still benefit more from fewer overall indexes (meaning faster inserts). This is definitely the case for the site that I've been working on - inserts to civicrm_email were showing up in the slow query log.

3) The index on contact_id is a single field index. Should be multi-field because most queries want to restrict searches on other fields.

This is still the case. just an index on contact_id is not as useful as it could be. contact_id, is_primary would be better. As mentioned in #2 contact_id, location_type, is_primary should be considered, but in many cases will be more harm than good.

4) There is an auto-incrementing PK on this table. Not sure why? Is this table on the 'one' side of any one-to-many relationships?

I'm reneging on this one. We want our PKs to be as short as possible. The only alternative for a PK would be a multi-field index on contact_id, email which is going to be long and not useful for any queries.

5) This one is new. Indexes on varchar fields should be limited in length. For example the email field. emails could be anywhere from ~10 characters to 64 characters and beyond (I'm not quite sure what the spec says the absolute limit is). So if you have one email in the table that is 64 characters then that means the index is ~ 192 bytes wide (UTF characters X 3 = # of bytes (approx)). This means that searching by email address is going to be a lot slower than it could be. The better way to go is to have your indexes have a cardinality approximately 95% of the number of rows in the table.

SELECT COUNT(*) FROM civicrm_email WHERE LENGTH(email) > 26

gives you approx. 5% of the rows in the table. So I think we should limit the length of the index on the email field to 26 characters.

ALTER TABLE civicrm_email
DROP INDEX UI_email,
ADD INDEX UI_email (email (26))ADD INDEX UI_email (email (26))

This applies to several other tables as well so I'll open a new issue.
Show
dave hansen-lange added a comment - I've done a bit more testing with this table and I'm refining some of my original thoughts: 1) There's a single field index on is_primary. Not very useful, because there will never be a query for all primary email addresses. This index will be pseudo-combined with other indexes, but this is not ideal. This is still the case. This index should disappear. However this should happen on all CiviCRM tables, so I'll open a new issue. 2) same goes for location_type This is still mostly the case. It is a foreign key, so it should be indexed. However, it's only going to be useful for "where address type is foo". Which is going to be very rare. More frequently we'll see "where contact_id = X and location_type = Y" (I think the query builder resolves location types to IDs rather than doing a JOIN to the civicrm_location_types) in which case a multi-field index on contact_id,location_type is going to be much more useful. However most use cases will still benefit more from fewer overall indexes (meaning faster inserts). This is definitely the case for the site that I've been working on - inserts to civicrm_email were showing up in the slow query log. 3) The index on contact_id is a single field index. Should be multi-field because most queries want to restrict searches on other fields. This is still the case. just an index on contact_id is not as useful as it could be. contact_id, is_primary would be better. As mentioned in #2 contact_id, location_type, is_primary should be considered, but in many cases will be more harm than good. 4) There is an auto-incrementing PK on this table. Not sure why? Is this table on the 'one' side of any one-to-many relationships? I'm reneging on this one. We want our PKs to be as short as possible. The only alternative for a PK would be a multi-field index on contact_id, email which is going to be long and not useful for any queries. 5) This one is new. Indexes on varchar fields should be limited in length. For example the email field. emails could be anywhere from ~10 characters to 64 characters and beyond (I'm not quite sure what the spec says the absolute limit is). So if you have one email in the table that is 64 characters then that means the index is ~ 192 bytes wide (UTF characters X 3 = # of bytes (approx)). This means that searching by email address is going to be a lot slower than it could be. The better way to go is to have your indexes have a cardinality approximately 95% of the number of rows in the table. SELECT COUNT(*) FROM civicrm_email WHERE LENGTH(email) > 26 gives you approx. 5% of the rows in the table. So I think we should limit the length of the index on the email field to 26 characters. ALTER TABLE civicrm_email DROP INDEX UI_email, ADD INDEX UI_email (email (26))ADD INDEX UI_email (email (26)) This applies to several other tables as well so I'll open a new issue.
Hide
Donald A. Lobo added a comment -

dalin:

any chance you can submit an xml patch and an upgrade script patch for this one?
Show
Donald A. Lobo added a comment - dalin: any chance you can submit an xml patch and an upgrade script patch for this one?

People

Vote (0)
Watch (1)

Dates

  • Created:
    Updated: