Uploaded image for project: 'CiviCRM'
  1. CiviCRM
  2. CRM-21332

Enhance api retrieval of db duplicates

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 4.7.27
    • Fix Version/s: None
    • Component/s: CiviCRM API, Dedupe
    • Labels:
      None
    • Versioning Impact:
      Patch (backwards-compatible bug fixes)
    • Documentation Required?:
      None
    • Funding Source:
      Needs Funding
    • Verified?:
      No

      Description

      For the purposes of this I'll talk about the batch merge api but as we clean up the dedupe UI layer & decouple the processing layer code from the UI it will affect both.

       

      Within the concept of dedupes there are 2 types of 'limit'

      1) a limit as to how many contacts to find duplicates for

      2) a limit as to how many rows to return / process.

       

      So if we want to have proper control over batch merges we want to be able to say 'load the results of the search for duplicates for the first 3000 contacts in the database into the prevnext table' and 'return me the second chunk of 200 contacts from that table. Note that searching for duplicates for 3000 contacts could return 0 rows or 30,000 rows (actually, mathematically you could have almost 9 million rows) depending on the data quality - so it's unknown how much is in there.

       

      This suggests there are 2 api actions - load (or create) & get. The former does the expensive calculation & the latter can page through it.

       

      A while back I spoke with @totten about this & he suggested the idea that instead of adding a load action (possibly as a generic action) we add load as a generic api option (e.g on create as well as get).

      I think this might look something like

       

      ```

      Contact.get (array(

         'options' => array(

             'load' => array(

                'type' => 'temporary_table',

                'storage_name' => 'civicrm_temp_blah',

               'column_mapping' => array('display_name' => 'Display Name',), 'limit' => 25,

      );

      ```

      In the above call the limit would reply to what is returned not what is stored (making the rest available to page through).

       

      It would be fairly easy to see how a csv would fit there.

       

      For duplicates perhaps it would look like

       

      ```

      Contact.getduplicates (array(

         'options' => array(

             'load' => array(

                'type' => 'core_cache',

       

      );

      ```

       

      The above would store 'the special core way'  (prevnext_cache table) but the return would have some extra fields

       

      'retreival_parameters' => array('cachekey' => 'xyz'),

      'total_rows' => 589,

       

      where retrieval parameters is whatever is need to be passed back in to get the next set. In this case it seems like it makes sense to pass them back to the same api. Not sure what that would look like if we got really generic on this

       

        Attachments

          Activity

            People

            • Assignee:
              eileen Eileen McNaughton
              Reporter:
              eileen Eileen McNaughton
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: