CRM-20155 The dedupe code passes has serious maintainability issues, locked in by hooks

    Details

    • Type: Epic
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 4.7.16
    • Fix Version/s: None
    • Component/s: Dedupe
    • Epic Name:
      Resolve dedupe code issues
    • Versioning Impact:
      Patch (backwards-compatible bug fixes)
    • Documentation Required?:
      Developer Doc
    • Funding Source:
      Needs Funding
    • Verified?:
      No

      Description

      The dedupe code loads a quickform oriented array of data & uses that for post processing. The result is the code is convoluted and data is hard to alter by hook. Conversely it is the hooks that have locked in the difficult code. My expectation at this stage is we need to break-hook-contract to get past this - which I believe means putting the code in an extension & incrementing the version.

      I think the path will look something like

      • extract to org.civicrm.dedupe (new civicrm github repo) & resolve ensuring it is enabled on install / upgrade
      • ensure unit tests are running - will they run on the extension PRs too? if not the extension could hurt core tests ...
      • consider if there any issues on managing DAO / schema when 'something changes'??
      • increment the version of the extension, resolving the hook issues
      • figure out if we can have unit tests run against both versions!!
      • figure out how to have new version installed in install but an optional upgrade

      I had a quick go and moving to an extension is fairly seamless. There is a question as to whether we
      a) move to an extension and enable for all and THEN increment the version & make changes in the version, leaving people to update the version when they are ready (& put the new one on for new installs)
      b) add an extension, leave existing sites on the core version and enable the extension by default for new sites.

      The latter approach seems easier from the point of view of making the tests work - ie. if we ship a version with core we only need to ship one version and the tests would run on that, and separately on the core code without the tests enabled. If also seems to offer a slightly softer transition.

      One challenge is that once we move the code out to an extension we want to lock the classes in core that are now deprecated to the extension. Perhaps we could do that using a github hook? From that point of view the extension versioning approach seems cleaner. It also saves us from having the question later on about how to decide when we can remove the deprecated code.

      Notes on issues with current hook pattern....

      The code currently hits hooks as follows:

      1) CRM_Utils_Hook::merge('flip', $dupes, $dupes['dstID'], $dupes['srcID']);

      $dupes is the only alterable variable here
      Array
      (
      [dstID] => 4
      [dstName] => Mr. Anthony Anderson II
      [srcID] => 5
      [srcName] => Mr. Anthony Anderson II
      [weight] => 10
      [canMerge] => 1
      )

      The intent of the hook appears to be to allow the reversal of the 2 contacts. Not sure if it works.

      2) CRM_Utils_Hook::merge('batch', $migrationData, $mainId, $otherId);

      described as
      // A hook to implement other algorithms for choosing which contact to bias to when
      // there's a conflict (to handle "gotchas"). fields_in_conflict could be modified here
      // merge happens with new values filled in here. For a particular field / row not to be merged
      // field should be unset from fields_in_conflict.

      $migrationData can be altered - to resolve a conflict and choose a field it is necessary to alter the following 3 places (is that it?)

      unset($migrationData[''fields_in_conflict'][$fieldname]
      $migrationData['migration_info']['move_$fieldname]
      $migrationData['migration_info'][$rows]['move_$fieldname]

      CRM_Utils_Hook::alterLocationMergeData($blocksDAO, $mainId, $otherId, $migrationInfo);

      $ blocksDAO passes out all the $dao actions to be done with a chance to change them. This was a recent addition reflecting the other hooks not meeting this need & no other clear place to add it

      Other hooks in the flow:

      CRM_Utils_Hook::merge('cidRefs', $contactReferences);
      CRM_Utils_Hook::merge('eidRefs', $eidRefs);
      In this case the goal is to get a list of tables that refer to the contact table. (the original list is derived from metadata & includes tables in the extension with correct metadata - e.g CiviDiscount does not implement this hook but does get merges happening

      Other completely different versions of the hook:
      CRM_Utils_Hook::merge('relTables', $relTables);
      CRM_Utils_Hook::merge('locTables', $locTables);
      CRM_Utils_Hook::merge('form', $migrationData, $this->_cid, $this->_oid);
      CRM_Utils_Hook::merge('sqls', $sqls, $tagAId, $tagBId, $tables);

      There is also a hook
      CRM_Utils_Hook::post_case_merge($mainContactId, $mainCaseId, $otherContactId, $otherCaseId, $changeClient);
      but that one passes clear IDs

        Attachments

          Activity

          [CRM-20155] The dedupe code passes has serious maintainability issues, locked in by hooks
          Nicolas Ganivet added a comment -

          Very found of moving pieces of core into extensions - in fact was discussing this with Tim/Coleman just last week and it seems like almost all pieces allowing such changes are already in core.

          Food for thoughts:

          • are you going to include the Find and merge duplicates UI in the extension? Or just the physical merge operation of contact records?
          • are you going to include screens related to managing duplicate rules in the extension?
          • will this move to extension facilitate having custom duplicate rules? I find the existing framework very limiting, specifically when we have to deal with cultures that include the names of the ancestors as part of the first name - ie, middle-eastern cultures, but also Mr. Bill Smith Jr. in the US!
          John K. added a comment -

          Hi Nicolas. It would be great to have a more powerful 'custom dedupe rule' builder in the UI. But I think this would have to be a phase 2 thing. It's going to be a significant piece of work to move this into an extension and ensure that it all behaves the way we expect!

          (Incidentally, it is possible to write custom dedupe rules as SQL which can be selected in the interface. This also allows for 'anti-rules' - like: if the suffixes (Jr) are different then do not consider the pair a duplicate. We do this for a custom external identifier field we have).

          Eileen, I'd be happy to help with this as and where I can.

          Nicolas Ganivet added a comment -

          Thanks John for the tip on dedupe rules! CiviCRM never fails to amaze me ...

          The purpose of my comment certainly was not to scope creep, but just to point this out to Eileen as she is working on the specs since this might influence the design of her extension.

          Cheers.

          Eileen McNaughton added a comment -

          Regarding the extension portion. My first focus will be simply to update it to have a much more limited and logical hook interface. I am expecting to pull across the merge screens.

          WRT scope creep - I think changes and new features will be added in the same way as per core - this should be thought of as 'like core' for most purposes. The specific blocker I have now is that there are too many hooks being called to fix up code issues at the moment, which blocks other things.

          My specific follow on will be to deal with address conflicts better. Currently there are a lot of false conflicts - e.g an address that has only 'United States' is treated as being a conflict with an address that says 'California, United States', its also really hard to figure out how to allow an extension to resolve '50 West Street' vs '50 West St' in the current structure.

          I would note re rules that probably won't be a focus for me. After some work I decided that it is better to do a very simple match (e.g. email) and then evaluate for conflicts, rather than to build complex rules.

          Guy Iaccarino added a comment -

          "its also really hard to figure out how to allow an extension to resolve '50 West Street' vs '50 West St' in the current structure." Usually when I have to include the address field in the dedupe rule I make Length equal between 6 and 8. It's not bulletproof, but it catches a lot more than if you couldn't specify Length, so I would hope we don't lose the ability to specify Length in the extension. If at some point the extension was smart enough to equate things like "Street" and "St." in the last characters of the address field, that would be great. Thanks Eileen for taking this on.

          Eileen McNaughton added a comment -

          Guy - I'm not looking to eliminate length from rules - although I do strongly advise people not to use that functionality for medium and large datasets as it creates a slow query with unindexed joins.

          Guy Iaccarino added a comment -

          Good to know, thanks! Most of our clients are on the small to small-medium size in terms of database size, so that hasn't been an issue.

          Jaap Jansma added a comment -

          Sounds good Eileen McNaughton!
          Regarding the unit tests. I would assume that core has unit tests which makes sure the hook functionality works as expected and that the extension org.civicrm.dedupe has unit tests to test whether the functionality of deduping in that extension work as expected.

          Eileen McNaughton added a comment -

          Yep - there are unit tests on core for some of the hooks & those tests need to still run. The extension would still have hooks in it - so there would be some hooks tests there too - but altered to reflect more logical hooks

          KarinG added a comment -

          I just want to say thank you - thank you - thank you - for taking the lead on a first move out of core and into ship-with-core extension project.

          Eileen McNaughton added a comment -

          After discussion with Tim we agreed that we would follow the idea of one branch of any extension running in CI. This means step one will be 'extract to an extension that ships with core, enables on install & upgrade & has unit tests' These tests will be added to CI. Step 2 will be to fork & and increment the master version to 2 with tests being added to the nightly matrix. Once 2.0 is stable / preferred it will go in the CI builds & v1 will only be overnight. v2 will enable on install, sites will be encouraged to upgrade. All fixes & improvements will go into v2

          The requirement for CI is that it will run by cloning (with a branch number ) & running phpunit4 from root - eg

          mkdir "/srv/buildkit/build/$BLDNAME/sites/all/modules/civicrm/ext"
          pushd "/srv/buildkit/build/$BLDNAME/sites/all/modules/civicrm/ext"
          git clone https://github.com/eileenmcnaughton/nz.co.fuzion.extendedreport.git
          pushd nz.co.fuzion.extendedreport
          civibuild restore "$BLDNAME"
          phpunit4 --tap --log-junit="$WORKSPACE/junit/ExtendedReport.xml"
          EXITCODE=$(($? || $EXITCODE))
          popd
          popd

          The concern re running 2 versions of the same extension during ci is:

          "when you aggregate all the different XML reports for display in Jenkins UI, you'd have different XML files giving different results for the same CRM_Foo_BarTest::testFoo.
          it's easy to write bash code which puts the results for each branch in different XML files.
          the doubt is what happens when jenkins aggregates those XML files"

            People

            • Assignee:
              Eileen McNaughton
              Reporter:
              Eileen McNaughton

              Dates

              • Created:
                Updated: