CRM-2467 CRM_Utils_String::IsUtf8() regular expression causes segfault on most PHP installations with some input strings

    Details

    • Type: Bug
    • Status: Done/Fixed
    • Priority: Major
    • Resolution: Fixed/Completed
    • Affects Version/s: 1.8, 1.9
    • Fix Version/s: 2.0
    • Component/s: Core CiviCRM
    • Labels:
      None

      Description

      For some input strings, the regular expression intended to detect whether a string is valid UTF-8 will cause a segfault.

      This has been reproduced on several PHP versions including current (5.2.5) on OSX and Linux. The segfault does not occur on PHP/win32 5.2.5.

      In order to reproduce, configure CiviCRM thusly:

      • Enable country menu
      • Disable state/province menu
      • Enable Google as mapping provider

      Then save the following contact details as an address:

      • Street Address - PO Box 30000
      • Supplemental 1 - Adderly Tce
      • City - Christchurch
      • Postcode - 8147

      Google Maps returns a "multiple addresses found" XML which the regex in CRM_Utils_String::IsUtf8() will crash on. This appears to be an issue with PHP, but there are alternative methods to detect a valid UTF-8 string which don't have these consequences, so I think we should use them until PHP has this bug fixed.

      The practical result is that for some addresses, it is impossible for a user to change the address without disabling the mapping provider. They will see a blank screen on every attempt to save.

      Some more notes and testing, including a sample file which lets you test a few different methods of testing UTF-8 validity @ http://forum.civicrm.org/index.php/topic,1112.0.html

      Tested on 1.9.11960 and 1.8.stable.11165.

      Regex segfault verified on Linux PHP 4.4.0, 4.4.3, 4.4.7, 5.2.0, 5.2.3, 5.2.5 and OSX PHP 5.2.3 and 5.2.4.

      Regex does not segfault on Win32 with PHP5.2.5.

        Attachments

          Activity

          [CRM-2467] CRM_Utils_String::IsUtf8() regular expression causes segfault on most PHP installations with some input strings
          Donald A. Lobo added a comment -

          Check this link out: http://www.codesimple.net/2006/08/google-maps-utf-8-problem_9034.html. We might be able to use this to get utf8 from google.

          Donald A. Lobo added a comment -

          We send in the oe (output encoding parameter) to google to ensure that we get back utf8

          Chris Burgess added a comment -

          Further testing reveals that the segfault will only happen if the string fed to CRM_Utils_String::IsUtf8() above some character length.

          On my main test platform this seemed to be between 4000 and 5000 characters.

          Manish Zope added a comment -

          verify for 2.0

          Kiran Jagtap added a comment -

          Verified and Confirm for 2.0 ( r - 12922 )

          Chris Burgess added a comment -

          http://bushi.net.nz/tmp/civicrm-utf8test.tgz has some samples you may find you can test the segfaults with - i used that to test out several different platforms when trying to debug this

            People

            • Assignee:
              Junia Biswas
              Reporter:
              Chris Burgess

              Dates

              • Created:
                Updated:
                Resolved: