Bugzilla – Bug 1803
segfault in pcre jit when running twig test suite (PHP7)
Last modified: 2016-03-09 18:31:43 GMT
I am working on updating Ubuntu 16.04 to PHP7.0 and we are seeing PCRE related test-suite failures with twig. Specifically, in a 16.04 VM/chroot/etc, with PHP7, the testsuite is segfaulting with: #0 __memcpy_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:273 #1 0x00005555556798d8 in memcpy (__len=18446744073709551614, __src=0x7fffed43e1fc, __dest=0x7fffed49e390) at /usr/include/x86_64-linux-gnu/bits/string3.h:53 #2 zend_string_init (persistent=0, len=18446744073709551614, str=0x7fffed43e1fc "\303\237\343\201\224a") at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_string.h:159 #3 php_pcre_split_impl (pce=pce@entry=0x555555d4aea0, subject=0x7fffed43e1f8 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=return_value@entry=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-Y7XHJx/php7.0-7.0.3/ext/pcre/php_pcre.c:1808 #4 0x000055555567a1eb in zif_preg_split (execute_data=<optimized out>, return_value=0x7ffff381b240) at /build/php7.0-Y7XHJx/php7.0-7.0.3/ext/pcre/php_pcre.c:1721 #5 0x000055555579b58a in dtrace_execute_internal ( execute_data=<optimized out>, return_value=<optimized out>) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_dtrace.c:107 #6 0x000055555582f5f0 in ZEND_DO_FCALL_SPEC_HANDLER () at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:844 #7 0x00005555557eaedb in execute_ex (ex=ex@entry=0x7ffff381b070) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:414 #8 0x000055555579b421 in dtrace_execute_ex (execute_data=0x7ffff381b070) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_dtrace.c:83 #9 0x000055555582f72d in ZEND_DO_FCALL_SPEC_HANDLER () at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:800 #10 0x00005555557eaedb in execute_ex (ex=ex@entry=0x7ffff3819ff0) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:414 #11 0x000055555579b421 in dtrace_execute_ex (execute_data=0x7ffff3819ff0) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_dtrace.c:83 #12 0x000055555582f72d in ZEND_DO_FCALL_SPEC_HANDLER () at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:800 #13 0x00005555557eaedb in execute_ex (ex=ex@entry=0x7ffff3819e80) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:414 #14 0x000055555579b421 in dtrace_execute_ex (execute_data=0x7ffff3819e80) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_dtrace.c:83 #15 0x000055555582f72d in ZEND_DO_FCALL_SPEC_HANDLER () at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:800 #16 0x00005555557eaedb in execute_ex (ex=ex@entry=0x7ffff3819db0) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:414 #17 0x000055555579b421 in dtrace_execute_ex (execute_data=0x7ffff3819db0) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_dtrace.c:83 #18 0x000055555582f72d in ZEND_DO_FCALL_SPEC_HANDLER () at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:800 #19 0x00005555557eaedb in execute_ex (ex=ex@entry=0x7ffff3819ca0) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:414 #20 0x000055555579b421 in dtrace_execute_ex (execute_data=0x7ffff3819ca0) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_dtrace.c:83 #21 0x000055555582f72d in ZEND_DO_FCALL_SPEC_HANDLER () at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:800 #22 0x00005555557eaedb in execute_ex (ex=ex@entry=0x7ffff38192e0) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:414 #23 0x000055555579b421 in dtrace_execute_ex (execute_data=0x7ffff38192e0) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_dtrace.c:83 #24 0x000055555582f72d in ZEND_DO_FCALL_SPEC_HANDLER () at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:800 #25 0x00005555557eaedb in execute_ex (ex=ex@entry=0x7ffff3819210) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:414 #26 0x000055555579b421 in dtrace_execute_ex (execute_data=0x7ffff3819210) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_dtrace.c:83 #27 0x000055555579d03c in zend_call_function (fci=fci@entry=0x7fffffff9ae0, fci_cache=fci_cache@entry=0x7fffffff9ab0) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_execute_API.c:860 #28 0x000055555569e042 in zim_reflection_method_invokeArgs ( execute_data=<optimized out>, return_value=0x7ffff3818e60) at /build/php7.0-Y7XHJx/php7.0-7.0.3/ext/reflection/php_reflection.c:3348 #29 0x000055555579b58a in dtrace_execute_internal ( execute_data=<optimized out>, return_value=<optimized out>) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_dtrace.c:107 #30 0x000055555582f5f0 in ZEND_DO_FCALL_SPEC_HANDLER () at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:844 #31 0x00005555557eaedb in execute_ex (ex=ex@entry=0x7ffff3818c60) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:414 #32 0x000055555579b421 in dtrace_execute_ex (execute_data=0x7ffff3818c60) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_dtrace.c:83 #33 0x000055555582f72d in ZEND_DO_FCALL_SPEC_HANDLER () at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:800 #34 0x00005555557eaedb in execute_ex (ex=ex@entry=0x7ffff3818470) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:414 #35 0x000055555579b421 in dtrace_execute_ex (execute_data=0x7ffff3818470) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_dtrace.c:83 #36 0x000055555582f72d in ZEND_DO_FCALL_SPEC_HANDLER () at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:800 #37 0x00005555557eaedb in execute_ex (ex=ex@entry=0x7ffff3817880) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:414 #38 0x000055555579b421 in dtrace_execute_ex (execute_data=0x7ffff3817880) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_dtrace.c:83 #39 0x000055555582f72d in ZEND_DO_FCALL_SPEC_HANDLER () at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:800 #40 0x00005555557eaedb in execute_ex (ex=ex@entry=0x7ffff3816e20) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:414 #41 0x000055555579b421 in dtrace_execute_ex (execute_data=0x7ffff3816e20) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_dtrace.c:83 #42 0x000055555582f72d in ZEND_DO_FCALL_SPEC_HANDLER () at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:800 #43 0x00005555557eaedb in execute_ex (ex=ex@entry=0x7ffff3816840) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:414 #44 0x000055555579b421 in dtrace_execute_ex (execute_data=0x7ffff3816840) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_dtrace.c:83 #45 0x000055555582f72d in ZEND_DO_FCALL_SPEC_HANDLER () at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:800 #46 0x00005555557eaedb in execute_ex (ex=ex@entry=0x7ffff3816260) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:414 #47 0x000055555579b421 in dtrace_execute_ex (execute_data=0x7ffff3816260) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_dtrace.c:83 #48 0x000055555582f72d in ZEND_DO_FCALL_SPEC_HANDLER () at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:800 #49 0x00005555557eaedb in execute_ex (ex=ex@entry=0x7ffff3815c80) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:414 #50 0x000055555579b421 in dtrace_execute_ex (execute_data=0x7ffff3815c80) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_dtrace.c:83 #51 0x000055555582f72d in ZEND_DO_FCALL_SPEC_HANDLER () at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:800 #52 0x00005555557eaedb in execute_ex (ex=ex@entry=0x7ffff3814640) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:414 #53 0x000055555579b421 in dtrace_execute_ex (execute_data=0x7ffff3814640) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_dtrace.c:83 #54 0x000055555582f72d in ZEND_DO_FCALL_SPEC_HANDLER () at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:800 #55 0x00005555557eaedb in execute_ex (ex=ex@entry=0x7ffff3814220) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:414 #56 0x000055555579b421 in dtrace_execute_ex (execute_data=0x7ffff3814220) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_dtrace.c:83 #57 0x000055555582f72d in ZEND_DO_FCALL_SPEC_HANDLER () at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:800 #58 0x00005555557eaedb in execute_ex (ex=ex@entry=0x7ffff3814130) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:414 #59 0x000055555579b421 in dtrace_execute_ex (execute_data=0x7ffff3814130) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_dtrace.c:83 #60 0x000055555582f72d in ZEND_DO_FCALL_SPEC_HANDLER () at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:800 #61 0x00005555557eaedb in execute_ex (ex=ex@entry=0x7ffff3814030) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:414 #62 0x000055555579b421 in dtrace_execute_ex (execute_data=0x7ffff3814030) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_dtrace.c:83 #63 0x000055555583e2b7 in zend_execute ( op_array=op_array@entry=0x7ffff3883000, return_value=return_value@entry=0x0) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend_vm_execute.h:458 #64 0x00005555557ab6b3 in zend_execute_scripts (type=type@entry=8, retval=retval@entry=0x0, file_count=file_count@entry=3) at /build/php7.0-Y7XHJx/php7.0-7.0.3/Zend/zend.c:1427 #65 0x000055555574c0c0 in php_execute_script (primary_file=0x7fffffffcb10) at /build/php7.0-Y7XHJx/php7.0-7.0.3/main/main.c:2484 #66 0x000055555583ff84 in do_cli (argc=4, argv=0x555555bab130) at /build/php7.0-Y7XHJx/php7.0-7.0.3/sapi/cli/php_cli.c:974 #67 0x00005555556364e4 in main (argc=4, argv=0x555555bab130) at /build/php7.0-Y7XHJx/php7.0-7.0.3/sapi/cli/php_cli.c:1345 While this fault is not directly in the PCRE code, it was noticed that passing pcre.jit=0 (a PHP ini value), resulted in no fault. You can see in the trace above the len value is bogus. pcre.jit=0 (upon code inspection) simply causes php to not call pcre_study() from the PHP7 code. I set up an environment with the same runtime and built pcre from svn. Modifying LD_LIBRARY_PATH to load the svn version (r1640) of pcre did not fix the issue. The failing twig test case is split_utf8.test: --TEST-- "split" filter --CONDITION-- function_exists('mb_get_info') --TEMPLATE-- {{ "é"|split('', 10)|join('-') }} {{ foo|split(',')|join('-') }} {{ foo|split(',', 1)|join('-') }} {{ foo|split(',', 2)|join('-') }} {{ foo|split(',', 3)|join('-') }} {{ baz|split('')|join('-') }} {{ baz|split('', 1)|join('-') }} {{ baz|split('', 2)|join('-') }} --DATA-- return array('foo' => 'Ä,é,Äほ', 'baz' => 'éÄßごa',) --EXPECT-- é Ä-é-Äほ Ä,é,Äほ Ä-é,Äほ Ä-é-Äほ é-Ä-ß-ご-a é-Ä-ß-ご-a éÄ-ßご-a which, as I understand, is splitting these PHP variables as specified (and then joining them back together). If I remove the "baz" invocations from the TEMPLATE, the test passes. If I only add the first "baz" invocation back in, a segmentation fault occurs. valgrind doesn't indicate any issues beyond those that happen once the length is invalid, as far as I can tell. Confusingly, if I recompile pcre to not support jit at all (./configure --enable-utf --enable-unicode-properties --enable-jit=no), the segmentation fault persists. So perhaps the bug is somewhere else, rather than in the jit code itself. I apologize if this bug report is too vague, I am happy to provide more details and test fixes, as necessary. This bug does seem similar to
From the backtrace this is strange: zend_string_init (persistent=0, len=18446744073709551614, str=0x7fffed43e1fc "\303\237\343\201\224a") len=18446744073709551614 seems too big (in hex it is 0x1999999999999999 which is a strange value). Especially because the subject len is 10. I think it would be good to put a breakpoint where the pcre returns with the offsets and check start and end. It would be also good to check how that big len is computed.
Likely \303\251\303\204\303\237\343\201\224a represents éÄßごa and \303\237\343\201\224a represents a valid suffix. So only the len seems wrong.
(In reply to Zoltan Herczeg from comment #1) > From the backtrace this is strange: > > zend_string_init (persistent=0, len=18446744073709551614, > str=0x7fffed43e1fc "\303\237\343\201\224a") > > len=18446744073709551614 seems too big (in hex it is 0x1999999999999999 > which is a strange value). Especially because the subject len is 10. > > I think it would be good to put a breakpoint where the pcre returns with the > offsets and check start and end. It would be also good to check how that big > len is computed. Apologies, I had this in my bug report for PHP (https://bugs.php.net/bug.php?id=71659): (gdb) print subject $3 = 0x7fffed43e1f8 "\303\251\303\204\303\237\343\201\224a" (gdb) print offsets $4 = (int *) 0x7fffffff9150 (gdb) print offsets[0] $5 = 2 (gdb) print last_match $6 = 0x7fffed43e1fc "\303\237\343\201\224a" (gdb) print &subject[offsets[0]]-last_match $7 = -2 I'll put in a breakpoint as you suggested and see what I can figure out. I am new to pcre, so I apologize in advance if I ask dumb questions :)
(In reply to Zoltan Herczeg from comment #2) > Likely \303\251\303\204\303\237\343\201\224a represents éÄßごa and > \303\237\343\201\224a represents a valid suffix. So only the len seems wrong. Yep, verified this by manual look up in UTF tables :)
Two updates from my testing last night. 1) The twig testsuite is run using phpunit. phpunit has a parameter --process-isolation. When the tests are run with that parameter, the tests pass (even with pcre.jit left on). This, I think, points to a PHP bug, but I'm not sure. 2) When I tried to just run the failing twig tests on their own (split_utf8.test and length_utf8.test are the two input files), I could not recreate the segmentation fault. This again, it feels like, points to a fault somewhere in the php logic due to some corrupt state, perhaps? I will keep both bugs updated as I progress.
> (gdb) print offsets[0] > $5 = 2 please print offsets[1], which is the end. The value 0x1999999999999999 is so strange that if feels like an uninitialized variable (I mean a stack protector filled the stack frame with this value to trigger such issues).
(In reply to Zoltan Herczeg from comment #6) > > (gdb) print offsets[0] > > $5 = 2 > > please print offsets[1], which is the end. > > The value 0x1999999999999999 is so strange that if feels like an > uninitialized variable (I mean a stack protector filled the stack frame with > this value to trigger such issues). #3 php_pcre_split_impl (pce=pce@entry=0x555555e3a9e0, subject=0x7fffe199df90 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=return_value@entry=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1808 1808 /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c: No such file or directory. (gdb) print offsets $1 = (int *) 0x7fffffff9150 (gdb) print offsets[0] $2 = 2 (gdb) print offsets[1] $3 = 4 I'm still at gdb so I can grab any other relevant information.
> (gdb) print offsets[0] > $2 = 2 > (gdb) print offsets[1] > $3 = 4 This is a perfectly valid offset pair representing the \303\204 single character substring. So PCRE result seems correct. The question is how the 0x19...9 length is computed from this offset values. Please do single stepping until you find out how that value is computed.
(In reply to Zoltan Herczeg from comment #8) > > (gdb) print offsets[0] > > $2 = 2 > > (gdb) print offsets[1] > > $3 = 4 > > This is a perfectly valid offset pair representing the \303\204 single > character substring. So PCRE result seems correct. The question is how the > 0x19...9 length is computed from this offset values. Please do single > stepping until you find out how that value is computed. Right, the issue is the value of last_match relative to these offsets: (gdb) print last_match $6 = 0x7fffed43e1fc "\303\237\343\201\224a" (gdb) print &subject[offsets[0]]-last_match $7 = -2 And PHP is using this last value to determine in the failing line: ZVAL_STRINGL(&tmp, last_match, &subject[offsets[0]]-last_match); After saving off each matching substring, PHP updates last_match to: last_match = &subject[offsets[1]]; I *believe* I saw two calls to pcre_exec return the same offsets (which might be how we get to this state, since the last_match value will be incorrect relative to the offsets (as they should have advanced)). I will try and reproduce today and post the results.
I've reproduced this condition on CentOS 7.2 (pcre 8.32-15.el7 + php70-php 7.0.4) and Fedora 23 (pcre 8.38-6.fc23 + php70-php 7.0.4). I'm attaching the full text output to this issue for each system.
Created attachment 869 [details] GDB output on CentOS 7.2
Created attachment 870 [details] GDB output on Fedora 23
> (gdb) print last_match > $6 = 0x7fffed43e1fc "\303\237\343\201\224a" > (gdb) print &subject[offsets[0]]-last_match > $7 = -2 That is likely incorrect. I think we soon find this bug. If I understand correctly, there is a loop in php_pcre_split_impl which construct a list from the non-matching parts of the string: https://github.com/php/php-src/blob/master/ext/pcre/php_pcre.c#L1730 However, there are lots of conditions in the loop, and certain variables are updated conditionally. Could you check how last_match, count, offsets[0], and offsets[1] are updated during each iteration of this loop? (Btw that /./us pattern for stepping a character ahead must be a joke. That is the most inefficint way I could imagine.)
(gdb) break ext/pcre/php_pcre.c:1794 if strcmp(subject, "\303\251\303\204\303\237\343\201\224a") == 0 (gdb) c ... (gdb) print offsets[0] $5 = 2 (gdb) print last_match $6 = 0x7fffed42e248 "\303\251\303\204\303\237\343\201\224a" (gdb) print offsets[0] $7 = 2 (gdb) print offsets[1] $8 = 2 (gdb) c ... (gdb) print last_match $9 = 0x7fffed42e24a "\303\204\303\237\343\201\224a" (gdb) print offsets[0] $10 = -1 (gdb) print offsets[1] $11 = -1 ... (gdb) print last_match $12 = 0x7fffed42e24a "\303\204\303\237\343\201\224a" (gdb) print offsets[0] $13 = 2 (gdb) print offsets[1] $14 = 4 (gdb) c ... (gdb) print last_match $15 = 0x7fffed42e24c "\303\237\343\201\224a" (gdb) print offsets[0] $16 = 2 (gdb) print offsets[1] $17 = 4 (gdb) c ... SIGSEGV count's value was optimized out, and I'd need to recompile PHP to get that value, I think.
(In reply to Nish Aravamudan from comment #14) > (gdb) break ext/pcre/php_pcre.c:1794 if strcmp(subject, > "\303\251\303\204\303\237\343\201\224a") == 0 strcmp? Do you mean this line: count = pcre_exec(pce->re, extra, subject, subject_len, start_offset, exoptions|g_notempty, offsets, size_offsets); Actually the line 1794 is empty here, so I suspect there is an offset difference between your source code and the master: https://github.com/php/php-src/blob/master/ext/pcre/php_pcre.c#L1794 > (gdb) c > ... > (gdb) print offsets[0] > $5 = 2 > (gdb) print last_match > $6 = 0x7fffed42e248 "\303\251\303\204\303\237\343\201\224a" > (gdb) print offsets[0] > $7 = 2 > (gdb) print offsets[1] > $8 = 2 > (gdb) c > ... So the first match is an empty match at offset 2. > (gdb) print last_match > $9 = 0x7fffed42e24a "\303\204\303\237\343\201\224a" > (gdb) print offsets[0] > $10 = -1 > (gdb) print offsets[1] > $11 = -1 > ... Is this a rerun because of: g_notempty = (offsets[1] == offsets[0])? PCRE_NOTEMPTY_ATSTART | PCRE_ANCHORED : 0; > (gdb) print last_match > $12 = 0x7fffed42e24a "\303\204\303\237\343\201\224a" > (gdb) print offsets[0] > $13 = 2 > (gdb) print offsets[1] > $14 = 4 > (gdb) c > ... It seems the second character is matched, and offsets updated. > (gdb) print last_match > $15 = 0x7fffed42e24c "\303\237\343\201\224a" > (gdb) print offsets[0] > $16 = 2 > (gdb) print offsets[1] > $17 = 4 > (gdb) c > ... Hm that is strange, since all offsets are relative to subject, and these offsets are before last_match. > SIGSEGV At this point I suspect something is wrong with start_offset, but it needs a proof. The last_match seemed to updated to offset 4 (substring "\303\237\343\201\224a"), but start_offset is below 4, and pcre returns a the same 2-4 match again. A string from offsets 4-2 cannot be constructed, since the end is smaller than the start. Could you also print start_offset and subject as well? (gdb) print substring (gdb) print last_match (gdb) print start_offset (gdb) print offsets[0] (gdb) print offsets[1] For all iterations? I am sorry for so many debugging requests, but I am not a php developer and just doing guesses here. If start_offset is 4, this is likely some PCRE bug, and I need the pattern to check it here.
(In reply to Zoltan Herczeg from comment #15) > (In reply to Nish Aravamudan from comment #14) > > (gdb) break ext/pcre/php_pcre.c:1794 if strcmp(subject, > > "\303\251\303\204\303\237\343\201\224a") == 0 > > strcmp? As mentioned earlier, it only reproduces if I run the entire test suite (`phpunit` invocation). So I do that as the argument to php in gdb, but want to break for this particular string as the known SEGV-inducing subject. > Do you mean this line: > > count = pcre_exec(pce->re, extra, subject, > subject_len, start_offset, > exoptions|g_notempty, offsets, size_offsets); > > Actually the line 1794 is empty here, so I suspect there is an offset > difference between your source code and the master: > > https://github.com/php/php-src/blob/master/ext/pcre/php_pcre.c#L1794 You're right. I have put two breakpoints in, one at the above pcre_exec line and one at the count==0 check that follows; the first to get the values of subject and start_offset passed to pcre_exec, the second to get the values of offsets returned. > > (gdb) c > > ... > > (gdb) print offsets[0] > > $5 = 2 > > (gdb) print last_match > > $6 = 0x7fffed42e248 "\303\251\303\204\303\237\343\201\224a" > > (gdb) print offsets[0] > > $7 = 2 > > (gdb) print offsets[1] > > $8 = 2 > > (gdb) c > > ... > > So the first match is an empty match at offset 2. > > > (gdb) print last_match > > $9 = 0x7fffed42e24a "\303\204\303\237\343\201\224a" > > (gdb) print offsets[0] > > $10 = -1 > > (gdb) print offsets[1] > > $11 = -1 > > ... > > Is this a rerun because of: > > g_notempty = (offsets[1] == offsets[0])? PCRE_NOTEMPTY_ATSTART | > PCRE_ANCHORED : 0; Checking... (gdb) print g_notempty $70 = 268435472 which is 0x1000011E #define PCRE_NOTEMPTY_ATSTART 0x10000000 #define PCRE_ANCHORED 0x00000010 So seems likely? <snip> > > SIGSEGV > > At this point I suspect something is wrong with start_offset, but it needs a > proof. The last_match seemed to updated to offset 4 (substring > "\303\237\343\201\224a"), but start_offset is below 4, and pcre returns a > the same 2-4 match again. A string from offsets 4-2 cannot be constructed, > since the end is smaller than the start. > > Could you also print start_offset and subject as well? > > (gdb) print substring I assume you mean subject here? > (gdb) print last_match > (gdb) print start_offset > (gdb) print offsets[0] > (gdb) print offsets[1] > > For all iterations? Here you go (excluding some typos on my part): (gdb) run The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /usr/bin/php /usr/bin/phpunit --bootstrap lib/Twig/autoload.php [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". PHPUnit 5.1.3 by Sebastian Bergmann and contributors. ....FF....................................................... 61 / 1172 ( 5%) ............................................................. 122 / 1172 ( 10%) ............................................................. 183 / 1172 ( 15%) ............................................................. 244 / 1172 ( 20%) ............................................................. 305 / 1172 ( 26%) ........................................... Breakpoint 9, php_pcre_split_impl (pce=0x555555d33520, subject=0x7fffed42e248 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1786 1786 count = pcre_exec(pce->re, extra, subject, (gdb) print subject $44 = 0x7fffed42e248 "\303\251\303\204\303\237\343\201\224a" (gdb) print last_match $45 = 0x7fffed42e248 "\303\251\303\204\303\237\343\201\224a" (gdb) print start_offset $46 = 0 (gdb) c Continuing. Breakpoint 8, php_pcre_split_impl (pce=0x555555d33520, subject=0x7fffed42e248 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1794 1794 if (count == 0) { (gdb) print offsets[0] $47 = 2 (gdb) print offsets[1] $48 = 2 (gdb) c Continuing. Breakpoint 9, php_pcre_split_impl (pce=0x555555d33520, subject=0x7fffed42e248 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1786 1786 count = pcre_exec(pce->re, extra, subject, (gdb) print subject $49 = 0x7fffed42e248 "\303\251\303\204\303\237\343\201\224a" (gdb) print last_match $50 = 0x7fffed42e24a "\303\204\303\237\343\201\224a" (gdb) print start_offset $51 = 2 (gdb) c Continuing. Breakpoint 8, php_pcre_split_impl (pce=0x555555d33520, subject=0x7fffed42e248 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1794 1794 if (count == 0) { (gdb) print offsets[0] $52 = -1 (gdb) print offsets[1] $53 = -1 (gdb) c Continuing. Breakpoint 9, php_pcre_split_impl (pce=0x555555d33520, subject=0x7fffed42e248 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1786 1786 count = pcre_exec(pce->re, extra, subject, (gdb) print subject $54 = 0x7fffed42e248 "\303\251\303\204\303\237\343\201\224a" (gdb) print last_match $55 = 0x7fffed42e24a "\303\204\303\237\343\201\224a" (gdb) print start_offset $57 = 4 (gdb) c Continuing. Breakpoint 8, php_pcre_split_impl (pce=0x555555d33520, subject=0x7fffed42e248 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1794 1794 if (count == 0) { (gdb) print offsets[0] $58 = 2 (gdb) print offsets[1] $59 = 4 (gdb) c Continuing. Breakpoint 9, php_pcre_split_impl (pce=0x555555d33520, subject=0x7fffed42e248 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1786 1786 count = pcre_exec(pce->re, extra, subject, (gdb) print subject $60 = 0x7fffed42e248 "\303\251\303\204\303\237\343\201\224a" (gdb) print last_match $61 = 0x7fffed42e24c "\303\237\343\201\224a" (gdb) print start_offset $62 = 4 (gdb) c Continuing. Breakpoint 8, php_pcre_split_impl (pce=0x555555d33520, subject=0x7fffed42e248 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1794 1794 if (count == 0) { (gdb) print offsets[0] $66 = 2 (gdb) print offsets[1] $67 = 4 (gdb) c Continuing. Program received signal SIGSEGV, Segmentation fault. __memcpy_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:273 273 ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: No such file or directory. > I am sorry for so many debugging requests, but I am not a php developer and > just doing guesses here. Neither am I :) I appreciate your help! -Nish
> Neither am I :) I appreciate your help! Me too. And you are good at gdb, and that is rare :) It seems we really get an offset pair before start_offset: > (gdb) print start_offset > $62 = 4 > (gdb) print offsets[0] > $66 = 2 > (gdb) print offsets[1] > $67 = 4 It is not impossible, a pattern like this can do that: /(?<=\K.)/ But such patterns are rare. Do you know what is the pattern here: {{ baz|split('')|join('-') }} An empty string? But that cannot match from 2-4. And there is one more thing, the interpreters fills the offset[0] and [1] with -1 in case of a failed match, but JIT does not do it. It can be a problem if the application expects the former behavior, but that code does not seem to rely on this. However, this part somehow contradicts to this: > (gdb) print offsets[0] > $52 = -1 > (gdb) print offsets[1] > $53 = -1 Anyway, I think we need to figure out which pattern causes the problem. The pce->re member is set somewhere, you could capture that with a write watchpoint: p &pce->re -> prints the absolute address watch *(long*)address rerun the application again with r. You might capture some unwanted breakpoint hits (sometimes thousands :) ), but just set a big ignore count to the watchpoint: ignore 1 100000 When the crash happens type "info breakpoints" and check the hit count. Set the ignore count just one (or two) below to that number and rerun the application again. This time gdb will stop where the pattern is compiled (since that is the last write to this address), and just check the pattern string. Please send it to me.
(In reply to Zoltan Herczeg from comment #17) > > Neither am I :) I appreciate your help! > > Me too. And you are good at gdb, and that is rare :) > > It seems we really get an offset pair before start_offset: > > > (gdb) print start_offset > > $62 = 4 > > > (gdb) print offsets[0] > > $66 = 2 > > (gdb) print offsets[1] > > $67 = 4 > > It is not impossible, a pattern like this can do that: > > /(?<=\K.)/ > > But such patterns are rare. > > Do you know what is the pattern here: > > {{ baz|split('')|join('-') }} > > An empty string? But that cannot match from 2-4. I would expect that would be the pattern, based upon my understanding of twig. > And there is one more thing, the interpreters fills the offset[0] and [1] > with -1 in case of a failed match, but JIT does not do it. It can be a > problem if the application expects the former behavior, but that code does > not seem to rely on this. > > However, this part somehow contradicts to this: > > > (gdb) print offsets[0] > > $52 = -1 > > (gdb) print offsets[1] > > $53 = -1 > > Anyway, I think we need to figure out which pattern causes the problem. The > pce->re member is set somewhere, you could capture that with a write > watchpoint: > > p &pce->re > -> prints the absolute address > watch *(long*)address > > rerun the application again with r. > > You might capture some unwanted breakpoint hits (sometimes thousands :) ), > but just set a big ignore count to the watchpoint: > > ignore 1 100000 > > When the crash happens type "info breakpoints" and check the hit count. Set > the ignore count just one (or two) below to that number and rerun the > application again. This time gdb will stop where the pattern is compiled > (since that is the last write to this address), and just check the pattern > string. Please send it to me. So I attempted to do this a few times, but the failing &pce->re value kept changing between runs. Is that expected? That made the write watchpoint fail to trip. Any advice? Agreed we need to figure out what the pattern actually is.
> > An empty string? But that cannot match from 2-4. > > I would expect that would be the pattern, based upon my understanding of > twig. Perhaps the runtime environment may match other patterns as well. Btw, this pattern can also overwrite the ovector: https://github.com/php/php-src/blob/master/ext/pcre/php_pcre.c#L1847 Since JIT is not enabled for that pattern, it could explain how -1 is appeared in the ovector. > So I attempted to do this a few times, but the failing &pce->re value kept > changing between runs. Is that expected? That made the write watchpoint fail > to trip. Any advice? Agreed we need to figure out what the pattern actually > is. This is bad news. Normally allocators consistently return with the same addresses except if address (or mmap) randomization is enabled for security reasons. I think it would take too much time to figure this out, so lets try something simpler first. Perhaps we should assume that the pattern is the empty string, and do a control flow analysis. Even if count is optimized out, after pcre_exec returns, the return value is in $rax. You can print it with "p $rax". Count is likely $rax here: if (count == 0) We can confirm this by: disassemble $pc,$pc+32 The code should start with "cmp $rax, 0", except if the value in $rax is moved to another register by GCC. Basically we need to check which conditional blocks are executed. Could you do the following analysis: Please check the executed code path for each run of the ((limit_val == -1 || limit_val > 1)) loop. E.g: 1st run of the loop: count ($rax) was 2, so the (count > 0 && (offsets[1] - offsets[0] >= 0)) condition is fulfilled, and the first conditional block is executed. offsets contains 2,2. Later g_notempty was initialized with 0. 2nd run of the loop: this time count ($rax) was -1 (0xffffffffffffffff), and the count == PCRE_ERROR_NOMATCH part was executed. The offsets is overwritten by the pcre_exec with 2,4. Later g_notempty was initialized with 0. The following code path may not be entirely correct: count = pcre_exec(re_bump, extra_bump, subject, subject_len, start_offset, exoptions, offsets, size_offsets); can overwrite offsets, and later g_notempty = (offsets[1] == offsets[0])? PCRE_NOTEMPTY_ATSTART | PCRE_ANCHORED : 0; uses these overwritten values. I don't think this was the intention of the author. But the segfault may be caused by a different issue.
One more thing, since count is int (32 bit value), the register can be $eax, not $rax. That is ok as well.
Another idea just came to my mind. It seems that all patterns are compiled by pcre_compile here: https://github.com/php/php-src/blob/master/ext/pcre/php_pcre.c#L433 Would it be possible to dump all regex compilation to some file after this call? E.g. re = pcre_compile(pattern, coptions, &error, &erroffset, tables); FILE *f = fopen("dump_file", "a"); // appending at the end fprintf(f, "/%s/ 0x%x -> %p\n", pattern, coptions, re); fclose(f); It would be easy to find the offending pattern from this list. Just find the latest entry which has the same address as pce->re.
(In reply to Zoltan Herczeg from comment #21) > Another idea just came to my mind. > > It seems that all patterns are compiled by pcre_compile here: > > https://github.com/php/php-src/blob/master/ext/pcre/php_pcre.c#L433 > > Would it be possible to dump all regex compilation to some file after this > call? > > E.g. > > re = pcre_compile(pattern, > coptions, > &error, > &erroffset, > tables); > > FILE *f = fopen("dump_file", "a"); // appending at the end > fprintf(f, "/%s/ 0x%x -> %p\n", pattern, coptions, re); > fclose(f); Recompiling PHP7.0 is quite slow, so I tried doing this with gdb... > It would be easy to find the offending pattern from this list. Just find the > latest entry which has the same address as pce->re. I set a breakpoint at: ext/pcre/php_pcre.c:1720 or so, which is the pce->refcount++ in (zif_)preg_split: Ignored the first 232 hits of it, and on the last one: Breakpoint 4, zif_preg_split (execute_data=<optimized out>, return_value=0x7ffff381b240) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1720 1720 pce->refcount++; (gdb) print pce $48 = (pcre_cache_entry *) 0x555555d333f0 (gdb) print subject->val@10 $47 = {"\303", "\251", "\303", "\204", "\303", "\237", "\343", "\201", "\224", "a"} (gdb) print regex->val@14 $58 = {"/", "(", "?", "<", "!", "^", ")", "(", "?", "!", "$", ")", "/", "u"} (gdb) print &pce->re $60 = (pcre **) 0x555555d333f0 (gdb) cont Continuing. Program received signal SIGSEGV, Segmentation fault. __memcpy_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:271 271 ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: No such file or directory. (gdb) up #1 0x00005555556798d8 in memcpy (__len=18446744073709551614, __src=0x7fffed40b1ac, __dest=0x7fffed7a6348) at /usr/include/x86_64-linux-gnu/bits/string3.h:53 53 return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest)); (gdb) up #2 zend_string_init (persistent=0, len=18446744073709551614, str=0x7fffed40b1ac "\303\237\343\201\224a") at /build/php7.0-WHFaJZ/php7.0-7.0.3/Zend/zend_string.h:159 159 memcpy(ZSTR_VAL(ret), str, len); (gdb) up #3 php_pcre_split_impl (pce=0x555555d333f0, subject=0x7fffed40b1a8 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1808 1808 ZVAL_STRINGL(&tmp, last_match, &subject[offsets[0]]-last_match); (gdb) print pce $61 = (pcre_cache_entry *) 0x555555d333f0 (gdb) print &pce->re $62 = (pcre **) 0x555555d333f0 So the regex in question, I think, is: /(?<!^)(?!$)/u which does correspond to the output I got from the above printf in gdb. Does that help narrow down where the bug might be? Do you still want me to do the control flow analysis?
Thank you! > /(?<!^)(?!$)/u This is a tricky pattern, since it matches to an empty string. But other than that nothing special with it. I tried matching it from offset 4 in UTF mode, and the result was 4,4 here. And that is the expected. This is still the most confusing part for me: Breakpoint 8, php_pcre_split_impl (pce=0x555555d33520, subject=0x7fffed42e248 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1794 1794 if (count == 0) { (gdb) print offsets[0] $52 = -1 (gdb) print offsets[1] $53 = -1 (gdb) c Continuing. JIT cannot return with -1 in offsets[0], except if the original value was -1, and there is no match. I really would like to see the value of count before the crash, and I think it is in $eax or $rax (disassemble can confirm it). Please print offsets[0] and [1] before and after pcre_exec is called. Please also print g_notempty as well.
I think I figured out where the -1 comes from: when a pattern is rerun with PCRE_NOTEMPTY_ATSTART | PCRE_ANCHORED, the JIT rejects it since it was compiled without PCRE_ANCHORED flag (this flag is compiled into the machine generated code). And the interpreter sets the -1.
(In reply to Zoltan Herczeg from comment #23) > Thank you! > > > /(?<!^)(?!$)/u > > This is a tricky pattern, since it matches to an empty string. But other > than that nothing special with it. > > I tried matching it from offset 4 in UTF mode, and the result was 4,4 here. > And that is the expected. I should reiterate that, here too -- when I run this particular testcase from twig on its own (just like `phpunit --process-isolation` does, which does work), I don't see any problem. So I'm not 100% sure it's this pattern in this execution that is bad, but some state somewhere (could be php, could be libpcre) is getting corrupted. > This is still the most confusing part for me: > > Breakpoint 8, php_pcre_split_impl (pce=0x555555d33520, > subject=0x7fffed42e248 "\303\251\303\204\303\237\343\201\224a", > subject_len=10, return_value=0x7ffff381b240, limit_val=-1, > flags=<optimized out>) > at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1794 > 1794 if (count == 0) { > (gdb) print offsets[0] > $52 = -1 > (gdb) print offsets[1] > $53 = -1 > (gdb) c > Continuing. > > JIT cannot return with -1 in offsets[0], except if the original value was > -1, and there is no match. > > I really would like to see the value of count before the crash, and I think > it is in $eax or $rax (disassemble can confirm it). > > Please print offsets[0] and [1] before and after pcre_exec is called. Please > also print g_notempty as well. Will do!
(In reply to Zoltan Herczeg from comment #24) > I think I figured out where the -1 comes from: when a pattern is rerun with > PCRE_NOTEMPTY_ATSTART | PCRE_ANCHORED, the JIT rejects it since it was > compiled without PCRE_ANCHORED flag (this flag is compiled into the machine > generated code). And the interpreter sets the -1. I could probably insert a breakpoint and remove PCRE_ANCHORED from the rerun to confirm this.
(In reply to Nish Aravamudan from comment #25) > (In reply to Zoltan Herczeg from comment #23) > > Thank you! > > > > > /(?<!^)(?!$)/u > > > > This is a tricky pattern, since it matches to an empty string. But other > > than that nothing special with it. > > > > I tried matching it from offset 4 in UTF mode, and the result was 4,4 here. > > And that is the expected. > > I should reiterate that, here too -- when I run this particular testcase > from twig on its own (just like `phpunit --process-isolation` does, which > does work), I don't see any problem. So I'm not 100% sure it's this pattern > in this execution that is bad, but some state somewhere (could be php, could > be libpcre) is getting corrupted. > > > This is still the most confusing part for me: > > > > Breakpoint 8, php_pcre_split_impl (pce=0x555555d33520, > > subject=0x7fffed42e248 "\303\251\303\204\303\237\343\201\224a", > > subject_len=10, return_value=0x7ffff381b240, limit_val=-1, > > flags=<optimized out>) > > at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1794 > > 1794 if (count == 0) { > > (gdb) print offsets[0] > > $52 = -1 > > (gdb) print offsets[1] > > $53 = -1 > > (gdb) c > > Continuing. > > > > JIT cannot return with -1 in offsets[0], except if the original value was > > -1, and there is no match. > > > > I really would like to see the value of count before the crash, and I think > > it is in $eax or $rax (disassemble can confirm it). > > > > Please print offsets[0] and [1] before and after pcre_exec is called. Please > > also print g_notempty as well. > > Will do! I *think* this is what you want? Breakpoint 9, php_pcre_split_impl (pce=0x555555d33810, subject=0x7fffed40b1a8 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1786 1786 count = pcre_exec(pce->re, extra, subject, (gdb) print offsets[0] $120 = -304455800 (gdb) print offsets[1] $121 = 32767 (gdb) printf "0x%x\n", g_notempty 0x0 (gdb) step 1794 if (count == 0) { (gdb) print $eax $122 = 1 (gdb) print offsets[0] $123 = 2 (gdb) print offsets[1] $124 = 2 (gdb) c Continuing. Breakpoint 9, php_pcre_split_impl (pce=0x555555d33810, subject=0x7fffed40b1a8 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1786 1786 count = pcre_exec(pce->re, extra, subject, (gdb) print offsets[0] $125 = 2 (gdb) print offsets[1] $126 = 2 (gdb) printf "0x%x\n", g_notempty 0x10000010 (gdb) step 1794 if (count == 0) { (gdb) print $eax $129 = -1 (gdb) print offsets[0] $127 = -1 (gdb) print offsets[1] $128 = -1 (gdb) c Continuing. Breakpoint 9, php_pcre_split_impl (pce=0x555555d33810, subject=0x7fffed40b1a8 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1786 1786 count = pcre_exec(pce->re, extra, subject, (gdb) print offsets[0] $130 = 2 (gdb) print offsets[1] $131 = 4 (gdb) printf "0x%x\n", g_notempty 0x0 (gdb) step 1794 if (count == 0) { (gdb) print $eax $132 = 0 (gdb) print offsets[0] $133 = 2 (gdb) print offsets[1] $134 = 4 (gdb) c Continuing. Breakpoint 9, php_pcre_split_impl (pce=0x555555d33810, subject=0x7fffed40b1a8 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1786 1786 count = pcre_exec(pce->re, extra, subject, (gdb) print offsets[0] $135 = 2 (gdb) print offsets[1] $136 = 4 (gdb) printf "0x%x\n", g_notempty 0x0 (gdb) step 1794 if (count == 0) { (gdb) print $eax $139 = 0 (gdb) print offsets[0] $137 = 2 (gdb) print offsets[1] $138 = 4 (gdb) c Continuing. Program received signal SIGSEGV, Segmentation fault. __memcpy_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:271 271 ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: No such file or directory.
I also noticed that for UTF8 specificaly, there is another pcre_exec that occurs (l#1853). Breakpoint 10, php_pcre_split_impl (pce=0x555555d333a0, subject=0x7fffed40b1a8 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1851 1851 count = pcre_exec(re_bump, extra_bump, subject, (gdb) print start_offset $151 = 2 (gdb) print offsets[0] $152 = -1 (gdb) print offsets[1] $153 = -1 (gdb) print extra_bump $154 = (pcre_extra *) 0x555555d33580 (gdb) print subject $155 = 0x7fffed40b1a8 "\303\251\303\204\303\237\343\201\224a" (gdb) step 1854 if (count < 1) { (gdb) print offsets[0] $156 = 2 (gdb) print offsets[1] $157 = 4 (gdb) print count $159 = 1
> I don't see any problem. So I'm not 100% sure it's this pattern in this > execution that is bad, but some state somewhere (could be php, could be > libpcre) is getting corrupted. I agree. This is my opinion as well. > I also noticed that for UTF8 specificaly, there is another pcre_exec that > occurs (l#1853). I also mentioned this before. I had the feeling that the 2,4 offset pair is coming from this pcre_exec (unfortunately both pcre_execs shares the same ovector). This is confirmed now. And count really is stored in $eax. This is the result of the last pcre_exec: > (gdb) print offsets[0] > $130 = 2 > (gdb) print offsets[1] > $131 = 4 > (gdb) printf "0x%x\n", g_notempty > 0x0 > (gdb) step > 1794 if (count == 0) { > (gdb) print $eax > $132 = 0 > (gdb) print offsets[0] > $133 = 2 > (gdb) print offsets[1] > $134 = 4 This confirms the "ovector is not changed" theory. The zero return value means there is not enough space in the ovector, and offsets are not updated. More precisely it seems as if 0 is passed for the length of offsets vector (size_offsets is changed to zero somehow?). Because count is zero: if (count == 0) { php_error_docref(NULL,E_NOTICE, "Matched, but too many substrings"); count = size_offsets/3; } count got a new value. However, if size_offsets would be changed to 0, count would be zero again (0/3=0). And the following condition (where the segfault happens) would not be fulfilled: if (count > 0 && (offsets[1] - offsets[0] >= 0)) { To make things even more confusing, pcre_exec runs correctly during the first run: > 1794 if (count == 0) { > (gdb) print $eax > $122 = 1 > (gdb) print offsets[0] > $123 = 2 > (gdb) print offsets[1] > $124 = 2 Here, there was enough space so the return value was 1, and offsets vector was updated (2,2 is the correct result for the first run). I think we should focus on the size_offsets variable, and what is passed to pcre_exec. It seems to me that although size_offsets did not change its value (regardless please print it), somehow 0 is passed to pcre_exec. Or the passed value is overwritten to zero later (buffer overflow?). Could you enter to pcre_exec and print the arguments? Especially the size of the ovector (offsetcount). A watchpoint could help to find where the value is changed if the argument is correct.
(In reply to Zoltan Herczeg from comment #29) > > I don't see any problem. So I'm not 100% sure it's this pattern in this > > execution that is bad, but some state somewhere (could be php, could be > > libpcre) is getting corrupted. > > I agree. This is my opinion as well. > > > I also noticed that for UTF8 specificaly, there is another pcre_exec that > > occurs (l#1853). > > I also mentioned this before. I had the feeling that the 2,4 offset pair is > coming from this pcre_exec (unfortunately both pcre_execs shares the same > ovector). This is confirmed now. And count really is stored in $eax. > > This is the result of the last pcre_exec: > > > (gdb) print offsets[0] > > $130 = 2 > > (gdb) print offsets[1] > > $131 = 4 > > (gdb) printf "0x%x\n", g_notempty > > 0x0 > > (gdb) step > > 1794 if (count == 0) { > > (gdb) print $eax > > $132 = 0 > > (gdb) print offsets[0] > > $133 = 2 > > (gdb) print offsets[1] > > $134 = 4 > > This confirms the "ovector is not changed" theory. The zero return value > means there is not enough space in the ovector, and offsets are not updated. > More precisely it seems as if 0 is passed for the length of offsets vector > (size_offsets is changed to zero somehow?). > > Because count is zero: > > if (count == 0) { > php_error_docref(NULL,E_NOTICE, "Matched, but too many substrings"); > count = size_offsets/3; > } > > count got a new value. However, if size_offsets would be changed to 0, count > would be zero again (0/3=0). > > And the following condition (where the segfault happens) would not be > fulfilled: > > if (count > 0 && (offsets[1] - offsets[0] >= 0)) { > > To make things even more confusing, pcre_exec runs correctly during the > first run: > > > 1794 if (count == 0) { > > (gdb) print $eax > > $122 = 1 > > (gdb) print offsets[0] > > $123 = 2 > > (gdb) print offsets[1] > > $124 = 2 > > Here, there was enough space so the return value was 1, and offsets vector > was updated (2,2 is the correct result for the first run). > > I think we should focus on the size_offsets variable, and what is passed to > pcre_exec. It seems to me that although size_offsets did not change its > value (regardless please print it), somehow 0 is passed to pcre_exec. Or the > passed value is overwritten to zero later (buffer overflow?). Could you > enter to pcre_exec and print the arguments? Especially the size of the > ovector (offsetcount). A watchpoint could help to find where the value is > changed if the argument is correct. I think you're right: Breakpoint 9, php_pcre_split_impl (pce=0x555555d333e0, subject=0x7fffed40b1a8 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1786 1786 count = pcre_exec(pce->re, extra, subject, (gdb) print size_offsets $165 = 3 (gdb) c Continuing. Breakpoint 9, php_pcre_split_impl (pce=0x555555d333e0, subject=0x7fffed40b1a8 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1786 1786 count = pcre_exec(pce->re, extra, subject, (gdb) print size_offsets $166 = 3 (gdb) c Continuing. Breakpoint 10, php_pcre_split_impl (pce=0x555555d333e0, subject=0x7fffed40b1a8 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1851 1851 count = pcre_exec(re_bump, extra_bump, subject, (gdb) print size_offsets $167 = 3 (gdb) c Continuing. Breakpoint 9, php_pcre_split_impl (pce=0x555555d333e0, subject=0x7fffed40b1a8 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1786 1786 count = pcre_exec(pce->re, extra, subject, (gdb) print size_offsets $168 = 0 (gdb) c Continuing. Breakpoint 9, php_pcre_split_impl (pce=0x555555d333e0, subject=0x7fffed40b1a8 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1786 1786 count = pcre_exec(pce->re, extra, subject, (gdb) print size_offsets $169 = 0 (gdb) c Continuing. Program received signal SIGSEGV, Segmentation fault. __memcpy_avx_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:271 271 ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: No such file or directory. I will do some more digging into what is changing that value to 0.
Breaking at pcre_exec: pcre_exec (argument_re=0x555555d33430, extra_data=extra_data@entry=0x555555d33490, subject=subject@entry=0x7fffed40b1a8 "\303\251\303\204\303\237\343\201\224a", length=length@entry=10, start_offset=start_offset@entry=2, options=268443664, offsets=0x7fffffff9210, offsetcount=3) at pcre_exec.c:6361 pcre_exec (argument_re=argument_re@entry=0x555555d33580, extra_data=0x555555d335d0, subject=subject@entry=0x7fffed40b1a8 "\303\251\303\204\303\237\343\201\224a", length=length@entry=10, start_offset=start_offset@entry=2, options=options@entry=8192, offsets=0x7fffffff9210, offsetcount=3) at pcre_exec.c:6361 pcre_exec (argument_re=0x555555d33430, extra_data=extra_data@entry=0x555555d33490, subject=subject@entry=0x7fffed40b1a8 "\303\251\303\204\303\237\343\201\224a", length=length@entry=10, start_offset=start_offset@entry=4, options=8192, offsets=0x7fffffff9210, offsetcount=0) pcre_exec (argument_re=0x555555d33430, extra_data=extra_data@entry=0x555555d33490, subject=subject@entry=0x7fffed40b1a8 "\303\251\303\204\303\237\343\201\224a", length=length@entry=10, start_offset=start_offset@entry=4, options=8192, offsets=0x7fffffff9210, offsetcount=0) ... SEGV.
Created attachment 873 [details] GDB log showing second pcre_exec possibly corrupting size_offsets value I grabbed a lot of gdb output just now, trying to narrow down when size_offsets location gets trashed to 0. I noticed that offsetcount does, inside one of the jit functions, get set to 2, but it's back to 3 in the caller, until it returns to the PHP code. At which point size_offsets has been set to 0... I'm going to put a watchpoint on the address of size_offsets to see if I can see what actually is writing to it.
> I grabbed a lot of gdb output just now, trying to narrow down when > size_offsets location gets trashed to 0. I noticed that offsetcount does, > inside one of the jit functions, get set to 2, but it's back to 3 in the > caller, until it returns to the PHP code. At which point size_offsets has > been set to 0... I'm going to put a watchpoint on the address of > size_offsets to see if I can see what actually is writing to it. Thank you, I check it. What I still don't get, if size_offsets is zero, and count is zero if (count == 0) { php_error_docref(NULL,E_NOTICE, "Matched, but too many substrings"); count = size_offsets/3; } then count should be still zero after this point, how could this be true: if (count > 0 && (offsets[1] - offsets[0] >= 0)) Perhaps GCC is (too) clever here, and realized that size_offsets must be >= 3 since it is computed in the following way: size_offsets = (pce->capture_count + 1) * 3; and optimized out the count > 0 part. I saw such things before...
(gdb) print &size_offsets $49 = (int *) 0x7fffffff9288 (gdb) watch *0x7fffffff9288 Hardware watchpoint 4: *0x7fffffff9288 Old value = 3 New value = 0 _pcre_jit_exec (extra_data=extra_data@entry=0x555555d4de70, subject=subject@entry=0x7fffed40b1a8 "\303\251\303\204\303\237\343\201\224a", length=length@entry=10, start_offset=start_offset@entry=2, options=options@entry=8192, offsets=offsets@entry=0x7fffffff9210, offset_count=<optimized out>) at pcre_jit_compile.c:10481 10481 pcre_jit_compile.c: No such file or directory. Which seems to be the return from jit_exec ? That roughly would correspond to what we see in the previous attachment. (gdb) disassemble $pc-32,$pc+32 Dump of assembler code from 0x7ffff799b6ea to 0x7ffff799b72a: 0x00007ffff799b6ea <_pcre_jit_exec+314>: rorl %cl,0x39440014(%rbp) 0x00007ffff799b6f0 <_pcre_jit_exec+320>: loop 0x7ffff799b6ac <_pcre_jit_exec+252> 0x00007ffff799b6f2 <_pcre_jit_exec+322>: add %al,(%rax) 0x00007ffff799b6f4 <_pcre_jit_exec+324>: add %al,(%rax) 0x00007ffff799b6f6 <_pcre_jit_exec+326>: cmovg %edx,%eax 0x00007ffff799b6f9 <_pcre_jit_exec+329>: testb $0x20,(%rbx) 0x00007ffff799b6fc <_pcre_jit_exec+332>: je 0x7ffff799b70a <_pcre_jit_exec+346> 0x00007ffff799b6fe <_pcre_jit_exec+334>: mov 0x30(%rbx),%rdx 0x00007ffff799b702 <_pcre_jit_exec+338>: mov 0x30(%rsp),%rcx 0x00007ffff799b707 <_pcre_jit_exec+343>: mov %rcx,(%rdx) => 0x00007ffff799b70a <_pcre_jit_exec+346>: mov 0x58(%rsp),%rbx 0x00007ffff799b70f <_pcre_jit_exec+351>: xor %fs:0x28,%rbx 0x00007ffff799b718 <_pcre_jit_exec+360>: jne 0x7ffff799b761 <_pcre_jit_exec+433> 0x00007ffff799b71a <_pcre_jit_exec+362>: add $0x60,%rsp 0x00007ffff799b71e <_pcre_jit_exec+366>: pop %rbx 0x00007ffff799b71f <_pcre_jit_exec+367>: pop %rbp 0x00007ffff799b720 <_pcre_jit_exec+368>: pop %r12 0x00007ffff799b722 <_pcre_jit_exec+370>: retq 0x00007ffff799b723 <_pcre_jit_exec+371>: nopl 0x0(%rax,%rax,1)
The offset_count is a local variable, and not a reference, so functions are free to modify it. This does not affect the value in the caller. The JIT performs a normalization but that should not cause any problem.
(In reply to Zoltan Herczeg from comment #33) > > I grabbed a lot of gdb output just now, trying to narrow down when > > size_offsets location gets trashed to 0. I noticed that offsetcount does, > > inside one of the jit functions, get set to 2, but it's back to 3 in the > > caller, until it returns to the PHP code. At which point size_offsets has > > been set to 0... I'm going to put a watchpoint on the address of > > size_offsets to see if I can see what actually is writing to it. > > Thank you, I check it. > > What I still don't get, if size_offsets is zero, and count is zero > > if (count == 0) { > php_error_docref(NULL,E_NOTICE, "Matched, but too many substrings"); > count = size_offsets/3; > } > > then count should be still zero after this point, how could this be true: > > if (count > 0 && (offsets[1] - offsets[0] >= 0)) > > Perhaps GCC is (too) clever here, and realized that size_offsets must be >= > 3 since it is computed in the following way: > > size_offsets = (pce->capture_count + 1) * 3; > > and optimized out the count > 0 part. I saw such things before... For reference, here is the disassembly around the count > 0 if-statement: 0x0000555555679831 <php_pcre_split_impl+497>: test %eax,%eax 0x0000555555679833 <php_pcre_split_impl+499>: mov %eax,%r9d 0x0000555555679836 <php_pcre_split_impl+502>: pop %r12 0x0000555555679838 <php_pcre_split_impl+504>: pop %r14 0x000055555567983a <php_pcre_split_impl+506>: je 0x55555567a07d <php_pcre_split_impl+2621> => 0x0000555555679840 <php_pcre_split_impl+512>: test %r9d,%r9d 0x0000555555679843 <php_pcre_split_impl+515>: jle 0x555555679b70 <php_pcre_split_impl+1328> 0x0000555555679849 <php_pcre_split_impl+521>: movslq (%r15),%r12 0x000055555567984c <php_pcre_split_impl+524>: mov 0x4(%r15),%eax 0x0000555555679850 <php_pcre_split_impl+528>: cmp %r12d,%eax 0x0000555555679853 <php_pcre_split_impl+531>: js 0x555555679f28 <php_pcre_split_impl+2280> with count in %eax initially, I believe.(In reply to Zoltan Herczeg from comment #33) > > I grabbed a lot of gdb output just now, trying to narrow down when > > size_offsets location gets trashed to 0. I noticed that offsetcount does, > > inside one of the jit functions, get set to 2, but it's back to 3 in the > > caller, until it returns to the PHP code. At which point size_offsets has > > been set to 0... I'm going to put a watchpoint on the address of > > size_offsets to see if I can see what actually is writing to it. > > Thank you, I check it. > > What I still don't get, if size_offsets is zero, and count is zero > > if (count == 0) { > php_error_docref(NULL,E_NOTICE, "Matched, but too many substrings"); > count = size_offsets/3; > } > > then count should be still zero after this point, how could this be true: > > if (count > 0 && (offsets[1] - offsets[0] >= 0)) > > Perhaps GCC is (too) clever here, and realized that size_offsets must be >= > 3 since it is computed in the following way: > > size_offsets = (pce->capture_count + 1) * 3; > > and optimized out the count > 0 part. I saw such things before... Maybe I am missing something, but that second conditional doesn't get entered if count == 0. Breakpoint 6, php_pcre_split_impl (pce=pce@entry=0x555555d333f0, subject=0x7fffed40b1a8 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=return_value@entry=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1800 1800 if (count > 0 && (offsets[1] - offsets[0] >= 0)) { 3: size_offsets = 3 (gdb) print $eax $59 = -1 (gdb) step 1835 } else if (count == PCRE_ERROR_NOMATCH) { 3: size_offsets = 3 (gdb) step 1840 if (g_notempty != 0 && start_offset < subject_len) { 3: size_offsets = 3 (gdb) step 1841 if (pce->compile_options & PCRE_UTF8) { 3: size_offsets = 3 (gdb) c Continuing. Breakpoint 1, php_pcre_split_impl (pce=pce@entry=0x555555d333f0, subject=0x7fffed40b1a8 "\303\251\303\204\303\237\343\201\224a", subject_len=10, return_value=return_value@entry=0x7ffff381b240, limit_val=-1, flags=<optimized out>) at /build/php7.0-WHFaJZ/php7.0-7.0.3/ext/pcre/php_pcre.c:1794 1794 if (count == 0) { 3: size_offsets = 0 (gdb) print eax No symbol "eax" in current context. (gdb) print $eax $60 = 0 (gdb) step 1786 count = pcre_exec(pce->re, extra, subject, 3: size_offsets = 0 which is when we've popped back up to the top of the loop?
> 0x00007ffff799b6f9 <_pcre_jit_exec+329>: testb $0x20,(%rbx) > 0x00007ffff799b6fc <_pcre_jit_exec+332>: je 0x7ffff799b70a <_pcre_jit_exec+346> > 0x00007ffff799b6fe <_pcre_jit_exec+334>: mov 0x30(%rbx),%rdx > 0x00007ffff799b702 <_pcre_jit_exec+338>: mov 0x30(%rsp),%rcx > 0x00007ffff799b707 <_pcre_jit_exec+343>: mov %rcx,(%rdx) > => 0x00007ffff799b70a <_pcre_jit_exec+346>: mov 0x58(%rsp),%rbx > 0x00007ffff799b70f <_pcre_jit_exec+351>: xor %fs:0x28,%rbx > 0x00007ffff799b718 <_pcre_jit_exec+360>: jne 0x7ffff799b761 gdb usually stops after the write, so this is likely the offending instruction: mov %rcx,(%rdx) It is hard to tell the corresponding source code from the assembly but I think it is the following source code: if ((extra_data->flags & PCRE_EXTRA_MARK) != 0) *(extra_data->mark) = arguments.mark_ptr; Could you check that PCRE_EXTRA_MARK is set in extra_data->flags? And please also check where the extra_data->mark points. PCRE_EXTRA_MARK is 0x20 And there is the comparison with 0x20 just before the overwrite: testb $0x20,(%rbx).
I mean just print *extra_bump here: count = pcre_exec(re_bump, extra_bump, subject, so we can see all of its fields.
(In reply to Zoltan Herczeg from comment #38) > I mean just print *extra_bump here: > > count = pcre_exec(re_bump, extra_bump, subject, > > so we can see all of its fields. I have to go now. Anyway, if the 0x20 flag is set in extra_bump->flags and extra_data->mark points to the address of size_offsets then we nailed the problem. I suspect extra_data->mark is not initialized, or setting the flag is not intentional. Btw do you know the meaning of 's' in /./us ? I suspect 'u' is UTF8.
(In reply to Zoltan Herczeg from comment #38) > I mean just print *extra_bump here: > > count = pcre_exec(re_bump, extra_bump, subject, > > so we can see all of its fields. (gdb) print *extra_bump $62 = {flags = 115, study_data = 0x555555d33610, match_limit = 1000000, callout_data = 0x0, tables = 0x21e <error: Cannot access memory at address 0x21e>, match_limit_recursion = 100000, mark = 0x7fffffff9288, executable_jit = 0x555555d33650}
Out of curiosity, do valgrind or even better an instrumented build (with ASAN/UBSAN) report anything wrong?
(In reply to Giuseppe D'Angelo from comment #41) > Out of curiosity, do valgrind or even better an instrumented build (with > ASAN/UBSAN) report anything wrong? IIRC, no, but I think I may have found it (possibly, going to need to build a new PHP7.0 to test). The old split implementation did not use JIT, but the newer one (as of https://github.com/php/php-src/commit/92655be7cf10f7551ee1a1ae7ea0f1bdcfa2ca6b) does. There was an older commit (https://github.com/php/php-src/commit/376ab3b7873ca04142185d8c08dbb4c4be152474) that indicates "Nested PCRE calls may clobber extra->mark and it has to be reinitailized" which is quite symptomatic here ... Will see if adding the reinit fixes the problem.
(In reply to Zoltan Herczeg from comment #39) > (In reply to Zoltan Herczeg from comment #38) > > I mean just print *extra_bump here: > > > > count = pcre_exec(re_bump, extra_bump, subject, > > > > so we can see all of its fields. > > I have to go now. Anyway, if the 0x20 flag is set in extra_bump->flags and > extra_data->mark points to the address of size_offsets then we nailed the > problem. I suspect extra_data->mark is not initialized, or setting the flag > is not intentional. Btw do you know the meaning of 's' in /./us ? I suspect > 'u' is UTF8. I *think* you have hit it on the head, and I'm hoping it's just a PHP bug when the moved to the JIT implementation for split(). From: http://perldoc.perl.org/perlre.html s Treat string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match. Used together, as /ms, they let the "." match any character whatsoever, while still allowing "^" and "$" to match, respectively, just after and just before newlines within the string. And you're right, /u is for unicode.
> $62 = {flags = 115, study_data = 0x555555d33610, match_limit = 1000000, > callout_data = 0x0, > tables = 0x21e <error: Cannot access memory at address 0x21e>, > match_limit_recursion = 100000, mark = 0x7fffffff9288, > executable_jit = 0x555555d33650} The 115 is 0x73, so 0x20 is indeed set in flags. The 0x7fffffff9288 for mark is a typical stack address. I thought PCRE_EXTRA_MARK is connected to 's' but it is probably not. Perhaps PCRE_EXTRA_MARK is always set. You are probably right about the issue you found. In the past study data was never passed, so setting PCRE_EXTRA_MARK did not cause any problem. But now it is, a random memory address is overwritten with NULL. I think they really need to change the character advancing to something more efficient like: if (g_notempty != 0 && start_offset < subject_len) { offsets[0] = start_offset; offsets[1] = start_offset + 1; if (pce->compile_options & PCRE_UTF8) { while (offsets[1] < subject + subject_len && (subject[offsets[1]] & 0xc0) == 0x80) { offsets[1]++; } } } Because using pcre to advancing just one UTF character ahead is the most inefficient way I can imagine. Regardless, the PCRE_EXTRA_MARK flag may cause other side effects (memory overwrites if the mark pointer is invalid) so I would check thoroughly in the engine.
(In reply to Zoltan Herczeg from comment #44) > > $62 = {flags = 115, study_data = 0x555555d33610, match_limit = 1000000, > > callout_data = 0x0, > > tables = 0x21e <error: Cannot access memory at address 0x21e>, > > match_limit_recursion = 100000, mark = 0x7fffffff9288, > > executable_jit = 0x555555d33650} > > The 115 is 0x73, so 0x20 is indeed set in flags. The 0x7fffffff9288 for mark > is a typical stack address. I thought PCRE_EXTRA_MARK is connected to 's' > but it is probably not. Perhaps PCRE_EXTRA_MARK is always set. > > You are probably right about the issue you found. In the past study data was > never passed, so setting PCRE_EXTRA_MARK did not cause any problem. But now > it is, a random memory address is overwritten with NULL. Yep, I just tested unsetting both the mark value and unsetting PCRE_EXTRA_MARK in the flags and it passed now! > I think they really need to change the character advancing to something more > efficient like: > > if (g_notempty != 0 && start_offset < subject_len) { > offsets[0] = start_offset; > offsets[1] = start_offset + 1; > if (pce->compile_options & PCRE_UTF8) { > while (offsets[1] < subject + subject_len > && (subject[offsets[1]] & 0xc0) == 0x80) { > offsets[1]++; > } > } > } > > Because using pcre to advancing just one UTF character ahead is the most > inefficient way I can imagine. Agreed -- but that is a decision I assume they made for some reason. I will propose your suggestion to the upstream developers, though :) > Regardless, the PCRE_EXTRA_MARK flag may cause other side effects (memory > overwrites if the mark pointer is invalid) so I would check thoroughly in > the engine. Thank you for all your help! I'll close this bug now.