-
Notifications
You must be signed in to change notification settings - Fork 577
regex: memory bug when recursively matching a Regexp reference though an embeded code block with a subroutine #21725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I forgot to mention that the bug occurs specifically in the presense of |
The backtraces in GDB using a debug build. case of:
case of:
|
I though I'd already posted this, for the tcache case:
which looks like the regexp has been accessed after being freed. For the Segmentation fault case:
Debugging:
which again looks like the REGEXP SV has been released and re-used for an array. |
@tonycoz The infamous use after free, in this case because of a free occuring earlier than it should be. I've made a bit of progress on the same case:
next until reaching:
then do:
careful with the
continue until reaching Perl_pp_match for the 2nd time
this function contains the macro
The SIGABRT then occurs after pp_leavesub, when the 2nd call to
I understand only a tiny bit of what is going on but maybe the issue it is related to this warning in
|
Can someone put the type-regex label please? I can't do it myself. I did additional research back in December but didn't finished writing the post, nor the research. The previous post is wrong.
The details are fuzzy now but commenting out TEST1
As expected:
TEST2regular perl:
perl debug build:
minperl: it doesn't crash but
This suggests that the function/macro
called in I think that maybe |
The behavior of this has changed. I'm unsure that it is fixed, but I'm not getting segfaults. Please try it again and update this ticket. |
On Wed, Apr 23, 2025 at 09:05:32AM -0700, Karl Williamson wrote:
The behavior of this has changed. I'm unsure that it is fixed, but I'm
not getting segfaults. Please try it again and update this ticket.
This issue sounds very similar to the one I fixed a month ago.
commit 39b4841
Author: David Mitchell ***@***.***>
AuthorDate: Sat Feb 22 16:48:52 2025 +0000
Commit: David Mitchell ***@***.***>
CommitDate: Mon Mar 10 08:34:34 2025 +0000
Fix crash on recursive /(?{...})/ call
In something like
my $pat = /... (?{ foo() if ...; }) .../;
sub foo { $string =~ $pat; }
foo();
perl would SEGV or produce the wrong values for $1 et al. This commit
fixes that.
Background:
For a compile-time pattern like $foo =~ /..../, the pattern is compiled
at compile time, and the resulting REGEXP SV is stored as a pointer in
the OP_MATCH op (or on threaded builds, stored in PL_regex_pad[],
indexed by the op_pmoffset field in the match op).
For a runtime pattern like
$foo =~ /abc$bar/;
or
$pat = qr/..../;
$foo =~ $pat;
the pattern is compiled by a regcomp op (or for a qr// object,
duplicated), and the resulting REGEX SV is stored by the regcomp op into
its related match (or subst or split etc) op.
This regex will likely have a refcount of 1: i.e. it is only being kept
alive by the link from the match op. Normally this is fine: the regex
lives for as long as the op (and hence the sub it lives in) exist. In
particular, the regex continues to live on after the match is complete,
so that $1 etc will work.
$1 etc work by perl setting PL_curpm to point to the match op which most
recently did a successful match. This is dynamically scoped: on scope
exit, the old value of PL_curpm is restored.
When $1 is accessed, its get-magic is called, which looks up PL_curpm,
gets the regex pointed to by that match op, and that regex contains the
char ranges and match string associated with the most recent match.
The Problem:
That all works well until the`
sub foo { $string =~ $pat; }
from the example above is called recursively from the /(?{...}/). When
foo() is first called, $pat is compiled and the resulting REGEXP is
stored in the OP_MATCH with a ref count of 1. OP_MATCH is then executed,
which calls the regex engine with that regex and string. Part of the
match is a (?{...}) which recursively calls foo(). foo() does an
OP_REGCOMP again, which overwrites the current regex in the OP_MATCH
with a new regex, freeing the old regex (the one we are in the middle of
executing). Cue SEGVs etc.
There is a further complication: PL_curpm points to the current
successful *match op* rather than the current *regex*. When the regex
engine was made accessible via an API, it was possible for the engine to
be running with no active OP_MATCH present. But the design of (?{...})
is such that any partial matches are accessible *during* the execution,
not just after the end of a successful match. So for example
"AB" =~ /^(.) (?{ say "$1$2" } (.)$/x;
will print out "A" and undef. The regex engine handles this by having a
fake global match PMOP structure, PL_reg_curpm, and every time code
within (?{...}) is about to be called, the current regex is pointed to
from PL_reg_curpm, and PL_curpm is set to point to PL_reg_curpm. Since
this is global, it suffers from the same problem as for the recursive
match op, in that the inner call to (?{...}) will overwrite the regex
pointer in PL_reg_curpm, potentially prematurely freeing the regex, and
even if not freed, meaning that on return to the outer pattern, $1 et al
will refer to the inner match, not the current match.
The Solution:
This commit makes use of an existing save/restore mechanism for patterns
involving (?{...}). At the start of the match, S_setup_eval_state() is
called, which saves some state in reginfo->info_aux_eval. On exit from
the match (either normally or via croak), S_cleanup_regmatch_info_aux()
is called to restore stuff.
This commit saves three new things in info_aux_eval.
1) The current REGEXP SV (ref counted). Formerly it stored ReANY(regex);
now it stores regex directly. This ensures the regex isn't freed
during matching (including calls out to code in (?{...}) blocks), but
it doesn't guarantee that it will live on after the end of the match,
to be accessible to $1 etc al.
2) It saves (ref counted) the current value of the regex pointer in
PL_reg_curpm and restores it on return. Thus on return from doing the
inner match, $1 et al will give the current value for any remaining
code within the code block, e.g. /(?{ foo(); print $1 })/
3) If PL_op happens to be a pattern match op (it might not if for
example the engine has been called via the API from XS) then its
regex is saved and restored similar to (2).
The combination of those three extra saves makes it likely that the
regex will not be prematurely freed, and $1 etc will have the right
values at all times.
Note that this commit doesn't fix the general problem of recursively
calling a match op; only the ones involving calls from within a
(?{...}). For example this still prints "BB" rather than "AB":
sub foo {
$_[0] =~ /(.)/;
foo('B') if $_[0] eq 'A';
print $1;
}
foo('A');
Note that the PM_GETRE() and PM_SETRE() macros, which I wanted to use to
save and restore the regex pointer in PL_reg_curpm, do some funny
business: PM_GETRE() returns NULL if the SV isn't a REGEX (e.g. if its
&PL_sv_undef),and PM_SETRE asserts that the regex isn't null. I got
round those side-effects by adding PM_GETRE_raw()/PM_SETRE_raw(), which
do nothing but get/set the regex from the PMOP.
Affected files ...
M op.h
M regexec.c
M regexp.h
M t/re/pat_re_eval.t
…--
31 Dec 1661: "I have newly taken a solemne oath about abstaining from plays".
1 Jan 1662: "And after ... we went by coach to the play".
-- The Diary of Samuel Pepys
|
@iabyn This is a great analysis! @khwilliamson I confirm that the segmentation fault disappeared. The malloc error doesn't appear with the newly compiled perl v5.41.13. It should be noted that with perl v5.40.2:
The 2 "ok" tests that segfault on v5.40.2 don't segfault on v5.41.13. I'm closing this isssue. |
The bug occurs when recursively matching a Regexp reference though an embeded code block using a subroutine call both for the initial call and for the recursive call inside the embeded code block.
The people dealing with regex engine bugs will certainly laugh or curse at the insanity of the bug, this is not lost on me.
The malloc() bug and Segmentation fault resulting from running the script:
The data file:
The golfed test script containing various test cases in order to better understand the context in which the bg occurs. Only 2 of those tests produce a Segmentation fault or a malloc() bug
The text was updated successfully, but these errors were encountered: