Håkon Hægland Håkon Hægland - 6 months ago 7
Perl Question

Perl match variables and performance. How does it work?

According to perlvar:


Variables related to regular expressions

These variables are read-only
and dynamically-scoped, unless we note otherwise. The dynamic nature
of the regular expression variables means that their value is limited
to the block that they are in.


and further down:


Traditionally in Perl, any use of any of the three variables
$`
,
$&

or
$'
anywhere in the code, caused
all subsequent successful pattern matches to make a copy of the
matched string, in case the code might subsequently access one of
those variables.


After reading the rest of the section from the documentation, I still missed some information like:


  • Why is a copy made in the first place?

    I think I know the answer to this one: It is sort of clear from the last statement
    "in case the code might subsequently access one of those variables"
    .
    So in my understanding:

    my $s = "Hello world";
    $s =~ s/Hello //;
    say $';


    this would still print
    world
    since a copy was made before
    $s
    was modified.

  • Why is a copy of the whole string done?

    In the previous example, it would suffice to copy only the trailing part of the string, since only
    $'
    was used ( we did not use
    $`
    or
    $&
    ). So why copy the whole string?

  • Finally: Since it says
    "all subsequent"
    and not
    "all subsequent matches in that block"
    , I would like to have that confirmed:

    my $s = "no\n yesHello world";
    {
    $s =~ /yes/ and say $'; # Note the use of $'
    }
    $s = '12' x 1_000_000;
    my $n = () = $s =~ /2/g;
    say "Found $n matches";


    In this case, (since
    $'
    is used only in the inner scope) there would be no copying related to the one million successfull matches in
    $s =~ /2/g
    ? (Asumming I did not mention any of
    $`
    ,
    $&
    and
    $'
    in the outer scope)






Note:


  • The question assumes a perl version less than 5.18.
    According to perlvar:


    In Perl 5.18.0 onwards, perl started noting the presence of each of
    the three variables separately, and only copied that part of the
    string required

    In Perl 5.20.0 a new copy-on-write system was enabled by default,
    which finally fixes all performance issues with these three variables,
    and makes them safe to use anywhere.

  • I am only asking this question out of curiosity. I am not intending to actually use any of these variables in my code since I would like my program to work for efficiently for earlier perl versions ( < 5.20 ) also.

    Further, as noted in the manual, they can be easily emulated (without the implicit copying and performance hit) using
    $+[0]
    and
    $-[0]
    (which was introduced in perl version 5.6. I am using perl version 5.22 myself). See perlvar for more information.


Answer
  • It has to copy the string because the original variable may have been modified between the regex match and the time the match variables are used:

    my $var = "foobar";
    $var =~ /.../ or die;
    $var = "hello";
    print "$& $'";  # outputs "foo bar"
    
  • It has to effectively make a copy of the whole string because that's what $` . $& . $' results in. As you quoted yourself, in perl 5.18 someone realized that you could track each variable separately and thus only copy the part before the match (or the match itself, or the part after the match) if only that variable appears in the code:

    In Perl 5.18.0 onwards, perl started noting the presence of each of the three variables separately, and only copied that part of the string required

  • The match variables are global. You could at any time call a function (that calls a function that calls a function ...) that accesses $` or $& or $', so perl has to do a full copy after each successful regex match if those variables have been seen anywhere in the code. (Basically, perl can't statically determine that a particular piece of code is never going to access those match variables and thus avoid the copy.)