Is there an efficient way to开发者_运维问答 substitute a bunch a strings using values from a Perl hash?
For example,
$regex{foo} = "bar";
$regex{hello} = "world";
$regex{python} = "perl";
open(F, "myfile.txt");
while (<F>) {
foreach $key (keys %regex) {
s/$key/$regex{$key}/g;
}
}
close(F);
Is there a way to accomplish the above in Perl?
First question: are you sure that what you have is inefficient?
Second, the most obvious next step would be to pull everything into a single regex:
my $check = join '|', keys %regex;
And then you can do the substitution as:
s/($check)/$regex{$1}/g;
This can still be "slow" with sufficient overlap of the keys where the regex engine has to recheck the same letters constantly. You can possibly use something like Regexp::Optimizer to eliminate the overlap. But the cost of optimising may be more than the cost of just doing everything, depending on how many changes (key/values in your hash) and how many lines you're modifying. Premature optimisation-- !
Note that, of course, your example code isn't doing anything with the text after the substitution. It won't modify the file in-place, so I'm assuming you're handling that separately.
Define a regexp that matches any of the keys.
$regex = join("|", map {quotemeta} keys %regex);
Replace any match of $regex
by $regex{$1}
.
s/($regex)/$regex{$1}/go;
Omit the o
modifier if $regex
changes during the execution of the program.
Note that if there are keys that are a prefix of another key (e.g. f
and foo
), whichever comes first in the joined regexp will be seen as a match (e.g. f|foo
matches f
but foo|f
matches foo
in foobar
). If that can happen, you may need to sort keys %regex
according to which match you want to win. (Thanks to ysth for pointing this out.)
To prove the point of eval
and also out of curiosity, I ran some tests with the OP's code vs. the $regex{$1}
approach vs. the eval
approach.
First off, there seems to be little value in cramming every possible token in a (token|token|...)
match expression. Perl needs to check against all tokens at once -- it is debatable how much more efficient this is than simply checking every token at a time and doing the replacement with a hardcoded value.
Secondly, doing $regex{$1}
means the hashmap key is extracted on every match.
Anyway, here are some numbers (ran this on strawberry 5.12, with a 4MB file of 100K lines):
- The
$regex{$1}
approach takes 6 seconds (5 seconds with /go instead of /g) - The
tie
approach takes 10 seconds - The OP approach takes a bit under 1 second (with /go instead of /g)
- The
eval
approach takes less than 1 second (faster than the OP code)
This is the eval
approach:
$regex{foo} = "bar";
$regex{hello} = "world";
$regex{python} = "perl";
$regex{bartender} = "barista";
$s = <<HEADER;
\$start = time;
open(F, "myfile.txt");
while (<F>) {
HEADER
foreach $key (keys %regex) {
$s .= "s/$key/$regex{$key}\/go;\n"
}
$s .= <<FOOTER;
print \$_;
}
close(F);
print STDERR "Elapsed time (eval.pl): " . (time - \$start) . "\r\n";
FOOTER
eval $s;
perl -e ' \
my %replace = (foo=>bar, hello=>world, python=>perl); \
my $find = join "|", sort keys %replace; \
my $str = "foo,hello,python"; \
$str =~ s/($find)/$replace{$1}/g; \
print "$str\n\n"; \
'
Something you may want to consider is not going line-by-line of the file, but instead processing the whole file at once and use the /s
modifier on your regex for single-line mode.
The begin:
#!/usr/bin/perl
use strict;
use Tie::File;
my %tr=( 'foo' => 'bar',
#(...)
);
my $r =join("|", map {quotemeta} keys %tr);
$r=qr|$r|;
with big files use:
tie my @array,"Tie::File",$ARGV[0] || die;
for (@array) {
s/($r)/$tr{$1}/g;
}
untie @array;
with small files use:
open my $fh,'<',$ARGV[0] || die;
local $/ = undef;
my $t=<$fh>;
close $fh;
$t=~s/($r)/$tr{$1}/g;
open $fh,'>',$ARGV[0] || die;
print $fh $t;
close $fh;
What you have works as is, so it's not clear what your request is.
One catch: The code you posted may have problems with double substitutions depending on the contents of %regex
and/or $_
. For example,
my %regex = (
foo => 'bar',
bar => 'foo',
);
The solution is to move the foreach into the pattern, so to speak.
my $pat =
join '|',
map quotemeta, # Convert text to regex patterns.
keys %regex;
my $re = qr/$pat/; # Precompile for efficiency.
my $qfn = 'myfile.txt'
open(my $fh, '<', $qfn) or die "open: $qfn: $!";
while (<$fh>) {
s/($re)/$regex{$1}/g;
... do something with $_ ...
}
This is an old question, so I'm surprised no one has yet suggested the obvious: pre-compile each of the regexps (i.e. the hash keys).
$regex{qr/foo/} = 'bar';
$regex{qr/hello/} = 'world';
$regex{qr/python/} = 'perl';
open(F, "myfile.txt");
while (<F>) {
foreach $key (keys %regex) {
s/$key/$regex{$key}/g;
}
}
close(F);
or for (IMO) greater readability:
%regex = (
qr/foo/ => 'bar',
qr/hello/ => 'world',
qr/python/ => 'perl',
);
If you know that there can only be one possible match per input line then skipping the remaining regexps with last
after a successful match will also help if there are a lot of keys. e.g. inside the for
loop:
s/$key/$regex{$key}/g && last;
精彩评论