开发者

Returning line numbers of a regex match across multiple lines

开发者 https://www.devze.com 2023-02-06 20:53 出处:网络
I\'m trying to write a tool that will find empty XML tags which are spanned across multiple lines in a large text file. E.g. don\'t match:

I'm trying to write a tool that will find empty XML tags which are spanned across multiple lines in a large text file. E.g. don't match:

<tag>
ABC
</tag>

And match:

<tag>
</tag>

I have no problem in writing the regex to match whitespace across multiple lines, but I need to find the line numbers where these matches occur (approximately at least).

I would split my text file in开发者_运维百科to an array, but then it'll be pretty tricky to match across multiple array elements as there may be > 2 lines of tags/whitespace.

Any ideas? My implementation needs to be in Perl. Thanks!


if ($string =~ $regex) {
    print "Match starting line number: ", 1 + substr($string,0,$-[0]) =~ y/\n//, "\n";
}


In this kind of work, I'd rather use an xml parser and output the line number of the closing empty tag than trying to do some cumbersome regex work.


If there is only one <tag> per line, you can use the the specail variable $. that contains the current line number.

#!/usr/bin/perl
use strict;
use warnings;
use 5.10.1;

my ($begin, $tag) = (0, 0, '');
while (my $line = <DATA>) {
  chomp $line;
  if ($line =~ m#<(tag).*?>#) {
    $tag = $1;
    $begin = $.;
    next;
  }
  if ($line =~ m#</($tag).*?>#) {
    if ($. - $begin < 2) {
      say "Empty tag '$tag' on lines $begin - $.";
    }
    $begin = 0;
    $tag = '';
  }
}

__DATA__
<tag>
ABC
</tag>

<tag>
</tag>

output:

Empty tag 'tag' on lines 5 - 6


If you need a robust solution, use a real XML parser rather than naive pattern matching.

If you are prepared to use a fragile approach that may not always give the right answers, then see below :-)

#!/usr/bin/perl
use warnings;
use strict;

my $xml =<<ENDXML;
<tag>
stuff
</tag>
<tag>


</tag>
<p>
paragraph
</p>
<tag> </tag>
<tag>
morestuff
</tag>
ENDXML

while ($xml =~ m#(<tag>\s*</tag>)#g) {
    my $tag = $1;

    # use substr() as an "lvalue" to find number of lines before </tag>
    my $prev_lines = substr($xml, 0, pos($xml)) =~ tr/\n// + 1;

    # adjust for newlines contained in the matched element itself
    my $tag_lines = $tag =~ tr/\n//;

    my $line = $prev_lines - $tag_lines;
    print "lines $line-$prev_lines\n$tag\n";
}
0

精彩评论

暂无评论...
验证码 换一张
取 消