I'm trying to match out of this text:
<a href="http://english317.ning.com/profiles/blogs/bad-business-writing-487">Continue</a>
</div>
<p class="small">
Added by <a href="/profile/KemberleyRamirez">Kemberley Ramirez</a> on September 2, 2010 at 11:38pm
I'd like to get the text after /blogs (e.g. "bad-business-writing-487") and also the added by string (Studen开发者_Python百科t Name and submit date) (e.g. "Kemberley Ramirez on September 2, 2010 at 11:38pm")
I'm using UltraEdit with Perl expressions.
I don't know what exactly you are trying to match, but you are better off using a proper HTML parser:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new(\*DATA);
my $blog_re = qr{^http://english317.ning.com/profiles/blogs/(.+)\z};
my $profile_re = qr{^/profile/(\w+)\z};
while ( my $tag = $parser->get_tag('a') ) {
next unless my ($href) = $tag->get_attr('href');
if ( $href =~ $blog_re or $href =~ $profile_re ) {
print "[$1]\n";
}
}
__DATA__
<a href="http://english317.ning.com/profiles/blogs/bad-business-writing-487">Continue</a>
</div>
<p class="small">
Added by <a href="/profile/KemberleyRamirez">Kemberley Ramirez</a> on September 2, 2010 at 11:38pm
Using PowerGrep in "dot matches newline" mode, I came up with this:
(?>profiles/blogs/(.*?)").*?added by(.*?)</a>(.*?2010.*?\d{2}[ap]m)
(and then an extra processing search)
<
?a.*?>
The /s and /m modifiers control how multiple lines are handled. see perlretut
You probably want something like rrr reg.exps with the /s modifier, or something like this: (untested)
$foo =~ m|blogs/([^"]+).*Added by <[^>]+>([^<]+)</a>|s
Using m|| instead of // to avoid all the escaping ..
Following should work for multiple lines:
.*blogs\/(\S+)".*\(\n.*\)*<a.*>(.*)<\/a>(.*)
精彩评论