开发者

Extract contents of paragraph tag using a Perl one liner

开发者 https://www.devze.com 2023-02-11 21:08 出处:网络
I would like to match the contents of a paragraph tag u开发者_JAVA百科sing a perl reg ex one liner. The paragraph is something like this:

I would like to match the contents of a paragraph tag u开发者_JAVA百科sing a perl reg ex one liner. The paragraph is something like this:

<p style="font-family: Calibri,Helvetica,serif;">Text I want to extract</p>

so I have been using something like this:

perl -nle 'm/<p>($.)<\/p>/ig; print $1' file.html

Any ideas appreciated

thanks


Mandatory link to what happens when you try to parse HTML with regular expressions.

David Dorward's comment, to use HTML::TreeBuilder, is a good one. Another good way to do this, is by using HTML::DOM:

perl -MHTML::DOM -e 'my $dom = HTML::DOM->new(); $dom->parse_file("file.html"); my @p = $dom->getElementsByTagName("p"); print $p[0]->innerText();'


$ in matching part means 'end-of-the-string' and you need also match all in p-tag non-greedy way:

perl -nle 'm/<p.*?>(.+)<\/p/ig; print $1' test.html

0

精彩评论

暂无评论...
验证码 换一张
取 消