I have text similar like this:
<html><p>this is <b>the</b> text</p> and <p>this is another text</p></html>
and I need to get this text using regexp
this is 开发者_StackOverflow中文版<b>the</b> text
Problem is, when I use simple regexp like this (<html>.*</p>
) I'm getting whole text until the last occurence of </p>
Can anyone help me?
thanks lennyd
You need a non-greedy match:
<html>.*?</p>
Also, you might want to consider using an HTML parser instead of regular expressions for this task.
By default regular expression quantifiers are greedy, i.e. you get the match of maximum length. You'll have to specify that you want an 'un-greedy' match using .*?
To capture the data in between para tags you may use regexp with positive look-ahead assertion /<p>(.*)(?=<\/p>)/
, which is more greedy then .*?
and works slower, but may be helpful for you. Also make sure that your HTML is valid, that means:
- All para tags are closed. HTML browsers close para tags, when they enter another block.
- Para tags are not nested :) Otherwise you have problems with any regex.
Silly question, still using pure regex, why not just strip any <..> inside paragraphs? THEN grab the phrases using something like [^<]
?
精彩评论