开发者

Regular Expression to capture the first <p> of HTML

开发者 https://www.devze.com 2023-01-01 22:29 出处:网络
I have the following regular expression: (?:<(?<tag>\\w*)>(?<text>.*)</\\k<tag>>)

I have the following regular expression:

(?:<(?<tag>\w*)>(?<text>.*)</\k<tag>>)

I want it t grab the text within the first HTML element.

eg.

<p>This should capture</p>This shouldn't

Works, but ...

<p>This should capture</p><p>This shouldn't</p>

Doesn't work. As you'd expect, it returns:

This shou开发者_运维百科ld capture</p><p>This shouldn't

I'm racking my brains here. How can I just have it select the FIRST inner text?

(I'm trying to be tag-agnostic, so <strong>This should match</strong> is equally appropriate, etc.)


You should use the HTML Agility Pack.

For example:

doc.DocumentNode.Descendants("p").First().InnerText


Stop. Just stop. If you are parsing HTML, use an HTML parser (or XML if you're dealing with valid XHTML). See this answer for more info.


In order to have a non-greedy * selection, you should add an ? after the *.

(?:<(?<tag>\w*)>(?<text>.*?)</\k<tag>>)
0

精彩评论

暂无评论...
验证码 换一张
取 消