开发者

.Net Regular Expression to get parenthetical text at end of <p> tags

开发者 https://www.devze.com 2022-12-26 09:08 出处:网络
I have a simple pattern I am trying to match, any characters captured between parenthesis at the end of an HTML paragraph.I am running into trouble any time there is additional parentheticals in that

I have a simple pattern I am trying to match, any characters captured between parenthesis at the end of an HTML paragraph. I am running into trouble any time there is additional parentheticals in that paragraph:

i.e.

If the input string is "..... (321)</p>" i want to get the value (321)

However, if the paragraph has this text: "... (123) (321)</p>" my regex is returning "(123) (321)" (everything between the opening "(" and closing ")"

I am using the regex pattern "\s(.+)</p>"

How can I grab the correct value (using VB.NET)

This is what I'm doing so far:

    Dim reg As New Regex("\s\(.+\)</P>", RegexOptions.IgnoreCase)
    Dim matchC As MatchCollection = reg.Matches(su.Question)
    If matchC.Count > 0 Then
        Dim lastMatch As Match = matchC(matchC.Count - 1)
        Dim DesiredValue As String = lastMatch.Value
    End If开发者_运维技巧


Just change the expression to non-greedy and reverse the match order:

Dim reg As New Regex("\s\(.+?\)</P>", RegexOptions.IgnoreCase Or RegexOptions.RightToLeft)

Or make it match only one closing parenthesis:

"\s\([^)]+\)</P>"

Or make it match only numbers inside your pharentesis:

"\s\(\d+\)</P>"

Edit: in order to make the non-greedy sample to work, you'll need to set the RightToLeft flag on the Regex object


Dim reg As New Regex("\s\(\d+\)</P>", RegexOptions.IgnoreCase)

Your stumbling block was the insufficient specificity of the . (it matches all characters, including parentheses) and the greediness of the + (it matches as much as possible).

Just be more specific (\d+) or less greedy (.+?).


You need to use a Look Ahead (?= ) to anchor the pattern. That gives a hint to the parser of where the data should stop, be anchored to. Here is an example which gets the previous ( ) data from the p tag anchor point:

(?:\()([^)]+)(?:\))(?=</[pP]>)


(?:\()        - Match but don't capture a (
([^)]+)       - Get all the data until a ) is hit. [^ ] is the not set
(?:\))        - Match but don't capture the )  
(?=</[pP]>)  - Look Ahead Match but don't capture a suffix of </p or P >

HTH

0

精彩评论

暂无评论...
验证码 换一张
取 消