开发者

Capture a part of a string that does not match another group (C# Regex)

开发者 https://www.devze.com 2022-12-13 12:04 出处:网络
I am working on a project that requires the parsing of \"formatting tags.\" By using a tag like this: <b>text</b>, it modifies the way the text will look (that tag makes the text bold). Yo

I am working on a project that requires the parsing of "formatting tags." By using a tag like this: <b>text</b>, it modifies the way the text will look (that tag makes the text bold). You can have up to 4 identifiers in one tag (b for bold, i for italics, u for underline, and s for strikeout).

For example:

<bi>some</b> text</i> here would produce some text here.

To parse these tags, I'm attempting to use a RegEx to capture any text before the first opening tag, and then capture any tags and their enclosed text after that. Right now, I have this:

<(?<open>[bius]{1,4})>(?<text>.+?)</(?<close>[bius]{1,4})>

That matches a single tag, its enclosed text, and a single corresponding closing tag.开发者_JAVA技巧

Right now, I iterate through every single character and attempt to match the position in the string I'm at to the end of the string, e.g. I attempt to match the whole string at i = 0, a substring from position 1 to the end at i = 1, etc.

However, this approach is incredibly inefficient. It seems like it would be better to match the entire string in one RegEx instead of manually iterating through the string.

My actual question is is it possible to match a string that does not match a group, such as a tag? I've Googled this without success, but perhaps I've not been using the right words.


I think trying to parse and validate the entire text in one regular expression is likely to give you problems. The text you are parsing is not a regular language, so regular expressions are not well designed for this purpose.

Instead I would recommend that you first tokenize the input to single tags and text between the tags. You can use a simple regular expression to find single tags - this is a much simpler problem that regular expressions can handle quite well. Once you have tokenized it, you can iterate over the tokens with an ordinary loop and apply formatting to the text as appropriate.


Try prefixing your regex with ^(.*?) (match any characters from the beginning of the string, non-greedy). Thus it will match anything at all that occurs at the start of the string, but it will match as little as it can while still having the rest of the regex match. Thus you'll grab all of the stuff that wasn't matched normally in that first capture group.


Why don't you use an HTML parser for this?

You should be using an XML parser, not regexes. XML is not a regular language, hence not easely parseable by a regular expression. Don't do it.

Never use regular expressions or basic string parsing to process XML. Every language in common usage right now has perfectly good XML support. XML is a deceptively complex standard and it's unlikely your code will be correct in the sense that it will properly parse all well-formed XML input, and even it if does, you're wasting your time because (as just mentioned) every language in common usage has XML support. It is unprofessional to use regular expressions to parse XML.

0

精彩评论

暂无评论...
验证码 换一张
取 消