How to restore HTML brackets that have been replaced?_问答_开发者

How to restore HTML brackets that have been replaced?

开发者 https://www.devze.com 2023-03-14 00:21 出处：网络

I\'m working with a database that has content where the angled brackets have been replaced with the character ^.

相关专题：regex replace

I'm working with a database that has content where the angled brackets have been replaced with the character ^.

e.g.

^b^some text^/b^

Can anyone please recommended a c# solution to convert the ^ character back to the appropriate bracket, so it can be displayed as html? I'm guessing some kind of regex will do the job...?

Thanks in advance

You can replace every n'th ^ character with > where n is even and < where n is odd.

var html = "^b^some text^/b^";

var n = 0;
var result = Regex.Replace(html, "\\^", m => ((n++ % 2) == 0) ? "<" : ">");
// result == "<b>some text</b>"

Note that this works only as long as the original HTML code contains a closing > character for every < character (<p<b>... is bad) and that there were no ^ characters in the original HTML code (<b>2^5</b> is bad).

A more complicated, but possibly safer solution would be to search for specific sets of characters, such as ^p, ^img, ^div, etc. and their counterparts, ^/p^, ^/div^, ^/img^, etc., and replacing each of them specifically.

Whether this is feasible though, depends on what tags exist in the data, and how big an effort you are willing to put in to do this securely. Do you know if there is a finite set of tags that have been used? Was the HTML generated, or is there a chance that someone has edited them manually, necessarily making the pattern-searching more complicated?

Maybe you could first do some analysis, for instance searching and listing the various instances where the character ^ occurs? How much data are we talking about, and is it static, or will it continue to grow (including the ^-problem)?

Tricky, to the point of being impossible to do perfectly automatically -- unless you can make some very convenient assumptions about the original HTML (that it is a small subset of all possible HTML, that it was known to conform to certain predictable patterns). I think in the end there's going to have to hand editing.

Having said that, and apologies for not including any actual C# code, here's how I'd consider approaching it.

Let's go after the problem incrementally, where we convert common patterns first. The goal being after every step to reduce the number of remaining ^ characters.

So first, regex-replace lots of very common literal patterns

^p^ -> <p>
^div^ -> <div>
^/div^ -> <div>

etc.

Next, replace patterns that contain optional text, like

^link[anything-except-^]^ -> <link[original-text]>

and on and on. My approach is to replace only expected patterns, and by doing that, avoid false matches. Then iterate with other patterns until there are no ^ chars left. This takes lots of inspection of data, and lots of patterns. It's brute force, not smart, but there you go.