Possible Duplicate:
Replace all < and > that are NOT part of an HTML tag
- Using Python
- I know how much everyone here hates REGEX questions surrounding HTML tags, but I am just doing this as a exercise to help my learn REGEX.
Replace (1 can be any character):
<b>< </b>
<b> < </b>
<b> <</b>
<b><</b>
<b><111</b>
<b>11<11</b>
<b>111<</b>
<b>11<11</b>
<b>
<<<
</b>
With:
<b>& </b>
<b> & </b>
<b> &</b>
<b>&</b>
<b>&111</b>
<b>11&11</b>
<b>111&</b>
<b>11&11</b>
<b>
&
</b>
I am searched in the interwebs and tried many of my own solutions. Please, is this possible? And if so, how?
My best guess was something like:
re.sub(r'(?<=>)(.*?)<(.*?)(?=</)', r'\1<\2', string)
But that falls apart with re.DOTALL and '<<<'+ etc.
I sincerely hope this is never used on actual HTML, but here is a solution that works for your example data. Note that it replaces with <
like your sample code, not &
like in your sample data.
re.sub(r'<+([^<>]*?)(?=</)', r'<\1', your_string)
You could use something like this:
re.sub(r'(?:<(?!/?b>))+', '&', string)
And if you'd want it to work with (some) other tags, you could use something like this:
re.sub(r'(?:<(?!/?\w+[^<>]*>))+', '&', string)
if a is your string, this seems to work:
re.sub('<+([^b/])','&\\1',a)
and a second version, more generic...
re.sub('(<[^<>]+>)([^<>]*)<+([^<>]*)(<[^<>]+>)','\\1\\2&\\3\\4',a)
This tested regex works for your given test data:
reobj = re.compile(r"""
# Match left angle brackets not part of HTML tag.
<+ # One or more < but only if
(?=[^<>]*</\w+) # inside HTML element contents.
""", re.VERBOSE)
result = reobj.sub("&", subject)
精彩评论