I need to split phrase to words, but ignore text within d开发者_开发技巧efined tag For example
Input
<i>111 111 111</i> 222 333 444 <i>555 666</i> 888 999 <i>000 111</i>
Output
<i>111 111 111</i>
222
333
444
<i>555 666</i>
888
999
<i>000 111</i>
Try this:
/<i>[\d\s]*<\/i>|\d+/g
Explanation:
- For strings within
<i>
tags, both whitespace and numerals will be included in the match. - Strings not within the tags cannot include whitespace, so they'll be restricted to numeric strings.
- The
|
alternator is short-circuiting, so it makes sure<i>111 222 333</i>
will be treated as a single unit, not split off into111
,222
, and333
.
Tested on Regexr here, works correctly: http://regexr.com?2uf6j
How about splitting on a space only if the next <
that follows is not followed by a slash?
>>> import re
>>> test = "<i>111 111 111</i> 222 333 444 <i>555 666</i> 888 999 <i>000 111</i>"
>>> split = re.compile(" (?![^<]*</)")
>>> split.split(test)
['<i>111 111 111</i>', '222', '333', '444', '<i>555 666</i>', '888', '999', '<i>000 111</i>']
This will fail if tags can be nested, though (which is a reason why regex is not a good fit for this kind of problem).
精彩评论