I have the following string:
blah blah yo<desc>some text with description - unwanted
text</desc>um hey now some words yah<desc>some other description text
stuff - more unwanted here</desc>random word and ; things. Now a hyphen
outside of desc tag - with other text<desc>yet another description - unwanted
<desc>and that's about it.
(Note: In reality there are no newline/carriage returns in the string. I only added them here for readability.)
I want to select only the text in the desc tag from the hyphen forward, and also including the preceding space, and also including the ending desc开发者_Python百科 tag. That was simple as I just did this:
\s-.*?<\/desc>
Now, the problem is that the hyphen that is outside the desc tag is getting selected too. So all my selections are as follow:
- unwanted text</desc>
- more unwanted here</desc>
- with other text<desc>yet another description - unwanted</desc>
So the first two are perfect but see how that last line is messed up because of the - outside the desc tag?
Just FYI, if interested, in my code I am doing a replace like this:
$text = preg_replace('/\s-.*?<\/desc>/', '</desc>', $text);
I tried doing some Lookbehind stuff but could not get it to work.
Any ideas?
Thanks! Mark
You could try [^-<>]*
instead of .*?
. This restricts what the regex can select and effectively treats angle brackets and the hyphen as tokens.
What about:
\s-[^-]*?<\/desc>
If desc is the only tag that can appear in this block, you could use a horrible hack like this:
$text = preg_replace('/\s-[^<]*?<\/desc>/', '</desc>', $text);
But if this needs to be bulletproof, you can't reliably do this with a regular expression. You might try using an XML parser and processing the resultant DOM.
精彩评论