I'm still learning regex (obviously) and i can't figure it out, and i want to do it the right way rather than doing it the long way. How can I:
Find all <p>
or </p>
and replace with a \n
except the first <p>
and last </p>
in which case replace with nothing, just remove, and for <br>
, <br />
and <br/>
replace with 开发者_StackOverflow社区\n
also.
With Regex OR something else. I'm getting this from a jQuery $.get() return. So, please don't flame me about it, I just don't know how to do it.
Javascript has rather nice tools for dealing with an xml (or xhtml) DOM. Use those.
In Regex perspective, to make the first <p>
become an exception, you must identify a pattern which makes the first <p>
fails. For example, if text before first <p>
is abcxyz
, that is, abcxyz<p>
, then you search every <p>
which is not preceded by abcxyz
, so that the first <p>
doesn't match. Using regex, it becomes: (?<!abcxyz)<p>
To make the last </p>
become an exception, you must identify a pattern which makes the last </p>
fails. For example, if text after last </p>
is abcxyz
, that is, </p>abcxyz
, then you search every </p>
which is not followed by abcxyz
, so that the last </p>
doesn't match. Using regex, it becomes: </p>(?!abcxyz)
Although JavaScript support positive and negative look-ahead, unfortunately, JavaScript regex doesn't support neither positive nor negative look-behind. Indeed, there are some dirty tricks to mimic look-behind in JavaScript, however, not all look-behind construct can be mimicked.
Thus, if possible, try to identify a pattern which makes the first <p>
fails, but use negative look-ahead.
To replace the first <p>
and the last </p>
with nothing, you can inverse the logic we use above, and you have to do this in separate step.
To replace <br>
, <br />
, <br/>
with \n
, search for: <br\s*\/?>
, and replace with \n
.
One way to do this would be to allow the browser to do it for you. In IE and WebKit, you could assign your HTML as the innerHTML of a <div>
and get its innerText
. However, that won't work in Firefox or Opera. Here's a slightly bizarre use of the Selection
object that will do it:
function getInnerText(html) {
var text = "";
var div = document.createElement("div");
div.innerHTML = html;
document.body.appendChild(div);
if (typeof window.getSelection != "undefined") {
var sel = window.getSelection();
sel.removeAllRanges();
var range = document.createRange();
range.selectNodeContents(div);
sel.addRange(range);
text = sel.toString();
sel.removeAllRanges();
} else if (document.body.createTextRange != "undefined") {
var range = document.body.createTextRange();
range.moveToElementText(div);
text = range.text;
}
document.body.removeChild(div);
return text.replace(/\r\n/g, "\n").replace(/\r/g, "\n");
}
精彩评论