I've been beating my head against this reg ex for the longest time now and am hoping someone can help. Basically I have a WYSIWYG field where a user can type formatted text. But of course they will copy and paste form word/web/etc. So I have a JS function catching the input on paste. I got a function that will strip ALL of the formatting on the text which is nice, but I'd like to have it leave tags like p and br so it's not just a big mess.
Any regex ninjas out there? Here is what I have so far and it works. Just need to allow tags.
o.node.innerHTML=o.node.innerHTML.replace(/(<([^>开发者_C百科]+)>)/ig,"");
The browser already has a perfectly good parsed HTML tree in o.node
. Serialising the document content to HTML (using innerHTML
), trying to hack it about with regex (which cannot parse HTML reliably), then re-parsing the results back into document content by setting innerHTML
... is just a bit perverse really.
Instead, inspect the element and attribute nodes you already have inside o.node
, removing the ones you don't want, eg.:
filterNodes(o.node, {p: [], br: [], a: ['href']});
Defined as:
// Remove elements and attributes that do not meet a whitelist lookup of lowercase element
// name to list of lowercase attribute names.
//
function filterNodes(element, allow) {
// Recurse into child elements
//
Array.fromList(element.childNodes).forEach(function(child) {
if (child.nodeType===1) {
filterNodes(child, allow);
var tag= child.tagName.toLowerCase();
if (tag in allow) {
// Remove unwanted attributes
//
Array.fromList(child.attributes).forEach(function(attr) {
if (allow[tag].indexOf(attr.name.toLowerCase())===-1)
child.removeAttributeNode(attr);
});
} else {
// Replace unwanted elements with their contents
//
while (child.firstChild)
element.insertBefore(child.firstChild, child);
element.removeChild(child);
}
}
});
}
// ECMAScript Fifth Edition (and JavaScript 1.6) array methods used by `filterNodes`.
// Because not all browsers have these natively yet, bodge in support if missing.
//
if (!('indexOf' in Array.prototype)) {
Array.prototype.indexOf= function(find, ix /*opt*/) {
for (var i= ix || 0, n= this.length; i<n; i++)
if (i in this && this[i]===find)
return i;
return -1;
};
}
if (!('forEach' in Array.prototype)) {
Array.prototype.forEach= function(action, that /*opt*/) {
for (var i= 0, n= this.length; i<n; i++)
if (i in this)
action.call(that, this[i], i, this);
};
}
// Utility function used by filterNodes. This is really just `Array.prototype.slice()`
// except that the ECMAScript standard doesn't guarantee we're allowed to call that on
// a host object like a DOM NodeList, boo.
//
Array.fromList= function(list) {
var array= new Array(list.length);
for (var i= 0, n= list.length; i<n; i++)
array[i]= list[i];
return array;
};
First, I'm not sure if regex is the right tool for this. A user might enter invalid HTML (forget a >
or put a >
inside attributes), and a regex would fail then. I don't know, though, if a parser would be much better/more bulletproof.
Second, you have a few unnecessary parentheses in your regex.
Third, you could use lookahead to exclude certain tags:
o.node.innerHTML=o.node.innerHTML.replace(/<(?!\s*\/?(br|p)\b)[^>]+>/ig,"");
Explanation:
<
match opening angle bracket
(?!\s*\/?(br|p)\b)
assert that it's not possible to match zero or more whitespace characters, zero or one /
, any one of br
or p
, followed directly by a word boundary. The word boundary is important, otherwise you might trigger the lookahead on tags like <pre>
or <param ...>
.
[^>]+
match one or more characters that are no closing angle brackets
>
match the closing angle brackets.
Note that you might run into trouble if a closing angle bracket occurs somewhere inside a tag.
So this will match (and strip)
<pre> <a href="dot.com"> </a> </pre>
and leave
<p> < p > < /br > <br /> <br>
etc.
alone.
精彩评论