Parse a badly formatted XML document (like an HTML file)_问答_开发者

After the parse, I would like to remove dangerous code and write it out again properly formatted.

The purpose is to prevent scripts from entering through an email but still allow the slew of bad HTML to work (at least not fail completely).

Is there a library for that? Is there a better way to keep scripts away from the browser?

The important thing is that the program not throw a Parse Exception. The program may make best guesses and ev开发者_Go百科en if it is wrong it will be acceptable.

Edit: I would appreciate any comments on which parsers y'all think are better and why.

For flexible parsing you might want to look at JSoup. But white-listing is the way to go here. If you just disallow a bunch of "dangerous" elements, someone will likely find a way to sneak something by your parser. Instead you should only allow a small list of safe elements.

Use one of available tools that convert the HTML to XHTML.

for example

http://www.chilkatsoft.com/java-html.asp

http://java-source.net/open-source/html-parsers

http://htmlcleaner.sourceforge.net/

etc

Then use regular XML parser.

I use the Jericho HTML parser for this purpose.

Somewhat tweaked version of their sanitizer example:

public class HtmlSanitizer {

private HtmlSanitizer() {
}

private static final Set<String> VALID_ELEMENTS = Sets.newHashSet(DIV, BR,
        P, B, I, OL, UL, LI, A, STRONG, SPAN, EM, TT, IMG);


private static final Set<String> VALID_ATTRIBUTES = Sets.newHashSet("id",
        "class", "href", "target", "title", "src");

private static final Object VALID_MARKER = new Object();

public static void sanitize(Reader r, Writer w) {
    try {
        sanitize(new Source(r)).writeTo(w);
        w.flush();
        r.close();
    } catch (IOException ioe) {
        throw new RuntimeException("error during sanitize", ioe);
    }
}

public static OutputDocument sanitize(Source source) {
    source.fullSequentialParse();
    OutputDocument doc = new OutputDocument(source);
    List<Tag> tags = source.getAllTags();
    int pos = 0;
    for (Tag tag : tags) {
        if (processTag(tag, doc))
            tag.setUserData(VALID_MARKER);
        else
            doc.remove(tag);
        reencodeTextSegment(source, doc, pos, tag.getBegin());
        pos = tag.getEnd();
    }
    reencodeTextSegment(source, doc, pos, source.getEnd());
    return doc;
}

private static boolean processTag(Tag tag, OutputDocument doc) {
    String elementName = tag.getName();
    if (!VALID_ELEMENTS.contains(elementName))
        return false;
    if (tag.getTagType() == StartTagType.NORMAL) {
        Element element = tag.getElement();
        if (HTMLElements.getEndTagRequiredElementNames().contains(
                elementName)) {
            if (element.getEndTag() == null)
                return false;
        } else if (HTMLElements.getEndTagOptionalElementNames().contains(
                elementName)) {
            if (elementName == HTMLElementName.LI && !isValidLITag(tag))
                return false;
            if (element.getEndTag() == null)
                doc.insert(element.getEnd(), getEndTagHTML(elementName));

        }
        doc.replace(tag, getStartTagHTML(element.getStartTag()));
    } else if (tag.getTagType() == EndTagType.NORMAL) {
        if (tag.getElement() == null)
            return false;
        if (elementName == HTMLElementName.LI && !isValidLITag(tag))
            return false;
        doc.replace(tag, getEndTagHTML(elementName));
    } else {
        return false;
    }
    return true;
}

private static boolean isValidLITag(Tag tag) {
    Element parentElement = tag.getElement().getParentElement();
    if (parentElement == null
            || parentElement.getStartTag().getUserData() != VALID_MARKER)
        return false;
    return parentElement.getName() == HTMLElementName.UL
            || parentElement.getName() == HTMLElementName.OL;
}

private static void reencodeTextSegment(Source source, OutputDocument doc,
        int begin, int end) {
    if (begin >= end)
        return;
    Segment textSegment = new Segment(source, begin, end);
    String encodedText = encode(decode(textSegment));
    doc.replace(textSegment, encodedText);
}

private static CharSequence getStartTagHTML(StartTag startTag) {
    StringBuilder sb = new StringBuilder();
    sb.append('<').append(startTag.getName());
    for (Attribute attribute : startTag.getAttributes()) {
        if (VALID_ATTRIBUTES.contains(attribute.getKey())) {
            sb.append(' ').append(attribute.getName());
            if (attribute.getValue() != null) {
                sb.append("=\"");
                sb.append(CharacterReference.encode(attribute.getValue()));
                sb.append('"');
            }
        }
    }
    if (startTag.getElement().getEndTag() == null
            && !HTMLElements.getEndTagOptionalElementNames().contains(
                    startTag.getName()))
        sb.append('/');
    sb.append('>');
    return sb;
}

private static String getEndTagHTML(String tagName) {
    return "</" + tagName + '>';
}

}

Have a look at http://nekohtml.sourceforge.net/ it has an inbuilt capability of tag balancing. Also checkout the custom filter section for Nekohtml http://nekohtml.sourceforge.net/filters.html#filters.removing . It is a very good html parser.