开发者

Are there any Java HTML parsers where the generated Nodes retain indexes to the original text?

开发者 https://www.devze.com 2023-04-02 01:26 出处:网络
I\'d like to query a HTML document as XML (e.g. with XPath), so I need to pass the HTML through some form of HTML cleaner.

I'd like to query a HTML document as XML (e.g. with XPath), so I need to pass the HTML through some form of HTML cleaner.

But I'd also like to make modifica开发者_运维知识库tions to the original source string based on the results of the queries.

Is there a Java HTML parser around that retains indexes to the original source string, so I can locate a node and modify the correct part of the original string?

Cheers.


It sounds like Jericho is almost exactly what you want. It is a robust HTML parser designed specifically for making unintrusive modifications to the source document.

While it doesn't come with DOM, SAX, or StAX interfaces, it has custom APIs that are similar enough to those standards that you should be able to adapt your approach to them fairly easily, or write an adapter between whatever you are using and Jericho. For instance, you can do XPath queries on Jericho documents using Jaxen -- see this blog entry for an example.

Jericho has begin and end attributes for every element, and even for parts of the element like the tag name or even an attribute name, so you can edit the document yourself with that information, but where Jericho really shines is the OutputDocument class, which lets you specify replacements directly by calling the appropriate methods with the Jericho elements that match your query instead of having to explicitly call getBegin() and getEnd() on them and pass that to some replacement method.


We use jericho html parser to do the parsing and htmlcleaner to do the actual clean up.

We had problems with jericho's behavior within a server app ( memory management, logging ) that we fixed. (the original developer didn't think our issues were important enough to put in the main code branch). Our fork is on github. We also made fixes to htmlcleaner.


I don't know about the "retain indexes to the original text" part but Jericho is a very good HTML parser library.

Here is an example of how to remove every span from a html:

public static String removeSpans(String html) {
    Source source = new Source(html);
    source.fullSequentialParse();
    OutputDocument outputDocument = new OutputDocument(source);
    List<Tag> tags = source.getAllTags();
    for (Tag tag : tags) {
        String tagname = tag.getName().toLowerCase();
        if (tagname.equals("span")) {
            //remove the <span>
            outputDocument.remove(tag);
        }
    }
    return outputDocument.toString();
}


I guess you could use HTML Parser.

You can get indexes to original Page using getStartPosition() and getEndPosition() from class Node.


As others have suggested, you probably want to render the DOM. This basically just means constructing the node tree, it wont alter the document source unless you use an HTML cleaner like jTidy. Then you have easy access to the document and can modify it as required. I would suggest DOM4J, it has a good api and xpath support too.

Re your "indexing" requirement, during your traversal/querying of the document you can cache in a list or map any elements or nodes that you wish to modify the text of at a later point.


this works great

http://jtidy.sourceforge.net/

EXAMPLE

Tidy tidy = new Tidy(); // obtain a new Tidy instance
tidy.setXHTML(boolean xhtml); // set desired config options using tidy setters 
...                           // (equivalent to command line options)

tidy.parse(inputStream, System.out);

For crawling the DOM, i recommend using JDOM, its way faster then simple XML.

http://www.jdom.org/

DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.newDocument();
Element root = doc.createElement("root");
Text text = doc.createText("This is the root");
root.appendChild(text);
doc.appendChild(root);

As far as implementation is concerned i would make a new document, and add nodes to it from the source.


You could try ANTLR with an HTML grammar.

You could take (at least) 2 approaches - try and use it as an actual HTML parser, and then get the indexes into the original string that you are interested in.

Or, it also has built-in support for doing in-place transformations on source text, where you define the transformations that you want to perform on the text as part of the grammar.

0

精彩评论

暂无评论...
验证码 换一张
取 消