开发者

how to get HTML DOM path by text content?

开发者 https://www.devze.com 2023-01-14 23:50 出处:网络
a HTML file: <html> <body> <div class=\"main\"> <p id=\"tID\">content</p> </div>

a HTML file:

<html>
    <body>
        <div class="main">
            <p id="tID">content</p>
        </div>
    </body>
</html>

i has a String == "content",

i want to use "content" get HTML DOM path:

html body div.main p#tID

chrome developer tools has this feature(Elements tag,bottom bar), i want to know 开发者_运维知识库how to do it in java?

thanks for your help :)


Have fun :)

JAVA CODE

import java.io.File;

import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathFactory;

import org.htmlcleaner.CleanerProperties;
import org.htmlcleaner.DomSerializer;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.TagNode;
import org.w3c.dom.Document;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;



public class Teste {

    public static void main(String[] args) {
        try {
            // read and clean document
            TagNode tagNode = new HtmlCleaner().clean(new File("test.xml"));
            Document document = new DomSerializer(new CleanerProperties()).createDOM(tagNode);

            // use XPath to find target node
            XPath xpath = XPathFactory.newInstance().newXPath();
            Node node = (Node) xpath.evaluate("//*[text()='content']", document, XPathConstants.NODE);

            // assembles jquery/css selector
            String result = "";
            while (node != null && node.getParentNode() != null) {
                result = readPath(node) + " " + result;
                node = node.getParentNode();
            }
            System.out.println(result);
            // returns html body div#myDiv.foo.bar p#tID 

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    // Gets id and class attributes of this node
    private static String readPath(Node node) {
        NamedNodeMap attributes = node.getAttributes();
        String id = readAttribute(attributes.getNamedItem("id"), "#");
        String clazz = readAttribute(attributes.getNamedItem("class"), ".");
        return node.getNodeName() + id + clazz;
    }

    // Read attribute
    private static String readAttribute(Node node, String token) {
        String result = "";
        if(node != null) {
            result = token + node.getTextContent().replace(" ", token);
        }
        return result;
    }

}

XML EXAMPLE

<html>
    <body>
        <br>
        <div id="myDiv" class="foo bar">
            <p id="tID">content</p>
        </div>
    </body>
</html>

EXPLANATIONS

  1. Object document points to evaluated XML.
  2. The XPath //*[text()='content'] finds everthing with text = 'content', and find the node.
  3. The while loops up to the first node, getting id and classes of current element.

MORE EXPLANATIONS

  1. In this new solution I'm using HtmlCleaner. So, you can have <br>, for example, and cleaner will replace with <br/>.
  2. To use HtmlCleaner, just download the newest jar here.
0

精彩评论

暂无评论...
验证码 换一张
取 消