开发者

How to convert HTML to text keeping linebreaks

开发者 https://www.devze.com 2022-12-24 11:20 出处:网络
How may I conv开发者_C百科ert HTML to text keeping linebreaks (produced by elements like br,p,div, ...) possibly using NekoHTML or any decent enough HTML parser

How may I conv开发者_C百科ert HTML to text keeping linebreaks (produced by elements like br,p,div, ...) possibly using NekoHTML or any decent enough HTML parser

Example:

Hello<br/>World

to:

Hello\n  
World  


Here is a function I made to output text (including line breaks) by iterating over the nodes using Jsoup.

public static String htmlToText(InputStream html) throws IOException {
    Document document = Jsoup.parse(html, null, "");
    Element body = document.body();

    return buildStringFromNode(body).toString();
}

private static StringBuffer buildStringFromNode(Node node) {
    StringBuffer buffer = new StringBuffer();

    if (node instanceof TextNode) {
        TextNode textNode = (TextNode) node;
        buffer.append(textNode.text().trim());
    }

    for (Node childNode : node.childNodes()) {
        buffer.append(buildStringFromNode(childNode));
    }

    if (node instanceof Element) {
        Element element = (Element) node;
        String tagName = element.tagName();
        if ("p".equals(tagName) || "br".equals(tagName)) {
            buffer.append("\n");
        }
    }

    return buffer;
}


w3m -dump -no-cookie input.html > output.txt


I did find a relatively clever solution in html2txt: THE ASCIINATOR which does an admirable job of producing nroff like output (e.g. like man ls run on a terminal). It produces output in the Markdown style that StackOverflow uses as input.

For moderately complex pages like this page, the output is somewhat scattered as it tries mightily to turn non-linear layout into something linear. The output from less complicated markup is pretty readable.


If you don't mind hard-wrapped/designed-for-monospace output, lynx -dump produces good plain text from HTML.


HTML to Text: I am taking this statement to mean that all HTML formatting, except line-breaks, will be abandoned.

What I have done for such a venture is using regexp to detect any set of tag enclosure. If the value within the tags are br or br/, a line-break is inserted, otherwise the tag is discarded.

It works only for simple html pages. Tables will obviously be linearised.

I had been thinking of detecting the title value between the title tag enclosure, so that the converter automatically places the title at the top of the page. Needs to put in a little more algorithm. By my time is better spent with ...

I am reading on using Google Data APIs to upload a document to Google Docs and then using the same API to download/export it as text. Or, why text, when I could do pdf. But you have to get a Google account if you don't already have one.

Google docs data download/export

Google docs data api for java


Does it matter what language you use? You could always use pattern matching. Basically HTML lien break tags (br,p,div, ...) you can replace with "\n" and remove all the other tags. You could always store the tags in an array so you can easily check when you go through the HTML file. Then any other tags and all the other end tags (/p,..) can be replaced with an empty string therefore getting your result.

0

精彩评论

暂无评论...
验证码 换一张
取 消