开发者

Jsoup fetching a partial page

开发者 https://www.devze.com 2023-03-13 03:53 出处:网络
I am trying to scrape the contents of bidding websites, but am unable to fetch the complete page of the website . I am using crowbar on xulrunner to fetch the page first (as ajax loads certain element

I am trying to scrape the contents of bidding websites, but am unable to fetch the complete page of the website . I am using crowbar on xulrunner to fetch the page first (as ajax loads certain elements in lazy fashion) and then scrape from the file. But on the mainpage of bidrivals website, this fails even when the local file is well开发者_运维问答 formed. jSoup simply seems to end with '...' characters midway in the html code. If anyone has encountered this before, please help. The following Code is called for [this link].

File f = new File(projectLocation+logFile+"bidrivalsHome");
    try {
        f.createNewFile();
        log.warn("Trying to fetch mainpage through a console.");
        WinRedirect.redirect(projectLocation+"Curl.exe -s --data \"url="+website+"&delay="+timeDelay+"\" http://127.0.0.1:10000", projectLocation, logFile+"bidrivalsHome");
    } catch (Exception e) {
        e.printStackTrace();
        log.warn("Error in fetching the nameList", e);
    }
    Document doc = new Document("");
    try {
        doc = Jsoup.parse(f, "UTF-8", website);
    } catch (IOException e1) {
        System.out.println("Error while parsing the document.");
        e1.printStackTrace();
        log.warn("Error in parsing homepage", e1);
    }


Try using HtmlUnit to render the page (including JavaScript and CSS dom manipulation) and then pass the rendered HTML to jsoup.

// load page using HTML Unit and fire scripts
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(myURL);

// convert page to generated HTML and convert to document
Document doc = Jsoup.parse(myPage.asXml(), baseURI);

// clean up resources        
webClient.close();


page.html - source code

<html>
<head>
    <script src="loadData.js"></script>
</head>
<body onLoad="loadData()">
    <div class="container">
        <table id="data" border="1">
            <tr>
                <th>col1</th>
                <th>col2</th>
            </tr>
        </table>
    </div>
</body>
</html>

loadData.js

    // append rows and cols to table.data in page.html
    function loadData() {
        data = document.getElementById("data");
        for (var row = 0; row < 2; row++) {
            var tr = document.createElement("tr");
            for (var col = 0; col < 2; col++) {
                td = document.createElement("td");
                td.appendChild(document.createTextNode(row + "." + col));
                tr.appendChild(td);
            }
            data.appendChild(tr);
        }
    }

page.html when loaded to browser

| Col1   | Col2   |
| ------ | ------ |
| 0.0    | 0.1    |
| 1.0    | 1.1    |

Using jsoup to parse page.html for col data

    // load source from file
    Document doc = Jsoup.parse(new File("page.html"), "UTF-8");

    // iterate over row and col
    for (Element row : doc.select("table#data > tbody > tr"))

        for (Element col : row.select("td"))

            // print results
            System.out.println(col.ownText());

Output

(empty)

What happened?

Jsoup parses the source code as delivered from the server (or in this case loaded from file). It does not invoke client-side actions such as JavaScript or CSS DOM manipulation. In this example, the rows and cols are never appended to the data table.

How to parse my page as rendered in the browser?

    // load page using HTML Unit and fire scripts
    WebClient webClient = new WebClient();
    HtmlPage myPage = webClient.getPage(new File("page.html").toURI().toURL());

    // convert page to generated HTML and convert to document
    doc = Jsoup.parse(myPage.asXml());

    // iterate row and col
    for (Element row : doc.select("table#data > tbody > tr"))

        for (Element col : row.select("td"))

            // print results
            System.out.println(col.ownText());

    // clean up resources        
    webClient.close();

Output

0.0
0.1
1.0
1.1
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号