开发者

Web Scraping with Jsoup only functioning half the time

开发者 https://www.devze.com 2023-03-27 16:30 出处:网络
I\'ve been playing around with the Java Jsoup library lately in an attempt to get a better understanding of web scraping (pulling data off a website). But it would seem that the code I managed to put

I've been playing around with the Java Jsoup library lately in an attempt to get a better understanding of web scraping (pulling data off a website). But it would seem that the code I managed to put together only functions part of the time. Is the issue with my code, or is it possible that certain sites have measures to stop web scraping?

Here is the class that does all the 'magic' :

import java.io.IOException;
import org.jsoup.*;
import org.jsoup.nodes.Document;




public class HTMLParser {

private Document d;
private String url;
private String content;



    public HTMLParser(String url){
    this.url = url; 
     connect();
     parse();
     display();

    }


    private void connect(){ 
        try{
        d = Jsoup.connect(url).get();   
        }catch(IOException e){}
    }

    private void parse(){
        content = d.b开发者_如何学Pythonody().text();

    }

    private void display(){
        System.out.println(content);

    }

}


You might also have a problem if the site dynamically loads data. Especially in this age of AJAX. Does JSoup ignore robot.txt, or can you make it do so?

Ideally you need to render the page, and THEN scrape it.

This software apparently renders web pages: http://lobobrowser.org/java-browser.jsp And there's certainly an API, which might allow you to look into the webpage's structure.


You can use https://github.com/subes/invesdwin-webproxy with its HtmlUnit Javascript headless browser support to wait for the page to render/load data/execute JS/do its Ajax magic before actually doing the parsing.


You can web scrape without Jsoup.

public class Trick {
public static void main(String[] args) {
String str;
URLConnection con;

//HAVE TO HAVE TRY CATCH HERE OR THROW IT

con =  new URL("ANY URL").openConnection();
Scanner scanner = new Scanner(con.getInputStream());
scanner.useDelimiter(INPUT ANY DELIMETER);
str = scanner.next();
scanner.close();



str = str.substring(content.indexOf("NAME OF CLASS OF ID") + INPUT A NUMBER 
WHICH SIGNIFIES HOW MANY INDEXES YOU WANT IT TO NOT CONSIDER STARTING FROM THE 
LEFT);
String wow = str.substring(0, content.indexOf("WHERE YOU WANT IT TO END OR STOP 
SCRAPING"));
System.out.println(wow);
str = str.substring(content.indexOf("WHERE YOU WANT IT TO END OR STOP 
SCRAPING"));
}
//System.out.println(wow);}}
0

精彩评论

暂无评论...
验证码 换一张
取 消