开发者

How to extract the data from a website using java?

开发者 https://www.devze.com 2022-12-16 14:52 出处:网络
开发者_如何转开发I am familier with java programming language I like to extract the data from a website and store it to my database running on my machine.Is that possible in java.If so which API I sho

开发者_如何转开发I am familier with java programming language I like to extract the data from a website and store it to my database running on my machine.Is that possible in java.If so which API I should use. For example the are number of schools listed on a website How can I extract that data and store it to my database using java.


What you're referring to is commonly called 'screenscraping'. There are a variety of ways to do this in Java, however, I prefer HtmlUnit. While it was designed as a way to test web functionality, you can use it to hit a remote webpage, and parse it out.

I would recommend using a good error handling html parser like Tagsoup to extract from the HTML exactly what you're looking for.


You definitely need a good parser like NekoHTML.

Here's an example of using NekoHTML, albeit using Groovy (a Java-based scripting language) rather than Java itself:

http://www.keplarllp.com/blog/2010/01/better-competitive-intelligence-through-scraping-with-groovy


You can use VietSpider XML from

http://sourceforge.net/projects/binhgiang/files/

Download VietSpider3_16_XML_Windows.zip or VietSpider3_16_XML_Linux.zip

VietSpider Web Data Extractor: Software crawls the data from the websites ((Data Scraper)), format to XML standard (Text, CDATA) then store in the relational database. Product supports the various of RDBMs such as Oracle, MySQL, SQL Server, H2, HSQL, Apache Derby, Postgres …VietSpider Crawler supports session (login, query by form input), multi-downloading, JavaScript handling, proxy (and multi-proxy by auto scan the proxies from website)…


Depending on what you are really trying to do, you can use many different solutions.

If you juste wanna fetch the HTML code of a web page, then URL.getContent() may be your solution. Here is a little tutorial :

http://www.javacoffeebreak.com/books/extracts/javanotesv3/c10/s4.html

EDIT : didn't understand he was searching for a way to parse the HTML code. Some tools have been suggested above. Sorry for that.

0

精彩评论

暂无评论...
验证码 换一张
取 消