开发者

Webcrawler, feedback?

开发者 https://www.devze.com 2023-01-01 06:38 出处:网络
Hey folks, every once in a while I have the need to 开发者_如何转开发automate data collection tasks from websites. Sometimes I need a bunch of URLs from a directory, sometimes I need an XML sitemap (y

Hey folks, every once in a while I have the need to 开发者_如何转开发automate data collection tasks from websites. Sometimes I need a bunch of URLs from a directory, sometimes I need an XML sitemap (yes, I know there is lots of software for that and online services).

Anyways, as follow up to my previous question I've written a little webcrawler that can visit websites.

  • Basic crawler class to easily and quickly interact with one website.

  • Override "doAction(String URL, String content)" to process the content further (e.g. store it, parse it).

  • Concept allows for multi-threading of crawlers. All class instances share processed and queued lists of links.

  • Instead of keeping track of processed links and queued links within the object, a JDBC connection could be established to store links in a database.

  • Currently limited to one website at a time, however, could be expanded upon by adding an externalLinks stack and adding to it as appropriate.

  • JCrawler is intended to be used to quickly generate XML sitemaps or parse websites for your desired information. It's lightweight.

Is this a good/decent way to write the crawler, provided the limitations above? Any input would help immensely :)

http://pastebin.com/VtgC4qVE - Main.java

http://pastebin.com/gF4sLHEW - JCrawler.java

http://pastebin.com/VJ1grArt - HTMLUtils.java


Your crawler does not seem to respect the robots.txt in any way and uses a fake User-Agent string to show off as if it is a webbrowser. This may lead to legal trouble in the future. Keep this into account.


I have written a custom web-crawler in my company and I follow similar steps as you have mentioned and I found them perfect.The only add-on I want to say is that it should have a polling frequency to crawl after certain period of time.

So it should follow "Observer" design pattern so that if any new update is found on a given url after certain period of time then it will update or write to a file.


I would recommend open source JSpider as the start point for your crawler project, it covers all the major concerns of a web crawler, including robots.txt, and has a plug-in scheme that you can use to apply your own tasks to each page it visits.

This is a brief and slightly dated review of JSpider. The pages around this one review several other Java spidering applications.

http://www.mksearch.mkdoc.org/research/spiders/j-spider/

0

精彩评论

暂无评论...
验证码 换一张
取 消