开发者

Java robots.txt parser with wildcard support

开发者 https://www.devze.com 2023-04-01 12:51 出处:网络
I\'m looking for a robots.txt parser in Java, which supports the same pattern matching rules as the Googlebot.

I'm looking for a robots.txt parser in Java, which supports the same pattern matching rules as the Googlebot.

I'v开发者_高级运维e found some librairies to parse robots.txt files, but none of them supports Googlebot-style pattern matching :

  • Heritrix (there is an open issue on this subject)
  • Crawler4j (looks like the same implementation as Heritrix)
  • jrobotx

Does anyone know of a java library that can do this ?


Nutch seems to be using a combination of crawler-commons with some custom code (see RobotsRulesParser.java). I'm not sure of the current state of afairs, though.

In particular, the issue NUTCH-1455 looks to be quite related to your needs:

If the user-agent name(s) configured in http.robots.agents contains spaces it is not matched even if is exactly contained in the robots.txt http.robots.agents = "Download Ninja,*"

Perhaps its worth it to try/patch/submit the fix :)

0

精彩评论

暂无评论...
验证码 换一张
取 消