I'm looking for a robots.txt parser in Java, which supports the same pattern matching rules as the Googlebot.
I'v开发者_高级运维e found some librairies to parse robots.txt files, but none of them supports Googlebot-style pattern matching :
- Heritrix (there is an open issue on this subject)
- Crawler4j (looks like the same implementation as Heritrix)
- jrobotx
Does anyone know of a java library that can do this ?
Nutch seems to be using a combination of crawler-commons with some custom code (see RobotsRulesParser.java). I'm not sure of the current state of afairs, though.
In particular, the issue NUTCH-1455 looks to be quite related to your needs:
If the user-agent name(s) configured in http.robots.agents contains spaces it is not matched even if is exactly contained in the robots.txt http.robots.agents = "Download Ninja,*"
Perhaps its worth it to try/patch/submit the fix :)
精彩评论