I'm new to Nutch and not really sure what is going on here. I run nutch and it crawl my website, but it seems to ignore URLs that contain query strings. I've commented out the filter in the crawl-urlfilter.t开发者_JS百科xt page so it look like this now:
# skip urls with these characters
#-[]
#skip urls with slash delimited segment that repeats 3+ times
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/
So, i think i've effectively removed any filter so I'm telling nutch to accept all urls it finds on my website.
Does anyone have any suggestions? Or is this a bug in nutch 1.2? Should i upgrade to 1.3 and will this fix this issue i am having? OR am i doing something wrong?
See my previous question here Adding URL parameter to Nutch/Solr index and search results
The first 'Edit' should answer your question.
# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]
You have to comment it or modify it as :
# skip URLs containing certain characters as probable queries, etc.
-[*!@]
By default, crawlers shouldn't crawl links with query strings to avoid spams and fake search engines.
精彩评论