开发者

Nutch 1.2 - Why won't nutch crawl url with query strings?

开发者 https://www.devze.com 2023-03-28 07:06 出处:网络
I\'m new to Nutch and not really sure what is going on here.I run nutch and it crawl my website, but it seems to ignore URLs that contain query strings.I\'ve commented out the filter in the crawl-urlf

I'm new to Nutch and not really sure what is going on here. I run nutch and it crawl my website, but it seems to ignore URLs that contain query strings. I've commented out the filter in the crawl-urlfilter.t开发者_JS百科xt page so it look like this now:

# skip urls with these characters
#-[]

#skip urls with slash delimited segment that repeats 3+ times
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/

So, i think i've effectively removed any filter so I'm telling nutch to accept all urls it finds on my website.

Does anyone have any suggestions? Or is this a bug in nutch 1.2? Should i upgrade to 1.3 and will this fix this issue i am having? OR am i doing something wrong?


See my previous question here Adding URL parameter to Nutch/Solr index and search results

The first 'Edit' should answer your question.


# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

You have to comment it or modify it as :

# skip URLs containing certain characters as probable queries, etc.
-[*!@]


By default, crawlers shouldn't crawl links with query strings to avoid spams and fake search engines.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号