Nutch 1.2 - Why won't nutch crawl url with query strings?_问答_开发者

Nutch 1.2 - Why won't nutch crawl url with query strings?

开发者 https://www.devze.com 2023-03-28 07:06 出处：网络

I\'m new to Nutch and not really sure what is going on here.I run nutch and it crawl my website, but it seems to ignore URLs that contain query strings.I\'ve commented out the filter in the crawl-urlf

相关专题：nutch

I'm new to Nutch and not really sure what is going on here. I run nutch and it crawl my website, but it seems to ignore URLs that contain query strings. I've commented out the filter in the crawl-urlfilter.t开发者_JS百科xt page so it look like this now:

# skip urls with these characters
#-[]

#skip urls with slash delimited segment that repeats 3+ times
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/

So, i think i've effectively removed any filter so I'm telling nutch to accept all urls it finds on my website.

Does anyone have any suggestions? Or is this a bug in nutch 1.2? Should i upgrade to 1.3 and will this fix this issue i am having? OR am i doing something wrong?

See my previous question here Adding URL parameter to Nutch/Solr index and search results

The first 'Edit' should answer your question.

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

You have to comment it or modify it as :

# skip URLs containing certain characters as probable queries, etc.
-[*!@]

By default, crawlers shouldn't crawl links with query strings to avoid spams and fake search engines.

Nutch 1.2 - Why won't nutch crawl url with query strings?

精彩评论

关注公众号

热门标签

图文推荐

Nutch 1.2 - Why won't nutch crawl url with query strings?

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：