开发者

Different pages to different Nutch cores (within the same domain)

开发者 https://www.devze.com 2023-04-12 10:35 出处:网络
How can I instruct Nutch to treat page#1 as belonging to a core and page#2 as belonging to a different core (both pages from the same domain)?

How can I instruct Nutch to treat page#1 as belonging to a core and page#2 as belonging to a different core (both pages from the same domain)?

Practical situation: let's say Nutch is crawling and indexing www.businessweek.com; let's also say that I have one core called "Japan" and another core called "France".

I want the page http://www.business开发者_运维百科week.com/magazine/content/05_51/b3964049.htm to be indexed only for the France core, since it's relevant for France but irrelevant for Japan.

Consequently, I want the page http://www.businessweek.com/magazine/content/11_27/b4235016555525.htm to be indexed only for the Japan core, but not for France.

Assuming we already know how to identify that a certain page belongs to a specific tag... how can Nutch be instructed about that?


Nutch only works with a single index afaik. Either a page gets crawled and indexed -- or it doesn't. You may use Regex URL Filters to prevent some pages from being crawled.

The pages you promoted are unfortunately quite identical. The headers are identical, except the title tag. You can't get any information from the URL either.

Assuming there is a typo in the headline of your question and you ment to add different pages to different Solr cores, you could do the following:

  • Add all pages to both solr cores
  • Execute a delete query for the french core where you remove everything not matching a certain criteria:

    curl $FRENCH_SERVER/update -H "Content-Type: text/xml" --data-binary 'NOT title:French' 2&>1 curl $JAPANESE_SERVER/update -H "Content-Type: text/xml" --data-binary 'NOT title:Japan' 2&>1

(these commands are not tested, do this on your own risk :).

0

精彩评论

暂无评论...
验证码 换一张
取 消