开发者

changing the url domain in nutch index programmatically

开发者 https://www.devze.com 2023-02-15 23:54 出处:网络
i\'m currently making search engine for a website content (only for searching within that website). however, i\'m thinking of building the index in the staging server. it\'s something like this:

i'm currently making search engine for a website content (only for searching within that website). however, i'm thinking of building the index in the staging server. it's something like this: 1. i stage my code at www.staging_server.com 2. i index the开发者_StackOverflow社区 pages at www.staging_server.com 3. i copy codes at www.staging_server.com to www.production_server.com 4. i copy the index to www.production_server.com index???

the problem with step 4 is that the urls in the index created in step 2 is in the form of www.staging_server.com/index, www.staging_server.com/whatever, www.staging_server/anything. but what i need is www.production_server.com/index, www.production_server.com/whatever, www.production_server.com/anything

i'm wondering whether the urls in the index can be changed programmatically. and if so, how to do that?

note: i'm nutch beginner, so please be merciful to me


If you are only working with the index after the crawl, you can open up the index with a Lucene IndexReader and add new records with an IndexModifier. Your can page through each document, create a copy of the document with the new url, and then add the new document back to the index. You will need to delete the original document if you do not with it to persist in the index.

Lucene does not allow index updating but rather the deletion of a old record and the insertion of a new one.

0

精彩评论

暂无评论...
验证码 换一张
取 消