开发者

how to parse (only text) web sites while crawling

开发者 https://www.devze.com 2022-12-26 01:59 出处:网络
i can succesfully run crawl command via cygwin 开发者_JS百科on windows xp. and i can also make web search via using tomcat.

i can succesfully run crawl command via cygwin 开发者_JS百科on windows xp. and i can also make web search via using tomcat.

but i also want to save parsed pages during crawling event

so when i start crawling with like this

bin/nutch crawl urls -dir crawled -depth 3

i also want save parsed html files to text files

i mean during this period which i started with above command

nutch when fetched a page it will also automaticly save that page parsed (only text) to text files

these files names could be fetched url

i really need help about this

this will be used at my university language detection project

ty


The crawled pages are stored in the segments. You can have access to them by dumping the segment content:

nutch readseg -dump crawl/segments/20100104113507/ dump

You will have to do this for each segment.

0

精彩评论

暂无评论...
验证码 换一张
取 消