开发者

Regarding crawling of short URLs using nutch

开发者 https://www.devze.com 2023-02-06 11:56 出处:网络
I am using nutch crawler for my application which needs to crawl a set of URLs which I give to the urls directory and fetch only the contents of that URL only.

I am using nutch crawler for my application which needs to crawl a set of URLs which I give to the urls directory and fetch only the contents of that URL only. I am not interested in the contents of the internal or external links. So I have used NUTCH crawler and have run the crawl command by giving depth as 1.

bin/nutch crawl urls -dir crawl -depth 1

Nutch crawls the urls and gives me the contents of the given urls.

I am reading the content using readseg utility.

bin/nutch readseg -dump crawl/segments/* arjun -nocontent -nofetch -nogenerate -noparse -noparsedata

With this I am fetching the content of webpage.

The problem I am facing is if I give direct urls like

http://isoc.org/wp/worldipv6day/
http://openhackindia.eventbrite.com
http://www.urlesque.com/2010/06/11/last-shot-ye-olde-twitter/
http://www.readwriteweb.com/archives/place_your_tweets_with_twitter_locations.php
http://bangalore.yahoo.com/labs/summerschool.html
http://riadevcamp.eventbrite.com
http://www.sleepingtime.org/

then I am able to get the contents of the webpage. But when I give the set of URLs as short URLs like

http://is.gd/jOoAa9
http://is.gd/ubHRAF
http://is.gd/GiFqj9
http://is.gd/H5rUhg
http://is.gd/wvKINL
http://is.gd/K6jTNl
http://is.gd/mpa6fr
http://is.gd/fmobvj
http://is.gd/s7uZf***

I am not able to fetch the contents.

When I read the segments, it is not showing any content. Please find below the content of dump file read from segments.

*Recno:: 0
URL:: http://is.gd/0yKjO6
CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Jan 25 20:56:07 IST 2011
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1295969171407
Content::
Version: -1
url: http://is.gd/0yKjO6
base: http://is.gd/0yKjO6
contentType: text/html
metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=http://holykaw.alltop.com/the-twitter-cool-of-a-to-z?tu4=1 _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html; charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14
Content:
Recno:: 1
URL:: http://is.gd/1tpKaN
Content::
Version: -1
url: http://is.gd/1tpKaN
base: http://is.gd/1tpKaN
contentType: text/html
metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=http://holykaw.alltop.com/fighting-for-women-who-dont-want-a-voice?tu3=1 _fst_=36 nutch.segment.name=20110125开发者_如何学运维205614 Content-Type=text/html; charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14
Content:
CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Jan 25 20:56:07 IST 2011
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0*

I have also tried by setting the max.redirects property in nutch-default.xml as 4 but dint find any progress. Kindly provide me a solution for this problem.

Thanks and regards, Arjun Kumar Reddy


Using nutch 1.2 try editing the file conf/nutch-default.xml
find http.redirect.max and change the value to at least 1 instead of the default 0.

<property>
  <name>http.redirect.max</name>
  <value>2</value><!-- instead of 0 -->
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  </description>
</property>

Good luck


You will have to set a depth of 2 or more, because the first fetch returns a 301 (or 302) code. The redirection will be followed on the next iteration, so you have to allow more depth.

Also, make sure that you allow all urls that will be followed in your regex-urlfilter.txt

0

精彩评论

暂无评论...
验证码 换一张
取 消