开发者

In what scenarios might a web crawler be CPU limited as opposed to IO limited?

开发者 https://www.devze.com 2023-03-07 04:41 出处:网络
It seems like typical crawlers that just download a small number of pages or do very little processing to decide what pages to download are IO limited.

It seems like typical crawlers that just download a small number of pages or do very little processing to decide what pages to download are IO limited.

I am curious as to what order of magnitude estimates of sizes relevant data structures, number of stored pages, indexing requirements etc that might actually make CPU the bottleneck?

For example an application might want to calculate some probabilities based on the links found on a page in order to decide what page to crawl next. This function takes O(noOfLinks) and is evaluated N times (at each step)...where N is the number of pages I want to download in one round of crawling.I have to sort and ke开发者_开发技巧ep track of these probabilities and i have to keep track of a list of O(N) that will eventually be dumped into disk and the index of a search engine. Is it not possible (assuming one machine) that N grows large enough and that storing the pages and manipulating the links gets expensive enough to compete with the IO response?


Only when you are doing extensive processing on each page. eg if you are running some sort of AI to try to guess the semantics of the page.

Even if your crawler is running on a really fast connection, there is still overhead creating connections, and you may also be limited by the bandwidth of the target machines


If the page contains pictures and you are trying to do face recognition on the pictures (ie to form a map of pages that have pictures of each person). That may be CPU bound because of the processing involved.


Not really. It takes I/O to download these additional links, and you're right back to I/O-limited again.


If you're using tomcat search for "Crawler Session Manager Valve"

0

精彩评论

暂无评论...
验证码 换一张
取 消