web-crawler
Stopping Google's crawl of my site
Google has started crawling my site, but from a temporary domain (beta.mydomain instead of just mydomain) and also I only want him to crawl just some of my pages. Therefore, I want to stop their crawl[详细]
2023-03-24 04:33 分类:问答Python Crawler - need help with my algorithm
** Added a summary of the problem at the end of the post ** I\'ve written a crawler that fetches and parses URLs.[详细]
2023-03-24 02:39 分类:问答Crawler url queue or hash list?
I\'m re-writing the spidering/crawler portion of a Delphi 6 site mapper application that I previously wrote. The app spiders a single site.[详细]
2023-03-23 18:35 分类:问答Python urllib2 and [errno 10054] An existing connection was forcibly closed by the remote host and a few urllib2 problems
I\'ve written a crawler that uses urllib2 to fetch URLs. every few requests I get some weird behaviors, I\'ve tried analyzing it with Wireshark and couldn\'t understand the problem.[详细]
2023-03-23 07:33 分类:问答Web spider/crawler in C# Windows.forms
I have created a web crawler in VC#. The crawler indexes certain information from .nl sites by brute-forcing all of the possible .nl addresses, starting with http://aa.nl to (theoretically) http://zzz[详细]
2023-03-22 05:52 分类:问答Interview question: Honeypots and web crawlers
I was recently reading a book as prep for an interview and came across the following question: What will you do when your crawler runs into a honey pot that generates an infinite subgraph for you to[详细]
2023-03-22 05:15 分类:问答Use python to crawl a website
So I am looking for a dynamic way to crawl a website and grab links from each page. I decided to experiment with Beauitfulsoup. Two questions: How do I do this more dynamically then using nested while[详细]
2023-03-21 16:10 分类:问答multi language site and search engines
I\'m developing a site for a company that has clients from all over the world and the site will be served in two languages: Italian (local) and English.[详细]
2023-03-21 09:35 分类:问答Excluding testing subdomain from being crawled by search engines (w/ SVN Repository)
I have: domain.com testing.domain.com I want domain.com to be crawled and indexed by searc开发者_开发问答h engines, but not testing.domain.com[详细]
2023-03-21 08:39 分类:问答Can one specify a file content-type to download using Wget?
I want to use wget to download files linked from the main page of a website, but I only want to download text/html files.Is it possible to limit wget to text/html files based on the mime content typ开[详细]
2023-03-21 06:57 分类:问答