web-crawler
Exclude Some URL from getting crawled
I am writing a crawler and in that crawler I do not want to crawl some page(exclude some link so that it is not 开发者_运维技巧crawl). So I wrote exclusions for that page. Anything wrong with this cod[详细]
2023-03-20 22:31 分类:问答python/scrapy question: How to avoid endless loops
I am using the web-scraping framework, scrapy, to data mine some sites. I am trying to use the CrawlSpider and the pages have a \'back\' and \'next\' button.开发者_JAVA百科 The URLs are in the format[详细]
2023-03-20 14:58 分类:问答Getting data from a website that needs you to log in (Java)
I don\'t even know if what I\'m asking is possible and I don\'t know what to search for on Google. Basically, there are multiple projects that would require me to fetch some data from websites. The e[详细]
2023-03-20 06:34 分类:问答Click a Button in Scrapy
I\'m using Scrapy to crawl a webpage. Some of the information I need on开发者_StackOverflow中文版ly pops up when you click on a certain button (of course also appears in the HTML code after clicking).[详细]
2023-03-20 06:11 分类:问答Search engines ignoring meta description content and showing footer
I have a site that is very simple and has mostly images and a login form and a link to signup. No actual text exist in the body except for the footer which shows the link to usage terms and copyright[详细]
2023-03-20 03:18 分类:问答Invalid Cookie Header and then it ask's for Authorization
I am trying to crawl a page that requires Siteminder Authentication, So I am trying to pass my username and password in the code itself to access that page and keep on crawling all the links that are[详细]
2023-03-20 00:10 分类:问答Tricking browser into calling javascript events?
So i\'m trying to create a web spider. I\'ve run into a website, that has some javascript, and I want to trick the browser into thinking that an event has been fired and that it must call the correspo[详细]
2023-03-19 21:54 分类:问答url requesting with different proxies in python
I am trying to retriv开发者_如何学Goe some pages which are google search results and cached. Actually i have two problems for now. I can normally download the first ten results, but cant get it to wor[详细]
2023-03-19 11:24 分类:问答Crawl only HTML page while checking the response header
I am trying to get all the url\'s that have header as Content-Type:text/html so I am checking the response header of each url and If they have content-type: text/html, then I want to print that url th[详细]
2023-03-19 04:08 分类:问答A java library for identifying data types in text
I have some structured text. I want to identify if a value is a number, date, spatial coordinate, plain text, uri etc. I can of course write my own solution.[详细]
2023-03-19 03:19 分类:问答