I have around 10 odd sites that I wish to scrape from. A couple of them are wordpress blogs and they follow the same html structure, albeit with different classes. The others are either forums or blogs of other formats.
The information I like to scrape is common - the post content, the timestamp, the author, title and the comments.
My question is, do i have to create one separate spider for each domain? If not, how can I create a generic spider that allows me scrape by loading options from a configuration file or something similar?
I figured I could load the xpath expressions from a file which location can be loaded via command line but there seems to be some difficulties when scraping for some domain requires that I use regex select(expression_here).re(开发者_StackOverflow中文版regex)
while some do not.
At scrapy spider set the allowed_domains to a list of domains for example :
class YourSpider(CrawlSpider):
allowed_domains = [ 'domain1.com','domain2.com' ]
hope it helps
Well I faced the same issue so I created the spider class dynamically using type()
,
from scrapy.contrib.spiders import CrawlSpider
import urlparse
class GenericSpider(CrawlSpider):
"""a generic spider, uses type() to make new spider classes for each domain"""
name = 'generic'
allowed_domains = []
start_urls = []
@classmethod
def create(cls, link):
domain = urlparse.urlparse(link).netloc.lower()
# generate a class name such that domain www.google.com results in class name GoogleComGenericSpider
class_name = (domain if not domain.startswith('www.') else domain[4:]).title().replace('.', '') + cls.__name__
return type(class_name, (cls,), {
'allowed_domains': [domain],
'start_urls': [link],
'name': domain
})
So say, to create a spider for 'http://www.google.com' I'll just do -
In [3]: google_spider = GenericSpider.create('http://www.google.com')
In [4]: google_spider
Out[4]: __main__.GoogleComGenericSpider
In [5]: google_spider.name
Out[5]: 'www.google.com'
Hope this helps
I do sort of the same thing using the following XPath expressions:
'/html/head/title/text()'
for the title//p[string-length(text()) > 150]/text()
for the post content.
You can use a empty allowed_domains
attribute to instruct scrapy not to filter any offsite request. But in that case you must be careful and only return relevant requests from your spider.
You should use BeautifulSoup especially if you're using Python. It enables you to find elements in the page, and extract text using regular expressions.
You can use start_request method!
and then you can prioritize each url as well! And then on top of that you can pass some meta data!
Here's a sample code that works:
"""
For allowed_domains:
Let’s say your target url is https://www.example.com/1.html,
then add 'example.com' to the list.
"""
class crawler(CrawlSpider):
name = "crawler_name"
allowed_domains, urls_to_scrape = parse_urls()
rules = [
Rule(LinkExtractor(
allow=['.*']),
callback='parse_item',
follow=True)
]
def start_requests(self):
for i,url in enumerate(self.urls_to_scrape):
yield scrapy.Request(url=url.strip(),callback=self.parse_item, priority=i+1, meta={"pass_anydata_hare":1})
def parse_item(self, response):
response = response.css('logic')
yield {'link':str(response.url),'extracted data':[],"meta_data":'data you passed' }
I recommend you to read this page for more info at scrapy
https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.Spider.start_requests
Hope this helps :)
精彩评论