开发者

How to remove expired items from database with Scrapy

开发者 https://www.devze.com 2022-12-16 08:08 出处:网络
I am using spidering a video site that expires content frequently. I am considering usingscrapy to do my spidering, but am not sure how to delete expired items.

I am using spidering a video site that expires content frequently. I am considering using scrapy to do my spidering, but am not sure how to delete expired items.

Strategies to detect if an item is expired are:

  1. Spider the site's "delete.rss".
  2. Every few days, try reloading the contents page and making sure it still works.
  3. Spider ever开发者_JAVA百科y page of the site's content indexes, and remove the video if it's not found.

Please let me know how to remove expired items in scrapy. I will be storing my scrapy items in a mysql DB via django.

2010-01-18 Update

I have found a solution that is working, but still may not be optimal. I am maintaining a "found_in_last_scan" flag on every video that I sync. When the spider starts, it sets all the flags to False. When it finishes, it deletes videos who still have the flag set to False. I did this by attaching to the signals.spider_opened and signals.spider_closed Please confirm this is a valid strategy and there are no problems with it.


I haven't tested this!
I have to confess that I haven't tried using the Django models in Scrapy, but here goes:

The simplest way I imagine would be to create a new spider for the deleted.rss file by extending the XMLFeedSpider (Copied from the scrapy documentation, then modified). I suggest you do create a new spider because very little of the following logic is related to the logic used for scraping the site:

from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
from myproject.items import DeletedUrlItem

class MySpider(XMLFeedSpider):
    domain_name = 'example.com'
    start_urls = ['http://www.example.com/deleted.rss']
    iterator = 'iternodes' # This is actually unnecesary, since it's the default value
    itertag = 'item'

    def parse_node(self, response, url):
        url['url'] = node.select('#path/to/url').extract()

        return url # return an Item 

SPIDER = MySpider()

This is not a working spider for you to use, but IIRC the RSS files are pure XML. I'm not sure how the deleted.rss looks like but I'm sure you can figure out how to extract the URLs from the XML. Now, this example imports myproject.items.DeletedUrlItem which is just a string in this example, but you need to create t he DeletedUrlItem using something like the code below:

You need to create the DeletedUrlItem:

class DeletedUrlItem(Item):
    url = Field()

Instead of saving, you delete the items using Django's Model API in a Scrapy's ItemPipeline - I assume you're using a DjangoItem:

# we raise a DropItem exception so Scrapy
# doesn't try to process the item any further
from scrapy.core.exceptions import DropItem

# import your model
import django.Model.yourModel

class DeleteUrlPipeline(item):

    def process_item(self, spider, item):
        if item['url']:
            delete_item = yourModel.objects.get(url=item['url'])
            delete_item.delete() # actually delete the item!
            raise DropItem("Deleted: %s" % item)

Notice the delete_item.delete().


I'm aware that this answer may contain errors, it's written by memory :-) but I will definitely update if you've got comments or cannot figure this out.


If you have a HTTP URL which you suspect might not be valid at all any more (because you found it in a "deleted" feed, or just because you haven't checked it in a while), the simplest, fastest way to check is to send an HTTP HEAD request for that URL. In Python, that's best done with the httplib module of the standard library: make a connection object c to the host of interest with HTTPConnection (if HTTP 1.1, it may be reusable to check multiple URLs with better performance and lower systrem load), then do one (or more, if feasible, i.e. if HTTP 1.1 is in use) calls of c's request method, first argument 'HEAD', second argument the URL you're checking (without the host part of course;-).

After each request you call c.getresponse() to get an HTTPResponse object, whose status attribute will tell you if the URL is still valid.

Yes, it's a bit low-level, but exactly for this reason it lets you optimize your task a lot better, with just a little knowledge of HTTP;-).

0

精彩评论

暂无评论...
验证码 换一张
取 消