开发者

scraping a non RSS page to generate a feed

开发者 https://www.devze.com 2022-12-19 20:06 出处:网络
I want to scrape a page that regularly updates (adding new articles with exactly the same structure as previous ones) in order to generate an RSS feed.

I want to scrape a page that regularly updates (adding new articles with exactly the same structure as previous ones) in order to generate an RSS feed.

I can write the code to analyse the page easily, but how do I emulate a ping i.e. when the page updates how can my php script know? Does it have to be a cron job?

(Probably a duplicate question I know, but searched for a direct answer with no luck. Closest I got was Scrape and generate RSS feed, which has a scraping script but no info on how to get it to respond to changes on 开发者_如何学运维the page automatically)


Depending on the system it may or may not be easy to tell when the page was updated last.

To check for changes, you can check the HTTP headers for the Last-Modified header of the page. Not all systems update the header properly, so it may not be useful. It's also possible that unmodified page will return a status of 304 (Not Modified), particularly if you provide a If-Modified-Since header in your request.

I would definitely run something like this on a cron job. While it might be possible do it just from the headers, if you have to update the page your user will be waiting a long time (in relative terms) for your server to go out, get the page, do the processing, and send the response. I would be surprised if you didn't run into time outs from time to time with a non-cron based a approach.


You could have a crontab running that checks if the site has updated (either by checking the last modified headers, if available, or by checking the content you are interested in).

If when your crontab checks the site, it detects change in content, it could append a message to a queue (something like Zend_Queue http://framework.zend.com/manual/en/zend.queue.example.html for example), then you could have a worker which just works through the messages either until a time / data limit has been reached, or until the queue is empty.


You could also check in the response to a HEAD request, if there is no Last-Modified line, for the presence and value of ETag and Content-Length lines. If neither of these match the prior values (which you've stored), then the content has likely changed. You could add to those any other response header lines that would indicate change.

0

精彩评论

暂无评论...
验证码 换一张
取 消