I've got a system that fetches a few hundred RSS feeds. Currently they're on a 10 minute refresh cycle, but I'd preferably like to make that faster. What is a strategy to fetch the RSS sources at near-realtime/push intervals?
Some solutions I've come across:
- do a fetch at 1 minute; if no changes, fetch again at 2, then 4, then 8, etc.
- find the average time-between-updates interval/variance of the RSS feed, and put them in a bucket (this one updates every 3 mins, so do a check every 1 minute; this on开发者_开发问答e updates every week, so do a check every day, etc.)
There is no way to make "pulling" quick and efficient. You will either poll more often (and be less efficient) or be more efficient by polling less often.
The only way to acheive near realtime experience is to poll at the right time :)
Luckily some publihsers (more and more!) use PubSubHubbub to update their feeds and let subscribers know. Other services like Superfeedr (I work for Superfeedr) use different techniques to learn when is the best time to fetch a feed (based on historic updates, updates in related feeds... etc).
I've used something like you first option. Start with a default time before retrieving a feed. If new items are found reduce the waiting period with 10%, otherwise increase with 10%. Perform this adaption with every update and the system adjusts itself.
You could use different percentages, e.g. decrease the time quicker to respond better to change in update frequency.
Include a minimum and maximum timespan to keep waiting within a predefined range.
It's not perfect but was good enough for me.
Though it's only part of the solution, you can also (if the feed is served over HTTP) check the Cache-Control and Expires headers of the RSS feed for hints on how often you should fetch the feed.
精彩评论