开发者

Extracting new items from an RSS feed

开发者 https://www.devze.com 2023-01-31 15:21 出处:网络
I\'m writing an application which takes data input from a series of arbitrary RSS feeds. The feeds are polled asynchronously in the background and a method is called every time a new item is added to

I'm writing an application which takes data input from a series of arbitrary RSS feeds. The feeds are polled asynchronously in the background and a method is called every time a new item is added to the feed.

My problem is identifying the new items in the feed. What's the best way to do it? I have come up with a few ideas, but they're all flawed.

Suggestion: Every time you poll, keep all items newer than the pubDate of the last item in the last poll Problem: pubDate is not a required field.

Suggestion: Keep a hash of the content for every item you return, and do not return content with the same hash Problem: Rapidly grows out of control 开发者_如何学Goin terms of memory usage


How about both?

Use pub-date on those feeds that do return it, and keep a hash of the others. If most of the feeds return a pub-date, and the number of feeds does not run into the millions, you should be ok, both performance and memory wise.


You can use PubDate for those RSS feed where it is provided. When PubDate is not provided and if duplicate items are exactly equal, ie.. when you can not find any single field to distinguish them, calculate the md5 checksum and store that for comparison. Use the link http://sharpertutorials.com/calculate-md5-checksum-file/. This way you will avoid storing the entire content files and their comparison. Practically you can purge the the checksum data often based upon frequency of new content to avoid the memory problem. If possible maintain multiple hash for the different sources. If you post the actual numbers we may have more realistic solution.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号