开发者

How To Aggregate API Data?

开发者 https://www.devze.com 2022-12-19 21:34 出处:网络
I have a system that connects to 2 popular APIs. I need to ag开发者_开发技巧gregate the data from each into a unified result that can then be paginated. The scope of the project means that the system

I have a system that connects to 2 popular APIs. I need to ag开发者_开发技巧gregate the data from each into a unified result that can then be paginated. The scope of the project means that the system could end up supporting 10's of APIs.

Each API imposes a max limit of 50 results per request.

What is the best way of aggregating this data so that it is reliable i.e ordered, no duplicates etc

I am using CakePHP framework on a LAMP environment, however, I think this question relates to all programming languages.

My approach so far is to query the search API of each provider and then populate a MySQL table. From this the results are ordered, paginated etc. However, my concern is performance: API communication, parsing, inserting and then reading all in one execution.

Am I missing something, does anyone have any other ideas? I'm sure this is a common problem with many alternative solutions.

Any help would be greatly appreciated.


Yes, this is a common problem.

Search SO for questions like https://stackoverflow.com/search?q=%5Bphp%5D+background+processing

Everyone who tries this realizes that calling other sites for data is slow. The first one or two seem quick, but other sites break (and your app breaks) and other sites are slow (and your app is slow)

You have to disconnect the front-end from the back-end.

Choice 1 - pre-query the data with a background process that simply gets and loads the database.

Choice 2 - start a long-running background process and check back from a JavaScript function to see if it's done yet.

Choice 3 - the user's initial request spawns the background process -- you then email them a link so they can return when the job is done.


i have a site doing just that with over 100 rss/atom feeds, this is what i do:

  1. i have a list of feeds and a cron job that iterates over them, about 5 feeds a minute, meaning i cycle through all feeds every 20 minute or so.
  2. i lift the feed, and try to insert each entry into the database, using the url as a unique field, if the url exists, i do not insert. the entry date is my current system clock, and is inserted by my application, as date fields in rss cannot be trusted, and in some cases, can't even be parsed.
  3. for some feeds, and only experiece can tell you which, i also search for duplicate titles, some websites change the urls for their own reasons.
  4. the items are now all placed in the same database table, ready to be queried.

one last thought: if your application is likely to have new feeds added while in production, you really should also check if a feed is "new" (ie: has no previous entries in the db), if it is, you should mark all currently available links as inactive, otherwise, when you add a feed, there will be a block of articles from that feed, all with the same date and time. (simply put: the method i described is for future additions to the feed only, past articles will not be available).

hope this helps.

0

精彩评论

暂无评论...
验证码 换一张
取 消