I keep running into the same programming task without a satisfying solution. I have similar collections of objects from different systems that I need to combine or merge into one, and possibly report on intersections between the two.
A good example of this might be a collection of users from Active Directory, and the same collection of users from SAP (with some richer attributes that do not exist in AD). I just want on开发者_如何学运维e collection of users containing properties from both collections.
Or maybe I have a collection of users in SharePoint, and a collection of newsletter subscribers in Constant Contact, and I want to get a collection of all currently active users that are also newsletter subscribers in Constant Contact.
Given that there will be a common identifier in both collections (email address, an ID of some sort) to join them, I find I have very few options to efficiently get the merged data:
- Get all objects from System A. Get all objects from System B. In a double-loop, find matches and add them into a new collection.
- Get all objects from System A. For each object in System A, query System B to find a match and add to new collection.
Option 1 stinks because I have to fetch all data from System B, even though I may throw some of it away if there are no matches. Option 2 stinks because I'll have to do many individual queries to System B to get my matches.
I know I can setup some kind of cube that processes this stuff regularly, but it seems like I should just be able to take two collections, denote the common piece of data between them, and ask a framework to join it for me intelligently. Is there some other method I may be missing here?
Thanks, Adam
Mathematically you either have to do A or B in some fashion- there is no getting around it.
The typical optimization is to do it as close to either A or B as possible, e.g. copy all the data from A to B and then ask B's database for the non-matching elements, or something to that effect. The choice of which system to copy from can be based on technical considerations (e.g. closed or inaccessible systems like mainframes are often copied from), performance considerations (B may be a much faster or more scalabale system than A), or data size considerations (if A's data size is an order of magnitude smaller than B's, it makes more sense to copy A to B rather than vice versa).
If the data sources can produce ordered streams of data, then you can perform the comparison in a streaming fashion, rather than needing all of the data from both systems. For instance:
A's Data B's Data
A A
B C
C D
D F
E
F
If you know the data is sorted, you can simply iterate both lists looking for matches, rather than doing look-ups against one data source.
I would consider how long the data retrieval takes and then also how often you intend to interrogate the 'intersection' data. If the answer to the former is several seconds and the latter is also (potentially) in seconds then I would strongly consider caching the data retrieved in to a simple database.
You can then do the join and save in to a third table or even do the JOIN in the SELECT statement on each request. Doing it on each request should be trivial with a couple of indexes.
Shouldn't be any need for a cube.
Finally depending on the available attributes you may be able to use a LastModifiedDate/CreationDate or similar to be intelligent about which rows/records you refresh.
精彩评论