开发者

Joining many (> 20) sources of data with varying priority

开发者 https://www.devze.com 2023-04-01 20:30 出处:网络
I\'ve been working on this problem for a long while now and can\'t seem to find an acceptable solution. The problem is that I have many different sources of data (at least 20) that I\'ve boiled down t

I've been working on this problem for a long while now and can't seem to find an acceptable solution. The problem is that I have many different sources of data (at least 20) that I've boiled down to have the same six or seven columns of data that I want. My goal is to create a "universe" that includes all the items listed on each source of data. Most of the data is listed on multiple sources, and some only on one or two sources. I need to keep track of what sources have what items as well. Ultimately my goal is a table with every item in the "universe" in one column, followed by a series of columns marking each source that it was identified in (i.e. the column for Source 12 gets an "x" if Source 12 contains the current item). In addition, there are columns that have some information about the items (their official name). All of the items have a unique number but they have varying names, so in the final table I produce I need to list the name from the source with highest priority. So, if the item is on Source 1, use the name listed in Source 1; if not, but it is on Source 2, use the name listed for the item in Source 2; and so on.

I'm working on this in Microsoft Access and until recently have been able to get away with joining all of my many source tables on the unique number, and doing a big nested Iif statement to get the name in order of the sources' priority, e.g.,

Iif(Not Null(Source1.ItemName), Source1.ItemName, 
  Iif(NotNull(Source2.ItemName), Source2.ItemName, 
    Iif(...... ))))

However I've now reached a problem as I've been adding more sources. Access limits how nested the Iif statements can be. So, I tried several different solutions. I tried using VBA to run a select query on each source table in order of priority until it found a record that matched the current unique number and returned the item's name listed in the source of highest priority in which it was found. The problem is that this function needs to run several times for each unique number, as there are a few properties in addition to the item name. This 开发者_高级运维results in a very slow query.

Another alternative I tried that worked well (but failed later for another reason I'll explain) was to do a self join. I had one query (query1) do a union of the unique number, item name, and other properties from all the sources as well as add a column for the source rank. Then another query (query2) had a subquery which grouped by the unique number and chose the minimum source rank, which was inner joined onto query1 to get the rest of the record listed in query 1 for the highest ranked source. The problem though is that in my final query I need to take that information, and add in the flags that note what sources the item is found in, and this requires about 20 left joins. In addition, some sources have data that others do not, so I can't just add those columns to the query1 union. So, while I solved the multiple nested Iif problem, I still have a problem with requiring too many joins.

What can I do to get a final table with all of the item properties as well as source columns?


In order to condense columns together you can use the SQL standard function COALESCE.

Access doesn't support COALESCE, but if does have a NZ function. Sadly the latter only takes 1 parameter, not an unlimited number like COALESCE.

Still if you want everything in one column, you can do a large UNION like so.

 SELECT a AS OneColumnToRuleThemAll FROM (
  SELECT col1 as a FROM a
UNION
  SELECT col2 as a FROM b
UNION 
  SELECT ....) S

This will put everything in one column.


How about 2 steps. In step one, use union to combine all the records into one table having all the common columns, including the unique ID, the priority, then name, and the others. This doesn't need to be an insert query, so you don't need an actual table.

This won't give you each source as a yes/no column, but you would have one row per source per unique identifier, which might be just as good.

If you want the top priority name to show for each row, you could do the select with a subquery for the Name column.


I actually solved it last night with a bit of re-jiggering of the queries. In case someone else out there has a similar problem, here's what I did:

1) My one union query (query1 is what I called it above) still had the unique numbers, item names, common properties, source name, and source flag (we use it as an indicator of whether it was new to the universe or not).

2) A new crosstab query (I'll call it query1_crosstab) was made, with the row heading as the unique number, the column heading as the source name, and the value (a crosstab property) as the source flag. An alternative is to set the value to be equal to count(item_name), which would put 1's in the places where a source had the relevant Item, or to do Iif(count(item_name)=1,"x",null) to put x's instead (what we prefer to do)

3) A new grouping query was made which did the same as query2 above, namely grouping by unique number and choosing minimum source number and rejoining onto the union query1 to get the item names from the highest ranked sources.

4) One last new query joined 2) and 3) based on the unique number, so that all of the common properties as well as information about each item's presence in each of the sources was listed in the query.

5) Finally I could join this query to any of the source tables (only 2) that had extra properties not common to all of the tables. Alternatively in the first step (the union), columns could've been added to include all of the properties, and ones not present in a source would just have null values.

I hope this helps someone!


I would do this:

SELECT UniqueNumber, Name, 1 as TablePriority FROM Table1
  UNION ALL
SELECT UniqueNumber, Name, 2 as TablePriority FROM Table2
  UNION ALL
SELECT UniqueNumber, Name, 3 as TablePriority FROM Table3
  UNION ALL
SELECT UniqueNumber, Name, 4 as TablePriority FROM Table4
  UNION ALL
SELECT UniqueNumber, Name, 5 as TablePriority FROM Table5
  etc
ORDER BY TablePriority DESC
0

精彩评论

暂无评论...
验证码 换一张
取 消