Need a better option - outer joining 32 times to same table_问答_开发者

I have a nasty SQL query problem and I'd love help with an elegant solution. I'm trying to avoid 32 left outer joins to the same table.

The database is Teradata.

I have a table with 14 million records and 33 columns. The primary key (let's call it Trans_Id), and 32 encoded fields (let's call them encoded_1 ... encoded_32). Something like this:

CREATE SET TABLE BigTable ,NO FALLBACK , NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT
     (
      TRANS_ID VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC,
      ENCODED_1 VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,
      ENCODED_2 VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,
      ENCODED_3 VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,
      ...
      ENCODED_32 VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC )
PRIMARY INDEX ( TRANS_ID );

I also have a single table with the coded / decoded values. Let's say there are 100 records in this table.

CREATE SET TABLE LookupTable ,NO FALLBACK , NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT
     (
      UNIQ_PK { just random numbers }
      ENCODED_VAR VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC,
      DECODED_DESC VARCHAR(50开发者_如何学C) CHARACTER SET LATIN NOT CASESPECIFIC)
PRIMARY INDEX ( UNIQ_PK );

I want to avoid a nasty join like this (I used ellipses instead of showing all 32 outer joins):

SELECT 
TRANS_ID
, a.ENCODED_1
, b1.DECODED_DESC DECODED_DESC_1
, a.ENCODED_2
, b2.DECODED_DESC DECODED_DESC_2
...
, a.ENCODED_31
, b31.DECODED_DESC DECODED_DESC_31
, a.ENCODED_32
, b32.DECODED_DESC DECODED_DESC_32
FROM BigTable a
LEFT OUTER JOIN LookupTable b1 ON a.ENCODED_1 = b1.ENCODED
LEFT OUTER JOIN LookupTable b2 ON a.ENCODED_2 = b1.ENCODED
...
LEFT OUTER JOIN LookupTable b31 ON a.ENCODED_31 = b31.ENCODED
LEFT OUTER JOIN LookupTable b32 ON a.ENCODED_32 = b32.ENCODED

Any help would be appreciated. I have a feeling outer joining 14M records 32 times is not the efficient way to do this!

You could create a function that takes as a parameter the VARCHAR(10) encoded_var and return the VARCHAR(50) decoded_desc, then your select would be something like this:

SELECT TRANS_ID,
     ENCODED_1, somefunc(ENCODED_1) AS DECODED_DESC_1,
     ENCODED_2, somefunc(ENCODED_2) AS DECODED_DESC_2,
     etc.

Depending on the number of rows you are planning on returning at a time, this would be doable.

If encoded_1, encoded_2, etc are all being used as look up keys to the same table, it sounds like they are all the "same idea". But my first thought is that a better design in this case would be:

big_table (trans_id, var_id, encoded_var)
lookup_table (encoded_var, decoded_desc)

Then the query just becomes:

select trans_id, var_id, encoded_var, decoded_desc
from big_table
join lookup_table on lookup_table.encoded_var=big_table.encoded_var

I don't know if this is the real field name or if you're just trying to leave out irrelevant details. You may be leaving out relevant details here. What's the difference between encoded_1 and encoded_2, etc? If they are interchangeable, there is no reason to have separate fields for them. Indeed, it causes a lot of problems. Even if there is a semantic difference, if they all use the same lookup table, they must all be coming from the same domain.

For example, a few years ago I worked on a system to manage technical manuals that our organization produced and used. Each manual had 3 managers. (An administrative manager who handled budgets and schedules, a stock manager who kept track of who needed copies and made sure they got them, and a content manager responsible for the actual text.) But they were all drawn from the same list of people, as often the same person would have more than one of these roles or could have different roles for different manuals. So we made a table of "people" with an id, name, email address, etc, and then in the basic manual record I created 3 columns, one for each manager type.

This was a huge mistake. What I should have done was create a separate table with manual id, manager type id, and person id, and then have 3 RECORDS for the 3 manager types rather than 3 fields within one record.

Why? With three columns, I ran into the same problem you are describing, though on a smaller scale: I had to join from the manual table to the person table three times. A query like "what books is Bob Smith responsible for?" required a surprising complex query, something like

select ... whatever ...
from manual
join person p1 on p1.person_id=manual.admin_id
join person p2 on p2.person_id=manual.stockmanager_id
join person p3 on p3.person_id=manual.contentmanager_id
where p1.name='Bob Smith'
 or p2.name='Bob Smith'
 or p3.name='Bob Smith'

With a single column it would have been simply

select ... whatever ...
from manual
join manual_manager on manual_manager.manual_id=manual.manual_id
join person on person.person_id=manual_manager.person_id
where person.name='Bob Smith'"

With all the repetition, it was not surprising that there were a couple of times that a programmer accidentally only checked 2 of the fields instead of all 3. With 1 field this error wouldn't be possible. With 3 fields, if we added a 4th type of manager, we would have had to add another column, and then change every query that looked at these fields. With 1 field, we probably wouldn't. Etc.

With 3 fields we needed 3 indexes, and there are other performance implications.

I suspect the same sort of thinking applies to you.

If your 32 fields are all completely interchangeable, then the table would only need a sequence number to make a unique pk. If there is some difference between them, then you could create a code to distinguish them.

I would change the PI on the LookUp table to the Encoded_Var for starters. You already have to redistribute the big table on each of the Encoded_Var columns in order to join to the LookUp table. Why bother having to redistribute the LookUp table each time as well.

Is there a reason why your table design isn't something closer to

CREATE SET TABLE BigTable ,NO FALLBACK , NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT
     (
      TRANS_ID    VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
      ENCODED_VAR VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL
UNIQUE PRIMARY INDEX ( TRANS_ID, ENCODED_VAR );

This would build a more appropriate 1:M relationship between a trans_id and encoded_var. Unless there are pertinent details that have been left out that would explain why this would not work. In fact, if necessary you could build this table as relationship table and have another table that looks like:

    CREATE SET TABLE BigTable2 ,NO FALLBACK , NO BEFORE JOURNAL, NO AFTER JOURNAL, CHECKSUM = DEFAULT
         (
          TRANS_ID    VARCHAR(20) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
          OtherData1  VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL,
          OtherData2  SMALLINT NOT NULL,
          ....,
          OtherDataN  VARCHAR(10) CHARACTER SET LATIN NOT CASESPECIFIC NOT NULL
   UNIQUE PRIMARY INDEX ( TRANS_ID );

Hope that helps.

If you don't want to repeatedly write the same query, I'd suggest putting it in a view.

For performance, I'd suggest the following:

As Rob Paller recommended, change primary index on LookupTable to ENCODED_VAR.
In LookupTable, add a new record with DECODED_DESC = null and ENCODED_VAR = some value you will never use. Update BigTable to replace all null ENCODED_* with this value. You can then change your query to use all inner joins and get the same result.

Couldn't you have it like this:

SELECT 
TRANS_ID
, a.ENCODED_1
, CASE a.ENCODED_1 WHEN b.ENCODED THEN b.DECODED_DESC END DECODED_DESC_1
, a.ENCODED_2
, CASE a.ENCODED_2 WHEN b.ENCODED THEN b.DECODED_DESC END DECODED_DESC_2
...
, a.ENCODED_31
, CASE a.ENCODED_31 WHEN b.ENCODED THEN b.DECODED_DESC END DECODED_DESC_31
, a.ENCODED_32
, CASE a.ENCODED_32 WHEN b.ENCODED THEN b.DECODED_DESC END DECODED_DESC_32
FROM BigTable a
 LEFT JOIN LookupTable b ON (
   a.ENCODED_1 = b.ENCODED OR
   a.ENCODED_2 = b.ENCODED OR
   ...
   a.ENCODED_31 = b.ENCODED OR
   a.ENCODED_32 = b.ENCODED
 )

I might also be tempted to rewrite the joining condition as follows:

...ON b.ENCODED IN (a.ENCODED_1, a.ENCODED_2, ... a.ENCODED_31, a.ENCODED_32)

But I'm not really sure if it can't be slower than the former version. Actually, my guess is, it would indeed turn out slower, but I would still verify that to be sure.

I have encountered the same problem, also on Teradata. A colleague has informed me of an elegant solution for you using a single LEFT OUTER JOIN, and a number of CASE statements.

However, your example is a little confusing, because you are joining on a column the doesn't exist ("ENCODED" column in "LookupTable", which I assume should be "ENCODED_VAR"?).

SELECT     TRANS_ID    
           , a.ENCODED_1    
           , MAX(CASE WHEN b.ENCODED_VAR = a.ENCODED_1
                 THEN b.DECODED_DESC
                 ELSE NULL
             END) DECODED_DESC_1    
           , a.ENCODED_2    
           , MAX(CASE WHEN b.ENCODED_VAR = a.ENCODED_2
                 THEN b.DECODED_DESC
                 ELSE NULL
             END) DECODED_DESC_2    
           ...    
           , a.ENCODED_31    
           , MAX(CASE WHEN b.ENCODED_VAR = a.ENCODED_31
                 THEN b.DECODED_DESC
                 ELSE NULL
             END) DECODED_DESC_31    
           , a.ENCODED_32    
           , MAX(CASE WHEN b.ENCODED_VAR = a.ENCODED_32
                 THEN b.DECODED_DESC
                 ELSE NULL
             END) DECODED_DESC_32    
FROM BigTable a
LEFT OUTER JOIN LookupTable b 
    ON a.ENCODED_1 = b.ENCODED_VAR
    OR a.ENCODED_2 = b.ENCODED_VAR
    ...
    OR a.ENCODED_31 = b.ENCODED_VAR
    OR a.ENCODED_32 = b.ENCODED_VAR
GROUP BY a.TRANS_ID

This does rely on there being a 1:1 relationship between ENCODED_n in BigTable and ENCODED_VAR in LookupTable.

Also, as an aside, you shouldn't use a random number as the PRIMARY INDEX in a Teradata table. While this will give you great table distribution, it will be completely useless when doing a table lookup. If you use a commonly joined field for the PI, the database can go directly to the AMP on which the data is stored. However, without this, the DBMS must do a full-table scan every time. You could likely use ENCODED_VAR as your PRIMARY INDEX and see much improved performance, as long as the distribution was still reasonable.

I hope this works. I think this will work in your case. I haven't verified my code is correct, but it's very similar to my own solution.