开发者

What's the fastest way to obtain the maximum value of a column in a table?

开发者 https://www.devze.com 2023-03-12 23:51 出处:网络
INFORMIX-SQL 4.10.DC1 (SE Engine), on DOS 6.22, under Microsoft Virtual PC 2007, in Windows 7! EDIT: Looking for Jonathan Leffler\'s wisdom on this one!

INFORMIX-SQL 4.10.DC1 (SE Engine), on DOS 6.22, under Microsoft Virtual PC 2007, in Windows 7!

EDIT: Looking for Jonathan Leffler's wisdom on this one!

I have a child table called transaction and a parent table cal开发者_如何学Goled customer.

These tables are joined by customer.pk_id SERIAL = transaction.fk_id INTEGER.

The transaction table has a clustered index on fk_id so that each customers transactions are physically grouped together with the data matching the index. The reason I chose the cluster index on fk_id is because most queries are done on customer name and this quickly returns all of the transactions belonging to the queried customer.

The transaction table also has a column called transaction_number SERIAL with a unique index.

When a new transaction is added for any customer, a new row is added at EOF with the highest transaction_number.. Note: Informix does not maintain the physical clustering in the datafile after the index is created.

So when I want to print a customer receipt of the last transaction entered I do the following query:

SELECT * 
  FROM customer c, 
       transaction t
 WHERE c.pk_id = t.fk_id
   AND t.trx_num = (SELECT MAX(trx_num)
                      FROM transaction)

EDIT: The following INDEXES exist:

UNIQUE INDEX ON customer(pk_id) {SERIAL}
UNIQUE CLUSTER INDEX ON customer(last_name,sur_name,first_name,middle_name) {CHAR(30's)}
UNIQUE INDEX ON customer(ident_type,ident_num,ident_state,ident_country) {CHARS(n)}

UNIQUE CLUSTER INDEX ON transaction(fk_id) {INT}
UNIQUE INDEX ON transaction(trx_num) {SERIAL}

EDIT: EXPLAIN results for the above query:

QUERY:
------
SELECT *
  FROM customer c,
       transaction t,
 WHERE c.pk_id   = t.fk_id           
   AND t.trx_num = (SELECT MAX(trx_num) 
                      FROM transaction)

Estimated Cost: 14
Estimated # of Rows Returned: 2

1) f.transaction: INDEX PATH

    (1) Index Keys: trx_num 
        Lower Index Filter: f.transaction.trx_num = <subquery> 

2) f.customer: INDEX PATH

    (1) Index Keys: pk_id 
        Lower Index Filter: f.customer.pk_id = f.transaction.fk_id 

    Subquery:
    ---------
    Estimated Cost: 3
    Estimated # of Rows Returned: 1

    1) f.trx_num: INDEX PATH

        (1) Index Keys: trx_num 

It seems like the query optimizer is doing a full table scan on transaction in order to locate the highest transaction number, whose new row is always placed at EOF , but since transactions are grouped by fk_id, the rest of the transaction numbers are scattered throughout the table.

Is there a better way for obtaining a quicker response time on the query?

Would CREATE UNIQUE INDEX trxnumidx ON transaction(transaction_number) DESCENDING help in quickly locating the MAX transaction_number?

EDIT: I do have a similar query which quickly retrieves the desired transaction and customer information so that the user can print the last entered transaction, or any previously entered transaction, but it requires the user to input the transaction number:

SELECT * 
      FROM customer c, 
           transaction t
     WHERE c.pk_id = t.fk_id
       AND t.trx_num = $trxnum  {ace input variable, type INT}

What baffles me about this query, where the user manually inputs the transaction number, is that retrieval is instantaneous (total cost = 2), where as when automatic with MAX (total cost = 14).. The reason why I chose to make this query automatic is that in the past, when users manually input the transaction number, sometimes they inadvertently entered the wrong transaction number which happened to be a valid number and without realizing it, they signed and gave the customer a receipt with the wrong information on it!

EDIT: Would DBINFO('sqlca.sqlerrd1') be a more efficient way of locating the most recently inserted row?


Since you are wanting the maximum transaction number for a fk_id, I think I would expect to create an (optionally UNIQUE) index:

CREATE {UNIQUE} INDEX ix_lookup1 ON Transaction(FK_ID, Transaction_Number);

The index can be unique since transaction number alone is unique, but the point of this index is to speed up lookups for a given fk_id value.

I would then expect to use a query like the one that oers suggested:

SELECT * 
  FROM customer AS c, transaction AS t
 WHERE c.pk_id = t.fk_id
   AND c.pk_id = 12345678         -- crucial to limit the data to one return row
   AND t.transaction_number =
          (SELECT MAX(transaction_number)
             FROM transaction AS t2
            WHERE t2.fk_id = c.pk_id)

This is still a correlated sub-query, but given that the customer number is used, there will only be one row of data to return and it is blatant to the optimizer that this is the case. Depending on the programming language I was using, I might well arrange for the sub-query to use:

SELECT * 
  FROM customer AS c, transaction AS t
 WHERE c.pk_id = t.fk_id
   AND c.pk_id = 1234567
   AND t.transaction_number =
          (SELECT MAX(transaction_number)
             FROM transaction AS t2
            WHERE t2.fk_id = 1234567)

Or even:

SELECT * 
  FROM customer AS c, transaction AS t
 WHERE c.pk_id = t.fk_id
   AND c.pk_id = 1234567
   AND t.transaction_number =
          (SELECT MAX(transaction_number)
             FROM transaction AS t2
            WHERE t2.fk_id = 1234567)

You could use placeholders or create the SQL from a string as suits.

Even better, I'd avoid the problem by capturing the inserted serial value immediately after the insert so that (a) if the program ever went multi-user, it would still work and (b) the whole MAX issue would be avoided, and the query could become:

SELECT * 
  FROM customer AS c, transaction AS t
 WHERE c.pk_id = t.fk_id
   AND c.pk_id = 1234567
   AND t.transaction_number = 23456789;

Note that with the c.pk_id = 1234567 for the customer specified as a parameter to the query, you might not even need the index I suggested. And wherever I used a number, it would be fine with me to use a placeholder for the value to be supplied when the query is executed.

Remember, you are using software that is old enough to drink, let alone vote; the optimizer is not as good as the one you'd find in a more modern DBMS.


Can you try using JOIN instead of the (implicit JOIN with) WHERE:

SELECT * 
FROM customer
  JOIN transaction
    ON customer.pk_id = transaction.fk_id
WHERE transaction.transaction_number = ( SELECT MAX(transaction_number)
                                         FROM transaction)

or this:

SELECT * 
FROM customer
  JOIN transaction
    ON customer.pk_id = transaction.fk_id
  JOIN ( SELECT MAX(transaction_number) AS max_tn
         FROM transaction
       ) AS tr
    ON transaction.transaction_number = tr.max_tn

Without using JOINs, I can think of this:

SELECT * 
  FROM customer
     , transaction
 WHERE customer.pk_id = transaction.fk_id
   AND transaction.transaction_number IN 
       ( SELECT MAX(transaction_number)
         FROM transaction
       )

Perhaps the IN (SELECT ...) is better optimized than = (SELECT ...)


You can also try putting the tables (and conditions) in reverse order:

SELECT * 
  FROM transaction, customer 
 WHERE transaction.transaction_number
       = ( SELECT MAX(transaction_number)
           FROM transaction )
  AND customer.pk_id = transaction.fk_id

No idea if the following will work or if it will raise error but it's worth trying:

SELECT * 
FROM 
      ( SELECT MAX(transaction_number) AS max_tn
        FROM transaction
      )
  AS tr
   , transaction
   , customer
WHERE transaction.transaction_number = tr.max_tn
  AND customer.pk_id = transaction.fk_id
0

精彩评论

暂无评论...
验证码 换一张
取 消