开发者

mysql query slow trying to find "missing" records in one table based on rows in a second table

开发者 https://www.devze.com 2023-03-06 11:05 出处:网络
Background: An ecommerce web application, but with ‘members’ who order from each other. When a member performs various activities (such as searches), records are inserted into an \'Activity\'开发者_

Background:

An ecommerce web application, but with ‘members’ who order from each other. When a member performs various activities (such as searches), records are inserted into an 'Activity'开发者_Go百科 table. If/when the member submits an order, a record is inserted into an 'Order' table. The goal is to find cases where a member performed an activity but did NOT place an order within some window of time (an hour, say) after the time of the activity.

NOTE: The code that creates order records cannot be changed. If it could be, I could simply 'remember' the activities, and include this information in the order records. Then, to find the cases where a member performed an activity but did not order would be simple: just look for a NULL value (or some other default value) in this column of the order table. Again, alas, this is not possible in my situation...

Tables:

  • Order (id, ts /* timestamp */, sending_member_id, receiving_member_id, …)
  • Member (id, name, …)
  • Activity_Type (id, name, …)
  • Activity_Log (id, ts, member_id, type_id, extra_info)

Indexes:

All appropriate indexes are in place. Specifically, an index on order.ts does exist.

I’ve tried these three queries:

APPROACH 1

SELECT …
  FROM activity_log, 
       Member
 WHERE activity_log.member_id = member.id
   AND activity_log.type_id = 1 /* Search */
   AND activity_log.ts > [start time]
   AND activity_log.ts < [end time]
   AND NOT EXISTS (SELECT ‘x’
                     FROM order
                    WHERE order.ts >= activity_log.ts
                      AND order.ts <= activity_log.ts + 3600
                      AND order.sending_member_id = activity_log.member_id)
ORDER BY activity_log.member_id, activity_log.ts desc

APPROACH 2

SELECT …
  FROM activity_log, member
 WHERE activity_log.member_id = member.id
   AND activity_log.type_id = 1 /* Search */
   AND activity_log.ts > [start time]
   AND activity_log.ts < [end time]
   AND activity_log.member_id NOT IN (SELECT order.sending_member_id
                                        FROM order
                                       WHERE order.ts >= activity_log.ts
                                         AND order.ts <= activity_log.ts + 3600)
ORDER BY activity_log.member_id, activity_log.ts desc

APPROACH 3

   SELECT …
     FROM activity_log
     JOIN member ON activity_log.member_id = member.id
LEFT JOIN order ON order.ts >= activity_log.ts 
               AND order.ts <= activity_log.ts + 3600 
               AND activity_log.member_id = order.sending_member_id
    WHERE activity_log.type_id = 1 /* Search */
      AND activity_log.ts > [start time]
      AND activity_log.ts < [end time]
      AND order.sending_member_id IS NULL
 ORDER BY activity_log.member_id, activity_log.ts desc

Even with approach 3, the query runs for 20-30 seconds and doesn’t use the index on order.ts.


For MySQL, the choice depends on the column(s) being compared:

  • If nullable, NOT EXISTS and NOT IN (with additional checks for NULLS) are the most efficient methods to implement an anti-join in MySQL.
  • If not nullable, use either a LEFT JOIN / IS NULL or NOT IN rather than NOT EXISTS.

Indexes will be useless if you alter the data to make the comparison:

AND order.ts <= activity_log.ts + 3600 

In this example, an index on ACTIVITY_LOG.ts won't matter. Consider composite indexes (single index, more than one column).


Then, to find the cases where a member performed an activity but did not order would be simple: just look for a NULL value (or some other default value) in this column of the order table. Again, alas, this is not possible in my situation...

If you really can't, then I'm afraid to suggest that there is no magic bullet.

Short of revisiting your schema (which, as you've correctly diagnosed, is the correct way to proceed), your best option will be an anti-join (i.e. the left join ... where is null).

Without any constraints (say, on a subset of recent order/activity dates), you're basically joining two tables together (in your case, three, but I suspect you can drop the one on members).

The best you can do is restrict yourself to a subset of timestamps on both orders and activity logs. This should allow you to reduce the join size. To make this happen you might need to introduce multi-colum indexes with the constraint to the left and the join clause to the right, e.g. activity_log(ts, member_id) and orders(ts, member_id), or with the columns the other way around depending on your data.


FYI, I changed the third query above as shown below, and it greatly speeds up my query when the [start time] - [end time] window is small relative to the overall timespan represented by the rows in the activity_log and order tables. Basically, MySQL decides whether or not to use indexes (on ts) on each table based on how helpful they would be... If I ask for "all time", MySQL quite rightly does not use the indexes. But if I ask for a small duration, MySQL does finally use the indexes.

SELECT …      
FROM activity_log JOIN member 
  ON activity_log.member_id = member.id 
LEFT JOIN (select * from order where order.ts > [start time] and order.ts < [end time + 3600]) orders
  ON order.ts >= activity_log.ts
    AND order.ts <= activity_log.ts + 3600
    AND activity_log.member_id = order.sending_member_id
WHERE activity_log.type_id = 1 /* Search */
AND activity_log.ts > [start time]
AND activity_log.ts < [end time]
AND order.sending_member_id IS NULL
ORDER BY activity_log.member_id, activity_log.ts desc 

I did not need the new indexed endts column.

Much thanks to the folks who responded, especially Denis, whose comment led me to this solution. -M

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号