SQL query - choosing 'last updated' record in a group, better db design?_问答_开发者

Let's say I have a MySQL database with 3 tables:

table 1: Persons, with 1 column ID (int)

table 2: Newsletters, with 1 column ID (int)

table 3: Subscriptions, with columns Person_ID (int), Newsletter_ID (int), Subscribed (bool), Updated (Datetime)

Subscriptions.Person_ID points to a Person, and Subscription.Newsletter_ID points to a Newsletter. Thus, each person may have 0 or more subscriptions to 0 or more magazines at once. The table Subscriptions will also store the entire history of each person's subscriptions to each newsletter. If a particular Person_ID-Newsletter_ID pair doesn't have a row in the Subscriptions table, then it's equivalent to that pair having a subscription status of 'false'.

Here is a sample dataset

Persons
ID
1
2
3

Newsletters
ID
1
2
3

Subscriptions
Person_ID  Newsletter_ID  Subscribed  Updated
2                1           true     2010-05-01
3                1           true     2010-05-01
3                2           true     2010-05-10
3                1           false    2010-05-15

Thus, as of 2010-05-16, Person 1 has no subscription, Person 2 has a subscription to Newsletter 1, and Person 3 has a subscription to Newsletter 2. Person 3 had a subscription to Newsletter 1 for a while, but not anymore.

I'm trying to do 2 kinds of query.

A query that shows everyone's active subscriptions as of query time (we can assume that updated will never be in the future -- thus, this means returning the record with the latest 'updated' value for each Person_ID-Newsletter_ID pair, as long as Subscribed is true (if the latest record for a Person_ID-Newsletter_ID pair has a Subscribed status of false, then I don't want that record returned)).
A query that returns all active subscriptions for a specific newsletter - same qualification as in 1. regarding records with 'false' in the Subscribed column.

I don't use SQL/databases often enough to tell if this design is good, or if the SQL queries needed would be slow on a database with, say, 1M records in the Subscriptions table.

I was using the Visual query builder tool in Visual Studio 2010 but I can't even get the query to return the latest updated record for each Person_ID-Newsletter_ID pair.

Is it possible to come up with SQL queries that don't involve using subqueries (presumably because they would become too slow with a larger data set)? If not, would it be a better design to have a separate Subscriptions_History table, and every time a subscription status for a Person_ID-Newsletter-ID pair is added to Subscriptions, any existing record开发者_开发技巧 for that pair is moved to Subscriptions_History (that way the Subscriptions table only ever contains the latest status update for any Person_ID-Newsletter_ID pair)?

I'm using .net on Windows, so would it be easier (or the same, or harder) to do this kind of queries using Linq? Entity Framework?

Edit: Here's what happens if I use this query:

SELECT     Person_ID, Newsletter_ID, Allocation, Updated, MAX(Updated) AS Expr1
FROM         subscriptions
GROUP BY Person_ID, Newsletter_ID

I get rows 2 and 4 from the Subscriptions table mishmashed together (in row 2 of the results set below):

Person_ID Newsletter_ID Subscribed Updated     Expr1 
2         1             true       2010-05-01  2010-05-01 
3         1             true       2010-05-01  2010-05-15 
3         2             true       2010-05-10  2010-05-10

Thanks!

I've recently run into a somewhat similar problem.

I'm not an SQL expert, so I can't really give much advice on what's the best design for this. But until the pros chip in, maybe this helps:

SELECT s.Person_ID, s.Newsletter_ID  
FROM (
 SELECT MAX(ID) AS mid
 FROM Subscriptions
 GROUP BY 
  Person_ID,Newsletter_ID
) q
JOIN Subscriptions s
ON q.mid = s.ID
WHERE s.Subscribed = 1

Notice I've added an ID colum to your subscriptions table (I'll explain why in a sec).

Now, let's break down how this works (or how I think it works, anyway; I'd be glad to be corrected if I'm wrong).

First, you retrieve all records for a given person / newsletter. This is what the subquery does (yes, I know you said you'd rather not have subqueries, but I'm not sure you can do it without one). I'm grouping by person_id and newsletter_id. This can return more than one row. Notice I'm selecting MAX(ID). If you use an autoincremental ID and it's safe to assume the row with the highest number in the ID column is the newest one for the group (i.e. if you don't insert ID's manually), this subquery will get you the ID of the last row for each person / newsletter.

So, you can join this with the subscriptions table: the join condition is that the ID of the subscriptions row has to match the MAX id you retrieved from the subquery. Here you are only considering the most recent record for each newsletter/person. Then, you factor out inactive subscriptions by using a WHERE condition.

If you want to restrict the result to a given newsletter (or a given person), add that condition to the WHERE clause.

Indices should help making this query run faster.

Hope this helps.

Added

If for some reason you can't guarantee that MAX(Subscriptions.ID) will correspond to the last inserted row, you can probably do something like this (which follows the same logic, I think, but is a bit more verbose and probably less efficient):

SELECT Person_ID, Newsletter_ID  
FROM (
 SELECT MAX(Updated) AS upd, Newsletter_ID AS nid, Person_ID AS pid 
 FROM Subscriptions
 GROUP BY 
  Person_ID,Newsletter_ID
) q
JOIN Subscriptions s
ON q.pid = s.Person_ID AND q.nid = s.Newsletter_ID and q.upd = s.Updated
WHERE Subscribed = 1

New edit

On second thoughts, the alternative I've added (the one with MAX(Updated)) is wrong, I think. You can't know for sure the selected Newsletter_ID and Person_ID in the subquery will be the Newsletter_ID and Person_ID corresponding to the MAX(Updated) row. Since these columns are used for the join condition, this query could give bogus results.

separate your Subscriptions into 2 tables:

First will store actual subscriptions list (the subscriptions, that are true for now): Person_Id | Newsletter_Id
Second one will store subscriptions log (its updates or status changes)

ordered analytical functions" is a standard method for this type of problem. 1M records, no problem... depending on the power of your machine, of course.

MAX( Updated) OVER( PARTITION BY list of fields over which you want "max" )

SELECT
  x.*
FROM
  (
    SELECT
       Person_ID
       , Newsletter_ID
       --, Subscribed
       , Updated
       , MAX(Updated) OVER( PARTITION BY Person_ID, Newsletter_ID, Subscribed) AS myUpdated
   FROM Subscriptions
  ) x 
WHERE Updated = myUpdated

I think your design is pretty good. There is no inherent slowness for subqueries - use them if is the best way to express your query.

Here's the query that gets you all of the latest (i.e. not overridden) directives:

SELECT Person_ID, Newsletter_ID, Subscribed, MAX(Updated)
FROM Subscriptions GROUP BY Person_ID, Newsletter_ID

Then you can use this query as a subquery of another query to get what you want. For your query #1:

SELECT x.Person_ID, x.Newsletter_ID FROM
  (SELECT Person_ID, Newsletter_ID, Subscribed, MAX(Updated) 
   FROM Subscriptions GROUP BY Person_ID, Newsletter_ID) x
WHERE x.Subscribed;

For query #2:

SELECT x.Person_ID FROM
  (SELECT Person_ID, Newsletter_ID, Subscribed, MAX(Updated)
   FROM Subscriptions GROUP BY Person_ID, Newsletter_ID) x
WHERE x.Subscribed AND x.Newsletter_ID = ?

You'll definitely want an index on Newsletter_ID in the Subscriptions table, as this query will likely be very selective.

Edit: Whoops, the Subscriptions column in the subquery can come from an arbitrary row, not the one that generates the MAX(Updated). You have to rejoin with the original table:

SELECT x.Person_ID, x.Newsletter_ID, y.Subscribed FROM
  (SELECT Person_ID, Newsletter_ID, MAX(Updated) as MaxUpdated
   From Subscriptions GROUP by Person_ID, Newsletter_ID) x
  JOIN Subscriptions y WHERE x.Person_ID = y.Person_ID AND
                             x.Newsletter_ID = y.Newsletter_ID AND
                             x.MaxUpdated = y.Updated