Database: Best practice - old data?_问答_开发者

I have a database of automobile classified listings.

After 90 days, the classified listing is no longer valid to be displayed (the listing expires); however, I want to retain the listing for archive purposes.

Question: From a database design best practice perspective as well as quer开发者_StackOverflowy performance, is it better to keep the old listing A) in the same table as the current listing or B), move the expired listing over to an expired table and delete that listing from the current listing table?

In other words,

Option A):

table_classified_listing:
car_id
expired = true | false
...

Option B):

// only current listing in this table (expired = false)
table_classified_listing:
car_id
...

// only expired listing in this table (expired = true)
expired_table_classified_listing:
car_id
...

UPDATE:

My concern with Option A, is that in my MySQL database - when I run EXPLAIN, it say that it's using the expired as the primary key to index on. However, what's more important to my query search performance is for it to use the price field since I'm doing a search based on the price > X. Hence why I'm considering to chose option B.

Option A) that way you have all your data in one place and can more easily create queries for reporting, listing a users historical entries etc. Any speed issues should be mitigated by the database's index on that column. Option B) is premature optimisation.

General advice (you have to fill the gaps ;-) ):

Performance will only be significant in some cases (over a million records, huge row size...).
Will you query the two tables with 'unions', or identical queries? If you will not query the tables using the same queries, then I suggest different tables (possible Performance gain as the number of records grow, but mostly Meaning gain).

A problem of duplication is it might increase the work (writing queries, testing them...). But all technologies (especially modern ones) allow you to reduce or cancel the duplication.

For example, with an ORM, you can have an abstract entity that maps to the common fields but no table, and two subclasses that map to your tables. No duplication of column information. And the ORM can create your database scripts too, so you don't even have these (although you should review them by hand for production database, of course).

UPDATE after the update in the question:

You can create the indexes you want, don't worry. If the performance you're looking for it to query the data not expired for a price over X, create an index (expired, price) and you're just fine :-)

Don't use B, it is basically splitting off the attributes.

The way I would do is to basically use two date columns instead. ValidFromDate and ValidToDate.

At the rate that anyone accumulates listings such as you describe, it would be a long time before performance would degrade. And hardware and software performance increase faster.

Don't make something complicated until you are sure you need it and simple won't work. Keep it in one table. See the question about pessimizations - this is one.

Personally I would say to move all the expired ones into a seperate table. As the database grows you are going to be wanting better performance from your 'live' records because those are likely to be hit the most often.

All the old records will cause the table size to continually grow, and that means slower queries, even with the query optimization and such that takes place.

EDIT: As someone else mentioned one big con to this approach is if you plan on combining the live data and the archived data frequently. If you will always reference them separately than great, but if not you will have lots of joins and unions required to pull the data together - which is not ideal.

There are at least two additional options as well for the general problem of keeping old data:

Partition the data by date and either roll-off the date or detach the partition. Alternatively, implement each partition as a separate table and then join them with a union-all view. In this latter case you're usually better off with coarse granularity partitioning (month instead of day). MySQL should be able to support both of these solutions, and partitioning has the added advantage of improving query performance associated with queries of large portions of your table's data.
Export all the data you want to keep, truncate your table and then reload it. Seriously - reloading can be far faster than deleting when you're deleting a lot of data. Many databases don't have enough data to need to do this - at least they don't for a few years and then their administrators discover that they either need a hardware upgrade or to purge an entire year's worth of data. At that point this strategy is often the best.

Back to your two offered solutions:

Keep the data in the same table. With your data volumes this is probably the best approach. But - at some point you may still want to drop it (7 years?) and at that point you can either have a little asynch job that does trickle-deletes, can drop partitions or can export/reload.
Keep archival data in a different table. This becomes most useful if you can exploit different (lesser) hardware for the less-frequently accessed archive data, like a separate server, a smaller number of CPUs, a different set of cheaper/slow disks, a smaller memory buffer, etc. MySQL doesn't have enough configurability to do some of this. Another reason is if your queries are often doing table-scans and if by moving most data out you can improve performance significantly. And this could be the case. You're using MySQL - which has a notoriously immature optimizer/planner and you aren't using partitioning. So, whenever an index can't be used you're going to table-scan. If you need lightning-fast queries, have a small server or a lot of rows - then I'd keep the old data on a separate table. But here's a possibly better way to do it:
keep data in two tables, but the first table has 100% of your data (new & old) and the second table has just the newest data. The reason for this approach is that there may be a variety of subsets or aggregates that you might want to generate - and now the table with the newest data is just one of many. These subset/aggregates aren't perfectly necessary - you can always just query your main table. However, analytical queries tend to pound a database pretty hard - and these tables can make them very fast. And frankly, any process worth spending time on is worth analyzing.

Here is what I understand :

Since these are classified listings, the data is "short-lived" in nature, and expires.
As a result, the volume of expired data can exceed the "current" or un-expired data.

If I understood the above correctly, the next question is how often is your expired data used? and what is it used for? Like @ghills pointed out, sql-unions might slow you down.

If expired data does not need to be online, it might make sense to archive it away to a separate table. Especially if # of expired-rows can exceed the # of active-rows.

If you keep them in the same table, "where expired=false" could end up being your constant companion, and since selectivity will be low (i.e. lots of expired-rows), indexing on "expired" column will not get you a lot of bang for the buck. (Oracle has bit-map indices - but that may not apply here at all).

I would keep them in one table. Otherwise, (a) You have two tables with identical columns. THen any time you make changes to the data, you have to remember to change both tables identically. Sooner or later someone will forget -- or get the bright idea that data in one table is not needed in the other -- and now your design gets more complicated. Pretty soon you're writing exactly the same logic twice: once to retrieve from the "current" table and again to retrieve from the "archive" table. But then someone makes a change to one piece of code and forgets to make the same change to the other piece of code. Then the next person to come along can't be sure if they're different because there's a good reason why they should be different or if someone just forgot. Etc. (b) It seems likely that you will have queries that will want to hit both tables, like "tell me all the ads with asking price over $20,000 in the past 12 months", where some of the ads could be current and others archived. These queries are now unions or complicated joins instead of simply not including the "expired is true" or "expired is false" flag.

As to the performance issue, that's easy: Create a multi-field key that includes whatever you need to include. expired + price or expired + modelname seem likely keys. You probably want to put expired first because most of your queries are probably going to want non-expired records, but I'm just guessing. Choosing what's worth indexing on is a complicated decision, but when there are obvious common queries on multiple fields, just do it.

There is no such thing as a Universal best practice. However if the table is tends to become huge and your search is taking too much time then you may need to Archived the items in a separate table or soo.. Other wise you can implement proper indexing too make thing faster. It really depends on the amount and type of data you are considering.