MySQL. Is it better (performance) to have one table of 1M records, or 10 tables 100K records each?_问答_开发者

This may be asked before, but here's the situation anyway.

I have one big table (on MySQL using InnoDB), that is basically a huge log, no relational fancy stuff.

3 fields: Customer_ID, TimeStamp, Log_Data (which is a tinytext like 'Visited Front Webpage' or 'Logged In').

Since I'm logging the activity of clients in a webpage that receives around 10,000 users a day, that table grows pretty fast.

On a given moment, I've wanted to know how many clients did actually anything on the site.

So I'm running the following query 'SELECT DISTINCT Customer_ID FROM table;', and I've started noticing that as the table grows bigger the query takes longer, which is perfectly fine and totally expected. At one given time the query started taking more than 5 minutes to complete.

I wanted to find a faster way, so I tried this. Let's say that I'm working with a table with 1 million rows. I've started by splitting that table into 10 tables, 100K records each. Then I run 'SELECT DISTINCT Customer_ID FROM table;' on each table, and with all the results I just 'sort | uniq | wc' them on a command line and arrive at the same result.

Surpris开发者_如何学Pythoningly, that method took less than half the time than the other to execute.

I've pretty much answered the question myself, 10*100K tables is faster than 1*1M table, BUT maybe I'm doing something wrong, maybe is more a problem of a performance tuning or something because tables should be designed to perform well no matter their size.

Let me know what you think.

Thanks for reading.

UPDATE: Here's how I create my table:

CREATE TABLE `mydb`.`mytable` (
 `Customer_ID` BIGINT( 20 ) UNSIGNED NOT NULL,
 `unix_time` INT( 10 ) UNSIGNED NOT NULL,
 `data` TINYTEXT NOT NULL,
KEY `fb_uid` ( `fb_uid` )
) ENGINE = INNODB DEFAULT CHARSET = utf8;

While your 100K*10 solution does make the query faster, it sounds hard to maintain and probably not best approach.

"tables should be designed to perform well no matter their size"

You must realize this can't be true when the tables get too large for the DB engine you are using.

So what can you do? The solution probably concerns the types of queries you run on this data.

Is the query above the only one using this data?
If not, what other queries are running on that table?

One rule of thumb here is don't store data you are not going to need. Another one is store the data in a way it's easy to query - Even if you do need the 1M rows of raw data, you can still store some aggregated data (or meta-data) in another table, e.g. a table of unique customer_id's per day, which is calculated at end of day.

You need an index that starts with Customer_ID for your query to be fast. If you have an index that simply contains it then it won't be able to use it as optimally. Here's how you can create it:

CREATE INDEX idx_cid ON table (Customer_ID)

As well you can get your count straight from the database with:

SELECT COUNT(DISTINCT(Customer_ID)) FROM table

If you ever want to narrow it to a range of time then you'd need a composite index:

CREATE INDEX idx_ts_cid ON table (TimeStamp, Customer_ID)

Then the query would be something like this for last month:

SELECT COUNT(DISTINCT(Customer_ID)) FROM table
WHERE TimeStamp BETWEEN "2011-03-01" AND "2011-04-01"

To add to the others, since you said that you aren't doing any "fancy relational stuff" you might also want to consider using a database solution geared towards massive datasets (and simple tables). MongoDB is one example.

I should add that this would only make sense if the rest of your database schema is also very large and non-relational :)

It seems you don't have index on user_id field or one user has many rows say 40000 rows out of a million.