Theoretically, Is this SQL query too big to serve quickly under load?_问答_开发者

I have a sql query that can easily return 70,000 rows of a relational database table (with a few joins involved). The total size of this resu开发者_JAVA技巧ltset is about 20mb. The total size of the table itself is about 10 million rows.

I lack perspective here, so I am wondering if a query of this size is practically capable of being served quickly even at a few hundred requests per second on a web page? Also, this is NOT a readonly table: there are a fairly significant number of updates/deletes (somewhere between 3:1 and 10:1 reads/write ratio depending on time of year)

I know I need indexes, etc. What I am wondering is if a single database server (with say 4gb of ram and a modern quad core CPU) could even theoretically serve this without thrashing the hell out of the cpu or disk IO and getting terrible performance?

You haven't provided much background as to what your queries actually look like, but I will walk you through how to estimate the ballpark of whether your expectations are realistic, using PostgreSQL as example.

Preparing a dummy table with 10M rows, and 80 bytes of filler data per row:

create table foo as select
    generate_series(1,10000000) as foo_id,
    repeat('a', 80) as filler;
create unique index foo_foo_id on foo (foo_id);
vacuum analyze foo;

This table is 1400 MB total, including the index, so it fits entirely into my OS cache, but not PostgreSQL's shared buffers.

Creating a custom pgbench script to fetch 70000 rows ordered by an index:

\setrandom key 1 9000000
SELECT * FROM foo WHERE foo_id > :key ORDER BY foo_id LIMIT 70000;

Here are the results from running the benchmark on my 4-core desktop computer (AMD Phenom II X4 955) for 1 minute:

% pgbench -j 4 -c 4 -T 60 -n -f script.pgb
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 4
duration: 60 s
number of transactions actually processed: 3922
tps = 65.309954 (including connections establishing)
tps = 65.316916 (excluding connections establishing)

Note that here the client (pgbench) and server are on the same physical machine. In reality they would be different, so things like network overhead and throughput come into play.

This naive configuration can handle ~65 such queries per second. Much lower than "few hundred requests per second", so you'd need a lot more powerful server to handle this kind of workload. Replication with multiple slaves is an option.

To get a more realistic result, you should tweak the pgbench script and test data to match your workload closer.

Well, no. But if you can limit your result set (to display it paginated), cache results, and possibly preprocess/transform your data (in effect, creating your own, optimized index), it could be possible.

EDIT: What I meant with preprocessing is to run, for example, a cronjob periodically that massages your data into a form where it can be very easily be queried by your consumer, e.g. a temporary or intermediate table (no joins involved). That way you perform the join-heavy queries only every few seconds/minutes. If you depend on accurate real-time queries, the cronjob optimization might not be possible.

To be able to answer all queries without overloading the DB layer, you could possibly cache the reusable results of previous searches in a memory cache, e.g. memcached).

It depends a lot on how selective the indexes are, and what you're doing with the data. I'd say 70K rows and 20mb isn't a show-stopper if you're piping the result set to a file for automatic processing. But it might be a show-stopper if you're trying to load it into a web page.

Anyway, I'd encourage you to give some thought to the real reason someone needs to see 70,000 rows and 20 megabytes on a web page. What are they trying to accomplish with that much data at one time?

With the hardware you describe you leave out the most import part: storage. Typical database are bottlenecked by the disk and then memory. Modern CPU's are so fast they're usually not the problem. If you get serious raid or SSD you can make it do some serious stuff. And the 10M row table will entirly be in memory anyway for most of the tools you describe.

The problem you describe however will probably get hung on locking. There many users reading and writing little facts to a table and then you read a large portion of that table. The are different ways of doing that called isolation levels. With the loads you describe you probably want to stay clear of that altogether.

This is a classical problem in a sport called datawarehousing where you want to run large analytical queries against an online system. You want to create a second copy of that table using log shipping for instance. Most of the databases you tagged can do this. Log shipping would create a buffer between the fast changing table and the analytical table. When you lock this analytical table the updates bunch up until you're done. and there a few people reading from this table so you have it all to yourself. tipically this will cost just a couple of percent of you databases max throughput. If you're near that already you have scaling issues. If you really need to see the latest data look into real time BI.

Also having a second copy of that data frees you to structure it differently, in a way that is very easy to query. Central idea there is the Star Schema.

Regards GJ