Processing large amounts of data quickly_问答_开发者

Processing large amounts of data quickly

开发者 https://www.devze.com 2023-03-18 06:00 出处：网络

I\'m working on a web application wh开发者_C百科ere the user provides parameters, and these are used to produce a list of the top 1000 items from a database of up to 20 million rows.I need all top 100

I'm working on a web application wh开发者_C百科ere the user provides parameters, and these are used to produce a list of the top 1000 items from a database of up to 20 million rows. I need all top 1000 items at once, and I need this ranking to happen more or less instantaneously from the perspective of the user.

Currently, I'm using a MySQL with a user-defined function to score and rank the data, then PHP takes it from there. Tested on a database of 1M rows, this takes about 8 seconds, but I need performance around 2 seconds, even for a database of up to 20M rows. Preferably, this number should be lower still, so that decent throughput is guaranteed for up to 50 simultaneous users.

I am open to any process with any software that can process this data as efficiently as possible, whether it is MySQL or not. Here are the features and constraints of the process:

The data for each row that is relevant to the scoring process is about 50 bytes per item.
Inserts and updates to the DB are negligible.
Each score is independent of the others, so scores can be computed in parallel.
Due to the large number of parameters and parameter values, the scores cannot be pre-computed.
The method should scale well for multiple simultaneous users
The fewer computing resources this requires, in terms of number of servers, the better.

Thanks

A feasible approach seems to be to load (and later update) all data into about 1GB RAM and perform the scoring and ranking outside MySQL in a language like C++. That should be faster than MySQL.

The scoring must be relatively simple for this approache because your requirements only leave a tenth of a microsecond per row for scoring and ranking without parallelization or optimization.

If you could post query you are having issue with can help.

Although here are some things. Make sure you have indexes created on database. Make sure to use optimized queries and using joins instead of inner queries.

Based on your criteria, the possibility of improving performance would depend on whether or not you can use the input criteria to pre-filter the number of rows for which you need to calculate scores. I.e. if one of the user-provided parameters automatically disqualifies a large fraction of the rows, then applying that filtering first would improve performance. If none of the parameters have that characteristic, then you may need either much more hardware or a database with higher performance.

I'd say for this sort of problem, if you've done all the obvious software optimizations (and we can't know that, since you haven't mentioned anything about your software approaches), you should try for some serious hardware optimization. Max out the memory on your SQL servers, and try to fit your tables into memory where possible. Use an SSD for your table / index storage, for speedy deserialization. If you're clustered, crank up the networking to the highest feasible network speeds.