I'm soon to be working on a project that poses a problem for me.
It's going to require, at regular intervals throughout the day, processing tens of thousands of records, potentially over a million. Processing is going to involve sev开发者_Python百科eral (potentially complicated) formulas and the generation of several random factors, writing some new data to a separate table, and updating the original records with some results. This needs to occur for all records, ideally, every three hours. Each new user to the site will be adding between 50 and 500 records that need to be processed in such a fashion, so the number will not be steady.
The code hasn't been written, yet, as I'm still in the design process, mostly because of this issue. I know I'm going to need to use cron jobs, but I'm concerned that processing records of this size may cause the site to freeze up, perform slowly, or just piss off my hosting company every three hours.
I'd like to know if anyone has any experience or tips on similar subjects? I've never worked at this magnitude before, and for all I know, this will be trivial to the server and not pose much of an issue. As long as ALL records are processed before the next three hour period occurs, I don't care if they aren't processed simultaneously (though, ideally, all records belonging to a specific user should be processed in the same batch), so I've been wondering if I should process in batches every 5 minutes, 15 minutes, hour, whatever works, and how best to approach this (and make it scalable in a way that is fair to all users)?
Below I am going to describe how I would approach this problem(but will cost you money and may not be desired solution):
- You should use VPS(a quick listing of some cheap VPS). But I guess you should do some more research finding the best VPS for your needs, if you want to achieve your task without pissing of your hosting company(I am sure you will).
- You should not use cronjobs but use a message queue like for example beanstalkd to queue up your messages(tasks) and do the processing offline instead. When using a message queue you could also throttle your processing if needed.
Not really necessary, but I would tackle it in this way.
- If performance was really a key issue I would have two VPS(at least) instances. one VPS instance to handle the http request from the users visiting your site and one VPS instance to do the offline processing you desire. This way your users/visitor will not notice any heavy offline processing which you are doing.
- I also would probably not use PHP to do the offline processing because of the blocking nature. I would use something like node.js to do this kind of processing because nothing is blocking in node.js which is going to be a lot faster.
- I also would probably not store the data in a relational database but use the lightning fast redis as a datastore. node_redis is a blazingly fast client for node.js
The problem with many updates on MySQL tables that are used on a website, is that updating data kills your query cache. Meaning that this will slow down you site significantly, even after you update is complete.
A solution we have used before, is to have two MySQL databases (on different servers too, in our case). Only one of them is actively used by the web server. The other is just a fallback and is used for these kind of updates. The two servers replicate their data to one another.
The solution:
- Replication is stopped.
- The website is told to use Database1.
- These large updates you mention done ran on Database2.
- Many commonly used queries are executed once on Database2 to warm up the query cache.
- The server is told to use Database2.
- Replication is started again. Database2 is now used mainly for reading (by both the website and the replication), so there isn't much delay on the websites.
it could be cone using many servers , where each server could do X records/hour , the more records you will be using in the future the more servers you will need , otherwise you might end up with million records being processed while the last 2-3 or even 4th processing is still not finished ...
You might want to consider what kind of database to use. Maybe a relational database isn't the best for this?
Only way to find out is to actually do some benchmarks simulating what you're going to do though.
In this situation I would consider using Gearman (which also has a PHP extension but can be used with many languages)
Do it all server side using a stored procedure that selects subsets of data then processes the data internally.
Here's an example that uses a cursor to select ranges of data:
drop procedure if exists batch_update;
delimiter #
create procedure batch_update
(
in p_from_id int unsigned, -- range of data to select for each batch
in p_to_id int unsigned
)
begin
declare v_id int unsigned;
declare v_val double(10,4);
declare v_done tinyint default 0;
declare v_cur cursor for select id, val from foo where id between = p_from_id and p_to_id;
declare continue handler for not found set v_done = 1;
start transaction;
open v_cur;
repeat
fetch v_cur into v_id, v_val;
-- do work...
if v_val < 0 then
update foo set...
else
insert into foo...
end if;
until v_done end repeat;
close v_cur;
commit;
end #
delimiter ;
call batch_update(1,10000);
call batch_update(10001, 20000);
call batch_update(20001, 30000);
If you can avoid using cursors at all - great, but the main point of my suggestion is about moving the logic from your application tier back into the data tier. I suggest you create a prototype stored procedure in your database and then perform some benchmarks. If the procedure executes in a few seconds then I dont see you having many issues especially if you're using innodb tables with transactions.
Here's another example which may prove of interest although it works on a much larger dataset 50+ million rows:
Optimal MySQL settings for queries that deliver large amounts of data?
Hope this helps :)
精彩评论