开发者

beginner question about python multiprocessing?

开发者 https://www.devze.com 2023-01-27 00:04 出处:网络
I have a number of records in the database I want to process. Basically, I want to run several regex substitution over 开发者_如何学运维tokens of the text string rows and at the end, and write them ba

I have a number of records in the database I want to process. Basically, I want to run several regex substitution over 开发者_如何学运维tokens of the text string rows and at the end, and write them back to the database.

I wish to know whether does multiprocessing speeds up the time required to do such tasks. I did a

multiprocessing.cpu_count

and it returns 8. I have tried something like

process = []
for i in range(4):
    if i == 3:
        limit = resultsSize - (3 * division)
    else:
        limit = division

    #limit and offset indicates the subset of records the function would fetch in the db
    p = Process(target=sub_table.processR,args=(limit,offset,i,))
    p.start()
    process.append(p)
    offset += division + 1

for po in process:
    po.join()

but apparently, the time taken is higher than the time required to run a single thread. Why is this so? Can someone please enlighten is this a suitable case or what am i doing wrong here?


Why is this so?

Can someone please enlighten in what cases does multiprocessing gives better performances?

Here's one trick.

Multiprocessing only helps when your bottleneck is a resource that's not shared.

A shared resource (like a database) will be pulled in 8 different directions, which has little real benefit.

To find a non-shared resource, you must have independent objects. Like a list that's already in memory.

If you want to work from a database, you need to get 8 things started which then do no more database work. So, a central query that distributes work to separate processors can sometimes be beneficial.

Or 8 different files. Note that the file system -- as a whole -- is a shared resource and some kinds of file access are involve sharing something like a disk drive or a directory.

Or a pipeline of 8 smaller steps. The standard unix pipeline trick query | process1 | process2 | process3 >file works better than almost anything else because each stage in the pipeline is completely independent.

Here's the other trick.

Your computer system (OS, devices, database, network, etc.) is so complex that simplistic theories won't explain performance at all. You need to (a) take several measurements and (b) try several different algorithms until you understand all the degrees of freedom.

A question like "Can someone please enlighten in what cases does multiprocessing gives better performances?" doesn't have a simple answer.

In order to have a simple answer, you'd need a much, much simpler operating system. Fewer devices. No database and no network, for example. Since your OS is complex, there's no simple answer to your question.


Here are a couple of questions:

  1. In your processR function, does it slurp a large number of records from the database at one time, or is it fetching 1 row at a time? (Each row fetch will be very costly, performance wise.)

  2. It may not work for your specific application, but since you are processing "everything", using database will likely be slower than a flat file. Databases are optimised for logical queries, not seqential processing. In your case, can you export the whole table column to a CSV file, process it, and then re-import the results?

Hope this helps.


In general, multicpu or multicore processing help most when your problem is CPU bound (i.e., spends most of its time with the CPU running as fast as it can).

From your description, you have an IO bound problem: It takes forever to get data from disk to the CPU (which is idle) and then the CPU operation is very fast (because it is so simple).

Thus, accelerating the CPU operation does not make a very big difference overall.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号