开发者

Automatic parallelization

开发者 https://www.devze.com 2023-01-08 13:05 出处:网络
What is your opinion regarding a project that will try to take a code and split it to threads automatically(maybe compile time, probably in runtime).

What is your opinion regarding a project that will try to take a code and split it to threads automatically(maybe compile time, probably in runtime).

Take a look at the code below:

for(int i=0;i<100;i++)
   sum1 += rand(100)
for(int j=0;j<100;j++)
   sum2 += rand(100)/2

This 开发者_开发技巧kind of code can automatically get split to 2 different threads that run in parallel. Do you think it's even possible? I have a feeling that theoretically it's impossible (it reminds me the halting problem) but I can't justify this thought.

Do you think it's a useful project? is there anything like it?


This is called automatic parallelization. If you're looking for some program you can use that does this for you, it doesn't exist yet. But it may eventually. This is a hard problem and is an area of active research. If you're still curious...

It's possible to automatically split your example into multiple threads, but not in the way you're thinking. Some current techniques try to run each iteration of a for-loop in its own thread. One thread would get the even indicies (i=0, i=2, ...), the other would get the odd indices (i=1, i=3, ...). Once that for-loop is done, the next one could be started. Other techniques might get crazier, executing the i++ increment in one thread and the rand() on a separate thread.

As others have pointed out, there is a true dependency between iterations because rand() has internal state. That doesn't stand in the way of parallelization by itself. The compiler can recognize the memory dependency, and the modified state of rand() can be forwarded from one thread to the other. But it probably does limit you to only a few parallel threads. Without dependencies, you could run this on as many cores as you had available.

If you're truly interested in this topic and don't mind sifting through research papers:

  1. Automatic thread extraction with decoupled software pipelining (2005) by G. Ottoni.
  2. Speculative parallelization using software multi-threaded transactions (2010) by A. Raman.


This is practically not possible.

The problem is that you need to know, in advance, a lot more information than is readily available to the compiler, or even the runtime, in order to parallelize effectively.

While it would be possible to parallelize very simple loops, even then, there's a risk involved. For example, your above code could only be parallelized if rand() is thread-safe - and many random number generation routines are not. (Java's Math.random() is synchronized for you - however.)

Trying to do this type of automatic parallelization is, at least at this point, not practical for any "real" application.


It's certainly possible, but it is an incredibly hard task. This has been the central thrust of compiler research for several decades. The basic issue is that we cannot make a tool that can find the best partition into threads for java code (this is equivalent to the halting problem).

Instead we need to relax our goal from the best partition into some partition of the code. This is still very hard in general. So then we need to find ways to simplify the problem, one is to forget about general code and start looking at specific types of program. If you have simple control-flow (constant bounded for-loops, limited branching....) then you can make much more head-way.

Another simplification is reducing the number of parallel units that you are trying to keep busy. If you put both of these simplifications together then you get the state of the art in automatic vectorisation (a specific type of parallelisation that is used to generate MMX / SSE style code). Getting to that stage has taken decades but if you look at compilers like Intel's then performance is starting to get pretty good.

If you move from vector instructions inside a single thread to multiple threads within a process then you have a huge increase in latency moving data between the different points in the code. This means that your parallelisation has to be a lot better in order to win against the communication overhead. Currently this is a very hot topic in research, but there are no automatic user-targetted tools available. If you can write one that works it would be very interesting to many people.

For your specific example, if you assume that rand() is a parallel version so you can call it independently from different threads then it's quite easy to see that the code can be split into two. A compiler would convert just need dependency analysis to see that neither loop uses data from or affects the other. So the order between them in the user-level code is a false dependency that could split (i.e by putting each in a separate thread).

But this isn't really how you would want to parallelise the code. It looks as if each loop iteration is dependent on the previous as sum1 += rand(100) is the same as sum1 = sum1 + rand(100) where the sum1 on the right-hand-side is the value from the previous iteration. However the only operation involved is addition, which is associative so we rewrite the sum many different ways.

sum1 = (((rand_0 + rand_1) + rand_2) + rand_3) ....
sum1 = (rand_0 + rand_1) + (rand_2 + rand_3) ...

The advantage of the second is that each single addition in brackets can be computed in parallel to all of the others. Once you have 50 results then they can be combined into a further 25 additions and so on... You do more work this way 50+25+13+7+4+2+1 = 102 additions versus 100 in the original but there are only 7 sequential steps so apart from the parallel forking/joining and communication overhead it runs 14 times quicker. This tree of additions is called a gather operation in parallel architectures and it tends to be the expensive part of a computation.

On a very parallel architecture such as a GPU the above description would be the best way to parallelise the code. If you're using threads within a process it would get killed by the overhead.

In summary: it is impossible to do perfectly, it is very hard to do well, there is lots of active research in finding out how much we can do.


Whether it's possible in the general case to know whether a piece of code can be parallelized does not really matter, because even if your algorithm cannot detect all cases that can be parallelized, maybe it can detect some of them.

That does not mean it would be useful. Consider the following:

  1. First of all, to do it at compile-time, you have to inspect all code paths you can potentially reach inside the construct you want to parallelize. This may be tricky for anything but simply computations.
  2. Second, you have to somehow decide what is parallelizable and what is not. You cannot trivially break up a loop that modifies the same state into several threads, for example. This is probably a very difficult task and in many cases you will end up with not being sure - two variables might in fact reference the same object.
  3. Even if you could achieve this, it would end up confusing for the user. It would be very difficult to explain why his code was not parallelizable and how it should be changed.

I think that if you want to achieve this in Java, you need to write it more as a library, and let the user decide what to parallelize (library functions together with annotations? just thinking aloud). Functional languages are much more suited for this.

As a piece of trivia: during a parallel programming course, we had to inspect code and decide whether it was parallelizable or not. I cannot remember the specifics (something about the "at-most-once" property? Someone fill me in?), but the moral of the story is that it was extremely difficult even for what appeared to be trivial cases.


There are some projects that try to simplify parallelization - such as Cilk. It doesn't always work that well, however.


I've learnt that as of JDK 1.8(Java 8), you can utilize/leverage multiple cores of your CPU in case of streams usage by using parallelStream().

However, it has been studied that before finalizing to go to production with parallelStream() it is always better to compare sequential() with parallel, by benchmarking the performance, and then decide which would be ideal.

Why?/Reason is: there could be scenarios where the parallel stream will perform dramatically worse than sequential, when the operation needs to do auto un/boxing. For those scenarios its advisable to use the Java 8 Primitive Streams such as IntStream, LongStream, DoubleStream.

Reference: Modern Java in Action: Manning Publications 2019


The Programming language is Java and Java is a virtual machine. So shouldn't one be able to execute the code at runtime on different Threads owned by the VM. Since all the Memory etc. is handled like that It whould not cause any corruption . You could see the Code as a Stack of instructions estimating execution Time and then distribute it on an Array of Threads which are each have an execution stack of roughtly the same time. It might be dangerous though some graphics like OpenGL immediate mode needs to maintain order and mostly should not be threaded at all.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号