I have to solve a huge linear equation for multiple right sides (Let's say 20 to 200). The Matrix is stored in开发者_如何学C a sparse format and distributed over multiple MPI nodes (Let's say 16 to 64). I run a CG solver on the rank 0 node. It's not possible to solve the linear equation directly, because the system matrix would be dense (Sys = A^T * S * A).
The basic Matrix-Vector multiplication is implemented as:
broadcast x
y = A_part * x
reduce y
While the collective operations are reasonably fast (OpenMPI seems to use a binary tree like communication pattern + Infiniband), it still accounts for a quite large part of the runtime. For performance reasons we already calculate 8 right sides per iteration (Basicly SpM * DenseMatrix, just to be complete).
I'm trying to come up with a good scheme to hide the communication latency, but I did not have a good idea yet. I also try to refrain from doing 1:n communication, although I did not yet measure if scaling would be a problem.
Any suggestions are welcome!
If your matrix is already distributed, would it be possible to use a distributed sparse linear solver instead of running it only on rank 0 and then broadcasting the result (if I'm reading your description correctly..). There's plenty of libraries for that, e.g. SuperLU_DIST, MUMPS, PARDISO, Aztec(OO), etc.
The "multiple rhs" optimization is supported by at least SuperLU and MUMPS (haven't checked the others, but I'd be VERY surprised if they didn't support it!), since they solve AX=B where X and B are matrices with potentially > 1 column. That is, each "rhs" is stored as a column vector in B.
If you don't need to have the results of an old right-hand-side before starting the next run you could try to use non-blocking communication (ISend, IRecv) and communicate the result while calculating the next right-hand-side already.
But make sure you call MPI_Wait before reading the content of the communicated array, in order to be sure you're not reading "old" data.
If the matrices are big enough (i.e. it takes long enough to calculate the matrix-product) you don't have any communication delay at all with this approach.
精彩评论