Using MPI_Type_Vector and MPI_Gather, in C_问答_开发者

I am trying to multiply square matrices in parallele with MPI.

I use a MPI_Type_vector to send square submatrixes (arrays of float) to the processes, so they can calculate subproducts. Then, for the next iterations, these submatrices are send to neighbours processes as MPI_Type_contiguous (the whole submatrix is sent). This part is working as expected, and local results are corrects.

Then, I use MPI_Gather with the contiguous types to send all local results back to the root process. The problem is, the final matrix is build (obviously, by this method) line by line instead of submatrix by submatrix.

I wrote an ugly procedure rearranging the final matrix, but I would like to know if there is a direct way of performing the "inverse" operation of sending MPI_Type_vectors (i.e., sending an array of values and directly arranging it in a subarray form in the receiving array).

An example, to try and clarify my long text :

A[16] and B[16]

Those really are 2D arrays, A[4][4] and B[4][4].

are 4x4 matrices to be multiplied ; C[4][4] will contain the result ; 4 processes are used (Pi with i from 0 to 3) :

Pi gets two 2x2 submatrices : subAi[4] and subBi[4] ; their product is stored locally in subCi[4].

For instance, P0 gets :

subA0[4] containing A[0], A[1], A[4] and A[5] ;
subB0[4] containing B[0], B[1], B[4] and B[5].

After everything is calculed, root process gathers all subCi[4].

Then C[4][4] contains :

[
subC0[0], subC0[1], subC0[2], subC0[3],
subC1[0], subC1[1], subC1[2], subC1[3],
subC2[0], subC2[1], subC2[2], subC2[3],
subC3[0], subC3[1], subC3[2], subC3[3]]

and I would like it to be :

[
subC0[0], subC0[1], subC1[0], subC1[1],
subC0[2], subC0[3], subC1[2], subC1[3],
subC2[0], subC2[1], subC3[0], subC3[1],
subC2[2], subC2[3], subC3[2], subC3[3]]

without further operation. Does someone know a way ?

Thanks for your advices.

Adding information in answer to 'High Performance Mark' :

1 Well, my initial matrices are 2D arrays (开发者_高级运维in the shape of A[4][4]). I wanted to make it short while writing my question, I see now it was a bad idea...

I did define the MPI_Type_vector as follow, for the example :

MPI_Type_vector(2, 2, 4, MPI_FLOAT, &subMatrix);

(By the way, I can't see any difference for a flattened array).

2 I am not an expert in MPI, far from it, so I might do strange things. Here is a bit of my code, applied to the examples (only A is dealt with, B is quite similar) :

Sending submatrices from root to slave processes :

Master {
    for (i = 0 ; i < 2 ; i++)
        for (j = 0 ; j < 2 ; j++)
            MPI_Send(&A[j * 2][(i + j) % 2 * 2], 1, subMatrix, i + j * 2, 42, MPI_COMM_WORLD);
}

Slaves receive :

MPI_Recv(subA, 4, MPI_FLOAT, 0, 42, MPI_COMM_WORLD, &status);

Then, exchanges between processes are done via MPI_Send and MPI_Recv of subMatrixLocal, which is :

MPI_Type_contiguous(4, MPI_FLOAT, &subMatrixLocal);

After all locals operations are done, I gather all subC matrices into C :

MPI_Gather(subC, 1, subMatrixLocal, C, 1, subMatrixLocal, 0, MPI_COMM_WORLD);

and I obtain the previously stated result, which I have to reorder...

And about your proposed algorithm : the next step will be to do the matrix multiplications with GPUs, where square matrices products are efficient. MPI will be just used to transfer matrices from CPUs to CPUs. Of course, global efficiency will be tested then.

0 You said that "the same type definition should be applicable for the reverse operation". However, my MPI_Vector_type is working fine on the "large" matrix, but using it directly on a sub-matrix isn't possible (applying MPI_Vector_type(2, 2, 4) on a 2x2 matrix will have wrong results, as it will take the last two values "outside" the defined array...). Do you mean I should create another MPI_Vector_type and send/receive it ?

The answer to your question is there is a direct way of performing the 'inverse' operation of sending MPI_Type_vectors is yes. If you have already defined a type vector to send a submatrix from one process to another, the same type definition should be applicable for the reverse operation.

However, I am somewhat confused by your explanation and have some further questions for you. If you answer them I may be able to provide better advice.

You write your matrices as A[16], B[16] and say that they are 4x4. Have you already flattened them ? I expected that they would be A[4][4] etc. If you have flattened the matrices, why have you done so ? You can certainly define an mpi_type_vector to define a sub-matrix of a 2D matrix.
It seems a little odd to me, not necessarily wrong but odd, to be matching sends with gathers. I would normally expect to see gathers matched by scatters and sends by receives. Perhaps you could post enough of your code to clarify what operations you are using.

Finally, I think that multiplying matrices by multiplying sub-matrices is probably not an efficient approach with MPI. If you are doing this as an exercise then carry on. But a better algorithm, and probably one that is easier to implement, would be

mpi_broadcast matrix B to all processes;
director process sends rows of A to worker processes, in a loop;
worker process computes a row of C and sends it back to director process;
director process receives rows of C and puts them in the right location.

I know this question was asked a very long time ago, but I think it has not been optimally answered yet, and I happened to stumble upon the same problem recently.

There are two things that you need to do. First, use two MPI_Datatypes, one for sending and one for receiving. The type for sending (stype in my code example) has the same stride as the number of local elements in a row (nloc in my code), which means that you can construct it using MPI_Type_contiguous if you like. The receiving process in the gather, however, has to place it in an array with rows of length n=nloc*nproc, so you have to create it using MPI_Type_vector.

Here comes the crucial part that took me some time to figure out (I eventually got the answer from Gilles Gouaillardet on the OpenMPI mailing list: https://www.mail-archive.com/users@lists.open-mpi.org//msg34678.html

In order to place the incoming matrices at the right offset, you have to set the receiving data type's "extent" to nloc (in my code) because that's the offset of the first element of the next block. To do this you can use MPI_Type_create_resized to get a new data type from the rtype (rtype_resized in my code).

MWE:

#include <mpi.h>
#include <iostream>
#include <sstream>
#include <string>

void print(std::string label, int rank, int nloc, int m, int* array)
{
  std::ostringstream oss;
  oss << label << " on P"<<rank<<": "<< m << "x" << nloc << std::endl;

  for (int i=0; i<m; i++)
  {
    for (int j=0; j<nloc; j++)
    {
      oss << array[i*nloc+j] << " ";
    }
    oss << std::endl;
  }
  std::cout << oss.str()<<std::flush<<std::endl;
}

int main(int argc, char** argv)
{

   MPI_Init(&argc,&argv);
   int rank, nproc;
   MPI_Comm_rank(MPI_COMM_WORLD,&rank);
   MPI_Comm_size(MPI_COMM_WORLD,&nproc);

   int nloc=2;
   int n=nloc*nproc;
   int m=2;

   int *Cj = new int[nloc*m+1000];
   int *Cglob = new int[n*m+1000];

   for (int j=0; j<m; j++)
     for (int i=0; i<nloc; i++)
       Cj[j*nloc+i]=j*n + rank*nloc + i;

   for (int r=0; r<nproc; r++)
   {
      if (rank==r) print("Cj", rank, nloc, m, Cj);
      std::cout << std::flush;
      MPI_Barrier(MPI_COMM_WORLD);
   }

   MPI_Datatype stype, rtype, rtype_resized;

   // this data typre represents the local nloc x m matrix,
   // which is column-major and has stride m.
   MPI_Type_vector(m,nloc,nloc,MPI_INT,&stype);
   MPI_Type_commit(&stype);

   // this represents a block of size nloc x m within a col-major
   // matrix of size n x m, hence the stride is n.
   MPI_Type_vector(m,nloc,n,MPI_INT,&rtype);
   MPI_Type_commit(&rtype);

  // we need to manually define the extent of the receive type in order to
  // get the displacements in the MPI_Gather right:
  MPI_Type_create_resized(rtype, 0, nloc*sizeof(int), &rtype_resized);
  MPI_Type_commit(&rtype_resized);

   // these two result in the same thing:
   //MPI_Allgather(Cj,nloc*m,MPI_INT,Cglob,1,rtype,MPI_COMM_WORLD);
   MPI_Gather(Cj,1,stype,Cglob,1,rtype_resized,0,MPI_COMM_WORLD);

   if (rank==0)
     print("Cglob", rank, n, m, Cglob);

   MPI_Type_free(&stype);
   MPI_Type_free(&rtype);
   MPI_Type_free(&rtype_resized);

   delete [] Cj;
   delete [] Cglob;

   MPI_Finalize();
}

Output:


> mpicxx -o matrix_gather matrix_gather.cpp
> mpirun -np 2 ./matrix_gather

Cj on P0: 2x2
0 1 
4 5 

Cglob on P0: 2x4
0 1 2 3 
4 5 6 7 

Cj on P1: 2x2
2 3 
6 7