The following sequence of errors is received when I try to run a problem on four processors. The MPI command I use is mpirun -np 4
I apologize for posting the error message as is (Primarily due a lack of knowledge on deciphering the information given). Would appreciate your input on the following:
What does the error message mean? At what point does one receive it? Is it because of the system memory (hardware) or is it due to a communication error (something related to MPI_Isend/Irecv?, i.e. Software issue).
Finally, how do I fix this?
Thanks!
ERROR message received follows below:: - - *PLEASE NOTE: This error is received only when the time is large*. Code computes fine when time required to compute data is small (i.e, 300 time steps compared to 1000 time steps)
aborting job:
Fatal error in MPI_Irecv: Other MPI error, error stack:
MPI_Irecv(143): MPI_Irecv(buf=0x8294a60, count=48, MPI_DOUBLE, src=2, tag=-1, MPI_COMM_WORLD, request=0xffffd68c) failed
MPID_Irecv(64): Out of memory
aborting job:
Fatal error in MPI_Irecv: Other MPI error, error stack:
MPI_Irecv(143): MPI_Irecv(buf=0x8295080, count=48, MPI_DOUBLE, src=3, tag=-1, MPI_COMM_WORLD, request=0xffffd690) failed
MPID_Irecv(64): Out of memory
aborting job: Fatal error in MPI_Isend: Internal MPI error!, error stack:
MPI_Isend(142): MPI_Isend(buf=0x8295208, count=48, MPI_DOUBLE, dest=3, tag=0, MPI_COMM_WORLD, request=0xffffd678) failed
(unknown)(): Internal MPI error!
aborting job: Fatal error in MPI_Irecv: Other MPI error, error stack:
MPI_Irecv(143): MPI_Irecv(buf=0x82959b0, count=48, MPI_DOUBLE, src=2, tag=-1, MPI_COMM_WORLD, request=0xffffd678) failed
MPID_Irecv(64): Out of memory
rank 3 in job 1 myocyte80_37021 caused collective abort of all ranks exit status of rank 3: return code 13
rank 1 in job 1 myocyte80_37021 caused collective abort of all ranks exit status of rank 1: return code 13
EDIT: (SOURCE CODE)
Header files
Variable declaration
TOTAL TIME =
...
...
double *A = new double[Rows];
double *AA = new double[Rows];
double *B = new double[Rows;
double *BB = new double[Rows];
....
....
int Rmpi;
int my_rank;
int p;
int source;
int dest;
int tag = 0;
function declaration
int main (int argc, char *argv[])
{
MPI_Status status[8];
MPI_Request request[8];
MPI_Init (&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &p);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
//PROBLEM SPECIFIC PROPERTIES. VARY BASED ON NODE
if (Flag = 1)
{
if (my_rank == 0)
{
Defining boundary (start/stop) for special elements in tissue (Rows x Column)
}
if (my_rank == 2)
..
if (my_rank == 3)
..
if (my_rank == 4)
..
}
//INITIAL CONDITIONS ALSO VARY BASED ON NODE
for (Columns = 0; Columns<48; i++) // Normal Direction
{
for (Rows = 0; Rows<48; y++) //Transverse Direction
{
if (Flag =1 )
{
if (my_rank == 0)
{
Initial conditions for elements
}
if (my_rank == 1) //MPI
{
}
..
..
..
//SIMULATION START
while(t[0][0] < TOTAL TIME)
{
for (Columns=0; Columns ++) //Normal Direction
{
for (Rows=0; Rows++) //Transverse Direction
{
//SOME MORE PROPERTIES BASED ON NODE
if (my_rank == 0)
{
if (FLAG = 1)
{
Condition 1
}
else
{
Condition 2
}
}
if (my_rank = 1)
....
....
...
//Evaluate functions (differential equations)
Function 1 ();
Function 2 ();
...
...
//Based on output of differential equations, different nodes estimate variable values. Since
the problem is of nearest neighbor, corners and edges have different neighbors/ boundary
conditions
if (my_rank == 0)
{
If (Row/Column at bottom_left)
{
Variables =
}
if (Row/Column at Bottom Right)
{
Variables =
}
}
...
...
//Keeping track of time for each element in Row and Column. Time is updated for a certain
element.
t[Column][Row] = t[Column][Row]+dt;
}
}//END OF ROWS AND COLUMNS
// MPI IMPLEMENTATION. AT END OF EVERY TIME STEP, Nodes communicate with nearest neighbor
//First step is to populate arrays with values estimated above
for (Columns, ++)
{
for (Rows, ++)
{
if (my_rank == 0)
{
//Loading the Edges of the (Row x Column) to variables. This One dimensional Array data
is shared with its nearest neighbor for computation at next time step.
if (Column == 47)
{
A[i] = V[Column][Row];
…
}
if (Row == 47)
{
B[i] = V[Column][Row];
}
}
...
...
//NON BLOCKING MPI SEND RECV开发者_如何学JAVA TO SHARE DATA WITH NEAREST NEIGHBOR
if ((my_rank) == 0)
{
MPI_Isend(A, Rows, MPI_DOUBLE, my_rank+1, 0, MPI_COMM_WORLD, &request[1]);
MPI_Irecv(AA, Rows, MPI_DOUBLE, my_rank+1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[3]);
MPI_Wait(&request[3], &status[3]);
MPI_Isend(B, Rows, MPI_DOUBLE, my_rank+2, 0, MPI_COMM_WORLD, &request[5]);
MPI_Irecv(BB, Rows, MPI_DOUBLE, my_rank+2, MPI_ANY_TAG, MPI_COMM_WORLD, &request[7]);
MPI_Wait(&request[7], &status[7]);
}
if ((my_rank) == 1)
{
MPI_Irecv(CC, Rows, MPI_DOUBLE, my_rank-1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[1]);
MPI_Wait(&request[1], &status[1]);
MPI_Isend(Cmpi, Rows, MPI_DOUBLE, my_rank-1, 0, MPI_COMM_WORLD, &request[3]);
MPI_Isend(D, Rows, MPI_DOUBLE, my_rank+2, 0, MPI_COMM_WORLD, &request[6]);
MPI_Irecv(DD, Rows, MPI_DOUBLE, my_rank+2, MPI_ANY_TAG, MPI_COMM_WORLD, &request[8]);
MPI_Wait(&request[8], &status[8]);
}
if ((my_rank) == 2)
{
MPI_Isend(E, Rows, MPI_DOUBLE, my_rank+1, 0, MPI_COMM_WORLD, &request[2]);
MPI_Irecv(EE, Rows, MPI_DOUBLE, my_rank+1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[4]);
MPI_Wait(&request[4], &status[4]);
MPI_Irecv(FF, Rows, MPI_DOUBLE, my_rank-2, MPI_ANY_TAG, MPI_COMM_WORLD, &request[5]);
MPI_Wait(&request[5], &status[5]);
MPI_Isend(Fmpi, Rows, MPI_DOUBLE, my_rank-2, 0, MPI_COMM_WORLD, &request[7]);
}
if ((my_rank) == 3)
{
MPI_Irecv(GG, Rows, MPI_DOUBLE, my_rank-1, MPI_ANY_TAG, MPI_COMM_WORLD, &request[2]);
MPI_Wait(&request[2], &status[2]);
MPI_Isend(G, Rows, MPI_DOUBLE, my_rank-1, 0, MPI_COMM_WORLD, &request[4]);
MPI_Irecv(HH, Rows, MPI_DOUBLE, my_rank-2, MPI_ANY_TAG, MPI_COMM_WORLD, &request[6]);
MPI_Wait(&request[6], &status[6]);
MPI_Isend(H, Rows, MPI_DOUBLE, my_rank-2, 0, MPI_COMM_WORLD, &request[8]);
}
//RELOADING Data (from MPI_IRecv array to array used to compute at next time step)
for (Columns, ++)
{
for (Rows, ++)
{
if (my_rank == 0)
{
if (Column == 47)
{
V[Column][Row]= A[i];
}
if (Row == 47)
{
V[Column][Row]=B[i];
}
}
….
//PRINT TO OUTPUT FILE AT CERTAIN POINT
printval = 100;
if ((printdata>=printval))
{
prttofile ();
printdata = 0;
}
printdata = printdata+1;
compute_dt ();
}//CLOSE ALL TIME STEPS
MPI_Finalize ();
}//CLOSE MAIN
Are you repeatedly calling MPI_Irecv? If so, you may not realize that each call allocates a request handle - and these are freed when the message is received and tested for completion with (eg.) MPI_Test. It's possible you could exhaust memory with over-use of MPI_Irecv - or the memory assigned by an MPI implementation for this purpose.
Only seeing the code would confirm the problem.
Now that the code has been added to the question: this is indeed dirty code. You only wait for the request from the Irecv
call. Yes, if the message is received you know that the send has completed, so you don't have to wait for it. But skipping the wait gives a memory leak: the Isend
allocates a new request, which the Wait
would deallocate. Since you never wait, you don't deallocate and you have a memory leak.
精彩评论