Tuesday, 7 June 2011

Handling the MPI_Wait calls

Asynchronous communication


What is it?


Asynchronous communication is used in MPI programmes to have non-blocking operations. Usually theses routines are used to avoid deadlocks, and insure the good working of the communications. Each asynchronous MPI action returns a MPI_Request object that will be used to insure that the communication completed.

It is composed of 2 steps:

  • doing the asynchronous communication
  • waiting for the request

Taking a simple message in a ring example, the code could be:


#include <mpi.h>
#include <stdio.h>

#define TURNS 10

int main(int argc, char** argv)
{
  int i;
  int rank, size, left, right;
  int mess1, mess2;
  MPI_Request leftReq;

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  left = (rank+1)%size;
  right = (rank-1+size)%size;
  mess1 = rank;

  for ( i = 0 ; i < TURNS ; i++ )
    {
      // non blocking send to left
      MPI_Issend(&mess1, 1, MPI_INT, left, 0, MPI_COMM_WORLD, &leftReq);
      fprintf(stderr, "%d: sending %d to %d\n", rank, mess1, left);

      // blocking receive from right
      MPI_Recv(&mess2, 1, MPI_INT, right, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
      fprintf(stderr, "%d: receives %d from %d\n", rank, mess2, right);

      // wait for unfinished request
      MPI_Wait(&leftReq, MPI_STATUS_IGNORE);
    }

  MPI_Finalize();

  return 0;
}

On the profiler side the problems comes from knowing on the MPI_Wait calls if the current request is part of the registered communicators (see previous note). From the normal MPI call there is no way of guessing what the original call was, to which processor and what data was actually sent.


Finding what is waited for


An easy way to find about any MPI_Wait information is to save the asynchronous information, and when a wait is issued to look in them in order to find the information about it.
Using the MPI_Request as an identifier, as it has to be unique for the MPI implementation to also find out about what is waited for, the data is stored in a linked list that is part of the Register_Comm structure. The code will therefore look like that:


int MPI_Issend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
{
  int ret;
  Register_Comm* commInfo = NULL;
 
  commInfo = Comm_register_search(comm);

  ret = PMPI_Issend(buf, count, datatype, dest, tag, comm, request);

  if ( commInfo )
    {
      /* send information to the Interface */

      addRequest(commInfo, request, dest, MESSAGE_Issend);
    }

  return ret;
}

int MPI_Wait(MPI_Request *request, MPI_Status *status)
{
  int ret;
  Request_Info* info = NULL;

  // search in all registered communicators...
  info = General_searchRequest(request);
    
  ret = PMPI_Wait(request, status);

  if ( info != NULL )
    {
      /* send information to the Interface */

      freeRequest(info);
    }

  return ret;
}

Internals improvements


The data are stored as double linked list cells, in order to have easy removal of elements (requests may not be waited in the same order that they are generated). Therefore the current implementation is obviously not very fast, as the time to find a request is proportional to the number of requests per registered communicators (each communicator is searched).

An easy improvement for that could be to add an attribute the the MPI_Request object (using MPI_set_attribute) that will point out what communicator is this request allocated to, reducing the searching time when several communicators or asynchronous alls are registered.

So far no information is given to the interface if a request isn't waited for, but a registered communicator cannot be deleted when there is still pending requests. The only way to notice is to see that the number of asynchronous calls is different of the numbers of wait ones. This may change in the future. A message should be sent to the Interface when a registered communicator is destroyed (an hence all its pending requests as well) or when MPI_Finalize is called and there is still some requests on the list.

The MPI_Waitall, MPI_Waitany and MPI_Waitsome aren't supported yet, and tests have to be performed to see if they individually call MPI_Wait, but it is more likely that they directly call some common internal function.

No comments:

Post a Comment