Friday 24 June 2011

Meeting 7 , 8 and 9 [01/06, 13/06 and 20/06]

Meeting 7 [01/06]

During this meeting the project was registered on Sourceforge:

project page
And the svn: mpi-visu svn

The meeting was very short and the goal of the next coming week was to develop the user-synchronised communication: literally waiting for the user to click on a continue button to continue the MPI operation.

Meeting 8 [13/06]

During the 8th meeting the result of the synchronised communication with the user was presented. Some improvement were nonetheless needed. Each message between the interface and the profiler are done though formatted string, chosen for the simplicity of adding new elements in it. It also gives the advantage of variable length, as MPI_Probe could be use on the receiver side to determine the size of the message and avoid sending big messages when not required (messages for registering a communicator and giving information about a Ssend are not the same length for instance, as they do not have the same information to transmit).

Meeting 9 [20/06]

During the week the code was cleaned, reorganising the Interface mostly in order to cope with user-waiting communications. Registering a communicator was also completed, so the profiler can cope with adding and removing communicators easily. Nonetheless the attribute to find if the communicator was previously registered wasn't introduced yet.

Some improvements were proposed by David on the way of displaying the communications. Currently a ball is moving from the waiting processor to the waited one (A does a Ssend to B, the ball moves from A to B). But displaying a fix image seams to be a better idea, as a moving ball implies message transit, when there is none.

An error was also found when using the Compute Pi example, as the tag MPI_ANY_SOURCE was used and the Interface couldn't understand it.

The next step is to develop the basis of the array registration. With that done, each main requirement will be completed, forming a backbone to develop further one or the other by taking a specific example. In order to register memory nonetheless some development has to be made on both the profiler and interface, as currently there is no way of correctly finding if a peace of memory was registered and to display it on the interface.

Tuesday 7 June 2011

Handling the MPI_Wait calls

Asynchronous communication

What is it?

Asynchronous communication is used in MPI programmes to have non-blocking operations. Usually theses routines are used to avoid deadlocks, and insure the good working of the communications. Each asynchronous MPI action returns a MPI_Request object that will be used to insure that the communication completed.

It is composed of 2 steps:

doing the asynchronous communication
waiting for the request

Taking a simple message in a ring example, the code could be:

#include <mpi.h>
#include <stdio.h>

#define TURNS 10

int main(int argc, char** argv)
{
  int i;
  int rank, size, left, right;
  int mess1, mess2;
  MPI_Request leftReq;

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  left = (rank+1)%size;
  right = (rank-1+size)%size;
  mess1 = rank;

  for ( i = 0 ; i < TURNS ; i++ )
    {
      // non blocking send to left
      MPI_Issend(&mess1, 1, MPI_INT, left, 0, MPI_COMM_WORLD, &leftReq);
      fprintf(stderr, "%d: sending %d to %d\n", rank, mess1, left);

      // blocking receive from right
      MPI_Recv(&mess2, 1, MPI_INT, right, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
      fprintf(stderr, "%d: receives %d from %d\n", rank, mess2, right);

      // wait for unfinished request
      MPI_Wait(&leftReq, MPI_STATUS_IGNORE);
    }

  MPI_Finalize();

  return 0;
}

On the profiler side the problems comes from knowing on the MPI_Wait calls if the current request is part of the registered communicators (see previous note). From the normal MPI call there is no way of guessing what the original call was, to which processor and what data was actually sent.

Finding what is waited for

An easy way to find about any MPI_Wait information is to save the asynchronous information, and when a wait is issued to look in them in order to find the information about it.
Using the MPI_Request as an identifier, as it has to be unique for the MPI implementation to also find out about what is waited for, the data is stored in a linked list that is part of the Register_Comm structure. The code will therefore look like that:

int MPI_Issend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
{
  int ret;
  Register_Comm* commInfo = NULL;
 
  commInfo = Comm_register_search(comm);

  ret = PMPI_Issend(buf, count, datatype, dest, tag, comm, request);

  if ( commInfo )
    {
      /* send information to the Interface */

      addRequest(commInfo, request, dest, MESSAGE_Issend);
    }

  return ret;
}

int MPI_Wait(MPI_Request *request, MPI_Status *status)
{
  int ret;
  Request_Info* info = NULL;

  // search in all registered communicators...
  info = General_searchRequest(request);
    
  ret = PMPI_Wait(request, status);

  if ( info != NULL )
    {
      /* send information to the Interface */

      freeRequest(info);
    }

  return ret;
}

Internals improvements

The data are stored as double linked list cells, in order to have easy removal of elements (requests may not be waited in the same order that they are generated). Therefore the current implementation is obviously not very fast, as the time to find a request is proportional to the number of requests per registered communicators (each communicator is searched).

An easy improvement for that could be to add an attribute the the MPI_Request object (using MPI_set_attribute) that will point out what communicator is this request allocated to, reducing the searching time when several communicators or asynchronous alls are registered.

So far no information is given to the interface if a request isn't waited for, but a registered communicator cannot be deleted when there is still pending requests. The only way to notice is to see that the number of asynchronous calls is different of the numbers of wait ones. This may change in the future. A message should be sent to the Interface when a registered communicator is destroyed (an hence all its pending requests as well) or when MPI_Finalize is called and there is still some requests on the list.

The MPI_Waitall, MPI_Waitany and MPI_Waitsome aren't supported yet, and tests have to be performed to see if they individually call MPI_Wait, but it is more likely that they directly call some common internal function.

Friday 3 June 2011

Register a communicator

Why registering a communicator?

Registering a communicator was, originally, an idea developed to allow the user to filter communication occurring on a specific MPI communicator rather than on all of the used ones. But quickly during the tests it appears that the MPI profiling interface was used for every call of the library, meaning that when MPI_Ssend is redefined for example, even for a simple program (like the message in ring, 40 sends, 40 receives, 40 waits) the count of messages was enormous (certainly internal messages).

It became therefore important to filter the messages to register in the Interface, and in order to avoid network congestion, to do so on the Profiler side. The first way to reduce this number of messages is very simple: check for MPI_COMM_WORLD, that is the basic communicator used. But it doesn't provide the user any choice on monitoring a communicator or not.

int MPI_Ssend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
{
  int ret;

  ret = PMPI_Ssend(buf, count, datatype, dest, tag, comm);

  if ( comm == MPI_COMM_WORLD )
    {
      /* send information to the Interface */
    }

  return ret;
}

Registering Communicators

Registering from the user side

A mechanism has to be create to allow the user to register and unregister communicators at will. Few functionalities are visible from the user side:

is a communicator registered?
register a communicator
unregister a communicator

Resulting in the following functions:

int Comm_is_registered(MPI_Comm comm);
int Comm_register(MPI_Comm comm, char* name);
int Comm_unregister(MPI_Comm comm);

When a user registers a communicator, a name is given, and this name is displayed in the Interface as the communicator name. By default MPI_COMM_WORLD is registered when the application starts, but further options may add the possibilities to avoid doing so.

Each communicator will have a unique identifier (an unsigned integer) that will be sent to the Interface. For each communicator sent to the interface this identifier will be given, allowing the GUI to sort information by communicator when needed.

Registering: use in the profiling interface

The actual registration is done via a linked list. When the user registers a communicator, the list is searched for it, and if it is not present, added. When a MPI call is processed, the list is searched as well, and if the communicator isn't present, no information is sent to the Interface. The MPI profiling function now looks like following.

int MPI_Ssend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
{
  int ret;
  Register_Comm* commInfo = NULL;

  commInfo =  Comm_register_search(comm);

  ret = PMPI_Ssend(buf, count, datatype, dest, tag, comm);

  if ( commInfo )
    {
        /* send information to the Interface */
    }

  return ret;
}

Registering a communicator: internals

Communicator cells struct diagram

Registering a communicator looks relatively easy. The class diagram is relatively simple, though it is a double linked list (linking previous and next cells).

The MPI_Comm datatype could be considered as a simple int or long int as it represents the address of a communicator internally. In OpenMPI it is a structure. In MPICH2 it is defined as an int. Therefore a comparison like follows works.

int compare(MPI_Comm a, MPI_Comm b)
{
   return a == b;
}

Using MPI_Comm_compare

But comparing the datatype itself isn't very portable. The goal of that tool is to work with the MPI standard, not really using tricks from implementations to implementations. A function exists to compare two communicators: MPI_Comm_compare. According to the standard it returns:

MPI_IDENT results if and only if comm1 and comm2 are handles for the same object (identical groups and same contexts).
MPI_CONGRUENT results if the underlying groups are identical in constituents and rank order; these communicators differ only by context.
MPI_SIMILAR results of the group members of both communicators are the same but the rank order differs.
MPI_UNEQUAL results otherwise.

while ( curr != NULL )
    {
      MPI_Comm_compare(curr->comm, comm, &res);

        switch(res)
        {
        case MPI_IDENT:
            fprintf(stderr, "got MPI_IDENT\n");
            break;
        case MPI_CONGRUENT:
            fprintf(stderr, "got MPI_CONGRUENT\n");
            break;
        case MPI_SIMILAR:
            fprintf(stderr, "got MPI_SIMILAR\n");
            break;
        case MPI_UNEQUAL:
            fprintf(stderr, "got MPI_UNEQUAL\n");
            break;
        }

   curr = curr->next;
}

Using Communicators attributes

In order to go further, the user should be able to register a communicator for some time, and then unregister it. But just maintaining a list of currently registered communicator, if an unregistered communicator is registered again, it will become a new communicator.

This behaviour could be avoided. Firstly a list "communicator registered in the past" could be created, and each time a communicator is created, a search is performed. This isn't a bad option as usually few communicators are used, but isn't very interesting in term of performance. The MPI standard defines attributes that can be attached to an object. In that case an attribute could be created when registering the communicator (storing its unique ID for instance). This attribute could be looked for just before the insertion in the list, and if it exists, retrieve the unique ID from it.

Conclusion and limitations

The simple mechanism (comparing as int) is used effectively to search through registered communicators, and be able to add/remove communicators. But a more portable way will be used in the future, in order to complain with the standard rather than adapting to implementors versions. The actual "retrieving" of information from a deleted communicator isn't implemented yet, and will certainly be useful for watching a particular moment of a code rather than the whole MPI program without creating a lot of communicators in the Interface.