Friday, 24 June 2011

Meeting 7 , 8 and 9 [01/06, 13/06 and 20/06]

Meeting 7 [01/06]


During this meeting the project was registered on Sourceforge:


The meeting was very short and the goal of the next coming week was to develop the user-synchronised communication: literally waiting for the user to click on a continue button to continue the MPI operation.


Meeting 8 [13/06]


During the 8th meeting the result of the synchronised communication with the user was presented. Some improvement were nonetheless needed. Each message between the interface and the profiler are done though formatted string, chosen for the simplicity of adding new elements in it. It also gives the advantage of variable length, as MPI_Probe could be use on the receiver side to determine the size of the message and avoid sending big messages when not required (messages for registering a communicator and giving information about a Ssend are not the same length for instance, as they do not have the same information to transmit).


Meeting 9 [20/06]


During the week the code was cleaned, reorganising the Interface mostly in order to cope with user-waiting communications. Registering a communicator was also completed, so the profiler can cope with adding and removing communicators easily. Nonetheless the attribute to find if the communicator was previously registered wasn't introduced yet.

Some improvements were proposed by David on the way of displaying the communications. Currently a ball is moving from the waiting processor to the waited one (A does a Ssend to B, the ball moves from A to B). But displaying a fix image seams to be a better idea, as a moving ball implies message transit, when there is none.

An error was also found when using the Compute Pi example, as the tag MPI_ANY_SOURCE was used and the Interface couldn't understand it.

The next step is to develop the basis of the array registration. With that done, each main requirement will be completed, forming a backbone to develop further one or the other by taking a specific example. In order to register memory nonetheless some development has to be made on both the profiler and interface, as currently there is no way of correctly finding if a peace of memory was registered and to display it on the interface.

Tuesday, 7 June 2011

Handling the MPI_Wait calls

Asynchronous communication


What is it?


Asynchronous communication is used in MPI programmes to have non-blocking operations. Usually theses routines are used to avoid deadlocks, and insure the good working of the communications. Each asynchronous MPI action returns a MPI_Request object that will be used to insure that the communication completed.

It is composed of 2 steps:

  • doing the asynchronous communication
  • waiting for the request

Taking a simple message in a ring example, the code could be:


#include <mpi.h>
#include <stdio.h>

#define TURNS 10

int main(int argc, char** argv)
{
  int i;
  int rank, size, left, right;
  int mess1, mess2;
  MPI_Request leftReq;

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  left = (rank+1)%size;
  right = (rank-1+size)%size;
  mess1 = rank;

  for ( i = 0 ; i < TURNS ; i++ )
    {
      // non blocking send to left
      MPI_Issend(&mess1, 1, MPI_INT, left, 0, MPI_COMM_WORLD, &leftReq);
      fprintf(stderr, "%d: sending %d to %d\n", rank, mess1, left);

      // blocking receive from right
      MPI_Recv(&mess2, 1, MPI_INT, right, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
      fprintf(stderr, "%d: receives %d from %d\n", rank, mess2, right);

      // wait for unfinished request
      MPI_Wait(&leftReq, MPI_STATUS_IGNORE);
    }

  MPI_Finalize();

  return 0;
}

On the profiler side the problems comes from knowing on the MPI_Wait calls if the current request is part of the registered communicators (see previous note). From the normal MPI call there is no way of guessing what the original call was, to which processor and what data was actually sent.


Finding what is waited for


An easy way to find about any MPI_Wait information is to save the asynchronous information, and when a wait is issued to look in them in order to find the information about it.
Using the MPI_Request as an identifier, as it has to be unique for the MPI implementation to also find out about what is waited for, the data is stored in a linked list that is part of the Register_Comm structure. The code will therefore look like that:


int MPI_Issend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
{
  int ret;
  Register_Comm* commInfo = NULL;
 
  commInfo = Comm_register_search(comm);

  ret = PMPI_Issend(buf, count, datatype, dest, tag, comm, request);

  if ( commInfo )
    {
      /* send information to the Interface */

      addRequest(commInfo, request, dest, MESSAGE_Issend);
    }

  return ret;
}

int MPI_Wait(MPI_Request *request, MPI_Status *status)
{
  int ret;
  Request_Info* info = NULL;

  // search in all registered communicators...
  info = General_searchRequest(request);
    
  ret = PMPI_Wait(request, status);

  if ( info != NULL )
    {
      /* send information to the Interface */

      freeRequest(info);
    }

  return ret;
}

Internals improvements


The data are stored as double linked list cells, in order to have easy removal of elements (requests may not be waited in the same order that they are generated). Therefore the current implementation is obviously not very fast, as the time to find a request is proportional to the number of requests per registered communicators (each communicator is searched).

An easy improvement for that could be to add an attribute the the MPI_Request object (using MPI_set_attribute) that will point out what communicator is this request allocated to, reducing the searching time when several communicators or asynchronous alls are registered.

So far no information is given to the interface if a request isn't waited for, but a registered communicator cannot be deleted when there is still pending requests. The only way to notice is to see that the number of asynchronous calls is different of the numbers of wait ones. This may change in the future. A message should be sent to the Interface when a registered communicator is destroyed (an hence all its pending requests as well) or when MPI_Finalize is called and there is still some requests on the list.

The MPI_Waitall, MPI_Waitany and MPI_Waitsome aren't supported yet, and tests have to be performed to see if they individually call MPI_Wait, but it is more likely that they directly call some common internal function.

Friday, 3 June 2011

Register a communicator

Why registering a communicator?


Registering a communicator was, originally, an idea developed to allow the user to filter communication occurring on a specific MPI communicator rather than on all of the used ones. But quickly during the tests it appears that the MPI profiling interface was used for every call of the library, meaning that when MPI_Ssend is redefined for example, even for a simple program (like the message in ring, 40 sends, 40 receives, 40 waits) the count of messages was enormous (certainly internal messages).

It became therefore important to filter the messages to register in the Interface, and in order to avoid network congestion, to do so on the Profiler side. The first way to reduce this number of messages is very simple: check for MPI_COMM_WORLD, that is the basic communicator used. But it doesn't provide the user any choice on monitoring a communicator or not.


int MPI_Ssend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
{
  int ret;

  ret = PMPI_Ssend(buf, count, datatype, dest, tag, comm);

  if ( comm == MPI_COMM_WORLD )
    {
      /* send information to the Interface */
    }

  return ret;
}

Registering Communicators


Registering from the user side


A mechanism has to be create to allow the user to register and unregister communicators at will. Few functionalities are visible from the user side:

  • is a communicator registered?
  • register a communicator
  • unregister a communicator

Resulting in the following functions:

int Comm_is_registered(MPI_Comm comm);
int Comm_register(MPI_Comm comm, char* name);
int Comm_unregister(MPI_Comm comm);

When a user registers a communicator, a name is given, and this name is displayed in the Interface as the communicator name. By default MPI_COMM_WORLD is registered when the application starts, but further options may add the possibilities to avoid doing so.

Each communicator will have a unique identifier (an unsigned integer) that will be sent to the Interface. For each communicator sent to the interface this identifier will be given, allowing the GUI to sort information by communicator when needed.


Registering: use in the profiling interface


The actual registration is done via a linked list. When the user registers a communicator, the list is searched for it, and if it is not present, added. When a MPI call is processed, the list is searched as well, and if the communicator isn't present, no information is sent to the Interface. The MPI profiling function now looks like following.


int MPI_Ssend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
{
  int ret;
  Register_Comm* commInfo = NULL;

  commInfo =  Comm_register_search(comm);

  ret = PMPI_Ssend(buf, count, datatype, dest, tag, comm);

  if ( commInfo )
    {
        /* send information to the Interface */
    }

  return ret;
}

Registering a communicator: internals



Communicator cells struct diagram

Registering a communicator looks relatively easy. The class diagram is relatively simple, though it is a double linked list (linking previous and next cells).


The MPI_Comm datatype could be considered as a simple int or long int as it represents the address of a communicator internally. In OpenMPI it is a structure. In MPICH2 it is defined as an int. Therefore a comparison like follows works.

int compare(MPI_Comm a, MPI_Comm b)
{
   return a == b;
}


Using MPI_Comm_compare


But comparing the datatype itself isn't very portable. The goal of that tool is to work with the MPI standard, not really using tricks from implementations to implementations. A function exists to compare two communicators: MPI_Comm_compare. According to the standard it returns:

  • MPI_IDENT results if and only if comm1 and comm2 are handles for the same object (identical groups and same contexts).
  • MPI_CONGRUENT results if the underlying groups are identical in constituents and rank order; these communicators differ only by context.
  • MPI_SIMILAR results of the group members of both communicators are the same but the rank order differs.
  • MPI_UNEQUAL results otherwise.


while ( curr != NULL )
    {
      MPI_Comm_compare(curr->comm, comm, &res);

        switch(res)
        {
        case MPI_IDENT:
            fprintf(stderr, "got MPI_IDENT\n");
            break;
        case MPI_CONGRUENT:
            fprintf(stderr, "got MPI_CONGRUENT\n");
            break;
        case MPI_SIMILAR:
            fprintf(stderr, "got MPI_SIMILAR\n");
            break;
        case MPI_UNEQUAL:
            fprintf(stderr, "got MPI_UNEQUAL\n");
            break;
        }

   curr = curr->next;
}


Using Communicators attributes


In order to go further, the user should be able to register a communicator for some time, and then unregister it. But just maintaining a list of currently registered communicator, if an unregistered communicator is registered again, it will become a new communicator.



This behaviour could be avoided. Firstly a list "communicator registered in the past" could be created, and each time a communicator is created, a search is performed. This isn't a bad option as usually few communicators are used, but isn't very interesting in term of performance. The MPI standard defines attributes that can be attached to an object. In that case an attribute could be created when registering the communicator (storing its unique ID for instance). This attribute could be looked for just before the insertion in the list, and if it exists, retrieve the unique ID from it.


Conclusion and limitations


The simple mechanism (comparing as int) is used effectively to search through registered communicators, and be able to add/remove communicators. But a more portable way will be used in the future, in order to complain with the standard rather than adapting to implementors versions. The actual "retrieving" of information from a deleted communicator isn't implemented yet, and will certainly be useful for watching a particular moment of a code rather than the whole MPI program without creating a lot of communicators in the Interface.

Monday, 30 May 2011

Meeting 6 [25/05] and how to use the tool

The meeting


Even thought a month came by, few new functionalities appeared into the project. Firstly because it was revision and examination period. During that period effort was made to insure the continuation of the project as design and researches. Some ideas on how to implement the synchronisation and possible functionalities were discuss (like the communicator registration).

The next step is to generate documentation for the project, as a proper webpage (and because some part of the code isn't much commented - as it was evolving quickly). Effort has to be make to make the project's folder organised. And some implementation will be carried on:

  • communicator registration, to organise the MPI_Request saving
  • syncrhonous communication (wait for the user to click continue before actually performing the MPI function


Was already in place the basis to do both implementation (MPI_Request are saved into a linked list and the backbone of the synchronisation is implemented - but not functional).

Using the tool


This tool aims to help people learning MPI behaviour. The sources have therefore been open to the "public". The first attempt was on an internal machine - Ness - that didn't work correctly. Therefore the project will be registered on source-forge, as it was planned, in advance.


This part is to explain how to use the current version of the code, and shouldn't change much in the future releases.

The project is composed of a library - the profiler - and an executable - the interface. The project should be organised into folders, one per deliverable. And should include tests. A general Makefile should be available to compile each of the deliverable, and a configure script may be available to automatise the variable generation (installation path, MPI flags to compile from the MPI compilers, Qt path, ...).

Compiling the profiler


Compiling the profiler requires:

  • a C MPI implementation
  • a C compiler
The profiler is available in both static or dynamic linking format, as only the linking stage changes. It is important for the user to be able to choose one or the other, as it appeared some MPI installation do not accept another type of library to use the MPI profiling interface.


Either mode could be compiled and installed, but note that if both are installed, it appears that dynamic linking is used by default.


Running make static or make dynamic should compile and install the library, by default in a local install folder composed of the classical lib and includes folders.


Compiling the interface


Compiling the interface requires:

  • Qt 4.6 or later (note that Qt 4.7 was used but none of the used functionalities where introduced on that release).
  • an C++ MPI implementation that supports multi-threading (see a previous note).
  • a C++ compiler
  • the headers from the profiler
The interface should be compilable from the main Makefile. A typical Qt project needs a project file to be generated that will generate the Makefile to compile it. Normally this process should be automatic, as the main Makefile should do so. If a configure script is available it should handles the variable generation, otherwise some variable needs to be set up:

  • INSTALL_ROOT should contains the path to the installation folder (default: ../install as it is relative to the interface folder where it is built).
  • MPI_INCLUDE should contain the path to the MPI headers. It can be retrieved by using mpicc -showme and is generally like -I/usr/local/include. However the -I should be REMOVED from the project option as QMake will generate it automatically.
  • MPI_LINK should contain the linking options given by mpicc -showme and is generally like -pthread -L/usr/local/lib -lmpi_cxx -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl.
  • MPI_EXTRA_FLAGS should be set up to -DMPICH_IGNORE_CXX_SEEK when using MPICH2 to avoid conflict with standard C++ file handling.

When the project file is done, and named as mpidisplay.pro, running make display should take care of the 2 compilation steps and of the installation. Nonetheless the steps are:
  • The generation of the Makefile qmake mpidisplay.pro. You can specify the previously stated variables in the command line or in the file itself (example: qmake mpidisplay.pro INSTAL_ROOT=../install).
  • Compiling the executable with the generated Makefile: make -f Makefile.qt
  • Installing the executable is done by calling make -f Makefile.qt install


Using a MPI program with the library


Compiling


In order to compile the library with the profiler options, you need to know where the profiler library is installed. Let's assume ~/local/, meaning that the library is in ~/local/lib and the headers in ~/local/includes. The location of the mpidisplay interface isn't important yet, but it certainly in ~/local/bin.


Note that to compile - even as a dynamic library - you do not need the LD_LIBRARY_PATH to be updated, but you will need it to run the software later. You don't need to update the variable if you use static linking as the library is completely added to your executable. To set the path simple execute export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/local/lib in your shell - or add it to your bashrc file.


Compiling your program is done exactly the same way than any other MPI program with additional library. You need to add the headers path to the compiler flags and the library location and name to the linking flags. In a generic Makefile this is done by adding CFLAGS+=-I~/local/includes and LDFLAGS+=-L~/local/lib -lmpi_wrap.


The only source modification is then to add to your code files:

#include <mpi_wrap.h>
In theory you can even remove the flags if you don't want to use the library, but be aware that the compiler might complain about not finding "mpi_wrap.h". Therefore you can define a precompiler macro WITH_MPIWRAP by adding CFLAGS+=-I~/local/includes -DWITH_MPIWRAP and doing your include as
#ifdef WITH_MPIWRAP
#include <mpi_wrap.h>
#endif


Running


As stated before your LD_LIBRARY_PATH should be updated if you are using a dynamic linking. Otherwise simply run your MPI program as usual. Assuming your program executable is called ring you usually ran mpiexec -n 4 ring to run with 4 MPI processes. With the library it is the same!

The profiler library will write the port on the standard output by default. But you can add command line arguments to define another way:

  • Standard output with --port-in-stdout
  • Standard error output with --port-in-stderr
  • A text file with --port-in-file file


Start the interface



Starting message box of the interface (GNU/Linux Gnome 3)

In order to start the mpidisplay interface you need to add its location to the PATH with the same technique than the LD_LIBRARY_PATH: export PATH=$PATH:~/local/bin. Then simply run mpidisplay to see the connection window. If you exported the port on the standard output move to the "manual" tab and write the ports in the fields. You can change the number of processors in the list - and the order of the port does not matter. If you used a text file, click the button and select it; the ports information will be loaded on the text edit area underneath if you need edition (and the number of processors should be updated).

Then simply click OK to start the interface.


You can find for the moment 2 main information in the interface: the number of calls to a sample of MPI routines and the time spent in them.



Friday, 20 May 2011

Profiler-Interface communication

Remember


Just to remember to the reader: the actual code is working with a client-server communication. The server part - the MPI profiler - sends information to the client - the interface - through a MPI interconnect done with MPI_Open_port.


In order to do so a protocol has to be defined for the messages.


Client-Server organisation


Connection



The Profiler-Interface organisation (4 MPI processes)

The connection of the profiler and the interface is done via the MPI_Open_port function, that opens and gives a port address. Each MPI process from the profiler publishes its own port, and therefore the interface has to connect to every single one of them. That actually means there is n servers and 1 client. It is unusual of a client-server model, as usually only 1 server delivers information to several clients. Nonetheless the profiler processes are the servers, as they are the ones that publish an accessible address.


With OpenMPI


OpenMPI provides an address that looks like:


117112832.0;tcp://192.168.1.71:36441+117112833.0;tcp://192.168.1.71:42986:300


An early attempt to guess the port was unsuccessful. Some part may change from a process to another without other logic than available resource (the port for example).


But a problem came when connecting in the interface. As the previous diagram shows it, several profilers connect to a single interface process. What is not shown is that the interface has in fact 1 thread per profiler, to deal with the communication. The data is then centralised in a single GUI. This approach is a typical Hybrid MPI programming approach. Therefore the interface has to initialise the MPI environment with MPI_Init_thread (rather than MPI_Init) and ask for a MPI_THREAD_MULTIPLE initialisation. By default OpenMPI doesn't provide such support.


The solution is rather simple: recompile OpenMPI with the threading support:

./configure --enable-mpi-threads


With MPICH-2


The Ness machine provided a MPICH-2 implementation already installed. For some reasons it didn't support dynamic linking, but static one is fine. For this implementation the port string looks like:

tag=0 port=52970 description=ness.epcc.ed.ac.uk ifname=129.215.175.1

That is radically different of the OpenMPI one, showing once more that guessing the port isn't a interesting idea.

As MPICH-2 is natively installed on Ness with the multi-threading, the configuration option isn't yet known.


Retrieving the port


The profiler opens and publish a port. As a matter of fact, the user has to read the port and give them in input to the interface. In order to give as much freedom as possible to the user several ways of doing it are available:

  • Printing the port to the standard output stream
  • Printing the port to the standard error output stream
  • Writing the ports into a defined file
This is achieved by giving information when calling MPI_Init on the MPI code. This could be achieved simply by providing command line arguments when calling mpiexec. The available arguments can be retrieved with (ring is the executable name):

$> ./ring --help
Profiler of an MPI program\nUse a MPI visualisation GUI to see information

Possible options:
--port-in-stdout [default]
   write the port into the standard output
--port-in-stderr
   write the port into the standard error output
--port-in-file file
   write the port into the file using MPI-I/O

Note that only the last given option is used

--help
   display that help

To use the file writing functionality simply start you program like:
$> mpiexec -n 4 ring --port-in-file port.txt

Note: so far adding the option manually as 2D array of char doesn't work, and no further looking as been made to make it work.


Writing each process' port in a single file


In order to write each process' port in a single file the MPI I/O functions are used. The standard defines several ways of doing so. In that case a simple subarray is defined with the size of the port as a base length. MPI I/O writes data as a whole line into a file, as this stores characters a new line is created for each port. The interface can therefore read the file line by line to find every port and know the number of started processes.


Extract of child_comm.c
if ( port == INFILE )
    {
      MPI_Datatype subarray;
      MPI_File file_ptr;
      int smallarray, bigarray, stride;

      smallarray = (strlen(port_name)+1);
      bigarray = world_size*smallarray;
      stride = world_rank*smallarray;

      fprintf(stderr, "!profiler(%d)! will write his port in '%s'\n", world_rank, file);

      MPI_Type_create_subarray(1, &bigarray, &smallarray, &stride, MPI_ORDER_C, INTRA_MESSAGE_MPITYPE, &subarray);
      MPI_Type_commit(&subarray);

      if ( MPI_File_open(MPI_COMM_WORLD, file, MPI_MODE_WRONLY|MPI_MODE_CREATE, MPI_INFO_NULL, &file_ptr) != MPI_SUCCESS )
 {
   fprintf(stderr, "!profiler(%d)! failed to open file '%s'. ABORTING\n", world_rank, file);
   MPI_Abort(MPI_COMM_WORLD, -1);
 } 

      if ( MPI_File_set_view(file_ptr, 0, INTRA_MESSAGE_MPITYPE, subarray, "native", MPI_INFO_NULL) != MPI_SUCCESS )
 {
   fprintf(stderr, "!profiler(%d)! failed to set the file view! ABORTING\n", world_rank);
   MPI_Abort(MPI_COMM_WORLD, -1);
 }

      if ( MPI_File_write_all(file_ptr, strcat(port_name, "\n"), smallarray, INTRA_MESSAGE_MPITYPE, MPI_STATUS_IGNORE) != MPI_SUCCESS )
 {
   fprintf(stderr, "!profiler(%d)! failed to write '%s'. ABORTING\n", world_rank, file);
   MPI_Abort(MPI_COMM_WORLD, -1);
 }

      MPI_File_close(&file_ptr);
    }

Communication


The profiler side


As far as the profiler is concerned, the communication with the interface could be either synchronous or asynchronous. The current implementation uses MPI_Ssend as simple choice, but later version could use asynchronous call and waiting before the next one is done. Or even deal with a request list to wait for.

The profiler uses internal functions defined into child_comm.h to communicate with the interface.


child_comm.h
#ifndef CHILDCOMM
#define CHILDCOMM

#include "intra_comm.h"

extern int world_rank;
extern double global_time;

typedef enum PortType { STDOUT, STDERR, INFILE } PortType;

int start_child(int world_size, PortType port_type, char* file);
int alive_child();
int sendto_child(Intra_message* message);
int wait_child(double time_in);

#endif // CHILDCOMM

intra_comm.h
#ifndef INTRA_COMM
#define INTRA_COMM

#define INTRA_MESSAGE_SIZE 64
typedef char Intra_message;

#define INTERCOMM_TAG 0

#define PROFNAME "!profiler!"

#ifdef __cplusplus
#define INTRA_MESSAGE_MPITYPE MPI::CHAR
#else
#define INTRA_MESSAGE_MPITYPE MPI_CHAR
#endif

/*
 * ACTIONS
 */

typedef enum Message { MESSAGE_INIT,
               MESSAGE_Ssend,
               MESSAGE_Bsend,
               MESSAGE_Issend,
               MESSAGE_Recv,
               MESSAGE_Irecv,
               MESSAGE_Wait,
               MESSAGE_QUIT } Message;

#endif // INTRA_COMM

The functions' name are explicit, and the intra_comm.h header defines the actual protocol information: it is therefore used by both profiler and interface. The actual sending is done by character stings, renamed as Intra_message. As the interface is coded in C++ the INTRA_MESSAGE_MPITYPE is defined using both C and C++ MPI standard definitions.


The message is composed of several fields, all separated by a space, which always includes main fields:

  • action::enum Message the occurring action
  • time in::double the Unix time when entering the MPI function
  • time out::double the Unix time when returning the MPI function
But each Message has its own information to add as well, after the main ones. For example a MPI_Ssend also encapsulate:
  • communicator::unsigned int the communicator unique number - not implemented yet
  • destination::int the destination process
And some more information could be added as needed. Each MPI function defines its own optional fields in his own call to sendto_child().

The information are written using standard C I/O calls:

sprintf(message, "%d %lf %lf %d\0", MESSAGE_Ssend, time_in, time_out, dest);


The interface side



Starting message box of the interface (GNU/Linux Gnome 3)

On the interface side the profilers' port could be defined either manually or by reading the file written as explained before. When this is done, one thread per process is started and their duty is to communicate with the profiler (the object is therefore called MPIWatch). The MPIWatch object is only responsible for receiving (and sending) information to the profiler, therefore each of them is attached to a Monitor object, that is responsible of the analyse of messages. In order to communicate the MPIWatch pushes arriving message onto a stack and signal to the Monitor that new messages are available. The Monitor then analyse the message and display information in the according places.


The couple MPIWatch - Monitor was done for logical purposes:

  1. Only the MPIWatch is actually aware of the MPI functions needed to sends and receive information to the profiler. If in the future another system is used, only this class has to be changed.
  2. Only the MPIWatch needs a separated thread, dealing with the messages contents is done on the main thread.
  3. Only the Monitor has the knowledge of what a message contains. New protocol functionnalities does not affect the way to transfer data between profiler and interface
  4. Only the Monitor knows about the GUI, that are shared "windows" among the several monitors.


As the interface is implemented in C++, the standard stream library is used to decapsulate the messages. The main fields are extract for each messages, and the according to the message action each additional information.


Extract of monitor.cpp
QString m = watcher->pop_pool();
std::istringstream stream(m.toStdString());
int message;
double time_in, time_out;

stream >> message >> time_in >> time_out;

switch(message)
{
        /* ... */

    case MESSAGE_Bsend:
        // adds to call counts
        statWidget->addTo(proc, N_Bsend); 
        // add time info
        statWidget->addTo(proc, T_Bsend, time_out-time_in); 
        break;

        /* ... */
}

Conclusion


The Profiler-Interface communication is done on two levels. The first one is the actual communication, done through MPI. This requires a port opening and publish mechanism, that the user has to give as an input to the profiler.

But the communication is also what information is sent. This is generated by each overloaded MPI function, and is analysed in the interface side by a Monitor object.

Decoupling the communication on these two levels allows an abstraction of actually sending and analysing the information.

Saturday, 16 April 2011

Original Requirements

Requirements


The requirements are the basis to organise the development of the project. The main problems of beginner - according to David Henty - are:

  1. Bad communication resulting in a dead-lock.
  2. Sending wrong data.
  3. Not waiting for asynchronous requests.



Example of state view.

In order to solve the 1st one, a synchronised view of the communication is needed. Providing the state of a process (for example: waiting for process 1: Recv) and the pattern of already executed messages. This is close to what Vampir provides, but display as the program is running and not after.



Example of a 2D array view.

Sending the wrong data would involve the software to know about important arrays for the user. A mechanism to register data is therefore needed - to both simplify development and readability. The user would be able to highlight operation on a given array, and display graphical information when part of it is used.



Deliverables


Two components will be delivered. A library (mpi_wrapper) that acts as a profiler; this part is using the MPI profiling interface and gather the information of the executing program. The other part is a software, that displays information gathered by the profiler; there is therefore a communication need between profiler and display, that will introduce some slowdown on the original MPI program.


Functional requirements


  1. Communication profiling
    1. point to point: communication from a processor to another.
    2. global: use of general communication routines.
    3. using different communicators: registering and selecting the communicators to profile.
    4. communication time: display the time when the communication occurred and the duration of the operation.
    5. step by step view: providing a blocking synchronized view of the communication, showing step by step what is going on for each processor.
    6. display communication with graphical “animation”: display the occurring communication with simple animation, using the step by step view.
    7. generate a log file: either using standard log formats (Vampir’s or Scalasca’s) or a dedicated one.
  2. Data view
    1. register an array: see information regarding an array when used in communications by registering it to the profiler.
    2. display graphical view of registered data: when used during communication display which part of data are transferred.
    3. recognise derived data types: in order to display the graphical view using simple data types first (vectors, subarrays) or more complex ones later.

Non-functional requirements


The project will as much as possible create a transparent tool for the user (only add a header and few compiler options to use the profiler); therefore it will avoid adding extra function calls. But some of the functionalities needs explicit calls - like the data view, that needs explicit listing of the arrays to look at - and have to add new functions. The aim is to have as few as possible extra calls for the user code to work with and without the library very easily.


The project is driven by a "teaching tool" goal. Therefore the development will be focussed on a solid backbone library usable for potential future development rather than providing a swiss-army knife of partially implemented functionalities. If the project leads to a well-developed tool that provides interesting features it will be published on the Internet and the code will be released with an open source licence.


The project is not motivated by a good performance tool. Analysing and displaying real time information will obviously introduce a delay. But as a matter of fact, the provided tool will try to be efficient in memory usage. It is important that the tool is both reliable and does not need enormous amount of RAM to work.

Thursday, 14 April 2011

Meeting 5 [ 21/03] & Existing Approaches

The 5th meeting only focussed on the report and presentation, that were due for the end of March. Therefore this not will mainly focus on the report itself.

Existing approaches


The report discuss what are the goals of the project and especially how is it possible to fulfil them. The first step was obviously to find existing software, that are known to be use and solve some of the goals.

Vampir


Vampir is tool used to display information about communication patterns. It creates a log file to store the information and shows - with an external program after the execution of the MPI code - several useful view to show the latencies on the network, the possible communication problems (like late sender or late receiver patterns).

The way to activate Vampir is to load a module - on Ness - that uses another compiler. As MPI does it with mpicc it certainly adds another library include and linking path to the classical compiler.

Even though the file format is available, the actual software isn't free of use.

Vampir Official website

Scalasca


Scalasca is another tool that provides analysis of an MPI code. It is problem based, meaning it tries to spot possible slowdowns in the program and highlight them. In order to do so, it analyse the communication patterns and the actual data pattern used as well. It supports hybrid development (MPI & OpenMP for example).

Scalasca is free to use, but the actual software is copyrighted. By definition it is quite a complicated tool that gives a lot of details on a running code.

Scalasca Official Website

XMPI


XMPI is a legacy tool that was used on the LAM/MPI implementation (now part of OpenMPI). It provides statistics information about a running MPI 1 program. But it also provides a real time view (snapshot) of the processes (waiting state, current messages in queue, etc). But it is not supported any more (the last update was March 2008) and did only work for the LAM/MPI implementation.

XMPI Official website

Motivation


Scalasca and Vampir are the two mainly used tools on HPC systems to analyse an MPI code. But both of them provide an after-execution analysis of the program. They are used to tune and improve the performances of a working code. XMPI is the only tool that might help knowing the state of a running program at the moment, with its snapshot view, but is not supported anymore.

The goal of this project is to develop a tool for beginner, helping them understanding why a give code is working or not. The aim is therefore not a deep analyse of the code, and the performance of the code is not an issue. This project should provide a simple library and GUI to be used by beginner in MPI development, it will help to illustrate possible mistakes, and provide a simple tool to display information about a running code (in real time).


To summarise, this project aims to generate a global view of the program, as it is executing,
to help understanding how it works - or does not work. The result is between a parallel
debugger (as a real-time view of the program actions are displayed) and a profiling tool
(with the information about the on-going communications).