Friday, 24 June 2011

Meeting 7 , 8 and 9 [01/06, 13/06 and 20/06]

Meeting 7 [01/06]

During this meeting the project was registered on Sourceforge:

project page
And the svn: mpi-visu svn

The meeting was very short and the goal of the next coming week was to develop the user-synchronised communication: literally waiting for the user to click on a continue button to continue the MPI operation.

Meeting 8 [13/06]

During the 8th meeting the result of the synchronised communication with the user was presented. Some improvement were nonetheless needed. Each message between the interface and the profiler are done though formatted string, chosen for the simplicity of adding new elements in it. It also gives the advantage of variable length, as MPI_Probe could be use on the receiver side to determine the size of the message and avoid sending big messages when not required (messages for registering a communicator and giving information about a Ssend are not the same length for instance, as they do not have the same information to transmit).

Meeting 9 [20/06]

During the week the code was cleaned, reorganising the Interface mostly in order to cope with user-waiting communications. Registering a communicator was also completed, so the profiler can cope with adding and removing communicators easily. Nonetheless the attribute to find if the communicator was previously registered wasn't introduced yet.

Some improvements were proposed by David on the way of displaying the communications. Currently a ball is moving from the waiting processor to the waited one (A does a Ssend to B, the ball moves from A to B). But displaying a fix image seams to be a better idea, as a moving ball implies message transit, when there is none.

An error was also found when using the Compute Pi example, as the tag MPI_ANY_SOURCE was used and the Interface couldn't understand it.

The next step is to develop the basis of the array registration. With that done, each main requirement will be completed, forming a backbone to develop further one or the other by taking a specific example. In order to register memory nonetheless some development has to be made on both the profiler and interface, as currently there is no way of correctly finding if a peace of memory was registered and to display it on the interface.

Tuesday, 7 June 2011

Handling the MPI_Wait calls

Asynchronous communication

What is it?

Asynchronous communication is used in MPI programmes to have non-blocking operations. Usually theses routines are used to avoid deadlocks, and insure the good working of the communications. Each asynchronous MPI action returns a MPI_Request object that will be used to insure that the communication completed.

It is composed of 2 steps:

doing the asynchronous communication
waiting for the request

Taking a simple message in a ring example, the code could be:

#include <mpi.h>
#include <stdio.h>

#define TURNS 10

int main(int argc, char** argv)
{
  int i;
  int rank, size, left, right;
  int mess1, mess2;
  MPI_Request leftReq;

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  left = (rank+1)%size;
  right = (rank-1+size)%size;
  mess1 = rank;

  for ( i = 0 ; i < TURNS ; i++ )
    {
      // non blocking send to left
      MPI_Issend(&mess1, 1, MPI_INT, left, 0, MPI_COMM_WORLD, &leftReq);
      fprintf(stderr, "%d: sending %d to %d\n", rank, mess1, left);

      // blocking receive from right
      MPI_Recv(&mess2, 1, MPI_INT, right, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
      fprintf(stderr, "%d: receives %d from %d\n", rank, mess2, right);

      // wait for unfinished request
      MPI_Wait(&leftReq, MPI_STATUS_IGNORE);
    }

  MPI_Finalize();

  return 0;
}

On the profiler side the problems comes from knowing on the MPI_Wait calls if the current request is part of the registered communicators (see previous note). From the normal MPI call there is no way of guessing what the original call was, to which processor and what data was actually sent.

Finding what is waited for

An easy way to find about any MPI_Wait information is to save the asynchronous information, and when a wait is issued to look in them in order to find the information about it.
Using the MPI_Request as an identifier, as it has to be unique for the MPI implementation to also find out about what is waited for, the data is stored in a linked list that is part of the Register_Comm structure. The code will therefore look like that:

int MPI_Issend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
{
  int ret;
  Register_Comm* commInfo = NULL;
 
  commInfo = Comm_register_search(comm);

  ret = PMPI_Issend(buf, count, datatype, dest, tag, comm, request);

  if ( commInfo )
    {
      /* send information to the Interface */

      addRequest(commInfo, request, dest, MESSAGE_Issend);
    }

  return ret;
}

int MPI_Wait(MPI_Request *request, MPI_Status *status)
{
  int ret;
  Request_Info* info = NULL;

  // search in all registered communicators...
  info = General_searchRequest(request);
    
  ret = PMPI_Wait(request, status);

  if ( info != NULL )
    {
      /* send information to the Interface */

      freeRequest(info);
    }

  return ret;
}

Internals improvements

The data are stored as double linked list cells, in order to have easy removal of elements (requests may not be waited in the same order that they are generated). Therefore the current implementation is obviously not very fast, as the time to find a request is proportional to the number of requests per registered communicators (each communicator is searched).

An easy improvement for that could be to add an attribute the the MPI_Request object (using MPI_set_attribute) that will point out what communicator is this request allocated to, reducing the searching time when several communicators or asynchronous alls are registered.

So far no information is given to the interface if a request isn't waited for, but a registered communicator cannot be deleted when there is still pending requests. The only way to notice is to see that the number of asynchronous calls is different of the numbers of wait ones. This may change in the future. A message should be sent to the Interface when a registered communicator is destroyed (an hence all its pending requests as well) or when MPI_Finalize is called and there is still some requests on the list.

The MPI_Waitall, MPI_Waitany and MPI_Waitsome aren't supported yet, and tests have to be performed to see if they individually call MPI_Wait, but it is more likely that they directly call some common internal function.

Friday, 3 June 2011

Register a communicator

Why registering a communicator?

Registering a communicator was, originally, an idea developed to allow the user to filter communication occurring on a specific MPI communicator rather than on all of the used ones. But quickly during the tests it appears that the MPI profiling interface was used for every call of the library, meaning that when MPI_Ssend is redefined for example, even for a simple program (like the message in ring, 40 sends, 40 receives, 40 waits) the count of messages was enormous (certainly internal messages).

It became therefore important to filter the messages to register in the Interface, and in order to avoid network congestion, to do so on the Profiler side. The first way to reduce this number of messages is very simple: check for MPI_COMM_WORLD, that is the basic communicator used. But it doesn't provide the user any choice on monitoring a communicator or not.

int MPI_Ssend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
{
  int ret;

  ret = PMPI_Ssend(buf, count, datatype, dest, tag, comm);

  if ( comm == MPI_COMM_WORLD )
    {
      /* send information to the Interface */
    }

  return ret;
}

Registering Communicators

Registering from the user side

A mechanism has to be create to allow the user to register and unregister communicators at will. Few functionalities are visible from the user side:

is a communicator registered?
register a communicator
unregister a communicator

Resulting in the following functions:

int Comm_is_registered(MPI_Comm comm);
int Comm_register(MPI_Comm comm, char* name);
int Comm_unregister(MPI_Comm comm);

When a user registers a communicator, a name is given, and this name is displayed in the Interface as the communicator name. By default MPI_COMM_WORLD is registered when the application starts, but further options may add the possibilities to avoid doing so.

Each communicator will have a unique identifier (an unsigned integer) that will be sent to the Interface. For each communicator sent to the interface this identifier will be given, allowing the GUI to sort information by communicator when needed.

Registering: use in the profiling interface

The actual registration is done via a linked list. When the user registers a communicator, the list is searched for it, and if it is not present, added. When a MPI call is processed, the list is searched as well, and if the communicator isn't present, no information is sent to the Interface. The MPI profiling function now looks like following.

int MPI_Ssend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
{
  int ret;
  Register_Comm* commInfo = NULL;

  commInfo =  Comm_register_search(comm);

  ret = PMPI_Ssend(buf, count, datatype, dest, tag, comm);

  if ( commInfo )
    {
        /* send information to the Interface */
    }

  return ret;
}

Registering a communicator: internals

Communicator cells struct diagram

Registering a communicator looks relatively easy. The class diagram is relatively simple, though it is a double linked list (linking previous and next cells).

The MPI_Comm datatype could be considered as a simple int or long int as it represents the address of a communicator internally. In OpenMPI it is a structure. In MPICH2 it is defined as an int. Therefore a comparison like follows works.

int compare(MPI_Comm a, MPI_Comm b)
{
   return a == b;
}

Using MPI_Comm_compare

But comparing the datatype itself isn't very portable. The goal of that tool is to work with the MPI standard, not really using tricks from implementations to implementations. A function exists to compare two communicators: MPI_Comm_compare. According to the standard it returns:

MPI_IDENT results if and only if comm1 and comm2 are handles for the same object (identical groups and same contexts).
MPI_CONGRUENT results if the underlying groups are identical in constituents and rank order; these communicators differ only by context.
MPI_SIMILAR results of the group members of both communicators are the same but the rank order differs.
MPI_UNEQUAL results otherwise.

while ( curr != NULL )
    {
      MPI_Comm_compare(curr->comm, comm, &res);

        switch(res)
        {
        case MPI_IDENT:
            fprintf(stderr, "got MPI_IDENT\n");
            break;
        case MPI_CONGRUENT:
            fprintf(stderr, "got MPI_CONGRUENT\n");
            break;
        case MPI_SIMILAR:
            fprintf(stderr, "got MPI_SIMILAR\n");
            break;
        case MPI_UNEQUAL:
            fprintf(stderr, "got MPI_UNEQUAL\n");
            break;
        }

   curr = curr->next;
}

Using Communicators attributes

In order to go further, the user should be able to register a communicator for some time, and then unregister it. But just maintaining a list of currently registered communicator, if an unregistered communicator is registered again, it will become a new communicator.

This behaviour could be avoided. Firstly a list "communicator registered in the past" could be created, and each time a communicator is created, a search is performed. This isn't a bad option as usually few communicators are used, but isn't very interesting in term of performance. The MPI standard defines attributes that can be attached to an object. In that case an attribute could be created when registering the communicator (storing its unique ID for instance). This attribute could be looked for just before the insertion in the list, and if it exists, retrieve the unique ID from it.

Conclusion and limitations

The simple mechanism (comparing as int) is used effectively to search through registered communicators, and be able to add/remove communicators. But a more portable way will be used in the future, in order to complain with the standard rather than adapting to implementors versions. The actual "retrieving" of information from a deleted communicator isn't implemented yet, and will certainly be useful for watching a particular moment of a code rather than the whole MPI program without creating a lot of communicators in the Interface.

Monday, 30 May 2011

Meeting 6 [25/05] and how to use the tool

The meeting

Even thought a month came by, few new functionalities appeared into the project. Firstly because it was revision and examination period. During that period effort was made to insure the continuation of the project as design and researches. Some ideas on how to implement the synchronisation and possible functionalities were discuss (like the communicator registration).

The next step is to generate documentation for the project, as a proper webpage (and because some part of the code isn't much commented - as it was evolving quickly). Effort has to be make to make the project's folder organised. And some implementation will be carried on:

communicator registration, to organise the MPI_Request saving
syncrhonous communication (wait for the user to click continue before actually performing the MPI function

Was already in place the basis to do both implementation (MPI_Request are saved into a linked list and the backbone of the synchronisation is implemented - but not functional).

Using the tool

This tool aims to help people learning MPI behaviour. The sources have therefore been open to the "public". The first attempt was on an internal machine - Ness - that didn't work correctly. Therefore the project will be registered on source-forge, as it was planned, in advance.

This part is to explain how to use the current version of the code, and shouldn't change much in the future releases.

The project is composed of a library - the profiler - and an executable - the interface. The project should be organised into folders, one per deliverable. And should include tests. A general Makefile should be available to compile each of the deliverable, and a configure script may be available to automatise the variable generation (installation path, MPI flags to compile from the MPI compilers, Qt path, ...).

Compiling the profiler

Compiling the profiler requires:

a C MPI implementation
a C compiler

The profiler is available in both static or dynamic linking format, as only the linking stage changes. It is important for the user to be able to choose one or the other, as it appeared some MPI installation do not accept another type of library to use the MPI profiling interface.

Either mode could be compiled and installed, but note that if both are installed, it appears that dynamic linking is used by default.

Running make static or make dynamic should compile and install the library, by default in a local install folder composed of the classical lib and includes folders.

Compiling the interface

Compiling the interface requires:

Qt 4.6 or later (note that Qt 4.7 was used but none of the used functionalities where introduced on that release).
an C++ MPI implementation that supports multi-threading (see a previous note).
a C++ compiler
the headers from the profiler

The interface should be compilable from the main Makefile. A typical Qt project needs a project file to be generated that will generate the Makefile to compile it. Normally this process should be automatic, as the main Makefile should do so. If a configure script is available it should handles the variable generation, otherwise some variable needs to be set up:

INSTALL_ROOT should contains the path to the installation folder (default: ../install as it is relative to the interface folder where it is built).
MPI_INCLUDE should contain the path to the MPI headers. It can be retrieved by using mpicc -showme and is generally like -I/usr/local/include. However the -I should be REMOVED from the project option as QMake will generate it automatically.
MPI_LINK should contain the linking options given by mpicc -showme and is generally like -pthread -L/usr/local/lib -lmpi_cxx -lmpi -lopen-rte -lopen-pal -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl.
MPI_EXTRA_FLAGS should be set up to -DMPICH_IGNORE_CXX_SEEK when using MPICH2 to avoid conflict with standard C++ file handling.

When the project file is done, and named as mpidisplay.pro, running make display should take care of the 2 compilation steps and of the installation. Nonetheless the steps are:

The generation of the Makefile qmake mpidisplay.pro. You can specify the previously stated variables in the command line or in the file itself (example: qmake mpidisplay.pro INSTAL_ROOT=../install).
Compiling the executable with the generated Makefile: make -f Makefile.qt
Installing the executable is done by calling make -f Makefile.qt install

Using a MPI program with the library

Compiling

In order to compile the library with the profiler options, you need to know where the profiler library is installed. Let's assume ~/local/, meaning that the library is in ~/local/lib and the headers in ~/local/includes. The location of the mpidisplay interface isn't important yet, but it certainly in ~/local/bin.

Note that to compile - even as a dynamic library - you do not need the LD_LIBRARY_PATH to be updated, but you will need it to run the software later. You don't need to update the variable if you use static linking as the library is completely added to your executable. To set the path simple execute export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/local/lib in your shell - or add it to your bashrc file.

Compiling your program is done exactly the same way than any other MPI program with additional library. You need to add the headers path to the compiler flags and the library location and name to the linking flags. In a generic Makefile this is done by adding CFLAGS+=-I~/local/includes and LDFLAGS+=-L~/local/lib -lmpi_wrap.

The only source modification is then to add to your code files:

#include <mpi_wrap.h>

In theory you can even remove the flags if you don't want to use the library, but be aware that the compiler might complain about not finding "mpi_wrap.h". Therefore you can define a precompiler macro WITH_MPIWRAP by adding CFLAGS+=-I~/local/includes -DWITH_MPIWRAP and doing your include as

#ifdef WITH_MPIWRAP
#include <mpi_wrap.h>
#endif

Running

As stated before your LD_LIBRARY_PATH should be updated if you are using a dynamic linking. Otherwise simply run your MPI program as usual. Assuming your program executable is called ring you usually ran mpiexec -n 4 ring to run with 4 MPI processes. With the library it is the same!

The profiler library will write the port on the standard output by default. But you can add command line arguments to define another way:

Standard output with --port-in-stdout
Standard error output with --port-in-stderr
A text file with --port-in-file file

Start the interface

Starting message box of the interface (GNU/Linux Gnome 3)

In order to start the mpidisplay interface you need to add its location to the PATH with the same technique than the LD_LIBRARY_PATH: export PATH=$PATH:~/local/bin. Then simply run mpidisplay to see the connection window. If you exported the port on the standard output move to the "manual" tab and write the ports in the fields. You can change the number of processors in the list - and the order of the port does not matter. If you used a text file, click the button and select it; the ports information will be loaded on the text edit area underneath if you need edition (and the number of processors should be updated).

Then simply click OK to start the interface.

You can find for the moment 2 main information in the interface: the number of calls to a sample of MPI routines and the time spent in them.

Friday, 20 May 2011

Profiler-Interface communication

Remember

Just to remember to the reader: the actual code is working with a client-server communication. The server part - the MPI profiler - sends information to the client - the interface - through a MPI interconnect done with MPI_Open_port.

In order to do so a protocol has to be defined for the messages.

Client-Server organisation

Connection

The Profiler-Interface organisation (4 MPI processes)

The connection of the profiler and the interface is done via the MPI_Open_port function, that opens and gives a port address. Each MPI process from the profiler publishes its own port, and therefore the interface has to connect to every single one of them. That actually means there is n servers and 1 client. It is unusual of a client-server model, as usually only 1 server delivers information to several clients. Nonetheless the profiler processes are the servers, as they are the ones that publish an accessible address.

With OpenMPI

OpenMPI provides an address that looks like:

117112832.0;tcp://192.168.1.71:36441+117112833.0;tcp://192.168.1.71:42986:300

An early attempt to guess the port was unsuccessful. Some part may change from a process to another without other logic than available resource (the port for example).

But a problem came when connecting in the interface. As the previous diagram shows it, several profilers connect to a single interface process. What is not shown is that the interface has in fact 1 thread per profiler, to deal with the communication. The data is then centralised in a single GUI. This approach is a typical Hybrid MPI programming approach. Therefore the interface has to initialise the MPI environment with MPI_Init_thread (rather than MPI_Init) and ask for a MPI_THREAD_MULTIPLE initialisation. By default OpenMPI doesn't provide such support.

The solution is rather simple: recompile OpenMPI with the threading support:

./configure --enable-mpi-threads

With MPICH-2

The Ness machine provided a MPICH-2 implementation already installed. For some reasons it didn't support dynamic linking, but static one is fine. For this implementation the port string looks like:

tag=0 port=52970 description=ness.epcc.ed.ac.uk ifname=129.215.175.1

That is radically different of the OpenMPI one, showing once more that guessing the port isn't a interesting idea.

As MPICH-2 is natively installed on Ness with the multi-threading, the configuration option isn't yet known.

Retrieving the port

The profiler opens and publish a port. As a matter of fact, the user has to read the port and give them in input to the interface. In order to give as much freedom as possible to the user several ways of doing it are available:

Printing the port to the standard output stream
Printing the port to the standard error output stream
Writing the ports into a defined file

This is achieved by giving information when calling MPI_Init on the MPI code. This could be achieved simply by providing command line arguments when calling mpiexec. The available arguments can be retrieved with (ring is the executable name):

$> ./ring --help
Profiler of an MPI program\nUse a MPI visualisation GUI to see information

Possible options:
--port-in-stdout [default]
   write the port into the standard output
--port-in-stderr
   write the port into the standard error output
--port-in-file file
   write the port into the file using MPI-I/O

Note that only the last given option is used

--help
   display that help

To use the file writing functionality simply start you program like:

$> mpiexec -n 4 ring --port-in-file port.txt

Note: so far adding the option manually as 2D array of char doesn't work, and no further looking as been made to make it work.

Writing each process' port in a single file

In order to write each process' port in a single file the MPI I/O functions are used. The standard defines several ways of doing so. In that case a simple subarray is defined with the size of the port as a base length. MPI I/O writes data as a whole line into a file, as this stores characters a new line is created for each port. The interface can therefore read the file line by line to find every port and know the number of started processes.

Extract of child_comm.c

if ( port == INFILE )
    {
      MPI_Datatype subarray;
      MPI_File file_ptr;
      int smallarray, bigarray, stride;

      smallarray = (strlen(port_name)+1);
      bigarray = world_size*smallarray;
      stride = world_rank*smallarray;

      fprintf(stderr, "!profiler(%d)! will write his port in '%s'\n", world_rank, file);

      MPI_Type_create_subarray(1, &bigarray, &smallarray, &stride, MPI_ORDER_C, INTRA_MESSAGE_MPITYPE, &subarray);
      MPI_Type_commit(&subarray);

      if ( MPI_File_open(MPI_COMM_WORLD, file, MPI_MODE_WRONLY|MPI_MODE_CREATE, MPI_INFO_NULL, &file_ptr) != MPI_SUCCESS )
 {
   fprintf(stderr, "!profiler(%d)! failed to open file '%s'. ABORTING\n", world_rank, file);
   MPI_Abort(MPI_COMM_WORLD, -1);
 } 

      if ( MPI_File_set_view(file_ptr, 0, INTRA_MESSAGE_MPITYPE, subarray, "native", MPI_INFO_NULL) != MPI_SUCCESS )
 {
   fprintf(stderr, "!profiler(%d)! failed to set the file view! ABORTING\n", world_rank);
   MPI_Abort(MPI_COMM_WORLD, -1);
 }

      if ( MPI_File_write_all(file_ptr, strcat(port_name, "\n"), smallarray, INTRA_MESSAGE_MPITYPE, MPI_STATUS_IGNORE) != MPI_SUCCESS )
 {
   fprintf(stderr, "!profiler(%d)! failed to write '%s'. ABORTING\n", world_rank, file);
   MPI_Abort(MPI_COMM_WORLD, -1);
 }

      MPI_File_close(&file_ptr);
    }

Communication

The profiler side

As far as the profiler is concerned, the communication with the interface could be either synchronous or asynchronous. The current implementation uses MPI_Ssend as simple choice, but later version could use asynchronous call and waiting before the next one is done. Or even deal with a request list to wait for.

The profiler uses internal functions defined into child_comm.h to communicate with the interface.

child_comm.h

#ifndef CHILDCOMM
#define CHILDCOMM

#include "intra_comm.h"

extern int world_rank;
extern double global_time;

typedef enum PortType { STDOUT, STDERR, INFILE } PortType;

int start_child(int world_size, PortType port_type, char* file);
int alive_child();
int sendto_child(Intra_message* message);
int wait_child(double time_in);

#endif // CHILDCOMM

intra_comm.h

#ifndef INTRA_COMM
#define INTRA_COMM

#define INTRA_MESSAGE_SIZE 64
typedef char Intra_message;

#define INTERCOMM_TAG 0

#define PROFNAME "!profiler!"

#ifdef __cplusplus
#define INTRA_MESSAGE_MPITYPE MPI::CHAR
#else
#define INTRA_MESSAGE_MPITYPE MPI_CHAR
#endif

/*
 * ACTIONS
 */

typedef enum Message { MESSAGE_INIT,
               MESSAGE_Ssend,
               MESSAGE_Bsend,
               MESSAGE_Issend,
               MESSAGE_Recv,
               MESSAGE_Irecv,
               MESSAGE_Wait,
               MESSAGE_QUIT } Message;

#endif // INTRA_COMM

The functions' name are explicit, and the intra_comm.h header defines the actual protocol information: it is therefore used by both profiler and interface. The actual sending is done by character stings, renamed as Intra_message. As the interface is coded in C++ the INTRA_MESSAGE_MPITYPE is defined using both C and C++ MPI standard definitions.

The message is composed of several fields, all separated by a space, which always includes main fields:

action::enum Message the occurring action
time in::double the Unix time when entering the MPI function
time out::double the Unix time when returning the MPI function

But each Message has its own information to add as well, after the main ones. For example a MPI_Ssend also encapsulate:

communicator::unsigned int the communicator unique number - not implemented yet
destination::int the destination process

And some more information could be added as needed. Each MPI function defines its own optional fields in his own call to sendto_child().

The information are written using standard C I/O calls:

sprintf(message, "%d %lf %lf %d\0", MESSAGE_Ssend, time_in, time_out, dest);

The interface side

Starting message box of the interface (GNU/Linux Gnome 3)

On the interface side the profilers' port could be defined either manually or by reading the file written as explained before. When this is done, one thread per process is started and their duty is to communicate with the profiler (the object is therefore called MPIWatch). The MPIWatch object is only responsible for receiving (and sending) information to the profiler, therefore each of them is attached to a Monitor object, that is responsible of the analyse of messages. In order to communicate the MPIWatch pushes arriving message onto a stack and signal to the Monitor that new messages are available. The Monitor then analyse the message and display information in the according places.

The couple MPIWatch - Monitor was done for logical purposes:

Only the MPIWatch is actually aware of the MPI functions needed to sends and receive information to the profiler. If in the future another system is used, only this class has to be changed.
Only the MPIWatch needs a separated thread, dealing with the messages contents is done on the main thread.
Only the Monitor has the knowledge of what a message contains. New protocol functionnalities does not affect the way to transfer data between profiler and interface
Only the Monitor knows about the GUI, that are shared "windows" among the several monitors.

As the interface is implemented in C++, the standard stream library is used to decapsulate the messages. The main fields are extract for each messages, and the according to the message action each additional information.

Extract of monitor.cpp

QString m = watcher->pop_pool();
std::istringstream stream(m.toStdString());
int message;
double time_in, time_out;

stream >> message >> time_in >> time_out;

switch(message)
{
        /* ... */

    case MESSAGE_Bsend:
        // adds to call counts
        statWidget->addTo(proc, N_Bsend); 
        // add time info
        statWidget->addTo(proc, T_Bsend, time_out-time_in); 
        break;

        /* ... */
}

Conclusion

The Profiler-Interface communication is done on two levels. The first one is the actual communication, done through MPI. This requires a port opening and publish mechanism, that the user has to give as an input to the profiler.

But the communication is also what information is sent. This is generated by each overloaded MPI function, and is analysed in the interface side by a Monitor object.

Decoupling the communication on these two levels allows an abstraction of actually sending and analysing the information.

Saturday, 16 April 2011

Original Requirements

Requirements

The requirements are the basis to organise the development of the project. The main problems of beginner - according to David Henty - are:

Bad communication resulting in a dead-lock.
Sending wrong data.
Not waiting for asynchronous requests.

Example of state view.

In order to solve the 1st one, a synchronised view of the communication is needed. Providing the state of a process (for example: waiting for process 1: Recv) and the pattern of already executed messages. This is close to what Vampir provides, but display as the program is running and not after.

Example of a 2D array view.

Sending the wrong data would involve the software to know about important arrays for the user. A mechanism to register data is therefore needed - to both simplify development and readability. The user would be able to highlight operation on a given array, and display graphical information when part of it is used.

Deliverables

Two components will be delivered. A library (mpi_wrapper) that acts as a profiler; this part is using the MPI profiling interface and gather the information of the executing program. The other part is a software, that displays information gathered by the profiler; there is therefore a communication need between profiler and display, that will introduce some slowdown on the original MPI program.

Functional requirements

Communication profiling
1. point to point: communication from a processor to another.
2. global: use of general communication routines.
3. using different communicators: registering and selecting the communicators to profile.
4. communication time: display the time when the communication occurred and the duration of the operation.
5. step by step view: providing a blocking synchronized view of the communication, showing step by step what is going on for each processor.
6. display communication with graphical “animation”: display the occurring communication with simple animation, using the step by step view.
7. generate a log file: either using standard log formats (Vampir’s or Scalasca’s) or a dedicated one.
Data view
1. register an array: see information regarding an array when used in communications by registering it to the profiler.
2. display graphical view of registered data: when used during communication display which part of data are transferred.
3. recognise derived data types: in order to display the graphical view using simple data types first (vectors, subarrays) or more complex ones later.

Non-functional requirements

The project will as much as possible create a transparent tool for the user (only add a header and few compiler options to use the profiler); therefore it will avoid adding extra function calls. But some of the functionalities needs explicit calls - like the data view, that needs explicit listing of the arrays to look at - and have to add new functions. The aim is to have as few as possible extra calls for the user code to work with and without the library very easily.

The project is driven by a "teaching tool" goal. Therefore the development will be focussed on a solid backbone library usable for potential future development rather than providing a swiss-army knife of partially implemented functionalities. If the project leads to a well-developed tool that provides interesting features it will be published on the Internet and the code will be released with an open source licence.

The project is not motivated by a good performance tool. Analysing and displaying real time information will obviously introduce a delay. But as a matter of fact, the provided tool will try to be efficient in memory usage. It is important that the tool is both reliable and does not need enormous amount of RAM to work.

Thursday, 14 April 2011

Meeting 5 [ 21/03] & Existing Approaches

The 5th meeting only focussed on the report and presentation, that were due for the end of March. Therefore this not will mainly focus on the report itself.

Existing approaches

The report discuss what are the goals of the project and especially how is it possible to fulfil them. The first step was obviously to find existing software, that are known to be use and solve some of the goals.

Vampir

Vampir is tool used to display information about communication patterns. It creates a log file to store the information and shows - with an external program after the execution of the MPI code - several useful view to show the latencies on the network, the possible communication problems (like late sender or late receiver patterns).

The way to activate Vampir is to load a module - on Ness - that uses another compiler. As MPI does it with mpicc it certainly adds another library include and linking path to the classical compiler.

Even though the file format is available, the actual software isn't free of use.

Vampir Official website

Scalasca

Scalasca is another tool that provides analysis of an MPI code. It is problem based, meaning it tries to spot possible slowdowns in the program and highlight them. In order to do so, it analyse the communication patterns and the actual data pattern used as well. It supports hybrid development (MPI & OpenMP for example).

Scalasca is free to use, but the actual software is copyrighted. By definition it is quite a complicated tool that gives a lot of details on a running code.

Scalasca Official Website

XMPI

XMPI is a legacy tool that was used on the LAM/MPI implementation (now part of OpenMPI). It provides statistics information about a running MPI 1 program. But it also provides a real time view (snapshot) of the processes (waiting state, current messages in queue, etc). But it is not supported any more (the last update was March 2008) and did only work for the LAM/MPI implementation.

XMPI Official website

Motivation

Scalasca and Vampir are the two mainly used tools on HPC systems to analyse an MPI code. But both of them provide an after-execution analysis of the program. They are used to tune and improve the performances of a working code. XMPI is the only tool that might help knowing the state of a running program at the moment, with its snapshot view, but is not supported anymore.

The goal of this project is to develop a tool for beginner, helping them understanding why a give code is working or not. The aim is therefore not a deep analyse of the code, and the performance of the code is not an issue. This project should provide a simple library and GUI to be used by beginner in MPI development, it will help to illustrate possible mistakes, and provide a simple tool to display information about a running code (in real time).

To summarise, this project aims to generate a global view of the program, as it is executing,
to help understanding how it works - or does not work. The result is between a parallel
debugger (as a real-time view of the program actions are displayed) and a profiling tool
(with the information about the on-going communications).

Wednesday, 30 March 2011

The tests programs

During the 1st semester, in the Message Parsing Programming lectures, EPPC staff taught us how to use the basic features of MPI, and how to avoid some mistakes. For this project some example will be reused to insure that both the profiler and display work correctly

The Message in a Ring

The message in a ring is a simple MPI program where each process sends data to the next one and receive therefore from the previous one. This program was developed to investigate the difference between the several send and receive possibilities proposed by the MPI standard:

Asynchronous send ; receive ; wait for the send
Asyncrhonous receive ; send ; wait for receive
Use of the special send and receive function

This code can be used as an example of real time communication, and checking on waiting for communications.

Calculating PI

The calculating PI was an exercise where each processor was computing a part of PI and then a reduction is done to add the results together. The goal was to illustrate the possible rounding errors when the sum wasn't done in the same order. In order to avoid it, an array was made on the master processor to store each result and do the sum in the end.

This code can be used to show a very simple data registering (float).

The traffic model

The traffic model simulation was a simple domain decomposition model, where a process can have several cells of road. Each cell can be either occupied or empty. A car moves to an empty cell forward to it. Therefore some communication has to be made to send cars across the neighbour processors: check if the next cell is empty (a very simple 1D hallow swapping).

The casestudy and coursework

The idea was to introduce a very simple reverse edge detection algorithm. The solution is quite computing intensive as a smooth operation has to be made several time in order to obtain the original picture. Therefore a simple domain decomposition was done on 1 of the dimension to share the work among processors.

This code involves typical hallo swapping and the use of MPI datatypes.

The coursework introduces a 2D domain decomposition. Making it more complex to share data with the neighbours processes.

Saturday, 26 March 2011

Meeting 3 and 4 [21/02 and 14/03]

During the month of March the deadline for the Project Preparation report and presentation. The goal of this module was to justify the research done and to prove the feasibility of the project.

During the 3rd meeting the MPI socket organisation was presented to David, showing a new orientation of the communication. Mainly the discussion was about the report, and what to write in it.

But also a set of tests, from the earlier MPI course, was discussed, including 3 simple codes and 1 more complex one:

calculating PI
A message in a ring
The traffic model

With the willing to be able to use the MPI casestudy and its evolution: the coursework. This last was about an image computing. All tests will be discuss later in this blog.

The main features for the software are the communication information (general statistics to find missing calls to MPI_Wait for example), the data display to show what data are sent and where they are stored to. Finally the last feature is a synchronised view of the communication, equivalent to a step by step action in a debugger, but at a MPI call level rather than a C or assembly one.

Few innovations were made for the 4th meeting. The report was due few days after. Nonetheless some goals were discuss. The project will provide a framework composed of a library (the profiler part) and an executable (the interface). The goal is to provide a real time global view of the program and it aims modest MPI programs with few processes. As it is only an information tool no performance is needed, but effort will be made to insure at least memory management.

Sunday, 20 February 2011

Using MPI sockets

This article will present how to use MPI to create a remote socket and use it through MPI calls. Remember that we have the profiler - the library part that uses the profiling interface of MPI to profile the program - and the display - that displays the information sent by the profiler - parts that communicate.

First of all a research was made in order to try to find out how to create a socket with MPI on the profiler and communicate with some other socket library on the display. So far no example were found using that approach, and as this is a technical test, no real implementation was done that way.

The approach used here is to bind the profiler and display communicators using a technique similar to MPI_Spawn but that doesn't require the 2 softwares to be tight together. This is done using the MPI_Open_port functions.

The code wasn't modifier a lot from the MPI Spawn approach, as you are going to see. The reference used to understand and develop that approach was actually the MPI standard website: 5.4.6. Client/Server Examples

The profiler side - server side

The global idea of that approach is for the profiler to open a port, and wait for some display to connect on it. The idea can be pushed further, if needed, to allow several display to connect on a single profiler (sharing the view of the program on several display for example).

Actually what was modified from the Spawn example is the way to connect the profiler and the display together. Rather than calling MPI_Spawn, MPI_Open_port was used, and few lines were added just before finalizing the execution.

Opening the port

int start_child(char* command, char* argv[])
{
  MPI_Open_port(MPI_INFO_NULL, port_name);

  /* child doesn't find it...
    sprintf(published, "%s-%d\0", PROFNAME, world_rank);

    MPI_Publish_name(published, MPI_INFO_NULL, port_name);*/

  fprintf(stderr, "!profiler!(%d) open port '%s'\n", world_rank, port_name);

  fprintf(stderr, "!profiler!(%d) waiting for a child...\n", world_rank);

  MPI_Comm_accept(port_name, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm);

  fprintf(stderr, "!profiler!(%d) got a child!\n", world_rank);

  int r;
  MPI_Comm_rank(intercomm, &r);
  fprintf(stderr, "!profiler!(%d) is %d on parent!\n", world_rank, r);

  // wait for a message that "I'm dying"
  if ( PMPI_Irecv(&(quitmessage[0]), INTRA_MESSAGE_SIZE, INTRA_MESSAGE_MPITYPE, CHILD_RANK, INTERCOMM_TAG, intercomm, &dead_child) != MPI_SUCCESS )
    {
      intercomm = MPI_COMM_NULL;

      fprintf(stderr, "!profiler!(%d) communication failed!\n", world_rank);
      intercomm = MPI_COMM_NULL;
      return FAILURE;
    }

  char mess[INTRA_MESSAGE_SIZE];
  sprintf(mess, "%d IsYourFather\0", world_rank);


  sendto_child(mess);

  PMPI_Barrier(MPI_COMM_WORLD);


  return SUCCESS;
}

Finalizing the communication

int wait_child(char* mess)
{
  // send my death
  if ( sendto_child(mess) == SUCCESS )
    {
      // wait his death
      if ( PMPI_Wait(&dead_child, MPI_STATUS_IGNORE) == MPI_SUCCESS )
        {
          fprintf(stderr, "!profiler!(%d) received its child death!\n", world_rank);
          //MPI_Unpublish_name(published, MPI_INFO_NULL, port_name);
          MPI_Close_port(port_name);
          return SUCCESS;
        }
    }

  return FAILURE;
}

The display side - the client side

On the display side, the same kind of modification had to be done. Rather that using information from the father's communicator, a connection to a port is performed.

The MPIWatch::getWatcher method

MPIWatch* MPIWatch::getWatcher(char port_name[])
{
    if ( instance == 0 )
    {
        MPI::Init();

        std::cout << "Try to connect to " << port_name << std::endl;

        parent = MPI::COMM_WORLD.Connect(port_name, MPI::INFO_NULL, 0);

        if ( parent == MPI::COMM_NULL )
        {
            std::cerr << "Cannot connect with the parent program! Aborting." << std::endl;
            MPI::Finalize();
            return 0;
        }

        std::cout << "Connection with parent completed!" << std::endl;

        instance = new MPIWatch();
    }

    return instance;
}

Running it!

The main difference here is that on the previous version the display was starting by itself. Now it has to be started separately, and actually one per MPI process. Some attempts were made to use the name publication described in the standard (see the reference further up) but for a unknown reason the display part never found the profiler published name.So far, 1 port is open per MPI process - or 1 name was published - and each display connect on 1 of them through command line input.

Console 1: run MPI

$> mpiexec -n 2 mpi_ring
!profiler!(0) open port '3449421824.0;tcp://192.168.0.2:48251+3449421825.0;tcp://192.168.0.2:36965:300'
!profiler!(1) open port '3449421824.0;tcp://192.168.0.2:48251+3449421825.1;tcp://192.168.0.2:52304:300'

Console 2-3: run the display

$> mpidisplay '3449421824.0;tcp://192.168.0.2:48251+3449421825.0;tcp://192.168.0.2:36965:300'

$> mpidisplay '3449421824.0;tcp://192.168.0.2:48251+3449421825.1;tcp://192.168.0.2:52304:300'

The current implementation is a little more complicated to run than the spawn version, but doesn't have any error code when finishing. It also allows more flexibility in the future, to allow more than one display on a single profiler, and any other idea that requires a more flexible approach than a spawn process (like been able to connect a display in the middle of a run and disconnect at will, to see if the program is deadlock etc).

Limitations

The port information are rather long and it is quite not user friendly to have to lookup the profiler output and copy/paste the port information into the display. Further investigation have to be made on that part, in order to either manage to find the name publication problem, or to find a way to look for the port with a more automatic fashion.The actual name publication idea was to publish a name, like 'profiler-mpirank' to look up for - or with any string given by the user instead of profiler. This will allow the display to be started in a single command, that will only need to know 2 information: the base name of the profiler and the number of MPI process to connect to!

The other limitation is not a real one, but more like a bug on the current implementation. A barrier was added to wait for every MPI process to get a display, and isn't that much of a problem, as no high performance are required for that project. The problem arises when one display is closed while the program is running. The current implementation doesn't catch it, and deadlocks. Further investigation will obviously be done on that problem later on.

Source code

As for the previous version the source code is available on http://www.megaupload.com/?d=ZXJGHBPQ. It is a test version, not very clean, and buggy (as explained above). Later on a post will be done on how to use the library with a MPI code in C.

Further work

The preliminary technical overview of the project is about to be over. Now that the basis of the project techniques are setted up, are more detailed reasoning will be done on the project functional requirements. As part of the Project Preparation course of the MSc, some risk analysis and workplan for the overall project has to be done as well and will be published here as well.

Tuesday, 15 February 2011

A bit of software engineering

This article will only details some changes on the code done in order to have a more adaptable test software. It will also explain how to use the library with an MPI program in C.

The Project

The project is so far organised around 2 things: the profiler and the display. The profiler is produced as a library, that patches some of the MPI calls. The display is an executable that only displays information from the profiler.

The current directory architecture reflects that organisation, where the display is actually in a subdirectory of the profiler (the interface one).

When build 3 folders are created;

dynamic containing the library as a .so - or static with the .a
includes which contains the header to add to the MPI executable you want to use with the profiler (the mpi_wrap.h file is the one, the intra_comm.h just defines some of the way for the display and profiler to communicate and can be used later on to develop another display)
display that obviously is the folder where the display executable is stored.

The actual profiler is done in C, and therefore uses MPICC (on my machine GCC - No build was really done on Ness, as for the moment Qt isn't installed on it).

The display is implemented in C++ using both Qt and MPI and uses the powerful .pro files to handle compilation.

The profiler

The profiler is organised so far around:

mpi_basic.c and mpi_communication.c that implements the MPI functions defined in mpi_wrap.h.
child_comm.c, child_comm.h and intra_comm.h that implements the profiler/display communication.

MPI overloading

Only the defined function in mpi_wrap.h are overloaded, and this is so far the only file that has to be included from the original MPI program. Each of the function will call some of the child_comm module to communicate with the child, and the user doesn't have to bother with them.

The child_comm module

Actually very few type of communication is required with the display. The header is rather small:

child_comm.h

#ifndef CHILDCOMM
#define CHILDCOMM

int start_child(char* command, char* argv[]);
int alive_child();
int sendto_child(char* mess);
int wait_child(char* mess);

#endif // CHILDCOMM

start_child starts the child, and therefore is called in MPI_Init()
sendto_child sends information to the child, the message is of a defined size in intra_comm.h
wait_child is to wait for the child death (i.e. be sure he received every information before closing communication) and is thus called in MPI_Finalize()
aline_child() return either SUCCESS or FAILURE (defined in intra_comm.h) to inform that the child is still running or not.

Such approach allows different way of communication with the child without affecting directly the MPI overloaded functions and vice versa.

The display

The display is developed using the Qt library, and uses a classical directory organisation. Qt provides a excellent tool, qmake, to generate Makefiles from a project file (here mpidisplay.pro) and will adapt to it. From a platform to another just minor modification have to be made on the file, such as the 2 first lines that defines MPICC flags. Note that Qt uses GCC as a compiler.

Extract of the mpidisplay.pro

# using 'mpicxx -showme:compile' and 'mpicxx -showme:link'
MPICXX_COMPILE = -I/usr/local/include -pthread
MPICXX_LINK = -pthread -L/usr/local/lib -lmpi_cxx -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl

Qt provides also a good interface designer, that will be used to generate the GUI, and the forms generated are stored in the forms folder. The src folder contains the sources.

The code organisation

The display code is organised around 2 classes so far:

MPIWatch that is implemented as a singleton and is the only one to deals with MPI communication (i.e. communicates with the profiler). It therefore uses some information from intra_comm.h.

It is inheriting from the QThread class, that is a portable thread for Qt (using pthreads on Unix certainly) and allows communication and display actualisation to be separated.

The communication with the other class is done through Qt internals signals, that are kind of remote calls. When a message is received from the profiler, it is stored on a message stack, and the signal newMessage() is emitted.
CommStat that is a classical QWidget displaying basic information on the number of sends and receives. It pops information from the MPIWatch object each time this one signals a new message.

How to use the mpi_wrap library?

Using the library is a very easy, and standard.

Add the #include line to the code that uses MPI.
Compile the files with the path of the include files (usually -I)
Link the executable with the path of the library, and the library name (usually -L and -libmpi_wrap).

Example in a Makefile

# path where the library is installed
MPI_WRAPPER = /home/workspace/project/current
# linking is either static or dynamic, will look in $MPI_WRAPPER/$linking
linking = dynamic

DEFINES+=
CC= mpicc
CFLAGS= -g $(DEFINES) -I${MPI_WRAPPER}/includes


LFLAGS= -lm -L${MPI_WRAPPER}/$(linking) -lmpi_wrap

EXE= ring

SRC= ring.c

OBJ= $(SRC:.c=.o)

.c.o:
 $(CC) $(CFLAGS) -c $<

all: $(EXE)

$(EXE): $(OBJ) 
 $(CC) $(CFLAGS) -o $@ $(OBJ) $(LFLAGS)
 @echo "don't forget export LD_LIBRARY_PATH='$(MPI_WRAPPER)/$(linking)'"
 @echo "don't forget to add $(MPI_WRAPPER)/display to the PATH!"

clean:
 rm -f $(OBJ) $(EXE)

The sources

The sources are available on http://www.megaupload.com/?d=DDUQP5QH.

Saturday, 12 February 2011

Using MPI_Spawn

This article will present how to use MPI_Spwan and what are the problem associated with it. This will first show the profiler code, then the display code. And finally discuss the problems.

Spawn the interface: profiler point of view

In order to spawn the interface, the PATH variable was exported in order to contain the path to the executable mpidisplay, that is the simple interface developed for this test. It is basically counting the number of calls to some of the communication function of MPI.

The spawning actually occurs in the MPI_Init overloaded function :

int world_rank;
MPI_Comm intercomm = MPI_COMM_NULL;
int intercomm_child_rank = 0; 

int MPI_Init(int* argc, char ***argv)
{
  int ret;

  ret = PMPI_Init(argc, argv);

  PMPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

  fprintf(stderr, "!profiler(%d)! MPI_Init()\n", world_rank);
  
  // spawn the interface
  MPI_Comm_spawn("mpidisplay", MPI_ARGV_NULL, 1, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm, MPI_ERRCODES_IGNORE);

  return ret;
}

This is simply starting the display when the profiler is started through mpiexec and link them together. But as soon as MPI_Finalize is called, both of them are killed, and the interface is closed. Thus a trick was used to make the profiler waiting for the child to be closed to stop running.

The idea is that the display sends a message to the profiler when it is closed, and that the profiler waits on this message with an asynchronous receive from the beginning. When MPI_Finalize is called on the profiler, a MPI_Wait of that message is performed, basically waiting for the display to be closed to resume. The profiler also send information about is imminent death, to display the information if needed on the display.

#define CHILD "mpidisplay"
#define CHILD_ARGS MPI_ARGV_NULL

int world_rank;
MPI_Comm intercomm = MPI_COMM_NULL;
int intercomm_child_rank = 0; 

static Intra_message quitmessage[INTRA_MESSAGE_SIZE]; 
MPI_Request dead_child = MPI_REQUEST_NULL;

int MPI_Init(int* argc, char ***argv)
{
  int ret;
  Intra_message message[INTRA_MESSAGE_SIZE];

  ret = PMPI_Init(argc, argv);

  PMPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
  //PMPI_Comm_size(MPI_COMM_WORLD, &world_size);

  fprintf(stderr, "!profiler(%d)! MPI_Init()\n", world_rank);

  
  // spawn the interface
  MPI_Comm_spawn(CHILD, CHILD_ARGS, 1, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm, MPI_ERRCODES_IGNORE);

  sprintf(message, "%d Init\0", world_rank);

  PMPI_Ssend(message, INTRA_MESSAGE_SIZE, INTRA_MESSAGE_MPITYPE, intercomm_child_rank, 0, intercomm);

  // wait for a message that "I'm dying"
  PMPI_Irecv(&(quitmessage[0]), INTRA_MESSAGE_SIZE, INTRA_MESSAGE_MPITYPE, intercomm_child_rank, 0, intercomm, &dead_child);// check each time if the child is dead...

  return ret;
}

int MPI_Finalize(void)
{
  int ret;

  fprintf(stderr, "!profiler!(%d): MPI_Finalize()\n", world_rank);

  if ( dead_child != MPI_REQUEST_NULL )
    {
      Intra_message message[INTRA_MESSAGE_SIZE];

      sprintf(message, "%d Finalize\0", world_rank);

      // send my death to the display
      PMPI_Ssend(message, INTRA_MESSAGE_SIZE, INTRA_MESSAGE_MPITYPE, intercomm_child_rank, 0, intercomm);

      fprintf(stderr, "!profiler!%d is waiting for its child...\n", world_rank);

      // wait for the display to quit
      PMPI_Wait(&dead_child, MPI_STATUS_IGNORE);
      fprintf(stderr, "!profiler!%d finished waiting...\n", world_rank);
    }

  ret = PMPI_Finalize();

  return ret;
}

Spawn the interface: display point of view

The display was implemented using Qt, and is therefore in C++. The MPI calls are the same, just organized in a Object Oriented fashion.

When the child is spawned, it can retrieve its parent information, and do so in order to get the special communicator. Then it simply uses normal MPI communication with it.

The MPIWatcher class was written to handle the MPI communication. It is implementing the singleton design pattern. The MPI init code are therefore present in the global call that creates the object, and are normally performed only once (as the object is carried by until the end of the program).

MPIWatch* MPIWatch::getWatcher(void)
{
    if ( instance == 0 )
    {
        //MPI::Intercomm parent = MPI::COMM_NULL;
        int parentSize;

        MPI::Init();
        parent = MPI::Comm::Get_parent();

        if ( parent == MPI::COMM_NULL )
        {
            std::cerr << "Cannot connect with the parent program! Aborting." << std::endl;
            //parent.Abort(-1);
            MPI::Finalize();
            return 0;
        }

        parentSize = parent.Get_remote_size();

        if ( parentSize != 1 )
        {
            std::cerr << "Parent communicator size is " << parentSize << "! It should be 1. Aborting." << std::endl;
            parent.Abort(-1);
            return 0;
        }

        instance = new MPIWatch();
    }

    return instance;
}

The instance process to catch up message will be discuss later. Basically the MPIWatch do synchronized receives from his father, and push the result on a stack, that is read by the interface.

When the window is closed, the MPIWatch object has to be destroyed, and the actual message is therefore sent to the father.

bool MPIWatch::delWatcher()
{
    if ( ! instance )
        return false;

    if ( instance->isRunning() )
        return false;

     QString s(MESSAGE_QUIT);

     parent.Ssend(s.toStdString().c_str(), INTRA_MESSAGE_SIZE, INTRA_MESSAGE_MPITYPE, 0, 0);

     MPI::Finalize();
     parent = MPI::COMM_NULL;

    delete instance;
    instance = 0;

    return true;
}

Problems with spawned instances

The major issue with the spawn interface is the actual call to MPI_Finalize. When one of the child or parent calls it, the ORTE process - the daemon that handles the MPI communication on OpenMPI and MPICH should have something similar - kills the other. Therefore even with the trick to wait for the display from the profiler would not always terminate the actual execution properly. It is actually rather bizarre that there is no proper way of doing so.

A bit more research will certainly be done on that problem, to see if closing the communicator can be effective. But there is not that much advantage compare to a typical client-server application, and next development tests will be done on that.

Tackling C++ from C

Why using both C and C++ in a single program when MPI provides C++ wrapper? Well first of all, most of the scientific program are either written in C or Fortran. Thus providing a C++ limited library is somewhat not in the score of the project. Then finding a solution that could provide a liberty of using C or Fortran for the MPI profiling interface (called the profiler) and any other language or library for the interface (called display) is, to my point of view, a good approach.

At the current state of the project, the profiler is written in C - and it will certainly be written only in C for the whole project - and the interface has to written using the Qt C++ library. The problem is therefore to call the corresponding C++ method when a MPI called is handled - hence calling C++ from C.

The first approach was to try to bind C++ in C, and was a big failure. The code was a simple function call (not even a method from an object) and it didn't link properly. Therefore a more modular solution had to be found.

Having a separate software for the profiler and the display is certainly the key to the problem. Hence it therefore requires another way of communication than simple function calls. MPI provides functions to spawn another process. It also provides socket handling.

During next week I will try to use both of the solution and try to choose between them.

A client and a server

The typical communication with sockets can be achieved with a client-server communication. The display will be a server, and the profiling interface will connect on it, and send information.

In order to provide a display per profiler, several interface will be started, each of them on a port. The obvious idea is to use a "base" port (say 4242) and to add the MPI process rank to find which port to use for communication. Thus on a 4 process job, 4 display will start, each of them listening on either 4242, 4243, 4244, 4245. Then the profilers will try to connect to one of them, according to their rank.

Using socket should be easy enough from MPI and Qt, as both libraries provide a "high" level interface.

The obvious advantage of such approach is the total independence of both software. One can communicate with another through a defined protocol without any trouble. It also allows the profiler to be in any language, and the display to be rewritten at wish - to display more specific information or using another library/language.

The obvious disadvantage is the opening of several ports, that might be troublesome on some restricted networks. A communication protocol has to be written as well, but it is also part of the other approach.

Spawning the display

Spawning the display is basically starting another process from the profiler. The display will hence be a totally different program, but it will be possible to communicate through a special MPI communicator given during the spawning process.

The difficulty of that approach lies in the spawning idea. As the 2 processes are tight together, if one of them dies (from an error, or simply because the display is closed) the ORTE (the deamon that manages MPI communication with OpenMPI - and MPICH2 must have something similar) will kill the other process. Therefore there is no real clean way of exiting both of the program.

Moreover the display program should be either accessible from the PATH or the profiler has to have a way to find where it is stored.

Hence the advantages are on the communication point of view. Both use MPI to communicate, that is rather simple and tackle the port problem.

Meeting 2 [07/02/11]

During the second meeting I presented my results to David. I also explained that using the C++ Standard Template Library could be nice and effective to store information on the profiler side.

One of the problem comes from the MPI interface, that has to be overloaded in C, and the interface/STL code that has to be in C++. Using C from C++ is relatively easy (extern "C" keyword and most of the C standard libraries are available - like #include for #include "stdlib.h"). The other way around is tricky enough to give a proper think about it.

So two main goals are to be considered for next meeting :

try to do an interface with Qt - and thus tackle the C/C++ binding
start to think about the display:
- what should be displayed
- how should it be displayed

The idea of the project is therefore still focussing on flagging up common errors and not developing a swiss-army knife for MPI.
The basic errors listed so far during the meeting were :

broadcast on a single node
synchronized send with no matching receive
data problems

The next meeting will be Monday 21st of February

Wednesday, 2 February 2011

Using the MPI profiling interface

How does the MPI profiling interface works? The answer is almost to easy. Finding how to use it is more complex.

The basic idea of the MPI profiling interface is simple: every single MPI function provides actually two entry points. One has the classical MPI_ prefix, the other has PMPI_. Thus, the whole idea is to overload the MPI_ ones, and call the corresponding PMPI_ in the middle. This approach gives full access to both parameters and return code.
Moreover the PMPI_ calls are part of the MPI standard definition (as far as I know...) and therefore are common to every implementations.

In order to test the MPI profiling interface I wrote down the simplest MPI code possible. Two files were needed, one for the wrapper, on for the program.

mpi_wrap.h

#ifndef MPI_WRAP
#define MPI_WRAP

int MPI_Init(int *argc, char ***argv);

#endif

mpi_wrap.c

#include "mpi_wrap.h"

#include <mpi.h>
#include <stdio.h>

int MPI_Init(int* argc, char ***argv)
{
  int ret;
  fprintf(stderr, "Prof: MPI_Init(...)");

  ret = PMPI_Init(argc, argv);

  return ret;
}

mpi_hello.c

#include <mpi.h>

#include "mpi_wrap.h"

int main()
{
  int rank=0, pop=0;

  MPI_Init(NULL, NULL);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &pop);

  if ( rank == 0 )
    printf("%d: I'm the master of %d puppets.\n", rank, pop);

  MPI_Finalize();

  return 0;
}

On Ness

Ness is the EPCC cluster used by MSc students, on Scientific Linux.

MPI installed: mpich-2

MPI C Compiler: pgcc

The first problem came from the compilation of the library. For a yet unknown reason ld doesn't want to link.

mpicc -c -fPIC mpi_wrap.c -o mpi_wrap.o
mpicc -shared -soname=libmpi_wrap.so -o libmpi_wrap.so mpi_wrap.o

Compilation ended by

/usr/bin/ld: /opt/local/packages/mpich2/1.0.5p4-ch3_sock-pgi7.0-7/lib/libmpich.a(init.o): relocation R_X86_64_32 against `MPIR_Process' can not be used when making a shared object; recompile with -fPIC
/opt/local/packages/mpich2/1.0.5p4-ch3_sock-pgi7.0-7/lib/libmpich.a: could not read symbols: Bad value.

Thus the static library approach was taken.

mpicc -c -fpic mpi_wrap.c -o mpi_wrap.o
ar rcs libmpi_wrap.a mpi_wrap.o

That compiled well.

Then comes the program compilation, that is straightforward.

mpicc -c -I. mpi_hello.c -o mpi_hello.o
mpicc mpi_hello.o -L. -lmpi_wrap -o mpi_hello

And the result worked fine:

$> mpiexec -n 2 mpi_hello
mpiexec: running on ness front-end; timings will not be reliable.
Prof: MPI_Init(...)
Prof: MPI_Init(...)
0: I'm the master of 2 puppets.

At home

My home desktop machine is using a Gentoo/Linux installation.

MPI installed: OpenMPI

MPI C Compiler: gcc

Compiling the dynamic library worked:

mpicc -c -fpic mpi_wrap.c -o mpi_wrap.o
mpicc -shared -Wl,-soname,libmpi_wrap.so mpi_wrap.o -o libmpi_wrap.so

And compiling the executable too:

mpicc -c -I. mpi_hello.c -o mpi_hello.o
mpicc -L. -lmpi_wrap mpi_hello.o -o mpi_hello

Of course as the dynamic library approach is used, the LD_LIBRARY_PATH environment variable has to be set from the directory where the .so is:

export LD_LIBRARY_PATH=`pwd`

Finally running works as well:

$> mpiexec -n 2 mpi_hello
Prof: MPI_Init(...)
Prof: MPI_Init(...)
0: I'm the master of 2 puppets.

Discussion

It is rather strange that Ness doesn't want to link as a dynamic library. Further investigation will be done on that problem, in order to find an answer.

Using a statically linked library offers the advantage of simplicity: no need to set up the LD_LIBRARY_PATH but increases the size of the executable, especially when the tool will include the graphical interface.

Thus the advantages of the dynamically linked library are the reversed, saving executable size as the expense of few configuration.

As far as possible I will try to use the dynamically linked library approach, as the graphical interface will certainly contains a lot of code, that is not directly needed into the program. But the library has to be present on a common ground if used on a cluster, and this will be something I need to investigate further on.

References

No real references here, but just some websites that helped me remember how to create libraries. And of course how to use the MPI profiling interface.

Creating a shared and static library with the gnu compiler [gcc] - René Nyffenegger

Open MPI FAQ: Performance analysis tools

Meeting 1 [24/01/11]

During the first meeting David and I discussed the main goals of the project. From the project proposal and some thinking we started to agree on several points.

Few parallel debuggers exist for the moment, and most of them are expensive, and not very useful. This tool shouldn't be one.

When people start learning MPI, there is 2 things they mainly get wrong :

communications (the typical example is the broadcast call, that has to be performed by all the nodes, and that learners only use on one)
sending the wrong data, either from the wrong source using a wrongly build datatype, or to the wrong node.

Therefore it can be interesting to create a tool that can help resolve these problems on a small program that runs on a small number of nodes. The problem size is here important, has the tool would provide some graphical interface to the user, that will become quite unreadable for large number of nodes. Moreover it is rather unusual that people witch need very large problem will use that kind of tool.

The first draw of some requirements can be:

a graphical interface to visualise ongoing actions
being able to monitor the state of a MPI node
being able to bloc when a monitored action occurs for the user to see it (communication waiting, sending, ...)
being able to register and track some simple data (1D arrays)

visually
derived datatypes
nD arrays

This tool will be provided as a library, that can be linked with any MPI code in C, and if possible Fortran as well. The generic MPI profiling interface will be used to catch the information.
The MPI coursework from the 1st semester will provide a testing case, and the first goal of the project is to demonstrate how this peace of code works using the tool.

Two main dangers have to been taken care of during the project specifications:

being too ambitious will result in a project failure, struggling with implementation
being not ambitious enough will result in a useless tool

In order to cope with these risk, a iterative prototype development approach will certainly be used.

Next meeting: 07/02/11
Work to be done: try to use the profiling interface with MPICH2 and OpenMPI.

Original project proposal

Real-time visualisation of MPI programs
David Henty

One of the problems with MPI programming is that it is very difficult to debug incorrect programs. Tools like VAMPIR can display the communications patterns of MPI programs by producing a trace file during execution and enabling the user to view the file as a timeline afterwards. Unfortunately, this is only useful if the program runs to completion which is usually not the case when you have a bug! It would also be useful to track MPI communications at runtime for training and education purposes, allowing new users to see what their programs are doing, or to run standard examples and follow their execution so they can understand concepts such as synchronous/asynchronous modes and blocking/non-blocking operations.

The project is to develop a tool/library that, for each MPI processes, pops up a window that shows real-time information about its execution. For example, it could just say what routine was being called ("Currently in MPI_Send"), give more details ("Calling MPI_Send to send 14 real numbers to rank 4") or display the operations graphically (eg boxes showing all the pending sends and receives, animations showing messages matching up at runtime etc etc). This tool would then be run on a set of test programs from simple examples all the way to full applications to see how useful it is in practice. Possible extensions include halting execution until the user hits a button ("click here to continue") which could be very useful in illustrating concepts such as collective communications: the routine will not complete until the user has clicked "go" for all MPI processes. Another possibility would be to display where in the source code each process is at any one time.

It is quite simple to do this in practice as the MPI library has a separate "profiling interface" that enables all MPI calls easily to be intercepted by the user. Here, we would then display information about the call in some way (eg write text to a window) before calling the real MPI routine.

The tool could easily be developed and tested on a single workstation with all MPI processes displaying information on the same screen. However, it would be more interesting to run on a real cluster like the EPCC training room machines. Here, a window would appear on each screen where an MPI process was running and there would be interactions between different machines in the room. A user at one screen might have to call to a user at another screen for them to initiate a receive operation so that the first user's synchronous send can complete.

The tool should work with both C and Fortran, but will itself be developed in C. A good knowledge of C programming is therefore required. Previous experience in graphics programming would be useful but not essential.