Sunday 20 February 2011

Using MPI sockets

This article will present how to use MPI to create a remote socket and use it through MPI calls. Remember that we have the profiler - the library part that uses the profiling interface of MPI to profile the program - and the display - that displays the information sent by the profiler - parts that communicate.

First of all a research was made in order to try to find out how to create a socket with MPI on the profiler and communicate with some other socket library on the display. So far no example were found using that approach, and as this is a technical test, no real implementation was done that way.

The approach used here is to bind the profiler and display communicators using a technique similar to MPI_Spawn but that doesn't require the 2 softwares to be tight together. This is done using the MPI_Open_port functions.

The code wasn't modifier a lot from the MPI Spawn approach, as you are going to see. The reference used to understand and develop that approach was actually the MPI standard website: 5.4.6. Client/Server Examples

The profiler side - server side

The global idea of that approach is for the profiler to open a port, and wait for some display to connect on it. The idea can be pushed further, if needed, to allow several display to connect on a single profiler (sharing the view of the program on several display for example).

Actually what was modified from the Spawn example is the way to connect the profiler and the display together. Rather than calling MPI_Spawn, MPI_Open_port was used, and few lines were added just before finalizing the execution.

Opening the port

int start_child(char* command, char* argv[])
{
  MPI_Open_port(MPI_INFO_NULL, port_name);

  /* child doesn't find it...
    sprintf(published, "%s-%d\0", PROFNAME, world_rank);

    MPI_Publish_name(published, MPI_INFO_NULL, port_name);*/

  fprintf(stderr, "!profiler!(%d) open port '%s'\n", world_rank, port_name);

  fprintf(stderr, "!profiler!(%d) waiting for a child...\n", world_rank);

  MPI_Comm_accept(port_name, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm);

  fprintf(stderr, "!profiler!(%d) got a child!\n", world_rank);

  int r;
  MPI_Comm_rank(intercomm, &r);
  fprintf(stderr, "!profiler!(%d) is %d on parent!\n", world_rank, r);

  // wait for a message that "I'm dying"
  if ( PMPI_Irecv(&(quitmessage[0]), INTRA_MESSAGE_SIZE, INTRA_MESSAGE_MPITYPE, CHILD_RANK, INTERCOMM_TAG, intercomm, &dead_child) != MPI_SUCCESS )
    {
      intercomm = MPI_COMM_NULL;

      fprintf(stderr, "!profiler!(%d) communication failed!\n", world_rank);
      intercomm = MPI_COMM_NULL;
      return FAILURE;
    }

  char mess[INTRA_MESSAGE_SIZE];
  sprintf(mess, "%d IsYourFather\0", world_rank);


  sendto_child(mess);

  PMPI_Barrier(MPI_COMM_WORLD);


  return SUCCESS;
}

Finalizing the communication

int wait_child(char* mess)
{
  // send my death
  if ( sendto_child(mess) == SUCCESS )
    {
      // wait his death
      if ( PMPI_Wait(&dead_child, MPI_STATUS_IGNORE) == MPI_SUCCESS )
        {
          fprintf(stderr, "!profiler!(%d) received its child death!\n", world_rank);
          //MPI_Unpublish_name(published, MPI_INFO_NULL, port_name);
          MPI_Close_port(port_name);
          return SUCCESS;
        }
    }

  return FAILURE;
}

The display side - the client side

On the display side, the same kind of modification had to be done. Rather that using information from the father's communicator, a connection to a port is performed.

The MPIWatch::getWatcher method

MPIWatch* MPIWatch::getWatcher(char port_name[])
{
    if ( instance == 0 )
    {
        MPI::Init();

        std::cout << "Try to connect to " << port_name << std::endl;

        parent = MPI::COMM_WORLD.Connect(port_name, MPI::INFO_NULL, 0);

        if ( parent == MPI::COMM_NULL )
        {
            std::cerr << "Cannot connect with the parent program! Aborting." << std::endl;
            MPI::Finalize();
            return 0;
        }

        std::cout << "Connection with parent completed!" << std::endl;

        instance = new MPIWatch();
    }

    return instance;
}

Running it!

The main difference here is that on the previous version the display was starting by itself. Now it has to be started separately, and actually one per MPI process. Some attempts were made to use the name publication described in the standard (see the reference further up) but for a unknown reason the display part never found the profiler published name.So far, 1 port is open per MPI process - or 1 name was published - and each display connect on 1 of them through command line input.

Console 1: run MPI

$> mpiexec -n 2 mpi_ring
!profiler!(0) open port '3449421824.0;tcp://192.168.0.2:48251+3449421825.0;tcp://192.168.0.2:36965:300'
!profiler!(1) open port '3449421824.0;tcp://192.168.0.2:48251+3449421825.1;tcp://192.168.0.2:52304:300'

Console 2-3: run the display

$> mpidisplay '3449421824.0;tcp://192.168.0.2:48251+3449421825.0;tcp://192.168.0.2:36965:300'

$> mpidisplay '3449421824.0;tcp://192.168.0.2:48251+3449421825.1;tcp://192.168.0.2:52304:300'

The current implementation is a little more complicated to run than the spawn version, but doesn't have any error code when finishing. It also allows more flexibility in the future, to allow more than one display on a single profiler, and any other idea that requires a more flexible approach than a spawn process (like been able to connect a display in the middle of a run and disconnect at will, to see if the program is deadlock etc).

Limitations

The port information are rather long and it is quite not user friendly to have to lookup the profiler output and copy/paste the port information into the display. Further investigation have to be made on that part, in order to either manage to find the name publication problem, or to find a way to look for the port with a more automatic fashion.The actual name publication idea was to publish a name, like 'profiler-mpirank' to look up for - or with any string given by the user instead of profiler. This will allow the display to be started in a single command, that will only need to know 2 information: the base name of the profiler and the number of MPI process to connect to!

The other limitation is not a real one, but more like a bug on the current implementation. A barrier was added to wait for every MPI process to get a display, and isn't that much of a problem, as no high performance are required for that project. The problem arises when one display is closed while the program is running. The current implementation doesn't catch it, and deadlocks. Further investigation will obviously be done on that problem later on.

Source code

As for the previous version the source code is available on http://www.megaupload.com/?d=ZXJGHBPQ. It is a test version, not very clean, and buggy (as explained above). Later on a post will be done on how to use the library with a MPI code in C.

Further work

The preliminary technical overview of the project is about to be over. Now that the basis of the project techniques are setted up, are more detailed reasoning will be done on the project functional requirements. As part of the Project Preparation course of the MSc, some risk analysis and workplan for the overall project has to be done as well and will be published here as well.

Tuesday 15 February 2011

A bit of software engineering

This article will only details some changes on the code done in order to have a more adaptable test software. It will also explain how to use the library with an MPI program in C.

The Project

The project is so far organised around 2 things: the profiler and the display. The profiler is produced as a library, that patches some of the MPI calls. The display is an executable that only displays information from the profiler.

The current directory architecture reflects that organisation, where the display is actually in a subdirectory of the profiler (the interface one).

When build 3 folders are created;

dynamic containing the library as a .so - or static with the .a
includes which contains the header to add to the MPI executable you want to use with the profiler (the mpi_wrap.h file is the one, the intra_comm.h just defines some of the way for the display and profiler to communicate and can be used later on to develop another display)
display that obviously is the folder where the display executable is stored.

The actual profiler is done in C, and therefore uses MPICC (on my machine GCC - No build was really done on Ness, as for the moment Qt isn't installed on it).

The display is implemented in C++ using both Qt and MPI and uses the powerful .pro files to handle compilation.

The profiler

The profiler is organised so far around:

mpi_basic.c and mpi_communication.c that implements the MPI functions defined in mpi_wrap.h.
child_comm.c, child_comm.h and intra_comm.h that implements the profiler/display communication.

MPI overloading

Only the defined function in mpi_wrap.h are overloaded, and this is so far the only file that has to be included from the original MPI program. Each of the function will call some of the child_comm module to communicate with the child, and the user doesn't have to bother with them.

The child_comm module

Actually very few type of communication is required with the display. The header is rather small:

child_comm.h

#ifndef CHILDCOMM
#define CHILDCOMM

int start_child(char* command, char* argv[]);
int alive_child();
int sendto_child(char* mess);
int wait_child(char* mess);

#endif // CHILDCOMM

start_child starts the child, and therefore is called in MPI_Init()
sendto_child sends information to the child, the message is of a defined size in intra_comm.h
wait_child is to wait for the child death (i.e. be sure he received every information before closing communication) and is thus called in MPI_Finalize()
aline_child() return either SUCCESS or FAILURE (defined in intra_comm.h) to inform that the child is still running or not.

Such approach allows different way of communication with the child without affecting directly the MPI overloaded functions and vice versa.

The display

The display is developed using the Qt library, and uses a classical directory organisation. Qt provides a excellent tool, qmake, to generate Makefiles from a project file (here mpidisplay.pro) and will adapt to it. From a platform to another just minor modification have to be made on the file, such as the 2 first lines that defines MPICC flags. Note that Qt uses GCC as a compiler.

Extract of the mpidisplay.pro

# using 'mpicxx -showme:compile' and 'mpicxx -showme:link'
MPICXX_COMPILE = -I/usr/local/include -pthread
MPICXX_LINK = -pthread -L/usr/local/lib -lmpi_cxx -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl

Qt provides also a good interface designer, that will be used to generate the GUI, and the forms generated are stored in the forms folder. The src folder contains the sources.

The code organisation

The display code is organised around 2 classes so far:

MPIWatch that is implemented as a singleton and is the only one to deals with MPI communication (i.e. communicates with the profiler). It therefore uses some information from intra_comm.h.

It is inheriting from the QThread class, that is a portable thread for Qt (using pthreads on Unix certainly) and allows communication and display actualisation to be separated.

The communication with the other class is done through Qt internals signals, that are kind of remote calls. When a message is received from the profiler, it is stored on a message stack, and the signal newMessage() is emitted.
CommStat that is a classical QWidget displaying basic information on the number of sends and receives. It pops information from the MPIWatch object each time this one signals a new message.

How to use the mpi_wrap library?

Using the library is a very easy, and standard.

Add the #include line to the code that uses MPI.
Compile the files with the path of the include files (usually -I)
Link the executable with the path of the library, and the library name (usually -L and -libmpi_wrap).

Example in a Makefile

# path where the library is installed
MPI_WRAPPER = /home/workspace/project/current
# linking is either static or dynamic, will look in $MPI_WRAPPER/$linking
linking = dynamic

DEFINES+=
CC= mpicc
CFLAGS= -g $(DEFINES) -I${MPI_WRAPPER}/includes


LFLAGS= -lm -L${MPI_WRAPPER}/$(linking) -lmpi_wrap

EXE= ring

SRC= ring.c

OBJ= $(SRC:.c=.o)

.c.o:
 $(CC) $(CFLAGS) -c $<

all: $(EXE)

$(EXE): $(OBJ) 
 $(CC) $(CFLAGS) -o $@ $(OBJ) $(LFLAGS)
 @echo "don't forget export LD_LIBRARY_PATH='$(MPI_WRAPPER)/$(linking)'"
 @echo "don't forget to add $(MPI_WRAPPER)/display to the PATH!"

clean:
 rm -f $(OBJ) $(EXE)

The sources

The sources are available on http://www.megaupload.com/?d=DDUQP5QH.

Saturday 12 February 2011

Using MPI_Spawn

This article will present how to use MPI_Spwan and what are the problem associated with it. This will first show the profiler code, then the display code. And finally discuss the problems.

Spawn the interface: profiler point of view

In order to spawn the interface, the PATH variable was exported in order to contain the path to the executable mpidisplay, that is the simple interface developed for this test. It is basically counting the number of calls to some of the communication function of MPI.

The spawning actually occurs in the MPI_Init overloaded function :

int world_rank;
MPI_Comm intercomm = MPI_COMM_NULL;
int intercomm_child_rank = 0; 

int MPI_Init(int* argc, char ***argv)
{
  int ret;

  ret = PMPI_Init(argc, argv);

  PMPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

  fprintf(stderr, "!profiler(%d)! MPI_Init()\n", world_rank);
  
  // spawn the interface
  MPI_Comm_spawn("mpidisplay", MPI_ARGV_NULL, 1, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm, MPI_ERRCODES_IGNORE);

  return ret;
}

This is simply starting the display when the profiler is started through mpiexec and link them together. But as soon as MPI_Finalize is called, both of them are killed, and the interface is closed. Thus a trick was used to make the profiler waiting for the child to be closed to stop running.

The idea is that the display sends a message to the profiler when it is closed, and that the profiler waits on this message with an asynchronous receive from the beginning. When MPI_Finalize is called on the profiler, a MPI_Wait of that message is performed, basically waiting for the display to be closed to resume. The profiler also send information about is imminent death, to display the information if needed on the display.

#define CHILD "mpidisplay"
#define CHILD_ARGS MPI_ARGV_NULL

int world_rank;
MPI_Comm intercomm = MPI_COMM_NULL;
int intercomm_child_rank = 0; 

static Intra_message quitmessage[INTRA_MESSAGE_SIZE]; 
MPI_Request dead_child = MPI_REQUEST_NULL;

int MPI_Init(int* argc, char ***argv)
{
  int ret;
  Intra_message message[INTRA_MESSAGE_SIZE];

  ret = PMPI_Init(argc, argv);

  PMPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
  //PMPI_Comm_size(MPI_COMM_WORLD, &world_size);

  fprintf(stderr, "!profiler(%d)! MPI_Init()\n", world_rank);

  
  // spawn the interface
  MPI_Comm_spawn(CHILD, CHILD_ARGS, 1, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm, MPI_ERRCODES_IGNORE);

  sprintf(message, "%d Init\0", world_rank);

  PMPI_Ssend(message, INTRA_MESSAGE_SIZE, INTRA_MESSAGE_MPITYPE, intercomm_child_rank, 0, intercomm);

  // wait for a message that "I'm dying"
  PMPI_Irecv(&(quitmessage[0]), INTRA_MESSAGE_SIZE, INTRA_MESSAGE_MPITYPE, intercomm_child_rank, 0, intercomm, &dead_child);// check each time if the child is dead...

  return ret;
}

int MPI_Finalize(void)
{
  int ret;

  fprintf(stderr, "!profiler!(%d): MPI_Finalize()\n", world_rank);

  if ( dead_child != MPI_REQUEST_NULL )
    {
      Intra_message message[INTRA_MESSAGE_SIZE];

      sprintf(message, "%d Finalize\0", world_rank);

      // send my death to the display
      PMPI_Ssend(message, INTRA_MESSAGE_SIZE, INTRA_MESSAGE_MPITYPE, intercomm_child_rank, 0, intercomm);

      fprintf(stderr, "!profiler!%d is waiting for its child...\n", world_rank);

      // wait for the display to quit
      PMPI_Wait(&dead_child, MPI_STATUS_IGNORE);
      fprintf(stderr, "!profiler!%d finished waiting...\n", world_rank);
    }

  ret = PMPI_Finalize();

  return ret;
}

Spawn the interface: display point of view

The display was implemented using Qt, and is therefore in C++. The MPI calls are the same, just organized in a Object Oriented fashion.

When the child is spawned, it can retrieve its parent information, and do so in order to get the special communicator. Then it simply uses normal MPI communication with it.

The MPIWatcher class was written to handle the MPI communication. It is implementing the singleton design pattern. The MPI init code are therefore present in the global call that creates the object, and are normally performed only once (as the object is carried by until the end of the program).

MPIWatch* MPIWatch::getWatcher(void)
{
    if ( instance == 0 )
    {
        //MPI::Intercomm parent = MPI::COMM_NULL;
        int parentSize;

        MPI::Init();
        parent = MPI::Comm::Get_parent();

        if ( parent == MPI::COMM_NULL )
        {
            std::cerr << "Cannot connect with the parent program! Aborting." << std::endl;
            //parent.Abort(-1);
            MPI::Finalize();
            return 0;
        }

        parentSize = parent.Get_remote_size();

        if ( parentSize != 1 )
        {
            std::cerr << "Parent communicator size is " << parentSize << "! It should be 1. Aborting." << std::endl;
            parent.Abort(-1);
            return 0;
        }

        instance = new MPIWatch();
    }

    return instance;
}

The instance process to catch up message will be discuss later. Basically the MPIWatch do synchronized receives from his father, and push the result on a stack, that is read by the interface.

When the window is closed, the MPIWatch object has to be destroyed, and the actual message is therefore sent to the father.

bool MPIWatch::delWatcher()
{
    if ( ! instance )
        return false;

    if ( instance->isRunning() )
        return false;

     QString s(MESSAGE_QUIT);

     parent.Ssend(s.toStdString().c_str(), INTRA_MESSAGE_SIZE, INTRA_MESSAGE_MPITYPE, 0, 0);

     MPI::Finalize();
     parent = MPI::COMM_NULL;

    delete instance;
    instance = 0;

    return true;
}

Problems with spawned instances

The major issue with the spawn interface is the actual call to MPI_Finalize. When one of the child or parent calls it, the ORTE process - the daemon that handles the MPI communication on OpenMPI and MPICH should have something similar - kills the other. Therefore even with the trick to wait for the display from the profiler would not always terminate the actual execution properly. It is actually rather bizarre that there is no proper way of doing so.

A bit more research will certainly be done on that problem, to see if closing the communicator can be effective. But there is not that much advantage compare to a typical client-server application, and next development tests will be done on that.

Tackling C++ from C

Why using both C and C++ in a single program when MPI provides C++ wrapper? Well first of all, most of the scientific program are either written in C or Fortran. Thus providing a C++ limited library is somewhat not in the score of the project. Then finding a solution that could provide a liberty of using C or Fortran for the MPI profiling interface (called the profiler) and any other language or library for the interface (called display) is, to my point of view, a good approach.

At the current state of the project, the profiler is written in C - and it will certainly be written only in C for the whole project - and the interface has to written using the Qt C++ library. The problem is therefore to call the corresponding C++ method when a MPI called is handled - hence calling C++ from C.

The first approach was to try to bind C++ in C, and was a big failure. The code was a simple function call (not even a method from an object) and it didn't link properly. Therefore a more modular solution had to be found.

Having a separate software for the profiler and the display is certainly the key to the problem. Hence it therefore requires another way of communication than simple function calls. MPI provides functions to spawn another process. It also provides socket handling.

During next week I will try to use both of the solution and try to choose between them.

A client and a server

The typical communication with sockets can be achieved with a client-server communication. The display will be a server, and the profiling interface will connect on it, and send information.

In order to provide a display per profiler, several interface will be started, each of them on a port. The obvious idea is to use a "base" port (say 4242) and to add the MPI process rank to find which port to use for communication. Thus on a 4 process job, 4 display will start, each of them listening on either 4242, 4243, 4244, 4245. Then the profilers will try to connect to one of them, according to their rank.

Using socket should be easy enough from MPI and Qt, as both libraries provide a "high" level interface.

The obvious advantage of such approach is the total independence of both software. One can communicate with another through a defined protocol without any trouble. It also allows the profiler to be in any language, and the display to be rewritten at wish - to display more specific information or using another library/language.

The obvious disadvantage is the opening of several ports, that might be troublesome on some restricted networks. A communication protocol has to be written as well, but it is also part of the other approach.

Spawning the display

Spawning the display is basically starting another process from the profiler. The display will hence be a totally different program, but it will be possible to communicate through a special MPI communicator given during the spawning process.

The difficulty of that approach lies in the spawning idea. As the 2 processes are tight together, if one of them dies (from an error, or simply because the display is closed) the ORTE (the deamon that manages MPI communication with OpenMPI - and MPICH2 must have something similar) will kill the other process. Therefore there is no real clean way of exiting both of the program.

Moreover the display program should be either accessible from the PATH or the profiler has to have a way to find where it is stored.

Hence the advantages are on the communication point of view. Both use MPI to communicate, that is rather simple and tackle the port problem.

Meeting 2 [07/02/11]

During the second meeting I presented my results to David. I also explained that using the C++ Standard Template Library could be nice and effective to store information on the profiler side.

One of the problem comes from the MPI interface, that has to be overloaded in C, and the interface/STL code that has to be in C++. Using C from C++ is relatively easy (extern "C" keyword and most of the C standard libraries are available - like #include for #include "stdlib.h"). The other way around is tricky enough to give a proper think about it.

So two main goals are to be considered for next meeting :

try to do an interface with Qt - and thus tackle the C/C++ binding
start to think about the display:
- what should be displayed
- how should it be displayed

The idea of the project is therefore still focussing on flagging up common errors and not developing a swiss-army knife for MPI.
The basic errors listed so far during the meeting were :

broadcast on a single node
synchronized send with no matching receive
data problems

The next meeting will be Monday 21st of February

Wednesday 2 February 2011

Using the MPI profiling interface

How does the MPI profiling interface works? The answer is almost to easy. Finding how to use it is more complex.

The basic idea of the MPI profiling interface is simple: every single MPI function provides actually two entry points. One has the classical MPI_ prefix, the other has PMPI_. Thus, the whole idea is to overload the MPI_ ones, and call the corresponding PMPI_ in the middle. This approach gives full access to both parameters and return code.
Moreover the PMPI_ calls are part of the MPI standard definition (as far as I know...) and therefore are common to every implementations.

In order to test the MPI profiling interface I wrote down the simplest MPI code possible. Two files were needed, one for the wrapper, on for the program.

mpi_wrap.h

#ifndef MPI_WRAP
#define MPI_WRAP

int MPI_Init(int *argc, char ***argv);

#endif

mpi_wrap.c

#include "mpi_wrap.h"

#include <mpi.h>
#include <stdio.h>

int MPI_Init(int* argc, char ***argv)
{
  int ret;
  fprintf(stderr, "Prof: MPI_Init(...)");

  ret = PMPI_Init(argc, argv);

  return ret;
}

mpi_hello.c

#include <mpi.h>

#include "mpi_wrap.h"

int main()
{
  int rank=0, pop=0;

  MPI_Init(NULL, NULL);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &pop);

  if ( rank == 0 )
    printf("%d: I'm the master of %d puppets.\n", rank, pop);

  MPI_Finalize();

  return 0;
}

On Ness

Ness is the EPCC cluster used by MSc students, on Scientific Linux.

MPI installed: mpich-2

MPI C Compiler: pgcc

The first problem came from the compilation of the library. For a yet unknown reason ld doesn't want to link.

mpicc -c -fPIC mpi_wrap.c -o mpi_wrap.o
mpicc -shared -soname=libmpi_wrap.so -o libmpi_wrap.so mpi_wrap.o

Compilation ended by

/usr/bin/ld: /opt/local/packages/mpich2/1.0.5p4-ch3_sock-pgi7.0-7/lib/libmpich.a(init.o): relocation R_X86_64_32 against `MPIR_Process' can not be used when making a shared object; recompile with -fPIC
/opt/local/packages/mpich2/1.0.5p4-ch3_sock-pgi7.0-7/lib/libmpich.a: could not read symbols: Bad value.

Thus the static library approach was taken.

mpicc -c -fpic mpi_wrap.c -o mpi_wrap.o
ar rcs libmpi_wrap.a mpi_wrap.o

That compiled well.

Then comes the program compilation, that is straightforward.

mpicc -c -I. mpi_hello.c -o mpi_hello.o
mpicc mpi_hello.o -L. -lmpi_wrap -o mpi_hello

And the result worked fine:

$> mpiexec -n 2 mpi_hello
mpiexec: running on ness front-end; timings will not be reliable.
Prof: MPI_Init(...)
Prof: MPI_Init(...)
0: I'm the master of 2 puppets.

At home

My home desktop machine is using a Gentoo/Linux installation.

MPI installed: OpenMPI

MPI C Compiler: gcc

Compiling the dynamic library worked:

mpicc -c -fpic mpi_wrap.c -o mpi_wrap.o
mpicc -shared -Wl,-soname,libmpi_wrap.so mpi_wrap.o -o libmpi_wrap.so

And compiling the executable too:

mpicc -c -I. mpi_hello.c -o mpi_hello.o
mpicc -L. -lmpi_wrap mpi_hello.o -o mpi_hello

Of course as the dynamic library approach is used, the LD_LIBRARY_PATH environment variable has to be set from the directory where the .so is:

export LD_LIBRARY_PATH=`pwd`

Finally running works as well:

$> mpiexec -n 2 mpi_hello
Prof: MPI_Init(...)
Prof: MPI_Init(...)
0: I'm the master of 2 puppets.

Discussion

It is rather strange that Ness doesn't want to link as a dynamic library. Further investigation will be done on that problem, in order to find an answer.

Using a statically linked library offers the advantage of simplicity: no need to set up the LD_LIBRARY_PATH but increases the size of the executable, especially when the tool will include the graphical interface.

Thus the advantages of the dynamically linked library are the reversed, saving executable size as the expense of few configuration.

As far as possible I will try to use the dynamically linked library approach, as the graphical interface will certainly contains a lot of code, that is not directly needed into the program. But the library has to be present on a common ground if used on a cluster, and this will be something I need to investigate further on.

References

No real references here, but just some websites that helped me remember how to create libraries. And of course how to use the MPI profiling interface.

Creating a shared and static library with the gnu compiler [gcc] - René Nyffenegger

Open MPI FAQ: Performance analysis tools

Meeting 1 [24/01/11]

During the first meeting David and I discussed the main goals of the project. From the project proposal and some thinking we started to agree on several points.

Few parallel debuggers exist for the moment, and most of them are expensive, and not very useful. This tool shouldn't be one.

When people start learning MPI, there is 2 things they mainly get wrong :

communications (the typical example is the broadcast call, that has to be performed by all the nodes, and that learners only use on one)
sending the wrong data, either from the wrong source using a wrongly build datatype, or to the wrong node.

Therefore it can be interesting to create a tool that can help resolve these problems on a small program that runs on a small number of nodes. The problem size is here important, has the tool would provide some graphical interface to the user, that will become quite unreadable for large number of nodes. Moreover it is rather unusual that people witch need very large problem will use that kind of tool.

The first draw of some requirements can be:

a graphical interface to visualise ongoing actions
being able to monitor the state of a MPI node
being able to bloc when a monitored action occurs for the user to see it (communication waiting, sending, ...)
being able to register and track some simple data (1D arrays)

visually
derived datatypes
nD arrays

This tool will be provided as a library, that can be linked with any MPI code in C, and if possible Fortran as well. The generic MPI profiling interface will be used to catch the information.
The MPI coursework from the 1st semester will provide a testing case, and the first goal of the project is to demonstrate how this peace of code works using the tool.

Two main dangers have to been taken care of during the project specifications:

being too ambitious will result in a project failure, struggling with implementation
being not ambitious enough will result in a useless tool

In order to cope with these risk, a iterative prototype development approach will certainly be used.

Next meeting: 07/02/11
Work to be done: try to use the profiling interface with MPICH2 and OpenMPI.

Original project proposal

Real-time visualisation of MPI programs
David Henty

One of the problems with MPI programming is that it is very difficult to debug incorrect programs. Tools like VAMPIR can display the communications patterns of MPI programs by producing a trace file during execution and enabling the user to view the file as a timeline afterwards. Unfortunately, this is only useful if the program runs to completion which is usually not the case when you have a bug! It would also be useful to track MPI communications at runtime for training and education purposes, allowing new users to see what their programs are doing, or to run standard examples and follow their execution so they can understand concepts such as synchronous/asynchronous modes and blocking/non-blocking operations.

The project is to develop a tool/library that, for each MPI processes, pops up a window that shows real-time information about its execution. For example, it could just say what routine was being called ("Currently in MPI_Send"), give more details ("Calling MPI_Send to send 14 real numbers to rank 4") or display the operations graphically (eg boxes showing all the pending sends and receives, animations showing messages matching up at runtime etc etc). This tool would then be run on a set of test programs from simple examples all the way to full applications to see how useful it is in practice. Possible extensions include halting execution until the user hits a button ("click here to continue") which could be very useful in illustrating concepts such as collective communications: the routine will not complete until the user has clicked "go" for all MPI processes. Another possibility would be to display where in the source code each process is at any one time.

It is quite simple to do this in practice as the MPI library has a separate "profiling interface" that enables all MPI calls easily to be intercepted by the user. Here, we would then display information about the call in some way (eg write text to a window) before calling the real MPI routine.

The tool could easily be developed and tested on a single workstation with all MPI processes displaying information on the same screen. However, it would be more interesting to run on a real cluster like the EPCC training room machines. Here, a window would appear on each screen where an MPI process was running and there would be interactions between different machines in the room. A user at one screen might have to call to a user at another screen for them to initiate a receive operation so that the first user's synchronous send can complete.

The tool should work with both C and Fortran, but will itself be developed in C. A good knowledge of C programming is therefore required. Previous experience in graphics programming would be useful but not essential.