Wednesday, 30 March 2011

The tests programs

During the 1st semester, in the Message Parsing Programming lectures, EPPC staff taught us how to use the basic features of MPI, and how to avoid some mistakes. For this project some example will be reused to insure that both the profiler and display work correctly

The Message in a Ring

The message in a ring is a simple MPI program where each process sends data to the next one and receive therefore from the previous one. This program was developed to investigate the difference between the several send and receive possibilities proposed by the MPI standard:

Asynchronous send ; receive ; wait for the send
Asyncrhonous receive ; send ; wait for receive
Use of the special send and receive function

This code can be used as an example of real time communication, and checking on waiting for communications.

Calculating PI

The calculating PI was an exercise where each processor was computing a part of PI and then a reduction is done to add the results together. The goal was to illustrate the possible rounding errors when the sum wasn't done in the same order. In order to avoid it, an array was made on the master processor to store each result and do the sum in the end.

This code can be used to show a very simple data registering (float).

The traffic model

The traffic model simulation was a simple domain decomposition model, where a process can have several cells of road. Each cell can be either occupied or empty. A car moves to an empty cell forward to it. Therefore some communication has to be made to send cars across the neighbour processors: check if the next cell is empty (a very simple 1D hallow swapping).

The casestudy and coursework

The idea was to introduce a very simple reverse edge detection algorithm. The solution is quite computing intensive as a smooth operation has to be made several time in order to obtain the original picture. Therefore a simple domain decomposition was done on 1 of the dimension to share the work among processors.

This code involves typical hallo swapping and the use of MPI datatypes.

The coursework introduces a 2D domain decomposition. Making it more complex to share data with the neighbours processes.

Saturday, 26 March 2011

Meeting 3 and 4 [21/02 and 14/03]

During the month of March the deadline for the Project Preparation report and presentation. The goal of this module was to justify the research done and to prove the feasibility of the project.

During the 3rd meeting the MPI socket organisation was presented to David, showing a new orientation of the communication. Mainly the discussion was about the report, and what to write in it.

But also a set of tests, from the earlier MPI course, was discussed, including 3 simple codes and 1 more complex one:

calculating PI
A message in a ring
The traffic model

With the willing to be able to use the MPI casestudy and its evolution: the coursework. This last was about an image computing. All tests will be discuss later in this blog.

The main features for the software are the communication information (general statistics to find missing calls to MPI_Wait for example), the data display to show what data are sent and where they are stored to. Finally the last feature is a synchronised view of the communication, equivalent to a step by step action in a debugger, but at a MPI call level rather than a C or assembly one.

Few innovations were made for the 4th meeting. The report was due few days after. Nonetheless some goals were discuss. The project will provide a framework composed of a library (the profiler part) and an executable (the interface). The goal is to provide a real time global view of the program and it aims modest MPI programs with few processes. As it is only an information tool no performance is needed, but effort will be made to insure at least memory management.

Sunday, 20 February 2011

Using MPI sockets

This article will present how to use MPI to create a remote socket and use it through MPI calls. Remember that we have the profiler - the library part that uses the profiling interface of MPI to profile the program - and the display - that displays the information sent by the profiler - parts that communicate.

First of all a research was made in order to try to find out how to create a socket with MPI on the profiler and communicate with some other socket library on the display. So far no example were found using that approach, and as this is a technical test, no real implementation was done that way.

The approach used here is to bind the profiler and display communicators using a technique similar to MPI_Spawn but that doesn't require the 2 softwares to be tight together. This is done using the MPI_Open_port functions.

The code wasn't modifier a lot from the MPI Spawn approach, as you are going to see. The reference used to understand and develop that approach was actually the MPI standard website: 5.4.6. Client/Server Examples

The profiler side - server side

The global idea of that approach is for the profiler to open a port, and wait for some display to connect on it. The idea can be pushed further, if needed, to allow several display to connect on a single profiler (sharing the view of the program on several display for example).

Actually what was modified from the Spawn example is the way to connect the profiler and the display together. Rather than calling MPI_Spawn, MPI_Open_port was used, and few lines were added just before finalizing the execution.

Opening the port

int start_child(char* command, char* argv[])
{
  MPI_Open_port(MPI_INFO_NULL, port_name);

  /* child doesn't find it...
    sprintf(published, "%s-%d\0", PROFNAME, world_rank);

    MPI_Publish_name(published, MPI_INFO_NULL, port_name);*/

  fprintf(stderr, "!profiler!(%d) open port '%s'\n", world_rank, port_name);

  fprintf(stderr, "!profiler!(%d) waiting for a child...\n", world_rank);

  MPI_Comm_accept(port_name, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm);

  fprintf(stderr, "!profiler!(%d) got a child!\n", world_rank);

  int r;
  MPI_Comm_rank(intercomm, &r);
  fprintf(stderr, "!profiler!(%d) is %d on parent!\n", world_rank, r);

  // wait for a message that "I'm dying"
  if ( PMPI_Irecv(&(quitmessage[0]), INTRA_MESSAGE_SIZE, INTRA_MESSAGE_MPITYPE, CHILD_RANK, INTERCOMM_TAG, intercomm, &dead_child) != MPI_SUCCESS )
    {
      intercomm = MPI_COMM_NULL;

      fprintf(stderr, "!profiler!(%d) communication failed!\n", world_rank);
      intercomm = MPI_COMM_NULL;
      return FAILURE;
    }

  char mess[INTRA_MESSAGE_SIZE];
  sprintf(mess, "%d IsYourFather\0", world_rank);


  sendto_child(mess);

  PMPI_Barrier(MPI_COMM_WORLD);


  return SUCCESS;
}

Finalizing the communication

int wait_child(char* mess)
{
  // send my death
  if ( sendto_child(mess) == SUCCESS )
    {
      // wait his death
      if ( PMPI_Wait(&dead_child, MPI_STATUS_IGNORE) == MPI_SUCCESS )
        {
          fprintf(stderr, "!profiler!(%d) received its child death!\n", world_rank);
          //MPI_Unpublish_name(published, MPI_INFO_NULL, port_name);
          MPI_Close_port(port_name);
          return SUCCESS;
        }
    }

  return FAILURE;
}

The display side - the client side

On the display side, the same kind of modification had to be done. Rather that using information from the father's communicator, a connection to a port is performed.

The MPIWatch::getWatcher method

MPIWatch* MPIWatch::getWatcher(char port_name[])
{
    if ( instance == 0 )
    {
        MPI::Init();

        std::cout << "Try to connect to " << port_name << std::endl;

        parent = MPI::COMM_WORLD.Connect(port_name, MPI::INFO_NULL, 0);

        if ( parent == MPI::COMM_NULL )
        {
            std::cerr << "Cannot connect with the parent program! Aborting." << std::endl;
            MPI::Finalize();
            return 0;
        }

        std::cout << "Connection with parent completed!" << std::endl;

        instance = new MPIWatch();
    }

    return instance;
}

Running it!

The main difference here is that on the previous version the display was starting by itself. Now it has to be started separately, and actually one per MPI process. Some attempts were made to use the name publication described in the standard (see the reference further up) but for a unknown reason the display part never found the profiler published name.So far, 1 port is open per MPI process - or 1 name was published - and each display connect on 1 of them through command line input.

Console 1: run MPI

$> mpiexec -n 2 mpi_ring
!profiler!(0) open port '3449421824.0;tcp://192.168.0.2:48251+3449421825.0;tcp://192.168.0.2:36965:300'
!profiler!(1) open port '3449421824.0;tcp://192.168.0.2:48251+3449421825.1;tcp://192.168.0.2:52304:300'

Console 2-3: run the display

$> mpidisplay '3449421824.0;tcp://192.168.0.2:48251+3449421825.0;tcp://192.168.0.2:36965:300'

$> mpidisplay '3449421824.0;tcp://192.168.0.2:48251+3449421825.1;tcp://192.168.0.2:52304:300'

The current implementation is a little more complicated to run than the spawn version, but doesn't have any error code when finishing. It also allows more flexibility in the future, to allow more than one display on a single profiler, and any other idea that requires a more flexible approach than a spawn process (like been able to connect a display in the middle of a run and disconnect at will, to see if the program is deadlock etc).

Limitations

The port information are rather long and it is quite not user friendly to have to lookup the profiler output and copy/paste the port information into the display. Further investigation have to be made on that part, in order to either manage to find the name publication problem, or to find a way to look for the port with a more automatic fashion.The actual name publication idea was to publish a name, like 'profiler-mpirank' to look up for - or with any string given by the user instead of profiler. This will allow the display to be started in a single command, that will only need to know 2 information: the base name of the profiler and the number of MPI process to connect to!

The other limitation is not a real one, but more like a bug on the current implementation. A barrier was added to wait for every MPI process to get a display, and isn't that much of a problem, as no high performance are required for that project. The problem arises when one display is closed while the program is running. The current implementation doesn't catch it, and deadlocks. Further investigation will obviously be done on that problem later on.

Source code

As for the previous version the source code is available on http://www.megaupload.com/?d=ZXJGHBPQ. It is a test version, not very clean, and buggy (as explained above). Later on a post will be done on how to use the library with a MPI code in C.

Further work

The preliminary technical overview of the project is about to be over. Now that the basis of the project techniques are setted up, are more detailed reasoning will be done on the project functional requirements. As part of the Project Preparation course of the MSc, some risk analysis and workplan for the overall project has to be done as well and will be published here as well.

Tuesday, 15 February 2011

A bit of software engineering

This article will only details some changes on the code done in order to have a more adaptable test software. It will also explain how to use the library with an MPI program in C.

The Project

The project is so far organised around 2 things: the profiler and the display. The profiler is produced as a library, that patches some of the MPI calls. The display is an executable that only displays information from the profiler.

The current directory architecture reflects that organisation, where the display is actually in a subdirectory of the profiler (the interface one).

When build 3 folders are created;

dynamic containing the library as a .so - or static with the .a
includes which contains the header to add to the MPI executable you want to use with the profiler (the mpi_wrap.h file is the one, the intra_comm.h just defines some of the way for the display and profiler to communicate and can be used later on to develop another display)
display that obviously is the folder where the display executable is stored.

The actual profiler is done in C, and therefore uses MPICC (on my machine GCC - No build was really done on Ness, as for the moment Qt isn't installed on it).

The display is implemented in C++ using both Qt and MPI and uses the powerful .pro files to handle compilation.

The profiler

The profiler is organised so far around:

mpi_basic.c and mpi_communication.c that implements the MPI functions defined in mpi_wrap.h.
child_comm.c, child_comm.h and intra_comm.h that implements the profiler/display communication.

MPI overloading

Only the defined function in mpi_wrap.h are overloaded, and this is so far the only file that has to be included from the original MPI program. Each of the function will call some of the child_comm module to communicate with the child, and the user doesn't have to bother with them.

The child_comm module

Actually very few type of communication is required with the display. The header is rather small:

child_comm.h

#ifndef CHILDCOMM
#define CHILDCOMM

int start_child(char* command, char* argv[]);
int alive_child();
int sendto_child(char* mess);
int wait_child(char* mess);

#endif // CHILDCOMM

start_child starts the child, and therefore is called in MPI_Init()
sendto_child sends information to the child, the message is of a defined size in intra_comm.h
wait_child is to wait for the child death (i.e. be sure he received every information before closing communication) and is thus called in MPI_Finalize()
aline_child() return either SUCCESS or FAILURE (defined in intra_comm.h) to inform that the child is still running or not.

Such approach allows different way of communication with the child without affecting directly the MPI overloaded functions and vice versa.

The display

The display is developed using the Qt library, and uses a classical directory organisation. Qt provides a excellent tool, qmake, to generate Makefiles from a project file (here mpidisplay.pro) and will adapt to it. From a platform to another just minor modification have to be made on the file, such as the 2 first lines that defines MPICC flags. Note that Qt uses GCC as a compiler.

Extract of the mpidisplay.pro

# using 'mpicxx -showme:compile' and 'mpicxx -showme:link'
MPICXX_COMPILE = -I/usr/local/include -pthread
MPICXX_LINK = -pthread -L/usr/local/lib -lmpi_cxx -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl

Qt provides also a good interface designer, that will be used to generate the GUI, and the forms generated are stored in the forms folder. The src folder contains the sources.

The code organisation

The display code is organised around 2 classes so far:

MPIWatch that is implemented as a singleton and is the only one to deals with MPI communication (i.e. communicates with the profiler). It therefore uses some information from intra_comm.h.

It is inheriting from the QThread class, that is a portable thread for Qt (using pthreads on Unix certainly) and allows communication and display actualisation to be separated.

The communication with the other class is done through Qt internals signals, that are kind of remote calls. When a message is received from the profiler, it is stored on a message stack, and the signal newMessage() is emitted.
CommStat that is a classical QWidget displaying basic information on the number of sends and receives. It pops information from the MPIWatch object each time this one signals a new message.

How to use the mpi_wrap library?

Using the library is a very easy, and standard.

Add the #include line to the code that uses MPI.
Compile the files with the path of the include files (usually -I)
Link the executable with the path of the library, and the library name (usually -L and -libmpi_wrap).

Example in a Makefile

# path where the library is installed
MPI_WRAPPER = /home/workspace/project/current
# linking is either static or dynamic, will look in $MPI_WRAPPER/$linking
linking = dynamic

DEFINES+=
CC= mpicc
CFLAGS= -g $(DEFINES) -I${MPI_WRAPPER}/includes


LFLAGS= -lm -L${MPI_WRAPPER}/$(linking) -lmpi_wrap

EXE= ring

SRC= ring.c

OBJ= $(SRC:.c=.o)

.c.o:
 $(CC) $(CFLAGS) -c $<

all: $(EXE)

$(EXE): $(OBJ) 
 $(CC) $(CFLAGS) -o $@ $(OBJ) $(LFLAGS)
 @echo "don't forget export LD_LIBRARY_PATH='$(MPI_WRAPPER)/$(linking)'"
 @echo "don't forget to add $(MPI_WRAPPER)/display to the PATH!"

clean:
 rm -f $(OBJ) $(EXE)

The sources

The sources are available on http://www.megaupload.com/?d=DDUQP5QH.

Saturday, 12 February 2011

Using MPI_Spawn

This article will present how to use MPI_Spwan and what are the problem associated with it. This will first show the profiler code, then the display code. And finally discuss the problems.

Spawn the interface: profiler point of view

In order to spawn the interface, the PATH variable was exported in order to contain the path to the executable mpidisplay, that is the simple interface developed for this test. It is basically counting the number of calls to some of the communication function of MPI.

The spawning actually occurs in the MPI_Init overloaded function :

int world_rank;
MPI_Comm intercomm = MPI_COMM_NULL;
int intercomm_child_rank = 0; 

int MPI_Init(int* argc, char ***argv)
{
  int ret;

  ret = PMPI_Init(argc, argv);

  PMPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

  fprintf(stderr, "!profiler(%d)! MPI_Init()\n", world_rank);
  
  // spawn the interface
  MPI_Comm_spawn("mpidisplay", MPI_ARGV_NULL, 1, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm, MPI_ERRCODES_IGNORE);

  return ret;
}

This is simply starting the display when the profiler is started through mpiexec and link them together. But as soon as MPI_Finalize is called, both of them are killed, and the interface is closed. Thus a trick was used to make the profiler waiting for the child to be closed to stop running.

The idea is that the display sends a message to the profiler when it is closed, and that the profiler waits on this message with an asynchronous receive from the beginning. When MPI_Finalize is called on the profiler, a MPI_Wait of that message is performed, basically waiting for the display to be closed to resume. The profiler also send information about is imminent death, to display the information if needed on the display.

#define CHILD "mpidisplay"
#define CHILD_ARGS MPI_ARGV_NULL

int world_rank;
MPI_Comm intercomm = MPI_COMM_NULL;
int intercomm_child_rank = 0; 

static Intra_message quitmessage[INTRA_MESSAGE_SIZE]; 
MPI_Request dead_child = MPI_REQUEST_NULL;

int MPI_Init(int* argc, char ***argv)
{
  int ret;
  Intra_message message[INTRA_MESSAGE_SIZE];

  ret = PMPI_Init(argc, argv);

  PMPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
  //PMPI_Comm_size(MPI_COMM_WORLD, &world_size);

  fprintf(stderr, "!profiler(%d)! MPI_Init()\n", world_rank);

  
  // spawn the interface
  MPI_Comm_spawn(CHILD, CHILD_ARGS, 1, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm, MPI_ERRCODES_IGNORE);

  sprintf(message, "%d Init\0", world_rank);

  PMPI_Ssend(message, INTRA_MESSAGE_SIZE, INTRA_MESSAGE_MPITYPE, intercomm_child_rank, 0, intercomm);

  // wait for a message that "I'm dying"
  PMPI_Irecv(&(quitmessage[0]), INTRA_MESSAGE_SIZE, INTRA_MESSAGE_MPITYPE, intercomm_child_rank, 0, intercomm, &dead_child);// check each time if the child is dead...

  return ret;
}

int MPI_Finalize(void)
{
  int ret;

  fprintf(stderr, "!profiler!(%d): MPI_Finalize()\n", world_rank);

  if ( dead_child != MPI_REQUEST_NULL )
    {
      Intra_message message[INTRA_MESSAGE_SIZE];

      sprintf(message, "%d Finalize\0", world_rank);

      // send my death to the display
      PMPI_Ssend(message, INTRA_MESSAGE_SIZE, INTRA_MESSAGE_MPITYPE, intercomm_child_rank, 0, intercomm);

      fprintf(stderr, "!profiler!%d is waiting for its child...\n", world_rank);

      // wait for the display to quit
      PMPI_Wait(&dead_child, MPI_STATUS_IGNORE);
      fprintf(stderr, "!profiler!%d finished waiting...\n", world_rank);
    }

  ret = PMPI_Finalize();

  return ret;
}

Spawn the interface: display point of view

The display was implemented using Qt, and is therefore in C++. The MPI calls are the same, just organized in a Object Oriented fashion.

When the child is spawned, it can retrieve its parent information, and do so in order to get the special communicator. Then it simply uses normal MPI communication with it.

The MPIWatcher class was written to handle the MPI communication. It is implementing the singleton design pattern. The MPI init code are therefore present in the global call that creates the object, and are normally performed only once (as the object is carried by until the end of the program).

MPIWatch* MPIWatch::getWatcher(void)
{
    if ( instance == 0 )
    {
        //MPI::Intercomm parent = MPI::COMM_NULL;
        int parentSize;

        MPI::Init();
        parent = MPI::Comm::Get_parent();

        if ( parent == MPI::COMM_NULL )
        {
            std::cerr << "Cannot connect with the parent program! Aborting." << std::endl;
            //parent.Abort(-1);
            MPI::Finalize();
            return 0;
        }

        parentSize = parent.Get_remote_size();

        if ( parentSize != 1 )
        {
            std::cerr << "Parent communicator size is " << parentSize << "! It should be 1. Aborting." << std::endl;
            parent.Abort(-1);
            return 0;
        }

        instance = new MPIWatch();
    }

    return instance;
}

The instance process to catch up message will be discuss later. Basically the MPIWatch do synchronized receives from his father, and push the result on a stack, that is read by the interface.

When the window is closed, the MPIWatch object has to be destroyed, and the actual message is therefore sent to the father.

bool MPIWatch::delWatcher()
{
    if ( ! instance )
        return false;

    if ( instance->isRunning() )
        return false;

     QString s(MESSAGE_QUIT);

     parent.Ssend(s.toStdString().c_str(), INTRA_MESSAGE_SIZE, INTRA_MESSAGE_MPITYPE, 0, 0);

     MPI::Finalize();
     parent = MPI::COMM_NULL;

    delete instance;
    instance = 0;

    return true;
}

Problems with spawned instances

The major issue with the spawn interface is the actual call to MPI_Finalize. When one of the child or parent calls it, the ORTE process - the daemon that handles the MPI communication on OpenMPI and MPICH should have something similar - kills the other. Therefore even with the trick to wait for the display from the profiler would not always terminate the actual execution properly. It is actually rather bizarre that there is no proper way of doing so.

A bit more research will certainly be done on that problem, to see if closing the communicator can be effective. But there is not that much advantage compare to a typical client-server application, and next development tests will be done on that.

Tackling C++ from C

Why using both C and C++ in a single program when MPI provides C++ wrapper? Well first of all, most of the scientific program are either written in C or Fortran. Thus providing a C++ limited library is somewhat not in the score of the project. Then finding a solution that could provide a liberty of using C or Fortran for the MPI profiling interface (called the profiler) and any other language or library for the interface (called display) is, to my point of view, a good approach.

At the current state of the project, the profiler is written in C - and it will certainly be written only in C for the whole project - and the interface has to written using the Qt C++ library. The problem is therefore to call the corresponding C++ method when a MPI called is handled - hence calling C++ from C.

The first approach was to try to bind C++ in C, and was a big failure. The code was a simple function call (not even a method from an object) and it didn't link properly. Therefore a more modular solution had to be found.

Having a separate software for the profiler and the display is certainly the key to the problem. Hence it therefore requires another way of communication than simple function calls. MPI provides functions to spawn another process. It also provides socket handling.

During next week I will try to use both of the solution and try to choose between them.

A client and a server

The typical communication with sockets can be achieved with a client-server communication. The display will be a server, and the profiling interface will connect on it, and send information.

In order to provide a display per profiler, several interface will be started, each of them on a port. The obvious idea is to use a "base" port (say 4242) and to add the MPI process rank to find which port to use for communication. Thus on a 4 process job, 4 display will start, each of them listening on either 4242, 4243, 4244, 4245. Then the profilers will try to connect to one of them, according to their rank.

Using socket should be easy enough from MPI and Qt, as both libraries provide a "high" level interface.

The obvious advantage of such approach is the total independence of both software. One can communicate with another through a defined protocol without any trouble. It also allows the profiler to be in any language, and the display to be rewritten at wish - to display more specific information or using another library/language.

The obvious disadvantage is the opening of several ports, that might be troublesome on some restricted networks. A communication protocol has to be written as well, but it is also part of the other approach.

Spawning the display

Spawning the display is basically starting another process from the profiler. The display will hence be a totally different program, but it will be possible to communicate through a special MPI communicator given during the spawning process.

The difficulty of that approach lies in the spawning idea. As the 2 processes are tight together, if one of them dies (from an error, or simply because the display is closed) the ORTE (the deamon that manages MPI communication with OpenMPI - and MPICH2 must have something similar) will kill the other process. Therefore there is no real clean way of exiting both of the program.

Moreover the display program should be either accessible from the PATH or the profiler has to have a way to find where it is stored.

Hence the advantages are on the communication point of view. Both use MPI to communicate, that is rather simple and tackle the port problem.

Meeting 2 [07/02/11]

During the second meeting I presented my results to David. I also explained that using the C++ Standard Template Library could be nice and effective to store information on the profiler side.

One of the problem comes from the MPI interface, that has to be overloaded in C, and the interface/STL code that has to be in C++. Using C from C++ is relatively easy (extern "C" keyword and most of the C standard libraries are available - like #include for #include "stdlib.h"). The other way around is tricky enough to give a proper think about it.

So two main goals are to be considered for next meeting :

try to do an interface with Qt - and thus tackle the C/C++ binding
start to think about the display:
- what should be displayed
- how should it be displayed

The idea of the project is therefore still focussing on flagging up common errors and not developing a swiss-army knife for MPI.
The basic errors listed so far during the meeting were :

broadcast on a single node
synchronized send with no matching receive
data problems

The next meeting will be Monday 21st of February