Remember

Just to remember to the reader: the actual code is working with a client-server communication. The server part - the MPI profiler - sends information to the client - the interface - through a MPI interconnect done with MPI_Open_port.

In order to do so a protocol has to be defined for the messages.

Client-Server organisation

Connection

The Profiler-Interface organisation (4 MPI processes)

The connection of the profiler and the interface is done via the MPI_Open_port function, that opens and gives a port address. Each MPI process from the profiler publishes its own port, and therefore the interface has to connect to every single one of them. That actually means there is n servers and 1 client. It is unusual of a client-server model, as usually only 1 server delivers information to several clients. Nonetheless the profiler processes are the servers, as they are the ones that publish an accessible address.

With OpenMPI

OpenMPI provides an address that looks like:

117112832.0;tcp://192.168.1.71:36441+117112833.0;tcp://192.168.1.71:42986:300

An early attempt to guess the port was unsuccessful. Some part may change from a process to another without other logic than available resource (the port for example).

But a problem came when connecting in the interface. As the previous diagram shows it, several profilers connect to a single interface process. What is not shown is that the interface has in fact 1 thread per profiler, to deal with the communication. The data is then centralised in a single GUI. This approach is a typical Hybrid MPI programming approach. Therefore the interface has to initialise the MPI environment with MPI_Init_thread (rather than MPI_Init) and ask for a MPI_THREAD_MULTIPLE initialisation. By default OpenMPI doesn't provide such support.

The solution is rather simple: recompile OpenMPI with the threading support:

./configure --enable-mpi-threads

With MPICH-2

The Ness machine provided a MPICH-2 implementation already installed. For some reasons it didn't support dynamic linking, but static one is fine. For this implementation the port string looks like:

tag=0 port=52970 description=ness.epcc.ed.ac.uk ifname=129.215.175.1

That is radically different of the OpenMPI one, showing once more that guessing the port isn't a interesting idea.

As MPICH-2 is natively installed on Ness with the multi-threading, the configuration option isn't yet known.

Retrieving the port

The profiler opens and publish a port. As a matter of fact, the user has to read the port and give them in input to the interface. In order to give as much freedom as possible to the user several ways of doing it are available:

Printing the port to the standard output stream
Printing the port to the standard error output stream
Writing the ports into a defined file

This is achieved by giving information when calling MPI_Init on the MPI code. This could be achieved simply by providing command line arguments when calling mpiexec. The available arguments can be retrieved with (ring is the executable name):

$> ./ring --help
Profiler of an MPI program\nUse a MPI visualisation GUI to see information

Possible options:
--port-in-stdout [default]
   write the port into the standard output
--port-in-stderr
   write the port into the standard error output
--port-in-file file
   write the port into the file using MPI-I/O

Note that only the last given option is used

--help
   display that help

To use the file writing functionality simply start you program like:

$> mpiexec -n 4 ring --port-in-file port.txt

Note: so far adding the option manually as 2D array of char doesn't work, and no further looking as been made to make it work.

Writing each process' port in a single file

In order to write each process' port in a single file the MPI I/O functions are used. The standard defines several ways of doing so. In that case a simple subarray is defined with the size of the port as a base length. MPI I/O writes data as a whole line into a file, as this stores characters a new line is created for each port. The interface can therefore read the file line by line to find every port and know the number of started processes.

Extract of child_comm.c

if ( port == INFILE )
    {
      MPI_Datatype subarray;
      MPI_File file_ptr;
      int smallarray, bigarray, stride;

      smallarray = (strlen(port_name)+1);
      bigarray = world_size*smallarray;
      stride = world_rank*smallarray;

      fprintf(stderr, "!profiler(%d)! will write his port in '%s'\n", world_rank, file);

      MPI_Type_create_subarray(1, &bigarray, &smallarray, &stride, MPI_ORDER_C, INTRA_MESSAGE_MPITYPE, &subarray);
      MPI_Type_commit(&subarray);

      if ( MPI_File_open(MPI_COMM_WORLD, file, MPI_MODE_WRONLY|MPI_MODE_CREATE, MPI_INFO_NULL, &file_ptr) != MPI_SUCCESS )
 {
   fprintf(stderr, "!profiler(%d)! failed to open file '%s'. ABORTING\n", world_rank, file);
   MPI_Abort(MPI_COMM_WORLD, -1);
 } 

      if ( MPI_File_set_view(file_ptr, 0, INTRA_MESSAGE_MPITYPE, subarray, "native", MPI_INFO_NULL) != MPI_SUCCESS )
 {
   fprintf(stderr, "!profiler(%d)! failed to set the file view! ABORTING\n", world_rank);
   MPI_Abort(MPI_COMM_WORLD, -1);
 }

      if ( MPI_File_write_all(file_ptr, strcat(port_name, "\n"), smallarray, INTRA_MESSAGE_MPITYPE, MPI_STATUS_IGNORE) != MPI_SUCCESS )
 {
   fprintf(stderr, "!profiler(%d)! failed to write '%s'. ABORTING\n", world_rank, file);
   MPI_Abort(MPI_COMM_WORLD, -1);
 }

      MPI_File_close(&file_ptr);
    }

Communication

The profiler side

As far as the profiler is concerned, the communication with the interface could be either synchronous or asynchronous. The current implementation uses MPI_Ssend as simple choice, but later version could use asynchronous call and waiting before the next one is done. Or even deal with a request list to wait for.

The profiler uses internal functions defined into child_comm.h to communicate with the interface.

child_comm.h

#ifndef CHILDCOMM
#define CHILDCOMM

#include "intra_comm.h"

extern int world_rank;
extern double global_time;

typedef enum PortType { STDOUT, STDERR, INFILE } PortType;

int start_child(int world_size, PortType port_type, char* file);
int alive_child();
int sendto_child(Intra_message* message);
int wait_child(double time_in);

#endif // CHILDCOMM

intra_comm.h

#ifndef INTRA_COMM
#define INTRA_COMM

#define INTRA_MESSAGE_SIZE 64
typedef char Intra_message;

#define INTERCOMM_TAG 0

#define PROFNAME "!profiler!"

#ifdef __cplusplus
#define INTRA_MESSAGE_MPITYPE MPI::CHAR
#else
#define INTRA_MESSAGE_MPITYPE MPI_CHAR
#endif

/*
 * ACTIONS
 */

typedef enum Message { MESSAGE_INIT,
               MESSAGE_Ssend,
               MESSAGE_Bsend,
               MESSAGE_Issend,
               MESSAGE_Recv,
               MESSAGE_Irecv,
               MESSAGE_Wait,
               MESSAGE_QUIT } Message;

#endif // INTRA_COMM

The functions' name are explicit, and the intra_comm.h header defines the actual protocol information: it is therefore used by both profiler and interface. The actual sending is done by character stings, renamed as Intra_message. As the interface is coded in C++ the INTRA_MESSAGE_MPITYPE is defined using both C and C++ MPI standard definitions.

The message is composed of several fields, all separated by a space, which always includes main fields:

action::enum Message the occurring action
time in::double the Unix time when entering the MPI function
time out::double the Unix time when returning the MPI function

But each Message has its own information to add as well, after the main ones. For example a MPI_Ssend also encapsulate:

communicator::unsigned int the communicator unique number - not implemented yet
destination::int the destination process

And some more information could be added as needed. Each MPI function defines its own optional fields in his own call to sendto_child().

The information are written using standard C I/O calls:

sprintf(message, "%d %lf %lf %d\0", MESSAGE_Ssend, time_in, time_out, dest);

The interface side

Starting message box of the interface (GNU/Linux Gnome 3)

On the interface side the profilers' port could be defined either manually or by reading the file written as explained before. When this is done, one thread per process is started and their duty is to communicate with the profiler (the object is therefore called MPIWatch). The MPIWatch object is only responsible for receiving (and sending) information to the profiler, therefore each of them is attached to a Monitor object, that is responsible of the analyse of messages. In order to communicate the MPIWatch pushes arriving message onto a stack and signal to the Monitor that new messages are available. The Monitor then analyse the message and display information in the according places.

The couple MPIWatch - Monitor was done for logical purposes:

Only the MPIWatch is actually aware of the MPI functions needed to sends and receive information to the profiler. If in the future another system is used, only this class has to be changed.
Only the MPIWatch needs a separated thread, dealing with the messages contents is done on the main thread.
Only the Monitor has the knowledge of what a message contains. New protocol functionnalities does not affect the way to transfer data between profiler and interface
Only the Monitor knows about the GUI, that are shared "windows" among the several monitors.

As the interface is implemented in C++, the standard stream library is used to decapsulate the messages. The main fields are extract for each messages, and the according to the message action each additional information.

Extract of monitor.cpp

QString m = watcher->pop_pool();
std::istringstream stream(m.toStdString());
int message;
double time_in, time_out;

stream >> message >> time_in >> time_out;

switch(message)
{
        /* ... */

    case MESSAGE_Bsend:
        // adds to call counts
        statWidget->addTo(proc, N_Bsend); 
        // add time info
        statWidget->addTo(proc, T_Bsend, time_out-time_in); 
        break;

        /* ... */
}

Conclusion

The Profiler-Interface communication is done on two levels. The first one is the actual communication, done through MPI. This requires a port opening and publish mechanism, that the user has to give as an input to the profiler.

But the communication is also what information is sent. This is generated by each overloaded MPI function, and is analysed in the interface side by a Monitor object.

Decoupling the communication on these two levels allows an abstraction of actually sending and analysing the information.

This article will present how to use MPI to create a remote socket and use it through MPI calls. Remember that we have the profiler - the library part that uses the profiling interface of MPI to profile the program - and the display - that displays the information sent by the profiler - parts that communicate.

First of all a research was made in order to try to find out how to create a socket with MPI on the profiler and communicate with some other socket library on the display. So far no example were found using that approach, and as this is a technical test, no real implementation was done that way.

The approach used here is to bind the profiler and display communicators using a technique similar to MPI_Spawn but that doesn't require the 2 softwares to be tight together. This is done using the MPI_Open_port functions.

The code wasn't modifier a lot from the MPI Spawn approach, as you are going to see. The reference used to understand and develop that approach was actually the MPI standard website: 5.4.6. Client/Server Examples

The profiler side - server side

The global idea of that approach is for the profiler to open a port, and wait for some display to connect on it. The idea can be pushed further, if needed, to allow several display to connect on a single profiler (sharing the view of the program on several display for example).

Actually what was modified from the Spawn example is the way to connect the profiler and the display together. Rather than calling MPI_Spawn, MPI_Open_port was used, and few lines were added just before finalizing the execution.

Opening the port

int start_child(char* command, char* argv[])
{
  MPI_Open_port(MPI_INFO_NULL, port_name);

  /* child doesn't find it...
    sprintf(published, "%s-%d\0", PROFNAME, world_rank);

    MPI_Publish_name(published, MPI_INFO_NULL, port_name);*/

  fprintf(stderr, "!profiler!(%d) open port '%s'\n", world_rank, port_name);

  fprintf(stderr, "!profiler!(%d) waiting for a child...\n", world_rank);

  MPI_Comm_accept(port_name, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm);

  fprintf(stderr, "!profiler!(%d) got a child!\n", world_rank);

  int r;
  MPI_Comm_rank(intercomm, &r);
  fprintf(stderr, "!profiler!(%d) is %d on parent!\n", world_rank, r);

  // wait for a message that "I'm dying"
  if ( PMPI_Irecv(&(quitmessage[0]), INTRA_MESSAGE_SIZE, INTRA_MESSAGE_MPITYPE, CHILD_RANK, INTERCOMM_TAG, intercomm, &dead_child) != MPI_SUCCESS )
    {
      intercomm = MPI_COMM_NULL;

      fprintf(stderr, "!profiler!(%d) communication failed!\n", world_rank);
      intercomm = MPI_COMM_NULL;
      return FAILURE;
    }

  char mess[INTRA_MESSAGE_SIZE];
  sprintf(mess, "%d IsYourFather\0", world_rank);


  sendto_child(mess);

  PMPI_Barrier(MPI_COMM_WORLD);


  return SUCCESS;
}

Finalizing the communication

int wait_child(char* mess)
{
  // send my death
  if ( sendto_child(mess) == SUCCESS )
    {
      // wait his death
      if ( PMPI_Wait(&dead_child, MPI_STATUS_IGNORE) == MPI_SUCCESS )
        {
          fprintf(stderr, "!profiler!(%d) received its child death!\n", world_rank);
          //MPI_Unpublish_name(published, MPI_INFO_NULL, port_name);
          MPI_Close_port(port_name);
          return SUCCESS;
        }
    }

  return FAILURE;
}

The display side - the client side

On the display side, the same kind of modification had to be done. Rather that using information from the father's communicator, a connection to a port is performed.

The MPIWatch::getWatcher method

MPIWatch* MPIWatch::getWatcher(char port_name[])
{
    if ( instance == 0 )
    {
        MPI::Init();

        std::cout << "Try to connect to " << port_name << std::endl;

        parent = MPI::COMM_WORLD.Connect(port_name, MPI::INFO_NULL, 0);

        if ( parent == MPI::COMM_NULL )
        {
            std::cerr << "Cannot connect with the parent program! Aborting." << std::endl;
            MPI::Finalize();
            return 0;
        }

        std::cout << "Connection with parent completed!" << std::endl;

        instance = new MPIWatch();
    }

    return instance;
}

Running it!

The main difference here is that on the previous version the display was starting by itself. Now it has to be started separately, and actually one per MPI process. Some attempts were made to use the name publication described in the standard (see the reference further up) but for a unknown reason the display part never found the profiler published name.So far, 1 port is open per MPI process - or 1 name was published - and each display connect on 1 of them through command line input.

Console 1: run MPI

$> mpiexec -n 2 mpi_ring
!profiler!(0) open port '3449421824.0;tcp://192.168.0.2:48251+3449421825.0;tcp://192.168.0.2:36965:300'
!profiler!(1) open port '3449421824.0;tcp://192.168.0.2:48251+3449421825.1;tcp://192.168.0.2:52304:300'

Console 2-3: run the display

$> mpidisplay '3449421824.0;tcp://192.168.0.2:48251+3449421825.0;tcp://192.168.0.2:36965:300'

$> mpidisplay '3449421824.0;tcp://192.168.0.2:48251+3449421825.1;tcp://192.168.0.2:52304:300'

The current implementation is a little more complicated to run than the spawn version, but doesn't have any error code when finishing. It also allows more flexibility in the future, to allow more than one display on a single profiler, and any other idea that requires a more flexible approach than a spawn process (like been able to connect a display in the middle of a run and disconnect at will, to see if the program is deadlock etc).

Limitations

The port information are rather long and it is quite not user friendly to have to lookup the profiler output and copy/paste the port information into the display. Further investigation have to be made on that part, in order to either manage to find the name publication problem, or to find a way to look for the port with a more automatic fashion.The actual name publication idea was to publish a name, like 'profiler-mpirank' to look up for - or with any string given by the user instead of profiler. This will allow the display to be started in a single command, that will only need to know 2 information: the base name of the profiler and the number of MPI process to connect to!

The other limitation is not a real one, but more like a bug on the current implementation. A barrier was added to wait for every MPI process to get a display, and isn't that much of a problem, as no high performance are required for that project. The problem arises when one display is closed while the program is running. The current implementation doesn't catch it, and deadlocks. Further investigation will obviously be done on that problem later on.

Source code

As for the previous version the source code is available on http://www.megaupload.com/?d=ZXJGHBPQ. It is a test version, not very clean, and buggy (as explained above). Later on a post will be done on how to use the library with a MPI code in C.

Further work

The preliminary technical overview of the project is about to be over. Now that the basis of the project techniques are setted up, are more detailed reasoning will be done on the project functional requirements. As part of the Project Preparation course of the MSc, some risk analysis and workplan for the overall project has to be done as well and will be published here as well.

MPI Visualisation

Friday, 20 May 2011

Profiler-Interface communication