Multi your Threads #2: The Scientific Method

So this is the second post in a series where I talk about various ways of parallelising code. If you haven’t checked out the first part you might want to do so, as I both explain the underlying a project a little bit and talk about parallelising using C++ STL

If you’ve done so or don’t really care, well, let’s get to it!

Also: All the code here is from ToyBrot. If you prefer to see the whole picture or browse on your own as you read, feel free to clone the repo and open it up on your favourite code editor (maybe on a second screen if you have one). I personally am a big fan of Atom

Before I start diving into bringing your code to the GPU, I want to go over two additional libraries/systems/technologies which are both, coincidentally, very common in scientific programming.

One of them, OpenMP, works as a way of managing concurrent code mainly for the CPU (though some GPU and FPGA implementations exist and it’s part of the specification in versions starting from 4.0, published in 2013). The other, MPI, is very much a “special case” here as it’s meant to distribute not among many cores, but among many computers.

Though both of these are at home and very present in the world of High Performance Computing (as in coding for actual super computers), they’re not at all confined to it and could be useful outside it, as we’ll see.

Lexical interlude: Standards, implementations, libraries...

Before we head on, it’s good to take the time to clarify some of the lexicon around this, a bit because it’ll become a bit confusing later on.

There are a few different types of “things” that come into play here. The C++ STL, for example, is the Standard Templates Library. It being a library, it involves a binary you need to link into your code. For this one, if you’re on Windows, it is the Microsoft Visual C++ Runtime, normally, the thing that gets installed when you install the Visual Studio redistributable packages. Whereas on Linux, it’ll likely either be libstdc++.so (“the gcc one”) or libc++.so (the one built with clang and llvm).

So… that’s actually 3 different libraries right there, what gives?

The thing is, C++ has a specification which I’ll generally use more or less interchangeably with “a standard“. Every time we talk about a standard, we mean there’s a company/committee/organisation/consortium that defines essentially a programming interface. They say “look, this thing will have such and such class that has such and such function that works like this”. But they don’t really, at least not necessarily, actually code all of that. In the case of C++, those 3 libraries are different implementations by different teams but they are all (well, they should be at least) all implementations of the same standard. Which ideally means you can get your C++ code and link it against any of those and the same code should work and do what you expect.

Those implementations can vary a lot as we’ll see particularly when we look at some of the GPU stuff

OpenMP: Take me away, preprocessor!

OpenMP is a standard, first published, according to wikipedia, in 1997. If you recall from the last post, before C++11, the STL didn’t really have anything on this front. Additionally, OpenMP is a standard not only for C++, but also for C and Fortran, so it’s pretty powerful and became quite popular for a lot of good reasons.

So, from our perspective, what does an OpenMP program look like? (full file here)

// tasks.cpp

bool FracGen::Generate(uint32_t* v, SDL_PixelFormat* format,
                        int width, int height, Region r)
{
    if(format == nullptr)
    {
        return false;
    }
    static std::atomic<int> h {0};
    bool finishedGeneration = false;
    auto genFrac = [](uint32_t* data, 
                      int width, 
                      int height, 
                      Region r, 
                      SDL_PixelFormat format, 
                      int h0, size_t idx)
                    {
                      return calculateIterations(data,width,
                          height,r,format,h0,idx);
                    };
    std::vector< std::future<bool>> tasks(numTasks);
    for(unsigned int i = 0; i < numTasks; i++)
    {
        tasks[i] = std::async(std::launch::async, 
                              genFrac,
                              v, width, height, r, 
                              *format, h.load(), i);
    }
    h+= 10;
    for(unsigned int i = 0; i < tasks.size(); i++)
    {
        //Block until all tasks are finished
        if(tasks[i].get())
        {
            //If one of them is done, they all must be
            h.store(0);
            finishedGeneration = true;
        }
    }
    return finishedGeneration;
}

// openmp.cpp

#include "FracGen.hpp"
#include <iostream>
#include <cfloat>
#include <atomic>
#include <omp.h>
...
bool calculateIterations(uint32_t* data,
                         int width, int height,
                         Region r,
                         SDL_PixelFormat format, int h0)
{
...
    for(int h = h0; h < h0+10; h++)
    {
        if (h >= height)
        {
            return true;
        }
        //For OpenMP, somewhat like for GPUs we're assigned 
        // a thread ID automagically
        for(int w = 0 + omp_get_thread_num(); 
            w < width; w+= omp_get_num_threads() )
        {
            ...
        }
    }
    return false;
}
bool FracGen::Generate(uint32_t* v, SDL_PixelFormat* format, 
                       int width, int height, Region r   )
{
  if(format == nullptr)
  {
    return false;
  }
  static std::atomic<int> h {0};
  bool finishedGeneration = false;
  std::vector<bool> results(omp_get_num_procs());
  #pragma omp parallel for
  for(unsigned int i = 0; i < omp_get_num_procs(); i++)
  {
    results[i]=calculateIterations(v, width, height, 
                                    r, *format, h);
  }
  h+= 10;
  for(unsigned int i = 0; i < results.size(); i++)
  {
    if(results[i])
    {
      h.store(0);
      finishedGeneration = true;
    }
  }
  return finishedGeneration;
}
...

From a coding perspective the first thing is we now have an additional include, omp.h. That header will have what you need to use all the functionality you need. OpenMP implementations are shipped with basically every major compiler these days, and it’s integrated enough that it’s usually enabled through a compiler flag, but you need to explicitly do so.

In my case, CMake handles it with built-in modules:

find_package(OpenMP)
target_link_libraries(${PROJECT_NAME} SDL2::SDL2 OpenMP::OpenMP_CXX)

Of note, it provides different targets for different languages too.

Back in the coding side, a second thing that is easy to spot and I point out in a comment is that each work unit knows its own “ID”. I call omp_get_thread_num() and omp_get_num_threads() and those numbers magically show up at runtime. It’s something pretty nifty and good convenience over the STL implementations because more housekeeping code gets abstracted away from you, our work function has two parameters less now. This is quite similar to how most GPU programming standards do things so it’s a step closer to code that will/can be ported to heterogeneous computing, so much so that, as mentioned, OpenMP actually contemplates being run on GPUs/FPGAs

Most of the magic, though, will happen in the Generate function which is much simpler.

std::vector<bool> results(omp_get_num_procs());

#pragma omp parallel for  
for(unsigned int i = 0; i < omp_get_num_procs(); i++)  
{  
    results[i]=calculateIterations(v, width, height, r, *format, h);

So… before, for our standard implementation, our results were a special type, std::future and we had to both manage our tasks manually, assigning them IDs and launching them asynchronously; Additionally, when we wanted the result, we had to have some kind of barrier that made sure those tasks were finished (get() a future or join() a thread). But now, we just straight up read the bools straight as if this was just the one thread. And all this works and the magic happens through that preprocessor pragma. It triggers some additional work on the very next command if it’s a loop. By default, OpenMP will spawn a thread for each of your hardware threads and toss work its way. Though the snippet is larger as I’m showing bits of both functions, OpenMP is actually the simplest and shortest implementation. Because it abstracts so much.

This is not without its drawbacks, of course, because things are abstracted it can be harder to understand the deeper workings of your code IF you happen to need it, but for a lot of tasks, you actually don’t care too much. It’s a bit as to when I talked about managing a thread pool yourself instead of relying on std::async, you CAN but you are unlikely to ACTUALLY need to do it and the idea that you do could be an indication you’re trying to over-engineer your problem a lot of the time (something no programmer anywhere has ever done).

Now, this is where toyBrot fails a bit in showing things because this is enough and there’s not really much more that makes sense for me to here in terms of “fancying this up”. But OpenMP DOES have more granular control over threads and work groups if you need or want to make use of and you can use it to implement some quite elaborate threading logic if you need to tackle a complex project.

The Verdict

Edit note here:
For a while, OpenMP looked like it had not super stellar performance. By the time I wrote chapter #7 I figured this was cause by the lingering and unnecessary atomic for h0. Once I got rid of it, it’s pretty much on par with std::threads and std::async

Can’t have one without seeing how it actually performs, eh? And…

The standard

The challenger

So… to my genuine surprise, the OpenMP implementation was about 50% slower than the std::threads implementation. I expected them to be more or less the same and even double checked the STL results but… nope, this seems to be it. It’s even slower than the STL implementations without the “overthreading”, which I DID try but then OpenMP just cried a lot and the average time was over double as you can see in the bottom run (I even tried to abort if halfway).

Both implementations seem to suffer the same increase in execution time over iterations that I’ve seen in the STL implementations. I still have no idea what this is.

So, for a project this simple, seems like the extra work of implementing with the STL is worth it (at the very least on a machine with as many hardware threads as I’ve got).

I’ve also heard stories of difference in performance between GCC and CLANG when it comes to OpenMP. I’ve built with both and also retested STL but both compilers produces pretty much identically performing code, for at least this so I’ve ruled this out.

The tradeoff for this performance loss is in ease of coding though, and if you DO happen to need “fancy threading logic” I expect OpenMP to pull further ahead in this area, as it IS a much more complete standard. For a project as simple as this, I’d say the question is whether this difference in performance is important. Imagining toyBrot is production software, 400 milliseconds on the initial execution is, in absolute terms, not too important, even if it IS a 30% increase in wait for the first fractal (which is the non-benchmark case), this is still one and a half seconds. The comparison would need validation on a target machine, if a threadripper is not representative, but just the cleaner code might be worth it.

Which is a long winded way of saying “performance is often crucial but not always (within reason)” and “don’t underestimate the importance of legible and neat code. Especially in production environments”. As much as the performance disappoints me, it’s not enough to get me to throw this baby out with the bath water.

A final strong point of OpenMP is that though it IS its own standard and library, it’s almost as ubiquitous as raw STL and this will be portable, so this is another implementation you can a lot of the time “code and forget” even if you deploy to several different platforms.

MPI: When one computer just isn't enough

MPI stands for Message Passing Interface and, more than just a standard, it is a communication protocol. Rather than a way of parallelising your code so it runs faster on your computer, it’s goal is being a way in which several computers can come together and work on the same issue. It is made for computing clusters and, as such, it’s most often seen in actual supercomputers.

MPI has two main concurrent implementations, OpenMPI and MPICH , though they’re not the only ones (Microsoft, for example, has their own implementation for Windows, as alien as Windows in High Performance Computing environment sounds). Curiously enough, MPI itself doesn’t really have a governing body over its standard, however somehow most implementations seem to more or less adhere to specifications.

MPI is worth mentioning (and since I had to study it for work at one point, it’s here and I’m not letting it pass unnoticed =P) but it is very different from the other technologies I have and will mention, as such, feel free to see this as a bit of a tangent.

Being made with clusters in mind, MPI has a much higher reliance on underlying infrastructure. This will usually include network setup and some sort of work manager, a popular one being slurm. The idea is that the work manager (among other things) will fire up the same job in several computers and, then, much like we do with individual threads, it lets them know “Hey, you’re job X of a total of Y”. And that also means that you may end up firing the same job multiple times per machine; Maybe you are relying just on MPI for your parallelisation and, then, you just set up your cluster so that each computer reports it can take a number of simultaneous jobs equal to the amount of threads it has, or maybe you have one job per available GPU…

Being made to distribute between computers, MPI is, more often than not, used in conjunction with other methods. Maybe you have an application that uses OpenMP for a bunch of stuff and offloads a bunch of work to a GPU with OpenCL, and then you throw MPI in the mix to have this run on a cluster of machines, which might not even have the same configuration….

The communications protocol part comes in there, MPI gets these computers talking over a network. Using MPI they shift messages and data around each other and sync over their work.

So let’s take a look at how does this look like and, before I do, a couple notes:

I won’t do a side-to-side with other implementations as this is the only one that is a different structure. The cluster I had access to was eventually torn down; I thought about setting one with VMs just to faff about with but never got around to doing it, so I’ve been without a good testing environment for this a while, which prevented me from doing meaningful work on this front
Though I’ve written some code for an interactive version of ToyBrot with MPI and SDL, I’ve never got around to test it in a cluster, so what I’ll be talking about will be the “fire and forget” one which, instead, uses libPNG to output some ridiculously sized image. rectangular sections of the image are calculated by different nodes and sent over to the root through the network

Those caveats in mind, let’s have a look:

//main.cpp

int main (int argc, char** argv) noexcept
{
    MPI_Init(&argc,&argv);
    MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
    MPI_Comm_rank(MPI_COMM_WORLD,&myrank);
    if(nprocs == 0 )
    {
        myrank = 0;
        nprocs = 1;
    }
    uint32_t height = 1920 * 6;
    uint32_t width  = 4096 * 6;
    pngData pngRows;
    pngData workRows;
    pngWriter writer(0,0);
    //root process behaviour
    if(myrank == 0)
    {
        writer.setHeight(height);
        writer.setWidth(width);
        writer.Init();
        writer.Alloc(pngRows);
    }
    
    workRows.resize(width*(height/nprocs) );
    region reg {-1.5l,1.5l,-2,1};
    MPI_Bcast(  reinterpret_cast<void*>(&reg),
                4,
                MPI_LONG_DOUBLE,
                0,
                MPI_COMM_WORLD
                );
    int res = 0;
    region myReg = defineRegion(reg, myrank, nprocs);
    spawnTasks(myReg, width, height/nprocs, 
                myrank, nprocs, workRows);
    MPI_Gather(reinterpret_cast<void*>(workRows.data()),
               static_cast<int>(workRows.size()*3),
               MPI_UINT16_T,
               reinterpret_cast<void*>(pngRows.data()),
               static_cast<int>(workRows.size()*3),
               MPI_UINT16_T,
               0,
               MPI_COMM_WORLD
               );
    if(myrank == 0)
    {
        writer.Write(pngRows);
    }
    
    MPI_Finalize();
    
    return res;
}

So… this code is a bit different, so let’s start unpacking it.

It starts with some boilerplate initialisation code. This starts the MPI system and, also asks it “how many processes”, “what is my Rank (your process ID)?”, with a default fallback that enables the code to run even in a single regular computer with a regular call. The way to run MPI application in the actual clusters is to call a specialised launcher (mpiexec for OpenMPI) and pass the number of tasks as argument.

Following that we start initialising some stuff and we get to the first interesting bit: in line 23 of the snippet, we initiallise the pngWriter but ONLY if we’re “task 0”). This starts to expose some of how this works… everyone is running the same code, but some nodes might be doing things differently from others, they read the same, but do different… to see this in action, we skip a couple of lines and:

MPI_Bcast( reinterpret_cast<void*>(&reg),
            4,
            MPI_LONG_DOUBLE,
            0,
            MPI_COMM_WORLD
           );

Our first “proper MPI call”

This is a call to “MPI Broadcast” which means “a node needs to copy some data over to other nodes”.

The arguments are, in order:

A pointer to the data which will be copied
How many values are being copied
What is the type of these values
Who’s actually sending the values
Which workgroup is doing this operation

MPI gives you the ability to have different workgroups… MPI_COMM_WORLD is the default “everyone” group. the data management is pretty self explanatory, but the argument that specifies the sending node goes back to that notion of “same call, different behaviour”. In this case, if you’re process 0, you’re the one sending, and everyone else will overwrite the contents of their own reg with the values you’ve got. Implementations of MPI can, and will, make sure this is an efficient copy too (nodes can propagate the new value among each others once they’ve received it) and this happens across your network.

After that preparation begins the work. Each process uses the knowledge of their own rank and number of processes to figure out the specific area they need to work and then spawn as many processes as they themselves need, forwarding their rank and total process count so we can determine what section to calculate.

After that, there is a different call, this time, it’s MPI_Gather, which is more interesting than Broadcast. Looking at it you can see that it’s pretty much the same structure as MPI_Bcast, except we now specify two different arrays:

    MPI_Gather(reinterpret_cast<void*>(workRows.data()),
               static_cast<int>(workRows.size()*3),
               MPI_UINT16_T,
               reinterpret_cast<void*>(pngRows.data()),
               static_cast<int>(workRows.size()*3),
               MPI_UINT16_T,
               0,
               MPI_COMM_WORLD
               );

Again this is a copy and again Rank 0 is special, but Gather means that we’re assembling an array in Rank 0 from pieces we’re gathering from everyone in the communications groups, INCLUDING Rank 0. Everyone sends their workRows array and Rank 0 writes all down in its pngRows. One after the other, first the data from 0, then from 1…

And at the end, everyone exits and Rank 0 has the final task of writing the full image to disk.

So what's it like?

Coding MPI itself is not really difficult, in the sense that the calls are easy to understand and the documentation for it is not bad. The difficulty here is mainly twofold:

First, it’s a pretty specialised thing so you’re unlikely to “bump into” it outside of an HPC environment. If you want to stretch it for real you need to sort out some sort of cluster and that’s some annoying overhead to play with a new technology. Even though you CAN just run multiple processes in just your own dev machine, I’ve had issues that I could not reproduce this way even in something as simple as this.

The second one is just wrapping your head around how the code looks and runs in the cluster. When you’re running and writing multi-threaded code, each thread gets it’s own execution pointer and goes off doing stuff. In here it’s the same but the threads are in different devices. The MPI calls are usually blocking and act as sync points. Everyone in the group waits until everyone gets to that call so they can do the talking that’s necessary. Additionally, your code becomes a bit of a chimera in which you have the same set of instructions but different ranks are doing different things but because you need to reach the same calls, they need to go through at least some of the same code.

This second to me is the greatest barrier… it’s just a bit weird to get yourself into this mindspace of “writing multiple program paths at once”.

All that out of the way, this is some pretty awesome stuff, if only because of why and how it’s used. It is the brunt behind many a super weird simulation, and drives a LOT of scientific research. It also opens an entire new door. Moving from just your regular CPU to working with your GPU as well is a massive leap and shift in scope and paradigm. Moving from one computer to a cluster is just a big a leap, if not even bigger.

Though it’s on the back burner, this is still something I’m quite keen on playing with, if only because of how cool I find the whole notion and I’d say that if you’re curious on getting your feet wet, it is quite interesting, especially as you get a glimpse of some of the fancier stuff.

What's next?

So far I’ve covered generally CPU bound stuff. Starting with part 3, I want to start talking about heterogeneous computing, which is smartass for “doing general computation on your GPU as well”.

There are quite a few ones that I’ve already looked into and at least one more I’ve got to cook myself, and I’m planning three more parts, each with two new ones. And at the end, a quick mash-up/recap.

in Part 3, I’ll talk about the “pioneers” of heterogeneous computing. I’m not going to go into “write your data to a buffer, tell your GPU that’s a texture and write a Pixel Shader that’s actually calculating some physics” as cool and hack as that is; Instead I’ll talk about the mainstream GPGPU solutions we’ve had in the past few years and that I assume are the first couple of things that come to mind when you think of this: OpenCL and CUDA.

Part 4 will be an AMD fanboy special where I talk about how Team Red sought to win back the ground that nVidia got in this space with their ROCm initiative.

And Part 5 will be about the emerging independent standards being put forward by Khronos aiming to replace OpenCL.

Some reference and related content

The MPI and OpenMP source files for ToyBrot
Some supercomputers right here in Wellington that make use of some of this stuff

The Great Refactoring