So, you wanted to multi your threading, huh?

Prelude - How did we get here?

It’s been a while since CPUs have stopped getting a lot faster. They’re still improving in somewhat different ways, they make more with whatever speed they’ve got but no one’s really holding their breath in anticipation for a next generation which’ll clock 6GHz without extreme cooling. This fact has been true for quite a few years now and, instead, CPUs have been doing more at once, amassing processing cores. Not only that, people more and more caught on that there are a lot of problems which GPUs are REALLY good at dealing with; as long as the thing you need to do is “simple enough”, they can do batches of thousands at once.

This development did not pass unnoticed. Parallel Computing, Concurrent Execution, Multi-threading…. whichever way you call it, all sorts of devs have been chasing it; Not only have they been trying to wrap their heads around on how to do you manage a piece of software that’s doing all sorts of things at once, they’ve been trying to figure out, how to make that not a nightmare. Hardware vendors too have been chasing their own avenues, with their own frameworks, libraries and developing specialised hardware. I myself am typing this on my AMD 1920X, a CPU which gives me 24 execution threads, while I daydream over the decadent excess that would be buying it’s bigger brother, with 64, on a desktop CPU, and my one isn’t even super expensive these days. From fiction to servers to now regular desktops, this is a LOT to have a CPU do at once,

On the GPU side, I remember watching pretty excitedly a few years back when AMD, ARM, Qualcomm formed their own “league of evil” and announced their HSA Foundation initiative. Heterogeneous Computing, making software to run on discrete and different processing units, was a pretty dang hype idea, and it keeps growing; nVidia has pretty recently released their new Turing architecture cards and much of their overhaul aimed to entice not necessarily gamers, but scientists who need to crunch endless amounts of data for Machine Learning models. Not wanting to be left behind Intel who plans on releasing discrete GPUs starting next year (not that we haven’t heard this story before) is placing a lot of their chips on supporting open heterogeneous computing standards and win public over that way. Computing on the GPU is here to stay.

How did *I* get here?!?

Making software that leverages all of this is complicated, though. It’s quite different in some ways to write in a way that deals with all these things happening at once without having them trip on each other. Once trivial problems now become intersections of concurrent tasks, all reaching for their own results. Not only do you need to get to grips with how to picture and organise this, there’s also the question of choosing which technology will suit your needs: which library, which language, which paradigm… Do you “taint” your code base with additional dependencies, do you rely on an existing subsystem, do you make your own from the basic building blocks? And if you then take the next step and move your software to a GPU, there’s a whole other level of things to consider. Now you have different processing units and different memory spaces which need to be synchronised, you get stricter restrictions on what you can do and depending on which direction you take, you also get vendor locked, as each GPU vendor tries to win you over to their hardware… It’s can be a daunting thing to dip your toes into, and this is an opportunity for you to learn a bit about different alternatives in this space from my own experiences, hopefully so you can make a more educated decision on which direction you want to go. Way, way, back, I wrote a piece of software for school. The purpose of that was to draw something on the screen you could point to and say “look, I did a thing which draws a pretty!” and have your teacher agree with you. My choice was the Mandelbrot Set, a pretty good looking and, crucially, easy to code fractal. The algorithm I chose was essentially “babby’s first mandelbrot”, the very simplest one that does some brute force iterative maths. This time I didn’t leave a comment with the original algorithm source but most of the implementations in Rosetta Code are basically the same algorithm (I’ve linked the Java one as it’s quite straightforward to read). This initial implementation had two useful characteristics: – It was very simple, it’s essentially one ~50 line function that does some pretty basic maths and then just some UI, boilerplate and housekeeping. My oldest version of this is just over 300 lines. – Because it doesn’t go intelligently or elegantly at pretty much anything, it’s also pretty costly for something that you’re running on a CPU. A few years later, when I was studying the then recent C++11 standard, I wanted some code I could play with and get some hands-on experience with the new parallelism functionality, in particular std::tasks. And I remembered this code existed. So came the repurposing of Toybrot

Let's get cracking; What are we in for?

So, by now I’ve reimplemented Toybrot several times for both CPU AND GPU. It’s become my go to tool for examining new parallelism technologies. It’s simple enough that it’s almost a Hello World but then it also has a tiny bit more complexity so that it exposes a bit more of the “day to day” life with any of these things (though not enough to stress any of these).

My goal here is to give an overview of my experience with these different technologies. I want to talk about how they compare to each other not only in performance (though given how simple the code is, they don’t vary too much) but also in ease of use, learning and integration, as well as how do I feel/hope these technologies go in the future.

Languages/Libraries I've implemented so far

CPU
GPU
Other/Special

Project Structure and dev/testing environment

After furious refactoring, all current projects follow the same structure, with the exception of the MPI ones.

I’ve linked it before but, the project is currently hosted on GitLab. Feel free to either clone or browse the files there

In case you feel like building yourself; the project is built with CMake and I’m assuming you’re always building it with CLANG, there shouldn’t, however, be much there, if anything, that GCC screams about. 

If you’re on Windows, firstly I’m sorry for you, and secondly I’m sorry that until now it just never occurred to me ensuring this does work on Windows, though I don’t really use anything that’s Linux specific, I just don’t know the state of windows support for things like ROCm and SYCL. 

Every project except for really the main MPI one are interactive demos and all use SDL2 to manage window UI and input. Other than that, you should just need the specifics for each project. I’ll give more details on each section but the root level CMakeLists.txt should try and enable projects as it finds more of the specific dependencies.

Most of my dev and testing has been done on my personal rig at home. This computer has:

 – A 1920X threadripper (12c24t)
 – 32Gigs of 3466C14 ram
 – A Vega 56 that’s been flashed with a 64 BIOS.

I’m quite the shameless AMD fanboy, but have some SOME testing on a machine from work that has an Intel 6700K and an nVidia 1080Ti, especifically for the CUDA side of things. So if I ever mention any numbers, unless I specify anything different, this is where those came from.

Source and Header files

I’ve set up all the projects to use a structure as similar to each other as possible. Since my goal is comparison, I try to make sure not only the files line up, but they’re exact copies if possible.

The files you can expect for each of these projects are:

Will have any specifics shenanigans required for building and deploying

This file contains the entry point for the application. The important bit here is a function called runProgram This function sets off the two main loops of the application, one which captures input and the other that requests the generation of the fractal and window redraw. Of note here is that the function that generates the fractal works differently for CPU and GPU based code. Given GPUs are not only much more powerful and this is a trivial task for many modern GPUs but also that one of the largest overheads you add is precisely  memory transfers, the Generate function for GPUs will always calculate the entire fractal for our window, whereas each call to the CPU generators will only update 10 lines of pixels. The function will report when the entire thing is done, as the application will ignore most input during that time.    

Predictably, this class controls the display window and handles input through SDL2
Nothing much to say about this one as it doesn’t really do anything of note, except try with limited success to keep your selection rectangle for a dive deeper in a fixed aspect ratio

This is where the meat of the implementation is.

What happens in this class is as follows:

  –   Constructor and destructor will do any set up and cleanup necessary.

  –   The Generate function is called by main. This function will do the setup for work submission and break down the work units. For CPU methods it breaks down the work in a multiple of the reported available number of threads. For GPUs, a different work unit is spawned for every individual pixel on the window. Each work unit runs the calculateIterations function to get the actual value we want, and then some very simple conversion to turn that into some hopefully not horrendous colour. These two are what change the most depending on individual needs of each language/library/tool

It's threading time!

C++ Standard Templates Library: Tasks and Threads

With the advent of the C++11 standard, C++ finally has native built-in concurrency facilities. Before this, you’d need to rely on either your Operating System (through the likes of Windows threads or pthreads) or use some abstraction that built on top of them. Now, these are built into the language which makes it much easier to write portable code though people used to dealing with some more sophisticated libraries might find std::threads in particular a bit barebones. I’ve mentioned before that std::tasks on the other hand were really my gateway to this entire project. So as we look into them, right off the bat, what’s the difference?

 

Threads are the more traditional “work unit” for parallel systems. Each thread is spawned with a task, which often is executed in a loop. The thread can be detached which means “setting it free” from the main calling thread so it’s now on it’s own world managed by the OS’s thread scheduler and joined which has the calling thread “bring it back” and wait for it to complete.

 

This is how input is handled for every single version of this project:


int runProgram() noexcept
{
    //some setup here
    bool exit = false;
    //expanding this lambda
    std::thread eventCapture([&mainWindow, exit]
                             ()
                             {
                                while( !*exit )
                                { 
                                    mainWindow.captureEvents();
                                }
                              } );
    //Let the new thread run on its own
    eventCapture.detach();
    
    while(!*exit)
    {
        if(*redraw)
	    {
            //fractal generation bits
            //event handling is happening concurrently in detached thread
		}
        mainWindow.paint();
	}
    if(eventCapture.joinable())
    {
        //make sure this is done
        eventCapture.join();
    }
	return 0;
}

tasks on the other hand, are an attempt at an easier to manage abstraction. Each task is an object that represents some amount of work that may be done in some background thread.

Note: When I say task here I refer to the inner workings and returned std::future from a std::async callrather than to the actual std::packaged_task. For reference, you can think of a call to async as creating a packaged_task, moving it to a thread to run and giving you its returned future; You can check out the example in cplusplus.com/reference to see what that looks like

These tasks rely on the concept of Futures and Promises. A promise encapsulates the returned value, and exceptions which you may want to catch from this thread. And the future, is the object which the calling thread has. Each future has a get() function which essentially waits, optionally with a timeout, until the promise is fulfilled, which in this context means the task finished. 

Side by side, these implementations look as such:

bool FracGen::Generate(uint32_t* v, SDL_PixelFormat* format,
                        int width, int height, Region r)
{
    if(format == nullptr)
    {
        return false;
    }
    static std::atomic<int> h {0};
    static std::vector< std::thread> threadPool(numThreads);
    std::vector<bool> results(threadPool.size());
    bool finishedGeneration = false;
    for(unsigned int i = 0; i < numThreads; i++)
    {
      //Fire up all threads
      threadPool[i] = std::thread([v, width, height, r, 
                                  fmt = *format, h0 = h.load(),
                                  idx = i, &results]
                      ()
                      {calculateIterations(v, width, 
                                           height, r, 
                                           fmt, h0, 
                                           idx, results);});
    }
    h+= 10;
    for(auto& td : threadPool)
    {
        //wait until all threads complete
        td.join();
    }
    //Block until all tasks are finished
    for(bool b : results)
    {
        if(b)
        {
            //If one of them is done, they all must be
            h.store(0);
            finishedGeneration = true;
        }
    }
    return finishedGeneration;
}
bool FracGen::Generate(uint32_t* v, SDL_PixelFormat* format,
                        int width, int height, Region r)
{
    if(format == nullptr)
    {
        return false;
    }
    static std::atomic<int> h {0};
    bool finishedGeneration = false;
    auto genFrac = [](uint32_t* data, 
                      int width, 
                      int height, 
                      Region r, 
                      SDL_PixelFormat format, 
                      int h0, size_t idx)
                    {
                      return calculateIterations(data,width,
                          height,r,format,h0,idx);
                    };
    std::vector< std::future<bool>> tasks(numTasks);
    for(unsigned int i = 0; i < numTasks; i++)
    {
        tasks[i] = std::async(std::launch::async, 
                              genFrac,
                              v, width, height, r, 
                              *format, h.load(), i);
    }
    h+= 10;
    for(unsigned int i = 0; i < tasks.size(); i++)
    {
        //Block until all tasks are finished
        if(tasks[i].get())
        {
            //If one of them is done, they all must be
            h.store(0);
            finishedGeneration = true;
        }
    }
    return finishedGeneration;
}

At first glance there is not much difference between the two implementations: Both rely on you, the coder, managing how many work units you are going to run and working out how to get their results and keep things from bumping into each other (something pretty simple when your code can be this neatly isolated).

The actual calculaterIterations function is the same for both and it’s coded in a way that each pixel in the texture pointer is only set by one thread/task.

The brunt of the magic for each call happens in a function that involves a lambda. For the std::thread version, this is the constructor of the thread, and for the task version, this is the call to std::async.

Both of these functions will take functors as parameters and run them. The difference here is that the tasks give you control over the acquisition and treatment of the results of the job through the returned std::future.

Conveniently, this does NOT mean that you NEED your task’s functor to return anything, as std::future<void> exist and are a pretty handy way of just having a thread wait on something to finish on a different thread. Even if the type is void, the call to  get() will block the calling thread until the promise is there.

How does it stack up?

So this is all C++ so let’s start with the usual main one: What’s the performance between these?

And in my semi-controlled test, I can say that they are pretty much the same:
I didn’t really control for “other stuff” running, but this is consistent. And the takeaway here, IMO, is that if you’re scared that std::async might give you some overhead, this one hello-world type tests says not to worry. This is a constant fear I keep hearing that threads could be mischeduled and you’ll get some untraceable performance hit… doesn’t seem to be the case.

 

Both of these implementations being pure STL is pretty big, though. This means that for a project like this you only need a C++ compiler and some way of getting the output (here being SDL2 for display, but you could, say, use libpng and dump a picture). 

When it comes to ease of implementation and code readability, my personal favourite is std::async. Whilst I think threads are a better representation, in some ways, of concurrent application loops (such as the even capture thread I fire up in main) for well defined work batches I feel async gives you two advantages:
 

  1. std::threads are a bit lower-level abstractions so their concepts are a bit more specific. This can also lead to some more trouble in structuring, such as understanding when threads are supposed to be detached, when you should have them joinable…. If you DO want to do some hands-on managing of your thread pool with some fancy logic, then go for it, otherwise, std::future is a much more immediately relatable abstraction and makes your code more readable and easier to implement
  2. Aside from being more relatable, using std::async hands off a bunch more work from you to the compiler and underlying OS. All you really need to do is specify your launch policy and then get() your future when you want it. Futures can also contain exceptions which can make error management in your code easier, as you can handle everything from the main thread.

 

For those reasons, I consider the std::async implementation to be the measuring stick. It is my default and every other way to thread code I measure according to it. It is too my recommendation that, if you’re just getting into this, get familiar with it, have a play, write a hello world or two and have that as your baseline. It’s easy to implement, easy to manage and it’s pure STL. C++11 is pretty old news by now, even Microsoft’s compiler implements it. From here on, things get more specific and/or reliant on external libraries, hardware and infrastructure.

 

 

The full implementation for both of these can be found at: my toyBrot repo in GitLab

The projects are called STDTHREADS and STDTASKS

A couple of notes on toyBrot for the CPU

You may have noticed my benchmarks accused 96 jobs when I mentioned I only have 24 hardware threads. This is intentional.

Way back, when I was first playing with this, on my old 8 core Bulldozer, I noticed that performance gains didn’t stop once I hit my expected thread cap. Instead, though much less, it seemed I kept gaining a bit of performance until somewhere around 3-4 times that amount of concurrent tasks. I’ve repeated tests regarding that with Intel Skylake and then with AMD Zen once I managed to spend an irresponsible amount of money on one. This observation seemed to remain consistent. I also remember seeing it in both Windows and Linux so it’s not necessarily some OS specific thread scheduler thing (unless my memory fails, I’ve been happily away from Windows for quite a while now). After that point, seems that the overhead of spawning and consuming the tasks takes over and you lose that extra performance.

Don’t really know WHY, but it’s apparently a thing, at the very least with this piece of software.

 

The second thing is a curiosity about memory. If you’ve actually opened up the source code to browse it, you might have seen that the actual calculateIterations function has two different implementations for how it breaks up work according to tasks.

Both of them are working on vertical strips of pixels. As mentioned way back, for the CPU generators I do it in batches of 10 lines, and then update the image as we go, just to make the visual more responsive, really. But the columns are managed by the threads. They get a “you’re job X of a total of Y” and from there on figure out which ones they need to work on.

Implementation 1 is the very first thing you’d think of: You break the horizontal space in Y strips of width/Y pixels and work on one of those:

for(int w = idx*(width/numTasks); w <= (idx+1)*(width/numTasks);w++)

It turns out, that is a pretty bad way of doing this and the reason is that CPUs are smart and something called the prefetcher exists. Memory operations take time so when you ask a bit of data from your RAM, your CPU already tries to load up some adjacent memory if it can, because we work in arrays all the time. But even though all this data is ultimately contiguous, each task is working on a very different part of that memory.

I saw some talk on it once and decided to have a check, and I reimplemented this for loop slightly differently:

for(int w = 0 + static_cast<int>(idx); w < width; w+= numTasks )

If you’ve coded anything in the GPU, this is usually how you manage buffers of different data, you have an offset (to the data you want) and a stride (the size of the whole package). And this also means that threads are much more likely to be working at adjacent data at any point. Quickly swapping those two around, not touching anything else in the code and running the same number of iterations, I get:

Preparing to run with 96 parallel tasks
Iteration 0 took 1771 milliseconds
Iteration 1 took 1756 milliseconds
Iteration 2 took 1766 milliseconds
Iteration 3 took 1746 milliseconds
Iteration 4 took 1825 milliseconds
Iteration 5 took 1870 milliseconds
Iteration 6 took 1883 milliseconds
Iteration 7 took 1931 milliseconds
Iteration 8 took 1951 milliseconds
Iteration 9 took 2001 milliseconds
Iteration 10 took 1986 milliseconds
Iteration 11 took 2141 milliseconds
Iteration 12 took 2117 milliseconds
Iteration 13 took 2176 milliseconds
Iteration 14 took 2267 milliseconds
Iteration 15 took 2333 milliseconds
Iteration 16 took 2350 milliseconds
Iteration 17 took 2417 milliseconds
Iteration 18 took 2565 milliseconds
Iteration 19 took 2729 milliseconds

 

Average time of 2079 milliseconds (over 20 tests)

This is  21% additional time compared to the current version. Bonus points because I split things wrong and I actually lost the last column of the fractal (oopsie).

On what is actually “the same code”. This time is lost on memory access.

 

This is not exactly a threading-related issue, but it’s worth keeping in mind in general too. As shown here, the difference can be pretty considerable

What's next?

This is already pretty long and there’s still a lot to go. It became quite obvious somewhat soon writing this that I’d need to break this up in multiple posts. For the next one I’m going to talk a bit about the remaining CPU-based implementation as well as the odd one out. So that’d be OpenMP and MPI. The idea is to cover these before I start talking about the GPU implementations, as they’re going to be plenty.

Some reference and related content