Table of Contents
It’s been a while since CPUs have stopped getting a lot faster. They’re still improving in somewhat different ways, they make more with whatever speed they’ve got but no one’s really holding their breath in anticipation for a next generation which’ll clock 6GHz without extreme cooling. This fact has been true for quite a few years now and, instead, CPUs have been doing more at once, amassing processing cores. Not only that, people more and more caught on that there are a lot of problems which GPUs are REALLY good at dealing with; as long as the thing you need to do is “simple enough”, they can do batches of thousands at once.
This development did not pass unnoticed. Parallel Computing, Concurrent Execution, Multi-threading…. whichever way you call it, all sorts of devs have been chasing it; Not only have they been trying to wrap their heads around on how to do you manage a piece of software that’s doing all sorts of things at once, they’ve been trying to figure out, how to make that not a nightmare. Hardware vendors too have been chasing their own avenues, with their own frameworks, libraries and developing specialised hardware. I myself am typing this on my AMD 1920X, a CPU which gives me 24 execution threads, while I daydream over the decadent excess that would be buying it’s bigger brother, with 64, on a desktop CPU, and my one isn’t even super expensive these days. From fiction to servers to now regular desktops, this is a LOT to have a CPU do at once,
On the GPU side, I remember watching pretty excitedly a few years back when AMD, ARM, Qualcomm formed their own “league of evil” and announced their HSA Foundation initiative. Heterogeneous Computing, making software to run on discrete and different processing units, was a pretty dang hype idea, and it keeps growing; nVidia has pretty recently released their new Turing architecture cards and much of their overhaul aimed to entice not necessarily gamers, but scientists who need to crunch endless amounts of data for Machine Learning models. Not wanting to be left behind Intel who plans on releasing discrete GPUs starting next year (not that we haven’t heard this story before) is placing a lot of their chips on supporting open heterogeneous computing standards and win public over that way. Computing on the GPU is here to stay.
So, by now I’ve reimplemented Toybrot several times for both CPU AND GPU. It’s become my go to tool for examining new parallelism technologies. It’s simple enough that it’s almost a Hello World but then it also has a tiny bit more complexity so that it exposes a bit more of the “day to day” life with any of these things (though not enough to stress any of these).
My goal here is to give an overview of my experience with these different technologies. I want to talk about how they compare to each other not only in performance (though given how simple the code is, they don’t vary too much) but also in ease of use, learning and integration, as well as how do I feel/hope these technologies go in the future.
After furious refactoring, all current projects follow the same structure, with the exception of the MPI ones.
I’ve linked it before but, the project is currently hosted on GitLab. Feel free to either clone or browse the files there
In case you feel like building yourself; the project is built with CMake and I’m assuming you’re always building it with CLANG, there shouldn’t, however, be much there, if anything, that GCC screams about.
If you’re on Windows, firstly I’m sorry for you, and secondly I’m sorry that until now it just never occurred to me ensuring this does work on Windows, though I don’t really use anything that’s Linux specific, I just don’t know the state of windows support for things like ROCm and SYCL.
Every project except for really the main MPI one are interactive demos and all use SDL2 to manage window UI and input. Other than that, you should just need the specifics for each project. I’ll give more details on each section but the root level CMakeLists.txt should try and enable projects as it finds more of the specific dependencies.
Most of my dev and testing has been done on my personal rig at home. This computer has:
– A 1920X threadripper (12c24t)
– 32Gigs of 3466C14 ram
– A Vega 56 that’s been flashed with a 64 BIOS.
I’m quite the shameless AMD fanboy, but have some SOME testing on a machine from work that has an Intel 6700K and an nVidia 1080Ti, especifically for the CUDA side of things. So if I ever mention any numbers, unless I specify anything different, this is where those came from.
I’ve set up all the projects to use a structure as similar to each other as possible. Since my goal is comparison, I try to make sure not only the files line up, but they’re exact copies if possible.
The files you can expect for each of these projects are:
Will have any specifics shenanigans required for building and deploying
Predictably, this class controls the display window and handles input through SDL2
Nothing much to say about this one as it doesn’t really do anything of note, except try with limited success to keep your selection rectangle for a dive deeper in a fixed aspect ratio
This is where the meat of the implementation is.
What happens in this class is as follows:
– Constructor and destructor will do any set up and cleanup necessary.
– The Generate function is called by main. This function will do the setup for work submission and break down the work units. For CPU methods it breaks down the work in a multiple of the reported available number of threads. For GPUs, a different work unit is spawned for every individual pixel on the window. Each work unit runs the calculateIterations function to get the actual value we want, and then some very simple conversion to turn that into some hopefully not horrendous colour. These two are what change the most depending on individual needs of each language/library/tool
With the advent of the C++11 standard, C++ finally has native built-in concurrency facilities. Before this, you’d need to rely on either your Operating System (through the likes of Windows threads or pthreads) or use some abstraction that built on top of them. Now, these are built into the language which makes it much easier to write portable code though people used to dealing with some more sophisticated libraries might find std::threads in particular a bit barebones. I’ve mentioned before that std::tasks on the other hand were really my gateway to this entire project. So as we look into them, right off the bat, what’s the difference?
Threads are the more traditional “work unit” for parallel systems. Each thread is spawned with a task, which often is executed in a loop. The thread can be detached which means “setting it free” from the main calling thread so it’s now on it’s own world managed by the OS’s thread scheduler and joined which has the calling thread “bring it back” and wait for it to complete.
This is how input is handled for every single version of this project:
tasks on the other hand, are an attempt at an easier to manage abstraction. Each task is an object that represents some amount of work that may be done in some background thread.
Note: When I say task here I refer to the inner workings and returned std::future from a std::async callrather than to the actual std::packaged_task. For reference, you can think of a call to async as creating a packaged_task, moving it to a thread to run and giving you its returned future; You can check out the example in cplusplus.com/reference to see what that looks like
These tasks rely on the concept of Futures and Promises. A promise encapsulates the returned value, and exceptions which you may want to catch from this thread. And the future, is the object which the calling thread has. Each future has a get() function which essentially waits, optionally with a timeout, until the promise is fulfilled, which in this context means the task finished.
Side by side, these implementations look as such:
At first glance there is not much difference between the two implementations: Both rely on you, the coder, managing how many work units you are going to run and working out how to get their results and keep things from bumping into each other (something pretty simple when your code can be this neatly isolated).
The actual calculaterIterations function is the same for both and it’s coded in a way that each pixel in the texture pointer is only set by one thread/task.
The brunt of the magic for each call happens in a function that involves a lambda. For the std::thread version, this is the constructor of the thread, and for the task version, this is the call to std::async.
Both of these functions will take functors as parameters and run them. The difference here is that the tasks give you control over the acquisition and treatment of the results of the job through the returned std::future.
Conveniently, this does NOT mean that you NEED your task’s functor to return anything, as std::future<void> exist and are a pretty handy way of just having a thread wait on something to finish on a different thread. Even if the type is void, the call to get() will block the calling thread until the promise is there.
So this is all C++ so let’s start with the usual main one: What’s the performance between these?
And in my semi-controlled test, I can say that they are pretty much the same:
I didn’t really control for “other stuff” running, but this is consistent. And the takeaway here, IMO, is that if you’re scared that std::async might give you some overhead, this one hello-world type tests says not to worry. This is a constant fear I keep hearing that threads could be mischeduled and you’ll get some untraceable performance hit… doesn’t seem to be the case.
Both of these implementations being pure STL is pretty big, though. This means that for a project like this you only need a C++ compiler and some way of getting the output (here being SDL2 for display, but you could, say, use libpng and dump a picture).
When it comes to ease of implementation and code readability, my personal favourite is std::async. Whilst I think threads are a better representation, in some ways, of concurrent application loops (such as the even capture thread I fire up in main) for well defined work batches I feel async gives you two advantages:
For those reasons, I consider the std::async implementation to be the measuring stick. It is my default and every other way to thread code I measure according to it. It is too my recommendation that, if you’re just getting into this, get familiar with it, have a play, write a hello world or two and have that as your baseline. It’s easy to implement, easy to manage and it’s pure STL. C++11 is pretty old news by now, even Microsoft’s compiler implements it. From here on, things get more specific and/or reliant on external libraries, hardware and infrastructure.
The full implementation for both of these can be found at: my toyBrot repo in GitLab
The projects are called STDTHREADS and STDTASKS
You may have noticed my benchmarks accused 96 jobs when I mentioned I only have 24 hardware threads. This is intentional.
Way back, when I was first playing with this, on my old 8 core Bulldozer, I noticed that performance gains didn’t stop once I hit my expected thread cap. Instead, though much less, it seemed I kept gaining a bit of performance until somewhere around 3-4 times that amount of concurrent tasks. I’ve repeated tests regarding that with Intel Skylake and then with AMD Zen once I managed to spend an irresponsible amount of money on one. This observation seemed to remain consistent. I also remember seeing it in both Windows and Linux so it’s not necessarily some OS specific thread scheduler thing (unless my memory fails, I’ve been happily away from Windows for quite a while now). After that point, seems that the overhead of spawning and consuming the tasks takes over and you lose that extra performance.
Don’t really know WHY, but it’s apparently a thing, at the very least with this piece of software.
The second thing is a curiosity about memory. If you’ve actually opened up the source code to browse it, you might have seen that the actual calculateIterations function has two different implementations for how it breaks up work according to tasks.
Both of them are working on vertical strips of pixels. As mentioned way back, for the CPU generators I do it in batches of 10 lines, and then update the image as we go, just to make the visual more responsive, really. But the columns are managed by the threads. They get a “you’re job X of a total of Y” and from there on figure out which ones they need to work on.
Implementation 1 is the very first thing you’d think of: You break the horizontal space in Y strips of width/Y pixels and work on one of those:
for(int w = idx*(width/numTasks); w <= (idx+1)*(width/numTasks);w++)
It turns out, that is a pretty bad way of doing this and the reason is that CPUs are smart and something called the prefetcher exists. Memory operations take time so when you ask a bit of data from your RAM, your CPU already tries to load up some adjacent memory if it can, because we work in arrays all the time. But even though all this data is ultimately contiguous, each task is working on a very different part of that memory.
I saw some talk on it once and decided to have a check, and I reimplemented this for loop slightly differently:
for(int w = 0 + static_cast<int>(idx); w < width; w+= numTasks )
If you’ve coded anything in the GPU, this is usually how you manage buffers of different data, you have an offset (to the data you want) and a stride (the size of the whole package). And this also means that threads are much more likely to be working at adjacent data at any point. Quickly swapping those two around, not touching anything else in the code and running the same number of iterations, I get:
Preparing to run with 96 parallel tasks
Iteration 0 took 1771 milliseconds
Iteration 1 took 1756 milliseconds
Iteration 2 took 1766 milliseconds
Iteration 3 took 1746 milliseconds
Iteration 4 took 1825 milliseconds
Iteration 5 took 1870 milliseconds
Iteration 6 took 1883 milliseconds
Iteration 7 took 1931 milliseconds
Iteration 8 took 1951 milliseconds
Iteration 9 took 2001 milliseconds
Iteration 10 took 1986 milliseconds
Iteration 11 took 2141 milliseconds
Iteration 12 took 2117 milliseconds
Iteration 13 took 2176 milliseconds
Iteration 14 took 2267 milliseconds
Iteration 15 took 2333 milliseconds
Iteration 16 took 2350 milliseconds
Iteration 17 took 2417 milliseconds
Iteration 18 took 2565 milliseconds
Iteration 19 took 2729 milliseconds
Average time of 2079 milliseconds (over 20 tests)
This is 21% additional time compared to the current version. Bonus points because I split things wrong and I actually lost the last column of the fractal (oopsie).
On what is actually “the same code”. This time is lost on memory access.
This is not exactly a threading-related issue, but it’s worth keeping in mind in general too. As shown here, the difference can be pretty considerable
This is already pretty long and there’s still a lot to go. It became quite obvious somewhat soon writing this that I’d need to break this up in multiple posts. For the next one I’m going to talk a bit about the remaining CPU-based implementation as well as the odd one out. So that’d be OpenMP and MPI. The idea is to cover these before I start talking about the GPU implementations, as they’re going to be plenty.