Multi Your Threading #7: Feeling a little blue

Ye olde disclaimer, straight off the bat, because this is going to be an intense one!

This is chapter 7 in a series I’ve been doing on multithreading. Each chapter (generally) is based on me reimplementing the  same code using different languages/technologies/models and talking about them both in isolation as well as in comparison to one another. While it’s not necessary that you’ve been following, I do assume you’re familiar with parallelisation in general and I’ll make the occasional references to other implementations. You can find them here if you’d like to check them out:

Note 1: I’ll talk about this later but the numbers I had previously on OpenMP are NOT representative of current state (it’s much better)

Note 2: The code behind all of this is open source and you can find it in my GitLab repository. Feel free to clone it and follow along using your favourite editor (I’m quite partial to Atom)

Okay, admin done! Let’s get going!

Team Blue! Da ba dee da ba daa!

So, at the risk of having my AMD fanboy certificate revoked and my shirt confiscated, it’s time to look at some of what Intel’s got to offer. I say some because they’ve been investing really heavily on this area and marrying it to their dedicated GPUs which, as of January 2020, are kind of happening for real actually.

There’s going to be more Intel to come as I actually feel that covering everything from them I want to look at in just the one post would be WAY too heavy. So I’m not going to be covering their OneAPI initiative (haven’t even had a chance to look into that yet). I’m also not going to be talking about their SYCL things and this is mostly because my access to Intel hardware is sub-optimal. Besides the NUC that hosts this blog, I have… a tablet. And though I use an Intel CPU at work, that one doesn’t have an iGPU so… yeah, sorry Team Blue. Give me a GPU when they’re out so I can drop it in between the R7 and the Titan, I have a universal waterblock waiting for it.

So, what I AM going to talk about today is a library and a compiler. Respectively TBB and ISPC. They’re both…. **checks around to see if Lisa Su is not breaking into my house to murder me for treason**… both really exciting and generally pretty great. Spoiler warning: I’m a great fan of both right now. So let’s get on with it!

TBB: I have a blue task with a blue functor

Intel’s Thread Building Blocks is “a library that supports scalable parallel programming using standard ISO C++ code. It does not require special languages or compilers”. It also has a somewhat deceptive name in a way because it’s task based, like std::async, instead of actually exposing threads directly.

 

TBB has been around for a while, the Wikipedia page on it has links to articles about it from 2006, but it’s definitely been keeping up. Current release is conveniently named 2020.0 and it is built very much on modern C++ features.

The code is open source on github and most distros seem to have it pre-packaged so let’s have not only a look at it, but our first snippets of the new code

A quick glance at the new toyBrot

So on the last post, I talked about how raymarching works and posted the distance estimator I use for the Mandelbox. That bit is not really going to change. I’m also not really going to talk about the colouring because the bit that does that is more or less self-explanatory and/or not super relevant.


So all that we’re left with when it comes to the meat of the new implementations is the bit that launches the tasks.

There are two important functions here. The first one is called “trace”. It’s the one that does the raymarching. I’ve adapted it from the syntopia blog, it takes a Screen struct, a Camera and X and Y pixel coordinates. The Screen represents really the Camera’s near plane segment on the View Frustrum. If you imagine the screen as a “window/portal” into the space where the mandelbox exists, the Screen object has the positions of that window in THAT space. Additionally, it knows how large is a a pixel there.

It represents the rectangle marked here as "Near Clipping Plane"

Using that information and knowing where the camera is, the trace function calculates the ray’s direction since. Simple straight line through two points (the pixel(X,Y) and the camera’s position). After it’s got that, it does the raymarching business of checking the distance (using the distance estimator function) and “walking” along that line away from the camera, through the pixel and beyond….

Whether it detects a collision or goes into infinity, it returns two pieces of information:

 

The last position of the ray in space and how many steps it took until it got there.

 

Vec4f trace(const Camera& cam,
            const Screen& s,
            size_t x, size_t y,
            const estimatorFunction& f)
{
    float totalDistance = 0.0f;
    unsigned int steps;
    Vec3f pixelPosition = s.topLeft 
                + Vec3f{s.pixelWidth*x, s.pixelHeight * y, 0.f};
    Vec3f rayDir = pixelPosition 
                - cam.pos;
    rayDir.normalise();
    Vec3f p;
    for (steps=0; steps < maxRaySteps; steps++)
    {
        p = cam.pos + (rayDir * totalDistance);
        float distance = f(p);
        totalDistance += distance;
        if (distance < collisionMinDist) break;
    }
    return Vec4f{p,static_cast<float>(steps)};
}

The position is really just for fancy colouring. One of the cool things in raymarching is that just the step count is enough for you to draw a lot. Like so:

The other function that’s important is the traceRegion function which really just takes care of going through a section of the image and calling trace to calculate it as appropriate. Generally speaking, what’s going to be different is how THIS function works and how it’s called. In openCL and SYCL it just figures out what pixel it is from the global launch and calls trace for that pixel, and in the STD implementations it loops through the same strided access according to the number of tasks. 

Everything farther down the rabbit hole than THIS function, we generally don’t care

// traceRegion.cpp

bool traceRegion(RGBA* data,
                 const Camera& cam,
                 const Screen& scr,
                 const estimatorFunction& f,
                 size_t h0, size_t heightStep,
                 size_t idx)
{
  for(size_t h = h0; h < h0+heightStep; h++)
  {
    if (h >= scr.height)
    {
      return true;
    }
    for(size_t w = 0 + idx; w < scr.width; w+= numTasks )
    {
      data[(h*scr.width)+w]= getColour(trace(cam, scr, w, h, f));
    }
  }
  return false;
}
// traceRegion.cl

kernel void traceRegion( __global float4* data,
                         struct Camera cam,
                         struct Screen scr )
{
    int row = get_global_id (1);
    int col = get_global_id (0);
    int index = ((row*scr.width)+col);
    cam.pos.z = cameraZ;
    scr.topLeft.z = cameraZ + cam.near;
    if (index > scr.width*scr.height)
    {
        return;
    }
    data[index] = getColour(trace(cam, scr, col, row));
}

Other than this, it’s all fluff and supporting stuff.

The FracGen::Generate is going to do the heavy lifting of setting up the parellelisation platform and the FracGen constructor still does initialisation such as “asking CUDA how many GPUs are there and can we have one” and the like. There’s a Window class and a pngWriter to give you visual display and the ability to save the images as PNG if the optional dependencies of SDL2 and libPNG are found…. Trace and TraceRegion, and how they’re called are the meat of where things come in

So what's TBB like then?

Again, feel free to open the actual source file in a separate window or a text editor if you prefer it to following the snippets

// TBB.cpp

void traceRegion(colourVec& data,
                 const Camera& cam, const Screen& scr,
                 const estimatorFunction& f,
                 const tbb::blocked_range2d<size_t,size_t>& range)
{
  for(size_t h = range.rows().begin();
      h < range.rows().end(); 
      h++)
    {
    for(size_t w = range.cols().begin();
        w < range.cols().end(); 
        w++ )
      {
        data[(h*scr.width)+w] = 
            getColour(trace(cam, scr, w, h, f));
      }
    }
}
bool FracGen::Generate(int width, int height)
{
    bool finishedGeneration = false;
    int heightStep = bench ? height : 10;
    estimatorFunction bulb(bulbDist);
    estimatorFunction box(boxDist);
    estimatorFunction sphere(sphereDist);
    /*
     * Ommiting the screen calculation
     */
    Screen s = {...};
    auto tbbTrace = [this, s, f = box]
    (const tbb::blocked_range2d<size_t,size_t>& range)
    {traceRegion(*(this->outBuffer), *(this->cam), s, f, range); };
    tbb::blocked_range2d<size_t,size_t> range (
        lastHeight, 
        std::min( lastHeight+heightStep, 
                  static_cast<size_t>(height)),
        0, 
        width);
    tbb::parallel_for(range, tbbTrace);
    lastHeight+= heightStep;
    if (lastHeight >= static_cast<size_t>(height) )
    {
        lastHeight = 0;
        finishedGeneration = true;
    }
    return finishedGeneration;
}

So, starting from the generate function, TBB actually looks a lot like the std::tasks implementation because, it being task based itself, it ALSO begins by defining your work unit inside a functor. Same as that one, I use a lambda that will call TraceRegion with the parameters we want. All well and good. Except we have our first difference.

If you look at that lambda, you’ll see that it returns nothing and the only parameter it takes is a tbb::blocked_range2d object and suddenly this looks a LOT like HC C++ and SYCL. And just as from that point on, you’d call hc::parallel_for_each or cl::sycl::queue::submit, so too here the way we launch our jobs is by defining that blocked_range2d (for us width and height of the entire thing) and then calling tbb::parallel_for(range, tbbTrace). You give it a range, give it a functor, be free my child. This function returns when all the jobs end.

So you get something with TBB that really only those two offered before it, you get to code proper C++ making heavy use of templates and functional shenanigans and all that good stuff but you get the sort of mindset that’s usually reserved for GPUs, which is more straightforward. 

 

When you’re coding for the CPU you’re normally thinking about your threadpool in some extent, and then you have to tell your tasks how to behave according to their index, how long they go for…. Look at that code again and you’ll see I don’t do ANY bounds checking it it, because that happens with the range definition.

Even after the change in algortihm, ToyBrot remains pretty much a “spherical cow in perfect vacuum” as multithreading situations go, but even in these conditions, it’s amazing how simple this is to work with. I also neglected to include them given how simple the code is, but TBB has exceptions and you CAN control your ranges in a much more granular way. So, yay all around. It’s also well documented though I REALLY hate the claustrophobic design of that page and DID find it clunky to navigate. The info IS there though and it expands on the finer concepts if your project is NOT some spherical cow in perfect vacuum.

Normally I’d  put the verdict and comparison performance here but, as you’ll see, I can’t really talk about that, especially performance, without talking about ISPC first so on we go. Grab yourself your choice of a coffee or a beer because we’re digging deeper (other drinks not explicitly forbidden)

ISPC: Blue's the compiler for extension sets!

ISPC stands for Intel SPMD Program Compiler and, yes, it IS a compiler. In fact, it “is a compiler for writing SPMD (single program multiple data) programs to run on the CPU. The SPMD programming approach is widely known to graphics and GPGPU programmers; it is used for GPU shaders and CUDA* and OpenCL* kernels, for example. The main idea behind SPMD is that one writes programs as if they were operating on a single data element (a pixel for a pixel shader, for example), but then the underlying hardware and runtime system executes multiple invocations of the program in parallel with different inputs (the values for different pixels, for example).

The main goals behind ispc are to:

  • Build a variant of the C programming language that delivers good performance to performance-oriented programmers who want to run SPMD programs on CPUs.
  • Provide a thin abstraction layer between the programmer and the hardware–in particular, to follow the lesson from C for serial programs of having an execution and data model where the programmer can cleanly reason about the mapping of their source program to compiled assembly language and the underlying hardware.
  • Harness the computational power of the Single Program, Multiple Data (SIMD) vector units without the extremely low-programmer-productivity activity of directly writing intrinsics.
  • Explore opportunities from close-coupling between C/C++ application code and SPMD ispc code running on the same processor–lightweight function calls between the two languages, sharing data directly via pointers without copying or reformatting, etc.” 

 

and… huh… this is immediately interesting in the context of something like toyBrot.

It’s also already a mouthful and, if you’re not from the actual Computing Science side of things this might already sound a bit scary so let’s dial this back and I’ll give the fifth grader walkthrough of the general idea here. Let’s start from the acronym, SPMD. This is derived from something known as Flynn’s Taxonomy and it’s a way of classifying computer architectures in how they work. It’s is divided in two parts, how the commands work and how it accesses data.
For example:


The most basic software is SISD, Single Instruction, Single Data. This means it does one thing at a time and each thing is done to one piece of data.

 

But say you want to do one operation on a bunch of values, that’s SIMD, Single Instruction, Multiple Data. Because you’re going to be doing the same thing to a bunch of different data pieces. You may think that this represents what happens in toyBrot but that is not quite the case because of the *I* bit. If you iterate through an array, it’s still a queue. The idea with SIMD is you give one command and everything happens at once, the CPU clicks a switch and does the entire batch.

 

Instead, toyBrot (and most multithreaded applications) map to the SPMD model, which is Single PROGRAM, Multiple Data. We have the same set of instructions and we divide it among computation units, cores and threads and tasks. And each of them is going to do it’s own thing independently and once everyone’s done you get the result.

What makes ISPC interesting is that third, bullet point: “Harness the computational power of the Single Program, Multiple Data (SIMD) vector units without the extremely low-programmer-productivity activity of directly writing intrinsics”

 

THAT’s something we haven’t done yet. THIS is the new territory

Instruction sets and extensions

This is something I touched briefly when I was first introducing heterogeneous computing, back in chapter 3. Generally speaking, whatever it is you’re coding, your hardware doesn’t understand that. My 1920X doesn’t understand C++, my Titan doesn’t understand CUDA and my Radeon 7 doesn’t understand GLSL. This is why compilers exist, they translate all that code into machine instructions, also known as assembly code. Even when you’re running a graphics program and you send your shader “straight to the GPU”, yeah, that doesn’t happen, the driver compiles it on the fly. And that’s WHY GPUs have drivers but CPUs don’t really because things need to run immediately on them. So CPUs implement instruction sets.

 

These are more or less what they sound like. They’re an API for your CPU. The most common ones right now are x86_64, which runs on most servers and desktop machines and the instruction sets by ARM, which run on CPUs based on that architecture (such as the Nintendo Switch and pretty much every phone). Though they’re ubiquitous, occasionally you still find different sets, famously the Playstation 3 had a quite unusual processor

 

The thing that’s interesting for us is that instruction sets ALSO have extensions, which are optional. If you’re relatively ancient, you might remember when Pentium processors came out with MMX and later when AMD had 3DNow! on their CPUs. These were extension sets, the CPUs had additional functionality on top of x86. Today the ones that are most present and relevant are called SSE and AVX

 

Streaming SIMD Extensions and Advanced Vector Extensions are extensions to x86 with SIMD instructions. What this means is that a CPU that implements either of these can actually run instructions on multiple pieces of data at once. It’s like if for those bits, a CPU core can count, in a way, as multiple cores.


Most CPUs these days implement SSE4 and AVX2 so, in theory they’re just waiting there to be used. And this is what ISPC is built to do

 

 

 

Joining the gang

Though very useful, ISPC is much more limited than the likes of CLANG and GCC. ISPC does not speak C++ at all, with the exception that it knows what references are. Other than that, all your code has to be “good old” C, which, if you’re like me, it’s annoying as heck, but definitely far from the worse you can think of in terms of obscure programming.


The way you USE ISPC is the following: you break away your code and separate the part you want it to compile. This is the bit that’s going to run vectorised, and this is the bit that has to be C. ISPC outputs from that piece of code an object file that has regular C linkage, so you just paste it into your regular software and your linker should be none the wiser. It’ll even output an auto-generated header with declarations for functions you export from the ISPC code so that your regular compiler can know what it’s supposed to tell the linker to find. The rest of your code, you do you, it doesn’t matter.

 

This also means that ISPC doesn’t really do the multithreading bit. They say often, code that you write is intrinsically parallel but it’s running on just the one core.

So knowing that, how does our code look?

// fracGenISPC.cpp

bool FracGen::Generate(int width, int height)
{
  bool finishedGeneration = false;
  int heightStep = bench ? height : 20;
  Screen s{...}
  std::vector<std::future<bool> > results(numTasks);
  auto traceTask = 
    [this, &s, heightStep, numTasks = numTasks,h0 = lastHeight]
    (size_t idx)
    {
      return ispc::traceRegion(
            reinterpret_cast<float(*)[4]>(outBuffer->data()),
            reinterpret_cast<float*>(&(cam->pos)),
            s.width, s.height,
            s.pixelWidth, s.pixelHeight,
            reinterpret_cast<float*>(&(s.topLeft)),
            h0, heightStep, idx, numTasks);
    };
  for(size_t i = 0; i < numTasks; i++)
  {
    results[i] = std::async(std::launch::async, traceTask, i);
  }
  lastHeight+= heightStep;
  for(auto& task: results)
  {
    if(task.get())
    {
      lastHeight = 0;
      finishedGeneration = true;
    }
  }
  return finishedGeneration;
}

It looks a LOT like the std::tasks code, because we’re once again using std::async to split the program among multiple cores. The difference is that now the actual body of traceRegion comes from somewhere else entirely. ISPC, kind of like MPI, is designed to be used alongside other technologies. A lot of the scientific computing people are well acquainted with both of these and use them in conjunction with the likes of TBB and OpenMP. In that kind of scenario, MPI spreads you across a cluster, OpenMP spreads you across CPUs and ISPC gets you the most of each core. Or, at the very least, such is the plan, and these days often alongside with CUDA or OpenCL driving GPUs or similar.

So, where’s that definition then?

// fracGen.ispc

static inline float sqMod(float v[3])
{
    return v[0]*v[0] + v[1]*v[1] + v[2]*v[2];
}
static inline float length(float v[3])
{
    return sqrt(v[0]*v[0] + v[1]*v[1] + v[2]*v[2]);
}
static inline void normalise(float v[3])
{
    float l = length(v);
    if(l == 0)
    {
        return;
    }
    v[0] /= l;
    v[1] /= l;
    v[2] /= l;
}
static inline float fmod(const float a, const float b)
{
    return a - (b*floor(a/b));
}
static void trace(  uniform float camPos[3],
                    uniform float topLeft[3],
                    uniform float pixWidth,
                    uniform float pixHeight,
                    int x, int y, float out[4])
{
    float totalDistance = 0.0f;
    unsigned int steps = 0;
    float pixelPosition[3] = { topLeft[0] + (pixWidth * (float)x),
                               topLeft[1] + (pixHeight * (float)y),
                               topLeft[2]  };
    float rayDir[3] = { pixelPosition[0] - camPos[0],
                        pixelPosition[1] - camPos[1],
                        pixelPosition[2] - camPos[2]};
    normalise(rayDir);
    varying float p[3] = {0,0,0};
    for (steps=0; steps < maxRaySteps; steps++)
    {
        p[0] = camPos[0] + (rayDir[0] * totalDistance);
        p[1] = camPos[1] + (rayDir[1] * totalDistance);
        p[2] = camPos[2] + (rayDir[2] * totalDistance);
        float distance = boxDist(p);
        totalDistance += distance;
        if (distance < collisionMinDist)
        {
            break;
        }
    }
    out[0] = p[0];
    out[1] = p[1];
    out[2] = p[2];
    out[3] = (float)(steps);
}
export uniform bool traceRegion(uniform float data[][4],
                                uniform float camPos[3],
                                uniform int width,
                                uniform int height,
                                uniform float pixWidth,
                                uniform float pixHeight,
                                uniform float topLeft[3],
                                uniform int h0,
                                uniform int heightStep,
                                uniform int idx,
                                uniform int numTasks)
{
    uniform bool done = false;
    uniform int taskWidth = ceil((float)width/(float)numTasks);
    uniform int wStart = taskWidth * idx;
    uniform int wEnd = min(wStart + taskWidth, width);
    uniform int hEnd = min(h0 + heightStep , height);
    foreach(h = h0 ... hEnd, w = wStart ... wEnd)
    {
        int pix = ((h*width)+w);
        varying float traceOut[4] = {0,0,0,0};
        varying float colour[4] = {0,0,0,0};
        trace(camPos, topLeft, pixWidth, pixHeight, w, h, traceOut);
        getColour(traceOut, colour);
        data[pix][0] = colour[0];
        data[pix][1] = colour[1];
        data[pix][2] = colour[2];
        data[pix][3] = colour[3];
    }
    return hEnd == height;
}

Well… it’s initially a bit more hostile on account of it being C. It doesn’t actually support the C standard library though and though it’s own math library has some pretty useful functions, it’s not a lot, as evidenced by me having to implement not only a few vector functions but actual fmod.

 

Moving down, C semantics aside, there’s not a lot that’s interesting other than 2 other things:

First there ‘s a foreach() loop with a two-dimensional range that’s definitely not C.

Second, a lot of variables are tagged with either uniform or varying which is also new.

This is where ISPC’s programming model starts.

 

That foreach loop will spawn multiple concurrent instances of the code there. But, remember, all of this is happening inside a single core. They’ll use the SIMD instructions in the CPU and they all run together, as in “at the same time doing everything”. Each of these groups of programs is called a gang. So when we first step into the loop for the first iteration, we spawn an entire gang of “internal processes” that are running together. They share the same execution pointer which means that each instruction happens literally at the same time. Each of these is called a program. For my 1920X, for example, the default is to use AVX2 and spawn gangs of 8 programs each. 

 

That can be a problem though. They’re going to run trace and each ray might collide at a different time, or not at all. The answer to this problem is twofold.

First, there’s an execution mask. This is essentially a bitmask that individual programs in the gang can set. Say one ray collided immediately. Its execution mask gets tweaked and when the loop runs again, that individual program gets ignored. But it still needs to wait until everyone is done, because the entire gang executes together.

 

If you read my chapter on MPI it’s somewhat of a similar mindset where your code looks like it’s describing one thing but underneath, when it actually runs, you have different instances doing different things at the same point.

The second thing is the variable qualifiers. Similar to shaders, an uniform variable is always the same value for everyone in the gang. Which means that if you’re, say, running a conditional against it, the entire gang could run the conditional in just the one check. But, like we said, not all data is the same all of the time. So data can be varying, which means each program in the gang has its own copy and they’re not necessarily equal, so we need to look at all of them. Uniform is the standard for literal variables, and everything the entry function gets as a parameter needs to be uniform (after all, from the outside, this is like just the one regular function call that will do one thing).

 

Everything else you declare inside the code is varying. And underneath they’re really a vector of [ProgramCount] variables, which makes sense when you twist your brain around it. When I say “here’s this distance variable that’s different for each program in the gang”, really each of those programs needs a different one. So when you thin they’re reading “distance”, they’re actually reading “distance[ProgramNumber]”.

 

Because of this, the one thing that’s not always varying is “what pointers point to”. When you write float*, that means “a varying pointer to an uniform float”, because each of those programs in the gang have their own pointer (to different addresses), but each of those pointers point to just the one float, not to a float for every program in the gang, which would be a bit of a special case.

 

This means it’s usually redundant to explicitly mark things as varying. But I do it every once in a while more as a reminder to myself than anything else

 If you’re now having to stop and catch a breather as you try to visualise all of this, welcome to the club. Massive shoutout to ISPC nerd Jefferson Amsutz (@jeffamstutz) who found me running around in circles on twitter and helped me understand this stuff. He works on OSPRay which is pretty badass  and they have a heck of a lot of ISPC pulling some weight in that project. They also distribute along clusters using MPI and along processors using TBB so, there’s that =)

 

Now, this information IS available on ISPC’s user’s guide but if you are not already familiar with the overall concepts and ideas, it CAN feel a bit obscure at times. So do take it easy when going through it. Other than that, I actually like the documentation for ISPC better than TBB’s. Ironically, even though it’s just the one blob, that makes it very easy to navigate, which is probably the worst issue with TBB, where everything is hidden behind 3 layers of links that you can’t really easily jump through. 

It does suffer on some points, for example, when it comes to the math functions I really just wish it was something more akin to QT’s or even CUDA’s where you get a list of functions signatures and you can click them for detailed information… It’s just less immediately evident here, but generally pretty good stuff. Just be aware not all of it is up-to-date. They have a whole section on “Experimental support for PTX” and when I actually opened the repository view, I could see commits with “removed PTX support” so… there’s that

Is all this hassle even worth it?

You tell me

Those numbers are not a fluke and not some weird bug in the code. They are very much for real. So real in fact that they’ve changed how I build everything else.

 

 

Remember when I was talking about instruction set extensions and all that? Well, most compilers these days are very conservative and will not use them… but they can… they know them. Clang, for example, not only knows about AVX and SSE, it has presets for the architecture my cpu is based on. I can call it with  – march=znver1 and it’ll do its best to optimise whatever its building for this line of processors, using whatever extensions and tricks it can, rather than just sticking to “generic x86-64”

 

 

So, is that any help?

The Verdict: Blue is the team and all results are too

That’s a significant improvement, but still, ISPC is WAY ahead of anyone else. This completely blew my mind. An improvement this massive is insane. And I’ve done essentially no work in ensuring that work is well distributed. With the STL implementations, for example, I know that the strided access is the way to go and I’ve also experimented with the oversubscription…  I haven’t done any of that experimenting with ISPC… It’s actually based on the old way where I take a strip of rows and slice a bunch of columns to each instance. Then ISPC breaks those pixels down along the gang but more or less on whatever automagical way it wants to. I also realised that altering the number of tasks can change the performance significantly. 20 tasks per thread, as I use for the STL implementations is actually quite slower, possibly because of that memory access. I more or less empirically found 10 as a good amount and left it there for the moment, so there’s definitely a lot to be explored here, maybe using the default deferred|async, maybe this is when a more robust system like, say, TBB will show it’s true power. 

 

Again, toyBrot is the spherical cow in perfect vacuum of parellelisation. Each work unit is pretty much 100% independent, there’s practically zero memory access… as long as you divide your work smartly, it pretty much scales perfectly, but this is the one time where I’m not sure if I’m dividing it well enough.

 

Also for consideration, if you read my previous entries you may recall I was somewhat disappointed in OpenMP’s performance and look at that, it’s actually right there with the pack now. Seems I was doing it dirty. I’ll explain why in an appendix further down.

 

So what does these numbers say?

ISPC (Host)

STD::THREADS (znver1)

STD::TASKS (znver1)

TBB (znver1)

OpenMP (znver1)

STD::THREADS

TBB

STD::TASKS

OpenMP

OpenCL (pocl)

12.43s   [100%]

17.15s   [+37%]

17.16s   [+38%]

17.43s   [+40%]

17.55s   [+41%]

20.55s   [+65%]

20.72s   [+66%]

21.10s   [+69%]

21.11s   [+69%]

22.34s   [+79%]

If you’re wondering why ISPC and OpenCL don’t get the nnver1 treatment, ISPC already has all its working bit optimised through ISPC, the compiler, itself. I mean, the software builds but nothing that makes any difference is touched by clang. The same is true with OpenCL. In this case, the compute program is compiled on the fly by the OpenCL implementation (in this case, pocl) which is who would be responsible for vectorising the code and making use of these extensions. But from our perspective (as in, I’m not building pocl itself), this is all on the fly and transparent so rather than having pretty much the same number repeated, I just omitted those two

Even giving the rest of the gang the best chance Clang can provide them, second place is over 6 seconds behind ISPC, it’s close to 40% slower. If you compare to the old unoptimised numbers, it gets simply ridiculous. But even if you don’t want to go through having to deal with ISPC, just enabling these optimisations is something worth considering if your code at all resembles a spherical cow a bit. Looking at just the improvement the old implementations have from simply turning -march-znver1 on:

That’s about 20% improvement there just by setting a compile flag, which is nothing to scoff at. If you know what you’re targeting, it’s worth doing it. And even if you’re working on a larger project for a wider range of targets, there’s a lot you can do here. It’s trivial, for example, to have CMake spit out several binaries and then, at runtime, you can either dynamically load the libraries or divert into the versions of the functions you need (remember, this is all CPU. So it’s trivial to wrap them individually in lambdas or std::function and pick the one you want on the fly), or even just have a launch script that picks the optimised version if possible. For an example of me doing some of that, you can check the CMakeLists.txt for toyBrot’s ISPC project which DOES give you the option to build binaries for a bunch of different ISPC targets at once

 

Though I went straight to specifying my arch, you can specify extensions to enable instead. And then it’s a matter of checking if the host supports it.

Final thoughts and considerations: I'm quite excited for things from Team Blue

I hope by now it’s evident that the praise I heaped on these two is well deserved. Truth be told I had a rough time with ISPC at first but a lot of that was my doing. I actually had a typo on my length function and it took me forever to catch it. When you’re entering an environment where there’s a lot new and you’re out of your comfort zone (hi, C) it’s easy to get flustered and focus too much on those aspects, sometimes in detriment of the basics. So definitely take it easy if you can, it could save you a lot of headache from really silly things.

 

 

ISPC can be some work, but (especially if you’re used to the C way of doing things) it’s actually quite straightforward to use. It, itself, doesn’t get any CMake integration and for me Qt Creator recognising the source files as C code was kinda spotty. That said, it’s pretty easy to hook it up for the files you need. CMake knows that when you give it an object file it just needs linking, so the CMakeLists for ISPC in toybrot is nothing too weird. Feel free to check it out if you’re even in doubt regarding this workflow.

 

 

TBB is straight up a great option. It’s not the top performer but in both cases where it’s behind, it’s very close by. In my opinion, not enough of a difference to discard it, at this point, I’d say none of the pure implementations are (and OpenCL can be deployed to GPUs so it’s way safe in that aspect). I’ve always heard great praises in regards to TBB especially from people who do things more complicated than this. And honestly, it’s a joy to code in. In my opinion, task-based parallelisation is the way to go. The one minor thing to look out for is CMake integration. At home, in Arch, I just ran find_package and I got it. But the package for OpenSuSE 15.0, at work, didn’t have the CMake config files, so maybe they’re new? Be prepared to provide your own find script if you’re targeting an older system such as Debian or CentOS, I guess.

 

 

Sometimes raw threads are great and they make sense, but even then, I can see less and less the appeal of going for those, especially with more robust task-based systems such as TBB. In toyBrot, the program will launch a std::thread to capture events for the SDL window if it’s there. On Warp Drive, my toy game engine, it’s std::task that runs. Which is right, who’s to say? =P

 

 

Running these and getting these numbers, especially when ISPC comes into the picture got me REALLY curious about the state and the future of computing. I’ve got a colleague that often mentions he feels that with the diminishing cost and ramping efficiency of regular CPU cores, for a lot of tasks people might be better off just optimising for and buying high end CPUs instead of going the GPU route.

 

 

There’s some validity and nuance to this argument where debugging on the CPU is MUCH easier. So the dev time is generally much more productive. This is true even with ISPC. If you look closely at the CMakeLists, you’ll see I pass it -g for debug builds and it generates regular debug symbols and you can step through your code as usual just like any regular C code. Your varying variables will actually show up correctly in the debugger as the arrays of variables they are, the same size as the gang. It’s fantastic and helped me a lot! You also drop a lot of the worry with drivers and such because it’s just your CPU. And if your calculation is double precision you need the really expensive GPUs. The easiest way to make my Titan, which is supposed to be sort of a “prosumer” type of product, look like an overpriced paperweight, is change my stuff to doubles. I think with ISPC now the 1920X might actually just crush CUDA. And, the TitanX is an old GPU, but the 1920X is slower AND has less cores than current regular gamer desktop CPUs. (Radeon VII, on account of it being an actual monster of a card, has pretty good double throughput). Something like a 3990X can very likely spit one of those images up in less than 5 seconds (I want to try one so bad now, you have no idea), even the 3970 has almost triple the cores AND more raw speed AND a better architecture overall compared to my overspecced 1920. AND you can have 128Gigs of ECC RAM attached to it. And GDB through your thing, business as usual.

 

 

When does the data throughput get outpriced by the development throughput? It’s certainly an interesting question and, again, in a lot of systems, one need not negate the other one. There’s a LOT you can throw concurrently into the pot. Not a lot of reason you can’t have your CPU crunching numbers with TBB and ISPC while your GPU is crunching ANOTHER set of numbers with OpenCL or SYCL. With so many players doing so many things, I’m quite curious to see how the next few years shape up in this space.

Before the final final bit, there are a couple of appendices to this one, things I want to talk about but that don’t quite fit the structure (I know right, this is already heavy). Bear with me if you feel so inclined. No need to get a second beer or coffee but as good an excuse as anything to do so

Appendix 0x0: Turns out I was doing OpenMP dirty

ToyBrot is a now completely unrecognisable form of code that I’ve carried around for years. And though it’s much cleaner now, it still carries some of its legacy (for example, why it uses SDL2 for its display and not something like Qt (huge dependency go)). Among its old legacies there was a detail that actually slowed down most CPU implementations.

 

Since CPUs can take some time to generate the image, if you’re just playing around and not actually benchmarking, unlike GPU implementations, all CPU implementations fill their texture incrementally. They do loops of 10 lines of rendering, return, it gets presented by SDL2, rinse and repeat until they’re done.

 

The way this used to be controlled was through a static atomic<int> in the FracGen::Generate function. This was a problem for a couple of reasons. Turns out in some cases it wasn’t getting reset properly, but also, it was an atomic that some implementations had to read constantly. And the issue there is that it being atomic, then the threads would queue up to read the atomic and everything stops because atomic means we’re going to have order in this house.

 

This, as it turns out is unnecessary because I ensure there’s never race on that variable. Only the main program thread writes to it and it always waits until all current work tasks have finished to do so. It turns out that, for whatever reason, OpenMP was having a particularly bad time with that. Once the atomic was gone it went from “pretty surprisingly slow” to pretty much the same as the other implementations.

 

There is some more examination that could be done of OpenMP and I’m considering doing a revisit. It has more elaborate ways of dividing and managing work which could be interesting to examine. 

 

Additionally, I’ve always listed it as “CPU thing” but it’s not necessarily true. OpenMP is a standard and it CAN be implemented by compilers that target GPUs, FPGAs and such. In fact, AMD’s new maybe hcc replacement, thing, whatever, something on ROCm is just that. It’s called AOMP and it allegedly CAN compile OpenMP to both AMD and Nvidia GPUs so I definitely want to give that a go. Not sure when I’m going to do itt but it’s certainly and idea that’s been on my mind

Appendix 0x1: Know your target

Astute observers and people in the know may have noticed that as much as I talked about instruction set extensions and whatnot I never mentioned what exactly I told ISPC to do. The astute observer part of that group may have noticed that in the graph I have “ISPC (host)” as the identifier. 

 

That is the default if you don’t tell ISPC to do anyhing specific. If you DO, ISPC can target either specific CPU lines (most Intel, with the notable exception of the PS4 CPU, heeey red team go). But you can also select the instruction set and a LOT of modern CPUs support a lot of those. If you’re worried AVX2 might exclude some of your targets (we had that exact problem at work at one point with Amazon cloud), maybe you can fall back to SSE, which is older and more likely to be supported… But… how much better is a modern standard anyway?

Well, fear not, I spent my entire morning benchmarking so I could answer this question!

If you run ISPC –help to know what targets you have available, you’re going to get a list like this:

   [--target=<t>] Select target ISA and width. 
<t> ={host, sse2-i32x4, sse2-i32x8, sse4-i32x4, sse4-i32x8, sse4-i16x8, sse4-i8x16, avx1-i32x4,
avx1-i32x8, avx1-i32x16, avx1-i64x4, avx2-i32x4, avx2-i32x8, avx2-i32x16, avx2-i64x4, avx512knl-i32x16,
avx512skx-i32x16, avx512skx-i32x8, generic-x1, generic-x4, generic-x8, generic-x16, generic-x32,
generic-x64, *-generic-x16, neon-i8x16, neon-i16x8, neon-i32x4, neon-i32x8}

“host” is the same as not passing, it’ll auto-detect.  The “generic” targets can actually output raw source files that you’d give to your regular C/C++ compiler. They’re not really human readable as they’re full of assembly, and I must confess I didn’t really test that (maybe it’s interesting but it looked like a hassle I had little reason to deal with). 

 

For the other implementations they are in the following format:
<EXTENSION>-i<WIDTH>x<SIZE>

The extension represents the extension set you want to target. The size is the gang size on your generated program. And the width is called a “Mask size” and you want to use whatever’s available that best matches your most common data type. So, if you’re me and using floats, you want 32. Gang size is dependent on the implementation details so always check –help to see what’s on your version of ISPC.

 

As for gang size, you’d think the more the better but it’s not necessarily the case. Consider the following set of blue bars

See, “host” on a gen 1 threadripper means avx2-i32x8. But that was actually not my best performer. That was avx1-i32x4, followed by avx2 on the same configuration. SSE4 has pretty good performance for being “the old thing” so, might be worth looking at that.

 

The moral of the story here really is that when you’re dealing with things that are this low level, the specifics of both your hardware and your code start to make a difference. Using 64 bits mask size had an impact on performance as it’s inadequate to the calculation. If you’re going down this route, it might be a good idea to make sure you can benchmark your code before it goes on production. Before you push 20 gigs of data through it, have a hundred megs and give it to a few different specialisations, see if your assumption is correct.

Also be aware than when you use something like this, you ARE narrowing your list of deployable targets. If you build for AVX2, your target NEEDS to have AVX2 on their CPU.

Otherwise it WILL unceremoniously crash

Finally, not all options are born equal. Looking in massive chart at avx1-i32x16 and avx2-i32x16 you can see that they build and they run but the 1920X did not have a good time running those.

Appendix 0x2: A little lie and a chink in ISPC's armour

So the little lie is that ISPC is not ENTIRELY divorced from the usual multithreading. It actually has a system for that using its launch and sync statements. It also has a lot of fine print there. The trick here is that ISPC CAN launch and join tasks…. if you teach it how.

 

If you do want to have all your job managed through it, what you need to do is implement three hooks that it will call when doing the thing.

void *ISPCAlloc(void **handlePtr, int64_t size, int32_t alignment);
void ISPCLaunch(void **handlePtr, void *f, void *data, int count0, int count1, int count2);
void ISPCSync(void *handle);

I’m not even going to beat around the bush here. I looked at these signatures and got immediately pretty “yeesh” about it. Then I opened the example they supply for a glance.

 

It has 1200 lines of code

Yeeeaah, not too many were the situations wherein I noped out as quick as I did from playing with that. That said, maaaaaybe I coud copy their file over and play with it. But as of right now I don’t really understand why I would, say, write these hooks so that in a generic way, ISPC could then create TBB tasks and fire them up instead of… writing the TBB tasks and firing them up myself which sounds so much easier. 

So there was my little lie. ISPC DOES do multithreading kind of in a fashion if you ignore the fact that it doesn’t really do it but provides you with a way to hooking it up a threading system instead.

The chink in the armour came from the benchmarks on different targets from ISPC.

 

I’ve mentioned that my code is all floats and that because of that, the best option for me is avx2-i32x4 and that avx2-i64x4 loses me performance as shown in the graph and substantiated by many numbers on an OpenOffice Calc spreadsheet

 

So the immediate question becomes “So if I use doubles, it’s the reverse, right?”. And the answer to that IS yes, but there was something else that got me confused and disappointed and randomly crying out for ISPC nerds on Twitter and Mastodon didn’t yield any light, nor did the googling.

So, the assumption on the AVX mask size is correct.  Switching everything to doubles, i64 handily beats i32. Manual spoke the truth, confirmed. Good guy, Manuel, read him always.


But in complete isolation these numbers aren’t super meaningful so I converted my base implementation to doubles as well, so we can see how ISPC compares aaand it’s actually worse. The best scenario, i64x4 was pretty much tied with standalone std::async with no -march optimization. 

 

I’m very happy I switched to floats way back because if I had gone through the trouble I had with ISPC and then it gave me NO performance increase I would be tremendously disappointed (even though, again, I brought most of the pain on myself). Just think of the AMD fanboy confirmation bias alone.

As I mentioned, I don’t know what this is about. Is this a Zen thing? Is this just ISPC? Is there some secret I’m missing? I DID try to –math-lib=fast and –opt=fast-math to try and bring it back but the difference was pretty minimal, which makes sense since I use the std math library VERY sparsely in the code.

So for now this is a bit of a mystery, I’ll try and check at work and see if this IS a zen thing, that’d be a fanboy bummer.

Appendix 0x3: Blue is this Haswell, it's sitting inside!

That’s right, I had one more!

Right, so the following day after I wrote this, I actually got the opportunity to run a bunch of ISPC tests on an i7-5820K. Being a Haswell-E, it’s quite an old CPU by now but it still, the numbers I got from it brought me a plethora of interesting information.

 

The 5820K is a 6 core HEDT entry from intel. It has the same extension sets as the 1920X, both have SMT/Hyper Threading and their base clock is not super different, 3.6GHz for the 1920X and 3.3GHz for the 5820K. So, on paper, ignoring automatic overclocking/boosting and architecture differences, the 1920X should be 2.2 times “more processor” than the 5820K. With that in mind, let’s start looking at some numbers:

At a glance, the overall shape of the graph looks similar. There are a couple outliers but most bars are the same size except the two rightmost. Difference here is the two rightmost bars on the 1920X graph were AVX 1 and 2 i32x16. But this time, they’re the std::async numbers. See, with the 1920X we have a whole chart of other implementations, but for the 5820K, all information is new, so I put those two in for a baseline. And before we even have a chance of looking to the other bars:

Observation #1 -> What's in an architecture

See, looking at those numbers I just gave on the two CPUs, one could assume the 1920X to be about twice as good as the 5820K, but that’s not at all true here. The regular generic unoptmised runtime for std::async on the 1920K is about a third of the time for the 5820K, instead of “slightly better than half”. There are several reasons for this, two of which are that the 1920X boosts to a much higher clock, but even at the same speed, 4GHz on a Zen is not the same thing as 4GHz on a Haswell and they’re not the same thing as, say 4GHz on a Bulldozer. There’s a lot going under the hood in these cases, and I just find it really interesting to see how things come along.

Observation #2 -> All or nothing

So, the reason I have TWO baseline bars is the same as with the 1920X. It’s only fair I give the 5820K the best chance on the regular compiler. So, this time, I gave it a -march=haswell and fired it up, but unlike with my own regular CPU, the difference here was minimal. After I got such massive improvements on the 1920X this IS a bit of a bummer. There’s also a caveat that the Clang used for the 5820K is an older version, I believe 7, thought I failed to mark that, as compared to clang9 that’s current on Arch Linux. But this was still surprising and disappointing. Reminder that your mileage may vary with compiler optimisation.

Observation #3 -> "AMD's AVX performance is much better now"

See, I feel like I hear this phrase a lot every time a new iteration of Zen comes up, and it really implies that it wasn’t too great, maybe still isn’t super. And looking at the results I from ISPC on the 1920X, it’s a bit like “I mean… yeah, this looks amazing”. But then we look at the best result I got for the 5820K, AVX2-i32x8, the regular host. And THIS time…. this time it IS practically a third of the runtime. In fact, it is SO much faster that the best case scenario for the 5820K matches the unoptmised baseline for the 1920X. This is absolutely bonkers. I look at this chart and no wonder people keep bringing  this up, because that result right there is a bar set massively high.

 

Also worthy of note is that the x16 graphs here, while still not the best case scenario are actually pretty good, as opposed to on the 1920X where their just much much worse than not using AVX at all.

Observation #4 -> Adding insult to injury: "What about doubles, then?"

Well, I took some numbers on that as well

So they start the same when you’re using the wrong mask. Worse than not using AVX at all. But once you actually DO use the correct mask, AVX DOES put in the work and you can get a LOT of benefit from it compared to the regular std::async code. This really is a twist in the knife of the last point about AMD still lagging behind on AVX performance because, if you still remember that graph, using ISPC was worse than than the baseline in ALL these cases.

 

But it DOES twist the knife on Haswell’s belly too because this time, turning on the optimisation from clang actually made it run slightly worse. And, once more… I got nothing, this is just weird to me. This chart is playing both sides, must have been bought by Qualcomm or something.

 

Keep in mind that as much as do try to keep the situation more or less controlled, take the average of multiple runs for this (normally 10, 5 for all of the 5820K datasets) and all, this is far from being “scientific”. Again, different systems, different compiler versions (not for ISPC though), running on different distros, sample size of 1 CPU on each side, etc… etc… etc…

 

So take all this data with the grain of salt they warrant and use that grain of salt as context for all of these conclusions. To me the main takeaway from [all of the stuff] is that when it comes to optimising at this level it’s very important that you examine things for your specific situation and confirm the assumptions you’re making before committing to them. It’s easy enough to, say, just assume “x16 means double the data, much faster” and never realise you’re missing out on 15% better throughput on your Haswell because you didn’t check before your project was a monster running stuff on the wild every day

So, what's next?

Despite the fear that I’ll be assassinated for fanboy treason, I actually want to continue on Intel for the next one. Like with OpenMP, TBB has some room to look better at how to divide the work, as well as marrying it to ISPC and seeing where I can go from there. I mentioned that my work division for ISPC was rather shoddy. So there’s space to experiment, and even with the regular implementation (which is not going anywhere because of how I manage and treat my dependencies for this project. ISPC+TBB would be extra), there’s is the possibility of looking more closely into trying to do it smart and understanding how the std::task sectioning is interfering with ISPC and vice-versa.

 

Additionally, there’s OneAPI to look at. I’m quite curious as to what they have available and all. Intel’s actually been on a roll on their open source initiatives. Considering how hostile they are towards anyone that wants to buy their CPUs, this is actually surprising and interesting. Really hoping that mindset keeps on their GPU side (so far it’s looking promising). But this post was already a massive one. So I guess we all need some time to recover and I need some time to check some new cool stuff out.

Reference and related content