Multi your Threads #3: Put a computer in your computer

Hello, there.

This is the third post in a series wherein I talk about various ways of parallelising your code. The previous two entries went on a bit about C++ STL as well as OpenMP and MPI. If you haven’t, you may want to check those out before hand.

If you have been following, welcome back, hope I’m not boring you to death. And if you’re not and don’t really care… I forgive you.

Madatory note: All the code here is from ToyBrot. If you prefer to see the whole picture or browse on your own as you read, feel free to clone the repo and open it up on your favourite code editor (maybe on a second screen if you have one). I personally am a big fan of Atom

Also: This turned out pretty long. I’m assuming some of the people who read this might know very little about this stuff and, as it’s “new territory” I try to give a lot of context and do my best not to lose anyone along the way. So grab a cup of your favourite hot beverage and don’t be shy about taking a break and taking this one a bite at a time. For everything moving forward, I’m assuming they’ve read this one so it should be much smoother

Put a computer in your computer. That'll do it!

We’re about to get to some meatier stuff, in some ways; as such, let’s start by pretending to ask ourselves “but what even is heterogeneous computing” as a shoddily-veiled pretence to explain it to a hypothetical reader who isn’t super familiar and, to boot, give our own smarty selves a rubber-ducky moment at their expense, in which we can make use of the opportunity to remind ourselves of all of this.

Computers are complicated stuff and though it’s “easy” to optimise a tool that has a well-defined job, the really exciting thing about computers is that you can use them to play games, make fancy maths, write random stuff and still watch 24/7 stream of cute animal videos whilst you do all of those. At the core of how these random jobs get juggled is a device (more like an agglomerate of a bunch of stuff, tbh) that we call the “Central Processing Unit”, The Rock That Thinks itself, brains of the operation!

However, as we’ve increased demands of what we were doing, a particular set of programmers (the people who ask the rock questions) started wanting a different kind of rock in the same computer… “can I just have a rock that can multiply thousands of vectors and matrices really fast? As in… it doesn’t need to do a lot else, but I REALLY want it to be a beast at this one thing!”. Those people wanted to draw pretty things in the screen. In particular a lot of 3D graphics revolves around what essentially is several of those multiplications for every single frame. The problem with the CPU is that it has a very strict pipeline of things to do, because it cares a lot about a bunch of stuff that is not really important to “multiplying a bunch of stuff really quick”, petty, silly things like “check if this exists because if it doesn’t it crashes the computer so instead we need to ask for the other thing”…. So engineers came up with a different type of rock, one we now call “Graphical Processing Unit”.

The reason why “GPUs are so much more powerful than CPUs” is that they’re specialised hardware. They’re really really good at a bunch of things but if that firststatement was true at face value, Intel would have filed for bankruptcy a long time ago. The thing is, they’re different. And because of how different they are, they also require a bunch of stuff built around them. If you build a computer, at the very least you need some sort of motherboard to drop your CPU in, some RAM and some power. Well, GPUs, being the conceited little rocks they are, have their own of all of those things: Their motherboard is a Video Graphics Adapter, they have their own VRAM and even take their own specialized power input these days.

It took a while until AMD and Intel started integrating GPUs in their processors (some of them, at least) and even when that’s the case, the iGPU is still essentially a separate rock.

The story of “Heterogeneous Computing” begins when people start thinking “you know what…. multiplying vectors and matrices is good for a bunch of stuff other than drawing dudes with swords on a screen” and started looking into how to “hijack” all that specialised computing. Nowadays, there are several frameworks, libraries, languages… GPGPU, General Purpose computing on Graphics Processing Units, is here to stay, so we might as well get acquainted with it.

So how is this any different than ye olde multi-threading?

Heterogeneous Computing has its own challenges compared to “regular multi-threading”.

The first thing to consider is basically all of these issues are in addition to all the “classic” multi-threading issues: You need to figure out how to divide the work, how to synchronise tasks, how to make sure threads aren’t fighting over data, debugging is a pain…. you start by inheriting “the lot”.

But the bulk of what makes GPGPU particularly intimidating is the reason behind the whole analogy that titles this chapter: The best way of thinking of your GPU is as a separate computer.

Your GPU has it’s own ins and outs… if you’ve coded graphics, you know this. It spits out video through it’s I/O and all the data you put in is always fiddly. Buffers and textures that need to be described and loaded very explicitly… If you haven’t and don’t believe me, have no fear, I’m here to horrify you with this snippet from my toy games engine where I’m trying to explain to the graphics card, via OpenGL, essentially what’s a vertex:


glBindVertexArray(id);

    if(buffer.Registered())
    {
        Err::log("WARNING::VERTEXARRAY:: Attempting to bind Vertex Buffer multiple times");
    }
    buffer.bindBuffer();
    glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, ebid);
    glBufferData(GL_ELEMENT_ARRAY_BUFFER, static_cast<GLsizeiptr>(sizeof(unsigned int)*indices.size()),
                    indices.data(), static_cast<GLenum>(type));



    //glVertexAttribPointer(index, size, type, normalized,
    //                      stride, first   );
    //Position
    glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE,
                            21 * sizeof(GLfloat), nullptr);
    glEnableVertexAttribArray(0);

    //Normal
    glVertexAttribPointer(1, 3, GL_FLOAT, GL_TRUE,
                            21 * sizeof(GLfloat), reinterpret_cast<GLvoid*>(3 * sizeof(GLfloat)) );
    glEnableVertexAttribArray(1);

    //UV
    glVertexAttribPointer(2, 2, GL_FLOAT, GL_FALSE,
                            21 * sizeof(GLfloat), reinterpret_cast<GLvoid*>(6 * sizeof(GLfloat)) );
    glEnableVertexAttribArray(2);

    //Ambient Colour
    glVertexAttribPointer(3, 4, GL_FLOAT, GL_FALSE,
                            21 * sizeof(GLfloat), reinterpret_cast<GLvoid*>(8 * sizeof(GLfloat)) );
    glEnableVertexAttribArray(3);

    //Diffuse Colour
    glVertexAttribPointer(4, 4, GL_FLOAT, GL_FALSE,
                            21 * sizeof(GLfloat), reinterpret_cast<GLvoid*>(12 * sizeof(GLfloat)) );
    glEnableVertexAttribArray(4);

    //Specular Colour
    glVertexAttribPointer(5, 4, GL_FLOAT, GL_FALSE,
                            21 * sizeof(GLfloat), reinterpret_cast<GLvoid*>(16 * sizeof(GLfloat)) );
    glEnableVertexAttribArray(5);

    //Shyniness
    glVertexAttribPointer(6, 1, GL_FLOAT, GL_FALSE,
                            21 * sizeof(GLfloat), reinterpret_cast<GLvoid*>(20 * sizeof(GLfloat)) );
    glEnableVertexAttribArray(6);


glBindVertexArray(0);

It’s essentially a lot of “look, this is item X in the list, it’s Y amount of such type, the next one is Z amount of bytes after this one…”. And you have to do that because there’s a second problem:

“Your video card doesn’t speak C++”

Which, again, if you have any passing knowledge of graphics programming is “well, duh”, but if you don’t and you ever hear anyone talking about “shaders” in the future, you are hereby granted knowledge: Shaders exist because of that fact, they are “programs that run on your GPU”.

In the context of graphics, communication with a GPU will happen through graphics pipeline. That pipeline is a specification that is implemented by your drivers, such as OpenGL, Vulkan and the… other… stuff… And these are specialised ways of “talking to your GPU”. Shaders are programs, yes, but they’re programs specialised for running over “vertices” or “pixels” and, for a while, this is what we had. There was once a time where you’d send “some texture” to a GPU and had it run a pixel shader on it. But instead of drawing it you were like “you know what, actually give me that texture you just processed” and it turns out your shader was all like: “multiply red on pixel 0 in row 0, by red in pixel 0 in row 1 and then add the value to red in pixel 0 in row 2” and it just didn’t know that to you those values were some velocity, acceleration and position values and that “red” actually meant “X axis”.

That is, however, obviously hacky and clunky and, as funny it is to imagine your code as this massive “cheating the system and tricking the rock to think of philosophy when it was told it was solving maths” con job, for actual every day work we can do with something better…

Finally, there’s a more complicated issue, vendors…

You see, your CPU actually follows standards. I said “GPUs don’t speak C++” but your CPU doesn’t really, either. Instead, the job of a compiler is to translate that C++ into “native code”. What is run is written in a class of languages known as Assembly code. And it IS “a class of languages” because different CPUs run different assembly. They know how to do different stuff. If you can remember way back, when you had “pentium with MMX”, that MMX is an assembly extension. These are known as Instruction Sets and if you’re a “regular user/dev” you come in contact with a few different ones. There’s x86, which was developed and solidified by Intel into what became the “de facto” instruction set for the majority of PCs for a while (Macs used to be based of a different architecture called PowerPC). Things got shaken again when AMD came up with AMD64, which is the instruction set most consumer desktop uses these days (yes, even your Intel, if it’s 64-bit, it’s running this). And somewhat parallel to this, ARM has been releasing their own standards which pretty much dominate low-end and mobile devices, a role that’s getting muddier as they get into consoles and are aiming on entering the server market. The big reason why AMD and Intel are essentially uncontested in their seats as desktop CPU kings is because they cross-license their instruction-sets to each other (which is why the overwhelming majority of software doesn’t care if you’re running on either), and they don’t want any other players.

Okay… so instruction sets are a thing… how do they apply to GPUs?
Eh? Ask nVidia?

Because GPUs live isolated in their own world and generally only talk through drivers, they’re each doing their own thing. So a lot of your work is suddenly very affected or reliant on who specifically makes your GPU, which can make things complicated. You wrote all your code in CUDA? Congratulations, you’re locked into buying exclusively nVidia forever now. So there’s now an additional layer of complication/consideration.

OpenCL: making "computing shaders" a thing

OpenCL was a standard released originally in 2009. This should put into perspective how new the entire thing of heterogeneous computing is. OpenCL is not THE oldest standard (minor spoiler: CUDA was already out) but it’s certainly the “traditional old guard” when it comes to GPGPU frameworks. In terms of how it fits for us, it has a few interesting things:

Though it was initially developed by Apple (I know, right? Apple and standards? What a world) it is now managed by the Khronos Group, which are “the OpenGL people”. As such, a lot of different vendors worked into refining and supporting it (wikipedia cites “AMD, IBM, Qualcomm, Intel, and Nvidia” for the initial draft)
OpenCL builds directly on the concepts from graphical programming
It exemplifies one of two main “philosophies” in GPGPU frameworks I want to talk about
Being vendor-neutral, managed by Khronos, it has far-reaching implementations. You could, for example, code and run some OpenCL on your phone.

GPGPU philosophy #1: Split source

So let’s talk about philosophies in GPGPU

OpenCL is the most widely known example of a “split-source” heterogeneous computing language/framework. This means that, as with shaders, your GPU code is separate from your CPU code. So much so that, when an OpenCL program is to be run, it is first compiled and then “the binary” is sent to your GPU, in explicit steps.

This has some advantages

The first of which is that your code reflects that dual nature of your system.

Your CPU and GPU are separate spaces and they use separate code. When you hop to a file it’s immediately clear that which one of those you’re in. Another advantage is that because it gets compiled on the fly by your driver, as long as you can guarantee it implements a certain OpenCL standard, you can run the same code on essentially any GPU you have access to.

The second advantage is that because this separation is “embraced”, if you’re used to coding graphics, the general structure looks a lot similar and you have a very clear picture of what is sent to the GPU, when and how. This is pretty relevant.

Way back in chapter 1, I briefly mentioned the CPU’s prefetcher and how it could “invisibly” impact performance depending on your data access. The reason for that is that compared to just working with data that’s on your CPU, fetching and writing stuff to the RAM is super slow. Your CPU needs to make a request, send it through the northbridge (these days integrated on the CPU), it goes through the motherboard bus, gets to the ram, which has it’s own operation cycles, and then the data makes the way back through that whole path….

With GPUs this is much worse. Because it doesn’t ONLY need to do all this. Instead of going back to your CPU, the data goes back through the motherboard and “up the PCIe slot”. There, it goes through the VGA’s memory controller, into some different kind of RAM… to be processed by the GPU… which then needs to notify you back… so you can request the copy to be made all the way back to your system RAM… so you can request it and get it into the CPU.

So you want to know when and what data you’re copying around because you want to do this as least as you can.

But "every price has its cost"

I start a few statements in the previous sections with “if you’re graphics programmer” or “if you know OpenGL”…. and that is all and fine but, what if you’re not? What if you’ve never banged your head against THAT wall?

Well, then this is some pretty convoluted stuff with some massive boilerplate and weird restrictions and finicky types everywhere.

If you’re just your regular run-of-the-mill general programmer, especially if you’ve managed to steer clear of things like C or the parts of C++ we don’t talk about in the table and don’t allow inside the house any more, then it can look pretty hostile at first.

Also, when I said there’s a lot of boilerplate, I meant it. There’s just a LOT of housekeeping, because you need to do housekeeping for everything. Want to pass an array? Make sure to convert it to the correct types. One of your objects is… wait, you’re passing an object? Go back to Java! Oh, I’m sorry, it’s just a struct… my bad… but make sure the OpenCL side has that struct declared in its own terms and that you can directly cast back and forth from the two safely….

It does look a lot like writing regular shaders but that also means a lot of the things that make writing shaders either weird or more difficult than writing “regular code” is there as well, which can be frustrating, especially for someone just dropping in.

Additionally, the “compile on the fly” thing has a pretty serious security drawback: If you need to compile your code on the fly, that means you need to have your code there to be compiled. And that’ll mean one of two things: Either you’re shipping an actual file which is read from on runtime, or you’ll need to “bake” this code into your source, but that still means the string is just there, sitting somewhere in the middle of your binary.

If you DON’T need to concern yourself with that kind of thing, though, it DOES mean that you can compile your program once and then just tweak your shader file between runs. But otherwise, that’s a dodgy stuff, especially if you’re thinking of using that for commercial software

So how's it like then?

With that 60% of a mountain of preamble and context finally out of the way, let’s finally compare our first Heterogeneous implementation to what we had. And the first thing to note is that this now can’t be “one” side-by-side comparison, because we’ve split our source code. We’ll start by the OpenCL side. Though it is “the new stuff” it’s also more straightforward:

//calculateIterations.cpp

bool calculateIterations(uint32_t* data,
                         int width, int height,
                         Region r,
                         SDL_PixelFormat format, 
                         int h0, size_t idx)
{
    unsigned int iteration_factor = 100;
    unsigned int max_iteration = 256 * iteration_factor;
    double incX = (r.Rmax - r.Rmin)/width;
    double incY = (r.Imax - r.Imin)/height;
    incX = incX < 0 ? -incX : incX;
    incY = incY < 0 ? -incY : incY;
    for(int h = h0; h < h0+10; h++)
    {
        if (h >= height)
        {
            return true;
        }
        for(int w = 0 + static_cast<int>(idx); 
            w < width; 
            w+= numTasks )
        {
            double x = r.Rmin+(w*incX);
            double y = r.Imax-(h*incY);
            double x0 = x;
            double y0 = y;
            unsigned int iteration = 0;
            while (  (x*x + y*y <= 4)  
                  && (iteration < max_iteration) )
            {
                double xtemp = x*x - y*y + x0;
                y = 2*x*y + y0;
                x = xtemp;
                iteration++;
            }
            data[(h*width)+w] = 
                MapSDLRGBA(getColour(iteration), 
                           format);
        }
    }
    return false;
}

//fracGen.cl

#if defined(cl_khr_fp64)
#  pragma OPENCL EXTENSION cl_khr_fp64: enable
#elif defined(cl_amd_fp64)
#  pragma OPENCL EXTENSION cl_amd_fp64: enable
#else
#  error double precision is not supported
#endif
//These two would probably be better expressed by:
// {uint, uchar8} and {double4} respectively
// but I've left them like this so it's easier to read
struct __attribute__ ((packed)) _sdl_pf_cl
{
    uint Amask;
    uchar Rloss;
    uchar Gloss;
    uchar Bloss;
    uchar Aloss;
    uchar Rshift;
    uchar Gshift;
    uchar Bshift;
    uchar Ashift;
};
struct __attribute__ ((packed)) Region
{
  double Rmin;
  double Rmax;
  double Imin;
  double Imax;
};
uchar4 getColour(unsigned int it)
{
    uchar4 colour;
    colour[0] = it == 25600? 0 : min(it, 255u)*.95;
    colour[1] = it == 25600? 0 : min(it, 255u)*.6;
    colour[2] = it == 25600? 0 : min(it, 255u)*.25;
    colour[3] = 255u;
    return colour;
}
uint MapSDLRGBA(uchar4 colour,  struct _sdl_pf_cl format)
{
    return(   colour[0] >> format.Rloss) << format.Rshift
          | ( colour[1] >> format.Gloss) << format.Gshift
          | ( colour[2] >> format.Bloss) << format.Bshift
          | ((colour[3] >> format.Aloss) << format.Ashift 
            & format.Amask  );
}
kernel void calculateIterations( __global uint* data,
                                int width,
                                int height,
                                struct Region r,
                                struct _sdl_pf_cl format)
{
    int row = get_global_id (1);
    int col = get_global_id (0);
    int index = ((row*width)+col);
    uchar Red = 0;
    uchar Green = 0;
    uchar Blue = 0;
    uchar Alpha = 255;
    if (index > width*height)
    {
        return;
    }
    uint iteration_factor = 100;
    uint max_iteration = 256 * iteration_factor;
    double incX = (r.Rmax - r.Rmin)/width;
    double incY = (r.Imax - r.Imin)/height;
    incX = incX < 0 ? -incX : incX;
    incY = incY < 0 ? -incY : incY;
    double x = r.Rmin+(col*incX);
    double y = r.Imax-(row*incY);
    double x0 = x;
    double y0 = y;
    uint iteration = 0;
    while ( (x*x + y*y <= 4) 
           &&  (iteration < max_iteration) )
    {
        double xtemp = x*x - y*y + x0;
        y = 2*x*y + y0;
        x = xtemp;
        iteration++;
    }
    data[index] = MapSDLRGBA(getColour(iteration), 
                             format);
}

So this is the first time we actually have the fractal generating code. As alluded, it’s some pretty simple, pretty dumb maths sort of deal. And we’ve also got the OpenCL kernel code.

In the context of GPGPU, kernel is a word for a function that you run on the GPU, for OpenCL in particular, it’s a function you’ll call from the CPU.

Before we get to it though, we’ve got some of that boilerplate stuff. We’ve got a redefinition of the SDL_PixelFormat struct. OpenCL is not the only implementation that’ll need this but it needs it twice. Once on the CPU side and once on the GPU side.

If you’re curious, the actual SDL_PixelFormat struct is all that stuff plus a pointer in the end and some heterogenenous standards seem to take some great offense at that pointer even if they never read it (or at least, it’s what it seems to me. Either that, or they can’t translate the types properly themselves).

The reason we need that is because we need to rewrite SDL_MapRGB so that it exists in the proper context, as a function the GPU can call.

Types and auxiliary functions out of the way, we get to the actual calculateIterations function.

Similarly to the OpenMP implementation, it knows automagically what’s it’s “thread ID” so it can know what row and column we’re calculating… however..if you look a bit closer, you’ll realise that we’re missing a for loop.

This is because, we’re running this on GPUs now. As such we’re spawning a LOT of threads… We’re spawning a thread for every pixel in this window.

I do this for every GPU implementation of toyBrot, regardless of the underlying hardware and tech (and they are different). This is what makes heterogeneous computing exciting, and, remember, we want do as few conditionals, as few loops and as few data transfers as possible. So this all works towards that goal.

Other than that fact, the code itself looks pretty similar. OpenCL is based heavily on C, so once you get past their types (uint8, for example, is an array of 8 uints, not a byte) and the setup, the code itself should look pretty familiar if you’ve coded … most modern-ish imperative languages?

So… cool beans. And how does the C++ side look like?

The OpenCL side in particular needed a lot of “code width massaging” to fit the snippet. You may want to consider just opening the actual file on your favourite editor instead

// tasks.cpp

bool FracGen::Generate(uint32_t* v, SDL_PixelFormat* format,
                        int width, int height, Region r)
{
    if(format == nullptr)
    {
        return false;
    }
    static std::atomic<int> h {0};
    bool finishedGeneration = false;
    auto genFrac = [](uint32_t* data, 
                      int width, 
                      int height, 
                      Region r, 
                      SDL_PixelFormat format, 
                      int h0, size_t idx)
                    {
                      return calculateIterations(data,width,
                          height,r,format,h0,idx);
                    };
    std::vector< std::future<bool>> tasks(numTasks);
    for(unsigned int i = 0; i < numTasks; i++)
    {
        tasks[i] = std::async(std::launch::async, 
                              genFrac,
                              v, width, height, r, 
                              *format, h.load(), i);
    }
    h+= 10;
    for(unsigned int i = 0; i < tasks.size(); i++)
    {
        //Block until all tasks are finished
        if(tasks[i].get())
        {
            //If one of them is done, they all must be
            h.store(0);
            finishedGeneration = true;
        }
    }
    return finishedGeneration;
}

// fracGenOpenCL.cpp

typedef struct 
  __attribute__ ((packed)) 
  _sdl_pf_cl
{
    _sdl_pf_cl(SDL_PixelFormat format)
        : Amask{format.Amask}
        , Rloss{format.Rloss}
        , Gloss{format.Gloss}
        , Bloss{format.Bloss}
        , Aloss{format.Aloss}
        , Rshift{format.Rshift}
        , Gshift{format.Gshift}
        , Bshift{format.Bshift}
        , Ashift{format.Ashift}
    {}
    cl_uint Amask;
    cl_uchar Rloss;
    cl_uchar Gloss;
    cl_uchar Bloss;
    cl_uchar Aloss;
    cl_uchar Rshift;
    cl_uchar Gshift;
    cl_uchar Bshift;
    cl_uchar Ashift;
} sdl_pf_cl;
typedef struct 
  __attribute__ ((packed)) 
  _reg_cl
{
    _reg_cl(Region r)
        : Rmin{r.Rmin}
        , Rmax{r.Rmax}
        , Imin{r.Imin}
        , Imax{r.Imax}
    {}
    cl_double Rmin;
    cl_double Rmax;
    cl_double Imin;
    cl_double Imax;
} reg_cl;
FracGen::FracGen()
{
  try
  {
    // Get list of OpenCL platforms.
    std::vector<::cl::Platform> platform;
    cl::Platform::get(&platform);
    if (platform.empty())
    {
      std::cerr 
        << "OpenCL platforms not found." 
        << std::endl;
      exit(1);
    }
    for(auto p = platform.begin(); 
        devices.empty() && p != platform.end(); 
        p++)
    {
      std::vector<cl::Device> pldev;
      try
      {
        p->getDevices(CL_DEVICE_TYPE_GPU, &pldev);
        for(auto d = pldev.begin(); 
            devices.empty() && d != pldev.end(); 
             d++)
        {
          if (!d->getInfo<CL_DEVICE_AVAILABLE>()) 
            continue;
          std::string ext = 
            d->getInfo<CL_DEVICE_EXTENSIONS>();
          if( ext.find("cl_khr_fp64") 
                         == std::string::npos
           && ext.find("cl_amd_fp64") 
                         == std::string::npos )
          {
            continue;
          }
          devices.push_back(*d);
          context = cl::Context(devices);
        }
      }
      catch(...)
      {
        devices.clear();
      }
    }
  }
  catch (const cl::Error &err)
  {
    std::cerr
      << "OpenCL error: "
      << err.what() << "(" << err.err() << ")"
      << std::endl;
      return;
  }
  if (devices.empty())
  {
    exit(1);
  }
    queue = cl::CommandQueue{context, 
                            devices[0]};
    compileOpenCLKernel();
}
void FracGen::compileOpenCLKernel()
{
  std::ifstream srcFile("FracGen.cl");
  std::vector<std::string> lines;
  std::string ln;
  lines.push_back(ln);
  if(srcFile.is_open())
  {
    while(std::getline(srcFile,ln))
    {
      ln.append("\n");
      lines.push_back(ln);
    }
  }
  else
  {
    std::cerr 
        << "Could not open OpenCL source file" 
        << std::endl;
    exit(1);
  }
  srcFile.close();
  clProgram = cl::Program(context,lines);
  try
  {
    clProgram.build(devices);
  }
  catch (cl::Error e)
  {
    std::cerr
      << "OpenCL compilation error" 
      << std::endl
    exit(1);
  }
  clGen = cl::Kernel(clProgram, 
                     "calculateIterations");
}
void FracGen::Generate(uint32_t* v, 
                      SDL_PixelFormat* format,
                      int width,
                      int height,
                      Region r)
{
  if(format == nullptr)
  {
    return;
  }
  size_t bufferSize = sizeof(uint32_t) 
                        * width * height;
  sdl_pf_cl fmt(*format);
  reg_cl    reg(r);
  try
  {
    const cl::NDRange dims(width, height);
    const cl::NDRange offset(0,0);
    cl::Buffer v_dev(
          context,CL_MEM_WRITE_ONLY, bufferSize
          );
    clGen.setArg(0, v_dev);
    clGen.setArg(1, width);
    clGen.setArg(2, height);
    clGen.setArg(3, reg);
    clGen.setArg(4, fmt);
    queue.enqueueNDRangeKernel(
          clGen, cl::NullRange, dims, cl::NullRange
          );
    queue.enqueueReadBuffer(
          v_dev,CL_TRUE,0,bufferSize,v
          );
  }
  catch (cl::Error err)
  {
    std::cerr
        << "OpenCL error: "
        << err.what() << "(" << err.err() << ")"
        << std::endl;
    return;
  }
}
FracGen::~FracGen()
{}

So… umm….. that’s a lot more code… so… let’s take it in parts…

From the top, the very first thing we get are the counterparts to our OpenCL structs we had to define. They are mostly the same, except that their types, here in “C++ Land” are prepended by cl_. They also have the conversion constructors we need. Nothing fancy.

Going next, we have the constructor.

Minor poetic liberty here, if you’re following the actual file instead of the snippet, the constructor is further down.

Whereas our old constructor didn’t do much, suddenly we have a LOT of housekeeping to do. All that code does two things:

We need to select an adequate device. Our device must be explicitly a GPU and it must 64 bit floating point capabilities
We need to prepare the Kernel to run in that device

Something interesting here is that we need to be explicit about wanting our code to run on a GPU. OpenCL is built in a way where you COULD just run the code on your CPU instead. I haven’t worked out a runtime switching mechanism (since other platforms aren’t quite as flexible) but it’s a one character tweak the way the code is. We can also query capabilities. Maybe we want a GPU if we can, but if it can’t handle those doubles, then we want to be able to fallback… this is some powerful flexibility that OpenCL offers you.

After figuring out our device, we need to compile the OpenCL code itself. Remember, at this point, we’re shipping the code raw. So we have an additional function for that.

The one for toybrot is pretty straightforward: It reads a file in it’s execution directory, gets all the lines, creates a cl::Program object from it and then calls it’s build function, which compiles it.

Throughout all of this process, we’re wrapping all of our code in try/catch blocks. This is some really good resilience. Not only it allows you to get better information on your error, you COULD, for example, have some issue where a GPU looks like it’s what you need but you’ve got a dodgy driver or something…. at which point you want to catch that fail and hook in your CPU fallback, even if it looked like you had one that you could use.

Just remember kids, try/catch blocks are not if/then or do/while. If you’re COUNTING on an exception, you’re doing it wrong

After the code has successfully built, we extract from it a cl::Kernel object, that represents the function on the device that we can call from the host. This is our doorway.

Finally, all of the housekeeping is out of the way! We can finally do some additional housekeeping. Way back I posted a snippet about how it is to send data to a GPU using OpenGL and when I started talking about OpenGL I mentioned it’s made as if, in some ways, it’s imagining a compute shader in OpenGL, taking from how that works… and… so far it’s been a lot like that… I mean, we get a cl::device instead of a glContext, and have a cl::program instead of a shader but it’s a bit of “different dress, same dance” to some extent, (although only some since openCL has a proper C++ interface, whereas OpenGL is a lot of C, a lot of macro shenanigans, a lot of yikes…)

But this all will indeed result in our own version of that whole deal which, albeit much less hostile, is still quite annoying.

Inside our try block, we first define a range. This is where we explain to OpenCL how many threads we want to run at once. We then make a cl::Buffer where OpenCL can dump the data when it’s done and start calling setArg on our Kernel. Rather than variadic parameter shenanigans like its modern C++ cousin, std::async, openCL relies on classic ways of doing things, so we need to tell it “your arg 0 is this” “and your arg 1 is this”…. Sadly for us, we can’t call them by name, so you need to have the order down correctly.

With that done, we add our computation to a queue, which we initiated way back in our constructor. For toyBrot we just get A device and then run it. But I totally could, say, have a queue for my CPU, a queue for my Vega and a queue for my 980 and send different things for each. You can associate each device you find with a different queue and just send jobs their way. Making multi-device coding potentially pretty straightforward (managing different queues properly and the magic of load balancing in potentially heterogeneous devices (CPU, GPU, iGPU?) is an entirely different challenge).

The final thing we need to do is do a copy command. Just as we copied all the data to the GPU before calling the kernel with those setArgs, we also need to explicitly copy the data back. We also need to tell OpenCL explicitly that, in our case, we, as in the calling thread, can’t move forward until it’s sure that it’s got that data, so we’re waiting for confirmation. This is our sync point, where we wait for OpenCL to give us the greenlight.

After this, we’re done. The data has been copied over back to our input buffer and the main loop will have SDL blit it on the screen.

Easy peasy right?

The Verdict: Double the code, double the win?

Our once simple and pure toybrot went from a best case scenario when it’s literally this one loop with a preprocessor pragma before it to an additional source file and some 100 lines of setup and housekeeping. What for?

Well… you know…. shaving 90% of your runtime maybe? This is a massive improvement. This is why we needed a rock that can think in a different way.

And, as you can see, we can run our code in the CPU and it still is doing a bang fine job.

Keep in mind that the OpenCL implementation uses the “GPU way” of just drawing the entire fractal at once so, the previous CPU implementations have additional overhead in favour of responsiveness so the numbers here aren’t directly comparable

So it looks like all that overhead DOES pay off in the end. OpenCL is a bit fiddly to get into, especially if you’ve had neither graphics nor GPGPU experience. But it’s not all quite as bad as it looks at first. The C++ interface to OpenCL is pretty comfortable to use and stays well clear of macro hell territory. Additionally, whilst the amount of boilerplate looks pretty intimidating, this is very standardised so chances are that projects 2 through 7 are going to start by copying the initialization stuff from project 1 and then making the necessary tweaks.

Combine that with the flexibility offered by having the same code being able to target different vendors, and you have some pretty strong upsides. In terms of setup and documentation, being the “old guard” of Heterogeneous Computing is a massive advantage to OpenCL. The standard is well documented by Khronos and there are several tutorials and help topics readily available. Setting up to run does require you to make sure you have an OpenCL Runtime available which might not come pre-bundled with your driver but really, basically every vendor has it. If I tweaked the SDL side, I could compile and run this on my phone.

I’ll talk more about the limitations and shortcomings of OpenCL after I’ve done the CUDA bit, because I feel like they will be easier to see in contrast with what is /has always been the main alternative to OpenCL.

CUDA: is the grass really greener on Team Green's side?

So we’ve had OpenCL for some 10 years now, but for even a bit longer, we’ve also had CUDA, nVidia’s own GPGPU thing launched in 2007. Being a shameless AMD fanboy myself I am legally obligated to speak derisively about nVidia every time I do need to speak about them so:

Screw you, Team Green. Everyone knows the red ones go faster.

There you go. Mandatory silliness out of the way, CUDA is interesting and powerful. nVidia knows and invests heavily on it and uses it to leverage their current dominance on the consumer side and try to win over more and more people on the computing side as well. Straight off the bat though, that implies the greatest disadvantage of CUDA, which to me is a pretty massive one: CUDA is 100% nVidia proprietary. If you write anything in CUDA, it ONLY runs on nVidia hardware. Not only that, nVidia dictates who gets to run what. Every version of CUDA has a list of supported devices. If you’re running a Turing card (nVidia 11 20-series), you basically need CUDA 10 to target it. But if you’re on CUDA 10, you can’t use Tesla and you can’t use Fermi. Which is the reason my Quadro 2000D that I bought second hand with the explicit goal of having A cuda-capable device on my machine is more or less a paperweight now. nVidia not only has no intention of “addressing” this as an issue, you can be sure they want to make it worse.

So with a disadvantage so inconvenient, why do people even use CUDA? And why am I talking about it and implied it’s important?

GPGPU Philosophy #2: Single-source

So, remember how OpenCL was made with graphics programmers in mind? And it works in a similar way to shaders? And has its own separate source that you write and compile separately for your GPU?

None of that is true for CUDA.

CUDA was made for C++ programmers. It tries its best to look like regular C++. And yourwrite CUDA code together with your C++ code. Same source file can have functions that run on the GPU and functions that run on the CPU.

After looking at all that OpenCL, it’s easy to see how this is attractive. Instead of writing your kernel as a random string or a separate file, it’s just there, all it takes is a tag and BAM! NVCC, CUDA’s official compiler will know that where that function needs to go. Something you also get “for free” is that it automatically understand basically all your types. You don’t need to redefine a struct in terms of what CUDA knows… “it can read C++”.

Not having that extraneous code also mitigates those safety concerns as now reverse-engineering your code actually takes reverse-engineering to pull off.

That sounds awesome! Is there even a downside?

This being somewhat related to what passes for a real world, yes.

Though your code is easier to write, once it becomes extensive, it can be more difficult to distinguish what goes where and how. Because unless you can structure your code in a way that makes sense both in “domain separation” and actual code architecture this starts looking a bit like spaghetti depending on how many functions are doing what and calling what where. We all want to think that’d never happen to us, but we pretty much all have opened up a project and just got angry and confused at our past selves. (Just Toybrot itself… geez, it was so disgusting)

Things being made easier for you also, almost always, mean that things are being done for you. Which sounds good until you need to understand what’s going on behind the curtain because it might not be what you expected at first. Can be a bit of a head scratcher.

Other issues would be specific to CUDA rather than single-source in general so we’ll get to them later (you know I want to)

So how's it like then?

It has a considerably smaller amount of boilerplate, for one. But I’ll still make it a long one because I still want to show pretty much the entire file.

// fracGenSTDTASK.cpp

#include "FracGen.hpp"
#include <iostream>
#include <cfloat>
#include <atomic>
#include <thread>
#include <future>
#include <vector>
static size_t numTasks=
  std::thread::hardware_concurrency()*4;
RGBA getColour(unsigned int it)
{
  RGBA colour;
  colour.r = it == 25600? 0 
                        : std::min(it, 255u));
  colour.g = it == 25600? 0 
                        : std::min(it, 255u));
  colour.b = it == 25600? 0 
                        : std::min(it, 255u));
  return colour;
}
RGBA::operator uint32_t() const
{
  uint32_t colour = 0;
  colour = colour | r;
  colour = colour << 8;
  colour = colour | g;
  colour = colour << 8;
  colour = colour | b;
  colour = colour << 8;
  colour = colour | a;
  return colour;
}
uint32_t MapSDLRGBA(RGBA colour,  
           SDL_PixelFormat format)
{
  return (colour.r >> format.Rloss) << format.Rshift
       | ( colour.g >> format.Gloss) << format.Gshift
       | ( colour.b >> format.Bloss) << format.Bshift
       | ((colour.a >> format.Aloss) << format.Ashift 
          & format.Amask  );
}
bool calculateIterations(uint32_t* data,
                         int width, int height,
                         Region r,
                         SDL_PixelFormat format, 
                         int h0, size_t idx)
{
  unsigned int factor = 100;
  unsigned int max_iteration = 256*factor;
  double incX = (r.Rmax - r.Rmin)/width;
  double incY = (r.Imax - r.Imin)/height;
  incX = incX < 0 ? -incX : incX;
  incY = incY < 0 ? -incY : incY;
  for(int h = h0; h < h0+10; h++)
  {
    if (h >= height)
    {
      return true;
    }
    for(int w = 0 + idx; 
        w < width; 
        w+= numTasks )
    {
      double x = r.Rmin+(w*incX);
      double y = r.Imax-(h*incY);
      double x0 = x;
      double y0 = y;
      unsigned int iteration = 0;
      while (  (x*x + y*y <= 4)  
            && (iteration < max_iteration) )
      {
        double xtemp = x*x - y*y + x0;
        y = 2*x*y + y0;
        x = xtemp;
        iteration++;
      }
      data[(h*width)+w] = 
         MapSDLRGBA(getColour(iteration), format);
    }
  }
  return false;
}
bool FracGen::Generate(uint32_t* v, 
                       SDL_PixelFormat* format, 
                       int width, 
                       int height, 
                       Region r)
{
  if(format == nullptr)
  {
    return false;
  }
  static std::atomic<int> h {0};
  bool finishedGeneration = false;
  auto genFrac = [](uint32_t* data, 
                    int width, 
                    int height, 
                    Region r, 
                    SDL_PixelFormat format, 
                    int h0, size_t idx)
                    {
                    return calculateIterations(data,
                                               width,
                                               height,
                                               r,
                                               format,
                                               h0,
                                               idx);
                    };
  std::vector<std::future<bool>> tasks(numTasks);
  for(unsigned int i = 0; i < numTasks; i++)
  {
    tasks[i] = std::async(std::launch::async, 
                          genFrac, v, width, 
                          height, r, *format, 
                          h.load(), i);
  }
  h+= 10;
  for(unsigned int i = 0; i < tasks.size(); i++)
  {
    if(tasks[i].get())
    {
      h.store(0);
      finishedGeneration = true;
    }
  }
  return finishedGeneration;
}
FracGen::FracGen(bool benching)
    :bench{benching}
{}
FracGen::~FracGen()
{}

// fracGen.cu

#include "FracGen.hpp"
#include <cuda_runtime.h>
#include <iostream>
#include <cfloat>
__device__ RGBA getColour(unsigned int it)
{
  RGBA colour;
  colour.g = it == 25600? 0 : min(it, 255u);
  return colour;
}
RGBA::operator uint32_t() const
{
  uint32_t colour = 0;
  colour = colour | r;
  colour = colour << 8;
  colour = colour | g;
  colour = colour << 8;
  colour = colour | b;
  colour = colour << 8;
  colour = colour | a;
  return colour;
}
__device__ uint32_t MapSDLRGBA(RGBA colour,  
                    SDL_PixelFormat format)
{
  return  ( (colour.r>>format.Rloss)<<format.Rshift
          | (colour.g>>format.Gloss)<<format.Gshift
          | (colour.b>>format.Bloss)<<format.Bshift
          | ((colour.a>>format.Aloss)<<format.Ashift 
          & format.Amask  );
}
__global__ void calculateIterations(uint32_t* data, 
                                    int width, 
                                    int height, 
                                    Region r, 
                                    SDL_PixelFormat format)
{
  int row = threadIdx.x;
  int col = blockIdx.x;
  int index = ((row*width)+col);
  if (index > width*height)
  {
    return;
  }
  unsigned int max_iteration = 25600;
  
  double incX = (r.Rmax - r.Rmin)/width;
  double incY = (r.Imax - r.Imin)/height;
  incX = incX < 0 ? -incX : incX;
  incY = incY < 0 ? -incY : incY;
  double x = r.Rmin+(col*incX);
  double y = r.Imax-(row*incY);
  double x0 = x;
  double y0 = y;
  unsigned int iteration = 0;
  while ( (x*x + y*y <= 4)  
        &&  (iteration < max_iteration) )
  {
    double xtemp = x*x - y*y + x0;
    y = 2*x*y + y0;
    x = xtemp;
    iteration++;
  }
  data[index] = MapSDLRGBA(getColour(iteration), 
                           format);
}
void FracGen::Generate(uint32_t* v, 
                       SDL_PixelFormat* format, 
                       int width, 
                       int height, 
                       Region r)
{
  if(format == nullptr)
  {
    return;
  }
  uint32_t* devVect;
  cudaMallocManaged(&devVect, 
                    width*height*sizeof(uint32_t));
  calculateIterations<<<width,height>>>
       (devVect, width, height, r, *format);
  cudaDeviceSynchronize();
  memcpy(v, devVect, width*height*sizeof(uint32_t));
  cudaFree(devVect);
}
FracGen::FracGen()
{}
FracGen::~FracGen()
{
  cudaDeviceReset();
}

Well, look at that… the code’s actually shorter than the STL version!

Our GetColour and MapSDLRGBA functions remain the same. Except they’re now tagged __device__. And calculateIterations itself is tagged __global__. So… __device__ functions are what they sound like, functions that run on the CUDA device. The equivalent for the CPU side is __host__. If you tag a function with both, CUDA will actually compile the function twice. Say, if we had another host side function that needed to call the mapRGBA function.

__global__ is the CUDA equivalent of OpenCL kernels. This is a function on the device that will be called from the host.

Inside the calculateIterations, things are very similar to how they were implemented in OpenCL, the main difference being that, whereas we had one global_id variable in OpenCL, here we have this threadIdx.x and blockIdx.x. The reason for this is that CUDA organizes threads in a very peculiar way. This somewhat being its tie-in to the fact that this runs primarily on GRAPHICS cards, their thread organization is a 3D grid. I won’t go very deeply into this because it pertains a lot to the actual architecture of nVidia GPUs but, the baby version of this is as follow:

Each Thread is part of a Warp, a group of always 32 threads on the actual GPU. These threads and warps are segregated into different Blocks. Threads and blocks are grouped into arrays of up to three dimensions on the programmer side and then CUDA does its best to chop it up in neat chunks for the GPU.

Sorting these out is an important part of optimising CUDA, but way beyond my scope right here (also I’m pretty sure real CUDA programmers will look at my call and instantly know that I barely understand enough to do what I’ve done so I’m definitely leaving that to the big boys). I’ll put a link in the footnotes if you want to get a better picture though

What matter here is that those calls refers to values your kernel gets automagically assigned when it’s told to run on the GPU and, as such, you always have them. The rest of the function is just business as usual.

Finally, the generate function is also tiny.
Like in OpenCL, we need to manage data transfer between the CPU and the GPU. The way I’ve done it here is by using CUDA Managed Memory, which is a convenience thing. When I ask CUDA to malloc for that pointer, it actually cheats, it’s allocating two pointers. But one is in the RAM and one is on the VRAM. If you’re using managed memory what you’re doing is forgoing actually knowing which is which and what is happening in the name of the convenience of not having to dealt with that yourself.

After those arrays are allocated, we call the kernel. And what’s with that <<<width,height>>> thing? ~~That’s the mark of the beast~~ That’s CUDA’s notation for a kernel launch. It’s weird, but it has two purposes: First it marks your code, “Okay, CUDA happens here!”. Second, those parameters are how many threadblocks you’re launching and how many threads you’re putting in each.

As with all my GPU code, I just spawn the easiest way to write “one thread per pixel and just go”, but this is where you could have your fancy 2D and 3D indices. Since all the parameters are passed by value, we don’t need to bother with all the setup we had on OpenCL. EXCEPT for the return array, which is the managed pointer.
After the kernel call, we call cudaDeviceSyncrhonize which for us is our barrier call. It also , under the hood, does the memory copy from the device’s devVect, to the host’s devVect. So that when we memcpy the array in the next call, we’re good to go.

Finally, before leaving we call cudaFree to get rid of both arrays and clear out.

Pretty easy right?

But it gets easier, because our constructor is default. At least for ToyBrot, we can just rely completely on CUDA’s own automatic initialization and defaults. “just do whatever you want, I’m fine as long as this runs”. And you CAN do a lot of fancy stuff if you want, probe devices with specific capabilities, assign devices to different queues (streams)…. go nuts. But if you don’t, it’ll make something happen.

Though in the destructor, we’re calling cudaDeviceReset() essentially to make sure we don’t have any lingering garbage on our GPU.

The Verdict: It's green, all right, but just how mean?

It’s pretty fast! But, actually to my surprise, not QUITE as fast as OpenCL.

I’m genuinely confused, did not at all expect this to be the case. A couple culprits here could be the cudaManagedMemory and the block/thread split.

Regardless, it’s pretty undeniable that this is much easier to code, and, at the very least at this scale, the possible disadvantages in terms of confusion are just not a thing at all. And either of those still crush my 1920X with ease.

I’m curious enough that I might revisit this and try a couple tweaks and I also want to get some numbers on the 1080Ti at work, which is a GPU that’s more comparable to my Vega.

Tying this up: A tale of two languages

So OpenCL and CUDA not only dominated the GPGPU landscape for quite a bit (they still do, really) but they did so by coming from two different approaches to writing heterogeneous code. There really isn’t a third way (source kind of split but not quite?) and experience with the pros and cons of these two still shapes how these technologies are understood and evolve. Also, being “old” and widely adopted means that these two are very well documented. It’s super easy to find a dozen tutorials and presentations and examples of most thing you’d want to usually do with these, which is a massive boon. A bit of a foreshadowing but so far, everything we’ve looked at is solid and well known and established. The next few things are not quite so.

Now, on the one hand, CUDA is proprietary and exclusive to nVidia. This is where we get one of the hardest problems of GPGPU, your hardware matters more than usual. OpenCL, on the other, has implementations for all sorts of vendors. But anyone that’s developed for Android or distro hopped its way through a few different Linux flavours knows this two is a double-edged sword:

See, your standard really is only as good as the implementations to some extent, and OpenCL is a complicated standard. It evolved fast and got fancy features that some vendors didn’t want to implement. Some vendors also had their own thing going so they were happy to just refuse to implement newer versions of a standard that could be used in GPUs that not their own.

Something I hadn’t mentioned yet is that I had to refactor and roll back my OpenCL code, otherwise it wouldn’t run on my GeForce. The reason is nVidia doesn’t provide an implementation of the OpenCL 2. You can actually see this on the info provided by each platform. And it’s not only them. This is a bit of a wrench in the wheels of the OpenCL dream. OpenCL’s standard has a fame for being bloated with a bunch of features no one really wants to implement, which takes quite some wind out of its sails if a lot of people are now kind of stuck in a “legacy” version. Not quite Python, but … oof…

Some of these implementations you also need to go out of your way to install separately, which is a minor hassle. I had the AMD Radeon one, but had to explicitly get nVidia’s and pocl on my machine (I’m running Arch).

Still, much better than just “oh you bought the wrong one, have a cup of nope”.

The struct redefinition thing with OpenCL is a massive pain, and a reason I’m much more keen on instead. The flexing of nVidia’s muscles is a problem though, CUDA is, after all, painfully popular and as an example that nVidia takes it seriously, just think of Turing. One of their main selling points (not the one that got all the attention of impressionable people, though) was that they revamped their architecture massively to facilitate Machine Learning. And how do you think you’re supposed to leverage all of that? Through CUDA. Because gamers aren’t really getting a lot from “bajillion Tensor Cores”

There is one final (?) thing where I actually like OpenCL over CUDA and that is error handling. I don’t even do any in toyBrot for CUDA, but a lot of the code in the OpenCL one is try/catch blocks scattered around. CUDA doesn’t have exceptions. Instead it uses what is really a kind of a C style of dealing with error. Most functions return exit codes and if they’re not 0, you can run cudaGetLastError to try and probe more information. It gets especially fun when you reach into some code and this is not only wrapped in a macro, but the guy who wrote the macro slapped a goto in the middle of it.

While I expect anyone reading to insta-yikes at the goto, the macro wrapping is the default way of doing it which is… not elegant…

A good set of exceptions will just always end being more well-behaved, really and I for one am super glad of this design choice.

What's next?

So yeah. this turned out massive. I didn’t expect to but as I wrote I felt that, this being the first post where I talk about Heterogeneous Computing, it was important to not only give context but also be clear on the new terms and challenges for people who are themselves just getting their feet wet.

The next ones shouldn’t be nearly as long and, for the very next one, I must pay for my sins.

You see, even though I did throw some mild shade at nVidia, I am still not off the hook and the only way to preserve my official AMD fanboy status is to compensate for writing half a post about nVidia, by making one entirely about AMD.

For the next one I’m going to talk about AMD’s ROCm initiative and the currently two heterogeneous computing standards that came from it: HC and HIP

See you soon!

Some reference and related content

The official page for OpenCL with the official material by Khronos themselves
nVidia’s CUDA zone where they try and convince you to fall for their trap
A Wikipedia article about CUDA Warps and Thread Blocks if you’re more curious about them
A 2018 Phoronix article about AMD nearing full OpenCL 2.0 support
Remember MPI? Here’s a blog post from nVidia about some pretty hardcore integration between it and CUDA streamlining memory transfers

The Great Refactoring

Multi your Threads #3: Put a computer in your computer

Put a computer in your computer. That'll do it!

So how is this any different than ye olde multi-threading?

OpenCL: making "computing shaders" a thing

GPGPU philosophy #1: Split source

This has some advantages

But "every price has its cost"

So how's it like then?

The Verdict: Double the code, double the win?

CUDA: is the grass really greener on Team Green's side?

GPGPU Philosophy #2: Single-source

That sounds awesome! Is there even a downside?

So how's it like then?

The Verdict: It's green, all right, but just how mean?

Tying this up: A tale of two languages

What's next?

Some reference and related content