Multi your Threading #5 - Khronos strikes back

Hello again!

It’s been quite a while since my last post, but I’m back with some more heterogeneous computing shenanigans.

So, the previous four posts were, respectively, about:

This time, we’re going to go back to Khronos, who manage the OpenCL standard, and have a look at a couple of their more modern initiatives

Standard disclaimers apply:

If this sort of thing in general interests you, feel free to check the previous posts if you’d like, but don’t worry if you’re not keen on doing it, as each of these are more or less independent. Of note, though: on the very first post I go over the structure of the underlying code (it’s nothing fancy) and if you’re not familiar with heterogeneous computing / GPGPU / “running arbitrary stuff on your graphics card”, I DO recommend to go over the OpenCL/CUDA post as I go over the general concepts

And finally: All the code here is from ToyBrot. If you prefer to see the whole picture or browse on your own as you read, feel free to clone the repo and open it up on your favourite code editor (maybe on a second screen if you have one). I personally am a big fan of Atom

Errata:

Thanks to Rod Burns for pointing out a couple slips up from me regarding SYCL implementations.
– triSYCL is not a “reference” implementation. it’s just AN implementation
– ComputeCPP works with SPIR-V and regular SPIR (they’re not quite the same)
Both of these details have been corrected in the text

So... what needs replacing again?

So, way back when I talked about OpenCL, I also mentioned Khronos, “the OpenGL people”. They are “a member-driven consortium developing royalty-free open standards and vibrant ecosystems, to harness silicon acceleration for demanding graphics rendering and computationally intensive applications such as 3D graphics, Virtual and Augmented Reality, Parallel Computing, Neural Networks, and Vision Processing.”

In, a higher level language, such as English, you can say Khronos is an organization that aims to manage free standards for as many things GPU-related as they can, and they have members from most of the big names that are somewhat related to this field (as in, not only are nVidia, AMD and Intel members, but so are Samsung, Qualcomm, Valve and even Nintendo…)

So far, from what we’ve mentioned, the big standards from them which are relevant have been OpenGL and OpenCL. OpenGL mainly pertains graphics but arbitrary compute is an actual part of it. And OpenCL, as we mentioned, was built from the ground up to be a cross-vendor heterogeneous computing standard.

OpenCL, having been released in 2009 is somewhat new. But OpenGL is, comparatively, ancient, with the version 1.0 of the standard having been released in 1992. Computers in general and, even specifically GPUs and their drivers have come a long way in… 27 years. And even in the last 10 GPGPU has got much more refined so… let’s look at what’s next…

With hammer and SYCL

The first standard we’re taking a look at is called SYCL (pronounced as the title pun implies)

According to Khronos, SYCL “is a royalty-free, cross-platform abstraction layer that builds on the underlying concepts, portability and efficiency of OpenCL that enables code for heterogeneous processors to be written in a “single-source” style using completely standard C++”

So, SYCL aims to be, in a way, a replacement to OpenCL. And from the get go, it tackles one of OpenCL’s greatest hindrances from a development point of view by being single-source. SYCL is also very much built with modern C++ in mind and aims to be as seamless as possible as part of a C++ program.

SYCL first showed up on 2014 and got finally released (according to Wikipedia) in May 2015, and it’s original version is numbered 1.2 as it’s built to interop with OpenCL 1.2.

So how's it like, then?

Even though SYCL is meant to replace and/or work on top of OpenCL, since it’s single source, I’m comparing it CUDA instead

//FracGen.cu

#include "FracGen.hpp"
#include <cuda_runtime.h>
#include <iostream>
#include <cfloat>
__device__ RGBA getColour(unsigned int it)
{
  RGBA colour;
  colour.g = it == 25600? 0 : min(it, 255u);
  return colour;
}
RGBA::operator uint32_t() const
{
  uint32_t colour = 0;
  colour = colour | r;
  colour = colour << 8;
  colour = colour | g;
  colour = colour << 8;
  colour = colour | b;
  colour = colour << 8;
  colour = colour | a;
  return colour;
}
__device__ uint32_t MapSDLRGBA(RGBA colour,  
                    SDL_PixelFormat format)
{
  return  ( (colour.r>>format.Rloss)<<format.Rshift
          | (colour.g>>format.Gloss)<<format.Gshift
          | (colour.b>>format.Bloss)<<format.Bshift
          | ((colour.a>>format.Aloss)<<format.Ashift 
          & format.Amask  );
}
__global__ void calculateIterations(uint32_t* data, 
                                    int width, 
                                    int height, 
                                    Region r, 
                                    SDL_PixelFormat format)
{
  int row = threadIdx.x;
  int col = blockIdx.x;
  int index = ((row*width)+col);
  if (index > width*height)
  {
    return;
  }
  unsigned int max_iteration = 25600;
  
  double incX = (r.Rmax - r.Rmin)/width;
  double incY = (r.Imax - r.Imin)/height;
  incX = incX < 0 ? -incX : incX;
  incY = incY < 0 ? -incY : incY;
  double x = r.Rmin+(col*incX);
  double y = r.Imax-(row*incY);
  double x0 = x;
  double y0 = y;
  unsigned int iteration = 0;
  while ( (x*x + y*y <= 4)  
        &&  (iteration < max_iteration) )
  {
    double xtemp = x*x - y*y + x0;
    y = 2*x*y + y0;
    x = xtemp;
    iteration++;
  }
  data[index] = MapSDLRGBA(getColour(iteration), 
                           format);
}
void FracGen::Generate(uint32_t* v, 
                       SDL_PixelFormat* format, 
                       int width, 
                       int height, 
                       Region r)
{
  if(format == nullptr)
  {
    return;
  }
  uint32_t* devVect;
  cudaMallocManaged(&devVect, 
                    width*height*sizeof(uint32_t));
  calculateIterations<<<width,height>>>
       (devVect, width, height, r, *format);
  cudaDeviceSynchronize();
  memcpy(v, devVect, width*height*sizeof(uint32_t));
  cudaFree(devVect);
}
FracGen::FracGen()
{}
FracGen::~FracGen()
{
  cudaDeviceReset();
}

//SYCL.cpp

#include "FracGen.hpp"
#include <iostream>
#include <cfloat>
#include <CL/sycl.hpp>
inline RGBA getColour(unsigned int it)
{
  RGBA colour;
  colour.r = it == 25600? 0 : std::min(it, 255u)*0.7;
  colour.g = it == 25600? 0 : std::min(it, 255u)*0;
  colour.b = it == 25600? 0 : std::min(it, 255u)*0.5;
  return colour;
}
inline RGBA::operator uint32_t() const
{
  uint32_t colour = 0;
  colour = colour | r;
  colour = colour << 8;
  colour = colour | g;
  colour = colour << 8;
  colour = colour | b;
  colour = colour << 8;
  colour = colour | a;
  return colour;
}
struct pxFmt
{
  pxFmt(SDL_PixelFormat format)
    : Amask{format.Amask}
    , Rloss{format.Rloss}
    , Gloss{format.Gloss}
    , Bloss{format.Bloss}
    , Aloss{format.Aloss}
    , Rshift{format.Rshift}
    , Gshift{format.Gshift}
    , Bshift{format.Bshift}
    , Ashift{format.Ashift}
    {}
  uint32_t Amask;
  uint8_t Rloss;
  uint8_t Gloss;
  uint8_t Bloss;
  uint8_t Aloss;
  uint8_t Rshift;
  uint8_t Gshift;
  uint8_t Bshift;
  uint8_t Ashift;
};
inline uint32_t MapSDLRGBA( RGBA colour
                          , pxFmt format)
{
  return ( colour.r >> format.Rloss) << format.Rshift
         | (colour.g >> format.Gloss) << format.Gshift
         | (colour.b >> format.Bloss) << format.Bshift
         | ((colour.a >> format.Aloss) << format.Ashift 
            & format.Amask  );
}
template <typename Acc>
void calculateIterations(Acc data,
                         int width,
                         int height,
                         Region r,
                         pxFmt format,
                         cl::sycl::id<1> tid)
{
  int row = tid.get(0)/width;
  int col = tid.get(0)%width;
  int index = ((row*width)+col);
  if (index > width*height)
  {
    return;
  }
  unsigned int iteration_factor = 100;
  unsigned int max_it = 25600;
  double incX = (r.Rmax - r.Rmin)/width;
  double incY = (r.Imax - r.Imin)/height;
  incX = incX < 0 ? -incX : incX;
  incY = incY < 0 ? -incY : incY;
  double x = r.Rmin+(col*incX);
  double y = r.Imax-(row*incY);
  double x0 = x;
  double y0 = y;
  unsigned int iteration = 0;
  while( (x*x + y*y <= 4) && (iteration < max_it) )
  {
    double xtemp = x*x - y*y + x0;
    y = 2*x*y + y0;
    x = xtemp;
    iteration++;
  }
  data[tid.get(0)] = MapSDLRGBA( getColour(iteration)
                               , format);
}
void FracGen::Generate( uint32_t* v
                      , SDL_PixelFormat* format
                      , int width
                      , int height
                      , Region r)
{
  if(format == nullptr)
  {
    return;
  }
  try
  {
    cl::sycl::range<1> pixels(width*height);
    cl::sycl::buffer<uint32_t,1> buff (v, pixels);
    cl::sycl::buffer<SDL_PixelFormat,1> fmt (format, 1);
    q.submit( 
      [&]
      (cl::sycl::handler& cgh)
      {
        auto access_v = 
          buff.get_access<cl::sycl::access::mode::write>(cgh);
        cgh.parallel_for<class syclbrotkernel>
          (  pixels,
            [=, fmt = pxFmt(*format)]
            (cl::sycl::id<1> tid)
            { 
              calculateIterations( access_v
                                 , width
                                 , height
                                 , r
                                 , fmt
                                 , tid);
            }
          );
              });
    q.wait_and_throw();
  }
  catch(cl::sycl::exception const& e)
  {
    std::cout << "SYCL sync exception -> " 
              << e.what() << std::endl;
  }
  catch(...)
  {
    std::cout << " Exception caught! " << std::endl;
  }
}
FracGen::FracGen(bool benching)
{
  static bool once = false;
  cl::sycl::default_selector device_selector;
  cl::sycl::async_handler sycl_err_handler =  
    [] 
    (cl::sycl::exception_list exceptions)
    {
      for (std::exception_ptr const& e : exceptions)
      {
        try
        {
          std::rethrow_exception(e);
        }
        catch(cl::sycl::exception const& e)
        {
          std::cout << "SYCL async exception -> " 
                    << e.what() << std::endl;
        }
      }
    };
  q = cl::sycl::queue{ device_selector
                     , sycl_err_handler};
  if(!once || !benching )
  {
    std::cout << "Running on "
      << q.get_device().get_info<cl::sycl::info::device::name>()
      << std::endl ;
    once = true;
  }
}
FracGen::~FracGen()
{}

So, at a first glance, a couple of things stand out.
First, it looks quite longer than the CUDA version. And second, thaaaats a lot of brackets…

So, the first point is not quite as bad as it looks. The full source files for CUDA and SYCL are, respectively, 121 and 181 lines of code. So the SYCL code is 60 lines longer. 23 of these lines are, like in HC C++ and OpenCL, the redefinition SDL_PixelFormat. 13 of those lines are the definition of an error handler, plus some 12 lines of try/catching wrapping the SYCL calls. Personally, looking at it like this, I find those latter ones, in particular, well worth it and to me, something that both SYCL and OpenCL have as a great advantage over CUDA’s C style error checking where you need to probe for an error status (I don’t even do this in ToyBrot).

The second part is a double edged sword and I feel it’s a bit aggravated by the resources available in terms of easily found examples (a trend I must confess to not being helping right here). As part of being VERY C++ in its form, SYCL makes heavy use of templates and lambdas. These are both things that fine in principle and make a lot of things a LOT easier, but they both require discipline and patience in laying your code out, lest you turn your code into barbed wire spaghetti.

So let’s go through the whole thing, starting from the constructor.
Our main point of interest is q, which represents an execution queue. And to instantiate it, we use a device_selector and an error handler. The error handler is a functor (an object which implements operator() and, as such, can be called like a function), which we initiate with a lambda. In our case, all it does is move exceptions forward. This looks a bit strange, but there’s a catch, these exceptions are coming from the GPU side of your code. So our handler is forwarding them so we treat those in our CPU side. With those set up, for us, instantiating the queue is trivial. I don’t really do any sort of device triage here. As a curiosity, I left in the snippet to get the device name string, just to give you a glimpse of how heavily templatised a lot of SYCL is

The Generate function is quite different. First of all, like in OpenCL, we’re wrapped in a try/catch block. And for the actual call, it is made in two steps. First we submit a functor to the queue, then we wait_and_throw which does what it sounds like. We’re waiting on the asynchronous call and throwing any exceptions that might arise.

That functor, though, is what’s a bit ugly. Prior to this, we define two different buffers. One for our output and the other for the pixel format we need to pass in, as we convert it to our version (same as before, to not have that pointer at the end). These buffers are going to do the copying around of data for us.

From our output buffer we get an accessor, and specify we want our kernel to have write access to that data. Internally, the queue has a handler which is passed as a parameter to the functor we’re submitting, and we ask it to do a parallel_for (remember OpenMP?). What this call takes as a parameter is the number of jobs to run and another functor, this time representing the actual kernel to be run. It takes an arbitrary identifier as a template parameter and a cl::sycl::id as a call argument. This ID is the unique identifier in the pool of jobs. In here I’m using the same logic as I’ve used for my CPU code, so it’s a one-dimensional id.

calculateIterations takes that accessor and and that job ID as parameters. The code for the function itself is essentially the same, but it’s now a templatised function. This is due to the acessor. The actual type of that accessor is quite a mouthful so this is very much necessary practice.

The rest of the generate function is just exception reporting. So, really the thing that hurts a bit is, structurally, the nested lambdas which is really never good practice. The way to do it would be to define you kernel separately so you can move some of that mess away but that aside, I personally really like the structure of it and how it encapsulates the heterogeneous side

The Verdict: Did SYCL really come to reap?

So…. That performance looks really comparable to HIP, which is really good news. And it should look pretty similar to HIP because it IS HIP.

So, something you may or may not have noticed is that SYCL is a standard by Khronos and I’ve neglected to talk about implementation of it. This is currently one of the weak links, in a way, to SYCL. Despite nVidia trying to somewhat sabotage OpenCL, even their implementation is solid and well established. Being quite new, SYCL is not really in the same situation. Without going too much in depth, here, the main ones currently available are triSYCL, hipSYCL and ComputeCpp. The first two are Open Source, whereas the latter is a proprietary implementation built by Codeplay (links for all of these in the end).

triSYCL is a “an open source implementation to experiment with the specification and to give feedback to Khronos”, of note there’s a backend for Xilinx FPGAs.

hiPSYCL is an implementation built not only on top of AMD’s HIP, but that can also build with CUDA and also incorporates a CPU backend which uses OpenMP.

ComputeCpp works with SPIR and SPIR-V (I’ll talk a bit about SPIR on the Vulkan bit)

In addition to them, and part of what makes SYCL exciting, Intel has their own implementation which they’ve open sourced and are working to get merged in LLVM upstream, which would mean it becomes part of regular clang. The triSYCL project also wants to merge it with their own efforts, so this could become a strong implementation in the somewhat near future

To give a bit of perspective, I’m shamelessly copying a diagram from the hipSYCL readme (links to it)

I’ll return to this on the final conclusion. For now, suffice to say that the implementation I myself am using is hipSYCL. It’s working pretty well and recently got a massively revamped CMake support which made it fairly easy to work with. It IS, though a project that is moving very very fast which can be slightly spooky depending on the type of project you’re thinking of and there is a disclaimer that there are still unimplemented parts of the specification.

All that being said, as we saw, the code is fairly straightforward (once you understand a couple basic concepts) and the performance is very similar to the “raw” HIP implementation.

Being very new, it also doesn’t have quite as many examples as you’d find for the likes of CUDA, but this is getting better and, if you can register for Codeplay’s website they actually have a few nice resources to go by.

From Core to Mantle to Vulkan

Once upon a time, AMD had a GPU architecture they called “Graphics Core Next”. As a mean of flexing the power of that Core, AMD started developing a new Graphics API, one that would replace both DirectX and OpenGL. This was Mantle.

Mantle was a very interesting and super exciting idea at the time. You see, the reason why AMD wanted to supersede DX and OpenGL is that both of those APIs make a lot of things easier. This is a good thing at first but this comes at the cost of abstracting away much of the hardware and also streamlining a lot of the processes that guide a rendering pipeline in order to make it simpler to use. As I mentioned way back, GPUs are almost full-fledged computers on their own, but they’re also specialised so have their own way of doing things. Most of the time, when you’re talking to a GPU you really just want to call a couple of functions and go “can you draw this? Great, kthxbai.” But… I mean… if you’re reading this, you’re a programmer, right? You’ve been around other programmers, I assume… You know how we’re like… All the code I have here is C++, have you talked to C++ programmers? They’re all like “Yeah, but you’re using inheritance there, so your virtual table is inducing unnecessary indirections which will give you additional memory access, you could templatise that instead and move that cost to compile time, it might shave off like tens of milliseconds from your minute long operation”. And don’t deny it, I code C++ myself, I have to hear myself thinking those sorts of things. And a lot of graphics people are like this as well, because programs with graphics are usually interactive, and for those, it’s normally very important that they’re very responsive. Some of those programs are games, they draw LOTS of very complicated stuff and you don’t want none of that “cinematic” feel in your games, they need to feel slick and crisp.

This all meant that you had a lot of programmers who were very frustrated trying to squeeze more and more out of their graphics cards and feeling held back by the limitations imposed on them by the graphics pipelines. Kind of like when one of those C++ people have to code a bit of Java for a change.

THIS was the people Mantle wanted to target.

The idea behind Mantle was to hand over the reins as much as possible to the programmer, kind of like “well, if you’re so smart then why don’t YOU do it, then? Here you go, you go build your own pipeline”. This was super exciting, but, to my mind, AMD made a blunder where they wanted to tie this to their own GPUs and “yeah yeah, we’ll open this later”. So it never really took off. But it made a heck of a noise. Since then two things happened. Microsoft decided to make their own ~~Java~~ Mantle, and essentially called it DX12. But also, somewhat in recognition of their folly, perhaps, AMD donated Mantle to Khronos. Mantle got merged in / morphed into the “OpenGL Next” project and, this, Matle burst forth into the world as Vulkan.

So I never really touched OpenGL other than mentioning it exists. Way back in the days, you really needed to trick OpenGL into doing arbitrary calculations. You would copy over a texture, then write a pixel shader that did some weird image processing in it but instead of drawing it, you wanted to read it back because it was some random values you needed maths on, actually. Later on (like, way later, from version 4.3, in 2012), it actually got compute support but that got thrown in pretty late. Vulkan came after all this, so compute was contemplated from the start, which is an interesting prospect. So let’s take a look at how it fares as an option

So how's it like, then?

For every other implementation in this project, when I come to this part, I do a side by side comparison showing how it looks side by side with another, “base” implementation for either CPU or GPU. But that doesn’t make a lot of sense for Vulkan, so we’re doing things differently. How come it doesn’t make sense? Well… like OpenCL, this is a split source solution, you need to write your own compute shader in addition to your C++ Vulkan stuff. So, if we compare to openCL, the C++ file for that is our longest one, that’s 220 lines, plus the shader. So that’s a LOT of boilerplate and setting things up. Well… the shader for Vulkan is essentially the same, but once I was done with the C++ file, it was 830 lines of code.This is not an exaggeration.

The goal of Mantle and, subsequently, of Vulkan was to be a low level API that gave the programmer control over the minutia and details of the graphical pipeline. They’ve achieved that goal. And it shows. Vulkan also aims to be even more general than OpenGL, in the sense that, for more restricted contexts and devices, OpenGL has the OpenGLES subset. From Vulkan’s perspective, that makes no sense; The goal of the API is to give you the least amount of overhead when it comes to performance and control, it’s not about to suddenly start telling what you can and can not use, you go and figure that our yourself.

For an example, let’s copy over a buffer. With CUDA, the way you go about this is (using managed memory here to add insult to injury):

someType* vec;
cudaMallocManaged(vec, sizeof(someType)*count);
memcpy(vec, hostVec, sizeof(someType)*count);
cudaDeviceSynchronize();

And there you go, the contents of hostVec are now in your device. With Vulkan… things are more involved.

Let’s assume you’ve already gone through the part where you set up your device and are ready to get rolling. There’s actually some housework you need to do before you even get to that point.

– Step 1: you declare your buffer. Easy. And then you get the memory requirements for that buffer. Before you can allocate that, though, you have more questions. You see, you don’t even know what memory your device has! So you do that
– Step 2: ask your device about its memory. When that happens though, your device, it has more than one type of memory. There’s the general RAM, there might be some caching, your memory might be split…
– Step 3: find out WHICH memory in your device has the properties you need and enough space for you to use. And you iterate over the various memories to find the one you want. All right, we’re getting there

– Step 4: create your DeviceMemory identifier

– Step 5: ask your device to allocate the memory.

– Step 6: create a temporary pointer on your host to represent the memory on your device.

– Step 7: Map your device memory to your host

– Step 8: Finally, memcpy your data to the memory in your temporary pointer

– Step 9: Don’t forget to unmap your memory.

4 lines of code became 9 steps. And this is not even half the journey… Or maybe it’s half but it’s the super easy half. Because later on, like in OpenCL, you need to tell your shader to use that memory and what it represents, and that is…. not straightforward.

This is the price you pay for having fine access and control over your device. You need to actually control your device. This is half of how a cpp file for something as simple as Toybrot reaches 830 lines. The other half is related to Vulkan’s desire to be as broad as it can. You see, I love C++, but if your goal is to reach as many devices and interop with as many languages in as many environments as you can, you’re looking at C.

Vulkan’s reference API is C. Every time you call a function, you need to then check it’s return code against some macros. Every time you need to pass information around, you do so through pointers to structs. The situation you need to do that the most? Every time you’re creating any thing. You’re creating a buffer? Well, here’s how it happens:

VkBuffer outBuf;
VkBufferCreateInfo bufInfo{};
bufInfo.sType = VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO;
bufInfo.size = texSize;
bufInfo.usage = VK_BUFFER_USAGE_STORAGE_BUFFER_BIT;
bufInfo.sharingMode = VK_SHARING_MODE_EXCLUSIVE;

if(vkCreateBuffer(vkDev, &bufInfo, nullptr, &outBuf) != VK_SUCCESS)
{
    throw std::runtime_error("Failed to create device buffer for output");
}

And you REALLY better not forget to initialise the struct to 0s, and to not forget to correctly set it’s sType.

You need one of those structs for basically every time you want to call a function that starts with vkCreate. This like this is why I don’t really feel Vulkan is directly comparable to the other things I’ve used, at the very least not in the way that they are to each other.

Before we move on, there is one more thing that’s interesting to talk about, and that is the shader.

//FracGen.cl

#if defined(cl_khr_fp64)
#  pragma OPENCL EXTENSION cl_khr_fp64: enable
#elif defined(cl_amd_fp64)
#  pragma OPENCL EXTENSION cl_amd_fp64: enable
#else
#  error double precision is not supported
#endif
//These two would probably be better expressed by:
// {uint, uchar8} and {double4} respectively
// but I've left them like this so it's easier to read
struct __attribute__ ((packed)) _sdl_pf_cl
{
    uint Amask;
    uchar Rloss;
    uchar Gloss;
    uchar Bloss;
    uchar Aloss;
    uchar Rshift;
    uchar Gshift;
    uchar Bshift;
    uchar Ashift;
};
struct __attribute__ ((packed)) Region
{
  double Rmin;
  double Rmax;
  double Imin;
  double Imax;
};
uchar4 getColour(unsigned int it)
{
    uchar4 colour;
    colour[0] = it == 25600? 0 : min(it, 255u)*.95;
    colour[1] = it == 25600? 0 : min(it, 255u)*.6;
    colour[2] = it == 25600? 0 : min(it, 255u)*.25;
    colour[3] = 255u;
    return colour;
}
uint MapSDLRGBA(uchar4 colour,  struct _sdl_pf_cl format)
{
    return(   colour[0] >> format.Rloss) << format.Rshift
          | ( colour[1] >> format.Gloss) << format.Gshift
          | ( colour[2] >> format.Bloss) << format.Bshift
          | ((colour[3] >> format.Aloss) << format.Ashift 
            & format.Amask  );
}
kernel void calculateIterations( __global uint* data,
                                int width,
                                int height,
                                struct Region r,
                                struct _sdl_pf_cl format)
{
    int row = get_global_id (1);
    int col = get_global_id (0);
    int index = ((row*width)+col);
    uchar Red = 0;
    uchar Green = 0;
    uchar Blue = 0;
    uchar Alpha = 255;
    if (index > width*height)
    {
        return;
    }
    uint iteration_factor = 100;
    uint max_iteration = 256 * iteration_factor;
    double incX = (r.Rmax - r.Rmin)/width;
    double incY = (r.Imax - r.Imin)/height;
    incX = incX < 0 ? -incX : incX;
    incY = incY < 0 ? -incY : incY;

    double x = r.Rmin+(col*incX);
    double y = r.Imax-(row*incY);
    double x0 = x;
    double y0 = y;
    uint iteration = 0;
    while ( (x*x + y*y <= 4) 
           &&  (iteration < max_iteration) )
    {
        double xtemp = x*x - y*y + x0;
        y = 2*x*y + y0;
        x = xtemp;
        iteration++;
    }
    data[index] = MapSDLRGBA(getColour(iteration), 
                             format);
}

// FracGen.glsl
#version 450
#extension GL_ARB_separate_shader_objects : enable

#define WIDTH 1280
#define HEIGHT 720
#define WORKGROUP_SIZE 32
layout (local_size_x = WORKGROUP_SIZE, 
  local_size_y = WORKGROUP_SIZE, local_size_z = 1 ) in;

struct PixelFormat
{
  uint Amask;
  // GLSL doesn't have a char or uint8_t type
  uint Rloss;
  uint Gloss;
  uint Bloss;
  uint Aloss;
  uint Rshift;
  uint Gshift;
  uint Bshift;
  uint Ashift;
};

//alternatively dvec4
struct Region
{
  double Imin;
  double Imax;
  double Rmin;
  double Rmax;
};

layout(binding = 0) buffer buf
{
  uint data[];
};

layout(binding = 1) buffer reg
{
  Region r;
};

layout(binding = 2) buffer fmt
{
  PixelFormat f;
};

uvec4 getColour( uint it)
{
  uvec4 colour;
  colour.r = it == 25600? 0 : min(it, 255);
  colour.g = it == 25600? 0 : min(it, 255);
  colour.b = it == 25600? 0 : 0;
  colour.a = 255;
  return colour;
}

uint MapSDLRGBA(uvec4 colour,  PixelFormat format)
{
  return  (colour.r >> format.Rloss) << format.Rshift
        | (colour.g >> format.Gloss) << format.Gshift
        | (colour.b >> format.Bloss) << format.Bshift
        | ((colour.a >> format.Aloss) << format.Ashift
        & format.Amask  );
}

void main()
{
  if(  gl_GlobalInvocationID.x >= WIDTH 
    || gl_GlobalInvocationID.y >= HEIGHT)
  {
    return;
  }

  uint col = gl_GlobalInvocationID.x;
  uint row = gl_GlobalInvocationID.y;
  uint index = ((row*WIDTH)+col);

  uint iteration_factor = 100;
  uint max_iteration = 256 * iteration_factor;

  double incX = (r.Rmax - r.Rmin) / double(WIDTH);
  double incY = (r.Imax - r.Imin) / double(HEIGHT);

  incX = incX < 0 ? -incX : incX;
  incY = incY < 0 ? -incY : incY;

  double x = r.Rmin+(col*incX);
  double y = r.Imax-(row*incY);
  double x0 = x;
  double y0 = y;

  uint iteration = 0;

  while ( (x*x + y*y <= 4.0) 
       && (iteration < max_iteration) )
  {
    double xtemp = x*x - y*y + x0;
    y = 2.0*x*y + y0;
    x = xtemp;
    iteration++;
  }
  
  data[index] = MapSDLRGBA(getColour(iteration), f);
}

So, the shader for Vulkan was written in GLSL which is not quite as fancy as OpenCL but, it’s nothing too terrible or anything of the sort. The only thing that really irks me is the lack of more types but other than that, it’s just fine, really. At least for something as simple as this. But like in OpenCL, you don’t really send this to the GPU, it gets compiled and then that other bit gets sent to your GPU (OpenGL shaders, same thing, incidentally). But because Vulkan wants to be really lightweight, GLSL is way too fancy for it. See, part of the complicated thing with OpenCL that kinda hindered it as time went on, is that OpenCL C++, the kernel language for OpenCL, is quite sophisticated and complex to implement. Heck, even SYCL is built on top of OpenCL 1.2, and not 2. So Vulkan does not really “read and compile” GLSL, instead it takes in SPIR. SPIR (and SPIR-V) means “Standard Portable Intermediate Representation”. If you’re unfamiliar with what is an “intermediary representation”, think “assembly but for compilers”. It’s a partially broken down code that your compiler can finish tweaking and optimising. A lot of the power of clang comes from that and how LLVM gives you control over it.

So really, when you supply a shader for Vulkan, you supply SPIR. In my project, in the spirit of being the next step up from a Hello World, what I do is, on the CMake side, I look for and call a tool that is provided with the Vulkan SDK, called glslangvalidator. This tool compiles that shader into SPIR, which the Vulkan application then loads and sends to the GPU. At that point it’s the work of the driver to take that SPIR code and finish compiling it into whatever goes on inside you graphics card. The trick is that SPIR is then much much easier to write a compiler for than something like OpenCL or GLSL.

The Verdict: Is there a silver lining here?

There’s some extra shenanigans happening in my Vulkan calls. That’s essentially to silence the stderr which otherwhise gets printed to console together with stdout. This is a byproduct of how I set my system and not related to the program itself.

I have both AMD and nVidia GPUs on my system but my motherboard (ASUS Zenith) reports them in an inverted order so the main GPU is has the last PCIe identifier. This makes it so that Xorg wants to run on the nVidia instead, but that’s an inferior card I really only want for work purposes because of CUDA. Xorg also got really touchy when I tried to specify the device through a conf file so I blacklisted the nvidia_drm module and now that card can’t run graphics at all. But that generates a complaint from Vulkan every time it creates an instance

So, comparing to OpenCL implementations, it actually looks slightly slower, huh… Not a lot, but a bit. HIP is about 160ms too so it looks a bit bad for Vulkan. There is a caveat, though. The very last bit of the FracGen function in the Vulkan implementation looks like this:

vkFreeCommandBuffers(vkDev, vkCmdPool, 1, &vkCmdBuffer);
vkDestroyCommandPool(vkDev, vkCmdPool, nullptr);
vkDestroyPipeline(vkDev,vkPpl, nullptr);
vkDestroyPipelineLayout(vkDev, vkPpLayout, nullptr);
vkDestroyShaderModule(vkDev, vkShMod, nullptr);
vkDestroyDescriptorPool(vkDev,vkDescPool,nullptr);
vkDestroyDescriptorSetLayout(vkDev,vkDescLayout,nullptr);
vkFreeMemory(vkDev, devVec, nullptr);
vkFreeMemory(vkDev, devReg, nullptr);
vkFreeMemory(vkDev, devFmt, nullptr);
vkDestroyBuffer(vkDev, outBuf, nullptr);
vkDestroyBuffer(vkDev, regBuf, nullptr);
vkDestroyBuffer(vkDev, fmtBuf, nullptr);

And the “execution duration” that’s timed doesn’t involve the setup and teardown which would, instead be on the constructor and destructor. What happened here is that in the interest of keeping things as simple as I could and as analogous to the other projects as possible, I’ve done some very minimal initialisation on the constructor and I’m pretty sure a lot of this could be moved there, so there’s definitely some “setup leakage” so to speak. That aside, for a project as simple as ToyBrot, outside of the context of this being a learning exercise, Vulkan is tremendously overkill so I don’t feel as if I have a lot of room where I’m looking for lower overhead (except on HIP’s first run which is always super slow).

I don’t, for example, make use of one of the most interesting things with Vulkan here. HIP, SYCL, HC C++, CUDA and OpenCL are all heterogeneous computing languages/frameworks/programming models. Vulkan in more than them because it is a Compute AND Graphics API. ToyBrot does the rendering using SDL2’s software rendering on the CPU. If I were to use Vulkan, I could have the same system managing both my display and my compute, which provides good opportunities (assuming a more complex application) in how you manage and intertwine those. On setup on the constructor, for example, I ask the device specifically for a “command queue that will take compute commands but not graphics commands”, so you have this finesse depending on how you structure your application.

There’s also some trickery going on here where, to this point, I’ve presented a “worst case scenario” for Vulkan. I said it didn’t make sense for us to compare Vulkan directly to other things we’ve explored, but we could compare Vulkan to a much better Vulkan instead, which is what I REALLY use.

You see, I mentioned a while back that Vulkan’s API is built on C. But Vulkan later got a real C++ API. Chances are that if you go looking for tutorials online, most of what you find is built on the C API, that’s what was the case for me. Once I learned it had a C++ interface, though, I decided to finish my “legacy” implementation and then convert it to the C++ interface. As a learning exercise it was really interesting, and dealing with C Vulkan, really throws you to the wolves. You need to abandon civility and bite and scratch your way through it. And the C++ interface is CONSIDERABLY better. For one, the file size went from 830 lines, down to 662. That’s just shy of 170 lines down. It’s still massive, but much easier. Speaking of easier, everything is easier on the eyes. Vulkan entities get wrapped in real objects with functions and state. A funny one, every time you need to ask Vulkan for a list of anything, and you’re talking C, this is how it goes:

uint32_t count = 0;
vkEnumerateWhatever(vkInst, &count, nullptr);
std::vector<vkWhatever> vec{count};
vkEnumerateWhatever(vkInst, &count, vec.data() );

You need to call once to know how many there are, then allocate the space and call again to fill that space with your list. It gets old before you finish writing this the first time. But, enumerateProperties? Like this. Devices? Same. Extensions?Layers….

And this code is skipping that error code checking we saw before.

Well, if we do it with the C++ interface instead, this becomes:

std::vector<vk::Whatever> vec{vkInst.enumerateWhatever()};

There you go. Done. Like civilised people. It’s magical. And this is also not necessarily skipping over your error checking because, unless you tell it to disable them, the Vulkan C++ API has exceptions. So once you wrap that in a try/catch, you’re good. That makes your code much more manageable and much more clean. I’ve kept the old C interface implementation and, due to the length of the files, I’m not going to make them into snippets and post them here but I DO recommend that you open them in your favourite text editor (or just a couple browser windows) and have a side-by-side look by yourself to check the differences out, it’s pretty impressive. If you clone the repo, they’re called Vulkan/FracGen.c and Vulkan/FracGen.cpp. And if you’re worried about the performance, well, I’ve checked that out

And, as you can see, they are the same, so if you’re one of those people who “but exceptions are evil” you can rest easy because they’re only that if you use them instead of loops and not as, you know, actual runtime exceptions and errors that shouldn’t occur. If you get interested in Vulkan, I definitely recommend the C++ interface instead.

Tying this up: Is the future according to Khronos an Utopia or a Distopia?

To me, both SYCL and Vulkan are extremely exciting intiatives and, curiously, they represent very different niches.

Starting with SYCL, as much as I like HIP and the idea behind ROCm in general, the reality is that any standard that is held by one of the hardware vendors (yes, even AMD) is not really going to be general. HIP aims to replace CUDA as the language you write in, and then if you want to run on nVidia, you can translate it and build it with CUDA. But if you want to run on your Intel iGPU (or, starting from next year, discrete GPU) you’re out of luck, at least right now. Additionally, you’re not running that on any sort of mobile device, a Raspberry Pi… Once you broaden your view beyond the GeForce/Quadro vs Radeon/FirePro microcosm, that whole “developer freedom” talk feels as if it falls a bit flat.

As such, and this applies to Vulkan as well, there is true and great significance for a standard that alleges to champion this ideal to be managed by Khronos. And this is a really exciting plus for SYCL.

For more practical considerations, though SYCL is not quite as intuitive at first as CUDA and HIP for some things (you have this memory accessor layer to understand, compared to CUDA just having their “malloc this, but with CUDA” idea…) I really like how it integrates better how C++, and in particular modern C++, is structured and takes advantage of that. It’s a similar feeling, from me, as I get from HC C++ which, in turn, is AMD only and is being discontinued since AMD wants to focus on HIP instead.

The nascent implementations, though, are an issue to be considered. As I mentioned, I’m using hipSYCL. hipSYCL itself is a really interesting project and, if you have any interest in this, I definitely recommend checking it out. hipSYCL has recently got some really improved CMake support and is quite nice to use. Initially, it relied on a python script which did some substitutions and called the correct underlying compiler (hcc for HIP, nvcc or clang for CUDA, and your regular c++ compiler for CPU). It’s moving now to use clang’s plugin system, which is the same change that HIP itself is moving towards. Again, I really recommend checking out the project on GitHub and if you feel like contributing, it’s a good project to do so.

With regards to Vulkan, I mentioned that it is tremendously overkill for the kind of project that Toybrot is, and that is true. To me, finally getting to grips and implementing some code using it, has confirmed something about Vulkan that I thought from the very first times I saw Vulkan code: In order for it to be widely used (as opposed to only by crazy people), it kind of needs some solid middleware that provides some automation and defaults, that the programmer can then go and tweak. A bit like how C++ STL gives you a lot of facilities and abstractions but leaves you with the opportunity to open them up and get down to the inner workings when you need to

Compute in Vulkan is also changing relatively fast, as the standard itself is farily new, but because the support is built into the GPU drivers, it doesn’t suffer from the same implementation pains as SYCL. That said, GLSL is kind of lacking when compared to the other GPGPU languages we examined, so something I would myself need to look into is, if there are other “base languages” we can use and then convert to SPIR. Speaking of SPIR, I mentioned that the ComputeCPP SYCL is based on SPIR-V and the reason I ended up not using it was because it claims (through the computecpp_info tool) the AMDGPU driver for Vega doesn’t support SPIR-V which to me is just tremendously baffling and confusing.

Despite all that I still find Vulkan quite exciting, though perhaps not as immediately attractive for pure compute projects when compared to other options. Keeping watch on how it evolves is still going to be interesting and though I haven’t tried any I know some of those Vulkan middleware solutions do exist, AMD, for example, has one called Anvil which might be worth checking out. Just take it easy when you’re first diving in because it’s a lot to wrap your head around.

All in all Vulkan and SYCL are initiatives I REALLY hope take off and to me, widespread adoption of them in place of their current alternatives is a great step in a better environment for graphics and compute

Some reference and related content

Khronos’ official website
sycl.tech a website created by Codeplay to be a hub for SYCL related information
ComputeCpp’s official website (good source for guides and examples if you register)
hipSYCL’s github repository
LunarG’s VulkanSDK Good source for materials and help
Vulkan compute example by Neil Henning. This was one of my main examples when learning Vulkan
Another compute example, by Eric Arnebäck. It is one file, very well commented. Also a Mandelbrot, he clearly stole my gimmick
Github repo for vulkan-hpp the project itself is in integrated in Vulkan, but good refrence

The Great Refactoring

Multi your Threading #5 - Khronos strikes back

Standard disclaimers apply:

Errata:

So... what needs replacing again?

With hammer and SYCL

So how's it like, then?

The Verdict: Did SYCL really come to reap?

From Core to Mantle to Vulkan

So how's it like, then?

The Verdict: Is there a silver lining here?

Tying this up: Is the future according to Khronos an Utopia or a Distopia?

Some reference and related content

Contents