Multi your Threads #4: ROCm Roll!

Welcome back

So far, in this series about multithreading I’ve talked about C++ STL, OpenMP and MPI and, in the last post, we’ve started dipping our toes in heterogenous computing with OpenCL and CUDA.

This time around, I’m paying my Team Red fanboy taxes and we’re going to talk specifically about AMD’s ROCm initiative and a couple of resulting products of it.

If you’re just dropping in, you might want to check some of those previous post. In particular, if you’re new to heterogeneous computing/GPGPU, I do go a bit out of my way in the previous post to make sure we’re all on the same page as we move forward. Otherwise, let’s dive in

The usual note: All the code here is from ToyBrot. If you prefer to see the whole picture or browse on your own as you read, feel free to clone the repo and open it up on your favourite code editor (maybe on a second screen if you have one). I personally am a big fan of Atom

Seeing red!

As you go back in the history of 3D graphical accelerator cards, there are three main brands that to me mark very important moments in the “coming about” of our current video adapter landscapes:

The first of which, was 3dfx Interactive, heck, my own very first 3D-capable computer had a voodoo card and a Trident 2D graphics adapter (later I inherited a Voodoo 5, such glory, one card to rule them all). 3dfx started pushing 3d graphics cards to general desktop users in 1996 (according to wikipedia). And they were the bar to clear for a while.

Eventually, however, they were overtaken and consumed by their evil sister company, nVidia. Both of them shared their origin as offshoots from former Silicon Graphics employees (nVidia actually being older), nVidia eventually became the giant it is today pushed by the high performance of their GeForce cards which are still coming out today.

And alongside them, a company called ATI Technologies was also trudging along. Originally with not quite as much fanfare, though still a popular option, ATI eventually started putting out some pretty beefy cards of their own. And though ATI is no more, their Radeon line of cards is still a thing. ATI itself, though, is no more. Instead, it was eaten by another monster company, Advanced Micro Devices. Wanting to broaden their net, AMD decided they wanted to fight on both the CPU and GPU side.

Though there have been times when AMD has struggled on either front, they’re still very much alive and very much fighting. Being in the position where nVidia is kept out of the CPU market and Intel doesn’t really have any GPUs they can offer (though that seems to be soon to change), AMD enjoyed a pretty unique position where they “own a larger stack” and from a few years back, AMD has been trying to push this advantage. For us, of particular interest in an initiative AMD started rolling in 2016

Are You Ready to ROCK?[sic]

On April 2016 AMD came out with their ROCm initiative. According to AMD, ROCm stands for “Radeon Open Compute” and it more or less puts an umbrella over AMDs efforts to essentially regain ground lost to CUDA.

You see, as I begrudgingly am forced to admit, CUDA is very popular. And though I gave a lot of attention to HPC and supercomputers when I talked about

OpenMP and MPI, CUDA is used in those too, and that means they have nVidia GPUs in them. A LOT of GPUs and a lot of very expensive ones too (hint, they’re not GeForce). So there’s some interesting things there. Additionally, for example, AMD’s recent Vega GPUs are (very depressingly) amazing at mining cryptocurrency, and those people were buying oh so many GPUs a while back….

So AMD wants to essentially wrangle the computing people people over to their side. But because nVidia did the whole CUDA thing now you have a lot of programmers who have already spent a lot of time learning that, and they don’t really want to get into the whole OpenCL thing (I mean, have you seen just how much boilerplate stuff you even need?). It’s much less effort just buying a team green GPU instead and a lot of people don’t really see what’s the big deal with being locked to nVidia (insert fanboy rant).

And ROCm is the tool with which they want to do it. Additionally, ROCm is under the even wider umbrella of their GPUOpen initiative and, as the name implies, everything here is open source software.

HC C++, she was too good for this world

So, as part of ROCm, AMD released a Compiler called HCC, the Heterogeneous Computer Compiler. It’s essentially a fork of CLANG (you’ll get used to this description going forward) that compiles code also to AMD’s GPUs. It aims to support a few different languages/standards. One of them is called C++ AMP. This is an API published by Microsoft (I know, I don’t believe it either) for the purposes of facilitating “Accelerated Massive Parallelism”. However, AMD also has HC C++ which is really “AMD’s own C++ AMP”.

The reason why you’d need your own version is that being Microsoft’s, the official implementation is on top of DirectX and Windows and probably built with MS’s Visual C++ (oof) soo… yeah… best keep distance.

But the syntax itself is pretty nice (surprisingly since what passes for C++ when it comes to VC is… very questionable). So AMD made their own thing and HCC can compile it and run stuff on their GPU.

So how's it like then?

It has some of the disadvantages of OpenCL when it comes to types, but, as with CUDA, we’re on single-source land again.

// fracGen.cu

#include "FracGen.hpp"
#include <cuda_runtime.h>
#include <iostream>
#include <cfloat>
__device__ RGBA getColour(unsigned int it)
{
  RGBA colour;
  colour.g = it == 25600? 0 : min(it, 255u);
  return colour;
}
RGBA::operator uint32_t() const
{
  uint32_t colour = 0;
  colour = colour | r;
  colour = colour << 8;
  colour = colour | g;
  colour = colour << 8;
  colour = colour | b;
  colour = colour << 8;
  colour = colour | a;
  return colour;
}
__device__ uint32_t MapSDLRGBA(RGBA colour,  
                    SDL_PixelFormat format)
{
  return  ( (colour.r>>format.Rloss)<<format.Rshift
          | (colour.g>>format.Gloss)<<format.Gshift
          | (colour.b>>format.Bloss)<<format.Bshift
          | ((colour.a>>format.Aloss)<<format.Ashift 
          & format.Amask  );
}
__global__ void calculateIterations(uint32_t* data, 
                                    int width, 
                                    int height, 
                                    Region r, 
                                    SDL_PixelFormat format)
{
  int row = threadIdx.x;
  int col = blockIdx.x;
  int index = ((row*width)+col);
  if (index > width*height)
  {
    return;
  }
  unsigned int max_iteration = 25600;
  
  double incX = (r.Rmax - r.Rmin)/width;
  double incY = (r.Imax - r.Imin)/height;
  incX = incX < 0 ? -incX : incX;
  incY = incY < 0 ? -incY : incY;
  double x = r.Rmin+(col*incX);
  double y = r.Imax-(row*incY);
  double x0 = x;
  double y0 = y;
  unsigned int iteration = 0;
  while ( (x*x + y*y <= 4)  
        &&  (iteration < max_iteration) )
  {
    double xtemp = x*x - y*y + x0;
    y = 2*x*y + y0;
    x = xtemp;
    iteration++;
  }
  data[index] = MapSDLRGBA(getColour(iteration), 
                           format);
}
void FracGen::Generate(uint32_t* v, 
                       SDL_PixelFormat* format, 
                       int width, 
                       int height, 
                       Region r)
{
  if(format == nullptr)
  {
    return;
  }
  uint32_t* devVect;
  cudaMallocManaged(&devVect, 
                    width*height*sizeof(uint32_t));
  calculateIterations<<<width,height>>>
       (devVect, width, height, r, *format);
  cudaDeviceSynchronize();
  memcpy(v, devVect, width*height*sizeof(uint32_t));
  cudaFree(devVect);
}
FracGen::FracGen()
{}
FracGen::~FracGen()
{
  cudaDeviceReset();
}

// fracGenHC.cpp

#include "FracGen.hpp"
#include <hc.hpp>
#include <iostream>
#include <cfloat>
RGBA getColour(unsigned int it) [[hc]]
{
  RGBA colour;
  colour.r = it == 25600? 0 : std::min(it, 255u);
  return colour;
}
struct pxFmt
{
    pxFmt(SDL_PixelFormat format)
        : Amask{format.Amask}
        , Rloss{format.Rloss}
        , Gloss{format.Gloss}
        , Bloss{format.Bloss}
        , Aloss{format.Aloss}
        , Rshift{format.Rshift}
        , Gshift{format.Gshift}
        , Bshift{format.Bshift}
        , Ashift{format.Ashift}
    {}
    uint32_t Amask;
    uint8_t Rloss;
    uint8_t Gloss;
    uint8_t Bloss;
    uint8_t Aloss;
    uint8_t Rshift;
    uint8_t Gshift;
    uint8_t Bshift;
    uint8_t Ashift;
};
uint32_t MapSDLRGBA( RGBA colour
                   , pxFmt format) [[hc]]
{
  return (colour.r >> format.Rloss) << format.Rshift
       | (colour.g >> format.Gloss) << format.Gshift
       | (colour.b >> format.Bloss) << format.Bshift
       | ((colour.a >> format.Aloss) << format.Ashift 
           & format.Amask  );
}
void calculateIterations(hc::array_view<uint32_t,1> data,
                         int width,
                         int height,
                         Region r,
                         pxFmt format,
                         hc::index<1> idx) [[hc]]
{
  int row = idx[0]/width;
  int col = idx[0]%width;
  int index = ((row*width)+col);
  if (index > width*height)
  {
    return;
  }
  unsigned int max_iteration = 256 * 100;
  double incX = (r.Rmax - r.Rmin)/width;
  double incY = (r.Imax - r.Imin)/height;
  incX = incX < 0 ? -incX : incX;
  incY = incY < 0 ? -incY : incY;
  double x = r.Rmin+(col*incX);
  double y = r.Imax-(row*incY);
  double x0 = x;
  double y0 = y;
  unsigned int iteration = 0;
  while ( (x*x + y*y <= 4)  
       && (iteration < max_iteration) )
  {
    double xtemp = x*x - y*y + x0;
    y = 2*x*y + y0;
    x = xtemp;
    iteration++;
  }
  data[idx] = 
      MapSDLRGBA(getColour(iteration), format);
}
void FracGen::Generate(uint32_t* v
                      , SDL_PixelFormat* format
                      , int width
                      , int height
                      , Region r)
{
  if(format == nullptr)
  {
    return;
  }
  hc::array_view<uint32_t, 1>av(width*height, v);
  hc::parallel_for_each
            (
                hc::extent<1>(width*height),
                [=, fmt = pxFmt(*format)]
                (hc::index<1> i) [[hc]]
                {
                    calculateIterations( av
                                        , width
                                        , height
                                        , r
                                        , fmt
                                        , i);
                } 
            );
}
FracGen::FracGen()
{
  auto devices = hc::accelerator::get_all();
  for (auto dev : devices)
  {
    if(dev.is_hsa_accelerator())
    {
      hc::accelerator::set_default
                 (dev.get_device_path());
    }
  }
}
FracGen::~FracGen()
{}

So, using CUDA as a comparison you can see that the code is even closer to C++. Both standards call themselves C++ and they mean it. Instead of the __device__ and __global__ tags, there is only one [[hc]] which indicates a function might get rolled up to the GPU. I DID run into the same problem as I did with OpenCL, though, where I had to redefine SDL_PixelFormat.

Other than that, going down we see that the calculateIterations function itself had a couple tweaks to its parameters. Instead of receiving an array, it receives a specialised type, an hc::array_view<uint32_t,1> and it also receives an index. Somewhat like CUDA’s own indices, this index can have multiple dimensions. Here we treat the array just linearly (I couldn’t get things to line up properly when I used two, not entirely sure why). subscripting the index gives us the position at a specific dimension and, from there on, it’s the same code for calculating and obtaining the colour value for SDL.

In the Generate function, though, things are a bit different. Rather than being like CUDA or OpenCL, this suddenly looks a lot more like our std::async code, if you remember it, and suddenly it seems like there’s more happening under the hood.

In CUDA and OpenCL we had to explicitly manage our memory copies between host and device but now, in HC C++, we’re “wrapping it” in this array_view object. This object manages our pointer while it lives and, for our case, it synchronises our data automatically.

The actual kernel call is done by the hc::parallel_for_each call, which gets an extent (a range to iterate over) and a functor, which we provide through a lambda. Besides being tagged [[hc]] itself and doing the pixelFormat conversion, there’s really nothing to it. Just mind that it must receive an index of the same number of dimensions that the extent has. Aand that call runs your stuff on the GPU. Done.

The constructor, similar to OpenCL, has a bit of initialization code, and you, again, can probe devices and choose what you want. But for us, this is enough.

Easy, right?

The Verdict: So it's like what if CUDA but actually C++? Sounds too good.

Well..i remember how with CUDA we had a disappointing performance but we got rid of all the boilerplate and extra steps we had with OpenCL. Imagine if we could have the performance AND code that is arguably even cleaner…

Welcome to the Red Side.

That said, that bad joke does imply a negative and it is real. Though it is open source, I don’t think there’s another implementation of HC and HCC only targets AMD GPUs, which means you’re locked in the other side. It is a better side, for sure, but still an undesirable situation.

HC also has a couple of disadvantages when compared to CUDA

First, there’s nowhere near the same amount of documentation and tutorials and examples for HC as there is to CUDA. This is mitigated a little bit by a fact I mentioned before. HC is basically a verbatim rip-off of C++ AMP, there are very few differences, which mean you can also consume the vast majority of those resources. But the headstart that CUDA has is very big here. I myself had a rought time figuring out the pointer issue that required me to rewrite SDL_PixelFormat.

Second, if you, like me, look at this performance and this code and find this all glorious and get excited to start playing with it, I’ve got some sad news for you. HC is somewhat of a stillborn thing. It’s barely here and it’s already marked for deprecation. AMD announced that both HC C++ and HCC are being abandoned moving forward. AMD wants to release one more version but from there on this is more or less it. Which is a massive bummer, I only found that out after I’d written all this code. And I truly like it. But with that in mind, it’s hard not to look at it as more or a curiosity thing at this point.

Other than that, HCC itself is a fork of CLANG8 and can be used as a drop-in replacement and they do provide some additional help; It ships with an application called hcc-config which you can call to get your compile flags and whatnot. Pretty handy.

But yeah… if this is already on the road to deprecation, what’s left then?

Getting HIP with the kids

Remember how HC C++ is AMD’s own version of C++ AMP? How it looks like almost the same but HCC builds it for AMD GPUs? What if that but CUDA instead?

Well, the answer is HIP.

HIP stands for Heterogeneous-compute Interface for Portability (bit of a mouthful) and what it IS is AMD’s attempt at hijacking the current CUDA user base. Before we move on, let’s elaborate on how much of a hijacking that is:

So, what's it like, then?

// fracGen.cu

#include "FracGen.hpp"
#include <cuda_runtime.h>
#include <iostream>
#include <cfloat>
__device__ RGBA getColour(unsigned int it)
{
  RGBA colour;
  colour.g = it == 25600? 0 : min(it, 255u);
  return colour;
}
RGBA::operator uint32_t() const
{
  uint32_t colour = 0;
  colour = colour | r;
  colour = colour << 8;
  colour = colour | g;
  colour = colour << 8;
  colour = colour | b;
  colour = colour << 8;
  colour = colour | a;
  return colour;
}
__device__ uint32_t MapSDLRGBA(RGBA colour,  
                    SDL_PixelFormat format)
{
  return  ( (colour.r>>format.Rloss)<<format.Rshift
          | (colour.g>>format.Gloss)<<format.Gshift
          | (colour.b>>format.Bloss)<<format.Bshift
          | ((colour.a>>format.Aloss)<<format.Ashift 
          & format.Amask  );
}
__global__ void calculateIterations(uint32_t* data, 
                                    int width, 
                                    int height, 
                                    Region r, 
                                    SDL_PixelFormat format)
{
  int row = threadIdx.x;
  int col = blockIdx.x;
  int index = ((row*width)+col);
  if (index > width*height)
  {
    return;
  }
  unsigned int max_iteration = 25600;
  
  double incX = (r.Rmax - r.Rmin)/width;
  double incY = (r.Imax - r.Imin)/height;
  incX = incX < 0 ? -incX : incX;
  incY = incY < 0 ? -incY : incY;
  double x = r.Rmin+(col*incX);
  double y = r.Imax-(row*incY);
  double x0 = x;
  double y0 = y;
  unsigned int iteration = 0;
  while ( (x*x + y*y <= 4)  
        &&  (iteration < max_iteration) )
  {
    double xtemp = x*x - y*y + x0;
    y = 2*x*y + y0;
    x = xtemp;
    iteration++;
  }
  data[index] = MapSDLRGBA(getColour(iteration), 
                           format);
}
void FracGen::Generate(uint32_t* v, 
                       SDL_PixelFormat* format, 
                       int width, 
                       int height, 
                       Region r)
{
  if(format == nullptr)
  {
    return;
  }
  uint32_t* devVect;
  cudaMallocManaged(&devVect, 
                    width*height*sizeof(uint32_t));
  calculateIterations<<<width,height>>>
       (devVect, width, height, r, *format);
  cudaDeviceSynchronize();
  memcpy(v, devVect, width*height*sizeof(uint32_t));
  cudaFree(devVect);
}
FracGen::FracGen()
{}
FracGen::~FracGen()
{
  cudaDeviceReset();
}

// fracGenHIP.cpp

#include "FracGen.hpp"
#include <hip/hip_runtime.h>
#include <iostream>
#include <cfloat>
__device__ RGBA getColour(unsigned int it)
{
  RGBA colour;
  colour.r = it == 25600? 0 : min(it, 255u);
  return colour;
}
RGBA::operator uint32_t() const
{
  uint32_t colour = 0;
  colour = colour | r;
  colour = colour << 8;
  colour = colour | g;
  colour = colour << 8;
  colour = colour | b;
  colour = colour << 8;
  colour = colour | a;
  return colour;
}
__device__ uint32_t MapSDLRGBA(RGBA colour,  
                    SDL_PixelFormat format)
{
 return  ( colour.r >> format.Rloss) << format.Rshift
         | (colour.g >> format.Gloss) << format.Gshift
         | (colour.b >> format.Bloss) << format.Bshift
         | ((colour.a >> format.Aloss) << format.Ashift 
         & format.Amask  );
}
__global__ void calculateIterations(uint32_t* data,
                                    int width,
                                    int height,
                                    Region r,
                                    SDL_PixelFormat format)
{
  int row = hipThreadIdx_x;
  int col = hipBlockIdx_x;
  int index = ((row*width)+col);
  if (index > width*height)
  {
    return;
  }
  unsigned int max_iteration = 25600;
  double incX = (r.Rmax - r.Rmin)/width;
  double incY = (r.Imax - r.Imin)/height;
  incX = incX < 0 ? -incX : incX;
  incY = incY < 0 ? -incY : incY;
  double x = r.Rmin+(col*incX);
  double y = r.Imax-(row*incY);
  double x0 = x;
  double y0 = y;
  unsigned int iteration = 0;
  while ( (x*x + y*y <= 4)  &&  (iteration < max_iteration) )
  {
    double xtemp = x*x - y*y + x0;
    y = 2*x*y + y0;
    x = xtemp;
    iteration++;
  }
  data[index] = MapSDLRGBA(getColour(iteration),
                           format);
}
void FracGen::Generate(uint32_t* v,
                       SDL_PixelFormat* format,
                       int width,
                       int height,
                       Region r)
{
    if(format == nullptr)
    {
        return;
    }
    uint32_t* devVect;
    hipMalloc(&devVect,
              width*height*sizeof(uint32_t));
    hipLaunchKernelGGL(calculateIterations,
                       dim3(width),
                       dim3(height),
                       0, 0,
                       devVect,
                       width,
                       height,
                       r,
                       *format);
                       
    hipDeviceSynchronize();
    hipMemcpyDtoH(v, 
                  devVect, 
                  width*height*sizeof(uint32_t));
    hipFree(devVect);
}
FracGen::FracGen()
{}
FracGen::~FracGen()
{
    hipDeviceReset();
}

So.. if your first instinct is that “I could do like 80% of that with like ONE sed call”. Yeah, yeah you could. In fact, AMD provide a toold called HIPify that aims to automate precisely converting from CUDA to HIP.

There really is very little to talk about in isolation that has not been said from the CUDA version of the code because it is pretty much that. It is as much that as it can possibly be. Other than a different include and a very slightly different syntax for thread and block ID, there really are only one and a half differences here.

Getting to the half one first, is that memory is managed differently. HIP doesn’t have (at least currently) CUDA managed memory so, you need to do the copies yourself. It really is pretty straightforward though; You hipMalloc the device pointer and then call hipMemcpy<src>to<dest> as you shuffle things about (CUDA only has the one Memcpy function and you pass the AtoB as a parameter).

And the other one is that instead of using CUDA’s triple bracket notation, you call the hipLaunchKernelGGL function and it takes the thread and block counts as parameters (along with the kernel, its parameters and a couple other details).

And there you go, your code is HIPified, assuming you sort the build side which, if you build with CMake, is pretty straightfoward as AMD provides a findHIP.cmake module that gets you sorted. It works in a similar way to the old findCUDA way, including a hip_add_executable CMake macro.

This would bring us to the same “locked again but on the other side of the fence” argument but AMD works around this. You see, with HIP being so similar to CUDA, AMD actually designed it to be able to use CUDA as a backend. So you could write your code with HIP and build it with CUDA for an nVidia GPU.

This really is AMD going hard on inserting itself on a previously green-dominated space.

The Verdict: Do the Red ones go, indeed faster?

Again, our benchmark is OpenCL and the performance is pretty similar.

I’ll be the first to admit that my benchmarks are hardly scientific (I have a bunch of other stuff running in my machine as I take them) but I do do a few different batches to avoid any extraneous results. For HIP, compared to OpenCL (and now HC), performance seems pretty similar. The first run seems consistently longer, which to me indicates that the GPU might actually be holding on to that kernel, despite the call to hipDeviceReset between runs (the generator object is recreated every time). This discrepancy could explain why CUDA is looking worse in these tests, a quick very non-scientific test does make for similar results; making the CUDA generator regenerate the fractal on a non-benchmark run (generator doesn’t get destroyed) lowers timing of subsequent runs to about 20ms off of the corresponding OpenCL code. So there might be something there.

I tried getting some HIP on nVidia numbers but it seems that my setup is not behaving so I’ll have to come back to this later. But those are some really impressive numbers. In fact, my Vega actually beats the 1080Ti I have at work (though not by much) so this is pretty hype for this Team Red fanboy,

There are some other differences between HIP and CUDA though, as well as between AMD and nVidia GPUs, for one, the Warp size on AMD is 64 instead of 32, but nothing that can’t be managed if you probe your platform during the build

Tying this up: Does ROCm melt my face?

It most certainly tickles my fanboy fancies. I really like HC and am really sad that it’s already being phased out. And being able to convert CUDA code easily is a HUGE blessing. Having the same code run on multiple GPUs actually allows you to make better informed decisions on which hardware to target.

Being built on CUDA, HIP also leverages much of the current knowledge built for CUDA, which is a massive boon.

I think the one place where it needs to gain a bit of ground on it is general integration. With AMD phasing out HCC, I’m not sure where HIP is going to go. I’m hoping they try and extend CLANG as unifying compilers would be great. Although the CMake integration is already satisfactory, it would be better if it ever became on par with CUDA, which is integrated as part of CMake as an actual language. No additional modules required, but that is a ways off I believe.

Right now also just acquiring ROCm is a bit tricky. The way that I found was through an repository on github called Experimental ROC, that centralises a build stack for HCC, ROCm and several libraries which have been ported to ROCm, there’s BLAS, FFT… this sort of deal. I’m actually trying to get ArchLinux support merged into upstream as an usupported community provided distro but right now it’s available for Ubuntu, CentOS and Fedora (off the top of my head). I’ll link it in below

What's next?

For the next chapter I’m going back to Khronos, to talk a bit about their more modern iterations on GPGPU. I only have code for one of the two things I want to talk about written right now, which means I need to add another executable to Toybrot.

After that I might do a quick recap and “mega-mash” where we see everyone together. Would be a good time to get some HIP on nVidia numbers, for example, and some people have actually asked that I take a look at a couple other APIs, this time from the Team Blue. Not sure which I’m going to do first.

Some reference and related content

AMD’s GPUOpen website
Microsoft’s overview of C++ AMP useful if you’re looking at HC
Github wiki for AMD’s HCC with links to API docs, among others
Github repo for Experimental ROC If you’re also on Arch, you can check my PR (#15) to build all this stuff on your system
HIP programming guide with some good generic advice on getting your feet wet with all of this

The Great Refactoring