Multi Your Threading #6 - The Age of Radeon has begun

The usual disclaimers

Hello again, everyone

As implied by the number in the title, this post is part of a series I’ve been writing on technologies for parallelisation and heterogeneous computing.

This chapter is actually a good point to jump in. It’s part recap, part benchmark shootout and part me explaining how I’ve changed every thing.

That said, I won’t go into much detail about the specifics of the implementations or the core concepts. If you’re interested in that, check out the previous posts on the series. In them I talk about:

In chapter #3 I also talk about Heterogeneous Computing / GPGPU / running arbitrary code in your video card in general. So if you’re new to the idea, it’s worth a check to grasp some of the key differences.

I won’t show much in this post but the code for all the different projects I use here are available on my repos in Gitlab. Feel free to grab a copy and follow along in your favourite editor (I personally quite like atom)

All that out of the way, time to nerd out with graphs and complaining about computing languages. Just the best of times

Well then... it's been A WHILE

A while back I mentioned that something awesome happened to me. I got a Radeon VII and dropped it in my main rig. And this card is an absolute monster, fanboy squeals all around!

Except… it was too good. As in excessively so. One of my goals with toyBrot is keeping the code as simple as I can, so that whatever is happening in it doesn’t get in the way of understanding the actual parallelisation part. And a simple mandelbrot generator is great for that. But it’s also very difficult to make it in any way costly for a GPU to compute. And with the Radeon VII well… OpenCL was averaging 35ms a run. This is small enough that really it’s hard to get any variation that’s not within your error margin. It didn’t even matter that I had everything as double precision because the Radeon VII actually can do doubles, unlike most consumer GPUs…

So I had to look for a new algorithm, perhaps a new dimension.

The answer I found was in still a fractal, but a different type of algorithm to the regular mandelbrot. As of now, the old Mandelbrot stuff is deprecated on the repository. All new projects will now use a technique called raymarching to generate a type of volumetric fractal known as a Mandelbox.

By the way, most of the pictures in this post can be clicked for full size if you want

So what's on the menu for today?

The main purpose of this chapter is to introduce the new algorithm. I won’t go into a lot of details about each implementation, but there’s some new particulars.

There’s also been a lot of upgrades. Both my GPUs changed (with a direct upgrade and a sort of sidestep) and projects like ROCm and hipSYCL have been moving quite fast. So this is a good opportunity for me to put numbers in perspective and do a sort of shootout between implementations.

For those who skipped the disclaimer I will NOT be going into a lot of detail about each implementation here, for that please go back to the relevant chapter where I introduced them. Otherwise this’d just drag on forever.

I’m also not doing ALL of the previous 10 implementations. I’ve decided to drop HC C++ and MPI. HC C++, as much as I really like working in it, is not only pretty much dead, it never had much of a chance, the way I see it. It’s C++ AMP but tied to AMD which is walling your garden around a copy of a thing I’m not even sure was that popular (compared to OpenCL and CUDA which are the real deal). AMD has discontinued it and it’s probably going to be entirely phased away once HCC gets dropped off of ROCm.

MPI, on the other hand is very much NOT dead. But it’s not only fiddly to code in (I know right, so lazy) but I’m also somewhat unable to demonstrate it in action without some sort of cluster. It’s made for spreading work over multiple computers and that’s not the setup I have. I could try and use an online cluster (apparently Intel has some services in that direction) or maybe emulate one with VMs… maybe even try to set one up between this rig and my HTPC (heck, it’s got an FX 8350 and a 1050Ti)… but I’m not sure it’s worth the effort. At least not right now. I don’t discard the possibility of coming back to MPI in the future because, as much as it’s difficult to get your head around it and code for it, it’s also super exciting stuff.

Can't stop the march of progress!

All right, time to get into some meat. Before we look at numbers, I’m going to give you the briefest explanation on what Raymarching is. Followed by a non-explanation of what a Mandelbox is, because I don’t really understand it myself, I just know it looks super cool.

So, Raymarching then.

If you use computers (something I assume to be true since you’re reading some nerd talk ab out code on the internet) you probably have painful memories of all the marketing hype around Raytracing back when nVidia was releasing their 11-series GPUs, which got renamed to 20 for no reason… and also 16 at the same time for even less reason….

So the thing with Turing, the architecture behind those cards, that related to that hype was that it has some specialisation for ray tracing which is a class of techniques for rendering images that are really approximation of physics simulation of how light behaves (you simulate photons bouncing around and all). It allows for a lot of fancy effects but it’s very expensive.

Raymarching is NOT that. But it’s somewhat related in a way. Raymarching involves essentially only two things: very simple trigonometry and a “distance function”. The distance function takes a position in space and tells you how far away that point is from “stuff in the scene”. So if your “scene” is,say, just a sphere, that’s the distance from the point to the sphere’s centre, minus the radius of the sphere.

All right, let’s stick with that for a bit

So… the basic trigonometry bit is this:

For each pixel in your screen you take a ray which flies from the origin of your viewpoint and crosses that. Then you march along that direction bit by bit, and wherever you stop, you go back to that distance function and if your distance to the scene (in our case, the sphere) is less than 0 (or really, less than your determined threshold) then you know you’ve hit something. And you can know how far away that thing is, because you’ve been counting how much you moved. Just by knowing these two things, you can already draw some interesting stuff.

Really the trick is in what’s inside that distance function. And this can be made very fast, but it CAN also be slow, which is what I needed.

This weird blob thing is a structure called a “Mandelbulb” which is somewhat of an extrapolation of what a mandelbrot fractal would be if it had volume. Raymarching also has some interesting properties. Because your distance functions are signed, you can compose them in different ways. Essentially painting and playing lego at the same time, but with equations. Some truly nerd stuff.

Just playing around for a bit with the mandelbox and the sphere I can:

So yeah, this is some super cool stuff. If you’re into this, I highly recommend checking out Inigo Quilez’ page on distance functions, where he talks a lot about this.

Well, and what about that box thing?

Well, the mandelbox is a weird fractal. It’s “made” by essentially folding space on top of itself, then reflecting it along spheres…. The idea sounds complicated but the code is quite simple. Like the Mandelbrot and the Mandelbulb, it’s an iterative fractal. You can decide how many times you do the operations and it changes how things look (and how expensive it is to generate it). Here is the entire distance estimator

float boxDist(const Vec3f& p)
{
    const Vec3f& offset = p;
    float dr = boxScale;
    Vec3f z{p};
    for (size_t n = 0; n < boxIterations; n++)
    {
        //a reflection
        z = z.clamp(-limit, limit)* 2.0f - z;
        //a sphere inversion
        float r2 = z.sqMod();
        if ( r2 < mR2)
        {
            // linear inner scaling
            float temp = (fR2/mR2);
            z *= temp;
            dz *= temp;
        }
        else if(r2<fR2)
        {
            // this is the actual sphere inversion
            float temp =(fR2/r2);
            z *= temp;
            dz*= temp;
        }
        // Scale & Translate
        z = z * boxScale + offset;  
        dr = dr * abs(boxScale) + 1.0f;
    }
    float r = z.mod();
    return r/std::abs(dr);
}

Other than a bunch of constants, there’s not really a lot happening in terms of code. The one thing to note is that this distance estimator is, sadly, not signed. It always returns a positive value, which is a bit of a bummer and does limit what we can do in terms of raymarching composition. It’s a fun thing to play around with the various parameters for the box. For every project, I’ve put all of them near the top of the main source file. The only thing you need to look elsewhere is the camera’s initial position. It’s a bit of a quick and dirty one but you can tweak the distance from origin in FracGenWindow.cpp, if you wish to look from further away

I based a lot of my initial work and adapted the distance estimator from a series of blog posts by a guy named Mikael Christensen. He goes much more in depth with the fractal stuff. Well worth checking out.

Also worth checking out for the mandelbox specifically is Tom Lowe’s site on it. Wikipedia says he ended up finding the fractal around 2010 and, regardless, there’s some good information there on it

FINALLY: On colouring the mandelbox and getting many pretty pictures

The code that colours the mandelboxes in toyBrot is pretty simple. Whereas the mandelbrot implementations depended only on the number of iterations before escape, for the mandelbox I use two parameters:

– The number of raymarching steps until a collision with the box

– The position of the point of collision

If the ray doesn’t touch the box I simply do a background colour. Otherwise, I determine the colour from Hue, Saturation and Value model. The hue is actually based off the Y coordinate of the point. For different colours I just tweak an offset and a multiplier (which makes the image have more or less different colours). Saturation is fixed to some arbitrary value which varies in different implementations and the Value is given by the number of steps taken. Because the ray has to step “more carefully” through crowded regions, this makes all the nooks and crannies darker, giving some automagic shadowing.

Parameters for colouring are also all defined near the top of files, so it’s easy to play around with them.

Cards on the table, numbers on the charts!

For the individual implementations I did the talking and background first and later talked about numbers. We’re going to do the opposite this time. And my times are split per hardware:

All the testing is done on a 1920X and I have both a Radeon VII and a Titan X (Maxwell) on my machine. Neither the CPU nor GPUs are overclocked for the baseline tests. I’ve ran every implementation on each hardware I could. Details further down.

The main event is the “runtime” which is how long the fractal took to be generated. The “setup” component is the time the constructor for the “FracGen” object takes. Normally this does inital setup such as loading code from disk (for OpenCL and Vulkan), picking out the right device, so and so forth….

On a properly designed/optmised program you’d do this only once, I do it every iteration. Speaking of which, all the benchmarked tests were averaged over 10 runs. The “error bars” you’ll see along them the standard deviation from the mean. Essentially how much variation you can expect from those values

And this is a reminder that you can click the images for full size if they’re a bit hard to read

Benchmarking charts

Edit note here:
For a while, OpenMP looked like it had not super stellar performance. By the time I wrote chapter #7 I figured this was cause by the lingering and unnecessary atomic<bool> for h0. Once I got rid of it, it’s pretty much on par with std::threads and std::async

Looking at them like this is interesting to see how they compare. When it comes to speed and their relative to the first place:

1920X shootout

std::threads + std::async -> 21.170s [+0%]
OpenCL (pocl) -> 24.894s [+17%]
OpenMP -> 26.225s [+ 23%]

Radeon VII shootout

HIP (aomp) -> 968ms [+0%]
Vulkan (amdvlk) -> 980ms [+1%]
hipSYCL -> 994ms [ +2%]
Vulkan (radv) -> 1034ms [+6%]
OpenCL -> 1153ms [+19%]

Titan X Shootout

OpenCL -> 1626ms [+0%]
Vulkan -> 1697ms [+4%]
HIP -> 1862ms [+14%]
CUDA (nvcc) -> 1965ms [+20%]
CUDA (clang) -> 2182ms [+34%]

Well, now… this paints an interesting picture in some situations. Let’s talk more about them, shall we? I’ll go through these in the other I originally published the relevant chapters

Conversion notes, considerations and curiosities

STL Threads and Tasks. Good old C++

Well, these two are pretty much undisputed champions in this round. 17% is a pretty significant gap to second place.

Implementing for STL was pretty straightforward. std::threads was quite simple with no big surprises and std::async was my very first implementation again, where I got to grasps with the raymarching algorithm and started experimenting.

To me it’s really good that STL is reigning supreme here, too. The more STL we have around, the better, though, std::threads IS rather barebones compared to other threading libraries

The one thing of note in the STL implementations is what I call the “oversubscription factor”. When I decide how many tasks/threads I’m going to spawn I do NOT just launch one per core. There’s actually gains to be had by oversubscribing. This could be due to how the main loop is organised. The way I do it is really the most immediately obvious way, with ONE optimisation.

For each line of pixels -> spawn a bunch of threads that calculate some of them

So maybe this is due to pixels taking different times, so some cpus could be idling, maybe…. Definitely not enough for kSysGuard to pick up any slack. And the optimisation is that instead of thread 0 picking pixels [0,width/numThreads], as in, each thread has a defined section of the line, the threads start together and then they walk +numThreads pixels on the line. So the whole batch moves together. The reason this is better is related to how CPUs preempt what data to fetch. I talk a bit more about this on the original STL chapter.

Okay, so something observable is that I get better performance by spwaning a much higher number of threads/tasks than I have cores. Mind you that while this code scales pretty much linearly with threads until you saturate your CPU, growth afterwards is nowhere as high and is somewhat logarithmic. While I don’t have a good explanation for this, it’s something I’ve noticed many years ago and it seems to still be true

The results in this graph are averaged from only three runs, instead of ten. I used to apply an arbitrary oversubscription factor of 4, but after making this graph I went back and changed it to 20 for the benchmark.

You may also notice that the times here are quite low, actually lower than the ones in the main chart.

My CPU still does the thing where it gets progressively worse as I run more and more times. It COULD be that it’s overheating or worrying about it, maybe it’s preventatively downclocking a bit because it keeps topped out for minutes. As mentined, I have it stock (could’ve played with OCing it over the holidays but was coding all of this stuff instead) and it’s watercooled with a quad rad for just itself and the motherboard’s mosfets so temperatures really rise up to around 55C and then my fans just say “lol, no more, thank you”.

Further investigation on this would be required, but it’s true of every CPU implementation. STL is the least affected and throughout the 10 runs I logged for the main chart, std::async’s runtime increased by 600ms and std::threads by 760ms.

OpenMP - Don't be too quick to look down on it

Once again if you’ve missed it, OpenMP’s performance was being excessively hurt by a lingering and unnecessary atomic that was a leftover from early code. Once that was removed, performance was on par with the std::async and std::threads. For more information, check here

So looking at the performance numbers, OpenMP doesn’t look too great.

The issues of the increasing runtime and oversubscription I mentioned in STL are also relevant here. OpenMP is the worse when it comes to increasing runtime. As implied by that massive error bar, the difference between runtime for the first and last run I logged for the chart is over 7 seconds.

Also when it comes to oversubscription, OpenMP behaves differently from the STL facilities. DO NOT oversubscribe OpenMP. I’ve actually removed the option at some point because it slows down to a crawl.

That said, OpenMP remains the super easiest way to parallelise some code. Really, if you’ve never done any of this and have no time to learn anything right now, just go OpenMP. One include, one pragma, two function calls in your regular for loop; you’re done, you’re going to get your massive speed gain from having more than one thread.

As mentioned before, OpenMP can target other things too. AMD’s current/new/future/something compiler for ROCm is called AOMP and it allegedly can deploy OpenMP for either AMD or nVidia GPUs. I say allegedly because I couldn’t get it really working, but it’s something I’d like to examine in the future

OpenCL - Portability is king

Me and OpenCL started off on somewhat of the wrong foot because I was annoyed at the setup overhead but the more I use it, the less I mind that. Once you’ve passed through that barrier, coding the openCL programs themselves is pretty easy. It’s got a good library, types that make sense… and, again, there is value in having your program code as a separate source that’s compiled on the fly. It’s my go to for fiddling with parameters. Plus, it runs on everything.

It was once the fastest on the red side but between me coding better the other things and some of the tech catching up, it’s lost some ground comparatively, but it stills rules nVidia, which is quite hilarious. And the CPU performance through pocl is not bad at all, actually beating OpenMP. So, yeah, I’m increasingly a fan. The caveats in the original post still apply but… dat portability…

It’s also notably got the longest setup times. if you add them up, it doesn’t look quite as good but it’s actually smarter than you’d expect, even with me telling it to load the file and compile the program on every run (they’re class variables, not statics). The first runtime is always quite higher and it lowers down later. If I kept it around, it would only apply to the initial run. Which is why though I time the setup, I don’t lump it in with the runtime for the graphs and all. I’m knowingly doing it wrong

CUDA - Don't embrace proprietary

So I’m a confessed AMD fanboy and I love dunking on nVidia (also Intel but they’re lately either very promising with their GPGPU stuff or kind of depressing with their CPUs so it’s a bit harder) which is why to me it’s somewhat hilarious that CUDA is the slowest implementation for nVidia GPUs.

Stay away from proprietary software, children.

Sadly my fanboy glee was not complete since CLANG ended up losing to NVCC, but I still liked playing with that. I also learned some interesting things about CUDA.

It’s still pretty cool that I came from OpenCL needing to have me translate types and whatnot to just have CUDA read my templates like a boss (after I decorated them and polluted code elsewhere with CUDA includes) which is super nice
CUDA cares more about workgroup structure than I first thought. Initially I had my thread blocks being one-dimensional for each line. I actually got a significant improvement in performance by “doing it right” with 16X16 blocks, which made its way into HIP as well.
Cuda actually has a nvstd::function type which is pretty much analogous to the STL equivalent. I found this really nice and I’m sure it can enable some awesome clever algorithms. It would be a shame if it was very slow making it really not recommended for usage where it’s not strictly necessary
When building CUDA with clang you can have your definitions for __host__ and __device__ as overloads, so you can separate them without resorting to preprocessor mess. It’s a real shame nvcc doesn’t do this. And that nvcc exists in general, tbh (ask your build friend. They’ll agree)

Something that’s less of a fanboy dunk and more of a “confused annoyance” is that CUDA is quite C… I mentioned this before but it REALLY irritated me this time around when I had to debug stuff and for that I had to write a function that calls and interprets cudaGetLastError() because… of course.

What makes it weirder to me is that my understanding is that a LOT of the work that went into the Vulkan C++ headers was contributed BY nVidia, and those are a blessing. But then CUDA, the thing they throw rivers of money in is all C….. eh… Another reason to avoid it when possible I guess =P

HIP - The other CUDA

So when it comes to build and code… HIP is still VERY MUCH “the other CUDA” so there’s not much there to talk about. Performance is getting really good though, but the cost is that ROCm is as messy as ever. Maybe even more so now.

The build side of ROCm is a nightmare. Both in building ROCm itself as well as building stuff with it. I decided to upgrade my install from ROCm 2.6 to ROCm 3.0. I eventually quit because the build was so broken that I have projects trying to clone dependencies with non-existing tags. I mean… if this is happening there can’t be any CI? And this was from me cloning the tag for the release. It’s all over the place…

My luck was that they have a new compiler now, AOMP, and the project for that is kind of a ROCm inside your ROCm, so it has HIP and hcc, and AOMP is a version of the old clang that was there (so the… fourth ROCm clang?). After some massaging I managed to get IT building, and this is what I’ve used. Once it’s up, I got a performance improvement from my previous install (no numbers, sry) but my environment is still all over the place.

Building HIP for nVidia was also a chore. CMake was just not calling the correct things so I had to copy the command lines and finish the build by hand to get some numbers. This is not a real world scenario. Bonus points for hipCC overriding my platform. “Can you please use nvcc? I want to deploy to nVidia” “Sure thin….wait a second, I see you have our clang. I’ma silently use that and ignore your request now”

And as a final note… it’s <current year argument> can we PLEASE stop using environment variables for everything. By all means have them there for defaults but it’s just completely backwards that you cannot tell your tool what to do. The command line is powerful and simple to use. HAVE YOUR TOOLS CONSUME ARGUMENTS.

SYCL - The future of heterogenous computing

The title for this section is really what I want to happen. As much as I gave a lot of praise to OpenCL this here is where I think everyone should be going towards. Both coders and vendors. I know Intel is making a massive push towards it and it actually excites me greatly for Xe. On the other hand, nVidia and AMD are more or less pretending this doesn’t exist.

The implementation of SYCL I use is hipSYCL. It’s a great project that’s been evolving fast. Worth checking out and worth contributing to. It implements SYCL on top of HIP for AMD and nVidia, and also integrates a CPU backend based on OpenMP but this one is more of a curiosity. Because of, I suspect, the OpenMP slowdown with oversubscription, it’s very slow. Which is why SYCL is not on the CPU chart. A different proprietary implementation, ComputeCpp, says it might be able to target my CPU through POCL, so I might give that a go at one point.

SYCL brings in the portability of OpenCL (well, it intends to by building on it and being managed by Khronos) but it’s the most C++ of all the remaining implementations (it’s really only matched by HC C++ which is now dead). There’s no doubt in my mind that in a future where SYCL is an option with comparable performance to the likes of CUDA or HIP, it is the best option.

But it still suffers from implementation woes. My CUDA version of the SYCL program was not running properly and I didn’t really have the time to figure out why. And it’s all growing pains of hipSYCL as a project that moves fast, chases its own goals together with the moving target that is ROCm all while being pretty small.

Vulkan - Flex your nerd muscles

Vulkan would be a third “code once, run everywhere” option (fourth if you count OpenMP). But I struggle to recommend it over… anything normally. Once you’ve got it set up, it’s not super hard to give maintenance. Most of the code in the C++ side now is the same as before. I really just had to tweak ins and outs. The place where I feel let down is that I code the compute shader in glsl and that’s pretty much remembering every line how much nicer it is to code for OpenCL.

The performance is great, though. And I’m curious as to how it’ll evolve. Really what Vulkan consumes is not the GLSL. It consumes an intermediary representation called SPIR (or SPIR-V if it’s supported). I use a tool that comes with the Vulkan toolkit to compiler my glsl code to SPIR, but maybe with better driver support, this could eventually be a richer standard with a more robust tool compiling some better code than glsl.

Not much to add on top of the original post

BONUS ROUND - extra charts

Before I go, I have a couple extras. First, I built toyBrot at work where my machine has a 1080Ti instead and the power of Pascal is scary.

CUDA is still slower than OpenCL or Vulkan though, even accounting for the setup. Luckily for my fanboy heart all the Turings were busy so my Radeon VII gets to not be crushed. But this is… threatening… time to remind myself that my Radeon VII has a waterblock on it

I was honestly surprised by how strongly it responded. Got me curious to do some rendering benchmarks though I know the difference will be nowhere near as significant.

it IS a pretty significant overclock though, enabled in no small part through the magic of waterblocks.

From a stock 1800MHz GPU clock and 1000MHz HBM clock, I take mine to 2125 MHz [+18%] and 1175MHz [+17%] respectively. And also push the power cap to 300W from the 250W stock

And it’s better than my alternative of sabotaging nVidia by switching erverything to double precision and watching my Titan struggle to keep up with my 1920X.

Parting words and relevant links

And there we go. It’s been a long time coming, but I’m super happy with the new code. Not only I’ve been having a lot of fun just playing around with it but it’s much more flexible it I need to tweak the runtime again. Still some improvements I could make, such as coding a more proper camera and moving the position to be with the other parameters.

Next on the list though, are a couple of entries by the Blue Team. I’ve been asked a long time ago to do tbb and ISPC from intel, so that’s what I’m going to look at. I’m dumping some links here related to raymarching and the mandelbox. For things related to the technology specifics, please refer to the actual post about it.

I’m going to take pretty much any opportunity I have to post links to hipSYCL
Part I of the blog by Mikael which I referred to heavily
Why haven’t you bookmarked Inigo Quilez’ page yet?
If you Google “shadertoy mandelbox” you get to see a bunch of different boxes by actual artists that are meant to be pretty and NOT slow
IQ + Shadertoy + Raymarching = “I kept saying maths was cool guise”
Final bonus: A GLSL signed distance function library

The Great Refactoring