If you want performance, maybe you should drop CUDA

12/Jan/2024

Way back, when I first got my Radeon VII, I made a post announcing that The Age of Radeon had begun. Those were good times, with that as my main GPU and an nVidia Titan X as a CUDA driver. In the beginning of last year I added a third card to the mix, an Intel Arc A770. Finally, I had RGB on my workstation!

Those times, sadly, came to an end still last year. A power spike here had two of my cards die. The sole survivor being actually the Titan. Had I been forced to go down to one of those three cards, it would have been the last option but, alas, no choice was given. This is now the world I have to live in, down to a single old Maxwell

Despite nVidia’s drivers still seeming to have weird stuff in Wayland half of the time, at least there’s ONE working GPU. Won’t lie that for a while, I was afraid something less replaceable had been killed. If I had, say, lost my CPU, there’s just no way I could’ve gone and bought a new Threadripper (though I guess now that two of these cards are dead, I’m not using nearly as many lanes). But, here we are; in this time of tribulations we must persevere

Funnily enough, thanks to the absolute godsend that is the Zenith’s PCIe slot switches, this is actually how things look at time of writing, the cards are still in there, because I’m in no hurry to redo that loop, even if it means the Titan is only running a x8. Those switches saved me, the thought of having to physically mess with these cards to try and figure out if any of them are alive ALMOST makes me regret watercooling (but then reason prevails)

So, with all of that, done, it’s time to start 2024 with a big recalibrating of expectations. Not just for the gaming that I do on this machine (though the vast majority of that happens in a very different computer in my living room) but also for compute workloads. So let’s collect our thoughts…. and a bunch of numbers

toyBrot to the rescue once again!

People familiar with this blog or with my blabbering outside of it are probably familiar with toyBrot, my little Mandelbox Raymarching fractal generator project. It does HAVE a lot of different implementations, including CUDA and is a benchmark in and of itself that I can easily get to build on different environments, so it is a tool that’s perfect for what I’m looking for here

That said, it IS important to keep in mind that toyBrot is a very peculiar workload. It doesn’t rely on any fancy libraries, the math it does IS pretty simple and there are basically no memory constraints here. It is a “spherical cow in perfect vacuum” when it comes to parallel workloads

If you’re curious about toyBrot itself, you can check its homepage here or any number of Multi Your Threading posts, a series that is built on top of it and its multiple implementations.

All of the numbers here are taken running toyBrot with “default parameters”, which means the config for it will read something like this. The colours will vary between different implementations but everything else will be the same. ToyBrot doesn’t really have versions to stick to, but I did make sure to note down the commit at the time, just so we can all know what was the code that actually ran for sure, it was 83649c04694350ca66876ad396398ee7fb2d4865

All of the binaries were built in Release mode and CUDA-based implementations were built to specifically target sm_52

// Camera information
float    cameraX          = 0.0
float    cameraY          = 0.0
float    cameraZ          = -3.8
float    targetX          = 0.0
float    targetY          = 0.0
float    targetZ          = 0.0
uint32_t width            = 1820
uint32_t height           = 980
float    near             = 0.1
float    fovY             = 45.0

// Colouring parameters 
float    hueFactor        = -60.0
int32_t  hueOffset        = 325
float    valueFactor      = 32.0
float    valueRange       = 1.0
float    valueClamp       = 0.9
float    satValue         = 0.8
float    bgRed            = 0.05
float    bgGreen          = 0.05
float    bgBlue           = 0.05
float    bgAlpha          = 1.0

// Raymarching parameters 
uint32_t maxRaySteps      = 7500
float    collisionMinDist = 0.00055

// Mandelbox parameters 
float    fixedRadiusSq    = 2.2
float    minRadiusSq      = 0.8
float    foldingLimit     = 1.45
float    boxScale         = -3.5
uint32_t boxIterations    = 30

toyBrot’s “default parameters” will generate a conf like this. Colouring will vary per implementation but the rest of the params will always be the same

Newer CUDA: Shiny Update or Planned Obsolence?

So this was the first thing I wanted to know: Was this newer CUDA intentionally crippling my card? Maxwell is rather old at this point, and nVidia is not at all shy of using CUDA to force older cards into obsolence, was this happening softly this time around? Let’s try and find out

All of the desktops in my house run Arch Linux (yes, even the gaming PC) so getting older CUDA running can be somewhat finicky. Luckily, containers exist and nVidia does provide containers specially crafted for building and deploying CUDA applications so the first thing I did was checking that out. I picked Rocky as my base distro of choice and found Nvidia provided containers for every minor version from 11.0 onwards

This was a good start, but I also wanted some even older results and I complemented those using some AUR CUDA packages for Arch. Getting them working was a bit of a hassle due to the necessity of also getting an older version of GCC, but it was well worth the effort. With a little scripting magic and a little patience, I DID end up gathering numbers for all of these CUDA versions and what I saw was definitely interesting

Additionally, while I’m looking at benchmarking CUDA, I might as well come back to an old question: NVCC vs CLANG.

If you’re not aware, turns out that for quite a while, CLANG is able to target CUDA so, instead of NVCC, you COULD just use that instead and the results have been different as both compilers evolved separately so it’s a good opportunity to add that into the mix. Each version of CUDA comes with it’s own bundled version of NVCC, but I’m using the same version of CLANG for all of these. The difference will be in the CUDA libraries themselves

So the first thing of note here is that there seems INDEED to have been a pretty significant performance degradation with CUDA 11, particularly CUDA 11.2 saw a massive performance penalty with NVCC so, at least for Maxwell, if you do need to use CUDA, you might be WAY better off sticking with CUDA 10 if you can, or at least 11.0/1. Once again, a reminder that the CUDA 10 versions listed were running natively whereas the others were built and ran on nVidia Rocky 8 containers

Does make me wish I had some more GPU families I could test on but for now it’s just kind of clear that you can’t trust nVidia to not cripple your card’s performance if it’s not the current shiny thing. There’s no telling from this data if that’s any sort of “intentional malice”, mind you, it can be simply that this card is already too old for nVidia to care

But it’s also REALLY surprising that even the best results from NVCC just cannot beat CLANG16. This is a change from my old numbers, back in the day, CLANG couldn’t quite keep up with NVCC and, to this day, it DOES present some additional hassle when it comes to building. Also keep in mind that the officially supported version of CUDA in CLANG lags behind a little bit, as that’s not supported by nVidia, being a community effort instead. So there could be issues if you rely on more cutting edge CUDA features. DO try out CLANG if you can but also keep it in mind to check for those potential pitfalls if you do

When it comes to performance across versions, CLANG DOES suffer some influence, you CAN see more or less the same shape in both graphs IF you discount the major performance loss that NVCC suffered with CUDA 11.2. But even the best NVCC numbers, from the CUDA 10 era, can only really beat the absolute worst cases for CLANG16, head to head, it’s not better even once

Dedicated compute vs General use

So this is a good start but, even those GOOD numbers are still SO FAR OFF the ~2 seconds this used to take me, so unless the driver itself has also taken a bat to Maxwell’s knees, there has to be something else and not even with all my spite for nVidia, I would expect them to cripple a card through drivers like that. So we need to dig deeper and there IS something else

Previously, all of my graphical load used to be on Radeon VII. That was the card that X11/Wayland were using to draw all of things, except when I explicitly told them to use a different Vulkan or whatnot. The Titan wasn’t even driving ANY of the displays (4 monitors and a Vive headset) and it now drives ALL of those

During all of this CUDA testing, the computer wasn’t DOING much but, you know… Plasma was there, Vivaldi was open, drawing my Mastodon feed. These were non-graphical runs of toyBrot but the card WAS running some OpenGL in the background. But this is Linux, we don’t HAVE to do that if we don’t want to. So let’s check that out, let’s reboot this baby in runlevel 3 and see if we get some different numbers for ALL of these things

If you’re not a Linux person, “runlevel 3” broadly means that your system startup will do everything up UNTIL it would start getting all of your graphical systems properly initialised. You get just the barebones stuff for terminal (I don’t even get my two main monitors up, it’s quite funnny) control and off you go

So these charts have more or less the same shape, at first glance, except the error bars are super tiny now, with individual runs having much less variance but there are TWO interesting details

So this time around, before “the fall of NVCC” in CUDA 11.2, it WAS actually faster than CLANG16. These are averaged over 20 runs for each result so it’s not really a one-off chance outlier. Comparing the peformance of both compilers in each run, we can get a measurement of how much faster is one or the other

For this one, anything below the first horizontal line means NVCC is faster and the higher up it is, the slower NVCC is in comparison to CLANG16. The blue line is for numbers in a graphical environment and the pink line for a text-only runlevel 3 environment. Starting with CUDA 11.2 we see that this becomes significant, with NVCC being between 17 and 35% slower than CLANG. That’s a lot of performance being left on the table by just using nVidia’s official super duper compiler, optimising specifically for the one video card being targeted

But ALSO, though both the runtime graphs have the same shape, if you look closely at that vertical axis on each, you may note that the numbers on the Runlevel 3 one are about half of those in the previous graph. This was it, I FOUND THE THING. Turns out running graphics on the card is just REALLY tanking the performance of this card, even when the card is not doing much. While it IS understandable that there is an impact, I never expected it to be THIS much in a mostly idle desktop. Running nvidia-smi right now as I edit this post, have a youtube video on the background and whatnot…. I get about 10% GPU-Util. Certainly not something I’d expect to give me numbers THIS drastically different

And by looking at the relative performance graph, we see that though both lines have the same overall shape, they ARE offset. This means that using NVCC means you’re also incurring in a HIGHER performance impact from running graphics on the same card. It always loses out to CLANG, but when there’s graphics involved, it loses harder. So what if we put all of this data in the one chart and see how much performance we’re actually losing? We’ve looked at bits of data in isolation, but what’s the big picture? Well, let’s do just that

All right, so let’s unpack all of this. Green data represents NVCC, yellow data represents CLANG, same as before. The vertical bars are the averaged runtime for that CUDA version, and are measured to the numbers on the left. Dark colours are from graphical environment runs and bright colours are taken from Runlevel 3 runs

The two lines near the top are the percentage increase in the results from having a graphical environment running, measuring to the right vertical axis. So, for example, in CUDA 10.0, NVCC took around 3200ms in average, and something closer to 1800ms in a text-only environment. The actual numbers meant an increase of 84% in the average runtime, this is the “graphics penalty” compared to having the card run dedicated compute

And here we can see with more detail what the offset in the relative performance chart meant: All CUDA versions and compilers have a huge performance penalty from running in a graphical environment, but the performance loss is higher for all CUDA versions if you’re building with NVCC. This is true even in CUDA 10, where the actual results for both compilers are very very similar. If your card is running any sort of graphics, even if it’s not doing much at all, CUDA just tanks

But maybe this is unfair to CUDA. Despite that disparity between compilers, this could very well be a result of just a driver quirk and, luckily for everyone, there IS more to nVidia cards than running CUDA. So let’s look at some more things!

Life beyond CUDA

For this next set of tests, let’s cut straight to the chase. We know to expect a performance impact from running a desktop environment, so let’s dive straight in and get all the numbers. Though we know it’s not the ideal situation for CUDA when it comes to comparisons, let’s level the playing field and take CUDA as it is TODAY, as of January 2020 NG+4, this means 12.3.103 right now running natively on my Arch desktop, no more container shenanigans

Besides CUDA 12.3 built with both NVCC and CLANG16, I’m also bringing up both old and new faces. OpenGL Compute Shaders, Vulkan Compute (with 3 different shaders) and OpenCL are all making their triumphant return. Additionally, I’m bringing in two SYCL implementations: Intel’s OneAPI, with the nVidia plugin from Codeplay and AdaptiveCpp, the project briefly known as OpenSYCL, the project previously known as HIPSYCL

Of note, AdaptiveCpp has come SO FAR since I’ve last gave it any serious look that, honestly, it’s an entirely different beast. It deserves a close look all on its own. For today, though, what I have is this: two different versions of it building SYCL code (because yeah, it’s more than just a SYCL implementation these days). One of them using their “generic backend” which, as I understand it, does on the fly compiling for backends it finds at runtime, and the other using it’s CUDA backend, which, as implied, builts targeting the CUDA runtime specifically (relying, I believe, on the CUDA CLANG backend support). Hopefully I didn’t get much wrong here =P

All right, caveats and details out of the way, let’s see some numbers!

I’ve ordered these categories by their “general use desktop” performance which is tailed by none other than… NVCC. If you want performance out of your nVidia card, that is literally the worst option. CUDA with CLANG only manages to beat AdaptiveCPP’s “Generic” flavour. Maxwell is actually not officially supported by Acpp, I was talking to Aksel while I was playing around and he wasn’t even sure it was going to work. Generic DOES complain a lot about this architecture, but… it DOES work, for a workload as simple as this and even with all of these caveats, it’s considerably faster than NVCC. Targeted CUDA Acpp beats both NVCC and CLANG here which goes to show how much work has gone into the project

With graphics out of the way, CLANG takes the crown between these, followed by Acpp’s CUDA backend, NVCC and Acpp-Generic. But they also enter the territory of Vulkan. Vulkan shows some difference between different shaders. glsl and hlsl are generated with glslang, from the standard tools and clspv is a tool that generates SPIR-V for Vulkan from OpenCL kernels. They all hover around the same general performance, with a consistent ordering between them… but also next to no difference when it comes to graphical vs CLI

If what causes the GFX impact on the other implementations is a driver quirk, Vulkan bypasses that entirely. I DID try to ensure the graphical workload was consistent, with all runs having the same things opened while the computer looped the same youtube video. So this is just a Vulkan thing, apparently. Performance-wise this makes Vulkan a curious choice as in a graphical environment it only lost to OpenGL who is the king of desktop compute here somehow? It’s the exact same glsl shader too. But OpenGL has its own restrictions, including REQUIRING a graphical environment, even if it IS faster than even the CLI CUDA numbers here

Both remaining implementations, OneAPI and OpenCL tell similar and impressive stories. Fantastic performance, the best ones in a pure dedicated compute environment and still impressive results even with high graphical environment penalties. OpenCL in particular, though losing out to Vulkan and OpenGL inside graphical environments, remains the absolute king of performance when you unleash everyone. I still hope that something like SYCL takes over the world but, until such a time that happens, if it ever does, it’s REALLY hard to not recommend OpenCL as the compute framework of choice considering it too has implementations for a variety of vendors and platforms

With all those raw numbers, we can look at the graphical impact for all these implementations

Once again, CUDA is king of the losers here, though followed closely by OpenCL, a result that looks really bad until you remember that even with that steep penalty, the base performance is so high it’s only losing to Vulkan and OpenGL on a graphical environment. Both AdaptiveCPP targets are the most consistent implementations outside of Vulkan which….

Many numbers, many lessons

This whole thing was a benchmarking adventure all on its own but, now that we have all of these numbers and all of these charts, what can we even take from them. Was the real performance the friends we made along the way? Kinda

I feel it’s important that I reiterate that toyBrot is a very specific workload. There is more to heterogeneous computing once you need to put actually complicated programs together, including supporting libraries and whatnot…. WITH THIS DISCLAIMER OUT OF THE WAY, THOUGH

Maybe it’s time that everyone who cares about performance on their nVidia cards stops listening to nVidia. CUDA is a backend used by many different projects but, on its own, just using it as presented, the numbers are simply awful, with NVCC in particular giving THE WORST NUMBERS you can get from an nVidia card. nVidia’s own fancy compiler is literally your worst option almost always

If you’re writing something on your own, doing so using CUDA is shooting yourself in the foot. You’re vendor locking yourself AND getting worse performance than you would get using… anything else, except maybe HIP (AMD’s own CUDA) which I haven’t tested this time around. Hip DOES default to building CUDA with CLANG if I’m not mistaken, though, in which case I’d expect it to also trounce NVCC. If whatever you’re looking for has an OpenCL version that isn’t just poorly written, I’d DEFINITELY check that out, there seems to be a chance it would perform WAY better on your nVidia card than an “nVidia approved” CUDA implementation

nVidia DOES put an unfortunate amount of effort into CUDA (as opposed to any open standard) so the ecosystem around it IS really good, but you’re also losing portability. Say, for example, if you’re writing your new program you decide to use SYCL instead of CUDA and build it with either OneAPI or AdaptiveCPP. Let’s also imagine you want to run this stuff on the cloud. If everyone decides to drink the hype and nVidia’s cards suddenly cost twice as much to run, you CAN just target something cheaper. You might not even have to recompile your program depending on what you’re using. But if you were on CUDA? Yeah, tough luck, guess you too are paying double now

Also DEFINITELY worth taking notice is Vulkan seemingly bypassing the whole “what if graphics” issue. The performance doesn’t look super impressive when you compare to the other dedicated compute numbers, but it’s beating at least NVCC. Once the presence of a graphical environment comes into play, though, it suddenly becomes incredible. For situations where you want to add Heterogeneous compute to graphical applications, it certainly is something to consider, though it comes with the caveat of being the worst stack to work with, at least if you’re doing everything manually. When I was working on this Vulkan implementation I was of the opinion that it kind of needs some good wrappers to alleviate all of the boilerplate and setup you need to do even for basic stuff and I’d say that is still the case, though I’ve been away for quite a while and simply don’t know what, if anything, actually IS available

And not to forget OpenGL, if your application is NECESSARILY graphical and, perhaps, already running on some OpenGL backend, then the numbers are quite impressive, though I expect it to lack a lot of the facilities that one would want when working with actually complex projects

What's next?

Well, first and foremost, happy 2020 NG+4. Opening up the year with some… strong opinions it seems. A new month means there’s a new Godot Mini to work on. This time it’s almost certainly going to be using GDExtension to integrate C++ into Godot projects, almost certainly toyBrot is going to be the means here once again. But I also DO want to look more into UI, this time scaling and resolution sort of shenanigans, there IS a chance I change my mind and swap these around, though I don’t expect to. The next post here, though, is probably going to be a follow up to this one

I mentioned I have a gaming computer as well, separate from the one I work on. That PC has a Radeon 6900XT and I DID get curious to see if that card responds differently to the presence or absence of a graphical environment so I might have a look. Other than that, I’m finally back to a less exclusive schedule, which means I’m back, for example, to working on Cartomancer for that sorely needed 1.1/1,1. We’ll see how that goes

But now getting into a more well balanced scheduled, I also may try and default to a post every other week instead. For a while I just had a LOT to suddenly post about and I was a bit more free-form with my time but the reality is these posts take a LONG time to write and put together and whatnot. And that’s not even talking about getting the data and making the charts, etc… no. Just writing the post and choosing the pictures, structuring things… All of this takes an amount of effort that’s a bit hostile to a weekly schedule, specially since I also need to be doing things worth talking about in the first place =)

So expect the next post in a couple of weeks, probably that follow up

Mandatory link dump

Or follow this blog : @tgr@vilelasagna.ddns.net

The Great Refactoring