nVidia performance round 2: Ampere, Wayland vs X11, AMD Navi

19/Jan/2024

All right! So last week I decided to put my Titan X through the wringer and find out what’s up with compute performance on nVidia. If you haven’t, I STRONGLY recommend you check it out as that post will have all the context and conclusions and whatnot

Overall, I did find that CUDA was very disappointing compared to other available implementations, that newer CUDA versions had WORSE performance than older ones and that having your card be also running graphics had a disproportionate impact on compute performance. But that was all on a Maxwell card, which is fairly old at this point. As I put all of those numbers and charts together, I kept thinking about how would those look in a more modern card and whether that sort of graphics penalty was as heavy on the AMD side as well.

So I went and got EVEN MORE NUMBERS. This is going to be a quick fire of these additional things I’ve tested and how they compare to my initial set of numbers so let’s get going!

Move over, grandpa Max, 3050 is bringing Ampere to the table!

So all of my testing for the previous post was on my workstation, the Threadripper machine that has that one Maxwell Titan, but I DO have other computers in the home and there’s this one ROG G15 laptop that has a 3050 mobile in it. While it IS a laptop card and whatnot, it’s also much newer and I was curious to find out how those CUDA version benchmarks would behave on it. I tweaked everything to build to sm_86 as opposed to 52 (that’s how far these cards are apart….) and let her rip

But before I get to the numbers, there are a couple of details to keep in mind:

The first detail is that with Ampere being pretty new still, it hasn’t been explicitly supported in CUDA for too long. I wanted to keep things as internally consistent as I could so what I decided to do was only work with the versions that support explicit sm_86. That means my charts will start at CUDA 11.1

The second detail is that with this being a laptop, the graphics environment actually runs on an AMD iGPU instead. This means that in these results, the nVidia card is ALWAYS dedicated compute, I have to prime-run the OpenGL one, actually. Honestly, going through the pains of disabling the iGPU to get runs where the card is shared was just too much of a hassle for me to go trough. Especially since I’d need to put it back lest my wife suddenly had her laptop’s battery life just gone

For some convenience here, I’ve kept the runtime scale the same so it’s easier to compare to the Maxwell’s numbers, which I’m reposting here for a side-by-side. The thing that I wanted to see here is if the graphs would have more or less the same shape. You can click each invididual to enlarge

So one interesting thing is CUDA 11.7 specifically. In the Maxwell that was abnormally bad, but for Ampere it was actually a bit better. After early teething for a first couple versions, 11.7 aside, it looks like the trend is more or less the same overall for NVCC: 11.2 marked a performance loss that NVCC never recovered from. CLANG instead has been consistently better and generally stable. There HAS been some steady performance gains with newer CUDA for Ampere, actually, which is not really the case with Maxwell.

With the graphs side by side you can get a glimpse of how a mobile version of a lower end card is a fair amount more powerful than this almost top of the line card from a few generations ago (the 980Ti outperformed it slightly, though it had less VRAM).

Comparing the NVCC vs Clang performance we can also see that NVCC is less worse for Ampere, even it CLANG 16 is still the way to go

I guess between all of this, it’s pretty safe to say that nVidia hasn’t been softly killing Maxwell through performance degradation. A low bar to clear, perhaps, but good reassurance nonetheless. We’ll just have to wait until they hard kill it by making CUDA refusing to run on it, something which they actually do

In addition to all of these CUDA versions, I also did an overall shootout with a bunch of different things

So, keeping in mind that these bars are compared to the pink set of bars from the Titan, there are no real surprises here, except perhaps for clspv beating out the glsl compiled SPIR-V for Vulkan, but they’re all pretty close. Numbers are generally smaller than the Titan’s, OpenGL ran through PRIME and the overall shape is the same, so different implementations are bringing out more or less the same performance out of both Maxwell and Ampere

This boring result is what I’d expect and WANT from well established compute stacks so this is actually pretty good, What this means is that unless you NEED something specific like specialised components in newer cards for Raytracing or Machine Learning (what they’re actually there for) then you don’t really need to worry about specifically what family is your card. just whether it has the performance and RAM you need. It pains me to have to say this but: good job here, nVidia. Two fairly different cards and no “weird results” from either

nVidia loses HOW MUCH performance to running graphics? Is AMD just as bad?

So we’ve moved from my workstation, with the Threadripper 1920X and the Maxwell Titan to the laptop, with an Ryzen 4800H and a GeForce 3050 Mobile but I’ve alluded to yet another computer and that one has an AMD GPU that was not killed by a power spike! So let’s do some compute in my gaming pc too. That one has a Ryzen 3600 and a Radeon 6900XT. And yes, the gaming computer also runs Arch just the same, it doesn’t even have Windows installed (neither has the laptop)

As expected from “the gaming machine”, the GPU on this computer is… considerably stronger than anything else in the house. A standard toybrot run on HIP gives me numbers like this

Iteration 0 took 334 milliseconds. Setup time: 388ms
Iteration 1 took 333 milliseconds. Setup time: 3ms
Iteration 2 took 327 milliseconds. Setup time: 0ms
Iteration 3 took 326 milliseconds. Setup time: 0ms
Iteration 4 took 330 milliseconds. Setup time: 0ms
Iteration 5 took 328 milliseconds. Setup time: 0ms
Iteration 6 took 326 milliseconds. Setup time: 0ms
Iteration 7 took 326 milliseconds. Setup time: 0ms
Iteration 8 took 329 milliseconds. Setup time: 0ms
Iteration 9 took 332 milliseconds. Setup time: 0ms
Iteration 10 took 327 milliseconds. Setup time: 0ms
Iteration 11 took 334 milliseconds. Setup time: 0ms
Iteration 12 took 332 milliseconds. Setup time: 0ms
Iteration 13 took 328 milliseconds. Setup time: 0ms
Iteration 14 took 326 milliseconds. Setup time: 0ms
Iteration 15 took 330 milliseconds. Setup time: 0ms
Iteration 16 took 327 milliseconds. Setup time: 0ms
Iteration 17 took 329 milliseconds. Setup time: 0ms
Iteration 18 took 332 milliseconds. Setup time: 0ms
Iteration 19 took 330 milliseconds. Setup time: 0ms

Average time of 329 milliseconds (over 20 tests)

But luckily, there are quite a few ways to manipulate toyBrot in being arbitrarily more costly so in order to exaggerate any variance, for this card only, I ran every test with -d 5000 5000, which means they generated a way larger image and, as such, had to calculate a lot more pixels

In this machine I couldn’t quite get either OneAPI or AdaptiveCpp to work. I think they both were struggling with ROCm but I can’t know for sure. It’s installed from the Arch repositories and that was a rabbit hole I elected to stay out of, in a surprising show of restraint. On the other hand, though, there WERE three different Vulkans to make up for having zero SYCLs

RADV is the open source Vulkan implementation from Mesa

AMDVLK is the one that AMD themselves ship

AMDGPU-PRO is the one that AMD ships as part of their professional graphics stack, which is built on top of the open source Mesa drivers, though you CAN use just the Vulkan driver itself (what I do here)

If you’re gaming, most of the time RADV seems to give the better performance these days, though some times amdvlk can be slightly better. Sometimes also one or the other doesn’t quite work. amdgpu-pro, from memory, seems to have similar compatibility as amdvlk but performance is very slightly behind. I mostly just use RADV unless something goes wrong but, let’s see if they give us any difference in compute

Since the question here is graphics, I did a similar thing to the one I did for the Titan: one set or runs in the graphical environment, which here is all Wayland, GDM and Plasma, and an additional set booting the PC on Runlevel 3. For the graphical runs I DID have a different youtube video running and rendering on screen, just to have some minor background load on the card

So we can see that this time, the graphical impact is MUCH lesser. Honestly, this is what I expected the Titan to behave like, if the card had around 10% background load. But the real surprise here is that AMDGPU-PRO is actually the best of the Vulkans for compute, with RADV being the worst. Sincerely didn’t expect that but I’d say that “the ‘pro-graphics’ driver gives the best general compute performance” is a pretty solid win for AMD. This IS the area where this implementation should focus. OpenCL remains the champion followed by HIP

A graphics penalty of between 0 (okay OpenCL) and 8% means that, unlike with nVidia, you don’t really need to worry as much about having a graphical environment running. While, yes, this is a tremendously more powerful card, I don’t think anything “explains” my CUDA numbers on the Titan being twice as bad

AMD seems to still be in the same place as it was back when I was doing shootouts on my Radeon VII. The card, this time a 6900XT, is a beast, the drivers are solid and ROCm is a mess that keeps AMD from just straight out being the best option by far. In a hypothetical world where ROCm was as easy to set up and use as CUDA I can’t really imagine why people would even bother with nVidia, especially if you’re in Linux

Wait, so you specified Wayland for the Navi, what about the Maxwell?

As I was putting all of this together, this was the last remaining doubt I had: If the graphical environment has this much of an impact, could there be any difference between Wayland and X11? I used to run the same pure Wayland on the workstation but ever since I was locked into having to use an nVidia card, there were some small weirdnesses that are seeing me use X11 most of the time in this machine now, that was the case for all previous numbers

In order to be double certain, though, I’ve gone and taken a new set of numbers on the Titan. This time I didn’t bother with CUDA versions or containers or CLI… this is just a quick shootout, one final chart, this time with 4 sets of numbers

I specify the desktop manager because I moved to GDM to avoid having my bootloader fire up an X server, as I used to use SDDM before (time to double check how that Wayland support is coming along, I guess). So for these benchmarks I’ve loaded Plasma with both X11 and Wayland loaded by either GDM or SDDM (using default X11) just to check if there were any discrepancies or trends. Numbers for the Maxwell only!

So each set of number seems to be very close, which is good news. Where there IS any difference it seems that, in general Wayland has an advantage. Running Plasma with Wayland is pretty much always better but GDM and SDDM seem to not be as clear. All in all seems that even with nVidia, perhaps you should prefer Wayland, but don’t get too stressed if you’re in X11 for whatever reason

The numbers from this chart are all new runs and the only things being drawn on the screens were yakuake running the tests, always with no GUI except for OpenGL, and Vivaldi just drawing Mastodon this time on one monitor. I made sure to not touch the machine while things were running

Loads of new numbers; any new lessons?

Well, with this being mostly a follow-up, the spiciest stuff has mostly been out of the way, but still it was interesting to go through all of this

The results on the 3050 were satisfyingly predictable for the most part. The card DOES get treated a little better by CUDA than the old Maxwell, but the difference is not particularly significant. Raw performance it beats the old card but generally all the charts have more or less the same shape. Having the nVidia card not drawing your desktop and whatnot seems to be extremely helpful

Checking an AMD card the graphical environment impact was more in line with what I expected from the light added load of just drawing a desktop. It is a shame that ROCm still has a fair amount of rough edges when essentially everything else about the card is really good

And also coming back to the Titan to double check whether there was a difference between X11 and Wayland did end up surprising. It’s no secret that nVidia’s drivers still have some rough edges when it comes to Wayland but still it seemed to generally get slightly better compute performance. Nothing significant enough to that would make this a major decision point which is ultimately good news, this is not something you want to see having a significant impact on your compute performance, it’s not even the actual graphics drawing

So what's next?

While there’s still some more information that one could get here, personally I’ve got what I wanted for the time being. The whole thing was pretty interesting. Maybe the most interesting takeaway is the graphics impact on nVidia. Maybe if you DO need to do some heavy compute a lot of the time and you have an iGPU or a cheap old card you could throw in, it would be worth using that as your primary driver, leaving the nVidia card as a secondary heavy hitter, like in a laptop. Originally this Titan was that, with the Radeon VII driving the main graphics, and the performance difference was astounding. The discrepancy between what I USED to get from the Titan then and what I get NOW that it’s the sole (working) video card in the computer is what prompted me to look into all of this in the same place (still the same physical x8 slot)

For the blog, I’ve been looking at GDExtension for my January Godot Mini Monthly and this is probably what’s coming so get ready for some C++ in Godot, I guess

Until then, hopefully all of this helped you not take the numbers your GPU is throwing your way at face value. There COULD be a LOT of performance just left on the table there. For me, this is tickling me right in the confirmation bias that CUDA needs to just stop =3

Mandatory link dump

Or follow this blog : @tgr@vilelasagna.ddns.net