Multi Your Threading #9: Back to the classics with Khronos

Look at me copypasting the same thing again!

This is chapter 9 in a series I’ve been doing on multithreading and heterogeneous computing. Each chapter (generally) is based on me reimplementing the same code using different languages/technologies/models and talking about them both in isolation as well as in comparison to one another. While it’s not necessary that you’ve been following, I do assume you’re familiar with parallelisation in general and I’ll make the occasional references to other implementations. You can find them here if you’d like to check them out:

Also: All the code here is from ToyBrot. If you prefer to see the whole picture or browse on your own as you read, feel free to clone the repo and open it up on your favourite code editor (maybe on a second screen if you have one). I personally am a big fan of Atom

Hello again!

Previously, on Multi Your Threading; I’ve talked about my experience in using emscripten to compile and deploy toyBrot to a web environment. That was all very interesting and useful, however one thing very obvious to us from all the previous toyBrot work is that GPUs are pretty great for a lot of computing work. Especially with how fiddly cpu multithreading can be on the web, having access to a GPU would be really awesome. But we have a different problem there: how do we even access a GPU?

In a way, every application running in a browser is “sandboxed” by said browser, and you need them to implement the APIs you’d use to talk to stuff. So far, to talk with GPUs, we’ve used CUDA, HIP, HC C++, SYCL, Vulkan and OpenCL. Now none of those really work in a browser (unless I’m missing something). There was a WebCL initiative by Khronos, which I would be very very interested in but apparently it didn’t really go anywhere and never managed to get traction. So we are going to need something new, looks like we’ll need to go back to the classic with good old OpenGL

One daddy to rule them all: Open Graphics Library

According to the official website:

“OpenGL is the premier environment for developing portable, interactive 2D and 3D graphics applications. Since its introduction in 1992, OpenGL has become the industry’s most widely used and supported 2D and 3D graphics application programming interface (API), bringing thousands of applications to a wide variety of computer platforms. OpenGL fosters innovation and speeds application development by incorporating a broad set of rendering, texture mapping, special effects, and other powerful visualization functions. Developers can leverage the power of OpenGL across all popular desktop and workstation platforms, ensuring wide application deployment.”

From the very first time I talked about coding for your GPU there were a few concepts which I’ve made an effort to explain. First that your GPU is really almost like a separate computer in itself. Second that programming for your GPU is not quite the same as programming for your CPU, not only because they work in some different ways, but also because GPUs have their own architectures (i.e: you don’t normally just compile some assembly-type thing and send it to them like you do for your CPU) and you have to do all your talking mediated by drivers.

These issues don’t affect only heterogeneous computing, if you want you GPU to draw something, you need to ask its driver to tell it to draw what you want. This is going to be done through what we call an API, an “Application Programming Interface”. Which is a “dictionary” of what you can tell something to do through code. Early on, each GPU manufacturer would make their own API but this is pretty terrible. The same way as if you code your stuff with CUDA you can’t run on non-nVidia GPUs, you could end up with the same situation with graphics. And if you wanted it to work with other vendors, you’d need to learn and implement their APIs. This sucks for everyone

In order to get around this problem, Silicon Graphics, who was a massive system manufacturer in their time (and where the people who would found nVidia came from), developed an Open Graphics Library. The idea was that different hardware manufacturers could implement this interface in their drivers and then programs could be coded using this library. Those programs could, then, run on hardware by all sorts of manufacturers AND in all sorts of OSes. As long as the API was implemented, most of the code could be the same.

The first version of OpenGL was released in the 30th of June of 1992. Almost 30 years ago. Though we think of computer graphics these days a lot in terms of gaming, SGI’s thing was really selling professional workstation machines so their main focus was rendering and modelling software as well as CAD applications. Over all this time not only has OpenGL evolved massively from what it once was, there are also some alternatives to it. Notably, Microsoft in their attempt to ruin things for everyone, decided to make their own OpenGL, which they call Direct3D. It’s the same thing but better, see, because it doesn’t work if you’re not on Windows. Later on AMD put a lot of work on an API which was supposed to be a lower-level alternative to OpenGL, which still abstracts a bunch of GPU pain from the user (and also a lot of control because of this). This was called Mantle and later on became Vulkan. Also Microsoft once again decided to make their own Mantle, which is (part of?) DirectX12, which is just like Vulkan, but better because it doesn’t work if you’re not on Windows. They are also trying now to make DirectX part of Linux but not getting rid of the “doesn’t work if you’re not on Windows” bit. I’m sure only great things can come from this. Most recently, feeling great disappointment at themselves for not having done this before Microsoft, Apple has decided to make their own graphics API, Metal. Metal is just like Vulkan (and also just like OpenCL, a bit, allegedely), but better, see, because it doesn’t work if you’re not on Mac OS/iOS. Also, it’s even more better than DirectX because they control their OSes more tightly and they’re killing off Vulkan and OpenGL on their platforms.

Despite all of those alternatives, including from Khronos themselves, OpenGL is very much still alive. I mentioned that OpenGL was created originally by SGI, but control of the OpenGL spec has rested with Khronos, the group that exists solely to help manage open programming interfaces related to graphics, since 2006. The current version of OpenGL is 4.6, which was released in July 2017. So if you want to code some graphics and run it on as many devices as possible, this is still your best option.

Open Computing/Graphics Language/Library

When I first introduced OpenCL, the tagline was making “computing shaders” a thing. And this is because when OpenCL came out, indeed you couldn’t normally run arbitrary compute on OpenGL. I say normally because you COULD if you could trick OpenGL into thinking it was telling your GPU to process graphics, like vertices and pixels instead and then make it give those results back to you instead of drawing on a screen.

As it turns out, though, CUDA and OpenCL showed that running arbitrary compute on your GPU was really powerful and useful. And, you know… everyone has a driver that runs OpenGL anyway… and they might be using OpenGL to draw stuff already… you have this whole system already sitting there…. So since OpenGL 4.3, released in 2012, you can instead of your usual graphic shaders, have a Compute Shader in OpenGL. And if you’ve followed this blog series, you’ve seen one already, kind of. Because to code shaders for OpenGL, you use GLSL, which is the same thing that you are going to use for Vulkan if you’re using the standard tools.

I’m not going to dwell too much on the shader side, mainly because I’ve talked about it way back when I was discussing Vulkan and my opinions on GLSL haven’t changed a lot. Instead I will focus on what I tweaked from the Vulkan version of the shader and why. If you want to open the full file, you can find it here

Also I am going to tell you not to repeat my mistake which I won’t correct for “historical reasons” and DO NOT USE vec3 on your shaders. vec4 only if you want to avoid very real memory alignment pains

#version 450

#ifdef TOYBROT_VULKAN
    #extension GL_ARB_separate_shader_objects : enable
    #extension GL_EXT_debug_printf : enable
#endif

#define WORKGROUP_SIZE 16
layout (local_size_x = WORKGROUP_SIZE, local_size_y = WORKGROUP_SIZE, local_size_z = 1 ) in;


layout(binding = 0, std140) buffer outBuf
{
   tbVecType4 data[];
};

layout(binding = 1, std140) uniform Camera
{
    tbVecType3 camPos;
    tbVecType3 camUp;
    tbVecType3 camRight;
    tbVecType3 camTarget;

    tbFPType padding;

    tbFPType camNear;
    tbFPType camFovY;

    uint screenWidth;
    uint screenHeight;

    tbVecType3 screenTopLeft;
    tbVecType3 screenUp;
    tbVecType3 screenRight;
};

layout(binding = 2, std140) uniform Parameters
{
    tbFPType hueFactor;
    int   hueOffset;
    tbFPType valueFactor;
    tbFPType valueRange;
    tbFPType valueClamp;
    tbFPType satValue;
    tbFPType bgRed;
    tbFPType bgGreen;
    tbFPType bgBlue;
    tbFPType bgAlpha;

    uint  maxRaySteps;
    tbFPType collisionMinDist;

    tbFPType fixedRadiusSq;
    tbFPType minRadiusSq;
    tbFPType foldingLimit;
    tbFPType boxScale;
    uint  boxIterations;
};



.
.
.

void main()
{

  /*
  In order to fit the work into workgroups, some unnecessary threads are launched.
  We terminate those threads here.
  */
    if(gl_GlobalInvocationID.x >= screenWidth|| gl_GlobalInvocationID.y >= screenHeight)
    {
        return;
    }


    uint col = gl_GlobalInvocationID.x;
    uint row = gl_GlobalInvocationID.y;
    uint index = ((row*uint(screenWidth))+col);

    #ifdef TOYBROT_DEBUG
        if(index == 0u)
        {
            #ifdef TOYBROT_VULKAN
                #ifdef TOYBROT_USE_DOUBLES

                    debugPrintfEXT("Vulkan Printf requires all types to be 32 bit long \n");
                    debugPrintfEXT("However it specifies doubles as 64 so it prints garbage \n");

                #else

                    debugPrintfEXT("Shader-side camera information:\n");
                    debugPrintfEXT("camPos    = %f, %f, %f\n", camPos.x, camPos.y, camPos.z);
                    debugPrintfEXT("camUp     = %f, %f, %f\n", camUp.x, camUp.y, camUp.z);
                    debugPrintfEXT("camRight  = %f, %f, %f\n", camRight.x, camRight.y, camRight.z);
                    debugPrintfEXT("camTarget = %f, %f, %f\n", camTarget.x, camTarget.y, camTarget.z);

                    debugPrintfEXT("camNear   = %f\n", camNear);
                    debugPrintfEXT("camFovY   = %f\n", camFovY);

                    debugPrintfEXT("screenWidth   = %u\n", screenWidth);
                    debugPrintfEXT("screenHeight  = %u\n", screenHeight);

                    debugPrintfEXT("screenTL    = %f, %f, %f\n", screenTopLeft.x
                                                               , screenTopLeft.y
                                                               , screenTopLeft.z);
                    debugPrintfEXT("screenUp    = %f, %f, %f\n", screenUp.x
                                                               , screenUp.y
                                                               , screenUp.z);
                    debugPrintfEXT("screenRight = %f, %f, %f\n\n", screenRight.x
                                                                 , screenRight.y
                                                                 , screenRight.z);

                    debugPrintfEXT("Shader-side Parameters information:\n");
                    debugPrintfEXT("hueFactor   = %f\n", hueFactor);
                    debugPrintfEXT("hueOffset   = %i\n", hueOffset);
                    debugPrintfEXT("valueFactor = %f\n", valueFactor);
                    debugPrintfEXT("valueRange  = %f\n", valueRange);
                    debugPrintfEXT("valueClamp  = %f\n", valueClamp);
                    debugPrintfEXT("satValue    = %f\n", satValue);
                    debugPrintfEXT("bgRed       = %f\n", bgRed);
                    debugPrintfEXT("bgGreen     = %f\n", bgGreen);
                    debugPrintfEXT("bgBlue      = %f\n", bgBlue);
                    debugPrintfEXT("bgAlpha     = %f\n\n", bgAlpha);


                    debugPrintfEXT("maxRaySteps      = %u\n", maxRaySteps);
                    debugPrintfEXT("collisionMinDist = %f\n\n", collisionMinDist);


                    debugPrintfEXT("fixedRadiusSq = %f\n", fixedRadiusSq);
                    debugPrintfEXT("minRadiusSq   = %f\n", minRadiusSq);
                    debugPrintfEXT("foldingLimit  = %f\n", foldingLimit);
                    debugPrintfEXT("boxScale      = %f\n", boxScale);
                    debugPrintfEXT("boxIterations = %u\n\n", boxIterations);

                #endif //TOYBROT_USE_DOUBLES
            #else //TOYBROT_VULKAN (We're on OpenGL)

                data[0].xyz = camPos;
                data[1].xyz = camUp;
                data[2].xyz = camRight;
                data[3].xyz = camTarget;

                data[4].x = camNear;
                data[4].y = camFovY;

                data[4].z = float(screenWidth);
                data[4].w = float(screenHeight);

                data[5].xyz = screenTopLeft;
                data[6].xyz = screenUp;
                data[7].xyz = screenRight;

                data[8].x = hueFactor;
                data[8].y = float(hueOffset);
                data[8].z = valueFactor;
                data[8].w = valueRange;
                data[9].x = valueClamp;
                data[9].y = satValue;
                data[9].z = bgRed;
                data[9].w = bgGreen;
                data[10].x = bgBlue;
                data[10].y = bgAlpha;

                data[10].z = float(maxRaySteps);
                data[10].w = collisionMinDist;

                data[11].x = fixedRadiusSq;
                data[11].y = minRadiusSq;
                data[11].z = foldingLimit;
                data[11].w = boxScale;
                data[12].x = float(boxIterations);
                data[12].y = 0.0;
                data[12].w = 0.0;
                data[12].z = 0.0;
            #endif //TOYBROT_VULKAN

        }
        #endif //TOYBROT_DEBUG

        #if defined(TOYBROT_DEBUG) && !defined(TOYBROT_VULKAN)
        if(index > 12u)
        {
            data[index] = getColour(trace(col, row));
        }
    #else
        data[index] = getColour(trace(col, row));
    #endif
}

So the changes are only in two places. Right in the beginning, I use a couple of Vulkan extensions which are not present in OpenGL. So I hid those inside a define. Most of the changes though, are in the debug side. One of the Vulkan extensions enables using printf from within the shader. This is extremely helpful for debugging, but is not present in OpenGL. Instead, what I do here for debugging is the traditional way where you write some debug values to specific pixels which you can later interpret on the C++ side. Ideally one could add some “debugging” pixels but I’m just eating on the generated image here. You may also notice that, since each pixel is an RGBA float value, I have several “data outs” per pixel sometimes.

All right, so how does the C++ side look? Thankfully, nowhere as as rough as the Vulkan side, though you can see some of the correlations. Starting from the constructor, SDL already does me a favour in initialising all the actual OpenGL context and whatnot so what I need to do here is create the OpenGL Shader Program and load the actual shader source. And, again, if you just want to look at the full actual file, you can find it here.

FracGen::FracGen(bool benching, CameraPtr c, ParamPtr p)
    : bench{benching}
    , outBuffLocation{0}
    , cameraLocation{1}
    , paramsLocation{2}
    , cam{c}
    , parameters{p}
{
    outBuffer = std::make_shared< colourVec >(cam->ScreenWidth()*cam->ScreenHeight());
    std::string shaderSrc{"FracGen.comp.glsl"};


#ifndef TOYBROT_ENABLE_GUI
    //Try and initialise the OpenGL stuff here. Otherwise, SDL has got our back
#endif


    static bool once = false;
    if(!once || !bench )
    {
        once = true;
        std::cout << glGetString(GL_VERSION) << std::endl;
    }

    glProgram = glCreateProgram();
    GLuint shaderID = glCreateShader(GL_COMPUTE_SHADER);

    /*
     * All right, so a few "needlessly weird" things about to happen here
     * I want a few things which are not super trivial to juggle together
     *
     * 1 - I want to have preprocessor switches in my shader for debugging and doubles
     * In Vulkan, you can just forward the defines to glslangValidator, you're not really
     * consuming the .glsl file directly, but in OpenGL, you need to manually edit the string
     *
     * 2 - I want to have the shader using openGL version 310 es. The reason for this is I want
     * to later port this to webGPU through emscripten, so I need to be on an ES profile
     *
     * 3 - BUT es profiles don't have double support (at least not dvec3/4 from validator's complaints)
     *
     * 4 - And I want to have just the one source for both openGL and Vulkan
     *
     * With all of that in mind, I need to do some massaging of the shader source here which makes
     * this part more involved than one'd expect (ifstream rdbuf, done)
     */

    std::string src;
    std::ifstream shaderFile;
    std::string ln;
    std::string additionalDefines{""};
    std::string alternativeVersion{"#version 310 es\n"};
#ifndef NDEBUG
        additionalDefines += "#define TOYBROT_DEBUG\n";
#endif
#ifdef TOYBROT_USE_DOUBLES
        additionalDefines += "#define TOYBROT_USE_DOUBLES\n";
        alternativeVersion = "";
#endif

    try
    {
        shaderFile.open(shaderSrc);
        if(!shaderFile.is_open())
        {
            throw std::ifstream::failure("Couldn't open file "+ shaderSrc);
        }
        while(std::getline(shaderFile,ln))
        {
            if(!ln.empty())
            {
                if(!ln.compare(0,8,"#version"))
                {

                    if(!alternativeVersion.empty())
                    {
                        src += alternativeVersion;
                    }
                    else
                    {
                        (src += ln) += '\n';
                    }
                    src+= additionalDefines;
                }
                else
                {
                    (src += ln) += '\n';
                }
            }
            else
            {
                /*
                 *  I could just remove the else and have this here
                 *  but it would make the defines if present bit fiddlier
                 *  Conversely, not having this here, while functional, makes
                 *  debugging the shader much harder
                 */

                src += '\n';
            }
        }
        shaderFile.close();
    }
    catch (std::ifstream::failure e)
    {
        std::cerr << "Error reading shader file: " << e.what() << std::endl;
        exit(12);
    }
    const GLchar* glsrc = src.c_str();
    GLint success = 0;
    GLint length[1] ={static_cast<GLint>(src.length())};
    GLchar info[512];
    glShaderSource(shaderID, 1, &glsrc, length);
    glCompileShader(shaderID);
    glGetShaderiv(shaderID, GL_COMPILE_STATUS, &success);
    if(success == 0)
    {
        glGetShaderInfoLog(shaderID, 512, NULL, info);
        std::cerr << "Error compiling shader: " << info << std::endl;
        exit(12);
    }

    glAttachShader(glProgram, shaderID);
    glLinkProgram(glProgram);
    checkGlError("glLinkProgram");

    //We're not going to manipulate this further, so we're good with what's loaded on the program
    glDeleteShader(shaderID);
    checkGlError("DeleteShader");
    debugprint("Constructor done");
}

So this looks a bit long but when you look better, most of it is me having do so some on the fly file editing and a comment that explains why I’m doing that with the TL;DR being that I want to use the same shader source for Vulkan and I want this to be OpenGL ES compatible. I’ll talk about OpenGL ES on Multi Your Threads #10, when it becomes relevant. For now, suffice to say it’s a more restricted API profile.

Other than some pain I’ve brought unto myself by wanting this file to be shared between implementations, there’s not a lot of note. You may notice that OpenGL’s API is C-like. So there’s a lot of passing pointers around and the error checking is the classic CUDA style where you call a function to ask the system if there were any errors after each call you care about. I’ve gone about this before but this to me is one of the worst parts of these older APIs like OpenGL, CUDA, classic OpenCL, HIP (because it wants to be CUDA)….

An actual C++ interface for OpenGL would be pretty great, if nothing else because we could then have it throw exceptions which would make your code much cleaner and easier to sort out. But I guess for a lot of people it’s “the pain you know”. People are used to doing it like this and even though it kind of sucks, OpenGL is not horrible with its errors. The reference documentation always lists all the errors each function can throw and what they mean, though a lot of them can have several meaning and this way of doing things means you have a queue of errors so, if you ran 6 statements before you last checked, it’s easy to forget an error you get could be from any one of those.

As one would expect, the generation side of things is similar to the likes of Vulkan and OpenCL as they all end up building on concepts from OpenGL in how to do stuff.

void FracGen::Generate()
{
    /*
     * Time to set up the I/O
     */

    glUseProgram(glProgram);

    if(outBuffer->size() != cam->ScreenHeight() * cam->ScreenWidth())
    {
        outBuffer->assign(cam->ScreenHeight() * cam->ScreenWidth(), RGBA{0,0,0,0});
    }

    debugprint("Generating");

    GLuint vao = 0;
    GLuint outBuffVBO = 0;
    GLuint camVBO = 0;
    GLuint paramsVBO = 0;

    glGenVertexArrays(1, &vao);
    glBindVertexArray(vao);

    glGenBuffers(1, &outBuffVBO);
    checkGlError("genBuffers(outBuff)");
    glGenBuffers(1, &camVBO);
    checkGlError("genBuffers(cam)");
    glGenBuffers(1, &paramsVBO);
    checkGlError("genBuffers(params)");

    debugprint("Buffers generated");

    glBindBuffer(GL_SHADER_STORAGE_BUFFER, outBuffVBO);
    checkGlError("bindBuffer(outBuff)");
    debugprint("outBuff bound successfully");
    glBufferData(GL_SHADER_STORAGE_BUFFER, static_cast<GLsizeiptr>( outSize())
                                                                  , outBuffer->data()
                                                                  , GL_STATIC_COPY);
    checkGlError("bufferData(outBuff)");
    debugprint("outBuff data transferred successfully");

    glBindBuffer(GL_UNIFORM_BUFFER, camVBO);
    glBufferData(GL_UNIFORM_BUFFER, static_cast<GLsizeiptr>( sizeof(*cam))
                                                           , cam.get()
                                                           , GL_STATIC_READ);

    glBindBuffer(GL_UNIFORM_BUFFER, paramsVBO);
    glBufferData(GL_UNIFORM_BUFFER, static_cast<GLsizeiptr>( sizeof(*parameters))
                                                           , parameters.get()
                                                           , GL_STATIC_READ);
    glBindBuffer(GL_UNIFORM_BUFFER, 0);

    debugprint("Buffers data transfered");

    glBindBufferBase(GL_SHADER_STORAGE_BUFFER, outBuffLocation, outBuffVBO);
    glBindBufferBase(GL_UNIFORM_BUFFER, cameraLocation,  camVBO);
    glBindBufferBase(GL_UNIFORM_BUFFER, paramsLocation,  paramsVBO);


    glDispatchCompute(static_cast<uint32_t>( ceil(cam->ScreenWidth()  / WORKGROUP_SIZE))
                      , static_cast<uint32_t>( ceil(cam->ScreenHeight() / WORKGROUP_SIZE))
                      , 1);
    checkGlError("glDispatchCompute");
    
    glMemoryBarrier(GL_SHADER_STORAGE_BARRIER_BIT);
    glBindBuffer(GL_SHADER_STORAGE_BUFFER, outBuffVBO);
    auto ptr = glMapBufferRange( GL_SHADER_STORAGE_BUFFER
                               , 0
                               , static_cast<GLsizeiptr>(outSize())
                               , GL_MAP_READ_BIT);
    checkGlError("glMapBufferRange");
    if(ptr == nullptr)
    {
        std::cout << "Error mapping OpenGL buffer!" << std::endl;
        exit(12);
    }

    memcpy( outBuffer->data(), ptr, outSize());

    glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);
    checkGlError("glUnmapBuffer");
    glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);

    checkGlError("glUnbindBuffer");

    #ifndef NDEBUG
            std::cout << "Values from the OpenGL shader" << std::endl;
            std::cout << std::endl << "Camera:" << std::endl << std::endl;
            std::cout << "camPos        -> " << (*outBuffer)[0] << std::endl;
            std::cout << "camUp         -> " << (*outBuffer)[1] << std::endl;
            std::cout << "camRight      -> " << (*outBuffer)[2] << std::endl;
            std::cout << "camTarget     -> " << (*outBuffer)[3] << std::endl;
            std::cout << std::endl;
            std::cout << "camNear       -> " << (*outBuffer)[4].X() << std::endl;
            std::cout << "camFovY       -> " << (*outBuffer)[4].Y() << std::endl;
            std::cout << std::endl;
            std::cout << "screenWidth   -> " << (*outBuffer)[4].Z() << std::endl;
            std::cout << "screenHeight  -> " << (*outBuffer)[4].W() << std::endl;;
            std::cout << std::endl;
            std::cout << "screenTopLeft -> " << (*outBuffer)[5] << std::endl;
            std::cout << "screenUp      -> " << (*outBuffer)[6] << std::endl;
            std::cout << "screenRight   -> " << (*outBuffer)[7] << std::endl;
            std::cout << std::endl << std::endl << "Parameters:" << std::endl << std::endl;
            std::cout << "hueFactor        -> " << (*outBuffer)[8].X() << std::endl;
            std::cout << "hueOffset        -> " << (*outBuffer)[8].Y() << std::endl;
            std::cout << "valueFactor      -> " << (*outBuffer)[8].Z() << std::endl;
            std::cout << "valueRange       -> " << (*outBuffer)[8].W() << std::endl;
            std::cout << "valueClamp       -> " << (*outBuffer)[9].X() << std::endl;
            std::cout << "satValue         -> " << (*outBuffer)[9].Y() << std::endl;
            std::cout << "bgRed            -> " << (*outBuffer)[9].Z() << std::endl;
            std::cout << "bgGreen          -> " << (*outBuffer)[9].W() << std::endl;
            std::cout << "bgBlue           -> " << (*outBuffer)[10].X() << std::endl;
            std::cout << "bgAlpha          -> " << (*outBuffer)[10].Y() << std::endl;
            std::cout << "maxRaySteps      -> " << (*outBuffer)[10].Z() << std::endl;
            std::cout << "collisionMinDist -> " << (*outBuffer)[10].W() << std::endl;
            std::cout << "fixedRadiusSq    -> " << (*outBuffer)[11].X() << std::endl;
            std::cout << "minRadiusSq      -> " << (*outBuffer)[11].Y() << std::endl;
            std::cout << "foldingLimit     -> " << (*outBuffer)[11].Z() << std::endl;
            std::cout << "boxScale         -> " << (*outBuffer)[11].W() << std::endl;
            std::cout << "boxIterations    -> " << (*outBuffer)[12].X() << std::endl;

    #endif

}

We tell OpenGL to use our shader program which we compiled before, create the buffers on the GPU side, then tell OpenGL what those buffers correspond to in the shader side (through binding them to numbered “locations” we know beforehand from having written the shader itself) and then do a glDispatchCompute call which is not unlike any other “run the kernel plez” call we have for basically all GPU implementations (and TBB).

After that call we place a memory barrier to tell OpenGL “we need to wait until you’re done with the SHADER_STORAGE_BUFFER because we’ll read it” and we do so, mapping the buffer to the CPU side and memcpy’ing it out. Unbind all the stuff, insert debugging output here and we’re done. The data is in our hands.

So that’s it! I have an OpenGL implementation that does what I want and I can now proceed to putting it on a browser right? Let’s go next chapter! Well, not quite. Again, I’ll talk about the reasons for this more specifically on the next chapter but it turns out I can’t use a compute shader if I want to put this on a browser. Not right now at least. So we need to go deeper

Back to the good old days: "It's just a texture, I promise! Don't worry about it!"

Not having compute shaders sucks. But they weren’t around for quite a while and people were already doing all sorts of arbitrary compute in their GPUs. The trick is really to make your GPU THINK it’s just drawing something, but then instead of putting it on the screen, you can retrieve what it “drew” and just treat is as an array of data.

The idea is simple enough but this does come with some limitations, or rather, having your shader done with general compute in mind removes some restrictions and enables some easier more flexible ways of doing things. When you’re computing on a graphical shader, your inputs and outputs need to be mapped to what a GPU expects. Additionally, our ShaderProgram needs to be a valid graphical shader program. For OpenGL this means we need at least two shaders. First, we need a Vertex Shader, which defines a set of operations/calculations we want to perform on each vertex we’re drawing on the scene. And secondly, we need a Fragment Shader, which basically tells OpenGL what colour to paint each bit it’s drawing. These “bits” are called fragments and for our purposes here you can think of them as each pixel in what’s going to be drawn, though this is not 100% accurate.

Additionally, we also need to tell OpenGL not to draw our stuff on the actual screen. We do this by creating a separate framebuffer and telling OpenGL to draw there. Then instead of blitting this buffer to the screen, we get what was “rendered” and interpret it however we want.

This whole situation is real funny in the context of toyBrot itself. If you think about it, I use all sorts of compute tech to do something which is actually just generate a picture for me to draw. If I’m using the likes of OpenGL or Vulkan to do it, really this SHOULD be a fragment shader and I should just be drawing straight to the screen. In fact if you search for Mandelbox in shadertoy you can find all sorts of mandelboxes, done with raymarching which are graphical shaders and, bonus points, coded by a lot of people who actually code graphics well and want their code to generate fancy visuals, and not be deliberately slow. The reason I do this through the compute route is that what I AM interested in here is arbitrary compute and this fancy image just happened to be something that’s interesting to generate, I can make somewhat arbitrarily costly in order to make performance comparisons and it’s easy to verify if the result is (more or less) correct. It’s also quite fun to play with generating different boxes and I use some as wallpapers as a hubris power move.

So in toyBrot we’ll have a Fragment Shader, which is really a Compute Shader in disguise, but which if you think about it, really should be an actual Fragment Shader. Nice and straightforward!

With that in mind, let’s look at some code and, again, let’s start with the constructor for fracGen. I’ll skip over the things which are the same, just so they don’t clutter too much but, as usual, you can check the full file here if you want to (or clone the repo and open on the editor of your choice)

FracGen::FracGen(bool benching, CameraPtr c, ParamPtr p)
    : bench{benching}
    , outBuffLocation{0}
    , cameraLocation{1}
    , paramsLocation{2}
    , cam{c}
    , parameters{p}
    , camLocs
        {
            {"camPos",-1},
            {"camUp",-1},
            {"camRight",-1},
            {"camTarget",-1},
            {"camNear",-1},
            {"camFovY",-1},
            {"screenWidth",-1},
            {"screenHeight",-1},
            {"screenTopLeft",-1},
            {"screenUp",-1},
            {"screenRight",-1}
        }
    , paramLocs
        {
            {"hueFactor",-1},
            {"hueOffset",-1},
            {"valueFactor",-1},
            {"valueRange",-1},
            {"valueClamp",-1},
            {"satValue",-1},
            {"bgRed",-1},
            {"bgGreen",-1},
            {"bgBlue",-1},
            {"bgAlpha",-1},
            {"maxRaySteps",-1},
            {"collisionMinDist",-1},
            {"fixedRadiusSq",-1},
            {"minRadiusSq",-1},
            {"foldingLimit",-1},
            {"boxScale",-1},
            {"boxIterations",-1}
        }
{

    //Hands off the SDL internal stuff
    glUseProgram(0);

    glProgram = glCreateProgram();
    GLuint fragID = glCreateShader(GL_FRAGMENT_SHADER);

    /**
     * 
     * LOAD AND COMPILE FRAGMENT SHADER
     * (this is our tweaked compute shader)
     * 
     */

    /**
     * The vertex shader here is a big old nothingburger. Just assume A quad and
     * output some UVs for the fragments
     */
    
    std::stringstream vertSrc;
    vertSrc << "#version 300 es" << std::endl;
    vertSrc << "layout(location = 0) in vec3 vertPos;" << std::endl;
    vertSrc << "out vec2 UV;" << std::endl;
    vertSrc << "void main(){" << std::endl;
    vertSrc << "gl_Position =  vec4(vertPos,1);" << std::endl;
    vertSrc << "UV = (vertPos.xy+vec2(1,1))/2.0;" << std::endl;
    vertSrc << "}" << std::endl;
    vertSrc << "" << std::endl;


    GLuint vertID = glCreateShader(GL_VERTEX_SHADER);
    std::string vertStr = vertSrc.str();
    const GLchar* glVertsrc = vertStr.c_str();

    GLint VertLength[1] = {static_cast<GLint>(vertStr.length())};
    glShaderSource(vertID, 1, &glVertsrc, VertLength);
    glCompileShader(vertID);
    glGetShaderiv(vertID, GL_COMPILE_STATUS, &success);
    if(success == 0)
    {
        glGetShaderInfoLog(vertID, 512, NULL, info);
        std::cerr << "Error compiling vertex shader: " << info << std::endl;
        exit(12);
    }

    glAttachShader(glProgram, vertID);
    glAttachShader(glProgram, fragID);
    checkGlError("glAttachShader");


    glLinkProgram(glProgram);
    checkGlError("glLinkProgram");

    // We're not going to manipulate this further
    // so we're good with what's loaded on the program
    glDeleteShader(vertID);
    checkGlError("DeleteShader(Vert)");
    glDeleteShader(fragID);
    checkGlError("DeleteShader(Frag)");
    glUseProgram(0);

    debugprint("Constructor done");

}

So, there’s really only two different things here. First is we have a couple new structs. Those have the “shader uniform locations” for all the data we need to tell the shaders. We’ve been using them before but if you’re a bit confused now that they’re at the forefront, shaders are intrinsically parallel. You’re going to run them over ranges of data. Uniforms are like “static” C++ variables and they are accessible and uniform across all instances of the shader (the same for every vertex/fragment/ray/whatnot)

The second thing is that our old Compute Shader is now a fragment shader, but we also need a vertex shader to go with it. In our case, ours is so minimal I didn’t even put it in a separate file, it’s just inlined. Here’s how it looks on its own:

#version 300 es

layout(location = 0) in vec3 vertPos;

out vec2 UV;

void main()
{
    gl_Position =  vec4(vertPos,1);
    UV = (vertPos.xy + vec2(1,1)) /2.0;
}

It really just takes a vertex position, sets that position in the OpenGL space and then calculates the Texture Coordinates for that vertex. Texture coordinates are (usually) 2D coordinates which go from 0 to 1 and are used to map a texture to a 3D polygon. What’s being done here is just assuming we’ll have one rectangle (made of two triangles) that covers the entire screen. For OpenGL (until you start telling it to do stuff at least), the screen goes from {-1,-1} to {1,1}. Left to right, bottom to top (this a common trap as most 2D graphic APIs set 0,0 on the TOP left of the screen/window). So we just adjust those numbers for the ~~compute~~ fragment shader. Easy peasy.

As for the fragment shader, there are, again few changes from the compute shader version. Most of the actual “working bits” are exactly the same. We just need to account mostly the new structure of the ShaderProgram itself. Also, once again, you can check the full file here

#version 300 es

precision highp float;

#ifdef TOYBROT_USE_DOUBLES
    precision highp double;
    #define tbFPType   double
    #define tbVecType3 dvec3
    #define tbVecType4 dvec4
#else
    #define tbFPType   float
    #define tbVecType3 vec3
    #define tbVecType4 vec4
#endif


// implementation independent mod
//#define mod(x, y) ( x - y * trunc(x / y) )

/******************************************************************************
 *
 * Tweakable parameters
 *
 ******************************************************************************/


struct Camera
{
    tbVecType3 camPos;
    tbVecType3 camUp;
    tbVecType3 camRight;
    tbVecType3 camTarget;

    //tbFPType padding;

    tbFPType camNear;
    tbFPType camFovY;

    uint screenWidth;
    uint screenHeight;

    tbVecType3 screenTopLeft;
    tbVecType3 screenUp;
    tbVecType3 screenRight;
};

struct Parameters
{
    tbFPType hueFactor;
    int   hueOffset;
    tbFPType valueFactor;
    tbFPType valueRange;
    tbFPType valueClamp;
    tbFPType satValue;
    tbFPType bgRed;
    tbFPType bgGreen;
    tbFPType bgBlue;
    tbFPType bgAlpha;

    uint  maxRaySteps;
    tbFPType collisionMinDist;

    tbFPType fixedRadiusSq;
    tbFPType minRadiusSq;
    tbFPType foldingLimit;
    tbFPType boxScale;
    uint  boxIterations;
};

out tbVecType4 outColour;

in vec2 UV;

uniform Camera cam;
uniform Parameters params;

 .
 .
 .

 /**
 * 
 * All the tracing and raymarching functions are untouched
 * 
 */



void main()
{
    outColour = getColour( trace( uint( UV.x * float(cam.screenWidth))
                                , uint( UV.y * float(cam.screenHeight)) ));
}

Gone is the layout information because this is not arbitrary workgroups or whatnot, we’re drawing some polygons (allegedly). We also have an additional restriction because we’re conforming to OpenGL ES 3.0, so we need to explicitly define our float precision. having to use ES 3.0 is actually the reason why I can’t use compute shaders so consider this a preview of the next chapter. Finally, when using compute shaders, be it in OpenGL or in Vulkan, I have all my uniforms bound to specific locations. With a regular fragment shader that doesn’t really work. Also, I need specific ins and outs. The ins are the data that comes from the vertex shader, which is run first and the out is the colour we’re going to use. This is where you’d do whatever shenanigans you’d need to format your output, but for toyBrot, it just happens to actually be a coloured pixel.

Finally, when it comes to just running the code, you can see that we no longer “figure out” where we are in the whole, in a way. For each fragment, OpenGL will generate the correct UV value, which is interpolated from the vertices in the polygon. So even though we only DIRECTLY set those, if this fragment is halfway between verts A and B, the UV is the average from those verts. Since that value is a float between 0 and 1, we multiply it by the screen dimensions (which we get as a uniform) to get our “pixel coordinates”.

This is where it’s good to remember that fragments are not pixels and really we’re forcing them into that condition by making sure that the polygon we’re drawing matches the screen. If we do the required shenanigans to be looking at this polygon from the side, for example, the assumptions we’re making here no longer make sense

Okay, so if so far things have been mostly the same, now is when we start having to go around a bunch of stuff, it’s time to generate our fractal.

void FracGen::Generate()
{
    /**
     * A lot of the stuff here was adapted and/or copypasted from
     *
     * http://www.opengl-tutorial.org/intermediate-tutorials/tutorial-14-render-to-texture/
     */

    /*
     * Time to set up the I/O
     */

    glUseProgram(glProgram);
    checkGlError("glUseProgram");

    if(outBuffer->size() != cam->ScreenHeight() * cam->ScreenWidth())
    {
        outBuffer->assign(cam->ScreenHeight() * cam->ScreenWidth(), RGBA{0,0,0,0});
    }

    debugprint("Generating");

    GLuint vao = 0;
    GLuint fracFB = 0;
    GLuint fracTex = 0;
    GLenum drawBuff[1] = {GL_COLOR_ATTACHMENT0};

    setUniforms();
    
    /**
     * Generate the framebuffer and texture
     */

    glGenFramebuffers(1, &fracFB);
    checkGlError("genFrameBuffers");

    glBindFramebuffer(GL_FRAMEBUFFER, fracFB);
    checkGlError("glBindFramebuffer");

    glGenTextures(1, &fracTex);
    checkGlError("glGenTextures");
    glBindTexture(GL_TEXTURE_2D, fracTex);
    checkGlError("glBindTexture(frag)");
    glPixelStorei(GL_UNPACK_ROW_LENGTH, 0);

    glTexImage2D(GL_TEXTURE_2D, 0
                , GL_RGBA32F
                , cam->ScreenWidth(), cam->ScreenHeight()
                , 0, GL_RGBA, GL_FLOAT, 0);
    checkGlError("glTexImage2D");


    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
    checkGlError("glTexParameteri");


    glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_TEXTURE_2D, fracTex, 0);
    checkGlError("glFramebufferTexture2D");
    
    /**
     * Vertex buffer setup
     */

    glGenVertexArrays(1, &vao);
    checkGlError("glGenVertexArrays");
    glBindVertexArray(vao);
    checkGlError("glBindVertexArray");

    glBindFramebuffer(GL_FRAMEBUFFER, fracFB);


    static const GLfloat g_quad_vertex_buffer_data[] = {
        -1.0f, -1.0f, 0.0f,
        1.0f, -1.0f, 0.0f,
        -1.0f,  1.0f, 0.0f,
        -1.0f,  1.0f, 0.0f,
        1.0f, -1.0f, 0.0f,
        1.0f,  1.0f, 0.0f,
    };

    GLuint quad_vertexbuffer;
    glGenBuffers(1, &quad_vertexbuffer);
    glBindBuffer(GL_ARRAY_BUFFER, quad_vertexbuffer);
    glBufferData( GL_ARRAY_BUFFER
                , sizeof(g_quad_vertex_buffer_data)
                , g_quad_vertex_buffer_data
                , GL_STATIC_DRAW);

    glEnableVertexAttribArray(0);
    glBindBuffer(GL_ARRAY_BUFFER, quad_vertexbuffer);
    glVertexAttribPointer(
        0,                  // attribute 0. Must match the layout in the shader.
        3,                  // size
        GL_FLOAT,           // type
        GL_FALSE,           // normalized?
        0,                  // stride
        (void*)0            // array buffer offset
    );

    /**
     * Draw the triangles !
     */
    
    glDrawArrays(GL_TRIANGLES, 0, 6); // 2*3 indices starting at 0 -> 2 triangles

    glDisableVertexAttribArray(0);

    glDrawBuffers(1, drawBuff);

    checkGlError("glDrawBuffers");
    if(glCheckFramebufferStatus(GL_FRAMEBUFFER) != GL_FRAMEBUFFER_COMPLETE)
    {
        std::cout << "Error drawing OpenGL FrameBuffer!" << std::endl;
        exit(12);
    }


    //This is essentially how we tell OpenGL to wait
    glReadBuffer(GL_COLOR_ATTACHMENT0);
    checkGlError("glReadBuffer");
    //And now we copy the data back
    glReadPixels( 0, 0
                , cam->ScreenWidth(), cam->ScreenHeight()
                , GL_RGBA, GL_FLOAT, outBuffer->data());
    checkGlError("glReadPixels");


    glBindFramebuffer(GL_FRAMEBUFFER, 0);
    glBindTexture(GL_TEXTURE_2D, 0);

    glUseProgram(0);

}

So we start by resizing outBuffer as normal. Then we do a bunch of OpenGL calls to generate our off-screen framebuffer and the texture we want to write to. After that we create the verts on each corner of the screen and tell OpenGL that we want to draw a couple of triangles from those. Most of the calls here are just telling OpenGL exactly how to interpret the data. Like that glVertexAtrribPointer at around line 96 (my syntax highlighting plugin is having a hard time and I don’t know exactly why). That’s is where we’re teaching OpenGL where, in our data, is “data input 0”, which if you look back to the vertex shader, is the vertex position. So it’s more verbose and weird than complex

Throughout the setup we tell OpenGL to draw to COLOR_ATTACHMENT0, so, after we told them to draw, we tell it we want to read that, so it waits until it’s done writing to it. After that we call glReadPixels and read the texture data to our outbuffer. Finally we unbind and stop using all of our stuff and we’re done. This final cleanup being particularly important because I have to also do the drawing with OpenGL on the mainWindow and I need to make sure neither side is mangling the other’s OpenGL state. If you don’t need to talk to OpenGL directly, then you can just rely on SDL’s internal mechanisms which also may or may not be hardware accelerated. This is what other versions of toyBrot do and this is handled through a bunch of ifdefs. If you DO start messing with OpenGL stuff, though, OpenGL relying on a single state means none of the “normal” OpenGL stuff works, you will mangle each other’s state. But SDL is prepared to create the windoe, get an OpenGL context for you and then let you take the wheel if that’s what you want, so no big issue here.

You may also have noticed that I didn’t really set the camera and whatnot but there IS a call to a setUniforms function. That function is not interesting except in that it shows one of the downsides of having to do things this way

void FracGen::setUniforms()
{
    camLocs["camPos"] = glGetUniformLocation(glProgram,"cam.camPos");
    camLocs["camUp"] = glGetUniformLocation(glProgram,"cam.camUp");
    camLocs["camRight"] = glGetUniformLocation(glProgram,"cam.camRight");
    camLocs["camTarget"] = glGetUniformLocation(glProgram,"cam.camTarget");
    camLocs["camTarget"] = glGetUniformLocation(glProgram,"cam.camTarget");
    camLocs["camNear"] = glGetUniformLocation(glProgram,"cam.camNear");
    camLocs["camFovY"] = glGetUniformLocation(glProgram,"cam.camFovY");
    camLocs["screenWidth"] = glGetUniformLocation(glProgram,"cam.screenWidth");
    camLocs["screenHeight"] = glGetUniformLocation(glProgram,"cam.screenHeight");
    camLocs["screenTopLeft"] = glGetUniformLocation(glProgram,"cam.screenTopLeft");
    camLocs["screenUp"] = glGetUniformLocation(glProgram,"cam.screenUp");
    camLocs["screenRight"] = glGetUniformLocation(glProgram,"cam.screenRight");
//#ifndef NDEBUG
//    std::cout<< std::endl << "****************" << std::endl;

//    std::cout << "Uniform locations for:" <<std::endl;
//#endif
    for(auto& loc: camLocs)
    {
//#ifndef NDEBUG
//        std::cout << "cam." << loc.first<<": "<< loc.second <<std::endl;
//#endif
        if(loc.second == -1)
        {
            std::cerr << "Error acquiring uniform location for cam."<< loc.first << std::endl;
            exit(12);
        }
    }

    paramLocs["hueFactor"]          = glGetUniformLocation(glProgram,"params.hueFactor");
    paramLocs["hueOffset"]          = glGetUniformLocation(glProgram,"params.hueOffset");
    paramLocs["valueFactor"]        = glGetUniformLocation(glProgram,"params.valueFactor");
    paramLocs["valueRange"]         = glGetUniformLocation(glProgram,"params.valueRange");
    paramLocs["valueClamp"]         = glGetUniformLocation(glProgram,"params.valueClamp");
    paramLocs["satValue"]           = glGetUniformLocation(glProgram,"params.satValue");
    paramLocs["bgRed"]              = glGetUniformLocation(glProgram,"params.bgRed");
    paramLocs["bgGreen"]            = glGetUniformLocation(glProgram,"params.bgGreen");
    paramLocs["bgBlue"]             = glGetUniformLocation(glProgram,"params.bgBlue");
    paramLocs["bgAlpha"]            = glGetUniformLocation(glProgram,"params.bgAlpha");
    paramLocs["maxRaySteps"]        = glGetUniformLocation(glProgram,"params.maxRaySteps");
    paramLocs["collisionMinDist"]   = glGetUniformLocation(glProgram,"params.collisionMinDist");
    paramLocs["fixedRadiusSq"]      = glGetUniformLocation(glProgram,"params.fixedRadiusSq");
    paramLocs["minRadiusSq"]        = glGetUniformLocation(glProgram,"params.minRadiusSq");
    paramLocs["foldingLimit"]       = glGetUniformLocation(glProgram,"params.foldingLimit");
    paramLocs["boxScale"]           = glGetUniformLocation(glProgram,"params.boxScale");
    paramLocs["boxIterations"]      = glGetUniformLocation(glProgram,"params.boxIterations");


    for(auto& loc: paramLocs)
    {
//        #ifndef NDEBUG
//            std::cout << "params." << loc.first<<": "<< loc.second <<std::endl;
//        #endif
        if(loc.second == -1)
        {
            std::cerr << "Error acquiring uniform location for params."<< loc.first << std::endl;
            exit(12);
        }
    }
//    #ifndef NDEBUG
//        std::cout<< std::endl << "****************" << std::endl;
//    #endif


    glUniform3f(camLocs["camPos"],        cam->Pos().X()
                                 ,        cam->Pos().Y()
                                 ,        cam->Pos().Z());
    
    glUniform3f(camLocs["camUp"],         cam->Up().X()
                                ,         cam->Up().Y()
                                ,         cam->Up().Z());
    
    glUniform3f(camLocs["camRight"],      cam->Right().X()
                                   ,      cam->Right().Y()
                                   ,      cam->Right().Z());
    
    glUniform3f(camLocs["camTarget"],     cam->Target().X()
                                    ,     cam->Target().Y()
                                    ,     cam->Target().Z());

    glUniform1f(camLocs["camNear"],       cam->Near());
    glUniform1f(camLocs["camFovY"],       cam->FovY());

    glUniform1ui(camLocs["screenWidth"],  cam->ScreenWidth());
    glUniform1ui(camLocs["screenHeight"], cam->ScreenHeight());

    glUniform3f(camLocs["screenTopLeft"], cam->ScreenTopLeft().X()
                                        , cam->ScreenTopLeft().Y()
                                        , cam->ScreenTopLeft().Z());
    
    glUniform3f(camLocs["screenUp"],      cam->ScreenUp().X()
                                   ,      cam->ScreenUp().Y()
                                   ,      cam->ScreenUp().Z());
    
    glUniform3f(camLocs["screenRight"],   cam->ScreenRight().X()
                                      ,   cam->ScreenRight().Y()
                                      ,   cam->ScreenRight().Z());

    glUniform1f(paramLocs["hueFactor"],         parameters->HueFactor());
    glUniform1i(paramLocs["hueOffset"],         parameters->HueOffset());
    glUniform1f(paramLocs["valueFactor"],       parameters->ValueFactor());
    glUniform1f(paramLocs["valueRange"],        parameters->ValueRange());
    glUniform1f(paramLocs["valueClamp"],        parameters->ValueClamp());
    glUniform1f(paramLocs["satValue"],          parameters->SatValue());
    glUniform1f(paramLocs["bgRed"],             parameters->BgRed());
    glUniform1f(paramLocs["bgGreen"],           parameters->BgGreen());
    glUniform1f(paramLocs["bgBlue"],            parameters->BgBlue());
    glUniform1f(paramLocs["bgAlpha"],           parameters->BgAlpha());

    glUniform1ui(paramLocs["maxRaySteps"],      parameters->MaxRaySteps());
    glUniform1f(paramLocs["collisionMinDist"],  parameters->CollisionMinDist());

    glUniform1f(paramLocs["fixedRadiusSq"],     parameters->FixedRadiusSq());
    glUniform1f(paramLocs["minRadiusSq"],       parameters->MinRadiusSq());
    glUniform1f(paramLocs["foldingLimit"],      parameters->FoldingLimit());
    glUniform1f(paramLocs["boxScale"],          parameters->BoxScale());
    glUniform1ui(paramLocs["boxIterations"],    parameters->BoxIterations());

}

With compute shaders, you’re expected to be sending arbitrary data but with regular graphic shaders you’re more restricted. So even though I can and do declare the structs on the shader, whereas in the compute version I can just send tell OpenGL “here’s a pointer, copy this amount of memory from here, let the shader handle it”, in here I couldn’t find a way around actually setting every variable manually.

I also cut out the debug output on the generate() function which just lets me know I am not mangling uniforms in copy and it’s basically the same, but this time doing a bunch of readUniform calls and then printing them out.

Tying this up: An oldie sure, bit is OpenGL a goldie?

All right, it’s been a LOT of work, so let’s get some more numbers and put this in perspective. All these numbers are fresh with the current versions of all of these things in my machine (also I’ve since overclocked my CPU again so there could be some minor gains in memory transfers and whatnot, though unlikely to make a difference here, I’d expect). Of note, I had been using the ROCm OpenCL stack but that is no longer running for some reason so I’m currently with AMDGPU-PRO. Finally the setup timings may have changed with some of the recent global structure tweaks but, and this is what we care about here, they’re comparable between themselves.

In terms of performance, HIP is still the frontrunner, followed by OpenCL (if you disregard the setup time) and hipSYCL. OpenGL follows hipSYCL quite closely, actually, which is good news, and then Vulkan, which at least I still don’t get super impressive numbers from, in comparison. But that is only true for the regular Compute implementation. Once we move to the classic method, OpenGL is now the last place. As for reasons to this, I have a couple of guesses. First is that I suspect doing all the separate OpenGL calls to read all the uniforms and set them individually could add some delay, though I expect not much (otherwise, OpenGL would be quite unwieldy for regular real time graphics). Second guess is that we’re copying the data more than once and in ways that may be less efficient. When we call glReadPixels instead of just memcpying stuff out, we don’t necessarily know what’s happening underneath, if there’s some on the fly conversion, how the access happens, etc… Plus, similar to Vulkan, maybe some of this time would be moved to setup on a better implementation as Generate creates the framebuffer and the VAOs…. Doing the work to validate these guesses is outside of what I consider my scope, though, at least for now/this project, but would be useful in a production scenario (as well as better separation of setup and execution).

In the end though, especially if you have access to Compute Shaders (and these days you probably do), the performance there is satisfactory, really, so I consider this a win for “modern OpenGL”. I also took some overclocked GPU numbers for fun and it responded well to it (2125MHz GPU, 1175MHz HBM). Really only Vulkan is lagging a bit in this department which leads me to be even more suspicious of the quality of my own Vulkan code (also it’s definitely not 100% right now because there were some changes to vulkan.hpp which caused toyBrot to no longer build and after “fixing those” it’s crashing hard on exit or when SDL tries to initialise video? I dunno, consider me confused, frustrated and running away from having to deal with that)

Okay, what about things other than performance? WELL…

I’m going to put this straight up front, I’m not a great fan of OpenGL here and would not really use it unless “cornered into doing so”.

The first reason, straight up front, is that I’m not a fan of GLSL. Like, it does it’s job and is made for graphics. But when you compare it to things like CL C and the “basically just straight up C++” of CUDA/HIP/SYCL and the like, it becomes somewhat lacking when it comes to arbitrary compute. Restricted types and memory models make it harder to work at times so I’d avoid relying on GLSL if I could. As far as I’m aware, there’s not really doing that if you’re in OpenGL.

OpenGL also has a fair amount of boilerplate and somewhat hostile setup. It’s nowhere near as bad as Vulkan BUT it certainly feels older. As much as I’m frustrated with it right now, to me the C++ Vulkan interface helps a lot, and OpenCL, the other boilerplate-heavy option is ahead of both of those. OpenGL also loses to both of those in another two aspects.

First, it’s the worse debugging experience out of the three, notwithstanding my current Vulkan issues which… **sigh**.

OpenCL and the Vulkan C++ interface have support for exceptions, which make your code much much cleaner and more manageable. With OpenGL you have to sprinkle your error checking wherever you need and it’s a mess. Additionally, both OpenCL and Vulkan give you the option to print from your kernel/shader (I THINK this is only available for Vulkan on GLSL last I checked tho). And printing from your kernel is SO BLOODY USEFUL in debugging. It gives you an opportunity to double check things on the GPU side are as expected which can cut off a world of unknown into why what you get back is not quite what you expect. Vulkan even has the validation layers which give you all sort of additional info that is often very useful. This to me counts massive points for it and is part of why I think Vulkan actually has a chance at being a viable option even for bigger “compute-only” projects.

Which brings me to my second point. To the best of my knowledge, there is no trivial way of having OpenGL working in a headless environment. It’s also not MEANT to be initialised on its own. With any of these other options you have some manner of just “hey, let’s create a CUDA instance and get the available devices”. With OpenGL, even when you are going go “draw” offscreen or not draw at all, you need an OpenGL context, and the way to get one, is to ask Windows or Xorg for one. In toyBrot, SDL handles all of that. Solutions for “headless OpenGL”, at least in Linux, involve creating a “fake screen” through the likes of xvfb and then telling the program to run there. But you still need a display server for this, and you application is still kind of tied to it. This also means (again, to the best of my knowledge) that you can’t select an arbitrary device/implementation when you’re running. In my machine, if I’m running this hypothetical very heavy duty program with OpenCL or Vulkan, I can choose whether I want it to run on my Radeon VII or on my Titan X. And at least with Vulkan, I think I could do both if I got things just right? Or I could run one instance on each. No such luck with OpenGL, because it’s all about “who’s responsible for your display output”

ALL THAT SAID. There are a few lingering strong points in favour of OpenGL here.

The first point is that nothing is as prevalent as OpenGL. If your application runs on top of just OpenGL it’s going to have maximum reach. As long as people have up-to-date drivers they can probably run it (though, again, Apple is working on breaking this on their platforms because capitalism).

Second point is that, somewhat like Vulkan, maybe your application already uses OpenGL, what’s with its aforementioned prevalence. So if you have some application that uses SDL, or GLFW or even Qt to give you windows and UI and whatnot, OpenGL is already there. So you could just use that for compute as well. This helps you limit your dependencies, which is always helpful, and can be a great way to offload some heavy workloads on, say, simulations, image processing, physics for some game…

Finally, and maybe this also comes back to that first point, earlier in this section I mentioned that I wouldn’t normally resort to OpenGL unless I’m “cornered into doing so” and… sometimes you just are. If it’s the only tool you have available, then it’s also the best tool you have available. While this can be viewed in a negative light, and often a valid one when you get restricted from using a tool you’d prefer forcibly, sometimes there are legitimate reasons why this is the only thing available. And when you’re in THAT situation, then aren’t you glad this ONE tool can actually do it? This is exactly the situation that drove ME to implementing toyBrot using OpenGL; in my case, I want to talk to a GPU through a browser and there is no way to do it outside of OpenGL (and to this effect I’m considering WebGL and WebGPU to count as OpenGL).

OpenGL compute shaders are somewhat clunky and unsophisticated, but get the job done and performance isn’t bad (in this one driver, one gpu, one OS test; “sample size of one” trifecta of an anecdote). OpenGL compute through fragment shader shenanigans is quite painful and introduces a lot more mess in your code, doesn’t perform quite as well, is more restrictive… but sometimes you just need to use OpenGL for whatever reason. And sometimes you can’t even have a compute shader because the universe hates you. With all the caveats, daddy OpenGL is there for you when those days come, and while it HAS been definitely surpassed by those that came after it, it has done its best to keep up, and it achieved a good amount. It’s also very well documented and having so many people who have and still use it, does mean that there is a lot of helpful content about it out there, which is always a blessing.

TL;DR:
Consider other stuff; use it if you have to, it’s not bad even if there’s better stuff out there. But if you DO have to go the compute-on-fragment route, good luck and have a commiseration beer in my name, because that way IS a bit rough.

What's next?

I’ve mentioned this before but I still have more OpenGL-related content coming. The next post in Multi Your Threading, and next post in the blog in general, is going to be about using OpenGL to access the GPU while deploying to a browser using emscripten. So toyBrot on a GPU on the browser. All the code and whatnot for this is already done. So I’m, probably going to jump straight into writing that before I go down some coding rabbit-hole “by accident”, as it happened when I was supposed to write THIS chapter a couple weeks ago.

There’s also an additional post for which the code is already done and is also related to OpenGL. I picked some development on my old study game engine again, Warp Drive, and ran into the issue where OpenGL is really hostile to multithreading. Now that it’s actually working, I quite like my solution to that issue and it involves a bunch of very C++ shenanigans. Equal chances of you being quite amused or completely horrified, depending on your feelings on template magic, sfinae, lambdas, futures, move semantics and all that good stuff.