What do when OpenGL just hates multithreading?

A lot of the content I put out in this blog is based upon work I do on a project called toyBrot, a mandelbox fractal generator whose main goals are giving me a platform to study different parallelism technologies and also a way to compare them. Additionally it can generate some nice wallpapers. In addition to toyBrot, though, I also have a different personal project that I’ve kept alive on and off over the years, it’s a game engine called Warp Drive. Working with emscripten in toyBrot recently has contributed to igniting my interest in Warp Drive once again but as I started working with it I bumped into the problem mentioned in the title and needed to work out a solution for it. Though the inner workings of it can look a bit messy, I actually like where I arrived at and thought it would both be interesting to talk about as well as potentially useful for people who run into the same problem in the future, or maybe even a situation with similar requirements.

This post is not connected to any previous ones and you do not need any previous knowledge of either Warp Drive (which is good, since I haven’t written about it) or game engine design in general, nor do you really need any knowledge of OpenGL itself for this to be understandable or useful. Boiled down to the essentials, the problem I’m solving is: 

I have a system that will only take calls from one specific thread and I need to make those calls from other threads sometimes

If you’re in a situation like that, for whatever reason, then this could be a suitable solution. With that out of the way, let’s have a look at what we’re dealing with in order to get some context and understand things a bit better

I’ve structured this post in a very step-by-step manner. First I talk about Warp Drive a bit and OpenGL to flesh out the context and then I go over how to examine the situation, research and plan your solution….

If nothing of that interests you and you just want the juicy bits, you can skip over straight to section 5.4 which is where the meat really is

So what's a Warp Drive?

Way back I actually studied some games design and development. As part of the course, we would build our own games engine, implementing features and concepts as we studied them. In a way, I guess this sort of mindset feeds into the type of study I do to this day with programming. If I want to check out a new parallelisation technology, okay, implement it in toyBrot and let’s see how that works. Warp Drive to me, in many ways, plays a similar role.

In terms of actually making games, there are enough available games engines that I find it hard to justify coding your stuff more or less from scratch rather than using something already available that has an actual team behind it. Not only you have very broad general engines such as Unity, Unreal and Godot, you also have progressively more specialised options such as Game Maker, Multimedia Fusion, RPG Maker and Ren’Py.

In terms of a mechanism to study things, though, a project like this is great, actually. Toybrot does, at times, require me to do some fancier build-related shenanigans, but each project is fairly self contained, only has a couple of source files, one of which (main.cpp) only has minor tweaks anyway… So the bulk of the new features is very self-contained. Warp Drive, on the other hand, is a much bigger project with a lot of systems and moving parts at different levels of completion, age and code quality. So it’s a great opportunity to run into some real-world type issues, complete with good old technical debt, where you implement something on one end and this supposedly unrelated thing breaks somewhere else which you now have to understand why.

It’s also a project I have a lot of fun working on and one that opens different opportunities from the more focused toyBrot. So let’s get to know Warp Drive a bit, what’s the big idea?

Important to us here is the big main class of Warp Drive, called just “Game“. Game is a singleton and it manages the life cycle of the application. The event loop starts when you call Game::Run() and it ends when that function returns. There are other managers for stuff such as Display and whatnot, but Game holds two important collections to the application. A vector of pointers to GameStates and a (rather messy, needs refactoring) map of pointers to GameObjects.

GameStates are essentially different “screen types”, somewhat like an Android Activity, if you hail from those parts. Game has a currentState int that points to an index on the gameStates vector which is how Game knows what to call. If this is ever a negative value, then that’s Game’s cue to clean stuff up and return from Run().

During the main loop what Game does is it’ll essentially call this:

while(currentState >= 0)
{
    /*
     * Update base systems and stuff, run global timers...
     */
    currentState = gameStates[currentState]->Update();
    gameStates[currentState]->Draw();
}

Really basic stuff, nothing fancy. Each GameState will have its own logic regarding what to do in those functions, including potentially setting up what GameObjects are supposed to be drawn and/or are active (need to be updated). Game can then, at the GameState’s request, do an updateObjects() or a drawObjects() where it’ll go through its colection of GameObjects and will call their own Update() and Draw() functions as needed (though the GameState can alternatively just request iterators to the map with the objects they want and do their own thing).

The flaw in this situation is (and is one that actual production game engines may or may not try to overcome) is that if either your updates or your drawing gets bogged down by something, the other will too. So, if your graphics card is having a hard time, suddenly your physics needs to stall as well. Warp drive does try to overcome this by splitting the update and draw into separate threads that run independently (though there is a build flag to not do this, mainly because web assembly thread support is still spotty). So this work flow ends up looking a bit like this on the “optimal” scenario.

This is the important bit, right here

What this means for us here is that whenever we are in an update() function of a GameState or GameObject or we are doing input handling (since that is part of the GameState‘s base class update() ), whatever gets called there, will be called from this separate running thread and this is all we really need to understand for this issue in particular.

All right, so what about OpenGL?

I’ve talked a bit more about OpenGL previously, going a bit more in depth about history and whatnot, so if you’re curious you check that out, as I’m not going to copypaste that here, instead focusing on what is relevant about it for the issue at hand.

OpenGL, from the programming side, is an API that enables you to tell your graphics card to do stuff and manage rendering contexts and whatnot. Additionally, OpenGL can be seen/is akin to/behaves as somewhat of a State Machine. What it means for us is that you can think of OpenGL as this one big system/object that lives somewhere and when you send commands to it, you put it in a different state, in which it’ll remain until you tell it something else. You can read a bit more about this in this stack overflow question and the OpenGL wiki if you want, but don’t worry about this too deeply.

Important for us right here is that, perhaps as part of being very concerned about maintaining the integrity of this State, OpenGL marries to a thread and it is very monogamous. When you create an OpenGL context, the thread that created it “owns” that context. If you make any calls to OpenGL from a different thread, OpenGL essentially just fails those calls. For us, it means that if, for some reason, we need to make an OpenGL call from the update thread, that call fails. I’d need to double check but I believe you get a GL_INVALID_OPERATION and that’s it. As it turns out, the way I’ve coded things in Warp Drive, sometimes I MAY need to make an OpenGL call from the update thread…

The specific issue which caused the problem to be a problem

There is a concept in 3D Graphics/ Game dev called “picking”. Picking is when the user clicks/taps somewhere on their 2D screen and you now have to figure out if they have clicked or tapped on an object in your 3D environment. There are a few different ways of doing this and I was fiddling with a particular one:

Somewhat using the same kind of ideas I use in toyBrot, if I can figure out the place in space where that screen pixel is, and then spawn a ray going from the camera’s origin, passing through that pixel, I can check for collision with that ray. It’s not a super great way of doing it, but it’s simple, so I thought I’d implement this for now and would look for something better later. To this end I was checking if a function called gluUnproject would work in giving me the “pixel coords in space” as I wanted or if I’d have to write my own version of it. And I was getting unexpected results.

The easiest way to start debugging this is just drawing the ray and seeing if it IS where it should be, because then you can know if you’re checking for collision wrong or spawning your ray wrong. Easy enough… he said…

This is where I run into trouble because I spawn my ray as part of treating a “screen clicked” event, which means this is happening on the update thread. When I create the ray I create a couple of verts and whatnot for OpenGL to draw it. But because this is happening on that separate thread, that is no bueno and suddenly it’s all gone wrong.

So what are our options and where do we go?

Information on this topic seems to vary a lot in quality, with most of it being from StackOverflow questions and answers and the like but from my research, there are two options:

Option #1: Hot seat OpenGL

Our first alternative is we can pass ownership of the context back and forth. This is, for example what’s suggested in this OpenGL Wiki article. The article relies on windows API stuff, so more lower level OpenGL than I use, but from a quick glance at their code, it seems to me like this is also what Godot offers as an option. They also do that by accessing the lower level APIs from each platform. The linked file is the header for their Linux display stuff and in the .cpp they’re making calls straight to X11 for a lot of stuff but this specific one seems to not be implemented so not a great start. This has two problems for me:

First, I’m really not interested in making calls to things like X11 directly. If it can’t be a shared API or go through an abstraction layer such as SDL, this is not a solution I’m interested in, because I don’t want to have to deal with backend minutia for different platforms. From what I can tell from their docs, SDL doesn’t really have that kind of thing. It has an SDL_MakeCurrent function but reading the docs that looks like something else.

The second problem is that this can introduce a LOT of mutual blocking between threads. Every time a thread makes an OpenGL call it’d need to something like:

void Ray::draw() const
{
    Matrix44 model;
    model.setTranslation(origin.X(),origin.Y(),origin.Z());
    Matrix44 transform;
    transform.setScaling(1000.f);

    //blocking call waiting on mutexes in the DisplayManager side
    DisplayManager::MakeGLThreadAndLock(std::this_thread::get_id());
    
    glUniformMatrix4fv( transformUniform, 1, GL_FALSE, transform.Elements().data() ));
    glUniformMatrix4fv( modelUniform, 1, GL_FALSE, model.Elements().data() ));
    glUniform4f( ambientUniform, colour.R(), colour.G(), colour.B(), colour.A() ));

    VAO.draw("wire",true);
    
    DisplayManager::UnlockGLThread();
}

All this locking, unlocking and waiting between threads can get out of hand in a hurry and I would hope it’s easy to see that very quick having two threads which need to wait on one another wrecks any performance advantage you might have had by splitting your code into separate threads anyway. So I don’t like it

Option #2: Don't ACTUALLY call it from the other thread

You see, how we frame things always makes a difference. It’s not that we “need to make these calls from Thread B”, the situation is “Thread B is doing something that means these calls need to be made”. So we could still just ask Thread A to do it. In that case, whenever you have an OpenGL call, you can have a workflow that goes somewhat like:

if(Game::mainThreadId() == std::this_tread::get_id())
{
    glUniformMatrix4fv( transformUniform, 1, GL_FALSE, transform.Elements().data() ));
}
else
{
    Game::RunThisPlez(glUniformMatrix4fv, transformUniform, 1, GL_FALSE, transform.Elements().data() ));
}

And there are a few interesting things going on here. The first of which is that we’re somewhat mixing together two parallelism paradigms. I’ve mentioned briefly in the very first Multi Your Threading that two ways of conceptualising your parallelism is by thinking of thread-based parallelism vs task-based parallelism. In here we’re doing both.

When Warp Drive has the one main thread that runs the display and main systems and then a separate thread that continually updates things, we’re clearly thinking in terms of threads. The way you split your work is by thinking of these separate application threads and what they’ll be doing throughout their lifetime. Normally you spawn your thread and then send it on its way, be free, little thread. Whatever happens or it does is no longer (directly) the concern of the mother-thread. This is exactly what happens when Warp Drive spawns a separate thread to just call Update on stuff until it’s time to exit. That thread is now on its own, mostly.

Thread-based parallelism is the traditional way of thinking about parallelism in computing, heck, we just say multithreading most of the time. It’s also a way of thinking that maps better, most of the time, to the underlying processes at a lower level (your CPU has a concept of execution threads and hardware cores that they map to). But you can also think of parallelisation differently.

When you send a bunch of stuff for your GPU to do, you’re normally not thinking about threads on your GPU, nor is that normally how you treat those inside the shaders themselves. Instead, generally, you think about what is the task that needs to be done, and you have a certain amount of tasks which will be done in parallel. This changes how you think of and how you structure your code. To me, this maps better in how we approach problems as people in a greater variety of problems compared to the thread-based mindset. If you look at things like std::async and TBB, those are this type of paradigm in action. You say “okay, I have these things I need done, take it away from here, plez”. Not worrying about threads and what’s going to go where… just thinking about what it is that you need done.

So what we would need to do here, if we were to adopt this strategy, is, in a way, introducing this task-based parallelism in conjunction to our preexisting thread-based system. This is exactly what we’ll do, I like this solution much better.

If you’re checking out my actual code you may be a bit confused about this when you see that I use std::async for launching the update thread rather than spawning an actual std::thread. This is really just because I like the syntax better. By specifying std::launch::async as a launch policy, I force the code to “spawn a new thread and make it run there now” and if you look you can see I never bother get()ting the future that comes back from this call. I’m using task-based syntax but this IS very much thread-based mindset and this is one of the situations which DO map better to this paradigm.

Implementing our solution

Now, I’m taking a bit of poetic liberty here and skipping over some of the iterative process where you keep coming back to reevaluate your work as you find more and more details, edge cases and whatnot. Instead, I’m somewhat streamlining this here to give an idea of how I like to go about sorting this type of problem out and how, in my mind is the useful way to go about this. With that out of the way and Step 0: Researching your problem, done let’s start with:

Step 1: Imagining a better future

A lot of the time, ESPECIALLY for problems that involve a lot of refactoring (such as, say, having to potentially change how you make every single OpenGL call in your engine) I like to start backwards, by asking myself “how do I want my final solution to work?”

How do I want this to work? How do I want the code to look like on the user side? What sort of functionality do I want available? What do I want to be calling? How do I want these calls to look like when I’m coding?…

Everyone that’s ever had to work with or on other people’s code (and here, Past You can also count as someone else) knows that a good API is key to maintainability and usability of any piece of software, especially in a library, like an engine. A dev who’s used to run TDD will learn a habit of asking of any particular function or functionality “what would break this and should we care? What would be the edge cases to beware?”. In the same way, I think someone who’s used to writing code for other coders, should strive to learn the habit of asking themselves, of any functionality they’re thinking of implementing: “What would be a useful and pleasant way to use this in some code?”. It takes imagining yourself writing code that uses you solution, and what would it take from your solution for that to be you having a good time.

 

If you keep that in mind, it can help you design your APIs in useful ways. Even if “the other coders” are just Future You, this is a great mindset to have. Though I will encounter many a bit in my code where I’m just sad about Past Me, sometimes I’m like “oh… ooohhh, this is so cool, yes! Thank you, past me!”. It’s a great feeling.

So what DOES our solution look like, in the perfect world?

  • Well, I don’t want to be doing any of that thread-checking when I’m making a call. I might be calling a wrapper or even (oof) a macro to do this, but I just want my call to be straightforward
  • I also want my calls to be consistent. I don’t want to have different syntax for different calls. The One wrapper to abstract them all.
  • I need (as I found out as I coded) to be able to guarantee that not only are my calls going to be executed in the order I request them. I ALSO might need of them that they’re executed without other stuff going on in between. Because you don’t want to be surprised by someone else loading a different shader or binding another array while you’re sorting stuff out
  • As part of that thread checking stuff, I don’t want to have to bother, when coding, if I am building for single or multithreaded. I want to code it once, compile however I need and done

This to me is a good guideline in what I want, so that’s what we’re going for to be best of our ability and trickery. But how do we shot web?

Step 2: Slamming face first into reality

What we want here sounds simple but every programmer knows to be very very skeptical of the code that just happens to build first time after being coded. Especially if it runs and appears to output what you wanted. As we look deeper, or are made to look deeper, we end up having to confront the more insidious issues of our problems. In this case, there are a few tricky ones and they mostly centre around the fact that “any OpenGL call” is actually quite broad of a category.

The problem with OpenGL calls being broad is that they may take all sorts of arguments. This makes it harder for us standardise things. At one point we need a system that will expect one specific thing to call. If we’d normally take all sorts of arguments, then we need to work around this. OpenGL calls return different things too. floats, ints, pointers…. we need to work around this somehow. And, to make things even better, OpenGL calls may take NO arguments and return nothing! This is bad because, say, you can’t take a reference to void, and this could complicate the syntax on your call side if you don’t have any arguments to pass.

Those, in my opinion, are the trickiest bits. In addition to those, we need Game to have some sort of command queue for us to execute from the display loop. This queue will be accessed by multiple threads concurrently so we need to prevent data races here too. This bit though, is mostly straightforward standard multithreading shenanigans. So even if we do have to code around it, this should be the easy part.

Step 3: What's in the (tool)box?

All right, so now we have an idea of where we want to get to and what might be some of the pitfalls along the way. We may or may not have read through a couple of Stack Overflow questions about it, or had a look at some relevant documentation… so the picture is hopefully becoming more clear. A thing which I find useful at this stage is, before you just jump into coding, take a step back and look at what are the tools you have available. Any libraries you could integrate? Anything you need or could wrap? What are your base language features and standard library constructs which are related to your problems?

Really what you want to figure out at this point is exactly how much of the wheel you NEED to reinvent and what is already just waiting to be used instead. As someone who codes C++ most of the time, I tend to gravitate a lot to looking at what’s in the STL first and foremost. Since a lot of the changes in C++11 and forward were to cover gaps in C++ (it had no threading at all in the language, for example) as well as enabling implementing code that maps better to different coding paradigms (such as all the functional-programming-like stuff in <algorithm> that was massively expanded and how it interacts with lambdas) I like looking at those as well if I can. Sometimes your problem might not be as unique as it might look like.

For example, in our case, if you’ve read the bit where I’m exploring potential solutions, you may remember that the solution I’ve chosen looks at the problem as one that is very very similar to task-based parallelism. So, what’s that universe like? Well, I’ve used std::async before on toyBrot (really I use it all the time) but there IS a more “manual” version of this, it’s a std::packaged_task<return(args…)>.

I’m going to talk about functors a bit. If the word is new to you, a functor is “function object”. It’s an object but that can be called as if it were a function. Essentially, it has operator() implemented for it. Examples are what actually gets generated when you write a lambda (caveat: might be optimised into an inline, etc) as well as instances of the std::function class

So, a packaged_task takes a function/functor and wraps it up in itself, which IS a functor. It is templatised to a function signature BUT packaged_task::operator() always returns void. Instead of just getting the return value normally, you get_future() from it, which gives you back a future<return>. Nothing in here is copyable, but we CAN move them around so if we move the packaged_task to the main thread and the future to the “calling thread”, then that thread can get() the future and we’re good to go in that front (famous last words).

If you’re not familiar with futures. A future<T> object has a get() function which will return a value of type T that is provided through a promise<T> object. In our case the promise is internal to the packaged_task. The useful thing here is that future::get() will block until the promise is “fulfilled” (you can specify a timeout) so we can use this to sync different threads, even if T is void. Additionally, if things go wrong on the actual function, you can put an exception instead of a value in the promise, and this will then get thrown on the code that does the get(), helping you manage where to treat them

So packaged_tasks help us split the call and the return apart across different threads and sync them up. But we still need to deal with all of that argument discrepancy. In order to do so, our main tools are going to be variadic templates and std::bind

Variadic templates are a feature introduced in C++11. In short, they are templates that take an parameter in the form of . This tells the compiler that Ts can be 0 or arbitrarily more parameters of potentially different types. There are a few different mechanisms to dealing with them inside the templated code itself but we just need the simplest way. Our Ts are just a bunch of parameters which we want to pass to the function and we don’t really care a lot about that or how many they are specifically outside of just handing them over. So it’s easy mode here.

std::bind is also a new addition from C++11, of sorts. It expands and generalises previously existing functions that allow us to bake in certain arguments to a function call. In short, it enables you to do something like this:

int multiply(int a, int b) {return a*b;}

auto mult2by3 = std::bind(multiply, 2, 3};

auto multby3 = std::bind(multiply, std::placeholders::_1, 3);


//these are all the same call underneath
int a = multiply(2,3);
int b = mult2by3();
int c = multby3(2);

So if we bind the function we want to package, we can make sure that the packaged task takes the same number and type of arguments, 0 and nothing, regardless of what the original call looked like. As expected, what we get back is another functor that we can pass around.

There’s one variable to deal with in our function, though, which is the return. The entire signature of the function is the template parameter for either std::packaged_task or std::function so <void()> and <int()> are still different and we couldn’t just make one collection that had both. But really we, ourselves, don’t care about the return type here; once we got our future out, that’s it. So we wrap our functor again, this time, in a lambda that takes no parameters, returns nothing, and just calls the task. So, something similar to….

//this returns an int and takes two ints
int multiply(int a, int b) {return a*b;}

//this returns an int and takes nothing;
auto mult2by3 = std::bind(multiply, 2, 3};

//this takes nothing and returns nothing
//but res is captured by reference so it gets updated
int res = 0;
auto lambda =   [f{mult2by3}, &res]
                ()
                { res = f(); };

So this is like a function call inside a functor inside a functor inside a functor sort of deal. Wrap all of the things a put a bow on the top. And these are the main tools we have on the packaging and abstracting side. The executing side is much simpler. Really we need some sort of queue of functors that we’ll call. And besides that just make sure we deal with data races and such. 

Instead of using a std::queue, though, I am going to use a std::list. The reason why I want to use a list instead of a queue is that there is a list::splice function that allows you to essentially append one list onto another. This will facilitates us sending little queues of commands all at once. One might consider a std::forward_list instead since we don’t really need to move in two directions, but forward_list really is a pile, which means we’d need to consume it in the opposite order we built it. Unless we do some reversing but at that point… what are you doing? Just use the other list.

All that wrapping enables our list to be, essentially, a std::list<std::function<void()>> which means we finally have a well defined task collection that we can have in one place and centralise its management and consumption. Nice.

I’ve mentioned before that a lot of this is an iterative process. You make some assumptions, work with them, then discover there’s either more details (and there were so many details) or you were wrong about some stuff and you reevaluate those assumptions….

Even in this streamlined, curated form, I haven’t gone through all of the things. Why don’t we take a look at the actual final implementation, discover that and add more wrappers along the way for good measure?

Step 4: Getting our hands dirty

All right. We’ve done as much planning we can do at this point before getting our hands dirty, more or less. For a larger project or more involved process we could document our plans and whatnot, draw it on a white board, get other people in the team to bang some heads together…. but since Warp Drive is this one man-show as long as I keep what I need in mind and scribbled on my notebook it should be good enough.

Really, keep a notebook and a pen ready on your desk if you can. It helps a lot at times

With all those considerations in mind, we can start actually writing down some code. I’ll start by the changes in the Game class because those are the simplest. The relevant parts look like this:

//game.cpp
using  GLTaskQueue = std::list< std::function<void()> >;

class Game
{
    .
    .
    .
    GLTaskQueue glqueue;
    GLTaskQueue glactive;
    std::mutex queueLock;
}


void Game::queueGLTask(std::function<void ()> &&call)
{
    if(static_cast<bool>(call) )
    {
        queueLock.lock();
            glqueue.emplace_back(call);
        queueLock.unlock();
    }
}

void Game::queueGLTasks(GLTaskQueue &&queue)
{
    if(!queue.empty())
    {
        queueLock.lock();
            glqueue.splice(glqueue.end(), queue);
        queueLock.unlock();
    }
}

void Game::draw()
{
    DisplayManager::instance()->clearDisplay();
    DisplayManager::instance()->updateMatrices();

    if(queueLock.try_lock())
    {
        glqueue.swap(glactive);
        queueLock.unlock();
    }
    while(!glactive.empty())
    {
        auto t = glactive.front();
        t();
        glactive.pop_front();
    }
    if(currentState < 0)
    {
        return;
    }
    states[static_cast<unsigned int>(currentState)]->draw();
    if (fps)
    {
        drawFPS();
    }
}

Nothing here should be too unexpected. Just an alias for our queue of functions and then adding to Game a mutex to prevent races and two queues, so we can double buffer it. This way we add to one queue and consume from another.

When it comes to adding to the queue, we’re taking rvalue references exclusively. We’re taking ownership of the tasks that get forwarded and we’re forbidden of copying anything. This is because std::packaged_task is non-copyable (though, spoilers, some copying was still happening). Which, if you think about it, makes sense since it’s tied to a future. If you accidentally copied it, would it then have a different future? If so, the one you’re holding is never going to get a value. If both tasks point to the same future, that’s also no good, they can only receive a value once.

Taking an rvalue reference to the queue specifically also makes our things more efficient. It’s telling your code that this queue we’re receiving is ours and no one else has it. So the splice() function can be as simple as just tying the forward and backward pointers on either end and calling it a day.

If you’re not familiar with move semantics and rvalue references this Stack Overflow thread has a literal excess of information and explanations about it. For our purposes, imagine that if we’re doing list B = std::move(list A); (std::move is just a cast to an rvalue reference) The expected implementation is that we just grab the pointer to list A’s head and make it the value for lists’s B head while also making list A’s head point to nullptr instead. So we don’t copy or allocate anything, which is very quick, but list A is now in an invalid state, it’s empty.

Given we’re making this a multithreaded system, we want to guarantee any number of threads we might have (I mean, I can’t guarantee some specific game could not have a third thread having to access it for whatever reason), we just guard it with a std::mutex. The mutex’s lock() function will block the thread if the mutex is not available. So we’re assuming that if you NEED to submit this call, you just need to wait, there’s no other way.

Being a simple mutex, we don’t really have a way of checking for deadlocks which I’m just assuming won’t happen because I’m that much of an optimist. A more robust solution would be to use a std::timed_mutex, which enables you to specify how long you’ll wait to try and get the lock and, should that fail, you take it from there and deal with it in whatever manner is appropriate

Finally, on the Draw() function, we consume the queue. The first we do though, is to check if the queue is being messed with. If the queue IS locked, then we just ignore it and move on. There’s no reason for the draw loop to wait on a queue, just process it the next time around. If it’s not, try_lock WILL lock it. And then we swap the queue that’s receiving commands with the one we’ll process it. Swap itself is very quick and could actually be atomic even (like, one CPU instruction). After this very quick operation, we release the lock so that the other threads can resume submitting things and from here on just call every function on the queue, discarding them afterwards, and proceed to do the regular draw loop stuff. Nice and simple.

This workflow means that the update loop should never block the draw loop. When I was talking about the “Hot Seat OpenGL” option, this was a problem, each thread could block each other, but not here. The update thread will wait on the draw thread for confirmation that the OpenGL commands were executed, but the draw thread doesn’t wait on the update for nothing, even managing the queue. If that’s busy, ignore it and carry on drawing stuff.

So that’s the Game side out of the way. With that, let’s get started on the real meat of the thing, let’s look at our new file: gltask.hpp.

 

The first thing we’ll do is add yet another wrapper to the mix, but before we go there, let’s see why:

When I was laying down the requirements I wanted, I said I wanted the thread-checking and whatnot to be done internally to the “runMyFunction” call. This means that internally one of two things can happen:

 

  • We’re on a separate thread so we get a packaged_task that we want to send to Game; OR
  • We’re on the main thread so we can avoid some of the wrapping and just call the packaged_task

 

We know that we need the future back anyway but we may or may not get a task back from this whole thing. We COULD just queue the task there and then but that prevents us from queuing batches of tasks, so really, we need a std::pair. However we’re going to have our own std::pair:

//gltask.hpp

template<typename T>
class GLPackage
{
    public:
        GLPackage(std::function<void()>&& t, std::future<T>&& f)
            : task{std::move(t)}
            , future{std::move(f)} {}
        GLPackage(std::future<T>&& f): task{}, future{std::move(f)} {}
        GLPackage(const GLPackage& other) = delete;
        GLPackage(GLPackage&& other) = default;
        ~GLPackage() = default;

        explicit operator bool() const noexcept 
        { 
            return static_cast<bool>(task);
            
        }
        std::function<void()>&& Task(){ return std::move(task);}
        std::future<T>& Future(){return future;}

        T get() {return future.get();}

    private:
        std::function<void()> task;
        std::future<T> future;
};

template<typename T>
void pushGLTask(GLTaskQueue& q, GLPackage<T>& p)
{
    if(static_cast<bool>(p) )
    {
        q.push_back(p.Task());
    }
}

Right from the get go, GLPackage is non-copyable. Which we really want. Secondly, it also helps us with the task/no task situation. std::function provides a conversion operator to bool, which we really just forward here. A std::function is true if it’s properly initialised and callable. If you default-construct a std::function, it is false, which is the case when we construct our GLPackage using just a future. In this way, asking if the package is true is really asking if there’s a function to be moved. Once the function is moved away, what’s left behind is false as well.Whatever is in the std::future is also the template parameter for GLPackage.

The free-floating function, pushGLTask, makes use of this. It abstracts this check from external-calling code and adds the Package’s function to a queue IF there is one to be added. So, where do these packages come from, then?

//gltask.hpp

// Code here adapted from
// https://stackoverflow.com/questions/34109641/c11-wrappers-for-opengl-calls

template <typename F, typename ...Args>
auto wdglcall(
    std::enable_if_t<!std::is_void<std::result_of_t<F(Args...)>>::value, const char> *text,
    int line,
    const char *file,
    F && f, Args &&... args
    )
{
    auto bound = std::bind(f, std::forward<Args>(args)...);
    
    std::string errMsg{ std::string(file)
                      + std::string("@")
                      + std::to_string(line)
                      + std::string(" -> ")
                      + std::string(text)};
    
    std::packaged_task< decltype(bound()) (void)> task ( 
                    [b{std::move(bound)}, msg{errMsg}]()
                    {
                        //This is the bit that changes with void
                        auto ret = b();
                        DisplayManager::instance()->checkGLError(msg);
                        return ret;
                    }                                   );
    
    auto fut = task.get_future();
    
    if(std::this_thread::get_id() != Game::instance()->mainThreadID())
    {
        auto ptr = std::make_shared<decltype(task)>(std::move(task));
        auto func = [ p{ptr}]() mutable {(*p)();};
        return GLPackage(std::move(func), std::move(fut));
        
    }
    else
    {
        task();
    }
    
    return GLPackage(std::move(fut));
}

template <typename F, typename ...Args>
auto wdglcall(
    std::enable_if_t<std::is_void<std::result_of_t<F(Args...)>>::value, const char> *text,
    int line,
    const char *file,
    F && f, Args &&... args
    )
{
    .
    .
    .
}

#define WDGL(fn, ...) wdglcall(#fn, __LINE__, __FILE__, gl##fn, __VA_ARGS__)
#define WDGLv(fn) wdglcall(#fn, __LINE__, __FILE__, gl##fn)

All right, so let’s take this by parts as it’s a bit scary on the surface. Also, if you prefer to check the actual source file, you can find it here (minor differences in formatting for this post).

SO, we’re looking at a templated function which takes as paramters an F, which we’re assuming is a function of some sort, and zero or more args of whatever types. Line 8 is where probably the weirdest if you’re not used to heavier template shenanigans (I myself have to look up and double check this type of thing pretty much every time it comes up as it’s not a tool I have to bring out every day so, don’t worry if you’re a bit confused). That line makes use of a very useful peculiarity in template specification known as SFINAE.

“Instantiating a template” for a particular type/types is called specialising. When you’re specialising a template, Substitution Failure Is Not An Error. To understand what this means, let’s “translate” line 8 above:

 

We have an argument to this function, called “text” which is a pointer to const char. Enable this IF the result of calling F, with arguments of whatever types we get from Args, is NOT void.

 

The “enable if” part means that if that result IS void, what the compiler sees when it’s reading this is garbage. It tries to specialise the template and has a substitution failure. But, as in the acronym, this is NOT an error, so compilation proceeds as normal, it just can’t use this template so it gets discarded. Then, when it reaches line 48, it has a different version of this template which it WILL be able to instantiate, because that one is missing the “not” on that enable_if line.

If the compiler either cannot instantiate the function/class at all OR gets two different specialisations and it can’t tell which one is the best, most specialised one, THEN you have an error. For us, this means we can write different behaviour of the same function for different cases. In our case, if F’s return is void, we have one behaviour, if it’s anything else, then we have another implementation, which is what’s on the snippet.

The rest of those arguments should be not as spooky. Though, of note, is the double reference on the parameters. In C++ a && to a value whose type is going to be deduced at compile time, is different to a && to an explicitly hard-coded type. If you want to understand this more deeply you can check out this blog post on isocpp.org by Scott Meyers. But I’ll give the gist here. When you use && on a type that’s going to be deduced, because of the way reference collapsing (references to references) works, it ends up meaning that your code will take either an lvalue or an rvalue reference and it’ll know which type it is. This is normally known as a “Universal Reference”, or a “Forwarding Reference”.

Sidenote here to say that Meyers’s stuff is fantastic and when I was learning C++ 11/14, his Effective Modern C++ book helped me tremendously. It’s well divided, easy to follow and a pleasure to read. Very highly recommend this book if you still don’t feel confident in the stuff that was added to C++ back then. Cannot recommend this book enough, seriously, even if it’s now “outdated”. you can even see in my notebook picture because I was double-checking some details as I was writing not only this post but also the code that’s here.

And the very first thing we do here IS forward some of those references, in line 14. What this means in this context is that we want to make sure that we’re passing these args to f exactly the way we received them. The combination of std::forward and a universal reference is intended for this. If you get a reference, you pass it as a reference, if you get an rvalue, you pass it as an rvalue. We already talked about bind, and here is where we use it to bake those all those arguments in, so that bound is a function that returns whatever f returns but takes nothing as arguments, regardless of what f is.

Next we assemble a debug string (nothing to see here) and then create our packaged_task, the main attraction of our solution. Our packaged task takes a function signature as parameter. We know it takes no arguments, as that’s what we used bind for, but for the return, we ask the compiler to give us “whatever is the type of what comes out of calling bound()”. If you’ve never seen decltype in action, now you know what it’s about. What we’re packaging though, is not just bound, because we’ll use the opportunity to build-in some error checking as well. This is what message is for. checking every OpenGL call like this IS very inefficient and at time of writing really the only thing I have in terms of controlling this is that some preprocessor shenanigans make that call a no-op on release builds.

As indicated by the comment, this is the bit that makes us need to account for void, since you can’t have a variable of the type void to return. Also note here that we’re using lambda explicit capture initialisation to make sure we’re not reintroducing arguments and to avoid copying.

Almost there!

 

After we have our task, we immediately grab our future so we can hold on to it and return it. We also now make the check if we’re running on the main thread. If we’re not, we call the function, return a package that just has the future (which will already be instantly “gettable” with no need to block in wait for a promise to be fulfilled) and done. If we’re NOT the same thread though, there’s a couple more things to look out for….

So we have this packaged_task which cannot be copyed or we’ll all have a bad time. BUT std::function is copyable. So between here and when it’s used to create a new task in the queue, I have run into the problem where it was trying to copy the task. A way to get around this is to move it to the heap. Because of this trigger-happy copying, I also haven’t used a unique_ptr, going for the shared one instead, which does come with a tiny performance cost.

Again we use a lambda to wrap that shared_ptr into a functor but we run into another detail: Things that are capture-initialised by a lambda are marked as const and a std::function’s call operator is not const. This sounds bizarre at first but do consider that you could be wrapping some functor that does have an internal state, such as random number generator. And then you have no guarantee on the constness of THAT call operator. In order to circumvent this, we declare our lambda mutable.

FINALLY, we package up both our std::function which wraps a lambda that calls a pointer to a std::packaged_task that wraps a lambda that calls a std::function which is a wrapper over a function with the arguments baked in, as well as the future from that task. And we return that.

As you might have surmised from the parameters to the wdglcall function, I then provide a couple of C-style macros to call this. I need to use two because unlike variadic templates, variadic macros only handle ONE or more arguments. So if the function takes none (like glViewport()), then that one breaks. The only reason they macros are here to begin with though, is because the function is doing double duty. Not only it’s packaging the call and sorting out the multithreading situation; it’s also doing some automatic error checking and debugging output on the OpenGL side. I mention on a comment that I’ve adapted a lot of this from an StackOverflow question and THIS is what OP wanted. None of this packaged_task withcraft, he wanted to wrap his calls so that he could check for errors automatically.

Well then, that was quite the adventure. Let’s have a look if it was even worth it.

Smoke and mirrors in place: Our solution in action

So the problem was:

I have a system that will only take calls from one specific thread and I need to make those calls from other threads sometimes

And what I wanted from the solution was:

  • Well, I don’t want to be doing any of that thread-checking when I’m making a call. I might be calling a wrapper or even (oof) a macro to do this but I just want my call to be straightforward
  • I also want my calls to be consistent. I don’t want to have different syntax for different calls. The One wrapper to abstract them all.
  • I need (as I found out as I coded) to be able to guarantee that not only are my calls going to be executed in the order I request, I ALSO might need of them that they’re executed without other stuff going on between them. Because you don’t want to be surprised by someone else loading a different shader or binding another array while you’re sorting stuff out
  • As part of that thread checking stuff, I don’t want to have to bother, when coding, if I am building for single or multithreaded. I want to code it once, compile however I need and done

And this is why we had to go through all of that fiddly fiddly mess. So, how does our code look now, using this? Well, Ray was the class that brought us here, let’s see how it is doing these days:

//ray.cpp

void Ray::draw() const
{
    Matrix44 model;
    model.setTranslation(origin.X(),origin.Y(),origin.Z());
    Matrix44 transform;
    transform.setScaling(1000.f);

    GLTaskQueue q;
    std::vector< GLPackage<void> > v;

    v.push_back( WDGL( UniformMatrix4fv
                     , transformUniform
                     , 1, GL_FALSE
                     , transform.Elements().data() ));
    v.push_back( WDGL( UniformMatrix4fv
                     , modelUniform
                     , 1, GL_FALSE
                     , model.Elements().data() ));
    v.push_back( WDGL( Uniform4f
                     , ambientUniform
                     , colour.R(), colour.G()
                     , colour.B(), colour.A() ));

    for(auto& task: v)
    {
        pushGLTask(q, task);
    }
    Game::instance()->queueGLTasks(std::move(q));
    for(auto& task: v)
    {
        task.get();
    }

    VAO.draw("wire",true);
}

So now, for functions that are going to make more than one OpenGL call, we need an additional GLTaskQueue (which is a std::list<std::function<void(void)>>) and we use a vector of GLPackages to make our life easier.

Every time we make an OpenGL call, we use the macro. Syntax there doesn’t differ too much from a regular OpenGL call, except we prepend the “gl” to the function name through the macro. This is mostly a style thing, I like the names of the functions clean, though it also means that if, for some reason, this is called with a non-GL function, compiler will say no.

Every time we’re calling the macro, we’re storing the GLPackage in a vector. We can only use the vector here if the return of all these functions are the same. Here they’re all void. But if we can’t, it’s no problem, just less convenient. After packaging all the calls we run the vector once, calling pushGLTask on each GLPackage in the vector. This, if you recall, will add our tasks to the list IF they haven’t been called already, otherwise it’s a no-op and q is actually empty after this.

Without caring too much about whether that list is indeed empty or not, we just send it to Game, let it do nothing on it’s side if it is. And after having sent them on their way, we run through the vector again, this time calling get() on our GLPackagedTasks. This just gets the future so we don’t even need to bother extracting them. The get will block if the function hasn’t called yet and move on if it has.

Honestly, I don’t see how you would go about getting much cleaner than this and still making sure all those details are sorted so, this, in my estimation is a success. There’s not a lot of extra code, the OpenGL calls themselves still are pretty straightforwardly identifiable, and I very nearly got away with the unified call syntax. I still need a slightly different one if the function takes no arguments but they’re very very few.

Final thoughts and further work

Let’s get the “buts” and “wells” out of the way first. This way we can end on a high note =)

As much as I’m claiming success here, the work is not ENTIRELY done. That call to VAO.draw() in the end will make OpenGL calls so really they should be all part of the same queue, I just haven’t implemented the overloads there yet. But the idea is simple, “if you also receive a GLTaskQueue, don’t call anything, just queue the calls up and let your calling code handle it”. Additionally I still need to triple check some of the reference stuff in wdglcall, the actual code has some commented code which needs to go away permanently, things like this. No biggie, but some details. Plus, I’m not entirely convinced I understand perfectly why I had issues with constness of std::function::operator() on the lambda that went to Game but not on the internal one that was packaged so, there is that to revisit/watch out for.

In terms of the solution itself I DO like it, quite a lot actually. I mentioned in the intro that the ideas on which I’ve built my solution could be generalised to any situation where you need to forward tasks to a system in the form of actual function calls. OpenGL here is particularly challenging too, order is important, returns and arguments are variable…

I got excited about sharing this solution because to me the specific problem looks like one for which no one has a REAL good answer (and no even this one is not a REAL good answer with all the hoops I end up having to jump through). This sort of system, though becomes a nice tool on your belt once you know it (or at least of it). It’s also an opportunity to see some of those template shenanigans in action, since they are usually very much “behind the curtains” sort of stuff. Implementing this was quite tricky (as I kept finding out all these new details) but it feels great to have done this, and to look and code Warp Drive now, having this system in place there.

I wasn’t quite as concise as a 3 paragraph StackOverflow answer, but I hope I’ve managed to give you some information which is both useful and interesting

What's next?

This section is a bit more nebulous than usual. This was the last post I had proper planned and coded for the blog in general but I AM working on Warp Drive. In fact, I had to forbid myself from coding more until I’ve finished this post, because I just started doing this when I should’ve been writing Multi Your Threading 9 and 10. Now that this is out I am finally released from my bonds! There are a couple of projects I want to build on top of Warp Drive and chances are I’ll find a lot that needs to be worked on the engine itself along the way. So though I still don’t know WHAT will be next and exactly when, I think there’s a fairly high chance it’s going to be more Warp Drive. So, let us wait and see… First thing is that now that nothing is exploding, I have to figure out these rays….

Some reference and related content

×

Contents