Game Design, Programming and running a one-man games business…

CPU/GPU concurrency in video games

I’m no graphics programming expert, and not really any kind of programming expert, unless you want a strategy game coded with its own engine, in C++ for the windows platform in which case *cracks knuckles* I’m pretty experienced. (Actually I dont know hot to crack my knuckles).

What I do know,. is what to look out for, when you are worried about performance. One of the things I learned early on, was learned REALLY early on, when I made a game called Kombat Kars (probably in directx5) and was working on particle systems. To make it clear just how many aeons ago this was, lets take a look at an epic image of the rear boxart (yes! retail!)

Kombat Kars (2001) Windows box cover art - MobyGames

Yup, its not the frostbite engine.

Anyway, I was working on optimizing the drawing of vertex buffers full of particles, or asteroids or whatever, and I was depressed to discover after doing some cunning batching of my draw calls, that the performance went DOWN. Yup. Making the game more efficient in how few draw calls it made, made the game run SLOWER.

How can that be?

Actually super-easy, barely an inconvenience, but to understand why, you need to conceptually understand whats going on in the box when you run a PC game under windows. You basically have two CPUS. One of which is on the motherboard and is general purpose, the other is on the video card and specialized for processing vertexes and shaders and so on. It used to be 95% CPU work, and 5% GPU work. These days the GPU is often the most expensive, and powerful component in the box. On a lot of setups, the capabilities are fairly equal.

Its that equality of power that can actually cause problems. The peak performance of the machine is when the CPU is 100% busy (all threads!) and AT THE SAME TIME the GPU is 100% busy (multiple streams at once etc…). This is almost impossible to achieve, but its possible to actually make things worse than they should be, when you get too obsessed with batching.

If you don’t care about performance you code like this:

PrepareAMesh();
RenderAMesh();
PRepareAMesh();
RenderAMesh();
PrepareA..

Then one day you read some articles about the reason your frankly low-poly indie game runs at 20fps is that you have WAY too many draw calls. You read about batching, and your new code looks like this

for(int n = 0; n < lots; n++)
    PrepareAMesh();
RenderAllThoseMeshes();
for(int n = 0; n < lots; n++)
    PrepareAMesh();
RenderAllThoseMeshes();

And all is good in the world, because suddenly you are not flushing the queues on the video card every nanosecond, and its doing what it likes to do, what it was BORN to do, which is to stream through a whole ton of data like a sieve and throw polygons at the screen fast! But hold on…things can go wrong…

for(int n=0; n < eleventybillion; n++)
  PrepareAMesh();
RenderTheWholeDarnedGame();

This can actually be a REALLY BAD IDEA. Why? surely batches are good right…? well…to an extent. It really depends how you structure the code. It *might* be that during all those bazillion PrepareAMesh() calls, the GPU has run out of things to do. Maybe it hasn’t done ANYTHING yet this frame. It finished the last frame, and now its basically watching netflix waiting to hear from you some day…

…and once the CPU calls the GPU to render all bazillion polygons, depending how you structure the code, the CPU may be doing nothing. Maybe this is the frame end, and the CPU has to sit on its ass waiting for a Flip() or present() call from the GPU to get back to it some time maybe next week after the rendering is finished, when it can start thinking about the next frame?

This is the CPU/GPU concurrency issue. You can be TOO BATCHY. You can inadvertently set things up so that the GPU is always waiting for the CPU and the CPU is always waiting for the GPU. This is BAD for performance.

Luckily, free apps like VTune let you analyze this. FWIW Democracy 4 has no such problems with this at all, but to show you how it looks, here is the output of a very brief snippet of the vtune CPU/GPU concurrency analyzer:

You can see near the bottom how busy the GPU and CPU are. Luckily for me, they both keep pretty busy, even if I zoom in a lot to see the span of individual frames, but if your zoomed in CPU/GPU concurrency stuff shows big empty blocks within a frame, you have some optimizing to do.

The reason this catches out so many experienced coders is that it *sounds wrong*. Surely batching is good right? It is… but you have to remember that if the GPU would otherwise be sat on its ass eating crisps, even doing a bunch of small inefficient batches of 50-100 vert each, is MORE efficient that just letting it sit idle.

Think of the CPU/GPU as a team trying to do the dishes. The CPU is washing em, the GPU is drying them. Don’t let either of them stand idle.


2 thoughts on CPU/GPU concurrency in video games

  1. hi cliffski!

    Thank you for keeping sharing the development details! I’m curious, do you have some specific approach to memory management in C++? Maybe some custom allocators, reference counting, object pooling, etc?

  2. I do a lot of object pooling with stuff that would otherwise get created each frame or even each turn, but I do not write my own direct memory management or override new,delete etc. I don’t use ref counting, I don’t have many cases where it would be helpful. I prefer explicit pools of objects that get deleted when the app closes to having ref counting decide when to free an object.

Comments are currently closed.