Background: I use directx9 to develop Gratuitous Space Battles 2 using my own engine.

I’ve been doing my best to reduce the number of draw calls per frame for complex scenes in GSB2. Basically I have a lot of stuff with different textures and render states, and they are being drawn from front to back rather than z-buffer sorted (for reasons concerning sprites & high quality alpha blended edges).  What this means is, when you have 16 identical laser beam turrets, you may not be able to draw them as a single batch, because in-between them you might be drawing other stuff. As a result you get 16 draws instead of one. Ouch. That causes driver slowdown, directx slowdown, and inefficiency on the GPU, which prefers big batches.

Of course you can immediately see that it would be fine to go through the draw list (one of several actually, for composition reasons), and spot all those cases where you have 2 or more turrets (or any sprite) that use the same texture and which do NOT have anything in between them that overlaps the first one, and grab that second turret and draw it ‘early’ with the other one. And in fact, that works just fine. suddenly lots of draw calls get optimized away! (the green ones) In this case out  1,159 draw calls 713 get optimized away into batches.

megabatchedThere is an immediate problem though. This is extremely slow, even with every optimization trick in the book. Assume you have a list of 1,000 objects to draw (not inconceivable for big battles). In the worst case situation, that means comparing object 1 to 999 different others and doing a bounds check. Then object 2 gets compared to 998, then 3 to 997, and so on. That is a LOT of function calls (inlineable I know…), and a lot of bounding box comparisons, and a lot of texture comparisons (only a pointer compare, but I need to extract that texture pointer from each renderable object, and at 500,000 de-references per frame even that adds up.

Now granted this is all total worst case. Some of those 1,000 objects aren’t batchable, some of them *will* quit early as something overlaps, and when I do batch future ones with early ones, those future ones themselves don’t need to be checked as they have been optimized away.

The trouble is, after profiling, it is still about 70% slower for the CPU to do this, than not do it. The big problem here is that I’m making stuff light for the GPU, heavy on the CPU. Is that a good idea? maybe… But as it happens, if I assumed a dual core (or better) PC, it doesn’t matter because it is FREE. I have other threads just sat there. If I find a slot in my main loop between building the draw list and having to actually render from it, I can multi-thread that new slower batching code and actually have the whole app run faster, thanks to less draw calls later on. The Visual C++ concurrency profiler shows it works: (Click to enlarge)

visual_c++_concurrency_game

Previously I’d have gone through and built up the draw list (thats the blue), then done some particle drawing preparation(green), then batched my draw calls (purple) and then drawn everything (yellow). Because the particle stuff actually gets put into a different list, I can mess around with my slow batching in a new thread while the main thread prepares particle stuff. Hence the purple bar is now on a new thread and works alongside the main one. As it happens I also have 2 more threads doing some particle emitter stuff at the same time as well, so briefly I’m at 100% utilization on 4 threads, possibly 4 cores (other processes such as the music streaming/driver might be on one of them).

So in a sense, yay! faster code, but what a nightmare to measure. It depends on scene complexity, relative CPU/GPU speed, number of cores and god knows what else. However it is worth remembering that sometimes slower code will make your game run faster, it just depends where that slow code runs.

7 Responses to “Draw list sorting and concurrency issues”

  1. Thomas says:

    Ouch! Algorithms with quadratic runtime are rarely nice. Just a couple of thoughts here, don’t know the exact situation so this might or might not be useful…

    I’ll assume that the number of overlaps is small compared to the number of sprites (as it hopefully is). Then we can do the whole thing in O(n) time as follows:

    – Use a spatial hash or quadtree to find overlaps. Best-case (no overlaps) it will run in roughly O(n) time, if you choose the right cell size.
    – This gives you a dependency graph, with an edge from A to B if A must be drawn before B. Apply a topological sort. This takes O(n*e) time where e is the number of edges. Use the texture ID as a tie breaker.
    – This gives you a list of textures, in the order they should be drawn. Sprites that can be drawn with the same texture will be adjacent to each other, so you just have to run through the list and gather up the batches.

    It may sound like a lot of operations, but there are only 1000 of them instead of 1000000 :)

    Alternatively, maybe you could also reuse the batching result from the previous frame as a starting point for the current frame. It doesn’t have to be optimal right away, as long as it eventually converges to the optimal solution over the course of multiple frames. For example, I’ve used bubble sort quite effectively when doing back-to-front drawing; you just bail out as soon as everything is sorted, which is typically after one or two iterations.

  2. cliffski says:

    Indeed, I had considered improving things with some sort of quadtree system because I really don’t need to granularity of a real box:box overlap test for starters,. I can probably reject quite a few things based on a much quicker test that way.
    I agree a quadratic thing is obviously bananas, luckily it doesn’t really work out that way, and can trivially be capped time or operation wise.

    I ike the idea of trying to persist information over frames, but I’m not sure how in practice this can be done. Anything could change in any frame, or are you suggesting only updating quadtree data every x frames, and just using it as a quick reject? in which case that might be pretty cool.

    I’ve only applied this to one of my draw lists (the ‘unlit’ list right now, and not even to all of the possible components in there, so there may be greater savings to come (some stuff isnt getting batched yet and should be, so right now it just ‘pollutes’; the list without ever getting skipped.

  3. Alex says:

    You don’t need anything as complicated as a quad tree or topological sort.

    Create a NxN array of texture pointers (N should be small, e.g 8). As you batch each quad, draw its coverage into the array conservatively (think of it as a low res texture coverage mask over the whole screen). Whenever you write over a different non-null texture pointer, flush that batch and erase from the coverage array.

    By adjusting N you can easily trade CPU work for more draw calls. The mask does not actually have to be square, you could have it fit your aspect, or experiment with horizontal or vertical bias..

    You should also consider using texture arrays or 3D textures to group textures of the same size (I assume you don’t want to use sprite sheets because of mipmapping).

  4. Cliffski says:

    Actually I already use sprite sheets, although not in all cases, nothing goes up to the borders so mi-mapping isn’t an issue.
    AFAIK texture arrays are not in DX9 :(

  5. Arowx says:

    Hi Cliffski,

    What do you think of DirectX 12 as it could mean larger Gratuitous Space Battles?

    Could you take GSB into the third dimension with the additional rendering performance it offers?

  6. Out of curiosity, didn’t you already have the screen set up into “levels” that dictate what can be seen and what can’t be seen?

    Maybe use that info to batch each of these “level’s” effects together. Does the GPU make the scene in multiple levels, or is that only in the CPU?

    If the GPU, can you just draw it all and depend on the GPU to be fast enough to handle the workload? How bad would it effect GPU performance?

    What about presorting based on the effect. Presumptively, you know each effect in your engine, so you could create a list for each effect to be drawn that is added to rather than forcing you to sort each time. Combine this with the knowledge the engine has of what is occluded, and what level it should be drawn on, and you could optimize this issue away.

    • Cliffski says:

      the problem is texture changes. A draw call can only reference one texture (generally), so 50 items with the same effects but fifty different textures require fifty draw calls. Atlasing is non trivial for complex reasons relating to dynamic textures and mip-maps