Thoughts on multi-threaded 2D game development in directx9 January 4, 2014 cliffski Am I the only person doing this? probably. I often am. Most people have moved on from DX9 (I know it so well there is big opportunity cost to updating) or use OpenGL, and very few people are doing 2D games where performance is an issue. I am taking early steps with Gratuitous Space Battles 2, and my aim is to have it run at 60 FPS on average hardware with 2 1920×1080 monitors. I also intend to get it running ok for bigger setups too. That’s a lot of pixels, and due to all sorts of fancyness I’m adding to GSB 2.0, it means a lot of processing. a REAL lot. So…multithreading! it’s about time i ventured forth. To date, my only multi-threading efforts have been the asynch server communication in GSB 1.0 for challenge uploads etc, and the loading screen for GTB and Democracy 3. Actual mid-game multihthreading has scared me until now. I hate middleware so I’m not using any libraries, just raw calls to CreateThread, TerminateThread and so on… This might make it more complex, but means I have complete control over stuff. My first experiments were not exactly encouraging. I attempted to speed up the position calculations of asteroids. Now to cut a long story short, I use D3DTLVERTEX style stuff (not hardware Transform and lighting) and for good reason i won’t bore you with. The upshot is, I have a lot of non-directx transform stuff to do for anything drawn on the screen. An ideal case for multithreading! So I wrote code to split up the asteroids into 8 chunks (test case of an 8 core chip), and gave each processor a list of asteroids to process. Result? SLOWER. Actually quite a bit slower. Some fiddling with AQTime (My profiler) let me analyze cache misses for each thread, and I also profiled it as 1 thread. The cache miss rate went through the roof. Basically, my transform code was relying on some global camera data, and I suspect that either: a) Referencing he camera data was a bottleneck with each thread blocking each other from getting it or… b) The memory locations of the asteroid transform data was laid out in such a way that all the different threads kept fighting for the same cache lines and generally getting in each others way. I spent a lot of time reading and fiddling and decided that it wasn’t working (although did manage a decent few speedups in other ways). I then decided that if lots of threads sharing the same job wasn’t going to help, maybe lots of threads doing different (unrelated) jobs would…? And this is more of a success. I have a function called ProcessFrame() which does a lot of non-directx stuff, such as the aforementioned asteroid transforming, updating engine glows, updating explosion plumes, particle effects and distortion waves blah blah… Until recently, it just did them one after the other. I then realized that although a lot of them accessed the same data (camera position stuff mostly), none of them altered it, and the tasks were quite discrete. So I packaged them up and sent them to different threads, and then spun in the main thread waiting for them to finish. result? 21% faster. yay? not bad, but not 800% faster, which would have been theoretically do-able(not really but…) Of course the missing link was that I am then left waiting for the slowest thread. Plus if I have more than 8 tasks, I run out of CPUs. So I re-coded it to have a queue of tasks, and when a thread finished a task, it checked the queue, and only reported it was done when the queue was empty. This was way more efficient, and easier to scale to available cores. Result? 41% faster! Now obviously a 41% processing speedup is good (although this is pre-render, not render, so probably only a 20% FPS boost) but I can’t help thinking that if not 800%, a 200% speedup of that bit of code must be possible. Debugging cache-misses is hard, as even aqtime will bluescreen occasionally on windows 7 when profiling it. I’m pretty sure it’s some cache, false-sharing issue going on. In the meantime, GSB is now faster as a result, even if i spend no more time attempting to multithread it (and i will… I’ve only just got going). Anyone else attempting this sort of thing?