Getting threads to rapidly schedule tasks in C++, Windows

July 4, 2017 cliffski

Posted on July 4, 2017

very techy post…

My latest game uses the same multi-threading code I’ve used before. This is C++ code under windows, on windows 10, using Visual Studio 2013 update 5. My multithreading code basically does this:

Create 8 threads (if an 8 core system). Create 8 events which are tracked by the 8 threads. At the start of the app all of the threads are running, all of the events are unset. As I get to a point where I wwant to doi some multithreaded code, i add tasks to my thread manager, and when I add them, I see which of my threads does not have a task scheduled, and I set that event:

PCurrentTasks[t] = pitem;
SetEvent(StartEvent[t]);

Thats done inside a CriticalSection, which I initialized with a spincount of 0 (although a 2,000 spiun count makes no difference).

Each of the threads has a threadprocess thats essentially this:

WaitForSingleObject(SIM_GetThreadManager()->StartEvent[thread], INFINITE);

When I get past that, I enter the critical section, grab the task assigned to me, do the actual thread work, then I immediately check to see if there is another task (again, this happens in a critical section). if so…I process that task, if not, I’m back in my loop of WaitForSingleObject().

Everything works just fine… BUT!

If I have a big queue of things to do, my concurrency charts from vtune show stuff like this:

Seriously, WTF is happening there? Those gaps are huge, according to the scale they are maybe 0.2ms long. What the hell is taking so long? Is there some fundamental problem with how I am multithreading using events? Does SetEvent() take a billion years to actually process? If feels like my code should be way more bunched up than it is.

13 thoughts on Getting threads to rapidly schedule tasks in C++, Windows

James says:

July 4, 2017 at 7:10 pm

Long shot.. but mutexes are pretty slow (If its an OS mutex then that requires a context switch and all sorts)

Would a lockless queue be a better way to distribute jobs to worker threads? To achieve the effect of a lockless queue, you can implement a ring buffer using nothing but atomic operations.. which are pretty fast. You can go turbo geeky in get into this: https://mikeash.com/pyblog/friday-qa-2012-02-03-ring-buffers-and-mirrored-memory-part-i.html
1. cliffski says:
  
  July 4, 2017 at 7:31 pm
  
  Interesting thanks for the link. I think I’ve just made some big progress actually, mostly by staying within the thread, and ignoring the event system as long as more items are queued up.
Sam says:

July 4, 2017 at 8:34 pm

I’m a bit of a lurker here, I find your posts super interesting, so thank you for sharing.

I’m curious why you are still on VS 2013, and haven’t moved up to 2015 or 2017 yet?
1. cliffski says:
  
  July 4, 2017 at 9:26 pm
  
  Mostly I dont see a compelling reason to change, and it always means some admin hassle getting the new build to work!
  1. ddengster says:
    
    July 7, 2017 at 11:57 am
    
    Recommend to upgrade to 2017, just did it recently. 2013 was laggy as hell, so was that version of intellisense. 2017 is more performant.
    1. cliffski says:
      
      July 7, 2017 at 12:16 pm
      
      intellisense wise I beef it up with visual assist which is tons better anyway
Sam Swain says:

July 4, 2017 at 11:11 pm

Sounds very similar to mine, except I don’t bother working out which is free, I just trigger them all and let them fight it out to pick up tasks. The logic being that jobs typically come along in groups. Cost of them all checking for work is negligible. Could try that. :)
gradbot says:

July 5, 2017 at 3:49 am

It sounds like the critical section is the problem. Threads calling EnterCriticalSection or WaitForSingleObject can be blocked waiting on the windows scheduler even though the critical section is not taken or the event has been signaled.

TryEnterCriticalSection could solve your problem. If you know there is more work but can’t take any yet because of the lock then you could spin your thread instead of giving up control to the OS.
1. CdrJameson says:
  
  July 5, 2017 at 3:11 pm
  
  Sounds familiar to some problems I ran into a few years back.
  SetEvent/WaitForSingleObject aren’t like calling a callback, where the waiting function is called immediately, it’s more like SetEvent says “Blockage removed” so the now unblocked thread becomes a candidate for consideration to get some processor time when the scheduler next gets round to having a look.
  
  Possibly if you can do something after the SetEvent() (eg. a Sleep(0)) to wake up the scheduler you might get an improvement, but that might not work on modern Windows versions.
  
  Someone on Stack Overflow seemed to be having this problem over 7 years ago, so it’s not new: https://stackoverflow.com/questions/1306118/overhead-due-to-use-of-events
  
  Also, some interesting stuff on Windows thread scheduling in this sample chapter from Windows Internals: http://download.microsoft.com/download/1/4/0/14045A9E-C978-47D1-954B-92B9FD877995/97807356648739_SampleChapters.pdf
  1. cliffski says:
    
    July 5, 2017 at 3:28 pm
    
    Yup thats exactly the issue. I have found that when I complete a task, if I just go in there and check 9inmside a critical section) for another task *within the thread body* rather than relying on the event check instead, I can get much better speeds, although thats only for when I build up a queue of events. Still…its much better.
Ben Collier says:

July 5, 2017 at 9:43 am

Probably nonsense but…

I almost wonder if you could be the victim of an emergent behaviour. If all of your calc_route procedures are of almost exactly the same length, and you have some holdoff time if you discover that all the threads are busy, then you could be holding off for those long delays on all your threads.

If there is a holdoff time, try randomising it instead of saying “wait for x microseconds before rechecking if there’s a thread free”.
1. cliffski says:
  
  July 5, 2017 at 9:45 am
  
  Those tasks definitely take different lengths of time, so sadly its not that :D
Shad Sterling says:

July 6, 2017 at 4:52 am

The thing that strikes me is that in each visible place where most of the threads have a gap, there is exactly one thread that does not. Is that the case generally (not just in this screenshot)? Could one of the paths be doing too much in a critical section?

Comments are currently closed.

Cliffs Solar Panels:
	CO2 emission reduced 445.05 kg
	Equivalent trees planted 26.93 trees
	Equivalent lightbulbs 6973.96 lightbulbs