{"id":6414,"date":"2022-11-05T15:24:47","date_gmt":"2022-11-05T15:24:47","guid":{"rendered":"https:\/\/www.positech.co.uk\/cliffsblog\/?p=6414"},"modified":"2022-11-05T15:24:48","modified_gmt":"2022-11-05T15:24:48","slug":"optimization-for-fun","status":"publish","type":"post","link":"https:\/\/www.positech.co.uk\/cliffsblog\/2022\/11\/05\/optimization-for-fun\/","title":{"rendered":"Optimization for fun!"},"content":{"rendered":"\n<p>I am well aware that my game Democracy 4 is not exactly slow with huge framerate issues. However, optimization is fun! or at least it should be, but in practice, getting profiling to work on remote PCs is not exactly easy. I have basically used every profiling software imaginable and still have not got one that I think really does the job well&#8230;<\/p>\n\n\n\n<p>I have basically wasted about an hour today trying to work out why I couldn&#8217;t get the intel vtune amplifier stuff to work with event based profiling and get rid of this pesky error that was clearly nonsense about &#8216;not able to recognize processor&#8230; until I finally realized that I actually have an AMD chip in my (relatively) new PC so&#8230;yeah&#8230; That drove me to try out the AMD uProf profiler, which is something I had not used before.<\/p>\n\n\n\n<p>It took me a moment to realize that this software, good though it is, does not suggest to you &#8216;hey if you run me in administrator mode I will show you 50x more config options&#8217; but luckily I worked that out. My first act was a brief run of Democracy 4, starting a new game then immediately going to the next turn. In the list of functions taking up all the time (and ignoring windows system functions) I get this list:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"797\" height=\"320\" src=\"https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-1.png\" alt=\"\" class=\"wp-image-6416\" srcset=\"https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-1.png 797w, https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-1-680x273.png 680w, https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-1-768x308.png 768w\" sizes=\"auto, (max-width: 797px) 100vw, 797px\" \/><\/figure>\n\n\n\n<p>Which is about what I would expect. The game is implemented as  custom-coded neural network structure, hence the terminology. Mostly everything is a neuron, and most of the processing is where each neural effect (the links between neurons) processes its equation, and then neurons do some math on their inputs and outputs.<\/p>\n\n\n\n<p>The inner machinery of the neural network ultimately comes down to that top item there: <\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">SIM_EquationProcessor::Interpret Value.<\/pre>\n\n\n\n<p>This is code that basically takes those equations in the game&#8217;s csv files like this:<\/p>\n\n\n\n<p>StateHealthService,0-(0.4*x),2<\/p>\n\n\n\n<p>And actually calculates a value from that. There are 2,000 voters with about 10 connections each, pre-processed on a new game 32 times, so thats 640,000 equations right there, plus all of the actual simulation stuff layered before that. In other words, that equation processer probably runs a million times on a new game, and the equation might have 5 values in it, so max case is 5 million values get interpreted when you click on &#8216;new game&#8217;.<\/p>\n\n\n\n<p>Can I speed it up?<\/p>\n\n\n\n<p>First step is to see how stable these values are any way, so I&#8217;ll do an identical profile run and check that the +\/- errors on different profiling runs are small&#8230;<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"316\" src=\"https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-2.png\" alt=\"\" class=\"wp-image-6417\" srcset=\"https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-2.png 800w, https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-2-680x269.png 680w, https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-2-768x303.png 768w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/figure>\n\n\n\n<p>I think thats pretty close. I definitely have numbers here that are in the same ballpark. So now lets try some optimisations to speed this puppy up. Looking at the top function with a double-click gives me a whole bunch more data:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"406\" src=\"https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-3-1024x406.png\" alt=\"\" class=\"wp-image-6418\" srcset=\"https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-3-1024x406.png 1024w, https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-3-680x269.png 680w, https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-3-768x304.png 768w, https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-3.png 1161w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>The function is much longer than this, but thats mostly catering to relatively rare edge cases. Looking at the bits that actually have numbers on it show pretty clearly that its pretty much all about the pesky strcmp() call. A separate piece of code has already parsed the full equation of 0-(0.4*x), so I have a bunch of char buffers for each variable, declared like this:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">char Vals[MAX_VARIABLES][32];<\/pre>\n\n\n\n<p>The thing is, do I need the overhead of calling strcmp() when I am only really checking for whether the first letter is x? Sadly I cannot JUST check that, because that would prevent we having a named variable starting with an x. Lets imagine this equation:<\/p>\n\n\n\n<p>0-(0.4*xylophone)<\/p>\n\n\n\n<p>Obviously not very likely, but theoretically possible. If the length of the buffer was 1, and the first letter is x, then thats a hit, but the question is, will inlining 2 manual checks be faster than a strcmp function call? Lets replace that code with<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">if(Vals[valindex][0] == 'x' &amp;&amp; Vals[valindex][1] == '\\0')<\/pre>\n\n\n\n<p>And check out the results:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"429\" src=\"https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-4-1024x429.png\" alt=\"\" class=\"wp-image-6419\" srcset=\"https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-4-1024x429.png 1024w, https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-4-680x285.png 680w, https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-4-768x322.png 768w, https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-4.png 1158w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Hmmm. Actually WORSE as far as I can tell. So it looks like whether we strcmp or not, just checking the value of two bytes at that point is slow. probably because its not immediately available memory? Its notable that the code at line 232 is super fast by comparison, as its just checking a bool value we cached earlier. Maybe I should try that? When I parse the function, just keep a bool for each Value, saying if its &#8216;x&#8217; or not?<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"799\" height=\"372\" src=\"https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-5.png\" alt=\"\" class=\"wp-image-6420\" srcset=\"https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-5.png 799w, https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-5-680x317.png 680w, https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-5-768x358.png 768w\" sizes=\"auto, (max-width: 799px) 100vw, 799px\" \/><\/figure>\n\n\n\n<p>Whoahh. This looks like a pretty major speedup. 326 cycles versus 1,143. What the hell? why didn&#8217;t I do this earlier? Lets look at the line by line&#8230;<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"409\" src=\"https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-6-1024x409.png\" alt=\"\" class=\"wp-image-6421\" srcset=\"https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-6-1024x409.png 1024w, https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-6-680x272.png 680w, https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-6-768x307.png 768w, https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-6.png 1164w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>This is awesome. I then tried to make this code inline, but it seemed to not make things any better. I haven&#8217;t fully explored uProf yet, but it does do cool flame graphs:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"505\" src=\"https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-7-1024x505.png\" alt=\"\" class=\"wp-image-6422\" srcset=\"https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-7-1024x505.png 1024w, https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-7-680x335.png 680w, https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-7-768x379.png 768w, https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-7-1536x758.png 1536w, https:\/\/www.positech.co.uk\/cliffsblog\/wp-content\/uploads\/2022\/11\/image-7-2048x1010.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Profiling UIs are great fun :D<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I am well aware that my game Democracy 4 is not exactly slow with huge framerate issues. However, optimization is fun! or at least it should be, but in practice, getting profiling to work on remote PCs is not exactly easy. I have basically used every profiling software imaginable and still have not got one<\/p>\n<p class=\"text-right\"><span class=\"screen-reader-text\">Continue Reading&#8230; Optimization for fun!<\/span><a class=\"btn btn-secondary continue-reading\" href=\"https:\/\/www.positech.co.uk\/cliffsblog\/2022\/11\/05\/optimization-for-fun\/\">Continue Reading&#8230;<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[],"class_list":["post-6414","post","type-post","status-publish","format-standard","hentry","category-programming"],"_links":{"self":[{"href":"https:\/\/www.positech.co.uk\/cliffsblog\/wp-json\/wp\/v2\/posts\/6414","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.positech.co.uk\/cliffsblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.positech.co.uk\/cliffsblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.positech.co.uk\/cliffsblog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.positech.co.uk\/cliffsblog\/wp-json\/wp\/v2\/comments?post=6414"}],"version-history":[{"count":1,"href":"https:\/\/www.positech.co.uk\/cliffsblog\/wp-json\/wp\/v2\/posts\/6414\/revisions"}],"predecessor-version":[{"id":6423,"href":"https:\/\/www.positech.co.uk\/cliffsblog\/wp-json\/wp\/v2\/posts\/6414\/revisions\/6423"}],"wp:attachment":[{"href":"https:\/\/www.positech.co.uk\/cliffsblog\/wp-json\/wp\/v2\/media?parent=6414"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.positech.co.uk\/cliffsblog\/wp-json\/wp\/v2\/categories?post=6414"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.positech.co.uk\/cliffsblog\/wp-json\/wp\/v2\/tags?post=6414"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}