Reading back from GPU memory in directx9

April 14, 2014 cliffski

Yeah you read that right, I’m reading back from the card. yes, I feel kinda dirty. What am I talking about? (skip this is you are gfx coders…)

***generally speaking games create ‘textures’ in memory on the graphics card, so the data is actually stored there. We write data *to* the card, and then we forget about it. We tell the card to draw chunks of that data to the screen, and it does so. What you never do, is read back *from* that same data. In other words, you draw stuff to the screen, but have no way of actually looking *at* the screen from back where you generally are in CPU / RAM land. The reason for this is everyone understands this to be slow, and there are very few reasons to do it***

I have some technique, the details of which I won’t bore you with, which requires me to draw the scene in a certain way, then blur that scene, and then check the color value of specific pixels. I cannot find any way to do this without reading back from the video card. I should say this is for Gratuitous Space Battles 2.

Theoretically, I could maintain an system-memory only version of the scene, render to it there, blur it, and read from it without ever touching the card GPU or card RAM. This would mean no sneaky using that video card bus to do any data transfer. The problem is, I suspect this would be slower. The GPU is good at blurring, and rendering, and in fact, all of the data I draw to the scene is in gfx loaded in the cards RAM. Make no bones about it, I have to compose this scene on the card, in card RAM. And if I want to access specific pixel colors, I need to get that data back.

So what I’m doing now is a call to GetRenderTargetData to grab the data and stick it into a system memory texture I created earlier specifically for this purpose. BTW did I mention I have to do this every frame? Once there I call LockRect() on the whole texture, and then quickly zip through my list of points, then UnLock() as soon as I can. So what happens?

Well if I look at the contention analysis in Visual Studio, it shows me that this calls a lot of thread blocking. It’s pretty much all of the thread blocking. This is clearly sub-optimal. But if I look at the actual game running in 1920×1200 res in FRAPS, the whole thing runs at a consistent 59-60 FPS. My video card is an Nvidia Geforce GTX 670. In other words, it really isn’t a problem. Am I over-reacting to what was once a taboo, and now is not? Are people calling LockRect() on textures just for giggles these days? Is my engine sufficiently meek that it leaves plenty of spare room in each frame to put up with this clunky technique?

I’ve also considered that I may be screwing up by doing this close to the end of a frame (sadly this is a requirement of my engine, unless I let a certain effect *lag* a frame). If it happened mid-frame I suspect the thread-blocking that prevents the end frame Present() wouldn’t be so bad. Sadly I can’t move it.

I’ve also wondered if a series of smaller LockRects() that don’t fill the screen might be quicker, but I doubt it, I think it’s the mere lockiness, not the area of memory that matters. I can easily allow the effect to be toggled under options BTW, so if it is a frame-rate killer for some people, they can just turn it off.

11 thoughts on Reading back from GPU memory in directx9

ac says:

April 14, 2014 at 12:28 pm

(is’ here all old news probably … w’e)

One way to approach such issue is looking at the product if you were the end user.
GTX670 performance goes for around $150-$250, but that is not relevant. What is relevant is that this game on the surface appears 2D even if technically it is not. So there’s going to be some (possibly vocal people who can’t read specs) in the reviews reviewing it with some Intel HD4000 or worse, going in believing it will run like any 2D game.

From study into benchmarks I’ve done, those Intel HDxxxx are still slower in many operations than some 2006 desktop gpu like 8800GT, and have more bugs and corner cases all around. Nvidia desktop gpu tends to be the best case for gaming, as a producer, even if the game isn’t catering to lower common denom., it’s still a good idea to do what they do in music production : make sure the product is good with low end equipment. Personally however I would only use this to the degree where optimizations can be done without affecting the end result on the high end setups. Consoles tend to go the other way around and then there’s all the ranting about mediocre pc ports.

However, you would think that such operation (reading from GPU) was much faster with the GPU sitting next to the CPU. I certainly would like to believe so (and with the new Iris PRO in still expensive portable computers, the edram cache might help a lot here too) . So there’s a possiblity this isn’t a problem at all.
cliffski says:

April 14, 2014 at 12:31 pm

Indeed, I have 4 different PC’s here all with different video cards, and the older intel cards will definitely get a thorough profiling before I decide on any default settings.
ac says:

April 14, 2014 at 12:54 pm

I would add that if any game developer is doing pioneering/bleeding edge/novel kind of things, it might be wise to have some sort of thing where you can test the tech broadly with low risk.

Typical methods of that are going public beta/demo (these do have risk of sales affecting feedback), atypical and less risky that might be very interesting in GSB2’s case is giving away (in return of test results) some sort of “benchmarking screensaver” well ahead of release (with a bouncing semi-transparent disclaimer about it being a tech test/not-representative of final product), the purpose of which was just to collect the data you need for determining what tricks you can in terms of the target specs or optimizations.

Personally if I was running the show for the couple “bleeding edge visual tech” game projects I follow, I would be extremely interested in trying to get into deals for shipping screensavers with new Samsung/LG OLED displays (just HD res for now). Some reviews have pointed out they may still have burn-in issues for static high brightness content. I don’t have any idea how to best get started in such though, beside the usual “find the person who’s in charge and get your demo seen by them”.
Cygon says:

April 14, 2014 at 1:27 pm

I believe that main issue with that is that you loose all parallelism between CPU and GPU (i.e. you’re going back to the old 80s scheme of “burn the CPU as hard as you can to gain performance”).

The graphics driver can queue up commands to keep executing while your code is already preparing the *next* frame. At the very moment you lock a render target, you’re completely eliminating the parallelism – the driver has to wait until the GPU has finished executing every command up to that point, so first your CPU is idling, waiting for the GPU to finish drawing, then the GPU is idling, waiting for the CPU to finish checking those pixels.

While it is a very sub-optimal design for a renderer, the video memory access path has received a lot love since things OpenCL, CUDA and PhysX were rolled out, so the readback itself should be reasonably fast.
cliffski says:

April 14, 2014 at 3:10 pm

Indeed, I fully realize the complete stall that takes place in this case. This is why it’s a pity I can’t find a way to trigger it when the card is idle anyway, but the nature of the effect is it has to come at the end of a frame when all my processing is already done. In theory I could do some ‘look ahead’ processing at that point while the GPU stalls…
z0r says:

April 14, 2014 at 4:02 pm

I came across the same problem in my own engine, where I render a scene, and I just want to grab that one pixel that is under my mouse cursor.
What I did to solve this is just render everything normal and when I needed the pixel under my mouse, I created a render target of just 1×1 and render the screen texture stretched in such way so only the desired pixel was rendered to that 1×1 texture. To grab only this texture from video memory is pretty fast (And I agree this is ugly to do, but it does the job)
cliffski says:

April 14, 2014 at 4:22 pm

Very interesting. Did you test to see if the 1 pixel grab vs fullscreen grab had any speed difference?
Long says:

April 15, 2014 at 8:02 pm

Just curious (since I haven’t done ANY of this), but can you look up frame buffer from the shader code and use it directly?
cliffski says:

April 15, 2014 at 8:08 pm

not realistically, because I’m actually changing quite a lot of what I’m drawing depending on a single pixel, it’s just hugely involved to do it that way.
Michal says:

April 15, 2014 at 9:26 pm

How about transferring just what’s needed and not whole image? I’d create a dedicated shader that would read pixels from screen (set up as input texture) and then write them to dedicated “transfer” texture or buffer, set up as a render target.

Some nVidia cards can transfer data in parallel to rendering. Check out Shalini Venkataraman’s GTC talks.
Vico87 says:

April 23, 2014 at 12:38 pm

I did some reading back a while ago, and it was surprisingly fast (but still really slow compared to a fully GPU-based solution). What I did is render to a 512×512 R32_FLOAT texture, then copy that to CPU memory, do some processing on it, then write the results back to GPU memory (1024×256 texture, also R32_FLOAT), to use in rendering the final image. I did this each frame, the CPU-side processing completely parallelized to 6 cores (AMD Phenom II X6 CPU). Apart from the multithreading, I did not optimize the CPU code much. I got away with stable 70-80 FPS on a GTX460. Average CPU usage was around 40%, because each frame it was waiting for the data, then crunching at 100%, then writing the results back. A fully GPU-based solution (where the processing was done in compute shader instead of the CPU) was around 250-300 FPS. I ran it on a computer with a low end CPU and a high end GPU, and got pretty much the same performance with the GPU-based implementation, but a way slower one when the CPU was involved (as far as I remember it was an older Core2 Duo, and achieved around 15-20 FPS).

Comments are currently closed.

Cliffs Solar Panels:
	CO2 emission reduced 445.05 kg
	Equivalent trees planted 26.93 trees
	Equivalent lightbulbs 6973.96 lightbulbs