I am sorry if it appears this question has been done to death. I've done plenty of research however, and it seems there is no well known solution to what seems a simple problem: take a screenshot in windows.
There's a catch of course - the screenshot is to be manipulated in some way (gpu side, with shaders/etc), so there is no option of a slow copy to system memory. Instead the copy must somehow stay in graphics memory. GetFrontBuffer and the like a limited in this sense (they don't work full stop, I've checked).
I am aware of several closed questions on the stack exchange network ("Don't bother, not possible") and 2 with open bounties that amount to solving this problem.
Windows 7 introduces some changes to the graphics system, so now there is a compositing window manager,etc. Apparently GDI is also now 'hardware accelerated' so I was hoping this would have exposed a simple path for a possible solution:
gdi desktop window device context (in gpu memory) -> some direct2d or direct3d surface
In my particular case, just getting the DC in gpu memory is sufficient, but I am looking for a general solution.
So, how does one screnshot gpu side in Windows?
Related
TL;DR at end in bold if you don't want rationale/context (which I'm providing since it's always good to explain the core issue and not simply ask for help with Method X which might not be the best approach)
I frequently do software performance analysis on older hardware, which shows up race errors, single-frame graphical glitches and other issues more readily than more modern silicon.
Often, it would be really cool to be able to take screenshots of a misbehaving application that might render garbage for one or two frames or display erroneous values for a few fractions of a second. Unfortunately, problems most frequently arise when the systems in question are swapping heavily to disk, making it consistently unlikely that the screenshots I try to take will contain the bugs I'm trying to capture.
The obvious solution would be a capture device, and I definitely want to explore pixel-perfect image and video recording in the future when I have the resources for that (it sounds like a hugely fun opportunity to explore FPGAs).
I recently realized, however, that the kernel is what is performing the swapping, and that if I move screenshotting into kernelspace, well, I don't have to wait for my screenshot keystroke to make its way through the X input layer, into the screenshot program, wait for that to do its XSHM dance and get the screenshot data, all while the system is heavily I/O loaded (eg, 5-second system load of >10) - I can simply have the kernel memcpy() the displayed area of video memory to a preallocated buffer at the exact fraction of a second I hit PrtSc!
TL;DR: Where should I start looking to figure out how to "portably" (within the sense of Linux having different graphics drivers, each with different architectural designs) access the currently-displayed area of video memory?
I get the impression I should be looking at libdrm, possibly within KMS, but I would really appreciate some pointers to knowing what actually accesses video memory.
I'm also guessing there are probably some caveats and gotchas to reading video memory directly on certain chipsets? I don't expect my code to make it into the Linux kernel (who knows, but I doubt it) but I'd still like whatever I build to be fairly portable across computers for convenience.
NOTE: I am not using compositing with the systems in question, in case this changes anything. I'm interested to know whether I could write a compositing-compatible system; I suspect this would be nontrivial.
Recently, I started developing an operating system in NASM and C. I have already made a boot loader, kernel, filesystem, etc. So far I used the VGA text mode directly in order to write to the address 0x000B8000. So, I decided to switch to video mode instead of text mode. I chose maximal display resolution 320x200, but then I realised that there are three problems. Firstly, there are only 256 different colors. Secondly, the resolution is too small. Thirdly, writing to the address 0x000A0000 is too slow. I tried to do some animations, but it is very laggy and sometimes it waits more than one second before the next frame.
I have searched on the internet for some explanations on how to switch to higher resolutions such as 1920x1080 and how to use 256*256*256 colors instead of just 256. Everything I found said that it is very hard to use higher resolutions because you must develop drivers for all the different types of graphics cards and for some cards there are no documentations, so we must use reverse engineering.
I really want to introduce high-resolution graphics to my operating system. Is it really hard or is there any easy method? Any suggestions on how I can solve this?
Nearly every graphics adapter supports VESA framebuffer semantics, you can configure almost every video mode with that. The drawback is that you cannot use vendor specific features (accelerated graphics etc.)
The VESA-Xserver for example works with almost any graphics adapter (but the model specific ones are considerably faster)
See also: https://en.wikipedia.org/wiki/VESA_BIOS_Extensions
You can do high res VESA graphics in assembly and it should be fast enough (in the beginning phase when you are learning and not doing very fancy 3d stuff, especially).
First of all, make sure you are using a good emulator/virtual machine for testing. I was using QEMU and it was way to slow to do any graphics at only 640x480x24bpp. I switched to VirtualBox and though it starts up quite slowly, I have never looked back.
As for the programming part, I encourage you to look at a project called Pure64. You can find it on GitHub. Go to src/init/isa.asm and look at the end of the file - there is some code to do VESA initializations. I am actually using Pure64 to set up a clean 64bit environment and I am doing VESA graphics so I can say that it works fine.
The VESA init consists of two parts - getting mode info and setting the video mode. Once you get the mode info, you get a Video Base Pointer to a region of memory which is continuous and where you can write your pixels without switching banks and doing complicated stuff. At least in 64-bit mode.
The only problem I had with this is that I could not make 32bpp mode working. I can do 24bpp, which is RRGGBB - 3 bytes per pixel (exactly like HTML/CSS color codes). As with everything that comprises of 3 bytes on a binary computer, this makes some things a bit more complex (at least for a beginner). Getting 4 bytes per pixel to work still eludes me. Maybe this is a limitation of VirtualBox or something.
This all means that for basic hi-res graphics there is no need to do a lot of hardware-specific things. If you are on a mildly current hardware, you should do fine.
Context
I've been programming mainly as a hobby for some time now, mostly in C# and Java. I made many application (Windows Forms or Java Forms) that required animated content. In Java I would use Graphics.drawX() and redraw in function of time. When the animations were happening frequently the resolution would diminish or the application would slow down. I never gave it thought until I played a video game on the same computer that had so much trouble rendering a simple Java app. How can my computer instantly render a complex moving environment but rush a displaying a home-made 2048 game? I figured it must be because either I am misusing the draw functions, either because those functions are not optized for real-time render.
Question :
How can I directly access the display without having to go through preprogrammed functions?
I realize this maybe hard in higher level languages so let's say in C on a Windows OS. (But I would appreciate any answer relating to any language and/or OS)
I know it's a really vague question but I can't seem to find the right words to Google it appropriatly. Thank you very much for your help!
You can't (or maybe I should say should never) try to access the graphics driver directly on Windows. You used to have write directly to video memory to do graphics prior to Windows as DOS did not support graphics or display management and the stability of those programs were always a bit dicey. On Windows, it owns the screen and you have to work through it to access it.
The very concept of a Windows-based OS is that the OS owns the display and gives application access to a virtual display so that the OS can hide it or move it around. In most cases this does not cause a speed problem; but, in certain cases like gaming you need more speed; so, DirectX allows you tor transfer some of those task to the graphics card to get you the speed you need.
For more info on DirectX, check out Microsoft's Graphics and Gaming Resources
I have a WPF application that acquires images from a camera, processes these images, and displays them. The processing part has become burdensome for the CPU, so I've looked at moving this processing to the GPU and running custom CUDA kernels against them. The basic process is as follows:
1) acquire image from camera
2) load image onto GPU
3) call CUDA kernel to process image
4) display processed image
A WPF-to-CUDA-to-Display Control strategy is what I'm trying to figure out.
It seems natural that once the image is loaded onto the GPU that it would not have to be unloaded in order to be displayed. I've read that this can be done with OpenGL, but do I really need to learn OpenGL and include it in my project in order to do a fast display of a CUDA-processed image?
I understand (I think) the issues of calling CUDA kernels from C#. My plan is to either build an unmanaged library around my CUDA calls, which I later wrap for C# -- OR -- try to decide on which one of the managed wrappers (managedCUDA, Cudafy, etc.) to try. I worry about using one of the prebuilt wrappers because they all appear to be lightly supported...but maybe I have the wrong impression.
Anyway, I'm feeling a bit overwhelmed after days of researching the possible options. Any advice would be greatly appreciated.
The process of taking a result of CUDA computation and using it directly on the device for a graphics activity is called "interop". There is OpenGL "interop" and there is DirectX "interop". There are plenty of CUDA sample codes demonstrating how to interact with computed images.
To go directly from computed data on the device, to display, without a trip to the host, you will need to use one of these 2 APIs (OpenGL or DirectX).
You mentioned two of the managed interfaces I've heard of, so it seems like you're aware of the options there.
If the processing time is significant compared to (much larger than) the time taken to transfer the image from host to device, you might consider starting out by just transferring the image from host to device, processing it, and then transferring it back, where you can then use the same plumbing you have been using to display it. You can then decide if the additional effort for interop is worth it.
If you can profile your code to figure out how long the image processing takes on the host, and then prototype something on the device to find out how much faster it is, that will be instructive.
You may find that the processing time is so long you can even benefit from the double-copy arrangement. Or you may find the processing time is so short on the host (compared to just the cost to transfer to the device) that the CUDA acceleration would not be useful.
WPF has a control named D3DImage to directly show DirectX content on screen and in the managedCuda samples package you can find a version of the original fluids sample from Cuda Toolkit using it (together with SlimDX). You don’t have to use managedCuda to realize Cuda in C#, but you can take it to see how things can be realized: managedCuda samples
My .Net Winforms application creates three OpenGL rendering contexts in my main window, and then allows the user to popup other windows where each window has two more rendering contexts (using a splitter). At around the 26th rendering context, things start to go REALLY slow. Instead of taking a few milliseconds to render a frame, the new rendering context takes between 5 and 10 seconds. It still works, just REALLY SLOW! And OpenGL does NOT return any errors (glGetError).
The other windows work fine. Just the new rendering contexts after a certain number slow down. If I close those windows, everything is fine -- until I reopen enough windows to pass the limit. Each rendering context has its own thread, and each one uses a simple shader. The slow down appears to happen when I upload a texture. But the size of the texture has no effect on how many contexts I can create, nor does the size of the OpenGL window.
I'm running on nVidia cards and see this on different GPU's with different amounts of memory and different driver versions. What's the deal? Is there some limit to how many rendering contexts an application can create?
Does anyone else have an application with LOTS of rendering contexts going at the same time?
As Nathan Kidd correctly said, the limit is implementation-specific, and all you can do is to run some tests on common hardware.
I was bored at today's department meeting, so i tried to piece together a bit of code which creates OpenGL contexts and tries some rendering. I tried rendering with and without textures, with and without forward-compatible OpenGL context.
It turned out that the limit is pretty high for GeForce cards (maybe even no limit). For desktop Quadro, there was limit of 128 contexts that were able to repaint correctly, the program was able to create 128 more contexts with no errors, but the windows contained rubbish.
It was even more interesting on ATi Radeon 6950, there the redrawing stopped at window #105, and creating rendering context #200 failed.
If you want to try for yourself, the program can be found here: Max OpenGL Contexts test (there is full source code + win32 binaries).
That's the result. One piece of advice - avoid using multiple contexts where possible. Multiple contexts can be understood in application running at mulitple monitors, but applications on a single monitor should resort to a single context. Context switching is slow. And that's not all. Applications where OpenGL windows are overlapped by another windows require hardware clipping regions. There is one hardware clipping region on GeForce, eight or more on Quadro (CAD applications often use windows and menus that overlap OpenGL window, in contrast with games). In case more regions are needed, rendering falls back to software - so again - having lots of OpenGL windows (contexts) is not a very good idea.
The best bet is that there is no real answer to this question. It probably depends on some internal limitation of the driver, hardware of even the OS. Something you might want to try to check is the number of available texture units using glGet(GL_MAX_TEXTURE_UNITS) but that may or may not be indicative.
A common solution to avoid this is to create multiple viewports within a single context rather than multiple contexts in a single window. It shouldn't be too hard to unite the two contextes that share a window to a single context with two viewports and some kind of UI widget to serve as the splitter. Multiple windows are a different story and you may want to consider completely re-thinking your UI design if there is an actual need for 26 separate OpenGL windows.
It's hard for me right now to think of a real UI use case that would actually require 26 different OpenGL windows operating simultaneously. maybe another option is to create a pool of say 5-10 contexts and reuse them only in the windows (tabs?) that are currently visible to the user. I didn't try it but it should be possible to create a context inside a plain window that contain nothing else and then move that window from parent window to parent window to whichever top-level window it is needed in.
EDIT -
Well, actually it's not that hard to think of one. The latest Chrome (9.x.x), supporting WebGL may want to open many tabs each with a WebGL context... I wonder if they handle this in any way. Just tried it and ran out of memory after 13 tabs... That would actually be a good check for you as well to see if its something you're doing wrong or if chrome and firefox (4.0.x-beta) have the same problem
Given the diverse nature of OpenGL drivers, your best bet is probably to check the behavior of the major drivers (AMD / Intel / NVIDIA / MS Software Render) and on first startup run a test. E.g. if you can see that NVIDIA always slows down like you saw, then just run a quick loop till you see where the limit is on that machine (or rather, card). It's not much fun, but I think it's pretty hard to reliably push the limits otherwise.
In other words "best bet" is just like previously answered, you can't know beforehand.
If you go through that much trouble to set OpenGL up in a over-the-top multi-threaded fashion, you could as well benefit from it and consider switching to Vulkan. See, by design, the OpenGL architecture funnels all the hard earned context/thread separated drawing operations into one single driver thread that then redistributes all these calls acrosss virtual hardware threads that map onto each context. The driver is in essence a huge bottleneck because it is not itself threaded, despite any glewmx sitting aroung. It is simply not designed to handle this well.
That said, I am curious if you used an older version of Glew, or if you do all the extension handling in some other way, since latest glew libs no longer support mx. One more reason to switch.