How long does it take for OpenGL to actually update the screen? - c

I have a simple OpenGL test app in C which draws different things in response to key input. (Mesa 8.0.4, tried with Mesa-EGL and with GLFW, Ubuntu 12.04LTS on a PC with NVIDIA GTX650). The draws are quite simple/fast (rotating triangle type of stuff). My test code does not limit the framerate deliberately in any way, it just looks like this:
while (true)
{
draw();
swap_buffers();
}
I have timed this very carefully, and I find that the time from one eglSwapBuffers() (or glfwSwapBuffers) call to the next is ~16.6 milliseconds. The time from after a call to eglSwapBuffers() to just before the next call is only a little bit less than that, even though what is drawn is very simple. The time that the swap buffers call takes is well under 1ms.
However, the time from the app changing what it's drawing in response to the key press to the change actually showing up on screen is >150ms (approx 8-9 frames worth). This is measured with a camera recording of the screen and keyboard at 60fps. (Note: It is true I do not have a way to measure how long it takes from key press to the app getting it. I am assuming it is <<150ms).
Therefore, the questions:
Where are graphics buffered between a call to swap buffers and actually showing up on screen? Why the delay? It sure looks like the app is drawing many frames ahead of the screen at all times.
What can an OpenGL application do to cause an immediate draw to screen? (ie: no buffering, just block until draw is complete; I don't need high throughput, I do need low latency)
What can an application do to make the above immediate draw happen as fast as possible?
How can an application know what is actually on screen right now? (Or, how long/how many frames the current buffering delay is?)

Where are graphics buffered between a call to swap buffers and actually showing up on screen? Why the delay? It sure looks like the app is drawing many frames ahead of the screen at all times.
The command is queued, whatever drawn to the backbuffer, waits till next vsync if you have set swapInterval and at the next vsync, this buffer should be displayed.
What can an OpenGL application do to cause an immediate draw to screen? (ie: no buffering, just block until draw is complete; I don't need high throughput, I do need low latency)
Use of glFinish will ensure everything is drawn before this API returns, but no control over when it actually gets to the screen other than swapInterval setting.
What can an application do to make the above immediate draw happen as fast as possible?
How can an application know what is actually on screen right now? (Or, how long/how many frames the current buffering delay is?)
Generally you can use sync (something like http://www.khronos.org/registry/egl/extensions/NV/EGL_NV_sync.txt) to find out this.
Are you sure the method of measuring latency is correct ? What if the key input actually has significant delay in your PC ? Have you measured latency from the event having been received in your code, to the point after swapbuffers ?

You must understand that the GPU has specially dedicated memory available (on board). At the most basic level this memory is used to hold the encoded pixels you see on your screen (it is also used for graphics hardware acceleration and other stuff, but that is unimportant now). Because it takes time loading a frame from your main RAM to your GPU RAM you can get a flickering effect: for a brief moment you see the background instead of what is supposed to be displayed. Although this copying happens extremely fast, it is noticeable to the human eye and quite annoying.
To counter this, we use a technique called double buffering. Basically double buffering works by having an additional frame buffer in your GPU RAM (this can be one or many, depending on graphics library you are working with and the GPU, but two is enough to work) and using a pointer to indicate which frame should be displayed. Thus while the first frame is being displayed, you are already creating the next in your code using some draw() function on an image structure in main RAM, this image is then copied to your GPU RAM (while still displaying the previous frame) and then when calling eglSwapBuffers() the pointer switches to your back buffer (I guessed it from your question, I'm not familiar with OpenGL, but this is quite universal). You can imagine this pointer switch does not require very much time. I hope you see now that directly writing an image to the screen actually causes much more delay (and annoying flickering).
Also ~16.6 milliseconds does not sound like that much. I think most time is lost creating/setting the required data structures and not really in the drawing computations itself (you could test this by just drawing the background).
At last I like to add that I/O is usually pretty slow (slowest part of most programs) and 150ms is not that long at all (still twice as fast as a blink of an eye).

Ah, yes you've discovered one of the peculiarities of the interaction of OpenGL and display systems only few people actually understand (and to be frank I didn't fully understand it until about 2 years ago as well). So what is happening here:
SwapBuffers does two things:
it queues a (private) command to the command queue that's used also for OpenGL drawing calls that essentially flags a buffer swap to the graphics system
it makes OpenGL flush all queued drawing commands (to the back buffer)
Apart from that SwapBuffers does nothing by itself. But those two things have interesting consequences. One is, that SwapBuffers will return immediately. But as soon as the "the back buffer is to be swapped" flag is set (by the queued command) the back buffer becomes locked for any operation that would alter its contents. So as long no call is made that would alter the contents of the back buffer, things will not block. And commands that would alter the contents of the back buffer will halt the OpenGL command queue until the back buffer has been swapped and released for further commands.
Now the length of the OpenGL command queue is an abstract thing. But the usual behavior is, that one of the OpenGL drawing commands will block, waiting for the queue to flush in response to swap buffers having happened.
I suggest you spray your program with logging statements using some high performance, high resolution timer as clock source to see where exactly the delay happens.

Latency will be determined both by the driver, and by the display itself. Even if you wrote directly to the hardware, you would be limited by the latter.
The application can only do so much (i.e. draw fast, process inputs as closely as possible to or during drawing, perhaps even modify the buffer at the time of flip) to mitigate this. After that you're at the mercy of other engineers, both hardware and software.
And you can't tell what the latency is without external monitoring, as you've done.
Also, don't assume your input (keyboard to app) is low latency either!

Related

Animation using Single Frame buffer how is it possible?

I'm using the device STM32F746. I know it has a hardware 2D Graphics accelerator.
I know how to do animation using double buffering.
But according to this
https://www.touchgfx.com/news/high-quality-graphics-using-only-internal-memory/
They are claiming that they use only one framebuffer for animation.
How is that possible and what techniques that are used using that STM32F746 ?
It is the double buffering. One buffer is stored in the MCU memory, where the next frame is prepared and composed. Another buffer is in LCD driver memory, to where data being transferred from the MCU when it is ready, and displayed on the LCD with the required refresh rate.
That's why that library requires so much of MCU memory.
Despite the answer was accepted it is wrong.
In fact those controllers have their own LCD-driving circuit, thus, do not require external driver. They use part of internal memory as the screen buffer and constantly refresh the image on the LCD.
In the library, that only part of memory is used. The write operatation are synchronized with LCD refresh, so they avoid flickering.
So, the only one buffer is used: the same buffer contains the output image and used to compose the next frame.

Asynchronously exit loop via interupt or similar (MSP430/C)

I have run into a problem that I am rather stumped on because every solution I can think of has an issue that makes it not work fully. I am working on a game on the MSP430FF529 that when first powered up has two images drawn to the screen infinitely using a loop and cycle delays. I would like to have it so that when the user presses the start button (a simple high-edge trigger on a port) that the program immediately stops drawing those screens, no matter what part of the process its in, and starts executing the rest of the code that runs the game.
I could put the function that puts the images on screen in a do while loop but then it wouldn't be asynchronous as the current image being drawn would have to finish before it moved on.
I'd use the break command but I don't think that works in ISRs and only when its directly in the loop.
I could put the entire rest of the program in the ISR I use for the start button press so that the screen drawing is essentially never returned to but thats really messes, poor coding, and would cause a lot of problems later.
Essentially, I want to make it so that when the button is pressed the program will immediately jump to the part of the program that is the actual game and forget about drawing those images on the screen. Is it possible to somehow have an ISR that doesn't return to what was currently happening after the code in the routine is executed? Basically, once the program starts moving forward (the start button is pressed) I don't want to come back to the function that draws the images unless I explicitly call it again.
The only thing I can think of is the goto command, which I feel in this particular instance would not actually be too bad, though I want to avoid using it for fear of it becoming a habit due to it being a poor solution in most cases. However, that might not even work because I have a feeling that using goto in a ISR would really mess up the stack.
Any ideas? Any suggestions are appreciated.
What you want is basically a "context switch". You should modify the program counter pointer and stack pointer which will be restored when you return from the ISR, and then do the normal ISR return so the interrupt mask is cleared, stack is restored, etc. As noted in the comments to your question, this likely requires some manual assembly code.
I'm not familiar with the MSP430, but on other architectures this is in a structure of saved registers on the kernel stack or interrupt-context stack (or maybe just "the stack" on some microcontrollers), or it might be in some special registers, and it's saved automatically by the CPU when it jumps to your ISR. So you have to change these register pointers where they are.
If you relax your requirement from "immediately" to "so fast that the user doesn't notice", you can put the if (button_pressed) into some loop in the image drawing routine.
If you really want to abort the image drawing immediately, you can do so by resetting the MCU (for example, by writing a wrong password to the WDT). In the application initialization code, check if one of the causes of the reset was your own software:
bool start_button = false;
for (;;) {
int cause = SYSRSTIV;
if (cause == SYSRSTIV_WDTKEY)
start_button = true;
if (cause == SYSRSTIV_NONE)
break;
// you might handle debugging of other reset causes here ...
}
if (!start_button)
draw_images();
else
actual_game();
(This assumes that your code never accidentally writes a wrong WDT password, but even if that happens, you're only skipping the intro images.)

Fastest way to display a screen buffer captured from other PC

My problem is the following:
I have a pointer that stores a framebuffer which is constantly changed by some thread.
I want to display this framebuffer via OpenGL APIs. My trivial choice is to use glTexImage2D and load the framebuffer again and again at every time. This loading over the framebuffer is necessary because the framebuffer is changed by outside of the OpenGL APIs. I think that there could be some methods or tricks to speed up such as:
By finding out the changes inside the framebuffer (Is it even possible?)
Some method for fast re-loading of image
OpenGL could directly using the pointer of the framebuffer? (So less framebuffer copying)
I'm not sure the above approaches are valid or not. I hope if you could give some advices.
By finding out the changes inside the framebuffer (Is it even possible?)
That would reduce the required bandwidth, because it's effectively video compression. However the CPU load is notably higher and its much slower than to just DMA copy the data.
Some method for fast re-loading of image
Use glTexSubImage2D instead glTexImage2D (note the …Sub…). glTexImage2D goes through a full texture initialization each time its called, which is rather costly.
If your program does additional things with OpenGL rather than just displaying the image, you can speed things up further, by reducing the time the program is waiting for things to complete by using Pixel Buffer Objects. The essential gist is
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pboID);
void *pbuf = glMapBuffer();
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
thread image_load_thread = start_thread_copy_image_to_pbo(image, pbuf);
do_other_things_with_opengl();
join_thread(image_load_thread();
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pboID);
glUnmapBuffer();
glBindTexture(…);
glTexSubImage2D(…, NULL); /* will read from PBO */
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
draw_textured_quad();
Instead of creating and joining with the thread you may as well use a thread pool and condition variables for inter thread synchronization.
OpenGL could directly using the pointer of the framebuffer? (So less framebuffer copying)
Stuff almost always has to be copied around. Don't worry about copies, if they are necessary anyway. The Pixel Buffer Objects I outlined above may or may not work with a shader copy in system RAM. Essentially you can glBufferMap into your process' address space and directly decode into that buffer you're given. But that's no guarantee that a further copy is avoided.

Using threads, how should I deal with something which ideally should happen in sequential order?

I have an image generator which would benefit from running in threads. I am intending to use POSIX threads, and have written some mock up code based on https://computing.llnl.gov/tutorials/pthreads/#ConVarSignal to test things out.
In the intended program, when the GUI is in use, I want the generated lines to appear from the top to the bottom one by one (the image generation can be very slow).
It should also be noted, the data generated in the threads is not the actual image data. The thread data is read and transformed into RGB data and placed into the actual image buffer. And within the GUI, the way the thread generated data is translated to RGB data can be changed during image generation without stopping image generation.
However, there is no guarantee from the thread scheduler that the threads will run in the order I want, which unfortunately makes the transformations of the thread generated data trickier, implying the undesirable solution of keeping an array to hold a bool value to indicate which lines are done.
How should I deal with this?
Currently I have a watcher thread to report when the image is complete (which really should be for a progress bar but I've not got that far yet, it instead uses pthread_cond_wait). And several render threads doing while(next_line());
next_line() does a mutex lock, and gets the value of img_next_line before incrementing it and unlocking the mutex. it then renders the line and does a mutex lock (different to first) to get lines_done checks against height, signals if complete, unlocks and returns 0 if complete or 1 if not.
Given that threads may well be executing in parallel on different cores it's pretty much inevitable that the results will arrive out of order. I think your appraoch of tracking what's complete with a set of flags is quite reasonable.
It's possible that the overall effect might be nicer if used threads in a different granularity. Say give each thread (say) 20 lines to work on rather than one. Then on completion you'd have bigger blocks available to draw, and maybe drawing stripes would look ok?
Just accept that the rows will be done in a non-deterministic order; it sounds like that is happening because they take different lengths of time to render, in which case forcing a completion order will waste CPU time.
This may sound silly but as a user I don't want to see one line rendered slowly from top to bottom. It makes a slow process seem even slower because the user already has completely predicted what will happen next. Better to just render when ready even if it is scattered over the place (either as single lines or better yet as blocks as some have suggested). It makes it look more random and therefore more captivating and less boring to a user like me.

printf slows down my program

I have a small C program to calculate hashes (for hash tables). The code looks quite clean I hope, but there's something unrelated to it that's bugging me.
I can easily generate about one million hashes in about 0.2-0.3 seconds (benchmarked with /usr/bin/time). However, when I'm printf()inging them in the for loop, the program slows down to about 5 seconds.
Why is this?
How to make it faster? mmapp()ing stdout maybe?
How is stdlibc designed in regards to this, and how may it be improved?
How could the kernel support it better? How would it need to be modified to make the throughput on local "files" (sockets,pipes,etc) REALLY fast?
I'm looking forward for interesting and detailed replies. Thanks.
PS: this is for a compiler construction toolset, so don't by shy to get into details. While that has nothing to do with the problem itself, I just wanted to point out that details interest me.
Addendum
I'm looking for more programatic approaches for solutions and explanations. Indeed, piping does the job, but I don't have control over what the "user" does.
Of course, I'm doing a testing right now, which wouldn't be done by "normal users". BUT that doesn't change the fact that a simple printf() slows down a process, which is the problem I'm trying to find an optimal programmatic solution for.
Addendum - Astonishing results
The reference time is for plain printf() calls inside a TTY and takes about 4 mins 20 secs.
Testing under a /dev/pts (e.g. Konsole) speeds up the output to about 5 seconds.
It takes about the same amount of time when using setbuffer() in my testing code to a size of 16384, almost the same for 8192: about 6 seconds.
setbuffer() has apparently no effect when using it: it takes the same amount of time (on a TTY about 4 mins, on a PTS about 5 seconds).
The astonishing thing is, if I'm starting the test on TTY1 and then switch to another TTY, it does take just the same as on a PTS: about 5 seconds.
Conclusion: the kernel does something which has to do with accessibility and user friendliness. HUH!
Normally, it should be equally slow no matter if you stare at the TTY while its active, or you switch over to another TTY.
Lesson: when running output-intensive programs, switch to another TTY!
Unbuffered output is very slow.
By default stdout is fully-buffered, however when attached to terminal, stdout is either unbuffered or line-buffered.
Try to switch on buffering for stdout using setvbuf(), like this:
char buffer[8192];
setvbuf(stdout, buffer, _IOFBF, sizeof(buffer));
You could store your strings in a buffer and output them to a file (or console) at the end or periodically, when your buffer is full.
If outputting to a console, scrolling is usually a killer.
If you are printf()ing to the console it's usually extremely slow. I'm not sure why but I believe it doesn't return until the console graphically shows the outputted string. Additionally you can't mmap() to stdout.
Writing to a file should be much faster (but still orders of magnitude slower than computing a hash, all I/O is slow).
You can try to redirect output in shell from console to a file. Using this, logs with gigabytes in size can be created in just seconds.
I/O is always slow in comparison to
straight computation. The system has
to wait for more components to be
available in order to use them. It
then has to wait for the response
before it can carry on. Conversely
if it's simply computing, then it's
only really moving data between the
RAM and CPU registers.
I've not tested this, but it may be quicker to append your hashes onto a string, and then just print the string at the end. Although if you're using C, not C++, this may prove to be a pain!
3 and 4 are beyond me I'm afraid.
As I/O is always much slower than CPU computation, you might store all values in fastest possible I/O first. So use RAM if you have enough, use Files if not, but it is much slower than RAM.
Printing out the values can now be done afterwards or in parallel by another thread. So the calculation thread(s) may not need to wait until printf has returned.
I discovered long ago using this technique something that should have been obvious.
Not only is I/O slow, especially to the console, but formatting decimal numbers is not fast either. If you can put the numbers in binary into big buffers, and write those to a file, you'll find it's a lot faster.
Besides, who's going to read them? There's no point printing them all in a human-readable format if nobody needs to read all of them.
Why not create the strings on demand rather that at the point of construction? There is no point in outputting 40 screens of data in one second how can you possibly read it? Why not create the output as required and just display the last screen full and then as required it the user scrolls???
Why not use sprintf to print to a string and then build a concatenated string of all the results in memory and print at the end?
By switching to sprintf you can clearly see how much time is spent in the format conversion and how much is spent displaying the result to the console and change the code appropriately.
Console output is by definition slow, creating a hash is only manipulating a few bytes of memory. Console output needs to go through many layers of the operating system, which will have code to handle thread/process locking etc. once it eventually gets to the display driver which maybe a 9600 baud device! or large bitmap display, simple functions like scrolling the screen may involve manipulating megabytes of memory.
I guess the terminal type is using some buffered output operations, so when you do a printf it does not happen to output in split micro-seconds, it is stored in the buffer memory of the terminal subsystem.
This could be impacted by other things that could cause a slow down, perhaps there's a memory intensive operation running on it other than your program. In short there's far too many things that could all be happening at the same time, paging, swapping, heavy i/o by another process, configuration of memory utilized, maybe memory upgrade, and so on.
It might be better to concatenate the strings until a certain limit is reached, then when it is, write it all out at once. Or even using pthreads to carry out the desired process execution.
Edited:
As for 2,3 it is beyond me. For 4, I am not familiar with Sun, but do know of and have messed with Solaris, There may be a kernel option to use a virtual tty.. i'll admit its been a while since messing with the kernel configs and recompiling it. As such my memory may not be great on this, have a root around with the options to see.
user#host:/usr/src/linux $ make; make menuconfig **OR kconfig if from X**
This will fire up the kernel menu, have a dig around in to see the video settings section under the devices sub-tree..
Edited:
but there's a tweak you put into the kernel by adding a file into the proc filesystem (if a such thing does exist), or possibly a switch passed into the kernel, something like this (this is imaginative and does not imply it actually exists), fastio
Hope this helps,
Best regards,
Tom.

Resources