Fastest way to display a screen buffer captured from other PC - c

My problem is the following:
I have a pointer that stores a framebuffer which is constantly changed by some thread.
I want to display this framebuffer via OpenGL APIs. My trivial choice is to use glTexImage2D and load the framebuffer again and again at every time. This loading over the framebuffer is necessary because the framebuffer is changed by outside of the OpenGL APIs. I think that there could be some methods or tricks to speed up such as:
By finding out the changes inside the framebuffer (Is it even possible?)
Some method for fast re-loading of image
OpenGL could directly using the pointer of the framebuffer? (So less framebuffer copying)
I'm not sure the above approaches are valid or not. I hope if you could give some advices.

By finding out the changes inside the framebuffer (Is it even possible?)
That would reduce the required bandwidth, because it's effectively video compression. However the CPU load is notably higher and its much slower than to just DMA copy the data.
Some method for fast re-loading of image
Use glTexSubImage2D instead glTexImage2D (note the …Sub…). glTexImage2D goes through a full texture initialization each time its called, which is rather costly.
If your program does additional things with OpenGL rather than just displaying the image, you can speed things up further, by reducing the time the program is waiting for things to complete by using Pixel Buffer Objects. The essential gist is
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pboID);
void *pbuf = glMapBuffer();
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
thread image_load_thread = start_thread_copy_image_to_pbo(image, pbuf);
do_other_things_with_opengl();
join_thread(image_load_thread();
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pboID);
glUnmapBuffer();
glBindTexture(…);
glTexSubImage2D(…, NULL); /* will read from PBO */
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
draw_textured_quad();
Instead of creating and joining with the thread you may as well use a thread pool and condition variables for inter thread synchronization.
OpenGL could directly using the pointer of the framebuffer? (So less framebuffer copying)
Stuff almost always has to be copied around. Don't worry about copies, if they are necessary anyway. The Pixel Buffer Objects I outlined above may or may not work with a shader copy in system RAM. Essentially you can glBufferMap into your process' address space and directly decode into that buffer you're given. But that's no guarantee that a further copy is avoided.

Related

Lock free readonly-shared memory don’t care about memory order, only ensure visibility?

So I’m emulating a small microprocessor in c that has an internal flash storage represented as an array of chars. The emulation is entirely single threaded and operates on this flash storage as well as some register variables.
What I want to do is have a second “read” thread that periodically (~ every 10ms or about the monitor refreshrate) polls the data from that array and displays it in some sort of window. (The flash storage is only 32KiB in size so it can be displayed in a 512x512 black and white image).
The thing is that the main emulation thread should have an absolutely minimal performance overhead to do this (or optimally not even care about the second thread at all). A rw mutex is absolutely out of the question since that would absolutely tank the performance of my emulator. Best case scenario as I said is to let the emulator be completely oblivious of the existence of the read thread.
I don’t care about memory order either since it doesn’t matter if some changes are visible earlier than others to the read thread as long as they are visible at some point at all.
Is this possible at all in c / cpp or at least through some sort of memory_barrier() function that I can call in my emulation thread about every 1000th clock cycle that would then assure visibility to my read thread?
Is it enough to just use a volatile on the flash memory? / would this affect performance in any significant way?
I don’t want to stall the complete emulation thread just to copy over the flash array to some different place.
Pseudo code:
int main() {
char flash[32 * 1024];
flash_binary(flash, "program.bin");
// periodically displays the content of flash
pthread_t *read_thread = create_read_thread(flash);
// does something with flash, highly performance critical
emulate_cpu(flash);
kill_read_thread(read_thread);
}

Animation using Single Frame buffer how is it possible?

I'm using the device STM32F746. I know it has a hardware 2D Graphics accelerator.
I know how to do animation using double buffering.
But according to this
https://www.touchgfx.com/news/high-quality-graphics-using-only-internal-memory/
They are claiming that they use only one framebuffer for animation.
How is that possible and what techniques that are used using that STM32F746 ?
It is the double buffering. One buffer is stored in the MCU memory, where the next frame is prepared and composed. Another buffer is in LCD driver memory, to where data being transferred from the MCU when it is ready, and displayed on the LCD with the required refresh rate.
That's why that library requires so much of MCU memory.
Despite the answer was accepted it is wrong.
In fact those controllers have their own LCD-driving circuit, thus, do not require external driver. They use part of internal memory as the screen buffer and constantly refresh the image on the LCD.
In the library, that only part of memory is used. The write operatation are synchronized with LCD refresh, so they avoid flickering.
So, the only one buffer is used: the same buffer contains the output image and used to compose the next frame.

When does OpenCL data transfer occur?

I've seen a few questions here on Stack overflow dealing with the same issues, but no definite answer. I thought I'll ask again, with a bunch of questions of my own. All relate to the subject matter at hand.
So, do we know when the data transfer from host to the openCL device occurs? Can you tell me the exact memory transfer operation of the functions below (that is, what data is transferred or created, if any, when these functions are invoked?):
clCreateBuffer()
clSetKernelArg()
clEnqueueNDRangeKernel()
The first two don't even produce events, so we can't time them, but surely some data transferring is happening here.
Is there a way to transfer data to a device without first setting it as a kernel arg?
It appears (from preliminary testing of my own) that a mem object created with CL_MEM_USE_HOST_PTR gets directly manipulated by the device. Why would that not be desirable, since, that way, we could avoid further data transfer commands (and surely the driver implements this in the most efficient way)?
Does transferred data (say, as par of a kernel arg) stay at the device for further manipulation, after a kernel returns? If not is there a way to do just that?
Buffer copies are related to command queues. Command queues are synced with host using finish() as easiest way.
clCreateBuffer()
clEnqueueWriteBuffer() <-------- you can get event data from this
(set blocking parameter to false to queue everything quickly)
(set blockinig to true if you sync write here)
clSetKernelArg()
clEnqueueWriteBuffer() <----- it could be here too
clEnqueueNDRangeKernel()
clEnqueueWriteBuffer() <----- or here (too quickly re-set an array?)
clFinish() <--------- this ensures all queued commands are executed before this
now you can query data of that event to check when it started and when ended
to let a buffer stay in device, you should create it in device first then don't migrate it to another device. Using only CL_MEM_READ_WRITE flag in createBuffer() is enough to make it a real buffer on device-side until you release that buffer.
CL_MEM_USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR uses host memory as device maps it to its cores. This is faster for streaming data in and out because of not-needing of extra data movements in host side. If you need to use device memory such as fast gddr5 or hbm always, then you should not use these flags.
Copy to device once, use as much as you want. If device has its own memory of course. For example, Intel HD Graphics 400 doesn't have its own memory and shares RAM so it is much faster to use CL_MEM_..._HOST_PTR flags and especially USE_HOST_PTR.
To check if device shares RAM with CPU, you query CL_DEVICE_HOST_UNIFIED_MEMORY property of device.
It appears (from preliminary testing of my own) that a mem object
created with CL_MEM_USE_HOST_PTR gets directly manipulated by the
device
Even without map/unmap commands pror to kernel execution, my computer is behaving same, but I'm using map/unmap just to be safe and it doesn't tax too many cycles.
Edit: if you want to make sure a command doesn't start before you want, you can add a user event in event list input parameter of bufferwrite command. Then you can trigger the user event to let writing start because commands wait for all events in the list to be fired+completed before continuing (if there are any specified in event list input parameter)

Using multiple command queues in OpenCL and mapped buffers. Do I get a conflict?

I use OpenCL on the Snapdragon 800 platform. Since GPU memory is shared, I can map a memory buffer into main memory, and can directly write on it. This avoids memory copy between GPU and RAM.
I wanted to know, if I could write into the mapped memory with my CPU and execute other OpenCL programs in other command queues at the same time.
If you want a little bit background, continue reading:
I am using a webcam to capture images, and the webcam library has a function like getImage(). This function blocks execution as long as there is a new frame. For 30 fps, thats 33ms in worst case. During this time, my buffer is mapped, because OpenCL gives me a pointer and I have to forward this pointer to the Webcam libary. When the call is done, I can unmap the OpenCL buffer.
I got advanced image processing algorithms implemented on the GPU, and NOT ALL OF THEM USE THE MAPPED BUFFER.
While a buffer if mapped, it is not valid to execute any commands that use that buffer. If you want to execute commands that use different buffers, that is fine.
In OpenCL 2.0, the new shared-virtual memory feature allows you to concurrently access a buffer from the host and device.

How long does it take for OpenGL to actually update the screen?

I have a simple OpenGL test app in C which draws different things in response to key input. (Mesa 8.0.4, tried with Mesa-EGL and with GLFW, Ubuntu 12.04LTS on a PC with NVIDIA GTX650). The draws are quite simple/fast (rotating triangle type of stuff). My test code does not limit the framerate deliberately in any way, it just looks like this:
while (true)
{
draw();
swap_buffers();
}
I have timed this very carefully, and I find that the time from one eglSwapBuffers() (or glfwSwapBuffers) call to the next is ~16.6 milliseconds. The time from after a call to eglSwapBuffers() to just before the next call is only a little bit less than that, even though what is drawn is very simple. The time that the swap buffers call takes is well under 1ms.
However, the time from the app changing what it's drawing in response to the key press to the change actually showing up on screen is >150ms (approx 8-9 frames worth). This is measured with a camera recording of the screen and keyboard at 60fps. (Note: It is true I do not have a way to measure how long it takes from key press to the app getting it. I am assuming it is <<150ms).
Therefore, the questions:
Where are graphics buffered between a call to swap buffers and actually showing up on screen? Why the delay? It sure looks like the app is drawing many frames ahead of the screen at all times.
What can an OpenGL application do to cause an immediate draw to screen? (ie: no buffering, just block until draw is complete; I don't need high throughput, I do need low latency)
What can an application do to make the above immediate draw happen as fast as possible?
How can an application know what is actually on screen right now? (Or, how long/how many frames the current buffering delay is?)
Where are graphics buffered between a call to swap buffers and actually showing up on screen? Why the delay? It sure looks like the app is drawing many frames ahead of the screen at all times.
The command is queued, whatever drawn to the backbuffer, waits till next vsync if you have set swapInterval and at the next vsync, this buffer should be displayed.
What can an OpenGL application do to cause an immediate draw to screen? (ie: no buffering, just block until draw is complete; I don't need high throughput, I do need low latency)
Use of glFinish will ensure everything is drawn before this API returns, but no control over when it actually gets to the screen other than swapInterval setting.
What can an application do to make the above immediate draw happen as fast as possible?
How can an application know what is actually on screen right now? (Or, how long/how many frames the current buffering delay is?)
Generally you can use sync (something like http://www.khronos.org/registry/egl/extensions/NV/EGL_NV_sync.txt) to find out this.
Are you sure the method of measuring latency is correct ? What if the key input actually has significant delay in your PC ? Have you measured latency from the event having been received in your code, to the point after swapbuffers ?
You must understand that the GPU has specially dedicated memory available (on board). At the most basic level this memory is used to hold the encoded pixels you see on your screen (it is also used for graphics hardware acceleration and other stuff, but that is unimportant now). Because it takes time loading a frame from your main RAM to your GPU RAM you can get a flickering effect: for a brief moment you see the background instead of what is supposed to be displayed. Although this copying happens extremely fast, it is noticeable to the human eye and quite annoying.
To counter this, we use a technique called double buffering. Basically double buffering works by having an additional frame buffer in your GPU RAM (this can be one or many, depending on graphics library you are working with and the GPU, but two is enough to work) and using a pointer to indicate which frame should be displayed. Thus while the first frame is being displayed, you are already creating the next in your code using some draw() function on an image structure in main RAM, this image is then copied to your GPU RAM (while still displaying the previous frame) and then when calling eglSwapBuffers() the pointer switches to your back buffer (I guessed it from your question, I'm not familiar with OpenGL, but this is quite universal). You can imagine this pointer switch does not require very much time. I hope you see now that directly writing an image to the screen actually causes much more delay (and annoying flickering).
Also ~16.6 milliseconds does not sound like that much. I think most time is lost creating/setting the required data structures and not really in the drawing computations itself (you could test this by just drawing the background).
At last I like to add that I/O is usually pretty slow (slowest part of most programs) and 150ms is not that long at all (still twice as fast as a blink of an eye).
Ah, yes you've discovered one of the peculiarities of the interaction of OpenGL and display systems only few people actually understand (and to be frank I didn't fully understand it until about 2 years ago as well). So what is happening here:
SwapBuffers does two things:
it queues a (private) command to the command queue that's used also for OpenGL drawing calls that essentially flags a buffer swap to the graphics system
it makes OpenGL flush all queued drawing commands (to the back buffer)
Apart from that SwapBuffers does nothing by itself. But those two things have interesting consequences. One is, that SwapBuffers will return immediately. But as soon as the "the back buffer is to be swapped" flag is set (by the queued command) the back buffer becomes locked for any operation that would alter its contents. So as long no call is made that would alter the contents of the back buffer, things will not block. And commands that would alter the contents of the back buffer will halt the OpenGL command queue until the back buffer has been swapped and released for further commands.
Now the length of the OpenGL command queue is an abstract thing. But the usual behavior is, that one of the OpenGL drawing commands will block, waiting for the queue to flush in response to swap buffers having happened.
I suggest you spray your program with logging statements using some high performance, high resolution timer as clock source to see where exactly the delay happens.
Latency will be determined both by the driver, and by the display itself. Even if you wrote directly to the hardware, you would be limited by the latter.
The application can only do so much (i.e. draw fast, process inputs as closely as possible to or during drawing, perhaps even modify the buffer at the time of flip) to mitigate this. After that you're at the mercy of other engineers, both hardware and software.
And you can't tell what the latency is without external monitoring, as you've done.
Also, don't assume your input (keyboard to app) is low latency either!

Resources