Using an iMX6 Quad in a bare metal application. No cameras or other video devices attached - just graphics from main memory. I'm trying to determine when it is safe to change the buffer addresses in the CPMEM (EBA0 and EBA1) to show a completely different "screen" on the display. In other words, is it safe to change these addresses once the "old" frame has started to be processed? Or do I have to stop processing altogether, switch the addresses in the CPMEM, then start processing again?
Haven't got far enough to try anything yet.
Related
I'm using the device STM32F746. I know it has a hardware 2D Graphics accelerator.
I know how to do animation using double buffering.
But according to this
https://www.touchgfx.com/news/high-quality-graphics-using-only-internal-memory/
They are claiming that they use only one framebuffer for animation.
How is that possible and what techniques that are used using that STM32F746 ?
It is the double buffering. One buffer is stored in the MCU memory, where the next frame is prepared and composed. Another buffer is in LCD driver memory, to where data being transferred from the MCU when it is ready, and displayed on the LCD with the required refresh rate.
That's why that library requires so much of MCU memory.
Despite the answer was accepted it is wrong.
In fact those controllers have their own LCD-driving circuit, thus, do not require external driver. They use part of internal memory as the screen buffer and constantly refresh the image on the LCD.
In the library, that only part of memory is used. The write operatation are synchronized with LCD refresh, so they avoid flickering.
So, the only one buffer is used: the same buffer contains the output image and used to compose the next frame.
I just started to work in a new company and I'm new in the embedded world.
They gave me a task, I have done it and it's working but I don't know if I did it the right way.
I will describe the task and what I have done.
I was requested to hide some small piece of the DDR from the Linux OS, then some HW feature can write something to this small piece of memory I saved. After that I need to be able to read this small piece of memory to a file.
To hide a chunk of the DDR from the Linux I just changed the Linux memory arg to be equal to the real memory size - (the size I needed + some small size for safety). I have got the idea and the idea for the driver I will describe in a sec from this post.
After that the Linux is seeing less memory then the HW has and the top section of the DDR is hided from the kernel and I can use it for my storage without worry.
I think that I have done this part right, not something I can say about the next part.
For the next part, to be able to read this piece of DDR I saved, I wrote a Char device driver, it’s working, it’s reading the DDR chunk I saved to a file piece by piece, every piece is of size no more then some value I decided, can't do it in one copy because it will require allocating a big buffer and I don’t have enough RAM space for that.
Now I read about block device and I started to think that maybe block device fits better for my program, but I'm not relay sure because first it's working and if it's not broken... second I never wrote block device driver, I also never wrote char device driver until the one I described before, so I don't sure if this is the time to use block device over char device.
This depends on the intended use, but according to your description a character device is much more likely to be what you want. The difference:
a character device takes simple read and write commands and gets no help from the kernel. This is suitable for reading or writing from devices (and from anything that resembles a device, both if it is an actual stream that's read sequentially or supports 'seek' and can read the same data over and over again).
a block device hooks into the kernel's memory paging system and is capable of serving as a back-end for virtual memory pages. It can host a swap space, be the storage for a file system, etc. It is a much more complex beast than a character device. You need this only for something that stores a large amount of data that needs to be accessed by mapping it into the address space of a process (normally this is needed only if you put a file system on it).
I've seen a few questions here on Stack overflow dealing with the same issues, but no definite answer. I thought I'll ask again, with a bunch of questions of my own. All relate to the subject matter at hand.
So, do we know when the data transfer from host to the openCL device occurs? Can you tell me the exact memory transfer operation of the functions below (that is, what data is transferred or created, if any, when these functions are invoked?):
clCreateBuffer()
clSetKernelArg()
clEnqueueNDRangeKernel()
The first two don't even produce events, so we can't time them, but surely some data transferring is happening here.
Is there a way to transfer data to a device without first setting it as a kernel arg?
It appears (from preliminary testing of my own) that a mem object created with CL_MEM_USE_HOST_PTR gets directly manipulated by the device. Why would that not be desirable, since, that way, we could avoid further data transfer commands (and surely the driver implements this in the most efficient way)?
Does transferred data (say, as par of a kernel arg) stay at the device for further manipulation, after a kernel returns? If not is there a way to do just that?
Buffer copies are related to command queues. Command queues are synced with host using finish() as easiest way.
clCreateBuffer()
clEnqueueWriteBuffer() <-------- you can get event data from this
(set blocking parameter to false to queue everything quickly)
(set blockinig to true if you sync write here)
clSetKernelArg()
clEnqueueWriteBuffer() <----- it could be here too
clEnqueueNDRangeKernel()
clEnqueueWriteBuffer() <----- or here (too quickly re-set an array?)
clFinish() <--------- this ensures all queued commands are executed before this
now you can query data of that event to check when it started and when ended
to let a buffer stay in device, you should create it in device first then don't migrate it to another device. Using only CL_MEM_READ_WRITE flag in createBuffer() is enough to make it a real buffer on device-side until you release that buffer.
CL_MEM_USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR uses host memory as device maps it to its cores. This is faster for streaming data in and out because of not-needing of extra data movements in host side. If you need to use device memory such as fast gddr5 or hbm always, then you should not use these flags.
Copy to device once, use as much as you want. If device has its own memory of course. For example, Intel HD Graphics 400 doesn't have its own memory and shares RAM so it is much faster to use CL_MEM_..._HOST_PTR flags and especially USE_HOST_PTR.
To check if device shares RAM with CPU, you query CL_DEVICE_HOST_UNIFIED_MEMORY property of device.
It appears (from preliminary testing of my own) that a mem object
created with CL_MEM_USE_HOST_PTR gets directly manipulated by the
device
Even without map/unmap commands pror to kernel execution, my computer is behaving same, but I'm using map/unmap just to be safe and it doesn't tax too many cycles.
Edit: if you want to make sure a command doesn't start before you want, you can add a user event in event list input parameter of bufferwrite command. Then you can trigger the user event to let writing start because commands wait for all events in the list to be fired+completed before continuing (if there are any specified in event list input parameter)
My problem is the following:
I have a pointer that stores a framebuffer which is constantly changed by some thread.
I want to display this framebuffer via OpenGL APIs. My trivial choice is to use glTexImage2D and load the framebuffer again and again at every time. This loading over the framebuffer is necessary because the framebuffer is changed by outside of the OpenGL APIs. I think that there could be some methods or tricks to speed up such as:
By finding out the changes inside the framebuffer (Is it even possible?)
Some method for fast re-loading of image
OpenGL could directly using the pointer of the framebuffer? (So less framebuffer copying)
I'm not sure the above approaches are valid or not. I hope if you could give some advices.
By finding out the changes inside the framebuffer (Is it even possible?)
That would reduce the required bandwidth, because it's effectively video compression. However the CPU load is notably higher and its much slower than to just DMA copy the data.
Some method for fast re-loading of image
Use glTexSubImage2D instead glTexImage2D (note the …Sub…). glTexImage2D goes through a full texture initialization each time its called, which is rather costly.
If your program does additional things with OpenGL rather than just displaying the image, you can speed things up further, by reducing the time the program is waiting for things to complete by using Pixel Buffer Objects. The essential gist is
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pboID);
void *pbuf = glMapBuffer();
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
thread image_load_thread = start_thread_copy_image_to_pbo(image, pbuf);
do_other_things_with_opengl();
join_thread(image_load_thread();
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pboID);
glUnmapBuffer();
glBindTexture(…);
glTexSubImage2D(…, NULL); /* will read from PBO */
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
draw_textured_quad();
Instead of creating and joining with the thread you may as well use a thread pool and condition variables for inter thread synchronization.
OpenGL could directly using the pointer of the framebuffer? (So less framebuffer copying)
Stuff almost always has to be copied around. Don't worry about copies, if they are necessary anyway. The Pixel Buffer Objects I outlined above may or may not work with a shader copy in system RAM. Essentially you can glBufferMap into your process' address space and directly decode into that buffer you're given. But that's no guarantee that a further copy is avoided.
I have a simple OpenGL test app in C which draws different things in response to key input. (Mesa 8.0.4, tried with Mesa-EGL and with GLFW, Ubuntu 12.04LTS on a PC with NVIDIA GTX650). The draws are quite simple/fast (rotating triangle type of stuff). My test code does not limit the framerate deliberately in any way, it just looks like this:
while (true)
{
draw();
swap_buffers();
}
I have timed this very carefully, and I find that the time from one eglSwapBuffers() (or glfwSwapBuffers) call to the next is ~16.6 milliseconds. The time from after a call to eglSwapBuffers() to just before the next call is only a little bit less than that, even though what is drawn is very simple. The time that the swap buffers call takes is well under 1ms.
However, the time from the app changing what it's drawing in response to the key press to the change actually showing up on screen is >150ms (approx 8-9 frames worth). This is measured with a camera recording of the screen and keyboard at 60fps. (Note: It is true I do not have a way to measure how long it takes from key press to the app getting it. I am assuming it is <<150ms).
Therefore, the questions:
Where are graphics buffered between a call to swap buffers and actually showing up on screen? Why the delay? It sure looks like the app is drawing many frames ahead of the screen at all times.
What can an OpenGL application do to cause an immediate draw to screen? (ie: no buffering, just block until draw is complete; I don't need high throughput, I do need low latency)
What can an application do to make the above immediate draw happen as fast as possible?
How can an application know what is actually on screen right now? (Or, how long/how many frames the current buffering delay is?)
Where are graphics buffered between a call to swap buffers and actually showing up on screen? Why the delay? It sure looks like the app is drawing many frames ahead of the screen at all times.
The command is queued, whatever drawn to the backbuffer, waits till next vsync if you have set swapInterval and at the next vsync, this buffer should be displayed.
What can an OpenGL application do to cause an immediate draw to screen? (ie: no buffering, just block until draw is complete; I don't need high throughput, I do need low latency)
Use of glFinish will ensure everything is drawn before this API returns, but no control over when it actually gets to the screen other than swapInterval setting.
What can an application do to make the above immediate draw happen as fast as possible?
How can an application know what is actually on screen right now? (Or, how long/how many frames the current buffering delay is?)
Generally you can use sync (something like http://www.khronos.org/registry/egl/extensions/NV/EGL_NV_sync.txt) to find out this.
Are you sure the method of measuring latency is correct ? What if the key input actually has significant delay in your PC ? Have you measured latency from the event having been received in your code, to the point after swapbuffers ?
You must understand that the GPU has specially dedicated memory available (on board). At the most basic level this memory is used to hold the encoded pixels you see on your screen (it is also used for graphics hardware acceleration and other stuff, but that is unimportant now). Because it takes time loading a frame from your main RAM to your GPU RAM you can get a flickering effect: for a brief moment you see the background instead of what is supposed to be displayed. Although this copying happens extremely fast, it is noticeable to the human eye and quite annoying.
To counter this, we use a technique called double buffering. Basically double buffering works by having an additional frame buffer in your GPU RAM (this can be one or many, depending on graphics library you are working with and the GPU, but two is enough to work) and using a pointer to indicate which frame should be displayed. Thus while the first frame is being displayed, you are already creating the next in your code using some draw() function on an image structure in main RAM, this image is then copied to your GPU RAM (while still displaying the previous frame) and then when calling eglSwapBuffers() the pointer switches to your back buffer (I guessed it from your question, I'm not familiar with OpenGL, but this is quite universal). You can imagine this pointer switch does not require very much time. I hope you see now that directly writing an image to the screen actually causes much more delay (and annoying flickering).
Also ~16.6 milliseconds does not sound like that much. I think most time is lost creating/setting the required data structures and not really in the drawing computations itself (you could test this by just drawing the background).
At last I like to add that I/O is usually pretty slow (slowest part of most programs) and 150ms is not that long at all (still twice as fast as a blink of an eye).
Ah, yes you've discovered one of the peculiarities of the interaction of OpenGL and display systems only few people actually understand (and to be frank I didn't fully understand it until about 2 years ago as well). So what is happening here:
SwapBuffers does two things:
it queues a (private) command to the command queue that's used also for OpenGL drawing calls that essentially flags a buffer swap to the graphics system
it makes OpenGL flush all queued drawing commands (to the back buffer)
Apart from that SwapBuffers does nothing by itself. But those two things have interesting consequences. One is, that SwapBuffers will return immediately. But as soon as the "the back buffer is to be swapped" flag is set (by the queued command) the back buffer becomes locked for any operation that would alter its contents. So as long no call is made that would alter the contents of the back buffer, things will not block. And commands that would alter the contents of the back buffer will halt the OpenGL command queue until the back buffer has been swapped and released for further commands.
Now the length of the OpenGL command queue is an abstract thing. But the usual behavior is, that one of the OpenGL drawing commands will block, waiting for the queue to flush in response to swap buffers having happened.
I suggest you spray your program with logging statements using some high performance, high resolution timer as clock source to see where exactly the delay happens.
Latency will be determined both by the driver, and by the display itself. Even if you wrote directly to the hardware, you would be limited by the latter.
The application can only do so much (i.e. draw fast, process inputs as closely as possible to or during drawing, perhaps even modify the buffer at the time of flip) to mitigate this. After that you're at the mercy of other engineers, both hardware and software.
And you can't tell what the latency is without external monitoring, as you've done.
Also, don't assume your input (keyboard to app) is low latency either!