OpenGL - Is it efficient to call glBufferSubData (nearly) each frame? - c

I have a spritesheet that contains a simple sprite animation. I can successfully extract each animation frame sprite and display it as a texture. However, I managed to do that by calling glBufferSubData to change the texture coordinates (before the game loop, at initialization). I want to play a simple sprite animation using these extracted textures, and I guess I will do it by changing the texture coordinates each frame (except if animation is triggered by user input). Anyways, this results in calling glBufferSubData almost every frame (to change texture data), and my question is that is this approach efficient? If not, how can I solve the issue? (In Reddit, I saw a comment saying that the goal must be to minimize the traffic between CPU and GPU memory in modern OpenGL, and I guess my approach violates this goal.) Thanks in advance.
For anyone interested, here is my approach:
void set_sprite_texture_from_spritesheet(Sprite* sprite, const char* path, int x_offset, int y_offset, int sprite_width, int sprite_height)
{
float* uv_coords = get_uv_coords_from_spritesheet(path, x_offset, y_offset, sprite_width, sprite_height);
for (int i = 0; i < 8; i++)
{
/* 8 means that I am changing a total of 8 texture coordinates (2 for each 4 vertices) */
edit_vertex_data_by_index(sprite, &uv_coords[i], (i / 2) * 5 + 3 + (i % 2 != 0));
/*
the last argument in this function gives the index of the desired texture coordinate
(5 is for stride, 3 for offset of texture coordinates in each row)
*/
}
free(uv_coords);
sprite->texture = load_texture(path); /* loads the texture -
since the UV coordinate is
adjusted based on the spritesheet
I am loading the entire spritesheet as
a texture.
*/
}
void edit_vertex_data_by_index(Sprite *sprite, float *data, unsigned int start_index)
{
glBindBuffer(GL_ARRAY_BUFFER, sprite->vbo);
glBufferSubData(GL_ARRAY_BUFFER, start_index * sizeof(float), sizeof(data), data);
glBindBuffer(GL_ARRAY_BUFFER, 0);
/*
My concern is that if I call this almost every frame, it could be not efficient, but I am not sure.
*/
}

Editing buffers is fine. Literally every game has buffers that change every frame. Buffers are how you get the data to the GPU so it can render it! (And uniforms. Your driver is likely to secretly put uniforms in buffers though!)
Yes, you should minimize the amount of buffer updates. You should minimize everything, really. The less stuff the computer does, the faster it can do it! That doesn't mean you should avoid doing stuff entirely. It means you should only do as much stuff as you need to, instead of doing wasteful stuff that you don't need.
Every time you call an OpenGL function, the driver takes some time to check how to process your request, which buffer is bound, that it's big enough, that the GPU isn't using it at the same time, etc. You want to do as few calls as possible, because that way, the driver has to check all this stuff less often.
You are doing 8 separate glBufferSubData calls in this function. If you put the UV coordinates all next to each other in the buffer, you could update them all at once with 1 call. And if you have lots of animated sprites, you should try to put all of their UV coordinates in one big array, and update the whole array in one call - all the sprites at once.
And loading textures from paths is really slow. Maybe your program can load 100 textures per second but that still means you blew half your frame time budget on texture loading. The texture hasn't changed anyway so why would you load it again?

Related

What is the most efficient way to put multiple colours on a window, especially in frame-by-frame format?

I am making a game with C and X11. I've been trying for quite a while to find a way to put different coloured pixels on a window, frame by frame. I've seen fully developed games get thousands of frames per second. What is the most efficient way of doing this?
I have seen 2-coloured bitmaps with XImages, allocating 256 colours on a fade of black-white, and using XPutPixel with XImages (which I wasn't able to figure how to create an XImage properly that could later have pixels put on it).
I have made this for loop that creates a random image, but it is, obviously, pixel-by-pixel instead of frame-by-frame and takes 18 seconds to render one entire frame.
XColor pixel;
for (int x = 0; x < currentWindowWidth; x++) {
for (int y = 0; y < currentWindowHeight; y++) {
pixel.red = rand() % 256 * 256; //Converting 16-bit colour to 8-bit colour
pixel.green = rand() % 256 * 256;
pixel.blue = rand() % 256 * 256;
XAllocColor(display, XDefaultColormap(display, screenNumber), &pixel); //This probably takes the most time,
XSetForeground(display, graphics, pixel.pixel); //as does this.
XDrawPoint(display, window, graphics, x, y);
}
}
After three or so more weeks of testing things off and on, I finally figured out how to do it, and it was rather simple. As I said in the OP, XAllocColor and XSetForground take quite a bit of time (relatively) to work. XDrawPoint also was slow, as it does more than just put a pixel at a point on an image.
First I tested how Xlib's colour format works (for the unsigned long int represented as pixel.pixel, which was what I needed XAllocColor for), and it appears to have 100% red set to 16711680, 100% green set to 65280, and 100% blue set to 255, which is obviously a pattern. I found the maximum to be a 50% of all colours, 4286019447, which is a solid grey.
Next, I made sure my XVisualInfo would be supported by my system with a test using XMatchVisualInfo([expected visual info values]). That ensures the depth I will use and the TrueColor class works.
Finally, I made an XImage copied from the root window's image for manipulation. I used XPutPixel for each pixel on the window and set it to a random value between 0 and 4286019448, creating the random image. I then used XPutImage to paste the image to the window.
Here's the final code:
if (!XMatchVisualInfo(display, screenNumber, 24, TrueColor, &visualInfo)) {
exit(0);
}
frameImage = XGetImage(display, rootWindow, 0, 0, screenWidth, screenHeight, AllPlanes, ZPixmap);
while (1) {
for (unsigned short x = 0; x < currentWindowWidth; x += pixelSize) {
for (unsigned short y = 0; y < currentWindowHeight; y += pixelSize) {
XPutPixel(frameImage, x, y, rand() % 4286019447);
}
}
XPutImage(display, window, graphics, frameImage, 0, 0, 0, 0, currentWindowWidth, currentWindowHeight);
}
This puts a random image on the screen, at a stable 140 frames per second on fullscreen. I don't necessarily know if this is the most efficient way, but it works way better than anything else I've tried. Let me know if there is any way to make it better.
Thousands of frames per second is not possible. The monitor frequency is about 100 Hz, or 100 cycles per second, that's roughly the maximum frame rate. This is still very fast. Human eye wouldn't pick up faster frame rates.
The monitor response time is about 5ms, so any single point on the screen cannot be refreshed more than 200 times per second.
8-bit is 1 byte, so 8-bit image uses one byte per pixel, each pixel is from 0 to 256. The pixel doesn't have red, blue, green component. Instead each pixel points to an index in the color table. The color table holds 256 colors. There is a trick where you keep the pixels the same and change the color table, this makes the image fade in and out or do other weird things.
In a 24-bit image, each pixel has blue, red, green component. Each color is 1 byte, so each pixel is 3 bytes, or 24 bits.
uint8_t red = rand() % 256;
uint8_t grn = rand() % 256;
uint8_t blu = rand() % 256;
A 16-bit image uses an odd format to store red, blue, green. 16 is not divisible by 3, often times 2 colors are assigned 5-bits, and the 3rd color gets 6-bits. Then you have to fit these colors on one uint16_t sized pixel. It's probably not worth it to explore this.
The slowness of your routine is because you are painting one pixel at a time. You should paint to a buffer instead, and render the buffer once per frame. You might consider using other frame works like SDL. Other games may use things like OpenGL which takes advantage of GPU optimization for matrix operation etc.
You must use a GPU. GPUs have a highly parallel architecture optimized for graphics (hence the name). To access the GPU you will use an API like OpenGL or Vulkan or make use of a Game Engine.

DirectShow data copy is TOO slow

Have USB 3.0 HDMI Capture device. It uses YUY2 format (2 bytes per pixel) and 1920x1080 resolution.
Video capture Output Pin connects directly to Video Render input Pin.
And all works good. It shows me 1920x1080 without any freezes.
But I need to make screenshot every second. So this is what I do:
void CaptureInterface::ScreenShoot() {
IMemInputPin* p_MemoryInputPin = nullptr;
hr = p_RenderInputPin->QueryInterface(IID_IMemInputPin, (void**)&p_MemoryInputPin);
IMemAllocator* p_MemoryAllocator = nullptr;
hr = p_MemoryInputPin->GetAllocator(&p_MemoryAllocator);
IMediaSample* p_MediaSample = nullptr;
hr = p_MemoryAllocator->GetBuffer(&p_MediaSample, 0, 0, 0);
long buff_size = p_MediaSample->GetSize(); //buff_size = 4147200 Bytes
BYTE* buff = nullptr;
hr = p_MediaSample->GetPointer(&buff);
//BYTE CaptureInterface::ScreenBuff[1920*1080*2]; defined in header
//--------- TOO SLOW (1.5 seconds for 4 MBytes) ----------
std::memcpy(ScreenBuff, buff, buff_size);
//--------------------------------------------
p_MediaSample->Release();
p_MemoryAllocator->Release();
p_MemoryInputPin->Release();
return;
}
Any other operations with this buffer is very slow too.
But If I use memcpy on other data (2 arrays in my class for example same size 4MB) It is very fast. <0.01sec
Video memory is (might be) slow to read back by its nature (e.g. VMR9 IBasicVideo->GetCurrentImage very slow and you can find other references). You normally want to grab the data before it actually reaches video memory.
Additionally, the way you read data is not quite reliable. You don't know what frame you are actually copying and it might so happen that you even read blackness or garbage, or vice versa your acquiring access to buffer freezes the main video streaming. This is because you are grabbing an unused buffer from pool of available buffers rather than a buffer that corresponds to specific video frame. Your getting an image from such buffer happen in a fragile assumption that unused data from previously streamed frame was initialized and is not yet overwritten by anything else.

Playing YUV video with SDL: Line wraps too late when resolution is not multiple of 4

I've been battling with an issue when playing certain sources of uncompressed YUV 4:2:0 planar video data with SDL_Overlay (SDL 1.2.5).
I have no problems playing, say, 640x480 video. But I have just attempted playing a video with the resolution 854x480, and I get a strange effect. The line wraps 1-2 pixels too late (causing a shear-like transformation) and the chroma disappears, to be replaced with alternating R, G or B on each line. See this screenshot
The YUV data itself is correct, as I can save it to a file and play it in another player. It is not padded at this point - the pitch matches the line length.
My suspicion is that some issue occurs when the resolution is not a multiple of 4. Perhaps SDL_Surface expects an SDL_Overlay to have a chroma resolution as a multiple of 2?
Adding to my suspicion, I note that the RGB SDL_Surface that I create at a size of 854*480 has a pitch of 2564, not the 3*854 = 2562 I would expect.
If I add 1 or 2 pixels to the width of the SDL_Surface (but keep the overlay and rectangle the same), it works fine, albeit with a black border to the right. Of course this then breaks with videos which are a multiple of four.
Setup
screen = SDL_SetVideoMode(width, height, 24, SDL_SWSURFACE|SDL_ANYFORMAT|SDL_ASYNCBLIT);
if ( screen == NULL ) {
return 0;
}
YUVOverlay = SDL_CreateYUVOverlay(width, height, SDL_IYUV_OVERLAY, screen);
Ydata = new unsigned char[luma_size];
Udata = new unsigned char[chroma_size];
Vdata = new unsigned char[chroma_size];
YUVOverlay->pixels[0] = Ydata;
YUVOverlay->pixels[1] = Udata;
YUVOverlay->pixels[2] = Vdata;
SDL_DisplayYUVOverlay(YUVOverlay, dest);
Rendering loop:
SDL_LockYUVOverlay(YUVOverlay);
memcpy(Ydata, buffer, luma_size);
memcpy(Udata, buffer+luma_size, chroma_size);
memcpy(Vdata, buffer+luma_size+chroma_size, chroma_size);
int i = SDL_DisplayYUVOverlay(YUVOverlay, dest);
SDL_UnlockYUVOverlay(YUVOverlay);
The easiest fix for me to do is increase the RGB SDL_Surface size so that it is a multiple of 4 in each dimension. But then this adds a black border.
Is there a correct way of fixing this issue? Should I try playing with padding on my YUV data?
Each plane of your input data must start on an address divisible by 8, and the stride of each row must be divisible by 8. To be clear: your chroma planes need to obey this too.
This requirement seems to be from the SDL library's use of MMX multimedia instructions on an x86 cpu. See the comments in src/video/SDL_yuv_mmx.c in the distribution.
update: I looked at the actual assembly code, and there are additional assumptions not mentioned in the source code comments. This is for SDL 1.2.14. In addition to the modulo 8 assumption described above, the code assumes that both the input luma and input chroma planes are packed perfectly (i.e. width == stride).

Suggestions on optimizing a Z-buffer implementation?

I'm writing a 3D graphics library as part of a project of mine, and I'm at the point where everything works, but not well enough.
In particular, my main headache is that my pixel fill-rate is horribly slow -- I can't even manage 30 FPS when drawing a triangle that spans half of an 800x600 window on my target machine (which is admittedly an older computer, but it should be able to manage this . . .)
I ran gprof on my executable, and I end up with the following interesting lines:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
43.51 9.50 9.50 vSwap
34.86 17.11 7.61 179944 0.04 0.04 grInterpolateHLine
13.99 20.17 3.06 grClearDepthBuffer
<snip>
0.76 21.78 0.17 624 0.27 12.46 grScanlineFill
The function vSwap is my double-buffer swapping function, and it also performs vsyching, so it makes sense to me that the test program will spend much of its time waiting in there. grScanlineFill is my triangle-drawing function, which creates an edge list and then calls grInterpolateHLine to actually fill in the triangle.
My engine is currently using a Z-buffer to perform hidden surface removal. If we discount the (presumed) vsynch overhead, then it turns out that the test program is spending something like 85% of its execution time either clearing the depth buffer, or writing pixels according to the values in the depth buffer. My depth buffer clearing function is simplicity itself: copy the maximum value of a float into each element. The function grInterpolateHLine is:
void grInterpolateHLine(int x1, int x2, int y, float z, float zstep, int colour) {
for(; x1 <= x2; x1 ++, z += zstep) {
if(z < grDepthBuffer[x1 + y*VIDEO_WIDTH]) {
vSetPixel(x1, y, colour);
grDepthBuffer[x1 + y*VIDEO_WIDTH] = z;
}
}
}
I really don't see how I can improve that, especially considering that vSetPixel is a macro.
My entire stock of ideas for optimization has been whittled down to precisely one:
Use an integer/fixed-point depth buffer.
The problem that I have with integer/fixed-point depth buffers is that interpolation can be very annoying, and I don't actually have a fixed-point number library yet. Any further thoughts out there? Any advice would be most appreciated.
You should have a look at the source code to something like Quake - considering what it could achieve on a Pentium, 15 years ago. Its z-buffer implementation used spans rather than per-pixel (or fragment) depth. Otherwise, you could look at the rasterization code in Mesa.
Hard to really tell what higher order optimizations can be done without seeing the rest of the code. I have a couple of minor observation, though.
There's no need to calculate x1 + y * VIDEO_WIDTH more than once in grInterpolateHLine. i.e.:
void grInterpolateHLine(int x1, int x2, int y, float z, float zstep, int colour) {
int offset = x1 + (y * VIDEO_WIDTH);
for(; x1 <= x2; x1 ++, z += zstep, offset++) {
if(z < grDepthBuffer[offset]) {
vSetPixel(x1, y, colour);
grDepthBuffer[offset] = z;
}
}
}
Likewise, I'm guessing that your vSetPixel does a similar calculation, so you should be able to use the same offset there as well, and then you only need to increment offset and not x1 in each loop iteration. Chances are this can be extended back to the function that calls grInterpolateHLine, and you would then only need to do the multiplication once per triangle.
There are some other things you could do with the depth buffer. Most of the time if the first pixel of the line either fails or passes the depth test, then the rest of the line will have the same result. So after the first test you can write a more efficient assembly block to test the entire line in one shot, then if it passes you can use a more efficient block memory setter to block-set the pixel and depth values instead of doing them one at a time. You would only need to test/set per pixel if the line is only partially occluded.
Also, not sure what you mean by older computer, but if your target computer is multi-core then you can break it up among multiple cores. You can do this for the buffer clearing function as well. It can help quite a bit.
I ended up solving this by replacing the Z-buffer with the Painter's Algorithm. I used SSE to write a Z-buffer implementation that created a bitmask w/the pixels to paint (plus the range optimization suggested by Gerald), and it still ran far too slowly.
Thank you, everyone, for your input.

How can this function be optimized? (Uses almost all of the processing power)

I'm in the process of writing a little game to teach myself OpenGL rendering as it's one of the things I haven't tackled yet. I used SDL before and this same function, while still performing badly, didn't go as over the top as it does now.
Basically, there is not much going on in my game yet, just some basic movement and background drawing. When I switched to OpenGL, it appears as if it's way too fast. My frames per second exceed 2000 and this function uses up most of the processing power.
What is interesting is that the program in it's SDL version used 100% CPU but ran smoothly, while the OpenGL version uses only about 40% - 60% CPU but seems to tax my graphics card in such a way that my whole desktop becomes unresponsive. Bad.
It's not a too complex function, it renders a 1024x1024 background tile according to the player's X and Y coordinates to give the impression of movement while the player graphic itself stays locked in the center. Because it's a small tile for a bigger screen, I have to render it multiple times to stitch the tiles together for a full background. The two for loops in the code below iterate 12 times, combined, so I can see why this is ineffective when called 2000 times per second.
So to get to the point, this is the evil-doer:
void render_background(game_t *game)
{
int bgw;
int bgh;
int x, y;
glBindTexture(GL_TEXTURE_2D, game->art_background);
glGetTexLevelParameteriv(GL_TEXTURE_2D, 0, GL_TEXTURE_WIDTH, &bgw);
glGetTexLevelParameteriv(GL_TEXTURE_2D, 0, GL_TEXTURE_HEIGHT, &bgh);
glBegin(GL_QUADS);
/*
* Start one background tile too early and end one too late
* so the player can not outrun the background
*/
for (x = -bgw; x < root->w + bgw; x += bgw)
{
for (y = -bgh; y < root->h + bgh; y += bgh)
{
/* Offsets */
int ox = x + (int)game->player->x % bgw;
int oy = y + (int)game->player->y % bgh;
/* Top Left */
glTexCoord2f(0, 0);
glVertex3f(ox, oy, 0);
/* Top Right */
glTexCoord2f(1, 0);
glVertex3f(ox + bgw, oy, 0);
/* Bottom Right */
glTexCoord2f(1, 1);
glVertex3f(ox + bgw, oy + bgh, 0);
/* Bottom Left */
glTexCoord2f(0, 1);
glVertex3f(ox, oy + bgh, 0);
}
}
glEnd();
}
If I artificially limit the speed by called SDL_Delay(1) in the game loop, I cut the FPS down to ~660 ± 20, I get no "performance overkill". But I doubt that is the correct way to go on about this.
For the sake of completion, these are my general rendering and game loop functions:
void game_main()
{
long current_ticks = 0;
long elapsed_ticks;
long last_ticks = SDL_GetTicks();
game_t game;
object_t player;
if (init_game(&game) != 0)
return;
init_player(&player);
game.player = &player;
/* game_init() */
while (!game.quit)
{
/* Update number of ticks since last loop */
current_ticks = SDL_GetTicks();
elapsed_ticks = current_ticks - last_ticks;
last_ticks = current_ticks;
game_handle_inputs(elapsed_ticks, &game);
game_update(elapsed_ticks, &game);
game_render(elapsed_ticks, &game);
/* Lagging stops if I enable this */
/* SDL_Delay(1); */
}
cleanup_game(&game);
return;
}
void game_render(long elapsed_ticks, game_t *game)
{
game->tick_counter += elapsed_ticks;
if (game->tick_counter >= 1000)
{
game->fps = game->frame_counter;
game->tick_counter = 0;
game->frame_counter = 0;
printf("FPS: %d\n", game->fps);
}
render_background(game);
render_objects(game);
SDL_GL_SwapBuffers();
game->frame_counter++;
return;
}
According to gprof profiling, even when I limit the execution with SDL_Delay(), it still spends about 50% of the time rendering my background.
Turn on VSYNC. That way you'll calculate graphics data exactly as fast as the display can present it to the user, and you won't waste CPU or GPU cycles calculating extra frames inbetween that will just be discarded because the monitor is still busy displaying a previous frame.
First of all, you don't need to render the tile x*y times - you can render it once for the entire area it should cover and use GL_REPEAT to have OpenGL cover the entire area with it. All you need to do is to compute the proper texture coordinates once, so that the tile doesn't get distorted (stretched). To make it appear to be moving, increase the texture coordinates by a small margin every frame.
Now down to limiting the speed. What you want to do is not to just plug a sleep() call in there, but measure the time it takes to render one complete frame:
function FrameCap (time_t desiredFrameTime, time_t actualFrameTime)
{
time_t delay = 1000 / desiredFrameTime;
if (desiredFrameTime > actualFrameTime)
sleep (desiredFrameTime - actualFrameTime); // there is a small imprecision here
}
time_t startTime = (time_t) SDL_GetTicks ();
// render frame
FrameCap ((time_t) SDL_GetTicks () - startTime);
There are ways to make this more precise (e.g. by using the performance counter functions on Windows 7, or using microsecond resolution on Linux), but I think you get the general idea. This approach also has the advantage of being driver independent and - unlike coupling to V-Sync - allowing an arbitrary frame rate.
At 2000 FPS it only takes 0.5 ms to render the entire frame. If you want to get 60 FPS then each frame should take about 16 ms. To do this, first render your frame (about 0.5 ms), then use SDL_Delay() to use up the rest of the 16 ms.
Also, if you are interested in profiling your code (which isn't needed if you are getting 2000 FPS!) then you may want to use High Resolution Timers. That way you could tell exactly how long any block of code takes, not just how much time your program spends in it.

Resources