NDS Homebrew: Multiple Animation Speeds for Sprites - c

I have been experimenting with the devkitARM toolchain for NDS homebrew recently. Something I would like better understand, however, is how to control sprite animation speeds. The only way I know of doing this is by 'counting frames'. For example, the following code could be placed into the "animate_simple" example included with devkitpro:
int main(void) {
int frame = 0;
while(1) {
if(++frame > 9)
frame = 0;
return 0;
This is generally fine, but it ensures that all the animation initialized in the main loop runs at a set speed. How would I go about having two different sprites, each animating at different speeds? Any insight would be greatly appreciated.

Use a separate frame counter for each sprite. For example, you can create a sprite struct:
typedef struct _Sprite Sprite;
struct _Sprite
int frame;
int count;
int delay; /* handles the speed of animation */
int limit;
// initialize all fields
void sprite_update(Sprite* s)
if ( ( s->count++ % s->delay ) == 0 ) )
if ( s->frame++ > s->limit ) s->frame = 0;
Give delay small value for faster animation, and a larger value for slow animation.
Sample Usage:
Sprite my_spr, my_spr_2;
/* initialize the fields, or write some function for initializing them. */
my_spr.delay = 5; /* fast */
my_spr_2.delay = 10; /* slow */
/* < in Main Loop > */
Since you are targeting only one platform, the best way to control animation speed is to monitor the frame rate ( or "counting frames" as you call it ). Good thing about programming for consoles is, you don't have to set timed delays. Since all console devices of the same model usually run at the same speed, so the animation speed you get on your machine ( or emulator ) will be the same everyone gets. Different Processors Speeds are a real headache when programming for the PC.


Trying to find a good way to control servos at different speeds using johnny-five.io

I am currently trying to figure out a configuration in controlling servos connected to Arduinos using Nodebot johnny-five.io hosted on an RPi. My main goal is to make a hexapod from scratch; I don't like the idea of kits because it's all cookie cutter code and parts that you put together and it's more or less a remote controlled car where you didn't learn anything.
I just learned the basics about servos (which I'm selecting servos over stepper motors). And unfortunately, as a default, servo speed cant be controlled via PWM, only position. So the way around this is to create a loop that increments the servo 1 degree (or more) at a time with an X ms delay in the loop until the servo has reached the desired position. Now, this is fine and all if you're only controlling one servo at a time or X amount of servos at a time for moving to one set position to another. But I'm shooting for a fluid motion here; I want one leg to start moving before another leg has stopped. True fluidity in motion, to accomplish this, I would need an infinite loop that would check on input states set by control commands that the API would receive.
The problem here is that while loops are not asynchronous. So, I need to find a way to kick a loop off that sets the different servos at different speed ranges and different positions, and at the end of the loop checks for new input state updates. And I need to do this without creating a memory leak.
One way would be to create a set of dependency scripts that worked asynchronously for each of the servos (3 servos per leg, 6 legs, 18 servos thus 18 mini dependencies), but I'm not sure if that would be too much overhead or put much strain onto the RPi.
Am I overthinking this?
You could create a simple servo class and give each instance its own speed and starting position using an 'update' method. Use a longer delay to have the servo move slower. In the main loop you can continuously check for some input, tell the servos to move if necessary, using the update method.
#include <Servo.h>
class HexapodLegServo
Servo servo; // Servo instance
int pos; // Position/angle of the servo
int delay_millis; // 'Delay' between each update
long prev_millis;
int degree_change; // Angle increase each update
int start_pos; // Position to start at
int pin; // Pin servo is connected to
HexapodLegServo (int whichPin, int delayMillis=15, int startPos=0)
pin = whichPin;
delay_millis = delayMillis;
start_pos = startPos;
degree_change = 1;
void attachToPin()
void setStartPos() // Set initial position of the servo
pos = start_pos;
void update() // Servo sweeps from end to end, and back
if (millis()-prev_millis > delay_millis)
prev_millis = millis();
pos += degree_change;
if ((pos <= 0) || (pos >= 180))
degree_change *= -1; // Start moving in the reverse direction
// Main script
const int BUTTON = 4; // Button on pin 4
HexapodLegServo right_leg(9);
HexapodLegServo left_leg(10, 30, 90);
void setup()
pinMode(BUTTON, INPUT_PULLUP); // Using a button to tell servos when to move
void loop()
if (digitalRead(BUTTON) == LOW) // If button is pushed (can be any type of input)
{ // update the position of the servos continuously

Audio samples producer multiple threads OSX

This question is a follow-up to a former question (Audio producer threads with OSX AudioComponent consumer thread and callback in C), including a test example, which works and behaves as expected but does not quite answer the question. I have substantially rephrased the question, and re-coded the example, so that it only contains plain-C code. (I've found out that few Objective-C portions of code in the former example only caused confusion and distracted the reader from what's essential in the question.)
In order to take advantage of multiple processor cores as well as to make the CoreAudio pull-model render thread as lightweight as possible, the LPCM samples' producer routine clearly has to "sit" on a different thread, outside the real-lime-priority render thread/callback. It must feed the samples to a circular buffer (TPCircularBuffer in this example), from which the system would schedule data pull-out in quants of inNumberFrames.
The Grand Central Dispatch API offers a simple solution, which I've deduced upon some individual research (including trial-and-error coding). This solution is elegant, since it doesn't block anything nor conflict between push and pull models. Yet the GCD, which is supposed to take care of "sub-threading" does not by far meet the specific parallelization requirements for the work threads of the producer code, so I had to explicitely spawn a number of POSIX threads, depending on the number of logical cores available. Although results are already remarkable in terms of speeding-up the computation I still feel a bit unconfortable mixing the POSIX and GCD. In particular it goes for the variable wait_interval, and computing it properly, not by predicting how many PCM samples may the render thread require for the next cycle.
Here's the shortened and simplified (pseudo)code for my test program, in plain-C.
Controller declaration:
#include "TPCircularBuffer.h"
#include <AudioToolbox/AudioToolbox.h>
#include <AudioUnit/AudioUnit.h>
#include <dispatch/dispatch.h>
#include <sys/sysctl.h>
#include <pthread.h>
typedef struct {
TPCircularBuffer buffer;
AudioComponentInstance toneUnit;
Float64 sampleRate;
AudioStreamBasicDescription streamFormat;
Float32* f; //array of updated frequencies
Float32* a; //array of updated amps
Float32* prevf; //array of prev. frequencies
Float32* preva; //array of prev. amps
Float32* val;
int* arg;
int* previous_arg;
UInt32 frames;
int state;
Boolean midif; //wip
} MyAudioController;
MyAudioController gen;
dispatch_semaphore_t mSemaphore;
Boolean multithreading, NF;
typedef struct data{
int tid;
int cpuCount;
Controller management:
void setup (void){
// Initialize circular buffer
TPCircularBufferInit(&(self->buffer), kBufferLength);
// Create the semaphore
mSemaphore = dispatch_semaphore_create(0);
// Setup audio
void dealloc (void) {
// Release buffer resources
// Clean up semaphore
// dispose of audio
Dispatcher call (launching producer queue from the main thread):
void dproducer (Boolean on, Boolean multithreading, Boolean NF)
if (on == true)
dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0), ^{
Threadable producer routine:
void producerSum(Boolean on)
int rc;
int num = getCPUnum();
pthread_t threads[num];
data thread_args[num];
void* resulT;
static Float32 frames [FR_MAX];
Float32 wait_interval;
int bytesToCopy;
Float32 floatmax;
wait_interval = FACT*(gen.frames)/(gen.sampleRate);
Float32 damp = 1./(Float32)(gen.frames);
bytesToCopy = gen.frames*sizeof(Float32);
memset(frames, 0, FR_MAX*sizeof(Float32));
availableBytes = 0;
fbuffW = (Float32**)calloc(num + 1, sizeof(Float32*));
for (int i=0; i<num; ++i)
fbuffW[i] = (Float32*)calloc(gen.frames, sizeof(Float32));
thread_args[i].tid = i;
thread_args[i].cpuCount = num;
rc = pthread_create(&threads[i], NULL, producerTN, (void *) &thread_args[i]);
for (int i=0; i<num; ++i) rc = pthread_join(threads[i], &resulT);
for(UInt32 samp = 0; samp < gen.frames; samp++)
for(int i = 0; i < num; i++)
frames[samp] += fbuffW[i][samp];
//code for managing producer state and GUI updates
{ ... }
float *head = TPCircularBufferHead(&(gen.buffer), &availableBytes);
memcpy(head,(const void*)frames,MIN(bytesToCopy, availableBytes));//copies frames to head
dispatch_semaphore_wait(mSemaphore, dispatch_time(DISPATCH_TIME_NOW, wait_interval * NSEC_PER_SEC));
if(gen.state == stopped){gen.state = idle; on = false;}
for(int i = 0; i <= num; i++)
A single producer thread may look somewhat like this:
void *producerT (void *TN)
Float32 samples[FR_MAX];
data threadData;
threadData = *((data *)TN);
int tid = threadData.tid;
int step = threadData.cpuCount;
int *ret = calloc(1,sizeof(int));
do_something(tid, step, &samples);
{ … }
return (void*)ret;
Here is the render callback (CoreAudio real-time consumer thread):
static OSStatus audioRenderCallback(void *inRefCon,
AudioUnitRenderActionFlags *ioActionFlags,
const AudioTimeStamp *inTimeStamp,
UInt32 inBusNumber,
UInt32 inNumberFrames,
AudioBufferList *ioData) {
MyAudioController *THIS = (MyAudioController *)inRefCon;
// An event happens in the render thread- signal whoever is waiting
if (THIS->state == active) dispatch_semaphore_signal(mSemaphore);
// Mono audio rendering: we only need one target buffer
const int channel = 0;
Float32* targetBuffer = (Float32 *)ioData->mBuffers[channel].mData;
// Pull samples from circular buffer
int32_t availableBytes;
Float32 *buffer = TPCircularBufferTail(&THIS->buffer, &availableBytes);
//copy circularBuffer content to target buffer
int bytesToCopy = ioData->mBuffers[channel].mDataByteSize;
memcpy(targetBuffer, buffer, MIN(bytesToCopy, availableBytes));
{ … };
TPCircularBufferConsume(&THIS->buffer, availableBytes);
THIS->frames = inNumberFrames;
return noErr;
Grand Central Dispatch already takes care of dispatching operations to multiple processor cores and threads. In typical real-time audio rendering or processing, one never needs to wait on a signal or semaphore, as the circular buffer consumption rate is very predictable, and drifts extremely slowly over time. The AVAudioSession API (if available) and Audio Unit API and callback allow you to set and determine the callback buffer size, and thus the maximum rate at which the circular buffer can change. Thus you can dispatch all render operations on a timer, render the exact number needed per timer period, and let the buffer size and state compensate for any jitter in thread dispatch time.
In extremely long running audio renders, you might want to measure the drift between timer operations and real-time audio consumption (sample rate), and tweak the number of samples rendered or the timer offset.

Data error in dynamic object array in arduino

Currently I'm using Arduino for my project and what I want is to have an array that stores an array of sensors. I do understand that there's limited resource to be used for a dynamic array. But by limiting the number of items in the array and uses struct data instead of creating a class, I managed to cut the SRAM cost. So without further ado, here's my code :
#define MAX_SENSOR 6
namespace Sensors
typedef struct
byte SlavePin;
byte LDRPin;
byte RedPin;
byte BluePin;
} Sensor;
Sensor _sensors[MAX_SENSOR];
byte _len = 0;
void Add(Sensor s)
if (_len > MAX_SENSOR)
_sensors[_len] = s;
Sensor Get(byte index)
return _sensors[index];
And here's how I use it.
#include "Sensors.h"
void setup()
for (int i = 0; i < 6; i++)
Sensors::Sensor sen;
sen.SlavePin = 0;
Serial.print("Length = ");
for (int j = 0; j < Sensors::_len; j++)
Serial.print(" = ");
void loop() { //Nothing goes here }
This code works and it compiles successfully. But when I run it, the serial window shows this :
Length : 6
Sensor 0:0
Sensor 1:0
Sensor 2:1
Sensor 3:2
Sensor 4:3
Sensor 5:4
Apparently, the first and the second item in the array has the same value and honestly, I don't know why.
Here's the output that I'm expecting :
Length : 6
Sensor 0:0
Sensor 1:1
Sensor 2:2
Sensor 3:3
Sensor 4:4
Sensor 5:5
Any help would be very appreciated. And BTW, I'm sorry if this kind of thread had already existed.
The first call to Add() places the structure at index 1:
byte _len = 0;
void Add(Sensor s)
if (_len > MAX_SENSOR)
// On first call _len will be 1
_sensors[_len] = s;
I understand the design intent of this code, but consider that this is a wasteful approach for a microcontroller.
Implementing Add() increases the code size. A library for a desktop computer would surely rate the code size a fair trade off for safety. A library for a microcontroller is harder to rate as good use of scarce memory.
Implementing Get() increases code size and execution time. Again, this seems like a good design for typical desktop environment and a library that you want to be safe. On a microcontroller, this is wrong.
The factor I see as key decider of good or bad is the permanent cost versus a one time savings. The safe version of Sensor costs code space and execution time on every system deployed and every second the program is running. The benefit is only the first day you are run and debug the program.

How can this function be optimized? (Uses almost all of the processing power)

I'm in the process of writing a little game to teach myself OpenGL rendering as it's one of the things I haven't tackled yet. I used SDL before and this same function, while still performing badly, didn't go as over the top as it does now.
Basically, there is not much going on in my game yet, just some basic movement and background drawing. When I switched to OpenGL, it appears as if it's way too fast. My frames per second exceed 2000 and this function uses up most of the processing power.
What is interesting is that the program in it's SDL version used 100% CPU but ran smoothly, while the OpenGL version uses only about 40% - 60% CPU but seems to tax my graphics card in such a way that my whole desktop becomes unresponsive. Bad.
It's not a too complex function, it renders a 1024x1024 background tile according to the player's X and Y coordinates to give the impression of movement while the player graphic itself stays locked in the center. Because it's a small tile for a bigger screen, I have to render it multiple times to stitch the tiles together for a full background. The two for loops in the code below iterate 12 times, combined, so I can see why this is ineffective when called 2000 times per second.
So to get to the point, this is the evil-doer:
void render_background(game_t *game)
int bgw;
int bgh;
int x, y;
glBindTexture(GL_TEXTURE_2D, game->art_background);
glGetTexLevelParameteriv(GL_TEXTURE_2D, 0, GL_TEXTURE_WIDTH, &bgw);
glGetTexLevelParameteriv(GL_TEXTURE_2D, 0, GL_TEXTURE_HEIGHT, &bgh);
* Start one background tile too early and end one too late
* so the player can not outrun the background
for (x = -bgw; x < root->w + bgw; x += bgw)
for (y = -bgh; y < root->h + bgh; y += bgh)
/* Offsets */
int ox = x + (int)game->player->x % bgw;
int oy = y + (int)game->player->y % bgh;
/* Top Left */
glTexCoord2f(0, 0);
glVertex3f(ox, oy, 0);
/* Top Right */
glTexCoord2f(1, 0);
glVertex3f(ox + bgw, oy, 0);
/* Bottom Right */
glTexCoord2f(1, 1);
glVertex3f(ox + bgw, oy + bgh, 0);
/* Bottom Left */
glTexCoord2f(0, 1);
glVertex3f(ox, oy + bgh, 0);
If I artificially limit the speed by called SDL_Delay(1) in the game loop, I cut the FPS down to ~660 ± 20, I get no "performance overkill". But I doubt that is the correct way to go on about this.
For the sake of completion, these are my general rendering and game loop functions:
void game_main()
long current_ticks = 0;
long elapsed_ticks;
long last_ticks = SDL_GetTicks();
game_t game;
object_t player;
if (init_game(&game) != 0)
game.player = &player;
/* game_init() */
while (!game.quit)
/* Update number of ticks since last loop */
current_ticks = SDL_GetTicks();
elapsed_ticks = current_ticks - last_ticks;
last_ticks = current_ticks;
game_handle_inputs(elapsed_ticks, &game);
game_update(elapsed_ticks, &game);
game_render(elapsed_ticks, &game);
/* Lagging stops if I enable this */
/* SDL_Delay(1); */
void game_render(long elapsed_ticks, game_t *game)
game->tick_counter += elapsed_ticks;
if (game->tick_counter >= 1000)
game->fps = game->frame_counter;
game->tick_counter = 0;
game->frame_counter = 0;
printf("FPS: %d\n", game->fps);
According to gprof profiling, even when I limit the execution with SDL_Delay(), it still spends about 50% of the time rendering my background.
Turn on VSYNC. That way you'll calculate graphics data exactly as fast as the display can present it to the user, and you won't waste CPU or GPU cycles calculating extra frames inbetween that will just be discarded because the monitor is still busy displaying a previous frame.
First of all, you don't need to render the tile x*y times - you can render it once for the entire area it should cover and use GL_REPEAT to have OpenGL cover the entire area with it. All you need to do is to compute the proper texture coordinates once, so that the tile doesn't get distorted (stretched). To make it appear to be moving, increase the texture coordinates by a small margin every frame.
Now down to limiting the speed. What you want to do is not to just plug a sleep() call in there, but measure the time it takes to render one complete frame:
function FrameCap (time_t desiredFrameTime, time_t actualFrameTime)
time_t delay = 1000 / desiredFrameTime;
if (desiredFrameTime > actualFrameTime)
sleep (desiredFrameTime - actualFrameTime); // there is a small imprecision here
time_t startTime = (time_t) SDL_GetTicks ();
// render frame
FrameCap ((time_t) SDL_GetTicks () - startTime);
There are ways to make this more precise (e.g. by using the performance counter functions on Windows 7, or using microsecond resolution on Linux), but I think you get the general idea. This approach also has the advantage of being driver independent and - unlike coupling to V-Sync - allowing an arbitrary frame rate.
At 2000 FPS it only takes 0.5 ms to render the entire frame. If you want to get 60 FPS then each frame should take about 16 ms. To do this, first render your frame (about 0.5 ms), then use SDL_Delay() to use up the rest of the 16 ms.
Also, if you are interested in profiling your code (which isn't needed if you are getting 2000 FPS!) then you may want to use High Resolution Timers. That way you could tell exactly how long any block of code takes, not just how much time your program spends in it.

Easy way to display a continuously updating image in C/Linux

I'm a scientist who is quite comfortable with C for numerical computation, but I need some help with displaying the results. I want to be able to display a continuously updated bitmap in a window, which is calculated from realtime data. I'd like to be able to update the image quite quickly (e.g. faster than 1 frame/second, preferably 100 fps). For example:
char image_buffer[width*height*3];//rgb data
for (t=0;t<t_end;t++)
getdata(data);//get some realtime data
docalcs(image_buffer, data);//process the data into an image
drawimage(image_buffer);//draw the image
What's the easiest way to do this on linux (Ubuntu)? What should I use for initializewindow() and drawimage()?
If all you want to do is display the data (ie no need for a GUI), you might want to take a look at SDL: It's straight-forward to create a surface from your pixel data and then display it on screen.
Inspired by Artelius' answer, I also hacked up an example program:
#include <SDL/SDL.h>
#include <assert.h>
#include <stdint.h>
#include <stdlib.h>
#define WIDTH 256
#define HEIGHT 256
static _Bool init_app(const char * name, SDL_Surface * icon, uint32_t flags)
if(SDL_Init(flags) < 0)
return 0;
SDL_WM_SetCaption(name, name);
SDL_WM_SetIcon(icon, NULL);
return 1;
static uint8_t * init_data(uint8_t * data)
for(size_t i = WIDTH * HEIGHT * 3; i--; )
data[i] = (i % 3 == 0) ? (i / 3) % WIDTH :
(i % 3 == 1) ? (i / 3) / WIDTH : 0;
return data;
static _Bool process(uint8_t * data)
for(SDL_Event event; SDL_PollEvent(&event);)
if(event.type == SDL_QUIT) return 0;
for(size_t i = 0; i < WIDTH * HEIGHT * 3; i += 1 + rand() % 3)
data[i] -= rand() % 8;
return 1;
static void render(SDL_Surface * sf)
SDL_Surface * screen = SDL_GetVideoSurface();
if(SDL_BlitSurface(sf, NULL, screen, NULL) == 0)
SDL_UpdateRect(screen, 0, 0, 0, 0);
static int filter(const SDL_Event * event)
{ return event->type == SDL_QUIT; }
#define mask32(BYTE) (*(uint32_t *)(uint8_t [4]){ [BYTE] = 0xff })
int main(int argc, char * argv[])
(void)argc, (void)argv;
static uint8_t buffer[WIDTH * HEIGHT * 3];
_Bool ok =
init_app("SDL example", NULL, SDL_INIT_VIDEO) &&
SDL_Surface * data_sf = SDL_CreateRGBSurfaceFrom(
init_data(buffer), WIDTH, HEIGHT, 24, WIDTH * 3,
mask32(0), mask32(1), mask32(2), 0);
for(; process(buffer); SDL_Delay(10))
return 0;
I'd recommend SDL too. However, there's a bit of understanding you need to gather if you want to write fast programs, and that's not the easiest thing to do.
I would suggest this O'Reilly article as a starting point.
But I shall boil down the most important points from a computations perspective.
Double buffering
What SDL calls "double buffering" is generally called page flipping.
This basically means that on the graphics card, there are two chunks of memory called pages, each one large enough to hold a screen's worth of data. One is made visible on the monitor, the other one is accessible by your program. When you call SDL_Flip(), the graphics card switches their roles (i.e. the visible one becomes program-accessible and vice versa).
The alternative is, rather than swapping the roles of the pages, instead copy the data from the program-accessible page to the monitor page (using SDL_UpdateRect()).
Page flipping is fast, but has a drawback: after page flipping, your program is presented with a buffer that contains the pixels from 2 frames ago. This is fine if you need to recalculate every pixel every frame.
However, if you only need to modify smallish regions on the screen every frame, and the rest of the screen does not need to change, then UpdateRect can be a better way (see also: SDL_UpdateRects()).
This of course depends on what it is you're computing and how you're visualising it. Analyse your image-generating code - maybe you can restructure it to get something more efficient out of it?
Note that if your graphics hardware doesn't support page flipping, SDL will gracefully use the other method for you.
This is another question you face. Basically, software surfaces live in RAM, hardware surfaces live in Video RAM, and OpenGL surfaces are managed by OpenGL magic.
Depending on your hardware, OS, and SDL version, programatically modifying the pixels of a hardware surface can involve a LOT of memory copying (VRAM to RAM, and then back!). You don't want this to happen every frame. In such cases, software surfaces work better. But then, you can't take advantage of double buffering, nor hardware accelerated blits.
Blits are block-copies of pixels from one surface to another. This works well if you want to draw a whole lot of identical icons on a surface. Not so useful if you're generating a temperature map.
OpenGL lets you do much more with your graphics hardware (3D acceleration for a start). Modern graphics cards have a lot of processing power, but it's kind of hard to use unless you're making a 3D simulation. Writing code for a graphics processor is possible but quite different to ordinary C.
Here's a quick demo SDL program that I made. It's not supposed to be a perfect example, and may have some portability problems. (I will try to edit a better program into this post when I get time.)
#include "SDL.h"
#include <assert.h>
#include <math.h>
/* This macro simplifies accessing a given pixel component on a surface. */
#define pel(surf, x, y, rgb) ((unsigned char *)(surf->pixels))[y*(surf->pitch)+x*3+rgb]
int main(int argc, char *argv[])
int x, y, t;
/* Event information is placed in here */
SDL_Event event;
/* This will be used as our "handle" to the screen surface */
SDL_Surface *scr;
/* Get a 640x480, 24-bit software screen surface */
scr = SDL_SetVideoMode(640, 480, 24, SDL_SWSURFACE);
/* Ensures we have exclusive access to the pixels */
for(y = 0; y < scr->h; y++)
for(x = 0; x < scr->w; x++)
/* This is what generates the pattern based on the xy co-ord */
t = ((x*x + y*y) & 511) - 256;
if (t < 0)
t = -(t + 1);
/* Now we write to the surface */
pel(scr, x, y, 0) = 255 - t; //red
pel(scr, x, y, 1) = t; //green
pel(scr, x, y, 2) = t; //blue
/* Copies the `scr' surface to the _actual_ screen */
SDL_UpdateRect(scr, 0, 0, 0, 0);
/* Now we wait for an event to arrive */
/* Any of these event types will end the program */
if (event.type == SDL_QUIT
|| event.type == SDL_KEYDOWN
|| event.type == SDL_KEYUP)
GUI stuff is a regularly-reinvented wheel, and there's no reason to not use a framework.
I'd recommend using either QT4 or wxWidgets. If you're using Ubuntu, GTK+ will suffice as it talks to GNOME and may be more comfortable to you (QT and wxWidgets both require C++).
Have a look at GTK+, QT, and wxWidgets.
Here's the tutorials for all 3:
Hello World, wxWidgets
GTK+ 2.0 Tutorial, GTK+
Tutorials, QT4
In addition to Jed Smith's answer, there are also lower-level frameworks, like OpenGL, which is often used for game programming. Given that you want to use a high frame rate, I'd consider something like that. GTK and the like aren't primarily intended for rapidly updating displays.
In my experience Xlib via MIT-SHM extension was significantly faster than SDL surfaces, not sure I used SDL in the most optimal way though.
