Can't run GPU_FFT in an infinite loop - c

I am currently trying to take FFT using GPU_FFT libraries provided with Pi. I've successfully taken FFT of the data. However, whenever I try this in a while(1) loop, somehow it takes FFT 1023 times and stops at 1024th. I've read about someone who faced with exact same issue on Raspberry Pi official forum, but couldn't figure out how to fix mine.
The code below is function to compute FFT
void computeFFT(float *Input,float *Output){
struct GPU_FFT_COMPLEX *base;
struct GPU_FFT *fft;
int i;
int ret = gpu_fft_prepare(mb,log2_N,GPU_FFT_FWD,1,&fft);
base = fft->in;
for(i=0;i<N;i++)
base[i].re = Input[i]; base[i].im =0.0;
gpu_fft_execute(fft);
base = fft->out;
for(i=0;i<N;i++)
Output[i]= base[i].re;
gpu_fft_release(fft);
}
and this is my while(1) loop
while (1){
if (mb != 0)
mb = mbox_open();
adc1Value = a2d(adc1Channel);
voltage1 = ((2.5/4096) * adc1Value);
QueueGet(Queue);
QueuePut(voltage1);
OutData = computeFFT(Queue,OutData);
printf("%f\n",OutData[0]); // Printing out signal value on 0Hz
}
mbox_close(mb);
free(OutData);
return 0;
I do believe problem is related to where I use mbox_open and close functions.
Things that I tried
Changing Queue into new array with random number generated with rand().
Changing location of mbox functions as suggested.
Disabling a2d function
Changing FFT length to 512, 2048, 4096.
Yet none of them was the solution.

Related

Confused about Passing user data to PortAudio Callbacks

This is my first post here and I'm fairly new to programming and especially with C. A couple weeks ago I started working through the Audio Programming Book(MIT press) and have been expand on some examples to try to understand things further.
I think my question lies with how I'm trying to pass data (retrieved from the user in an initialization function) to a PortAudio callback. I feel like what I've done isn't that different from the examples (both from the book and PortAudio's examples like paex_sine.c), but for some reason I can't my code to work and I've been banging my head against a wall trying to understand why. I've tried searching pretty extensively for solutions or example code to study, but I kind of don't know what I don't know, so that hasn't returned much.
How do I get user data into the callback?
Am I just not understanding how pointers and structs work and trying to force them to do things they don't want to?
Or, am I just overlooking something really obvious?
The following code either gives a really high pitched output, short high pitched blips, or no (audible) output:
#include <stdio.h>
#include <math.h>
#include "portaudio.h"
#define FRAME_BLOCK_LEN 64
#define SAMPLING_RATE 44100
#define TWO_PI (3.14159265f * 2.0f)
PaStream *audioStream;
double si = 0;
typedef struct
{
float frequency;
float phase;
}
paTestData;
int audio_callback (const void *inputBuffer, void *outputBuffer,
unsigned long framesPerBuffer,
const PaStreamCallbackTimeInfo* timeinfo,
PaStreamCallbackFlags statusFlags,
void *userData )
{
paTestData *data = (paTestData*)userData;
float *out = (float*)outputBuffer;
unsigned long i;
// data->frequency = 400;
for(i = 0; i < framesPerBuffer; i++){
si = TWO_PI * data->frequency / SAMPLING_RATE; // calculate sampling-incr
*out++ = sin(data->phase);
*out++ = sin(data->phase);
data->phase += si; // add sampling-incr to phase
}
return paContinue;
}
void init_stuff()
{
float frequency;
int i;
PaStreamParameters outputParameters;
paTestData data;
printf("type the modulator frequency in Hz: ");
scanf("%f", &data.frequency); // get modulator frequency
printf("you chose data.frequency %.2f\n",data.frequency);
data.phase = 0.0;
printf("initializing Portaudio. Please wait...\n");
Pa_Initialize(); // initialize Portaudio
outputParameters.device = Pa_GetDefaultOutputDevice(); /* default output device */
outputParameters.channelCount = 2; /* stereo output */
outputParameters.sampleFormat = paFloat32; /* 32 bit floating point output */
outputParameters.suggestedLatency = Pa_GetDeviceInfo( outputParameters.device )->defaultLowOutputLatency;
outputParameters.hostApiSpecificStreamInfo = NULL;
Pa_OpenStream( // open paStream object
&audioStream, // portaudio stream object
NULL, // input params
&outputParameters, // output params
SAMPLING_RATE, // SampleRate
FRAME_BLOCK_LEN, // frames per buffer
paNoFlag, // set no Flag
audio_callback, // callbak function address
&data ); // user data
Pa_StartStream(audioStream); // start the callback mechanism
printf("running... press space bar and enter to exit\n");
}
void terminate_stuff()
{
Pa_StopStream(audioStream); // stop callback mechanism
Pa_CloseStream(audioStream); // destroy audio stream object
Pa_Terminate(); // terminate portaudio
}
int main(void)
{
init_stuff();
while(getchar() != ' ') Pa_Sleep(100);
terminate_stuff();
return 0;
}
Uncommenting data->frequency = 400; at least plays a 400hz sine wave, but that ignores any user input done in init_stuff()
If I put a printf("%f\n",data->frequency); inside the callback, it prints 0.000000 or something like -146730090609497866240.000000.
It's pretty unpredictable, and this really makes me think it's pointer related.
My goal for this code is to eventually incorporate envelope generators to change the pitch and possibly incorporate wavetable oscillators so I'm not calculating sin(x) for every iteration.
I can get envelopes and wavetables to work while using a blocking API like portsf that's used in the book, but trying to adapt any of that code from earlier chapters to use PortAudio callbacks is turning my brain to mush.
Thanks so much!
The problem you're having with your callback data is that it goes out of scope and memory is deallocated as soon as init_stuff finishes execution.
You should allocate memory for your callback data using malloc or new and passing the pointer to it for the callback.
For example:
void init_stuff()
{
float frequency;
int i;
PaStreamParameters outputParameters;
paTestData *data = (paTestData *) malloc(sizeof(paTestData));
printf("type the modulator frequency in Hz: ");
scanf("%f", &(data->frequency)); // get modulator frequency
printf("you chose data.frequency %.2f\n",data->frequency);
data->phase = 0.0;
...
Pa_OpenStream( // open paStream object
&audioStream, // portaudio stream object
NULL, // input params
&outputParameters, // output params
SAMPLING_RATE, // SampleRate
FRAME_BLOCK_LEN, // frames per buffer
paNoFlag, // set no Flag
audio_callback, // callbak function address
data );
...
I wasn't able to get the original code working using malloc but based on both suggestions, I realized another workable solution. Because running init_stuff() caused my data to get deallocated, I'm for now just making all my assignments and calls to Pa_OpenStream() from main.
Works beautifully and I can now send whatever data I want to the callback. Thanks for the help!

A lot of 0's received when using cudaMemcpy()

I've just started to learn CUDA and i wanted to fill an array (a 2D array represented as a 1D array) with random numbers. I followed another posts in order to generate random numbers, but i don't know if there is a problem with the generation of numbers or with the memory recovering from the device or anything else. The problem is that, though i have tried to fill any cell of the array with the id of the thread that is atending it in order to see the results after copying into the host memory, i receive an array that is filled with 0 in any position after recovering the data with cudaMemcpy().
I'm programming on Visual Studio 2013, with cuda 7.5, on a i5 2500k as my processor and a 960 GTX graphic card.
Here is the main and the method where i try to fill it. I'll update the cuRand Initialization too. If you need to see something else, just tell me.
__global__ void setup_cuRand(curandState * state, unsigned long seed)
{
int id = threadIdx.x;
curand_init(seed, id, 0, &state[id]);
}
__global__ void poblar(int * adn, curandState * state){
curandState localState = state[threadIdx.x];
int random = curand(&localState);
adn[threadIdx.x] = random;
// It doesn't mind if i use the following instruction, the result is a lot of 0's
//adn[threadIdx.x] = threadIdx.x;
}
int main()
{
const int adnLength = NUMCROMOSOMAS * SIZECROMOSOMAS; // 256 * 128 (32.768)
const size_t adnSize = adnLength * sizeof(int);
int adnCPU[adnLength];
int * adnDevice;
cudaError_t error = cudaSetDevice(0);
if (error != cudaSuccess)
exit(-EXIT_FAILURE);
curandState * randState;
error = cudaMalloc(&randState, adnLength * sizeof(curandState));
if (error != cudaSuccess){
cudaFree(randState);
exit(-EXIT_FAILURE);
}
//Here is initialized cuRand
setup_cuRand <<<1, adnLength >> > (randState, unsigned(time(NULL)));
error = cudaMalloc((void **)&adnDevice, adnSize);
if (error == cudaErrorMemoryAllocation){// cudaSuccess){
cudaFree(adnDevice);
cudaFree(randState);
printf("\n error");
exit(-EXIT_FAILURE);
}
poblar <<<1, adnLength >>> (adnDevice, randState);
error = cudaMemcpy(adnCPU, adnDevice, adnSize, cudaMemcpyDeviceToHost);
//After here, for any i, adnCPU[i] is 0 and i cannot figure what is wrong
if (error == cudaSuccess){
for (int i = 0; i < NUMCROMOSOMAS; i++){
for (int j = 0; j < SIZECROMOSOMAS; j++){
printf("%i,", adnCPU[(i*SIZECROMOSOMAS) + j]);
}
printf("\n");
}
}
return 0;
}
EDIT after answer solved: There was a particularity over the answer given, and is that you need a lower number of threads (half of that quantity worked for me) in order to seed correctly the random numbers with cuRand. For some reason, i could create the threads perfectly but i couldn't seed the pseudo-random algorithm generator.
The maximum number of threads per block is 1024 on your hardware, hence, you may not schedule a call with adnLength if it is larger than 1024.
The error you are having is most probably a call configuration error, and it is returned by cudaPeekAtLastError, as it occurs before any GPU work, right after the triple angled-bracket call. Indeed cudaMemcpy may not return it, even though it returns error from previous asynchronous calls.
The error that may occur is cudaErrorLaunchOutOfResources.

Calculating the delay between write and read on I2C in Linux

I am currently working with I2C in Arch Linux Arm and not quite sure how to calculate the absolute minimum delay there is required between a write and a read. If i don't have this delay the read naturally does not come through. I have just applied usleep(1000) between the two commands, which works, but its just done empirically and has to be optimized to the real value (somehow). But how?.
Here is my code sample for the write_and_read function i am using:
int write_and_read(int handler, char *buffer, const int bytesToWrite, const int bytesToRead) {
write(handler, buffer, bytesToWrite);
usleep(1000);
int r = read(handler, buffer, bytesToRead);
if(r != bytesToRead) {
return -1;
}
return 0;
}
Normally there's no need to wait. If your writing and reading function is threaded somehow in the background (why would you do that???) then synchronizating them is mandatory.
I2C is a very simple linear communication and all the devices used my me was able to produce the output data within microsecs.
Are you using 100kHz, 400kHz or 1MHz I2C?
Edited:
After some discuss I suggest you this to try:
void dataRequest() {
Wire.write(0x76);
x = 0;
}
void dataReceive(int numBytes)
{
x = numBytes;
for (int i = 0; i < numBytes; i++) {
Wire.read();
}
}
Where x is a global variable defined in the header then assigned 0 in the setup(). You may try to add a simple if condition into the main loop, e.g. if x > 0, then send something in serial.print() as a debug message, then reset x to 0.
With this you are not blocking the I2C operation with the serial traffic.

C on embedded system w/ linux kernel - mysterious adc read issue

I'm developing on an AD Blackfin BF537 DSP running uClinux. I have a total of 32MB SD-RAM available. I have an ADC attached, which I can access using a simple, blocking call to read().
The most interesting part of my code is below. Running the program seems to work just fine, I get a nice data package that I can fetch from the SD-card and plot. However, if I comment out the float calculation part (as noted in the code), I get only zeroes in the ft_all.raw file. The same occurs if I change optimization level from -O3 to -O0.
I've tried countless combinations of all sorts of things, and sometimes it works, sometimes it does not - earlier (with minor modifications to below), the code would only work when optimization was disabled. It may also break if I add something else further down in the file.
My suspicion is that the data transferred by the read()-function may not have been transferred fully (is that possible, even though it returns the correct number of bytes?). This is also the first time I initialize pointers using direct memory adresses, and I have no idea how the compiler reacts to this - perhaps I missed something, here?
I've spent days on this issue now, and I'm getting desperate - I would really appreciate some help on this one! Thanks in advance.
// Clear the top 16M memory for data processing
memset((int *)0x01000000,0x0000,(size_t)SIZE_16M);
/* Prep some pointers for data processing */
int16_t *buffer;
int16_t *buf16I, *buf16Q;
buffer = (int16_t *)(0x1000000);
buf16I = (int16_t *)(0x1600000);
buf16Q = (int16_t *)(0x1680000);
/* Read data from ADC */
int rbytes = read(Sportfd, (int16_t*)buffer, 0x200000);
if (rbytes != 0x200000) {
printf("could not sample data! %X\n",rbytes);
goto end;
} else {
printf("Read %X bytes\n",rbytes);
}
FILE *outfd;
int wbytes;
/* Commenting this region results in all zeroes in ft_all.raw */
float a,b;
int c;
b = 0;
for (c = 0; c < 1000; c++) {
a = c;
b = b+pow(a,3);
}
printf("b is %.2f\n",b);
/* Only 12 LSBs of each 32-bit word is actual data.
* First 20 bits of nothing, then 12 bits I, then 20 bits
* nothing, then 12 bits Q, etc...
* Below, the I and Q parts are scaled with a factor of 16
* and extracted to buf16I and buf16Q.
* */
int32_t *buf32;
buf32 = (int32_t *)buffer;
uint32_t i = 0;
uint32_t n = 0;
while (n < 0x80000) {
buf16I[i] = buf32[n] << 4;
n++;
buf16Q[i] = buf32[n] << 4;
i++;
n++;
}
printf("Saving to /mnt/sd/d/ft_all.raw...");
outfd = fopen("/mnt/sd/d/ft_all.raw", "w+");
if (outfd == NULL) {
printf("Could not open file.\n");
}
wbytes = fwrite((int*)0x1600000, 1, 0x100000, outfd);
fclose(outfd);
if (wbytes < 0x100000) {
printf("wbytes not correct (= %d) \n", (int)wbytes);
}
printf(" done.\n");
Edit: The code seems to work perfectly well if I use read() to read data from a simple file rather than the ADC. This leads me to believe that the rather hacky-looking code when extracting the I and Q parts of the input is working as intended. Inspecting the assembly generated by the compiler confirms this.
I'm trying to get in touch with the developer of the ADC driver to see if he has an explanation of this behaviour.
The ADC is connected through a SPORT, and is opened as such:
sportfd = open("/dev/sport1", O_RDWR);
ioctl(sportfd, SPORT_IOC_CONFIG, spconf);
And here are the options used when configuring the SPORT:
spconf->int_clk = 1;
spconf->word_len = 32;
spconf->serial_clk = SPORT_CLK;
spconf->fsync_clk = SPORT_CLK/34;
spconf->fsync = 1;
spconf->late_fsync = 1;
spconf->act_low = 1;
spconf->dma_enabled = 1;
spconf->tckfe = 0;
spconf->rckfe = 1;
spconf->txse = 0;
spconf->rxse = 1;
A bfin_sport.h file from Analog Devices is also included: https://gist.github.com/tausen/5516954
Update
After a long night of debugging with the previous developer on the project, it turned out the issue was not related to the code shown above at all. As Chris suggested, it was indeed an issue with the SPORT driver and the ADC configuration.
While debugging, this error messaged appeared whenever the data was "broken": bfin_sport: sport ffc00900 status error: TUVF. While this doesn't make much sense in the application, it was clear from printing the data, that something was out of sync: the data in buffer was on the form 0x12000000,0x34000000,... rather than 0x00000012,0x00000034,... whenever the status error was shown. It seems clear then, why buf16I and buf16Q only contained zeroes (since I am extracting the 12 LSBs).
Putting in a few calls to usleep() between stages of ADC initialization and configuration seems to have fixed the issue - I'm hoping it stays that way!

Measuring the maximum and minimum execution time of a specific function

I am doing some trivial benchmarking of writing x lines of the same text into a file using two methods:
Direct fwrite.
Make a new thread and communication is done via asynchronous queue (main thread is inserting on one side and the other thread is reading from the other). This method is used to try to minimize slowest writing (due to flushing)
This is a snippet of the code which should give a basic idea of the program:
int i;
char * buf;
int buf_size;
double local_start, local_end, global_start, global_end;
double slowest, fastest;
double local_time_difference;
buf = "A string to be printed to a file \n";
buf_size = strlen(buf);
fastest = MAX_WRITE_TIME;
slowest = 0;
logger_init(atoi(argv[1]));
global_start = get_time();
for(i = 0 ; i < 100000000 ; i++)
{
local_start = get_time();
logger_write(buf, buf_size);
local_end = get_time();
local_time_difference = local_end-local_start;
if(local_time_difference < fastest && local_time_difference != 0)
fastest = local_time_difference;
if(local_time_difference > slowest)
slowest = local_time_difference;
if(i % 10000 == 0)
usleep(1);
}
global_end = get_time();
printf("Fastest: %1.9f\nSlowest: %1.9f\nTotal Time: %1.9f\n", fastest, slowest, global_end-global_start);
logger_destroy();
Get time procedure returns time in microseconds
double get_time()
{
struct timeval t;
struct timezone tzp;
gettimeofday(&t, &tzp);
return t.tv_sec + t.tv_usec*1e-6;
}
Depending on the argument passed to logger_init, logger_write will either directly write to the file or insert it in the queue (size of the queue must not exceed some particular limit). GAsyncQueue is being used
The method I'm currently using to calculate fastest and slowest write certainly works but my question is: is there a tool or profiler that would do this for me? i.e. give me statistics about each function (maximum, minimum and average call)
Tools that I've tried so far but had no luck with:
gprof
Zoom
Kcachegrind
VTune
TL:DR
I am looking for a tool to give me min, max and average execution time of a particular function, not the overall time taken.
Use the correct high resolution OS API functions for benchmarking.
Don't calculate execution times from inside the measurement itself, especially not if you are using float numbers.
Why are you calling a sleep function? Are you trying to force a context switch or some oddity like that? The OS will likely handle such better and more efficient than your program.

Resources