Simple reverb alghoritm when buffer is small - c

I'm trying to implement simple delay/reverb described in this post https://stackoverflow.com/a/5319085/1562784 and I have a problem. On windows where I record 16bit/16khz samples and get 8k samples per recording callback call, it works fine. But on linux I get much smaller chunks from soundcard. Something around 150 samples. Because of that I modified delay/reverb code to buffer samples:
#define REVERB_BUFFER_LEN 8000
static void reverb( int16_t* Buffer, int N)
{
int i;
float decay = 0.5f;
static int16_t sampleBuffer[REVERB_BUFFER_LEN] = {0};
//Make room at the end of buffer to append new samples
for (i = 0; i < REVERB_BUFFER_LEN - N; i++)
sampleBuffer[ i ] = sampleBuffer[ i + N ] ;
//copy new chunk of audio samples at the end of buffer
for (i = 0; i < N; i++)
sampleBuffer[REVERB_BUFFER_LEN - N + i ] = Buffer[ i ] ;
//perform effect
for (i = 0; i < REVERB_BUFFER_LEN - 1600; i++)
{
sampleBuffer[i + 1600] += (int16_t)((float)sampleBuffer[i] * decay);
}
//copy output sample
for (i = 0; i < N; i++)
Buffer[ i ] = sampleBuffer[REVERB_BUFFER_LEN - N + i ];
}
This results in white noise on output, so clearly I'm doing something wrong.
On linux, I record in 16bit/16khz, same like on Windows and I'm running linux in VMWare.
Thank you!
Update:
As indicated in answered post, I was 'reverbing' old samples over and over again. Simple 'if' sovled a problem:
for (i = 0; i < REVERB_BUFFER_LEN - 1600; i++)
{
if((i + 1600) >= REVERB_BUFFER_LEN - N)
sampleBuffer[i + 1600] += (int16_t)((float)sampleBuffer[i] * decay);
}

Your loop that performs the actual reverb effect will be performed multiple times on the same samples, on different calls to the function. This is because you save old samples in the buffer, but you perform the reverb on all samples each time. This will likely cause them to overflow at some point.
You should only perform the reverb on the new samples, not on ones which have already been modified. I would also recommend checking for overflow and clipping to the min/max values instead of wrapping in that case.
A probably better way to perform reverb, which will work for any input buffer size, is to maintain a circular buffer of size REVERB_SAMPLES (1600 in your case), which contains the last samples.
void reverb( int16_t* buf, int len) {
static int16_t reverb_buf[REVERB_SAMPLES] = {0};
static int reverb_pos = 0;
for (int i=0; i<len; i++) {
int16_t new_value = buf[i] + reverb_buf[reverb_pos] * decay;
reverb_buf[reverb_pos] = new_value;
buf[i] = new_value;
reverb_pos = (reverb_pos + 1) % REVERB_SAMPLES;
}
}

Related

How can I store the 50ms before and after an audio event in a circular buffer?

I am processing a dataset of 17 hours of audio .wav (16-bit PCM, 192khz), to simulate a "real-time" processing that will be embedded in an ESP32, Arduino DUE or in a RASP, depending on the results.
How am I handling with that now?
First I cut the 17 hours file in a 1 minute samples, after I created a program in C that turns this file into a .CSV (jumping the entire head of .wav and taking only the date field).
PS: I chose CSV to have the data in a better disposition in order to perform tests in Scilab to validate the algorithms.
With this generated .CSV file I run it in a second program, that opens this file and fills a circular buffer with 130ms (24900 values), when the buffer is full, the code begins to calculate the RMS (Root Mean Square) in moving window with 10ms overlap, window size is 30ms . When I get a value greater than 1000 it is considered an event.
Below you can see an illustration of the problem:
Here is shown the Window with 50 ms before and after an event which I mean:
PS: Inicio, Fim and Janela means respectively, Start, End, Window.
My question is:
How should I save these 50ms before and after the event, since the event can occur anywhere in the buffer? And what should I do if the event lasts for more than one window?
Some data to help the understanding:
130ms = 24900 values ​​from my .csv file
50ms = 9600 values
30ms = 5700 values
10ms = 1920 values
I've searched for several sources, but most of the DSP bibliographies and Data Structures treat these topics superficially, just illustrating what a circular buffer is and not how to deal with it in a useful way.
Here is my code sketch, which seems to be taking a wrong approach to the problem, but I really have no idea how to proceed, in this case I created a data-set from 1 to 100 to ease of the debug:
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<math.h>
// Define the size of window 50ms
#define window_size 3 // 30ms
#define buffer_size 13 // 130ms = 50ms + 30ms + 50ms
int main()
{
//define variables.
int buffer[buffer_size]={0}; // create the buffer with 150ms;
int write = 0;
int i = 0, j = 0;
int read = 0;
int read1 =0;
int write1 = 0;
int counter_elements = 0;
int number_lines = 0;
int save_line = 0;
char c;
char str[1024]; // array to hold characters in a conversion of char to int.
int inicio = 0, fim = 0;
//RMS
int soma_quadrado = 0;
int rms = 0;
int pre_amostragem[5] = {0};
//Define variaveis referentes a leitura do arquivo e manipulacoes do mesmo.
FILE * fp;
FILE * LOG;
FILE * log_rms_final;
// Open the file and verify is NULL.
if((fp = fopen("generator.txt","r")) == NULL)
{ // Define o nome do csv para abrir
printf("Error! Can't open the file.\n");
exit(1);
}
// store rms values
LOG = fopen("RMSValues.csv", "a");
// store the 50ms after and before a event.
log_rms_final = fopen("Log_RMS.csv","a");
int lines = 0;
while(!feof(fp))
{
fgets(str,1024,fp); //reads 1024 characters and store in str.
buffer[write] = atoi(str);
write = (write + 1) % buffer_size; // circular
counter_elements++; // sum
c = fgetc(fp);
if(c == '\n')
{
lines++;
}
printf("%d\n", lines);
//if buffer is full
if(counter_elements == buffer_size)
{
// window
read1 = read;
for(i = 0; i < window_size; i++)
{
//square and sum.
soma_quadrado += buffer[read1]*buffer[read1];
read1 = (read1 + 1) % buffer_size;
}
// RMS
rms = sqrt(soma_quadrado/window_size);
fprintf(LOG, "\n %d", rms); // store
if(rms > 1000)
{
printf("rms: %d\n",rms);
// store the 50ms befor a event and the window.
write1 = write;
for(j = 0 ; j < 5; j++)
{
write1 = (write1 + (buffer_size - 1)) % buffer_size;
pre_amostragem[j] = buffer[write1];
}
fprintf(log_rms_final,"%s","\n");
for(j = 4; j >= 0; j--)
{
fprintf(log_rms_final,"%d - pre \n",pre_amostragem[j]);
}
fprintf(log_rms_final,"%s","\n");
/*
for(j = 0; j < window_size; j++)
{
fprintf(log_rms_final,"%d - janela\n",buffer[read1]);
read1 = (read1 + 1) % buffer_size;
}
*/
fprintf(log_rms_final,"%s","\n");
//store the 50ms after a event.
/*
fseek(log_rms_final,save_line - 3,save_line);
for(j = 0; j < 5; j++){
fgets(str,1024,fp);
fprintf(log_rms_final,"%d - pós \n",atoi(str));
}
*/
}
soma_quadrado = 0;
rms = 0;
read = (read + 1) % buffer_size;
counter_elements = counter_elements - 2;
}
soma_quadrado = 0;
rms = 0;
}
fclose(fp);
fclose(LOG);
fclose(log_rms_final);
return 0;
}
some comments are in Portuguese but they aren't relevant for the understanding of the problem.
I am giving you an algorithm for the solution here.
Always record 50ms (or even 60ms) data in a circular buffer.
If you detect a start event,
Copy previous 50 ms from circular buffer to final buffer
Continue writing received data into final buffer at 50ms location.
If you detect an end event.
Continue writing into final buffer for 50 ms more.
Start writing into circular buffer again.
If you have multiple events, you need to have multiple final buffers and can repeat the process.
As mentioned below in the comments, this solution will work for a window size > 50 ms also. You need to select the size of the final buffer accordingly.

FFTW plan segmentation fault

I am using FFTW3 to perform an fft on multiple columns of data (i.e multi channel audio, where I desire the transform of each channel). This is working fine on OSX but porting the code over to linux gives me a seg fault.
const int fftwFlags = FFTW_PRESERVE_INPUT|FFTW_PATIENT;
struct fft {
fftw_complex **complexSig;
double **realSig;
fftw_plan forwardR2C;
int fftLen;
int numChan;
}
void create FFT(struct fft *fft) {
int bufLen = 1024;
int numChan = 4;
fft->fftLen = bufLen;
fft->numChan = numChan;
fft->realSig = fftw_malloc(sizeof(double *) * numChan);
for(int i = 0; i < numChan; i++) {
fft->realSig[i] = fftw_malloc(sizeof(double) * bufLen);
}
fft->complexSig = fftw_malloc(sizeof(fftw_complex *) * numChan);
for(int i = 0; i < numChan; i++) {
fft->complexSig[i] = fftw_malloc(sizeof(fftw_complex) * bufLen);
}
fft->forwardR2C = fftw_plan_many_dft_r2c(1, &fft->fftLen, fft->numChan, *fft->realSig, &fft->fftLen, 1, fft->fftLen, *fft->complexSig, &fft->fftLen, 1, fft->fftLen, fftwFlags);
}
valgrind is showing that the fftw planner is attempting to access past the end of this array (by 8 bytes, one sample), resulting in a segmentation fault. When increasing the amount of memory allocated to realSig to bufLen * 2 this error is absent.
I am sure this is an error in how I am telling FFTW to read my data, but I can not spot it!
You seem to be assuming that successive malloc calls will be contiguous, which of course they are unlikely to be (you probably just "got lucky" on OS X). You can fix this quite easily though by making one large allocation, e.g.
void createFFT(struct fft *fft)
{
const int bufLen = 1024;
const int numChan = 4;
fft->fftLen = bufLen;
fft->numChan = numChan;
fft->realSig = fftw_malloc(sizeof(double *) * numChan);
// array of numChan pointers
fft->realSig[0] = fftw_malloc(sizeof(double) * numChan * bufLen);
// one large contiguous block of size `numChan * bufLen`
for(int i = 1; i < numChan; i++) // init pointers
{
fft->realSig[i] = fft->realSig[i - 1] + bufLen;
}
// ...
}
Note: when you're done you just need to:
fftw_free(fft->realSig[0]);

Concatenating uint8_t to a char*

im really new to C and im having a bit of a complication creating a char* from various uint8_t
My idea is to create a char* where in each location I place a number form a matrix
For example if I have a matrix with:
[1][2][3][4]
[5][6][7][8]
[9][0][1][2]
[3][4][5][6]
id like a char* thats "01234567890123456"
what im doing bit its not working is:
char* string = malloc(sizeof(char)*matrix->height*matrix->width);
for (int i = 0; i < matrix->height ; ++i) {
for (int j = 0; j < matrix->width ; ++j) {
string[i*matrix->height+j] = matrix->value[i][j];
}
}
of course its not working but im a bit lost on how to proceed and I cant find more information regarding this problem.
Any help would be nice,
thanks
Since you're trying to print a string, you need the ASCII character for 0. So, simply add '0' to each number, like so
char* string = malloc(sizeof(char)*(matrix->height*matrix->width + 1));
for (int i = 0; i < matrix->height ; ++i) {
for (int j = 0; j < matrix->width ; ++j) {
string[i*matrix->width+j] = matrix->value[i][j] + '0';
}
}
string[matrix->height*matrix->width] = 0; //null terminator
Note however this isn't exactly the most portable solution.
Also, notice that you want to multiply i by the width, because if you didn't have a square matrix your calculation wouldn't work correctly.
It's kind of unnecessary to have sizeof(char), because the size of a char is defined to be 1 regardless of the byte size.

Cuda programming-passing nested structs to kernels

I'm new to CUDA C and I'm trying to parallelize the following piece of code of the slave_sort function, which you will realize that is already parallel to work with posix threads..
I have the following structs:
typedef struct{
long densities[MAX_RADIX];
long ranks[MAX_RADIX];
char pad[PAGE_SIZE];
}prefix_node;
struct global_memory {
long Index; /* process ID */
struct prefix_node prefix_tree[2 * MAX_PROCESSORS];
} *global;
void slave_sort(){
.
.
.
long *rank_me_mynum;
struct prefix_node* n;
struct prefix_node* r;
struct prefix_node* l;
.
.
MyNum = global->Index;
global->Index++;
n = &(global->prefix_tree[MyNum]);
for (i = 0; i < radix; i++) {
n->densities[i] = key_density[i];
n->ranks[i] = rank_me_mynum[i];
}
offset = MyNum;
level = number_of_processors >> 1;
base = number_of_processors;
while ((offset & 0x1) != 0) {
offset >>= 1;
r = n;
l = n - 1;
index = base + offset;
n = &(global->prefix_tree[index]);
if (offset != (level - 1)) {
for (i = 0; i < radix; i++) {
n->densities[i] = r->densities[i] + l->densities[i];
n->ranks[i] = r->ranks[i] + l->ranks[i];
}
} else {
for (i = 0; i < radix; i++) {
n->densities[i] = r->densities[i] + l->densities[i];
}
}
base += level;
level >>= 1;
}
Mynum is the number of processors. I want after passing the code to kernel, Mynum to berepresented by blockIdx.x.The problem is that i get confused with the structs. I don't know how to pass them in the kernel. Can anyone help me?
Is the following code right?
__global__ void testkernel(prefix_node *prefix_tree, long *dev_rank_me_mynum, long *key_density,long radix)
int i = threadIdx.x + blockIdx.x*blockDimx.x;
prefix_node *n;
prefix_node *l;
prefix_node *r;
long offset;
.
.
.
n = &prefix_tree[blockIdx.x];
if((i%numthreads) == 0){
for(int j=0; j<radix; j++){
n->densities[j] = key_density[j + radix*blockIdx.x];
n->ranks[i] = dev_rank_me_mynum[j + radix*blockIdx.x];
}
.
.
.
}
int main(...){
long *dev_rank_me_mynum;
long *key_density;
prefix_node *prefix_tree;
long radix = 1024;
cudaMalloc((void**)&dev_rank_me_mynum, radix*numblocks*sizeof(long));
cudaMalloc((void**)&key_density, radix*numblocks*sizeof(long));
cudaMalloc((void**)&prefix_tree, numblocks*sizeof(prefix_node));
testkernel<<<numblocks,numthreads>>>(prefix_tree,dev_runk_me_mynum,key_density,radix);
}
The host API code you have posted in your edit looks fine. The prefix_node structure only contains statically declared arrays, so all that is needed is a single cudaMalloc call to allocate memory for the kernel to work on. Your method of passing prefix_tree to the kernel is also fine.
The kernel code, although incomplete and containing a couple of obvious typos, is another story. It seems that your intention is to only have a single thread per block operate on one "node" of the prefix_tree. That will be terribly inefficient and utilise only a small portion of the GPU's total capacity. For example why do this:
prefix_node *n = &prefix_tree[blockIdx.x];
if((i%numthreads) == 0){
for(int j=0; j<radix; j++){
n->densities[j] = key_density[j + radix*blockIdx.x];
n->ranks[j] = dev_rank_me_mynum[j + radix*blockIdx.x];
}
.
.
.
}
when you could to this:
prefix_node *n = &prefix_tree[blockIdx.x];
for(int j=threadIdx.x; j<radix; j+=blockDim.x){
n->densities[j] = key_density[j + radix*blockIdx.x];
n->ranks[j] = dev_rank_me_mynum[j + radix*blockIdx.x];
}
which coalesces the memory reads and uses as many threads in the block as you choose to run, rather than just one and should be many times faster as a result. So perhaps you should rethink your strategy of directly trying to translate the serial C code you posted into a kernel....

Mono to Stereo conversion

I have the following issue here: I get a block of bytes (uint16_t*) representing audio data, and the device generating them is capturing mono sound, so obviously I have mono audio data, on 1 channel. I need to pass this data to another device, which is expecting interleaved stereo data (so, 2 channels). What I want to do is basically duplicate the 1 channel in data so that both channels of the stereo data will contain the same bytes. Can you point me to an efficient algorithm doing this?
Thanks,
f.
If you just want interleaved stereo samples then you could use a function like this:
void interleave(const uint16_t * in_L, // mono input buffer (left channel)
const uint16_t * in_R, // mono input buffer (right channel)
uint16_t * out, // stereo output buffer
const size_t num_samples) // number of samples
{
for (size_t i = 0; i < num_samples; ++i)
{
out[i * 2] = in_L[i];
out[i * 2 + 1] = in_R[i];
}
}
To generate stereo from a single mono buffer then you would just pass the same pointer for in_L and in_R, e.g.
interleave(mono_buffer, mono_buffer, stereo_buffer, num_samples);
You might want to do the conversion in-place to save some memory. Depends on how small an amount of memory the device in question has. So you might want to use something like this instead of Paul R's approach:
void interleave(uint16_t buf[], const int len)
{
for (int i = len / 2 - 1, j = len - 1; i >= 0; --i) {
buf[j--] = buf[i];
buf[j--] = buf[i];
}
}
When getting the sound data from the mono device, you allocate a buffer that's twice as big as needed and pass that to the mono device. This will fill half the buffer with mono audio. You then pass that buffer to the above function, which converts it to stereo. And finally you pass the buffer to the stereo device. You save an extra allocation and thus use 33% less memory for the conversion.
Pass to both channels the same pointer? If that violates restrict rules, use memcpy()?
Sorry, but your question is otherwise to broad. API? OS? CPUArchitectures?
You are going to have to copy the buffer and duplicate it. As you haven't told us the format, how it is terminated, I can't give code, but it will look like a simple for loop.
int_16* allocateInterleaved(int_16* data, int length)
int i;
int *copy = malloc(sizeof(int_16)*length*2);
if(copy == NULL) {
/* handle error */
}
for(i =0; i<length; i++) {
copy[2*i] = data[i];
copy[2*i+1] = data[i];
}
return copy;
}
forgive any glaring typos, my C is a bit rusty. typdef in whatever type you need for signed 16bit into int_16. Don't forget to free the copy buffer, or better yet reuse it.
You need to interleave the data, but if the frame length is anything greater than one, none of the above solutions will work. The below code can account for variable frame lengths.
void Interleave(BYTE* left, BYTE* right, BYTE* stereo,int numSamples_in, int frameSize)
{
int writeIndex = 0;
for (size_t j = 0; j < numSamples_in; j++)
{
for (int k = 0; k < frameSize; k++)
{
int index = j * frameSize + k;
stereo[k + writeIndex] = left[index];
stereo[k + writeIndex + frameSize] = right[index];
}
writeIndex += 2 * frameSize;
}
}

Resources