FFT Resolution in constrained environments - c

So I am sampling data from a sensor at 10 kilo samples per second. I am going to collect 512 samples continuously from this sensor, and then try to do an FFT on it. But here is the problem, I am constrained to do a 16 point FFT on it. So from what I understand is that I divide my 512 samples' frame into bins of 16, and take FFT on them individually. Once I have done that, I just merge them side by side.
My questions:
If my sampling frequency is 10 kilo samples per second, and my FFT size is 16, then my bin size should be 625 Hz, right?
Second, am I correct in merging the FFT outputs as above?
I will be absolutely grateful for a response.

You could also do 2 layers of radix-16 FFTs and bit shuffles, plus 1 layer of radix-2 FFT butterflys to produce the same result as an FFT of length 512.

If you collect data in 512-sample chunks but are constrained to 16-point FFT, you will have to perform the FFT 32 times for each chunk and average the results (either for each chunk or for the entire recording - your choice).
The sampling rate determines the upper limit of the frequency values you assign to the FFT results, and it doesn't matter whether you are looking at 512 samples or 16 samples at a time. Your top frequency is going to be 1/2 the sample rate = 5 kHz.
The series of frequency results will be (in Hz) ...
5000
2500
1250
625
312.5
...
and so on, depending on how many samples you pass to the FFT.
I'm not going to ask why you're restricted to 16-point FFT!

If you are using 16-point FFT, then the resolution you will get is low. It will be able to capture frequencies from 0-5 Sa/s with only 8 unique bins.
Regarding your question about the bin size, I don't understand why you need it.
I think to get better results, you can also average the sampled points to fit your 16-point FFT.

Related

Result from audio FFT function makes it near impossible to inspect low/mid frequencies

I am trying to build a graphical audio spectrum analyzer on Linux. I run an FFT function on each buffer of PCM samples/frames fed to the audio hardware so I can see which frequencies are the most prevalent in the audio output. Everything works, except the results from the FFT function only allocate a few array elements (bins) to the lower and mid frequencies. I understand that audio is logarithmic, and the FFT works with linear data. But with so little allocation to low/mid frequencies, I'm not sure how I can separate things cleanly to show the frequency distribution graphically. I have tried with window sizes of 256 up to 1024 bytes, and while the larger windows give more resolution in the low/mid range, it's still not that much. I am also applying a Hann function to each chunk of data to smooth out the window boundaries.
For example, I test using a mono audio file that plays tones at 120, 440, 1000, 5000, 15000 and 20000 Hz. These should be somewhat evenly distributed throughout the spectrum when interpreting them logarithmically. However, since FFTW works linearly, with a 256 element or 1024 element array only about 10% of the return array actually holds values up to about 5 kHz. The remainder of the array from FFTW contains frequencies above 10-15 kHz.
Here's roughly the result I'm after:
But this is what I'm actually getting:
Again, I understand this is probably working as designed, but I still need a way to get more resolution in the bottom and mids so I can separate the frequencies better.
What can I do to make this work?
What you are seeing is indeed the expected outcome of an FFT (Fourier Transform). The logarithmic f-axis that you're expecting is achieved by the Constant-Q transform.
Now, the implementation of the Constant-Q transform is non-trivial. The Fourier Transform has become popular precisely because there is a fast implementation (the FFT). In practice, the constant-Q transform is often implemented by using an FFT, and combining multiple high-frequency bins. This discards resolution in the higher bins; it doesn't give you more resolution in the lower bins.
To get more frequency resolution in the lower bins of the FFT, just use a longer window. But if you also want to keep the time resolution, you'll have to use a hop size that's smaller than the window size. In other words, your FFT windows will overlap.

Need a FFT for an near infinite set of data points

I need to perform a Fourier transform on a long stream of data. I made a DFT .c file that works fine, the downside is of course the speed. It is slow AF.
I am looking for a way to perform the FFT on a long stream of data.
All the FFT libs require an array of max 1024, 2048 or some even 4096 data points.
I get the data from an ADC that runs around 128000 Hz and I need to measure data between 1 and 10 seconds. This means an array from 128 000 to 1 280 000 samples. In my code I check the frequencies 0 till 2000. It took around 400 core ticks for one sin+cos calculation. The core runs at 480 Mhz, so it costs around 1 us.
This means 2000 frequencies * 128 000 samples * 1 us = +/- 256 seconds(4 min) of analysis per 1 second of data.
And when 10 secs are used, it would cost 40 mins.
Does anyone know a faster way or a FFT solution that supports a near "infinite" data array?
If your calculation involved floating points, avoid double precision if you doesn't need that kind of guaranteed floating point precision.
If your ADC resolution is not that high (says less than 16 bits), you can consider to use fixed point arithmetic. This can help to reduce computation time especially if your machine does not support hardware floating point calculation. please refer: Q number format
If you are using ARM base controller, you may want to check out this:
http://www.keil.com/pack/doc/CMSIS/DSP/html/index.html

Computing the discrete fourier transform of audio data with FFTW

I am quite new to signal processing so forgive me if I rant on a bit. I have download and installed FFTW for windows. The documentation is ok but I still have queries.
My overall aim is to capture raw audio data sampled at 44100 samps/sec from the sound card on the computer (this task is already implemented using libraries and my code), and then perform the DFT on blocks of this audio data.
I am only interested in finding a range of frequency components in the audio and I will not be performing any inverse DFT. In this case, is a real to real transformation all that is necessary, hence the fftw_plan_r2r_1d() function?
My blocks of data to be transformed are 11025 samples long. My function is called as shown below. This will result in a spectrum array of 11025 bins. How do I know the maximum frequency component in the result?
I believe that the bin spacing is Fs/n , 44100/11025, so 4. Does it mean that I will have a frequency spectrum in the array from 0 Hz all the way up to 44100Hz in steps of 4, or up to half the nyquist frequency 22200?
This would be a problem for me as I only wish to search for frequencies from 60Hz up to 3000Hz. Is there some way to limit the transform range?
I don't see any arguments for the function, or maybe there is another way?
Many thanks in advance for any help with this.
p = fftw_plan_r2r_1d(11025, audioData, spectrum, FFTW_REDFT00, FFTW_ESTIMATE);
To answer some of your individual questions from the above:
you need a real-to-complex transform, not real-to-real
you will calculate the magnitude of the complex output bins at the frequencies of interest (magnitude = sqrt(re*re + im*im))
the frequency resolution is indeed Fs / N = 44100 / 11025 = 4 Hz, i.e. the width of each output bin is 4 Hz
for a real-to-complex transform you get N/2 + 1 output bins giving you frequencies from 0 to Fs / 2
you just ignore frequencies in which you are not interested - the FFT is very efficient so you can afford to "waste" unwanted output bins (unless you are only interested in a relatively small number of output frequencies)
Additional notes:
plan creation does not actually perform an FFT - typically you create a plan once and then use it many times (by calling fftw_execute)
for performance you probably want to use the single precision calls (e.g. fftwf_execute rather than fftw_execute, and similarly for plan creation etc)
Some useful related questions/answers on StackOverflow:
How do I obtain the frequencies of each value in an FFT?
How to get frequency from fft result?
How to generate the audio spectrum using fft in C++?
There are many more similar questions and answers which you might also want to read - search for the fft and fftw tags.
Also note that dsp.stackexchange.com is the preferred site for site for questions on DSP theory rather than actual specific programming problems.

Array Sum Benchmark on GPU - Odd Results?

I am currently doing some benchmark tests using OpenCL on an AMD Radeon HD 7870.
The code that I have written in JOCL (the Java bindings for OpenCL) simply adds two 2D arrays (z= x + y) but it does so many times (z=x+y+y+y+y+y+y...).
The size of the two arrays I am adding is 500 by 501 and I am looping over the number of iterations I want to add them together on the GPU. So first I add them once, then ten times, then one thousand times, etc.
The maximum number of iterations that I loop to is 100,000,000. Below is what the log file looks like when I run my code (counter is the number of times my program executes in 5 seconds):
Number of Iterations: 1
Counter: 87
FLOPS Rate: 0.0043310947 GFLOPs/s
Number of Iterations: 10
Counter: 88
FLOPS Rate: 0.043691948 GFLOPs/s
Number of Iterations: 100
Counter: 84
FLOPS Rate: 0.41841218 GFLOPs/s
Number of Iterations: 1000
Counter: 71
FLOPS Rate: 3.5104263 GFLOPs/s
Number of Iterations: 10000
Counter: 8
FLOPS Rate: 3.8689642 GFLOPs/s
Number of Iterations: 100000
Counter: 62
FLOPS Rate: 309.70895 GFLOPs/s
Number of Iterations: 1000000
Counter: 17
FLOPS Rate: 832.0814 GFLOPs/s
Number of Iterations: 10000000
Counter: 2
FLOPS Rate: 974.4635 GFLOPs/s
Number of Iterations: 100000000
Counter: 1
FLOPS Rate: 893.7945 GFLOPs/s
Do these numbers make sense? I feel that 0.97 TeraFLOPS is quite high and that I must be calculating the number of FLOPs incorrectly.
Also, I believe that the number of FLOPs I am calculating should at one point level out with an increase in the number of iterations but that is not so evident here. It seems that if I continue to increase the number of iterations, the calculated FLOPS will also increase which also leads me to believe that I am doing something wrong.
Just for reference, I am calculating the FLOPS in the following way:
FLOPS = counter(500)(501)(iterations)/(time_elapsed)
Any help with this issue will be greatly appreciated.
Thank you
EDIT:
I have now done this same benchmark test looping over a range of iterations (the amount of times I add y to x) as well as array sizes. I have generated the following surface plot as can be seen at this GitHub repository
https://github.com/ke0m/Senior_Design/blob/master/JOCL/Graphing/GoodGPUPlot.PNG
I have asked the opinion of others on this plot and they mention to me that while the numbers I am calculating are feasible, they are artificially high. They say this is evident in the steep slope in the plot that does not really make any physical sense. One suggested idea as to why the slope is so steep is because the compiler converts the variable that controls the iterations (of type int) to a short and therefore forces this number to stay below 32000 (approximately). That means that I am doing less work on the GPU then I think I am and calculating a higher GFLOPS value.
Can anyone confirm this idea or offer any other ideas as to why the plot looks the way it does?
Thank you again
counter(500)(501)(iterations) - If this is calculated with integers, the result is likely to be too large for an integer register. If so convert to floating point before calculating.
I did a matrix-matrix multiplication kernel that uses local memory optimization. On my HD7870 # stock settings, it does just about 500 billion sums and 500 billion multiplications per second which makes 1 Teraflops. This is quite close to your calculations if your card is at stock settings too.
Yes, your calculations make sense since the gpu's peak is about 2.5 Tflops/s and you are doing the calculations in local memory / register space which is needed to get close to peak values of card.
You are doing only additions so you just add 1 per iteration(not doing any multiplication leaves one pipeline per core empty I assume so you have nearly half of the peak).
1 flops per a=b+c
so you are right about the flops values.
But when you dont give the gpu a "resonance condition for total item number" like multiple of 512(multiple of maximum local item size) or 256 or 1280(number of cores) , your gpu will not compute at full efficiently and will degreade on performance for small arrays.
Also if you dont give enough total warps, threads will not be able to hide latency of main memory just like in the 1,10,100 iterations. Hiding memory latency needs multiple warps on a compute unit such that all the ALUs and ADDR units (i mean all pipelines) are occupied most of the time. Occupation is very important here because of so few operations per memory operation. If you decrease the workgroup size from 256 to 64, this can increase occupation so more latency hiding.
Trial&error can give you an optimum peak performance. Otherwise your kernel is bottlenecked by main memory bandwidth and thread start/stop latencies.
Here:
HD 7870 SGEMM with 9x16x16 pblocking algorithm: 1150 Gflops/s for square matrix size=8208
Additionally, divisions and special functions can be percepted as 50 to 200 flops per item and subject to different versions of them(like a software rsqrt() vs hardware rsqrt() approximation).
Try with array sizes of multiple of 256 and with a high iterations like 1M and try 64 or 128 as local items per compute unit. If you could multiply them at the same time, you could reach a higher flops throughput. You can add a multiplication of y with 2 or 3 to use multiplication pipelines too! This way you may approach a higher flops than before.
x=y+z*2.0f+z*3.0f+z*4.0f+z*5.0f ---->8 flops
or against auto-optimizaitons of compiler,
x=y+zrandomv+zrandomval2+zrandomval3+zrandomval4
instead of
x=y+z+z+z+z ----->4 flops
Edit: I dont know if HD7870 uses different(an extra batch of) ALUs for double-precision(64-bit fp) operations, if yes, then you can use them to do mixed-precision operations to have %10 more flops throughput because HD7870 is capable of 64-bit # 1/8 of 32-bit speed! You can make your card explode with this way.

How to extract semi-precise frequencies from a WAV file using Fourier Transforms

Let us say that I have a WAV file. In this file, is a series of sine tones at precise 1 second intervals. I want to use the FFTW library to extract these tones in sequence. Is this particularly hard to do? How would I go about this?
Also, what is the best way to write tones of this kind into a WAV file? I assume I would only need a simple audio library for the output.
My language of choice is C
To get the power spectrum of a section of your file:
collect N samples, where N is a power of 2 - if your sample rate is 44.1 kHz for example and you want to sample approx every second then go for say N = 32768 samples.
apply a suitable window function to the samples, e.g. Hanning
pass the windowed samples to an FFT routine - ideally you want a real-to-complex FFT but if all you have a is complex-to-complex FFT then pass 0 for all the imaginary input parts
calculate the squared magnitude of your FFT output bins (re * re + im * im)
(optional) calculate 10 * log10 of each magnitude squared output bin to get a magnitude value in dB
Now that you have your power spectrum you just need to identify the peak(s), which should be pretty straightforward if you have a reasonable S/N ratio. Note that frequency resolution improves with larger N. For the above example of 44.1 kHz sample rate and N = 32768 the frequency resolution of each bin is 44100 / 32768 = 1.35 Hz.
You are basically interested in estimating a Spectrum -assuming you've already gone past the stage of reading the WAV and converting it into a discrete time signal.
Among the various methods, the most basic is the Periodogram, which amounts to taking a windowed Discrete Fourier Transform (with a FFT) and keeping its squared magnitude. This correspond to Paul's answer. You need a window which spans over several periods of the lowest frequency you want to detect. Example: if your sinusoids can be as low as 10 Hz (period = 100ms), you should take a window of 200ms o 300ms or so (or more). However, the periodogram has some disadvantages, though it's simple to compute and it's more than enough if high precision is not required:
The raw periodogram is not a good
spectral estimate because of spectral
bias and the fact that the variance
at a given frequency does not decrease
as the number of samples used in the
computation increases.
The periodogram can perform better by averaging several windows, with a judious choosing of the widths (Bartlet method). And there are many other methods for estimating the spectrum (AR modelling).
Actually, you are not exactly interested in estimating a full spectrum, but only the location of a single frequency. This can be done seeking a peak of an estimated spectrum (done as explained), but also by more specific and powerful (and complicated) methods (Pisarenko, MUSIC algorithm). They would probably be overkill in your case.
WAV files contain linear pulse code modulated (LPCM) data. That just means that it is a sequence of amplitude values at a fixed sample rate. A RIFF header is contained at the beginning of the file to convey information like sampling rate and bits per sample (e.g. 8 kHz signed 16-bit).
The format is very simple and you could easily roll your own. However, there are several libraries available to speed the process such as libsndfile. Simple Direct-media Layer (SDL)/SDL_mixer and PortAudio are two nice libraries for playback.
As for feeding the data into FFTW, you would need to buffer 1 second chunks (determine size by the sample rate and bits per sample). Then convert all of the samples to IEEE floating-point (i.e. float or double depending on the FFTW configuration--libsndfile can do this for you). Next create another array to hold the frequency domain output. Finally, create and execute an FFTW plan by passing both buffers to fftw_plan_dft_r2c_1d and calling fftw_execute with the returned fftw_plan handle.

Resources