Improve performances of division Vivado HLS

Improve performances of division Vivado HLS - c

I'm computing the incremental mean of my input data (which is an array of 6 elements, so i'll end up with 6 means).
This is the code I'm using everytime a new input array is available (obviously I update the number of samples ecc...):
computing_mean:for(int i=0;i<6;i++){
temp_mean[i]=temp_mean[i] + (input[i]-temp_mean[i])/number_of_samples;
//Possible optimization?
//temp_mean[i]=temp_mean[i] + divide(input[i]-temp_mean[i],number_of_samples);
}
Where all the data in the code are arrays or single number of the following type:
typedef ap_fixed <36,24,AP_RND_CONV,AP_SAT> decimalNumber;
From my synthesis report this loop hase 324 latency and 54 iteration latency, caused mainly by the division operation.
Are there any ways I can improve the speed of the division? I tried using hls_math and the divide function, but it doesn't seem to work with my type of data.
EDIT 1: I'm including my performance profiler inside vivado HLS. I'll add a self-contained reproducible code later with another edit.
As you can see, the majority of the time is spent in SDIV

Other than trigonometric functions like sin() (FSIN = ~50-170 cycles) and cos() (FCOS = ~50-120 cycles), or things like sqrt() (FSQRT = ~22 cycles), division will always be the most painful.
FDIV is 15 cycles. FADD and FMUL are both 5.
There are occasions where you can skip division and do bit-shifting instead, if you're working with integer data and the number you're dividing by is a power of 2, but that's about it.
You can look up the approximate CPU cycle cost of any given instruction in tables like this. FDIV is an example of an expensive one.
That being said, one thing you could try is to compute the division factor in advance, then apply it using multiplication instead:
double inverse_n = 1 / number_of_samples;
temp_mean[i]=temp_mean[i] + (input[i]-temp_mean[i]) * inverse_n;
I'm not sure that's saving a whole lot, but if you really do need to shave off cycles, it's worth a shot.

Related

Efficiency of Fortran ndarray versus n*1d arrays

Well this is one I'm struggling with since I started working on the actual code I'm working with right now.
My advisor wrote this for the past ten years and had, at some point, to stock values that we usually store in matrix or tensors.
Actually we look at matrix with six independent composents calculated from the Virial theorem (from Molecular dynamics simulation) and he had the habits to store 6*1D arrays, one for each value, at each recorded step, ie xy(n), xz(n) yz(n)... n being the number of records.
I assume that a single array s(n,3,3) could be more efficient as the values will be stored closer from one another (xy(n) and xz(n) have no reason to be stored side to side in memory) and rise less error concerning corrupted memory or wrong memory access. I tried to discuss it in the lab but eventually no one cares and again, this is just an assumption.
This would not have buggued me if everything in the code wasn't stored like that. Every 3d quantity is stored in 3 different arrays instead of 1 and this feels weird to me as for the performance of the code.
Is their any comparable effect for long calculations and large data size? I decided to post here after resolving an error I had due to wrong memory access with one of these as I find the code more readable and the data more easy to compute (s = s+... instead of six line of xy = xy+... for example).

The fact that the columns are close to each other is not very important, especially if the leading dimension n is large. Your CPU has multiple prefetch streams and can prefetch simultaneously in different arrays of different columns.
If you make some random access in an array A(n,3,3) where A is allocatable, the dimensions are not known at compile time. Therefore, the address of a random element A(i,j,k) will be address_of(A(1,1,1)) + i + (j-1)*n + (k-1)*3*n, and it will have to be calculated at the execution every time you make a random access to the array. The calculation of the address involves 3 integer multiplications (3 CPU cycles each) and at least 3 adds (1 cycle each). But regular accesses (predictible) can be optimized by the compiler using relative addresses.
If you have different 1-index arrays, the calculation of the address involves only one integer add (1 cycle), so you get a peformance penalty of at least 11 cycles for each access when using a single 3-index array.
Moreover, if you have 9 different arrays, each one of them can be aligned on a cache-line boundary, whereas you would be forced to use padding at the end of lines to ensure this behavior with a single array.
So I would say that in the particular case of A(n,3,3), as the two last indices are small and known at compile time, you can safely do the transformation into 9 different arrays to potentially gain some performance.
Note that if you use often the data of the 9 arrays at the same index i in a random order, re-organizing the data into A(3,3,n) will give you a clear performance increase. If a is in double precision, A(4,4,n) could be even better if A is aligned on a 64-byte boundary as every A(1,1,i) will be located at the 1st position of a cache line.

Assuming that you always loop along n and inside each loop need to access all the components in the matrix, storing the array like s(6,n) or s(3,3,n) will benefit from cache optimization.
do i=1,n
! do some calculation with s(:,i)
enddo
However, if your innerloop looks like this
resultarray(i)=xx(i)+yy(i)+zz(i)+2*(xy(i)+yz(i)+xz(i))
Don't border to change the array layout because you may break the SIMD optimization.

Gamma in the Baum Welch algorithm and float precision

I am currently trying to implement a Baum Welch algorithm in C, but I run into the following problem : the gamma function :
gamma(i,t) = alpha(i,t) * beta(i,t) / sum over `i` of(alpha(i,t) * beta(i,t))
Unfortunately, for large enough observation sets, alpha drops rapidly to 0 as t increases, and beta drops rapidly to 0 as t decreases, meaning that, due to rounding down, there is never a spot where both alpha and beta are non-zero, which makes things rather problematic.
Is there a way around this problem or should I just try to increase precision for the values? I fear the problem may just pop up again if I try this approach, as alpha and beta drop of about one order of magnitude per observation.

You should do these computations, and generally all computations for probability models, in log-space:
lg_gamma(i, t) = (lg_alpha(i, t) + lg_beta(i, t)
- logsumexp over i of (lg_alpha(i, t) + lg_beta(i, t)))
where lg_gamma(i, t) represents the logarithm of gamma(i, t), etc., and logsumexp is the function described here. At the end of the computation, you can convert to probabilities using exp, if needed (that's typically only needed for displaying probabilities, but even there logs may be preferable).
The base of the logarithm is not important, as long as you use the same base everywhere. I prefer the natural logarithm, because log saves typing compared to log2 :)

efficient way to take powers of a vector

I wrote a code that numerically uses Legendre polynomials up to some high n-th order. For example:
....
case 8
p = (6435*x.^8-12012*x.^6+6930*x.^4-1260*x.^2+35)/128; return
case 9
...
If the vectorx is long this can become slow. I saw that there is a performance difference between say x.^4 and x.*x.*x.*x and thought I could use this to improve my code. I've used timeit and found that for:
x=linspace(0,10,1e6);
f1= #() power(x,4)
f2= #() x.4;
f3= #() x.^2.^2
f4= #() x.*x.*x.*x
f4 is faster by a factor 2 than the rest. However when I go to x.^6 there is very little difference between (x.*x.*x).^2 and x.*x.*x.*x.*x.*x (while all other options are slower).
Is there away to tell what will be the most efficient way to take a power of a vector?
Can you explain why there is such a big difference in performance?

This is not exactly an answer to your question, but it may solve your problem:
x2 = x.*x; % or x.^2 or power(x,2), whichever is most efficient
p = ((((6435*x2-12012)*x2+6930)*x2-1260)*x2+35)/128
This way you do the power just once, and only with exponent 2. This trick can be applied to all Legendre polynomials (in the odd-degree polynomials one x2 is replaced by x).

Here are some thoughts:
power(x,4) and x.^4 are equivalent (just read the doc).
x.*x.*x.*x is probably optimized to something like x.^2.^2
x.^2.^2 is probably evaluated as: Take the square of each element (fast), and take the square of that again (fast again).
x.^4 is probably directly evaluated as: Take the fourth power of each element (slow).
It is not so strange to see that 2 fast operations take less time than 1 slow operation. Just too bad that the optimization is not performed in the power 4 case, but perhaps it won't always work or come at a cost (input checking, memory?).
About the timings: Actually there is much more difference than a factor 2!
As you call them in a function now, the function overhead is added in each case, making the relative differences smaller:
y=x;tic,power(x,4);toc
y=x;tic,x.^4;toc
y=x;tic,x.^2.^2;toc
y=x;tic,x.*x.*x.*x;toc
will give:
Elapsed time is 0.034826 seconds.
Elapsed time is 0.029186 seconds.
Elapsed time is 0.003891 seconds.
Elapsed time is 0.003840 seconds.
So, it is nearly a factor 10 difference. However, note that the time difference in seconds is still minor, so for most practical applications I would just go for the simple syntax.

It seems as though Mathworks has special cased squares in its power function (unfortunately, it's all builtin closed source that we cannot see). In my testing on R2013b, it appears as though .^, power, and realpow use the same algorithm. For squares, I believe they have special-cased it to be x.*x.
1.0x (4.4ms): #()x.^2
1.0x (4.4ms): #()power(x,2)
1.0x (4.5ms): #()x.*x
1.0x (4.5ms): #()realpow(x,2)
6.1x (27.1ms): #()exp(2*log(x))
For cubes, the story is different. They're no longer special-cased. Again, .^, power, and realpow all are similar, but much slower this time:
1.0x (4.5ms): #()x.*x.*x
1.0x (4.6ms): #()x.*x.^2
5.9x (26.9ms): #()exp(3*log(x))
13.8x (62.3ms): #()power(x,3)
14.0x (63.2ms): #()x.^3
14.1x (63.7ms): #()realpow(x,3)
Let's jump up to the 16th power to see how these algorithms scale:
1.0x (8.1ms): #()x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x
2.2x (17.4ms): #()x.^2.^2.^2.^2
3.5x (27.9ms): #()exp(16*log(x))
7.9x (63.8ms): #()power(x,16)
7.9x (63.9ms): #()realpow(x,16)
8.3x (66.9ms): #()x.^16
So: .^, power and realpow all run in a constant time with regards to the exponent, unless it was special cased (-1 also appears to have been special cased). Using the exp(n*log(x)) trick is also constant time with regards to the exponent, and faster. The only result I don't quite understand why the repeated squaring is slower than the multiplication.
As expected, increasing the size of x by a factor of 100 increases the time similarly for all algorithms.
So, moral of the story? When using scalar integer exponents, always do the multiplication yourself. There's a whole lot of smarts in power and friends (exponent can be floating point, vector, etc). The only exceptions are where Mathworks has done the optimization for you. In 2013b, it seems to be x^2 and x^(-1). Hopefully they'll add more as time goes on. But, in general, exponentiation is hard and multiplication is easy. In performance sensitive code, I don't think you can go wrong by always typing x.*x.*x.*x. (Of course, in your case, follow Luis` advice and make use of the intermediate results within each term!)
function powerTest(x)
f{1} = #() x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x;
f{2} = #() x.^2.^2.^2.^2;
f{3} = #() exp(16.*log(x));
f{4} = #() x.^16;
f{5} = #() power(x,16);
f{6} = #() realpow(x,16);
for i = 1:length(f)
t(i) = timeit(f{i});
end
[t,idxs] = sort(t);
fcns = f(idxs);
for i = 1:length(fcns)
fprintf('%.1fx (%.1fms):\t%s\n',t(i)/t(1),t(i)*1e3,func2str(fcns{i}));
end

Matlab Fast Fourier Transform / fft for time and speed

I have a 2 column vector with times and speeds of a subset of data, like so:
5 40
10 37
15 34
20 39
And so on. I want to get the fourier transform of speeds to get a frequency. How would I go about doing this with a fast fourier transform (fft)?
If my vector name is sampleData, I have tried
fft(sampleData);
but that gives me a vector of real and imaginary numbers. To be able to get sensible data to plot, how would I go about doing this?

Fourier Transform will yield a complex vector, when you fft you get a vector of frequencies, each has a spectral phase. These phases can be extremely important! (they contain most of the information of the time-domain signal, you won't see interference effects without them etc...). If you want to plot the power spectrum, you can
plot(abs(fft(sampleData)));
To complete the story, you'll probably need to fftshift, and also produce a frequency vector. Here's a more elaborate code:
% Assuming 'time' is the 1st col, and 'sampleData' is the 2nd col:
N=length(sampleData);
f=window(#hamming,N)';
dt=mean(diff(time));
df=1/(N*dt); % the frequency resolution (df=1/max_T)
if mod(N,2)==0
f_vec= df*((1:N)-1-N/2); % frequency vector for EVEN length vector
else
f_vec= df*((1:N)-0.5-N/2);
end
fft_data= fftshift(fft(fftshift(sampleData.*f))) ;
plot(f_vec,abs(fft_data))

I would recommend that you back up and think about what you are trying to accomplish, and whether an FFT is an appropriate tool for your situation. You say that you "want to ... get a frequency", but what exactly do you mean by that? Do you know that this data has exactly one frequency component, and want to know what the frequency is? Do you want to know both the frequency and phase of the component? Do you just want to get a rough idea of how many discrete frequency components are present? Are you interested in the spectrum of the noise in your measurement? There are many questions you can ask about "frequencies" in a data set, and whether or not an FFT and/or power spectrum is the best approach to getting an answer depends on the question.
In a comment above you asked "Is there some way to correlate the power spectrum to the time values?" This strikes me as a confused question, but also makes me think that maybe the question you are really trying to answer is "I have a signal whose frequency varies with time, and I want to get an estimate of the frequency vs time". I'm sure I've seen a question along those lines within the past few months here on SO, so I would search for that.

Most simple and fast method for audio activity detection?

Given is an array of 320 elements (int16), which represent an audio signal (16-bit LPCM) of 20 ms duration. I am looking for a most simple and very fast method which should decide whether this array contains active audio (like speech or music), but not noise or silence. I don't need a very high quality of the decision, but it must be very fast.
It occurred to me first to add all squares or absolute values of the elements and compare their sum with a threshold, but such a method is very slow on my system, even if it is O(n).

You're not going to get much faster than a sum-of-squares approach.
One optimization that you may not be doing so far is to use a running total. That is, in each time step, instead of summing the squares of the last n samples, keep a running total and update that with the square of the most recent sample. To avoid your running total from growing and growing over time, add an exponential decay. In pseudocode:
decay_constant=0.999; // Some suitable value smaller than 1
total=0;
for t=1,...
// Exponential decay
total=total*decay_constant;
// Add in latest sample
total+=current_sample;
if total>threshold
// do something
end
end
Of course, you'll have to tune the decay constant and threshold to suit your application. If this isn't fast enough to run in real time, you have a seriously underpowered DSP...

You might try calculating two simple "statistics" - first would be spread (max-min). Silence will have very low spread. Second would be variety - divide the range of possible values into say 16 brackets (= value range) and as you go through the elements, determine in which bracket that element goes. Noise will have similar numbers for all brackets, whereas music or speech should prefer some of them while neglecting others.
This should be possible to do in just one pass through the array and you do not need complicated arithmetics, just some addition and comparison of values.
Also consider some approximation, for example take only each fourth value, thus reducing the number of checked elements to 80. For audio signal, this should be okay.

I did something like this a while back. After some experimentation I arrived at a solution that worked sufficiently well in my case.
I used the rate of change in the cube of the running average over about 120ms. When there is silence (only noise that is) the expression should be hovering around zero. As soon as the rate starts increasing over a couple of runs, you probably have some action going on.
rate = cur_avg^3 - prev_avg^3
I used a cube because the square just wasn't agressive enough. If the cube is to slow for you, try using the square and a bitshift instead. Hope this helps.

It is clearly that the complexity should be at least O(n). Probably some simple algorithms that calculate some value range are good for the moment but I would look for Voice Activity Detection on web and for related code samples.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Improve performances of division Vivado HLS - c

Related

Efficiency of Fortran ndarray versus n*1d arrays

Gamma in the Baum Welch algorithm and float precision

efficient way to take powers of a vector

Matlab Fast Fourier Transform / fft for time and speed

Most simple and fast method for audio activity detection?

Categories

Resources