I wrote a code that numerically uses Legendre polynomials up to some high n-th order. For example:
....
case 8
p = (6435*x.^8-12012*x.^6+6930*x.^4-1260*x.^2+35)/128; return
case 9
...
If the vectorx is long this can become slow. I saw that there is a performance difference between say x.^4 and x.*x.*x.*x and thought I could use this to improve my code. I've used timeit and found that for:
x=linspace(0,10,1e6);
f1= #() power(x,4)
f2= #() x.4;
f3= #() x.^2.^2
f4= #() x.*x.*x.*x
f4 is faster by a factor 2 than the rest. However when I go to x.^6 there is very little difference between (x.*x.*x).^2 and x.*x.*x.*x.*x.*x (while all other options are slower).
Is there away to tell what will be the most efficient way to take a power of a vector?
Can you explain why there is such a big difference in performance?
This is not exactly an answer to your question, but it may solve your problem:
x2 = x.*x; % or x.^2 or power(x,2), whichever is most efficient
p = ((((6435*x2-12012)*x2+6930)*x2-1260)*x2+35)/128
This way you do the power just once, and only with exponent 2. This trick can be applied to all Legendre polynomials (in the odd-degree polynomials one x2 is replaced by x).
Here are some thoughts:
power(x,4) and x.^4 are equivalent (just read the doc).
x.*x.*x.*x is probably optimized to something like x.^2.^2
x.^2.^2 is probably evaluated as: Take the square of each element (fast), and take the square of that again (fast again).
x.^4 is probably directly evaluated as: Take the fourth power of each element (slow).
It is not so strange to see that 2 fast operations take less time than 1 slow operation. Just too bad that the optimization is not performed in the power 4 case, but perhaps it won't always work or come at a cost (input checking, memory?).
About the timings: Actually there is much more difference than a factor 2!
As you call them in a function now, the function overhead is added in each case, making the relative differences smaller:
y=x;tic,power(x,4);toc
y=x;tic,x.^4;toc
y=x;tic,x.^2.^2;toc
y=x;tic,x.*x.*x.*x;toc
will give:
Elapsed time is 0.034826 seconds.
Elapsed time is 0.029186 seconds.
Elapsed time is 0.003891 seconds.
Elapsed time is 0.003840 seconds.
So, it is nearly a factor 10 difference. However, note that the time difference in seconds is still minor, so for most practical applications I would just go for the simple syntax.
It seems as though Mathworks has special cased squares in its power function (unfortunately, it's all builtin closed source that we cannot see). In my testing on R2013b, it appears as though .^, power, and realpow use the same algorithm. For squares, I believe they have special-cased it to be x.*x.
1.0x (4.4ms): #()x.^2
1.0x (4.4ms): #()power(x,2)
1.0x (4.5ms): #()x.*x
1.0x (4.5ms): #()realpow(x,2)
6.1x (27.1ms): #()exp(2*log(x))
For cubes, the story is different. They're no longer special-cased. Again, .^, power, and realpow all are similar, but much slower this time:
1.0x (4.5ms): #()x.*x.*x
1.0x (4.6ms): #()x.*x.^2
5.9x (26.9ms): #()exp(3*log(x))
13.8x (62.3ms): #()power(x,3)
14.0x (63.2ms): #()x.^3
14.1x (63.7ms): #()realpow(x,3)
Let's jump up to the 16th power to see how these algorithms scale:
1.0x (8.1ms): #()x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x
2.2x (17.4ms): #()x.^2.^2.^2.^2
3.5x (27.9ms): #()exp(16*log(x))
7.9x (63.8ms): #()power(x,16)
7.9x (63.9ms): #()realpow(x,16)
8.3x (66.9ms): #()x.^16
So: .^, power and realpow all run in a constant time with regards to the exponent, unless it was special cased (-1 also appears to have been special cased). Using the exp(n*log(x)) trick is also constant time with regards to the exponent, and faster. The only result I don't quite understand why the repeated squaring is slower than the multiplication.
As expected, increasing the size of x by a factor of 100 increases the time similarly for all algorithms.
So, moral of the story? When using scalar integer exponents, always do the multiplication yourself. There's a whole lot of smarts in power and friends (exponent can be floating point, vector, etc). The only exceptions are where Mathworks has done the optimization for you. In 2013b, it seems to be x^2 and x^(-1). Hopefully they'll add more as time goes on. But, in general, exponentiation is hard and multiplication is easy. In performance sensitive code, I don't think you can go wrong by always typing x.*x.*x.*x. (Of course, in your case, follow Luis` advice and make use of the intermediate results within each term!)
function powerTest(x)
f{1} = #() x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x.*x;
f{2} = #() x.^2.^2.^2.^2;
f{3} = #() exp(16.*log(x));
f{4} = #() x.^16;
f{5} = #() power(x,16);
f{6} = #() realpow(x,16);
for i = 1:length(f)
t(i) = timeit(f{i});
end
[t,idxs] = sort(t);
fcns = f(idxs);
for i = 1:length(fcns)
fprintf('%.1fx (%.1fms):\t%s\n',t(i)/t(1),t(i)*1e3,func2str(fcns{i}));
end
Related
I'm computing the incremental mean of my input data (which is an array of 6 elements, so i'll end up with 6 means).
This is the code I'm using everytime a new input array is available (obviously I update the number of samples ecc...):
computing_mean:for(int i=0;i<6;i++){
temp_mean[i]=temp_mean[i] + (input[i]-temp_mean[i])/number_of_samples;
//Possible optimization?
//temp_mean[i]=temp_mean[i] + divide(input[i]-temp_mean[i],number_of_samples);
}
Where all the data in the code are arrays or single number of the following type:
typedef ap_fixed <36,24,AP_RND_CONV,AP_SAT> decimalNumber;
From my synthesis report this loop hase 324 latency and 54 iteration latency, caused mainly by the division operation.
Are there any ways I can improve the speed of the division? I tried using hls_math and the divide function, but it doesn't seem to work with my type of data.
EDIT 1: I'm including my performance profiler inside vivado HLS. I'll add a self-contained reproducible code later with another edit.
As you can see, the majority of the time is spent in SDIV
Other than trigonometric functions like sin() (FSIN = ~50-170 cycles) and cos() (FCOS = ~50-120 cycles), or things like sqrt() (FSQRT = ~22 cycles), division will always be the most painful.
FDIV is 15 cycles. FADD and FMUL are both 5.
There are occasions where you can skip division and do bit-shifting instead, if you're working with integer data and the number you're dividing by is a power of 2, but that's about it.
You can look up the approximate CPU cycle cost of any given instruction in tables like this. FDIV is an example of an expensive one.
That being said, one thing you could try is to compute the division factor in advance, then apply it using multiplication instead:
double inverse_n = 1 / number_of_samples;
temp_mean[i]=temp_mean[i] + (input[i]-temp_mean[i]) * inverse_n;
I'm not sure that's saving a whole lot, but if you really do need to shave off cycles, it's worth a shot.
*
A search method has time complexity O(n2), where n is the number of states in the space to be
searched. If it takes 1 second to search a space of a thousand states, roughly how long will it take to
search a space of a million states?*
I have found that its approximately 12 days but the way I found is quite wrong i think.
I did 1million^2 / 86400(seconds in a day ) and found 11.56 so approximately 12 days. Is there a better and more efficient solution?
There is not nearly enough information to answer this question. See Big-O description.
O(N^2) means only that the algorithm's execution time will be dominated by an N^2 term. As N grows large, the ratio between two execution times will asymptotically approach the square of their ratios. It says nothing about the execution time for particular values.
Let's keep this simple, assuming a set-up overhead with an array initialization O(N) and some system start-up, a constant. This makes the execution time
t = a * N^2 + b * N + c
for some values of a, b, and c. Even if we know that this is the equation form, we do not have enough information to solve given only one (t, N) data point. We don't know enough to derive t for N= 10^6.
I suspect that whomever posed this problem is looking for the invalid solution, making the unwarranted assumption that N=1000 has already blown all smaller terms to insignificance. In this case, simply scale up by the square of the size ratio:
N1 / N2 = 10^6 / 10^3 = 10^3
Scale up by N^2, or (10^3)^2 = 10^6
That gives you 10^6 seconds, or somewhat over a day; I'll leave the math to you.
Suppose I have an expression of which I need to find the sum:
where the bounds are finite and known. What is the fastest or most efficient way to go about calculating such a sum in scipy/numpy. It could be done with nested for loops, but is there a better way?
How about
np.dot(x[:amax], np.cumsum(y[:amax] * np.sum(z[cmin:cmax])))
np.einsum may be an option too for these kind of sum. As nevsan showed though, for b which is bounded by a you need to use np.cumsum first, and np.einsum should not be faster in the given example.
it could look like this:
y_acc = np.add.accumulate(y[:amax]) # same as cumsum
result = np.einsum('i,i,j->', x[:amax], y_acc, z[cmin:cmax])
However this is increadibly slow, because einsum does not optimize the fact that the z summation only needs to be done once, so you need to reformulate it by hand:
result = np.einsum('i,i->', x[:amax], y_summed) * z[cmin:cmax].sum()
Which should in this case however should be slower then nevsan's np.dot based approach, since dot should normally be better optimized (ie. np.einsum(ii->, a, b) is slower then np.dot(a, b)). However if you have more arrays to sum over, it may be a nice option.
I'd like to have a MATLAB array fill a column with numbers in increments of 0.001. I am working with arrays of around 200,000,000 rows and so would like to use the most efficient method possible. I had considered using the following code:
for i = 1 : size(array,1)
array(i,1) = i * 0.001;
end
There must be a more efficient way of doing this..?
Well the accepted answer is pretty close to being fast but no fast enough. You should use:
s=size(array,1);
step=0.0001;
array(:,1)=[step:step:s*step];
There are two issues with the accepted answer
you don't need to transpose
you should include the step inside the vector, instead of multiplying
and here is a comparison (sorry I am running 32-bit matlab)
array=rand(10000);
s=size(array,1);
step=0.0001;
tic
for i=1:100000
array(:,1)=[step:step:s*step];
end
toc
and
tic
for i=1:100000
array(:, 1)=[1:s]'*step;
end
toc
the results are:
Elapsed time is 3.469108 seconds.
Elapsed time is 5.304436 seconds.
and without transposing in the second example
Elapsed time is 3.524345 seconds.
I suppose in your case things would be worst.
array(:,1) = [1:size(array,1)]' * 0.001;
Matlab is more efficient when vectorizing loops, see also the performance tips from mathworks.
If such vectorization is infeasible due to space limitations, you might want to reconsider rewriting your for-loop in C, using a MEX function.
you can also try this
size=20000000;%size is defined
array(1:size,1)=(1:size)*0.001
Given is an array of 320 elements (int16), which represent an audio signal (16-bit LPCM) of 20 ms duration. I am looking for a most simple and very fast method which should decide whether this array contains active audio (like speech or music), but not noise or silence. I don't need a very high quality of the decision, but it must be very fast.
It occurred to me first to add all squares or absolute values of the elements and compare their sum with a threshold, but such a method is very slow on my system, even if it is O(n).
You're not going to get much faster than a sum-of-squares approach.
One optimization that you may not be doing so far is to use a running total. That is, in each time step, instead of summing the squares of the last n samples, keep a running total and update that with the square of the most recent sample. To avoid your running total from growing and growing over time, add an exponential decay. In pseudocode:
decay_constant=0.999; // Some suitable value smaller than 1
total=0;
for t=1,...
// Exponential decay
total=total*decay_constant;
// Add in latest sample
total+=current_sample;
if total>threshold
// do something
end
end
Of course, you'll have to tune the decay constant and threshold to suit your application. If this isn't fast enough to run in real time, you have a seriously underpowered DSP...
You might try calculating two simple "statistics" - first would be spread (max-min). Silence will have very low spread. Second would be variety - divide the range of possible values into say 16 brackets (= value range) and as you go through the elements, determine in which bracket that element goes. Noise will have similar numbers for all brackets, whereas music or speech should prefer some of them while neglecting others.
This should be possible to do in just one pass through the array and you do not need complicated arithmetics, just some addition and comparison of values.
Also consider some approximation, for example take only each fourth value, thus reducing the number of checked elements to 80. For audio signal, this should be okay.
I did something like this a while back. After some experimentation I arrived at a solution that worked sufficiently well in my case.
I used the rate of change in the cube of the running average over about 120ms. When there is silence (only noise that is) the expression should be hovering around zero. As soon as the rate starts increasing over a couple of runs, you probably have some action going on.
rate = cur_avg^3 - prev_avg^3
I used a cube because the square just wasn't agressive enough. If the cube is to slow for you, try using the square and a bitshift instead. Hope this helps.
It is clearly that the complexity should be at least O(n). Probably some simple algorithms that calculate some value range are good for the moment but I would look for Voice Activity Detection on web and for related code samples.