I am trying to understand how the cascaded biquad filtering is optimized for Arm processors in CMSIS using Neon extensions.
The code is ifdefed under #if defined(ARM_MATH_NEON) here, and documentation is here.
The NEON intrinsics are used when there are more than 4 biquads cascaded. I am puzzled how could any kind of parallel instruction execution be done if output from one biduaq is fed as input to the next one? Could anyone explain what is done in parallel in that peace of code?
A biquad cascade can be parallelized by offsetting them in time.
If you compute 4 biquads at a time, the last cascade biquad doesn’t operate on the results from the previous biquad in the same batch of 4, but on results saved from the previous batch of 4. That removes the dependencies within each batch. Thus it takes 4 steps of latency to propagate data diagonally from the first to the last biquad, but thruput finishes 4 biquads per time step, or 4x higher thruput than computing biquads one at a time.
Here’s the formula from the documentation:
y[ n ] = b0 * x[ n ] + d1;
d1 = b1 * x[ n ] + a1 * y[ n ] + d2;
d2 = b2 * x[ n ] + a2 * y[ n ];
Let’s get rid of the mutable state by renaming variables, for 2 iterations of the loop:
// Iteration 1
y[ n ] = b0 * x[ n ] + d1_0;
const float d1_1 = b1 * x[ n ] + a1 * y[ n ] + d2_0;
const float d2_1 = b2 * x[ n ] + a2 * y[ n ];
// Iteration 2
y[ n + 1 ] = b0 * x[ n + 1 ] + d1_1;
const float d1_2 = b1 * x[ n + 1 ] + a1 * y[ n + 1 ] + d2_1;
const float d2_2 = b2 * x[ n + 1 ] + a2 * y[ n + 1 ];
When it’s written that way, it’s obvious you can substitute variables, and compute 2 iterations in parallel, here’s how:
// Rewriting iterations to only use data available before the #1
y[ n ] = b0 * x[ n ] + d1_0;
y[ n + 1 ] = b0 * x[ n + 1 ] + b1 * x[ n ] + a1 * b0 * x[ n ] + d1_0 + d2_0;
const float d1_2 = b1 * x[ n + 1 ] + a1 * y[ n + 1 ] + b2 * x[ n ] + a2 * y[ n ];
const float d2_2 = b2 * x[ n + 1 ] + a2 * y[ n + 1 ];
Pretty sure I have screwed up the algebra above, but I hope you got the idea. The approach removes data dependency at the cost of more computations.
That particular implementation does that that for 4 iterations instead of 2, by shifting vectors and doing lots of extra computations. Here’s the main NEON loop with HLSL-style comments about what is happening with the lanes of the YnV SIMD vector.
float32x4_t YnV = s;
// YnV.w += t1.w * dV.val[ 0 ].x;
s = vextq_f32( zeroV, dV.val[ 0 ], 3 );
YnV = vmlaq_f32( YnV, t1, s );
// YnV.zw += t2.zw * dV.val[ 0 ].xy;
s = vextq_f32( zeroV, dV.val[ 0 ], 2 );
YnV = vmlaq_f32( YnV, t2, s );
// YnV.yzw += t3.yzw * dV.val[ 0 ].xyz
s = vextq_f32( zeroV, dV.val[ 0 ], 1 );
YnV = vmlaq_f32( YnV, t3, s );
// And finally the all-lanes version without shifts:
// YnV.xyzw += t4.xyzw * XnV.xyzw
YnV = vmlaq_f32( YnV, t4, XnV );
Related
How do you invert 4x3 matrices that are only translation and rotation, no scale? The sort of thing you would use to do an OpenGL Matrix inverse (just without scaling)?
Assuming your TypeMatrix3x4 is a [3][4] matrix, and you are only transforming a 1:1 scale, rotation and translation matrix, the following code seems to work -
This transposes the rotation matrix and applies the inverse of the translation.
TypeMatrix3x4 InvertHmdMatrix34( TypeMatrix3x4 mtoinv )
{
int i, j;
TypeMatrix3x4 out = { 0 };
for( i = 0; i < 3; i++ )
for( j = 0; j < 3; j++ )
out.m[j][i] = mtoinv.m[i][j];
for ( i = 0; i < 3; i++ )
{
out.m[i][3] = 0;
for( j = 0; j < 3; j++ )
out.m[i][3] += out.m[i][j] * -mtoinv.m[j][3];
}
return out;
}
You can solve that for any 3 dimensional affine transformation whose 3x3 transformation matrix is invertible. This allows you to include scaling and non conformant applications. The only requirement is for the 3x3 matrix to be invertible.
Simply extend your 3x4 matrix to 4x4 by adding a row all zeros except the last element, and invert that matrix. For example, as shown below:
[[a b c d] [[x] [[x']
[e f g h] * [y] = [y']
[i j k l] [z] [z']
[0 0 0 1]] [1]] [1 ]] (added row)
It's easy to see that this 4x4 matrix, applied to your vector produces exactly the same vector as before the extension.
If you get the inverse of that matrix, you'll have:
[[A B C D] [[x'] [[x]
[E F G H] * [y'] = [y]
[I J K L] [z'] [z]
[0 0 0 1]] [1 ] [1]]
It's easy to see that it this works in one direction, it needs to be in the reverse direction, if A is the image of B, then B will be the inverse throug the inverse transformation, the only requisite is the matrix to be invertible.
More on... if you have a list of vectors you want to process, you can apply Gauss elimination method to an extended matrix of the form:
[[a b c d x0' x1' x2' ... xn']
[e f g h y0' y1' y2' ... yn']
[i j k l z0' z1' z2' ... zn']
[0 0 0 1 1 1 1 ... 1 ]]
to obtain the inverses of all the vectors you do the Gauss elimination vector to get from above:
[[1 0 0 0 x0 x1 x2 ... xn ]
[0 1 0 0 y0 y1 y2 ... yn ]
[0 0 1 0 z0 z1 z2 ... zn ]
[0 0 0 1 1 1 1 ... 1 ]]
and you will solve n problems in one shot, because the column vectors above will be the ones, that once transformed produce the former ones.
You can get a simple implementation I wrote to teach my son about linear algebra of Gauss/Jordan elimination method here. It's opensource (BSD license) and you can modify/adapt it to your needs. This method uses the last approach, and you can use it out of the box by trying the sist_lin program.
If you want the inverse transformation, put the following contents in the matrix, and apply Gauss elimination to:
a b c d 1 0 0 0
e f g h 0 1 0 0
i j k l 0 0 1 0
0 0 0 1 0 0 0 1
as input to sist_lin and you get:
1 0 0 0 A B C D <-- these are the coefs of the
0 1 0 0 E F G H inverse transformation
0 0 1 0 I J K L
0 0 0 1 0 0 0 1
you will have:
a * x + b * y + c * z + d = X
e * x + f * y + g * z + h = Y
i * x + j * y + k * z + l = Z
0 * x + 0 * y + 0 * z + 1 = 1
and
A * X + B * Y + C * Z + D = x
E * X + F * Y + G * Z + H = y
I * X + J * Y + K * Z + L = z
0 * X + 0 * Y + 0 * Z + 1 = 1
I am implementing the Runge-Kutta-Fehlberg method with adaptive step-size (RK45). I define and call my Butcher tableau in a notebook with
module FehlbergTableau
using StaticArrays
export A, B, CH, CT
A = #SVector [ 0 , 2/9 , 1/3 , 3/4 , 1 , 5/6 ]
B = #SMatrix [ 0 0 0 0 0
2/9 0 0 0 0
1/12 1/4 0 0 0
69/128 -243/128 135/64 0 0
-17/12 27/4 -27/5 16/15 0
65/432 -5/16 13/16 4/27 5/144 ]
CH = #SVector [ 47/450 , 0 , 12/25 , 32/225 , 1/30 , 6/25 ]
CT = #SVector [ -1/150 , 0 , 3/100 , -16/75 , -1/20 , 6/25 ]
end
using .FehlbergTableau
If I code the algorithm for RK45 straightforwardly as
function infinitesimal_flow(A::SVector{6,Float64}, B::SMatrix{6,5,Float64}, CH::SVector{6,Float64}, CT::SVector{6,Float64}, t0::Float64, Δt::Float64, J∇H::Function, x0::SVector{N,Float64}) where N
k1 = Δt * J∇H( t0 + Δt*A[1], x0 )
k2 = Δt * J∇H( t0 + Δt*A[2], x0 + B[2,1]*k1 )
k3 = Δt * J∇H( t0 + Δt*A[3], x0 + B[3,1]*k1 + B[3,2]*k2 )
k4 = Δt * J∇H( t0 + Δt*A[4], x0 + B[4,1]*k1 + B[4,2]*k2 + B[4,3]*k3 )
k5 = Δt * J∇H( t0 + Δt*A[5], x0 + B[5,1]*k1 + B[5,2]*k2 + B[5,3]*k3 + B[5,4]*k4 )
k6 = Δt * J∇H( t0 + Δt*A[6], x0 + B[6,1]*k1 + B[6,2]*k2 + B[6,3]*k3 + B[6,4]*k4 + B[6,5]*k5 )
TE = CT[1]*k1 + CT[2]*k2 + CT[3]*k3 + CT[4]*k4 + CT[5]*k5 + CT[6]*k6
xt = x0 + CH[1]*k1 + CH[2]*k2 + CH[3]*k3 + CH[4]*k4 + CH[5]*k5 + CH[6]*k6
norm(TE), xt
end
and compare it with the more compact implementation
function infinitesimal_flow_2(A::SVector{6,Float64}, B::SMatrix{6,5,Float64}, CH::SVector{6,Float64}, CT::SVector{6,Float64}, t0::Float64,Δt::Float64,J∇H::Function, x0::SVector{N,Float64}) where N
k = MMatrix{N,6}(0.0I)
TE = zero(x0); xt = x0
for i=1:6
# EDIT: this is wrong! there should be a new variable here, as pointed
# out by Lutz Lehmann: xs = x0
for j=1:i-1
# xs += B[i,j] * k[:,j]
x0 += B[i,j] * k[:,j] #wrong
end
k[:,i] = Δt * J∇H(t0 + Δt*A[i], x0)
TE += CT[i]*k[:,i]
xt += CH[i]*k[:,i]B[i,j] * k[:,j]
end
norm(TE), xt
end
Then the first function, which defines variables explicitly, is much faster:
J∇H(t::Float64, X::SVector{N,Float64}) where N = #SVector [ -X[2]^2, X[1] ]
x0 = SVector{2}([0.0, 1.0])
infinitesimal_flow(A, B, CH, CT, 0.0, 1e-2, J∇H, x0)
infinitesimal_flow_2(A, B, CH, CT, 0.0, 1e-2, J∇H, x0)
#btime infinitesimal_flow($A, $B, $CH, $CT, 0.0, 1e-2, $J∇H, $x0)
>> 19.387 ns (0 allocations: 0 bytes)
#btime infinitesimal_flow_2($A, $B, $CH, $CT, 0.0, 1e-2, $J∇H, $x0)
>> 50.985 ns (0 allocations: 0 bytes)
I cannot find a type instability or anything to justify the lag, and for more complex tableaus it is mandatory that I use the algorithm in loop form. What am I doing wrong?
P.S.: The bottleneck in infinitesimal_flow_2 is the line k[:,i] = Δt * J∇H(t0 + Δt*A[i], x0).
Each stage of the RK method computes its evaluation point directly from the base point of the RK step. This is explicit in the first method. In the second method you would have to reset the point computation in each stage, such as in
for i=1:6
xs = x0
for j=1:i-1
xs += B[i,j] * k[:,j]
end
k[:,i] = Δt * J∇H(t0 + Δt*A[i], xs)
...
The slightest error in the step computation can catastrophically throw off the step-size controller, forcing the step size to fall towards zero and thus the effort to increase drastically. An example is the 4101 error in RKF45
I have a matrix A like this:
A = [ 1 0 2 4; 2 3 1 0; 0 0 3 4 ]
A has only unique row elements except zero, and each row has at least 2 non-zero elements.
I want to create a new matrix B from A,where each row in B contains the first two non-zero elements of the corresponding row in A.
B = [ 1 2 ; 2 3 ; 3 4 ]
It is easy with loops but I need vectorized solution.
Here's a vectorized approach:
A = [1 0 2 4; 2 3 1 0; 0 0 3 4]; % example input
N = 2; % number of wanted nonzeros per row
[~, ind] = sort(~A, 2); % sort each row of A by the logical negation of its values.
% Get the indices of the sorting
ind = ind(:, 1:N); % keep first N columns
B = A((1:size(A,1)).' + (ind-1)*size(A,1)); % generate linear index and use into A
Here is another vectorised approach.
A_bool = A > 0; A_size = size(A); A_rows = A_size(1);
A_boolsum = cumsum( A_bool, 2 ) .* A_bool; % for each row, and at each column,
% count how many nonzero instances
% have occurred up to that column
% (inclusive), and then 'zero' back
% all original zero locations.
[~, ColumnsOfFirsts ] = max( A_boolsum == 1, [], 2 );
[~, ColumnsOfSeconds ] = max( A_boolsum == 2, [], 2 );
LinearIndicesOfFirsts = sub2ind( A_size, [1 : A_rows].', ColumnsOfFirsts );
LinearIndicesOfSeconds = sub2ind( A_size, [1 : A_rows].', ColumnsOfSeconds );
Firsts = A(LinearIndicesOfFirsts );
Seconds = A(LinearIndicesOfSeconds);
Result = horzcat( Firsts, Seconds )
% Result =
% 1 2
% 2 3
% 3 4
PS. Matlab / Octave common subset compatible code.
Given an array {1,3,5,7}, its subparts are defined as {1357,135,137,157,357,13,15,17,35,37,57,1,3,5,7}.
I have to find the sum of all these numbers in the new array. In this case sum comes out to be 2333.
Please help me find a solution in O(n). My O(n^2) solution times out.
link to the problem is here or here.
My current attempt( at finding a pattern) is
for(I=0 to len) //len is length of the array
{
for(j=0 to len-i)
{
sum+= arr[I]*pow(10,j)*((len-i) C i)*pow(2,i)
}
}
In words - len-i C i = (number of integers to right) C weight. (combinations {from permutation and combination})
2^i = 2 power (number of integers to left)
Thanks
You can easily solve this problem with a simple recursive.
def F(arr):
if len(arr) == 1:
return (arr[0], 1)
else:
r = F(arr[:-1])
return (11 * r[0] + (r[1] + 1) * arr[-1], 2 * r[1] + 1)
So, how does it work? It is simple. Let say we want to compute the sum of all subpart of {1,3,5,7}. Let assume that we know the number of combinatiton of {1,3,5} and the sum of subpart of {1,3,5} and we can easily compute the {1,3,5,7} using the following formula:
SUM_SUBPART({1,3,5,7}) = 11 * SUM_SUBPART({1,3,5}) + NUMBER_COMBINATION({1,3,5}) * 7 + 7
This formula can easily be derived by observing. Let say we have all combination of {1,3,5}
A = [135, 13, 15, 35, 1, 3, 5]
We can easily create a list of {1,3,5,7} by
A = [135, 13, 15, 35, 1, 3, 5] +
[135 * 10 + 7,
13 * 10 + 7,
15 * 10 + 7,
35 * 10 + 7,
1 * 10 + 7,
3 * 10 + 7,
5 * 10 + 7] + [7]
Well, you could look at at the subparts as sums of numbers:
1357 = 1000*1 + 100*3 + 10*5 + 1*7
135 = 100*1 + 10*3 + 1*5
137 = 100*1 + 10*3 + 1*7
etc..
So, all you need to do is sum up the numbers you have, and then according to the number of items work out what is the multiplier:
Two numbers [x, y]:
[x, y, 10x+y, 10y+x]
=> your multiplier is 1 + 10 + 1 = 12
Three numbers [x, y, z]:
[x, y, z,
10x+y, 10x+z,
10y+x, 10y+z,
10z+x, 10z+y,
100x+10y+z, 100x10z+y
.
. ]
=> you multiplier is 1+10+10+1+1+100+100+10+10+1+1=245
You can easily work out the equation for n numbers....
If you expand invisal's recursive solution you get this explicit formula:
subpart sum = sum for k=0 to N-1: 11^(N-k) * 2^k * a[k]
This suggests the following O(n) algorithm:
multiplier = 1
for k from 0 to N-1:
a[k] = a[k]*multiplier
multiplier = multiplier*2
multiplier = 1
sum = 0
for k from N-1 to 0:
sum = sum + a[k]*multiplier
multiplier = multiplier*11
Multiplication and addition should be done modulo M of course.
I am trying to implement the Karatsuba algorithm in C.
I work with char strings (which are digits in a certain base), and although I think I have understood most of the Karatsuba algorithm, I do not get where I should split the strings to multiply.
For example, where should I cut 123 * 123, and where should I cut 123 * 12?
I can't get to a solution that works with both these calculations.
I tried to cut it in half and flooring the result when the number if odd, but it did not work, and ceiling does not work too.
Any clue?
Let a, b, c, and d be the parts of the strings.
Let's try with 123 * 12
First try (a = 1, b = 23, c = 1, d = 2) (fail)
z0 = a * c = 1
z1 = b * d = 46
z2 = (a + b) * (c + d) - z0 - z1 = 24 * 3 - 1 - 46 = 72 - 1 - 46 = 25
z0_padded = 100
z2_padded = 250
z0_padded + z1 + z2_padded = 100 + 46 + 250 = 396 != 123 * 12
Second try (a = 12, b = 3, c = 12, d = 0) (fail)
z0 = 144
z1 = 0
z2 = 15 * 12 - z1 - z0 = 180 - 144 = 36
z0_padded = 14400
z2_padded = 360
z0_padded + z1 + z2_padded = 14760 != 1476
Third try (a = 12, b = 3, c = 0, d = 12) (success)
z0 = 0
z1 = 36
z2 = 15 * 12 - z0 - z1 = 144
z0_padded = 0
z2_padded = 1440
z0_padded + z1 + z2_padded = 1476 == 1476
Let's try with 123 * 123
First try (a = 1, b = 23, c = 1, d = 23) (fail)
z0 = 1
z1 = 23 * 23 = 529
z2 = 24 * 24 - z0 - z1 = 46
z0_padded = 100
z2_padded = 460
z0_padded + z1 + z2_padded = 561 != 15129
Second try (a = 12, b = 3, c = 12, d = 3) (success)
z0 = 12 * 12 = 144
z1 = 3 * 3 = 9
z2 = 15 * 15 - z0 - z1 = 72
z0_padded = 14400
z2_padded = 720
z0_padded + z1 + z2_padded = 15129 == 15129
Third try (a = 12, b = 3, c = 1, d = 23) (fail)
z0 = 12
z1 = 3 * 23 = 69
z2 = 15 * 24 - z0 - z1 = 279
z0_padded = 1200
z2_padded = 2799
z0_padded + z1 = z2_padded = 4068 != 15129
Here, I do not get where I messed this up. Note that my padding method adds n zeroes at the end of a number where n = m * 2 and m equals the size of the longest string divided by two.
EDIT
Now that I have understood that b and d must be of the same length, it works almost everytime, but there are still exceptions: for example 1234*12
a = 123
b = 4
c = 1
d = 2
z0 = 123
z1 = 8
z2 = 127 * 3 - 123 - 8 = 250
z0_padded = 1230000
z2_padded = 25000
z0_padded + z1 + z2_padded = 1255008 != 14808
Here, assuming I split the strings correctly, the problem is the padding, but I do not get how I should pad. I read on Wikipedia that I should pad depending on the size of the biggest string (see a few lines up), there should be another solution.
The Karatsuba algorithm is a nice way to perform multiplications.
If you want it to work, b and d must be of the same length.
Here are two possibilities to compute 123x12 :
a= 1;b=23;c=0;d=12;
a=12;b= 3;c=1;d= 2;
Let's explain how it works for the second case :
123=12×10+3
12= 1×10+2
123×12=(12×10+3)×(1×10+2)
123×12=12×1×100+ (12×2+3×1)×10+3×2
123×12=12×1×100+((12+3)×(1+2)-12×1-3×2)×10+3×2
Let's explain how it works for the first case :
123=1×100+23
12=0×100+12
123×12=(1×100+23)×(0×100+12)
123×12=1×0×10000+ (1×12+23×0)×100+23×12
123×12=1×0×10000+((1+23)×(0+12)-1×0-23×12)×100+23×12
It also works with 10^k, 2^k or n instead of 10 or 100.