I am implementing the Runge-Kutta-Fehlberg method with adaptive step-size (RK45). I define and call my Butcher tableau in a notebook with
module FehlbergTableau
using StaticArrays
export A, B, CH, CT
A = #SVector [ 0 , 2/9 , 1/3 , 3/4 , 1 , 5/6 ]
B = #SMatrix [ 0 0 0 0 0
2/9 0 0 0 0
1/12 1/4 0 0 0
69/128 -243/128 135/64 0 0
-17/12 27/4 -27/5 16/15 0
65/432 -5/16 13/16 4/27 5/144 ]
CH = #SVector [ 47/450 , 0 , 12/25 , 32/225 , 1/30 , 6/25 ]
CT = #SVector [ -1/150 , 0 , 3/100 , -16/75 , -1/20 , 6/25 ]
end
using .FehlbergTableau
If I code the algorithm for RK45 straightforwardly as
function infinitesimal_flow(A::SVector{6,Float64}, B::SMatrix{6,5,Float64}, CH::SVector{6,Float64}, CT::SVector{6,Float64}, t0::Float64, Δt::Float64, J∇H::Function, x0::SVector{N,Float64}) where N
k1 = Δt * J∇H( t0 + Δt*A[1], x0 )
k2 = Δt * J∇H( t0 + Δt*A[2], x0 + B[2,1]*k1 )
k3 = Δt * J∇H( t0 + Δt*A[3], x0 + B[3,1]*k1 + B[3,2]*k2 )
k4 = Δt * J∇H( t0 + Δt*A[4], x0 + B[4,1]*k1 + B[4,2]*k2 + B[4,3]*k3 )
k5 = Δt * J∇H( t0 + Δt*A[5], x0 + B[5,1]*k1 + B[5,2]*k2 + B[5,3]*k3 + B[5,4]*k4 )
k6 = Δt * J∇H( t0 + Δt*A[6], x0 + B[6,1]*k1 + B[6,2]*k2 + B[6,3]*k3 + B[6,4]*k4 + B[6,5]*k5 )
TE = CT[1]*k1 + CT[2]*k2 + CT[3]*k3 + CT[4]*k4 + CT[5]*k5 + CT[6]*k6
xt = x0 + CH[1]*k1 + CH[2]*k2 + CH[3]*k3 + CH[4]*k4 + CH[5]*k5 + CH[6]*k6
norm(TE), xt
end
and compare it with the more compact implementation
function infinitesimal_flow_2(A::SVector{6,Float64}, B::SMatrix{6,5,Float64}, CH::SVector{6,Float64}, CT::SVector{6,Float64}, t0::Float64,Δt::Float64,J∇H::Function, x0::SVector{N,Float64}) where N
k = MMatrix{N,6}(0.0I)
TE = zero(x0); xt = x0
for i=1:6
# EDIT: this is wrong! there should be a new variable here, as pointed
# out by Lutz Lehmann: xs = x0
for j=1:i-1
# xs += B[i,j] * k[:,j]
x0 += B[i,j] * k[:,j] #wrong
end
k[:,i] = Δt * J∇H(t0 + Δt*A[i], x0)
TE += CT[i]*k[:,i]
xt += CH[i]*k[:,i]B[i,j] * k[:,j]
end
norm(TE), xt
end
Then the first function, which defines variables explicitly, is much faster:
J∇H(t::Float64, X::SVector{N,Float64}) where N = #SVector [ -X[2]^2, X[1] ]
x0 = SVector{2}([0.0, 1.0])
infinitesimal_flow(A, B, CH, CT, 0.0, 1e-2, J∇H, x0)
infinitesimal_flow_2(A, B, CH, CT, 0.0, 1e-2, J∇H, x0)
#btime infinitesimal_flow($A, $B, $CH, $CT, 0.0, 1e-2, $J∇H, $x0)
>> 19.387 ns (0 allocations: 0 bytes)
#btime infinitesimal_flow_2($A, $B, $CH, $CT, 0.0, 1e-2, $J∇H, $x0)
>> 50.985 ns (0 allocations: 0 bytes)
I cannot find a type instability or anything to justify the lag, and for more complex tableaus it is mandatory that I use the algorithm in loop form. What am I doing wrong?
P.S.: The bottleneck in infinitesimal_flow_2 is the line k[:,i] = Δt * J∇H(t0 + Δt*A[i], x0).
Each stage of the RK method computes its evaluation point directly from the base point of the RK step. This is explicit in the first method. In the second method you would have to reset the point computation in each stage, such as in
for i=1:6
xs = x0
for j=1:i-1
xs += B[i,j] * k[:,j]
end
k[:,i] = Δt * J∇H(t0 + Δt*A[i], xs)
...
The slightest error in the step computation can catastrophically throw off the step-size controller, forcing the step size to fall towards zero and thus the effort to increase drastically. An example is the 4101 error in RKF45
Related
How do you invert 4x3 matrices that are only translation and rotation, no scale? The sort of thing you would use to do an OpenGL Matrix inverse (just without scaling)?
Assuming your TypeMatrix3x4 is a [3][4] matrix, and you are only transforming a 1:1 scale, rotation and translation matrix, the following code seems to work -
This transposes the rotation matrix and applies the inverse of the translation.
TypeMatrix3x4 InvertHmdMatrix34( TypeMatrix3x4 mtoinv )
{
int i, j;
TypeMatrix3x4 out = { 0 };
for( i = 0; i < 3; i++ )
for( j = 0; j < 3; j++ )
out.m[j][i] = mtoinv.m[i][j];
for ( i = 0; i < 3; i++ )
{
out.m[i][3] = 0;
for( j = 0; j < 3; j++ )
out.m[i][3] += out.m[i][j] * -mtoinv.m[j][3];
}
return out;
}
You can solve that for any 3 dimensional affine transformation whose 3x3 transformation matrix is invertible. This allows you to include scaling and non conformant applications. The only requirement is for the 3x3 matrix to be invertible.
Simply extend your 3x4 matrix to 4x4 by adding a row all zeros except the last element, and invert that matrix. For example, as shown below:
[[a b c d] [[x] [[x']
[e f g h] * [y] = [y']
[i j k l] [z] [z']
[0 0 0 1]] [1]] [1 ]] (added row)
It's easy to see that this 4x4 matrix, applied to your vector produces exactly the same vector as before the extension.
If you get the inverse of that matrix, you'll have:
[[A B C D] [[x'] [[x]
[E F G H] * [y'] = [y]
[I J K L] [z'] [z]
[0 0 0 1]] [1 ] [1]]
It's easy to see that it this works in one direction, it needs to be in the reverse direction, if A is the image of B, then B will be the inverse throug the inverse transformation, the only requisite is the matrix to be invertible.
More on... if you have a list of vectors you want to process, you can apply Gauss elimination method to an extended matrix of the form:
[[a b c d x0' x1' x2' ... xn']
[e f g h y0' y1' y2' ... yn']
[i j k l z0' z1' z2' ... zn']
[0 0 0 1 1 1 1 ... 1 ]]
to obtain the inverses of all the vectors you do the Gauss elimination vector to get from above:
[[1 0 0 0 x0 x1 x2 ... xn ]
[0 1 0 0 y0 y1 y2 ... yn ]
[0 0 1 0 z0 z1 z2 ... zn ]
[0 0 0 1 1 1 1 ... 1 ]]
and you will solve n problems in one shot, because the column vectors above will be the ones, that once transformed produce the former ones.
You can get a simple implementation I wrote to teach my son about linear algebra of Gauss/Jordan elimination method here. It's opensource (BSD license) and you can modify/adapt it to your needs. This method uses the last approach, and you can use it out of the box by trying the sist_lin program.
If you want the inverse transformation, put the following contents in the matrix, and apply Gauss elimination to:
a b c d 1 0 0 0
e f g h 0 1 0 0
i j k l 0 0 1 0
0 0 0 1 0 0 0 1
as input to sist_lin and you get:
1 0 0 0 A B C D <-- these are the coefs of the
0 1 0 0 E F G H inverse transformation
0 0 1 0 I J K L
0 0 0 1 0 0 0 1
you will have:
a * x + b * y + c * z + d = X
e * x + f * y + g * z + h = Y
i * x + j * y + k * z + l = Z
0 * x + 0 * y + 0 * z + 1 = 1
and
A * X + B * Y + C * Z + D = x
E * X + F * Y + G * Z + H = y
I * X + J * Y + K * Z + L = z
0 * X + 0 * Y + 0 * Z + 1 = 1
In my work, I have to deal with large size matrices.
For example, I use the following matrices.
using LinearAlgebra
#Pauli matrices
σ₁ = [0 1; 1 0]
σ₂ = [0 -im; im 0]
τ₁ = [0 1; 1 0]
τ₃ = [1 0; 0 -1]
#Trigonometric functions in real space
function EYE(Lx,Ly,Lz)
N = Lx*Ly*Lz
mat = Matrix{Complex{Float64}}(I, N, N)
return mat
end
function SINk₁(Lx,Ly,Lz)
N = Lx*Ly*Lz
mat = zeros(Complex{Float64},N,N)
for ix = 1:Lx
for iy = 1:Ly
for iz = 1:Lz
for dx in -1:1
jx = ix + dx
jx += ifelse(jx > Lx,-Lx,0)
jx += ifelse(jx < 1,Lx,0)
for dy in -1:1
jy = iy + dy
jy += ifelse(jy > Ly,-Ly,0)
jy += ifelse(jy < 1,Ly,0)
for dz in -1:1
jz = iz + dz
ii = (iz-1)*Lx*Ly + (ix-1)*Ly + iy
jj = (jz-1)*Lx*Ly + (jx-1)*Ly + jy
if 1 <= jz <= Lz
if dx == +1 && dy == 0 && dz == 0
mat[ii,jj] += -(im/2)
end
if dx == -1 && dy == 0 && dz == 0
mat[ii,jj] += im/2
end
end
end
end
end
end
end
end
return mat
end
function COSk₃(Lx,Ly,Lz)
N = Lx*Ly*Lz
mat = zeros(Complex{Float64},N,N)
for ix = 1:Lx
for iy = 1:Ly
for iz = 1:Lz
for dx in -1:1
jx = ix + dx
jx += ifelse(jx > Lx,-Lx,0)
jx += ifelse(jx < 1,Lx,0)
for dy in -1:1
jy = iy + dy
jy += ifelse(jy > Ly,-Ly,0)
jy += ifelse(jy < 1,Ly,0)
for dz in -1:1
jz = iz + dz
ii = (iz-1)*Lx*Ly + (ix-1)*Ly + iy
jj = (jz-1)*Lx*Ly + (jx-1)*Ly + jy
if 1 <= jz <= Lz
if dx == 0 && dy == 0 && dz == +1
mat[ii,jj] += 1/2
end
if dx == 0 && dy == 0 && dz == -1
mat[ii,jj] += 1/2
end
end
end
end
end
end
end
end
return mat
end
Then, I calculate
kron(SINk₁(Lx,Ly,Lz),kron(σ₁,τ₁)) + kron(EYE(Lx,Ly,Lz) + COSk₃(Lx,Ly,Lz),kron(σ₂,τ₃))
This calculation, however, takes much times for large Lx,Ly,Lz;
Lx = Ly = Lz = 15
#time kron(SINk₁(Lx,Ly,Lz),kron(σ₁,τ₁)) + kron(EYE(Lx,Ly,Lz) + COSk₃(Lx,Ly,Lz),kron(σ₂,τ₃))
4.692591 seconds (20 allocations: 8.826 GiB, 6.53% gc time)
Lx = Ly = Lz = 20
#time kron(SINk₁(Lx,Ly,Lz),kron(σ₁,τ₁)) + kron(EYE(Lx,Ly,Lz) + COSk₃(Lx,Ly,Lz),kron(σ₂,τ₃))
52.687861 seconds (20 allocations: 49.591 GiB, 2.69% gc time)
Are there faster ways to calculate Kronecker product, Addition or more proper definitions of EYE(Lx,Ly,Lz), SINk₁(Lx,Ly,Lz), COSk₃(Lx,Ly,Lz)?
The problem you're having is really simple. You are using at least O(n^12) memory. Since almost all of these values are 0, this is a huge waste. You should almost certainly be using Sparse arrays. This should bring your runtime to reasonable levels
I am trying to understand how the cascaded biquad filtering is optimized for Arm processors in CMSIS using Neon extensions.
The code is ifdefed under #if defined(ARM_MATH_NEON) here, and documentation is here.
The NEON intrinsics are used when there are more than 4 biquads cascaded. I am puzzled how could any kind of parallel instruction execution be done if output from one biduaq is fed as input to the next one? Could anyone explain what is done in parallel in that peace of code?
A biquad cascade can be parallelized by offsetting them in time.
If you compute 4 biquads at a time, the last cascade biquad doesn’t operate on the results from the previous biquad in the same batch of 4, but on results saved from the previous batch of 4. That removes the dependencies within each batch. Thus it takes 4 steps of latency to propagate data diagonally from the first to the last biquad, but thruput finishes 4 biquads per time step, or 4x higher thruput than computing biquads one at a time.
Here’s the formula from the documentation:
y[ n ] = b0 * x[ n ] + d1;
d1 = b1 * x[ n ] + a1 * y[ n ] + d2;
d2 = b2 * x[ n ] + a2 * y[ n ];
Let’s get rid of the mutable state by renaming variables, for 2 iterations of the loop:
// Iteration 1
y[ n ] = b0 * x[ n ] + d1_0;
const float d1_1 = b1 * x[ n ] + a1 * y[ n ] + d2_0;
const float d2_1 = b2 * x[ n ] + a2 * y[ n ];
// Iteration 2
y[ n + 1 ] = b0 * x[ n + 1 ] + d1_1;
const float d1_2 = b1 * x[ n + 1 ] + a1 * y[ n + 1 ] + d2_1;
const float d2_2 = b2 * x[ n + 1 ] + a2 * y[ n + 1 ];
When it’s written that way, it’s obvious you can substitute variables, and compute 2 iterations in parallel, here’s how:
// Rewriting iterations to only use data available before the #1
y[ n ] = b0 * x[ n ] + d1_0;
y[ n + 1 ] = b0 * x[ n + 1 ] + b1 * x[ n ] + a1 * b0 * x[ n ] + d1_0 + d2_0;
const float d1_2 = b1 * x[ n + 1 ] + a1 * y[ n + 1 ] + b2 * x[ n ] + a2 * y[ n ];
const float d2_2 = b2 * x[ n + 1 ] + a2 * y[ n + 1 ];
Pretty sure I have screwed up the algebra above, but I hope you got the idea. The approach removes data dependency at the cost of more computations.
That particular implementation does that that for 4 iterations instead of 2, by shifting vectors and doing lots of extra computations. Here’s the main NEON loop with HLSL-style comments about what is happening with the lanes of the YnV SIMD vector.
float32x4_t YnV = s;
// YnV.w += t1.w * dV.val[ 0 ].x;
s = vextq_f32( zeroV, dV.val[ 0 ], 3 );
YnV = vmlaq_f32( YnV, t1, s );
// YnV.zw += t2.zw * dV.val[ 0 ].xy;
s = vextq_f32( zeroV, dV.val[ 0 ], 2 );
YnV = vmlaq_f32( YnV, t2, s );
// YnV.yzw += t3.yzw * dV.val[ 0 ].xyz
s = vextq_f32( zeroV, dV.val[ 0 ], 1 );
YnV = vmlaq_f32( YnV, t3, s );
// And finally the all-lanes version without shifts:
// YnV.xyzw += t4.xyzw * XnV.xyzw
YnV = vmlaq_f32( YnV, t4, XnV );
I am attempting to multiply several matrices using a loop in C. I obtain the expected answer in R, but cannot obtain the expected answer in C. I suspect the problem is related to the += function which seems to double the value of the product after the first iteration of the loop.
I am not very familiar with C and have not been able to replace the += function with one that will return the expected answer.
Thank you for any advice.
First, here is the R code that returns the expected answer:
B0 = -0.40
B1 = 0.20
mycov1 = exp(B0 + -2 * B1) / (1 + exp(B0 + -2 * B1))
mycov2 = exp(B0 + -1 * B1) / (1 + exp(B0 + -1 * B1))
mycov3 = exp(B0 + 0 * B1) / (1 + exp(B0 + 0 * B1))
mycov4 = exp(B0 + 1 * B1) / (1 + exp(B0 + 1 * B1))
trans1 = matrix(c(1 - 0.25 - mycov1, mycov1, 0.25 * 0.80, 0,
0, 1 - 0.50, 0, 0.50 * 0.75,
0, 0, 1, 0,
0, 0, 0, 1),
nrow=4, ncol=4, byrow=TRUE)
trans2 = matrix(c(1 - 0.25 - mycov2, mycov2, 0.25 * 0.80, 0,
0, 1 - 0.50, 0, 0.50 * 0.75,
0, 0, 1, 0,
0, 0, 0, 1),
nrow=4, ncol=4, byrow=TRUE)
trans3 = matrix(c(1 - 0.25 - mycov3, mycov3, 0.25 * 0.80, 0,
0, 1 - 0.50, 0, 0.50 * 0.75,
0, 0, 1, 0,
0, 0, 0, 1),
nrow=4, ncol=4, byrow=TRUE)
trans4 = matrix(c(1 - 0.25 - mycov4, mycov4, 0.25 * 0.80, 0,
0, 1 - 0.50, 0, 0.50 * 0.75,
0, 0, 1, 0,
0, 0, 0, 1),
nrow=4, ncol=4, byrow=TRUE)
trans2b <- trans1 %*% trans2
trans3b <- trans2b %*% trans3
trans4b <- trans3b %*% trans4
trans4b
#
# This is the expected answer
#
# [,1] [,2] [,3] [,4]
# [1,] 0.01819965 0.1399834 0.3349504 0.3173467
# [2,] 0.00000000 0.0625000 0.0000000 0.7031250
# [3,] 0.00000000 0.0000000 1.0000000 0.0000000
# [4,] 0.00000000 0.0000000 0.0000000 1.0000000
#
Here is my C code. The C code is fairly long because I do not know C well enough to be efficient:
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
char quit;
int main(){
int i, j, k, ii, jj, kk ;
double B0, B1, mycov ;
double trans[4][4] = {0} ;
double prevtrans[4][4] = {{1,0,0,0},
{0,1,0,0},
{0,0,1,0},
{0,0,0,1}};
B0 = -0.40 ;
B1 = 0.20 ;
for (i=1; i <= 4; i++) {
mycov = exp(B0 + B1 * (-2+i-1)) / (1 + exp(B0 + B1 * (-2+i-1))) ;
trans[0][0] = 1 - 0.25 - mycov ;
trans[0][1] = mycov ;
trans[0][2] = 0.25 * 0.80 ;
trans[0][3] = 0 ;
trans[1][0] = 0 ;
trans[1][1] = 1 - 0.50 ;
trans[1][2] = 0 ;
trans[1][3] = 0.50 * 0.75 ;
trans[2][0] = 0 ;
trans[2][1] = 0 ;
trans[2][2] = 1 ;
trans[2][3] = 0 ;
trans[3][0] = 0 ;
trans[3][1] = 0 ;
trans[3][2] = 0 ;
trans[3][3] = 1 ;
for (ii=0; ii<4; ii++){
for(jj=0; jj<4; jj++){
for(kk=0; kk<4; kk++){
trans[ii][jj] += trans[ii][kk] * prevtrans[kk][jj] ;
}
}
}
prevtrans[0][0] = trans[0][0] ;
prevtrans[0][1] = trans[0][1] ;
prevtrans[0][2] = trans[0][2] ;
prevtrans[0][3] = trans[0][3] ;
prevtrans[1][0] = trans[1][0] ;
prevtrans[1][1] = trans[1][1] ;
prevtrans[1][2] = trans[1][2] ;
prevtrans[1][3] = trans[1][3] ;
prevtrans[2][0] = trans[2][0] ;
prevtrans[2][1] = trans[2][1] ;
prevtrans[2][2] = trans[2][2] ;
prevtrans[2][3] = trans[2][3] ;
prevtrans[3][0] = trans[3][0] ;
prevtrans[3][1] = trans[3][1] ;
prevtrans[3][2] = trans[3][2] ;
prevtrans[3][3] = trans[3][3] ;
}
printf("To close this program type 'quit' and hit the return key\n");
printf(" \n");
scanf("%d", &quit);
return 0;
}
Here is the final matrix returned by the above C code:
0.4821 3.5870 11.68 381.22
0 1 0 76.875
0 0 5 0
0 0 0 5
This line
trans[ii][jj] += trans[ii][kk] * prevtrans[kk][jj] ;
is not right. You're modifying trans in place while you are still using it to compute the resultant matrix. You need another matrix to store the result of the multiplication temporarily. And then use:
// Store the resultant matrix in temp.
for (ii=0; ii<4; ii++){
for(jj=0; jj<4; jj++){
temp[ii][jj] = 0.0;
for(kk=0; kk<4; kk++){
temp[ii][jj] += trans[ii][kk] * prevtrans[kk][jj] ;
}
}
}
// Transfer the data from temp to trans
for (ii=0; ii<4; ii++){
for(jj=0; jj<4; jj++){
trans[ii][jj] = temp[ii][jj];
}
}
I am trying to implement the Karatsuba algorithm in C.
I work with char strings (which are digits in a certain base), and although I think I have understood most of the Karatsuba algorithm, I do not get where I should split the strings to multiply.
For example, where should I cut 123 * 123, and where should I cut 123 * 12?
I can't get to a solution that works with both these calculations.
I tried to cut it in half and flooring the result when the number if odd, but it did not work, and ceiling does not work too.
Any clue?
Let a, b, c, and d be the parts of the strings.
Let's try with 123 * 12
First try (a = 1, b = 23, c = 1, d = 2) (fail)
z0 = a * c = 1
z1 = b * d = 46
z2 = (a + b) * (c + d) - z0 - z1 = 24 * 3 - 1 - 46 = 72 - 1 - 46 = 25
z0_padded = 100
z2_padded = 250
z0_padded + z1 + z2_padded = 100 + 46 + 250 = 396 != 123 * 12
Second try (a = 12, b = 3, c = 12, d = 0) (fail)
z0 = 144
z1 = 0
z2 = 15 * 12 - z1 - z0 = 180 - 144 = 36
z0_padded = 14400
z2_padded = 360
z0_padded + z1 + z2_padded = 14760 != 1476
Third try (a = 12, b = 3, c = 0, d = 12) (success)
z0 = 0
z1 = 36
z2 = 15 * 12 - z0 - z1 = 144
z0_padded = 0
z2_padded = 1440
z0_padded + z1 + z2_padded = 1476 == 1476
Let's try with 123 * 123
First try (a = 1, b = 23, c = 1, d = 23) (fail)
z0 = 1
z1 = 23 * 23 = 529
z2 = 24 * 24 - z0 - z1 = 46
z0_padded = 100
z2_padded = 460
z0_padded + z1 + z2_padded = 561 != 15129
Second try (a = 12, b = 3, c = 12, d = 3) (success)
z0 = 12 * 12 = 144
z1 = 3 * 3 = 9
z2 = 15 * 15 - z0 - z1 = 72
z0_padded = 14400
z2_padded = 720
z0_padded + z1 + z2_padded = 15129 == 15129
Third try (a = 12, b = 3, c = 1, d = 23) (fail)
z0 = 12
z1 = 3 * 23 = 69
z2 = 15 * 24 - z0 - z1 = 279
z0_padded = 1200
z2_padded = 2799
z0_padded + z1 = z2_padded = 4068 != 15129
Here, I do not get where I messed this up. Note that my padding method adds n zeroes at the end of a number where n = m * 2 and m equals the size of the longest string divided by two.
EDIT
Now that I have understood that b and d must be of the same length, it works almost everytime, but there are still exceptions: for example 1234*12
a = 123
b = 4
c = 1
d = 2
z0 = 123
z1 = 8
z2 = 127 * 3 - 123 - 8 = 250
z0_padded = 1230000
z2_padded = 25000
z0_padded + z1 + z2_padded = 1255008 != 14808
Here, assuming I split the strings correctly, the problem is the padding, but I do not get how I should pad. I read on Wikipedia that I should pad depending on the size of the biggest string (see a few lines up), there should be another solution.
The Karatsuba algorithm is a nice way to perform multiplications.
If you want it to work, b and d must be of the same length.
Here are two possibilities to compute 123x12 :
a= 1;b=23;c=0;d=12;
a=12;b= 3;c=1;d= 2;
Let's explain how it works for the second case :
123=12×10+3
12= 1×10+2
123×12=(12×10+3)×(1×10+2)
123×12=12×1×100+ (12×2+3×1)×10+3×2
123×12=12×1×100+((12+3)×(1+2)-12×1-3×2)×10+3×2
Let's explain how it works for the first case :
123=1×100+23
12=0×100+12
123×12=(1×100+23)×(0×100+12)
123×12=1×0×10000+ (1×12+23×0)×100+23×12
123×12=1×0×10000+((1+23)×(0+12)-1×0-23×12)×100+23×12
It also works with 10^k, 2^k or n instead of 10 or 100.