I've used Julia for some time, but still know little about it, especially for the paralell computing.
I want to obtain a new array with the existing data, as the array is very large, I want to do it in a parallel way, and the code is written as follows:
ψ1 = Array{Complex}(undef, D)
ψ = rand(Complex{Float64},D)
Threads.#threads for k = 1:D
ψ1[k] = #views sum(ψ[GetBasis(k - 1, N)])
end
I run it with julia -t 4, But it turn out to be very slow, compared to the non parallel code as follows,
ψ1 = [#views sum(ψ[GetBasis(k - 1, N)]) for k=1:D]
I have no idea about why this happens, and GetBasis() is just a function to generate a Array, Array{Int64,1}(N).
I would like to ask how I could improve the first code, or, is there some way I can modify the second code to also run it in a parallel way? As the array can be very large, and I want to find a way to speed it up...
Thanks a lot, and look forward to your replies!
The complete code can be found as follows
function GetBasis(n, N)
xxN = collect(0:N-1)
BI = BitFlip.(n,xxN).+1
end
function BitFlip(n, i)
n = n ⊻ (1 << i)
end
N=24
D=2^N
ψ1 = Array{Complex}(undef, D)
ψ = rand(Complex{Float64},D)
Threads.#threads for k = 1:D
ψ1[k] = #views sum(ψ[GetBasis(k - 1, N)])
end
ψ2 = [#views sum(ψ[GetBasis(k - 1, N)]) for k=1:D]
ψ1 = Array{Complex}(undef, D)
This is an abstract type, which is very slow.
You should define the type of the complex components, e.g. Complex{Float64}.
Alternatively, use similar.
When all is done properly the multithreaded code is faster indeed.
function GetBasis(n, N)
BI = BitFlip.(n,0:N-1).+1
end
function BitFlip(n, i)
n = n ⊻ (1 << i)
end
const N=24
const D=2^N
const ψ = rand(Complex{Float64},D)
const ψ1 = Vector{Complex{Float64}}(undef, D)
using BenchmarkTools
Threads.nthreads() # should return 4 or more
# set JULIA_NUM_THREADS environment variable
Now testing:
julia> GC.gc()
julia> #btime ψ2 = [#views sum(ψ[GetBasis(k - 1, N)]) for k=1:D];
5.591 s (16777218 allocations: 4.50 GiB)
julia> GC.gc()
julia> #btime Threads.#threads for k = 1:D
#inbounds ψ1[k] = #views sum(ψ[GetBasis(k - 1, N)])
end
2.293 s (16777237 allocations: 4.25 GiB)
Note the amount of memory used by this code - you need to run garbage collector before running the benchamrk and the test will be less meaningful when you have less than 16GB RAM in your machine.
Related
Let's say i have an array with 10 elments:
arr = [1,2,3,4,5,6,7,8,9,10]
Then I want to define a function that takes this arr as parameter to perform
a calculation, let's say for this example the calculation is the difference of means, for example:
If N=2 (That means group the elements of arr in groups of size 2 sequentially):
results=[]
result_1 = 1+2/2 - 3+4/2
result_2 = 3+4/2 - 5+6/2
result_3 = 5+6/2 - 7+8/2
result_4 = 7+8/2 - 9+10/2
The output would be:
results = [-2,-2,-2,-2]
If N=3 (That means group the elements of arr in groups of size 3 sequentially):
results=[]
result_1 = 1+2+3/3 - 4+5+6/3
result_2 = 4+5+6/3 - 7+8+9/3
The output would be:
results = [-3,-3]
I want to do this defining two functions:
Function 1 - Creates the arrays that will be used as input for 2nd function:
Parameters: array, N
returns: k groups of arrays -> seems to be ((length(arr)/N) - 1)
Function 2 - Will be the fucntion that gets the arrays (2 by 2) and perfoms the calculations, in this case, difference of means.
Parameters: array1,array2....arr..arr..
returns: list of the results
Important Note
My idea is to apply these fucntions to a stream of data and the calculation will be the PSI (Population Stability Index)
So, if my stream has 10k samples and I set the first function to N=1000, then the output to the second function will be 1k samples + next 1k samples.
The process will be repetead till the end of the datastream
I was trying to do this in python (I already have the PSI code ready) but now I decided to use Julia for it, but I am pretty new to Julia. So, if anyone can give me some light here will be very helpfull.
In Julia if you have a big Vector and you want to calculate some statistics on groups of 3 elements you could do:
julia> a = collect(1:15); #creates a Vector [1,2,...,15]
julia> mean.(eachcol(reshape(a,3,length(a)÷3)))
5-element Vector{Float64}:
2.0
5.0
8.0
11.0
14.0
Note that both reshape and eachcol are non-allocating so no data gets copied in the process.
If the length of a is not divisible by 3 you could truncate it before reshaping - to avoid allocation use view for that:
julia> a = collect(1:16);
julia> mean.(eachcol(reshape(view(a,1:(length(a)÷3)*3),3,length(a)÷3)))
5-element Vector{Float64}:
2.0
5.0
8.0
11.0
14.0
Depending on what you actually want to do you might also want to take a look at OnlineStats.jl https://github.com/joshday/OnlineStats.jl
Well, I use JavaScript instead Of Python, But it would be same thing in python...
You need a chunks function, that take array and chunk_size (N), lets say chunks([1,2,3,4], 2) -> [[1,2], [3,4]]
we have a sum method that add all element in array, sum([1,2]) -> 3;
Both JavaScript and python support corouting, that you can use for lazy evaluation, And its called Generator function, this type of function can pause its execution, and can resume on demand! This is useful for calculate stream of data.
let arr = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10];
// Javascript Doesn't support `chunks` method yet, So we need to create one...
Array.prototype.chunks = function* (N) {
let chunk = [];
for (let value of this) {
chunk.push(value)
if (chunk.length == N) {
yield chunk;
chunk = []
}
}
}
Array.prototype.sum = function () {
return this.reduce((a, b) => a + b)
}
function* fnName(arr, N) {
let chunks = arr.chunks(N);
let a = chunks.next().value.sum();
for (let b of chunks) {
yield (a / N) - ((a = b.sum()) / N)
}
}
console.log([...fnName(arr, 2)])
console.log([...fnName(arr, 3)])
I have an nx4 matrix A representing n spheres, and an mx3 matrix B representing m points. I need to test whether these m points are inside any of the spheres. I can do this using a for loop, but with large n and m this method is very inefficient. How can I vectorize this operation? My current method is
A = [0.8622 1.1594 0.7457 0.6925;
1.4325 0.2559 0.0520 0.4687;
1.8465 0.3979 0.2850 0.4259;
1.4387 0.8713 1.6585 0.4616;
0.2383 1.5208 0.5415 0.9417;
1.6812 0.2045 0.1290 0.1972];
B = [0.5689 0.9696 0.8196;
0.5211 0.4462 0.6254;
0.9000 0.4894 0.2202;
0.4192 0.9229 0.4639];
for i=1:size(B,1)
mask = vecnorm(A(:, 1:3) - B(i,:), 2, 2) < A(:, 4);
if sum(mask) > 0
C(i) = true;
else
C(i) = false;
end %if
end %for
I tested the method suggested by #LuisMendo, and it seems it only speeds up the calculation for quite small m and n, but for large m and n, say, around 10000 for my problem, the improvement is very limited. But #NickyMattsson gave me some hint. Because logical operation in matlab is faster than vecnorm, I first use a rough check to find the spheres near the point, and then do a fine check:
A = [0.8622 1.1594 0.7457 0.6925;
1.4325 0.2559 0.0520 0.4687;
1.8465 0.3979 0.2850 0.4259;
1.4387 0.8713 1.6585 0.4616;
0.2383 1.5208 0.5415 0.9417;
1.6812 0.2045 0.1290 0.1972];
B = [0.5689 0.9696 0.8196;
0.5211 0.4462 0.6254;
0.9000 0.4894 0.2202;
0.4192 0.9229 0.4639];
ids = 1:size(A, 1);
for i=1:size(B,1)
% first a rough check
xbound = abs(A(:, 1) - B(i, 1)) < A(:, 4);
ybound = abs(A(:, 2) - B(i, 2)) < A(:, 4);
zbound = abs(A(:, 3) - B(i, 3)) < A(:, 4);
nears = ids(xbound & ybound & zbound);
if isempty(nears)
C(i) = false;
else
% then a fine check
mask = vecnorm(A(nears, 1:3) - B(i,:), 2, 2) < A(nears, 4);
if sum(mask) > 0
C(i) = true;
else
C(i) = false;
end
end
end
This may reduce the time to 1/2 or 1/3, which is acceptable, and if I divide m and n into batches it may be even faster without too heavy memory burden. #CrisLuengo mentioned the R*-tree method, but it seems that the implementation is quite complicated XD
This uses implicit expansion to compute all distances between points and sphere centers, and then to compare those with the sphere radii:
C = any(vecnorm(permute(B, [1 3 2]) - permute(A(:,1:3), [3 1 2]), 2, 3) < A(:,4).', 2);
This is probably faster than the loop approach, but also more memory-intensive, because an intermediate m×n×3 array is computed.
It seems I am stuck on the following problem with numpy.
I have an array X with shape: X.shape = (nexp, ntime, ndim, npart)
I need to compute binned statistics on this array along npart dimension, according to the values in binvals (and some bins), but keeping all the other dimensions there, because I have to use the binned statistic to remove some bias in the original array X. Binning values have shape binvals.shape = (nexp, ntime, npart).
A complete, minimal example, to explain what I am trying to do. Note that, in reality, I am working on large arrays and with several hunderds of bins (so this implementation takes forever):
import numpy as np
np.random.seed(12345)
X = np.random.randn(24).reshape(1,2,3,4)
binvals = np.random.randn(8).reshape(1,2,4)
bins = [-np.inf, 0, np.inf]
nexp, ntime, ndim, npart = X.shape
cleanX = np.zeros_like(X)
for ne in range(nexp):
for nt in range(ntime):
indices = np.digitize(binvals[ne, nt, :], bins)
for nd in range(ndim):
for nb in range(1, len(bins)):
inds = indices==nb
cleanX[ne, nt, nd, inds] = X[ne, nt, nd, inds] - \
np.mean(X[ne, nt, nd, inds], axis = -1)
Looking at the results of this may make it clearer?
In [8]: X
Out[8]:
array([[[[-0.20470766, 0.47894334, -0.51943872, -0.5557303 ],
[ 1.96578057, 1.39340583, 0.09290788, 0.28174615],
[ 0.76902257, 1.24643474, 1.00718936, -1.29622111]],
[[ 0.27499163, 0.22891288, 1.35291684, 0.88642934],
[-2.00163731, -0.37184254, 1.66902531, -0.43856974],
[-0.53974145, 0.47698501, 3.24894392, -1.02122752]]]])
In [10]: cleanX
Out[10]:
array([[[[ 0. , 0.67768523, -0.32069682, -0.35698841],
[ 0. , 0.80405255, -0.49644541, -0.30760713],
[ 0. , 0.92730041, 0.68805503, -1.61535544]],
[[ 0.02303938, -0.02303938, 0.23324375, -0.23324375],
[-0.81489739, 0.81489739, 1.05379752, -1.05379752],
[-0.50836323, 0.50836323, 2.13508572, -2.13508572]]]])
In [12]: binvals
Out[12]:
array([[[ -5.77087303e-01, 1.24121276e-01, 3.02613562e-01,
5.23772068e-01],
[ 9.40277775e-04, 1.34380979e+00, -7.13543985e-01,
-8.31153539e-01]]])
Is there a vectorized solution? I thought of using scipy.stats.binned_statistic, but I seem to be unable to understand how to use it for this aim. Thanks!
import numpy as np
np.random.seed(100)
nexp = 3
ntime = 4
ndim = 5
npart = 100
nbins = 4
binvals = np.random.rand(nexp, ntime, npart)
X = np.random.rand(nexp, ntime, ndim, npart)
bins = np.linspace(0, 1, nbins + 1)
d = np.digitize(binvals, bins)[:, :, np.newaxis, :]
r = np.arange(1, len(bins)).reshape((-1, 1, 1, 1, 1))
m = d[np.newaxis, ...] == r
counts = np.sum(m, axis=-1, keepdims=True).clip(min=1)
means = np.sum(X[np.newaxis, ...] * m, axis=-1, keepdims=True) / counts
cleanX = X - np.choose(d - 1, means)
Ok, I think I got it, mainly based on the answer by #jdehesa.
clean2 = np.zeros_like(X)
d = np.digitize(binvals, bins)
for i in range(1, len(bins)):
m = d == i
minds = np.where(m)
sl = [*minds[:2], slice(None), minds[2]]
msum = m.sum(axis=-1)
clean2[sl] = (X - \
(np.sum(X * m[...,np.newaxis,:], axis=-1) /
msum[..., np.newaxis])[..., np.newaxis])[sl]
Which gives the same results as my original code.
On the small arrays I have in the example here, this solution is approximately three times as fast as the original code. I expect it to be way faster on larger arrays.
Update:
Indeed it's faster on larger arrays (didn't do any formal test), but despite this, it just reaches the level of acceptable in terms of performance... any further suggestion on extra vectoriztaions would be very welcome.
I have the following line of code in Julia:
X=[(i,i^2) for i in 1:100 if i^2%5==0]
Basically, it returns a list of tuples (i,i^2) from i=1 to 100 if the remainder of i^2 and 5 is zero. What I want to do is, in the array comprehension, break out of the for loop if i^2 becomes larger than 1000. However, if I implement
X=[(i,i^2) for i in 1:100 if i^2%5==0 else break end]
I get the error: syntax: expected "]".
Is there any way to easily break out of this for loop inside the array? I've tried looking online, but nothing came up.
It's a "fake" for-loop, so you can't break it. Take a look at the lowered code below:
julia> foo() = [(i,i^2) for i in 1:100 if i^2%5==0]
foo (generic function with 1 method)
julia> #code_lowered foo()
LambdaInfo template for foo() at REPL[0]:1
:(begin
nothing
#1 = $(Expr(:new, :(Main.##1#3)))
SSAValue(0) = #1
#2 = $(Expr(:new, :(Main.##2#4)))
SSAValue(1) = #2
SSAValue(2) = (Main.colon)(1,100)
SSAValue(3) = (Base.Filter)(SSAValue(1),SSAValue(2))
SSAValue(4) = (Base.Generator)(SSAValue(0),SSAValue(3))
return (Base.collect)(SSAValue(4))
end)
The output shows that array comprehension is implemented via Base.Generator which takes an iterator as input. It only supports the [if cond(x)::Bool] "guard" for now, so there is no way to use break here.
For your specific case, a workaround is to use isqrt:
julia> X=[(i,i^2) for i in 1:isqrt(1000) if i^2%5==0]
6-element Array{Tuple{Int64,Int64},1}:
(5,25)
(10,100)
(15,225)
(20,400)
(25,625)
(30,900)
I don't think so. You could always just
tmp(i) = (j = i^2; j > 1000 ? false : j%5==0)
X=[(i,i^2) for i in 1:100 if tmp(i)]
Using a for loop is considered idiomatic in Julia and could be more readable in this instance. Also, it could be faster.
Specifically:
julia> using BenchmarkTools
julia> tmp(i) = (j = i^2; j > 1000 ? false : j%5==0)
julia> X1 = [(i,i^2) for i in 1:100 if tmp(i)];
julia> #btime [(i,i^2) for i in 1:100 if tmp(i)];
471.883 ns (7 allocations: 528 bytes)
julia> X2 = [(i,i^2) for i in 1:isqrt(1000) if i^2%5==0];
julia> #btime [(i,i^2) for i in 1:isqrt(1000) if i^2%5==0];
281.435 ns (7 allocations: 528 bytes)
julia> function goodsquares()
res = Vector{Tuple{Int,Int}}()
for i=1:100
if i^2%5==0 && i^2<=1000
push!(res,(i,i^2))
elseif i^2>1000
break
end
end
return res
end
julia> X3 = goodsquares();
julia> #btime goodsquares();
129.123 ns (3 allocations: 304 bytes)
So, another 2x improvement is nothing to disregard and the long function gives plenty of room for illuminating comments.
I have the antenna array factor expression here:
I have coded the array factor expression as given below:
lambda = 1;
M = 100;N = 200; %an M x N array
dx = 0.3*lambda; %inter-element spacing in x direction
m = 1:M;
xm = (m - 0.5*(M+1))*dx; %element positions in x direction
dy = 0.4*lambda;
n = 1:N;
yn = (n - 0.5*(N+1))*dy;
thetaCount = 360; % no of theta values
thetaRes = 2*pi/thetaCount; % theta resolution
thetas = 0:thetaRes:2*pi-thetaRes; % theta values
phiCount = 180;
phiRes = pi/phiCount;
phis = -pi/2:phiRes:pi/2-phiRes;
cmpWeights = rand(N,M); %complex Weights
AF = zeros(phiCount,thetaCount); %Array factor
tic
for i = 1:phiCount
for j = 1:thetaCount
for p = 1:M
for q = 1:N
AF(i,j) = AF(i,j) + cmpWeights(q,p)*exp((2*pi*1j/lambda)*(xm(p)*sin(thetas(j))*cos(phis(i)) + yn(q)*sin(thetas(j))*sin(phis(i))));
end
end
end
end
How can I vectorize the code for calculating the Array Factor (AF).
I want the line:
AF(i,j) = AF(i,j) + cmpWeights(q,p)*exp((2*pi*1j/lambda)*(xm(p)*sin(thetas(j))*cos(phis(i)) + yn(q)*sin(thetas(j))*sin(phis(i))));
to be written in vectorized form (by modifying the for loop).
Approach #1: Full-throttle
The innermost nested loop generates this every iteration - cmpWeights(q,p)*exp((2*pi*1j/lambda)*(xm(p)*sin(thetas(j))*cos(phis(i)) + yn(q)*sin(thetas(j))*sin(phis(i)))), which are to summed up iteratively to give us the final output in AF.
Let's call the exp(.... part as B. Now, B basically has two parts, one is the scalar (2*pi*1j/lambda) and the other part
(xm(p)*sin(thetas(j))*cos(phis(i)) + yn(q)*sin(thetas(j))*sin(phis(i))) that is formed from the variables that are dependent on
the four iterators used in the original loopy versions - i,j,p,q. Let's call this other part as C for easy reference later on.
Let's put all that into perspective:
Loopy version had AF(i,j) = AF(i,j) + cmpWeights(q,p)*exp((2*pi*1j/lambda)*(xm(p)*sin(thetas(j))*cos(phis(i)) + yn(q)*sin(thetas(j))*sin(phis(i)))), which is now equivalent to AF(i,j) = AF(i,j) + cmpWeights(q,p)*B, where B = exp((2*pi*1j/lambda)*(xm(p)*sin(thetas(j))*cos(phis(i)) + yn(q)*sin(thetas(j))*sin(phis(i)))).
B could be simplified to B = exp((2*pi*1j/lambda)* C), where C = (xm(p)*sin(thetas(j))*cos(phis(i)) + yn(q)*sin(thetas(j))*sin(phis(i))).
C would depend on the iterators - i,j,p,q.
So, after porting onto a vectorized way, it would end up as this -
%// 1) Define vectors corresponding to iterators used in the loopy version
I = 1:phiCount;
J = 1:thetaCount;
P = 1:M;
Q = 1:N;
%// 2) Create vectorized version of C using all four vector iterators
mult1 = bsxfun(#times,sin(thetas(J)),cos(phis(I)).'); %//'
mult2 = bsxfun(#times,sin(thetas(J)),sin(phis(I)).'); %//'
mult1_xm = bsxfun(#times,mult1(:),permute(xm,[1 3 2]));
mult2_yn = bsxfun(#times,mult2(:),yn);
C_vect = bsxfun(#plus,mult1_xm,mult2_yn);
%// 3) Create vectorized version of B using vectorized C
B_vect = reshape(exp((2*pi*1j/lambda)*C_vect),phiCount*thetaCount,[]);
%// 4) Final output as matrix multiplication between vectorized versions of B and C
AF_vect = reshape(B_vect*cmpWeights(:),phiCount,thetaCount);
Approach #2: Less-memory intensive
This second approach would reduce the memory traffic and it uses the distributive property of exponential - exp(A+B) = exp(A)*exp(B).
Now, the original loopy version was this -
AF(i,j) = AF(i,j) + cmpWeights(q,p)*exp((2*pi*1j/lambda)*...
(xm(p)*sin(thetas(j))*cos(phis(i)) + yn(q)*sin(thetas(j))*sin(phis(i))))
So, after using the distributive property, we would endup with something like this -
K = (2*pi*1j/lambda)
part1 = K*xm(p)*sin(thetas(j))*cos(phis(i));
part2 = K*yn(q)*sin(thetas(j))*sin(phis(i));
AF(i,j) = AF(i,j) + cmpWeights(q,p)*exp(part1)*exp(part2);
Thus, the relevant vectorized approach would become something like this -
%// 1) Define vectors corresponding to iterators used in the loopy version
I = 1:phiCount;
J = 1:thetaCount;
P = 1:M;
Q = 1:N;
%// 2) Define the constant used at the start of EXP() call
K = (2*pi*1j/lambda);
%// 3) Perform the sine-cosine operations part1 & part2 in vectorized manners
mult1 = K*bsxfun(#times,sin(thetas(J)),cos(phis(I)).'); %//'
mult2 = K*bsxfun(#times,sin(thetas(J)),sin(phis(I)).'); %//'
%// Perform exp(part1) & exp(part2) in vectorized manners
part1_vect = exp(bsxfun(#times,mult1(:),xm));
part2_vect = exp(bsxfun(#times,mult2(:),yn));
%// Perform multiplications with cmpWeights for final output
AF = reshape(sum((part1_vect*cmpWeights.').*part2_vect,2),phiCount,[])
Quick Benchmarking
Here are the runtimes with the input data listed in the question for the original loopy approach and proposed approach #2 -
---------------------------- With Original Approach
Elapsed time is 358.081507 seconds.
---------------------------- With Proposed Approach #2
Elapsed time is 0.405038 seconds.
The runtimes suggests a crazy performance improvement with Approach #2!
The basic trick is to figure out what things are constant, and what things depend on the subscript term - and therefore are matrix terms.
Within the sum:
C(n,m) is a matrix
2π/λ is a constant
sin(θ)cos(φ) is a constant
x(m) and y(n) are vectors
So the two things I would do are:
Expand the xm and ym into matrices using meshgrid()
Take all the constant term stuff outside the loop.
Like this:
...
piFactor = 2 * pi * 1j / lambda;
[xgrid, ygrid] = meshgrid(xm, ym); % xgrid and ygrid will be size (N, M)
for i = 1:phiCount
for j = 1:thetaCount
xFactor = sin(thetas(j)) * cos(phis(i));
yFactor = sin(thetas(j)) * sin(phis(i));
expFactor = exp(piFactor * (xgrid * xFactor + ygrid * yFactor)); % expFactor is size (N, M)
elements = cmpWeights .* expFactor; % elements of sum, size (N, M)
AF(i, j) = AF(i, j) + sum(elements(:)); % sum and then integrate.
end
end
You could probably figure out how to vectorise the outer loop too, but hopefully that gives you a starting point.