It's related to this question
I want to know how to calculate median along specific dimension on huge array, for example with size (20, 1920, 1080, 3). I not sure whether there is any practical purpose but I just wanted to check how well median works in Julia.
It takes ~0.5 seconds to calculate medians on (3,1920,1080,3) with numpy. It works very fast on zeros array (less than 2 seconds on (120, 1920, 1080,3)) and works not so fast but fine on real images (20 seconds on (120, 1920, 1080,3)).
Python code:
import cv2
import sys
import numpy as np
import time
ZEROES=True
N_IMGS=20
print("n_imgs:", N_IMGS)
print("use dummy data:", ZEROES)
imgs_paths = sys.argv[1:]
imgs_paths.sort()
imgs_paths_sparse = imgs_paths[::30]
imgs_paths = imgs_paths_sparse[N_IMGS]
if ZEROES:
imgs_arr = np.zeros((N_IMGS,1080,1920,3), dtype=np.float32)
else:
imgs = map(cv2.imread, imgs_paths)
imgs_arr = np.array(list(imgs), dtype=np.float32)
start = time.time()
imgs_median = np.median(imgs_arr, 0)
end = time.time()
print("time:", end - start)
cv2.imwrite('/tmp/median.png', imgs_median)
In julia I can only calculate median of (3, 1920, 1080,3). After that my earlyoom process kills julia process because of huge amount of used memory.
I tried approach similar to what I tried first on max:
function median1(imgs_arr)
a = imgs_arr
b = reshape(cat(a..., dims=1), tuple(length(a), size(a[1])...))
imgs_max = Statistics.median(b, dims=1)
return imgs_max
end
Or even more simple case:
import Statistics
a = zeros(3,1080,1920,3)
#time Statistics.median(a, dims=1)
10.609627 seconds (102.64 M allocations: 2.511 GiB, 3.37% gc time)
...
So, it takes 10 seconds vs 0.5 seconds on numpy.
I have only 4 CPU cores and it's not simply parallelization.
Is there more or less simple way to optimize it somehow?
Or at least take slices and compute it one-by-one without overuse of memory?
It's hard to know if the fact that the images are loaded separately is a key part of the problem here or not since the setup for the problem in Julia is missing and it's a bit hard for Julia programmers to follow the Python setup or know how much we need to match it. You either need to:
Load or move the image data so that they are, in fact, part of the same array and then take the median of that;
Make a set of spatially unrelated values in different arrays abstractly behave as though they are part of a single array and then take the median of that collection via a method that's generic enough to handle this abstraction.
Fredrik's answer implicitly assumes that you have already loaded the image data so that they're all part of the same contiguous array. If that's the case, however, then you don't even need JuliennedArrays, you can just use the median function from the Statistics stdlib:
julia> a = rand(3, 1080, 1920, 3);
julia> using Statistics
julia> median(a, dims=1)
1×1080×1920×3 Array{Float64,4}:
[:, :, 1, 1] =
0.63432 0.205958 0.216221 0.571541 … 0.238637 0.285947 0.901014
[:, :, 2, 1] =
0.821851 0.486859 0.622313 … 0.917329 0.417657 0.724073
If you can load the data like this, it's the best approach—this is by far the most efficient representation of a bunch of same-sized images and makes vectorize operations across images easy and efficient. The first dimension is the most efficient one to do operations across because Julia is column-major, so the first dimension (columns) is stored contiguously.
The best way to get the images into contiguous memory is to pre-allocate an uninitialized array of the right type and dimensions and then read the data into the array using some in-place API. For some reason your Julia code appears to have loaded the images as a vector of individual arrays while your Python code seems to have loaded all of the images into a single array?
The approach of reshaping and concatenating is an extreme case of the second approach where you move all of the data all at once before then applying a vectorized median operation. Obviously, that involves moving a lot of data around, which is pretty inefficient.
Due to memory locality, it may be more efficient to copy a single slice of the data into a temporary array and compute the median of that. That can be done pretty easily with an array comprehension:
julia> v_of_a = [rand(1080, 1920, 3) for _ = 1:3]
3-element Array{Array{Float64,3},1}:
[0.7206652600431633 0.7675119703509619 … 0.7117084561740263 0.8736518021960584; 0.8038479801395197 0.3159392943734012 … 0.976319025405266 0.3278606124069767; … ; 0.7424260315304789 0.4748658164109498 … 0.9942311708400311 0.37048961459068086; 0.7832577306186075 0.13184454935145773 … 0.5895094390350453 0.5470111170897787]
[0.26401298651503025 0.9113932653115289 … 0.5828647778524962 0.752444909740893; 0.5673144007678044 0.8154276504227804 … 0.2667436824684424 0.4895443896447764; … ; 0.2641913584303701 0.16639100493266934 … 0.1860616855126005 0.04922131616483538; 0.4968214514330498 0.994935452055218 … 0.28097239922248685 0.4980189891952156]
julia> [median(a[i,j,k] for a in v_of_a) for i=1:1080, j=1:1920, k=1:3]
1080×1920×3 Array{Float64,3}:
[:, :, 1] =
0.446895 0.643648 0.694714 … 0.221553 0.711708 0.225268
0.659251 0.457686 0.672072 0.731218 0.449915 0.129987
0.573196 0.328747 0.668702 0.355231 0.656686 0.303168
0.243656 0.702642 0.45708 0.23415 0.400252 0.482792
Try JuliennedArrays.jl
julia> a = zeros(3,1080,1920,3);
julia> using JuliennedArrays
julia> #time map(median, Slices(a,1));
0.822429 seconds (6.22 M allocations: 711.915 MiB, 20.15% gc time)
As Stefan commented below, the built in median does the same thing, but much slower
julia> #time median(a, dims=1);
7.450394 seconds (99.80 M allocations: 2.368 GiB, 4.47% gc time)
at least as of julia> VERSION v"1.5.0-DEV.876"
Related
I need to prepare "flattened" versions of 2D fftfrequencies in the shape Nx^2 * 2. Those are basically constructed like a ravel(meshgrid(fftfreqs1d,fftfreqs1d)) in matlab or python.
This appears to be no big deal in python, but can hang for reasonable array sizes in julia, especially when i want to build a StaticArray out of the intermediate results. To make it more confusing, #btime pretends that my arrays are created in no time, while they are clearly not.
My question is why this happens and how it is done right.
I am aware that using julia it might be a waste to keep the full 2D fftfreqs in memory instead of using the 1D versions and a loop, but let us assume for a moment that i need it this way.
Julia
function my_freqs1(Nnu::Int,T)
dx = 2. /Nnu
freq1d = fftfreq(Nnu).*dx
nu = hcat( vec([ i for i in freq1d, j in freq1d ]),
vec([ j for i in freq1d, j in freq1d ]))
return nu
end;
#btime my_freqs1(100,Float64)
28.528 μs (10 allocations: 312.80 KiB)
Julia, converting to a static array (in the hope for better performance of other code later on)
function my_freqs2(Nnu::Int,T)
### the same as above ###
return SMatrix{Nnu^2,2,T}(nu)
end;
#btime my_freqs2(100,Float64)
94.540 μs (36 allocations: 470.38 KiB)
Python
def my_fftfreqs(xy):
freqs = np.fft.fftfreq(np.shape(xy)[0],d=xy[1]-xy[0])
fx,fy = np.meshgrid(freqs,freqs,indexing="ij")
freq_list = np.transpose(np.asarray( [np.ravel(fx),np.ravel(fy)] ))
return freq_list
%time f=my_fftfreqs(np.linspace(0,1,100));
CPU times: user 1.08 ms, sys: 0 ns, total: 1.08 ms
Wall time: 600 µs
My observation is that while python %time reports a much longer time, it will actually run in a very reasonable time while the julia version has a noticable delay and the version with the static array will hang for a long time and completely crash for larger sizes.
Please help me to understand how i would do this correctly in Julia and whether (why not?) creating a static array seems to be such a bad idea.
Rather than making a SMatrix{Nnu^2,2} I think you probably want to make a Vector{SVector{2}}. The former will require recompiling for each new value of Nnu which is fairly inefficient.
You may also consider:
using FFTW
my_freqs3(ν) = fftfreq(ν)*2/ν |>
(w -> [repeat(w, inner=length(w)) repeat(w, outer=length(w))])
# or
my_freqs3alt(ν) = ( w = fftfreq(ν)*2/ν ;
[repeat(w, inner=length(w)) repeat(w, outer=length(w))] )
which is more Julian and "if-I-understand-correctly" is equivalent.
Usually shorter/simpler functions are also more efficient.
Julia features used:
Unicode nu variable.
Piping |> operator.
Definition with no function keyword.
repeat standard library vector filling function.
Matlab-like hcat [v1 v2] notation.
Multi-statement block enclosed in ( ) separated by ;.
New to julia, so this is probably very easy.
I have an n-by-m array and a vector of length n and want to repeat each row of the array the number of times in the corresponding element of the vector. For example:
mat = rand(3,6)
v = vec([2 3 1])
The result should be a 6-by-6 array. I tried the repeat function but
repeat(mat, inner = v)
yields a 6×18×1 Array{Float64,3}: array instead so it takes v to be the dimensions along which to repeat the elements. In matlab I would use repelem(mat, v, 1) and I hope julia offers something similar. My actual matrix is a lot bigger and I will have to call the function many times, so this operation needs to be as fast as possible.
It has been discussed to add a similar thing to Julia Base, but currently it is not implemented yet AFAIK. You can achieve what you want using the inverse_rle function from StatsBase.jl:
julia> row_idx = inverse_rle(axes(v, 1), v)
6-element Array{Int64,1}:
1
1
2
2
2
3
and now you can write:
mat[row_idx, :]
or
#view mat[row_idx, :]
(the second option creates a view which might be relevant in your use case if you say that your mat is large and you need to do such indexing many times - which option is faster will depend on your exact use case).
Going thought the Julia's performance tips I haven't found any suggestions regarding how to speed up a code with three dimensional arrays.
From my understanding d-element Array{Array{Float64,2},1} would perform best when d (the third dimension) is small. However, I am not sure whether this is the case when d is large.
Is there any tutorial on this topic for Julia?
Example 1a (d=50)
x = [zeros(100, 10) for d=1:50];
#time for d=1:50
x[d] = rand(100,10);
end
0.000100 seconds (50 allocations: 396.875 KB)
Example 1b (d=50)
y=zeros(100, 10, 50);
#time for d=1:50
y[:,:,d] = rand(100,10);
end
0.000257 seconds (200 allocations: 400.781 KB)
Example 2a (d=50000)
x = [zeros(100, 10) for d=1:50000];
#time for d=1:50000
x[d] = rand(100,10);
end
0.410813 seconds (99.49 k allocations: 388.328 MB, 81.88% gc time)
Example 2b (d=50000)
y=zeros(100, 10, 50000);
#time for d=1:50000
y[:,:,d] = rand(100,10);
end
0.185929 seconds (298.98 k allocations: 392.898 MB, 6.83% gc time)
From my understanding d-element Array{Array{Float64,2},1} would perform best when d (the third dimension) is small. However, I am not sure whether this is the case when d is large.
No, it's moreso how you use it. A = Array{Array{Float64,2},1} is an array of pointers to matrices. The value of an array is the pointer or the reference. Thus A[i] returns a reference, i.e. it's cheap. A2 = Array{Float64,3} is a contiguous array of floats. It's really just an indexing setup over a linear slab of memory (and has a linear index A2[i] which runs through the whole thing using that linear form).
The latter has some advantages because it is contiguous. There's no indirection, so looping over all of A2s values will be faster. A has to deference two pointers to get a value, so a simple 3D loop will be slower if you don't know to deference each internal matrix only once. Also, you can get views to the matrices via #view A2[:,:,1] etc., but you have to take note that A2[:,:,1] by itself will make a copy of the matrix. A[1] is natural a view because it returns the reference to the matirx, and if you want to copy you'd have to explicitly do copy(A[1]). Because A is just a linear array of pointers, push!ing a new matrix onto it is cheap since it's just increasing a relatively small array (and push! is automatically amortized) to add a new pointer on the end (this is why things like DifferentialEqautions.jl use arrays of arrays to build timeseries instead of the more traditional matrix).
So they are different tools with different advantages and disadvantages.
As for your timings, you're doing two different things. x[d] = rand(100,10) is creating a new matrix and adding its reference to x. y[:,:,d] = rand(100,10) is creating a new matrix and looping through the values of y to change the values of y. You can see why that's slower. But what you're leaving out is the allocation-free cases.
function f2()
y=zeros(100, 10, 50);
#time for i in eachindex(y)
y[i] = rand()
end
y
end
In the small case this matches the array creation. You can't naively do this on case one, but as I said, if you dereference the pointer for the matrix once you do really well:
function f()
x = [zeros(100, 10) for d=1:5000];
#time #inbounds for d=1:50
xd = x[d]
for i in eachindex(xd)
xd[i] = rand()
end
end
x
end
So arrays of arrays can be great data structures in the right cases. The library RecursiveArrayTools.jl was created to take better advantage of it. For example, A3 = VectorOfArrays(A) gives A3 the same indexing structure as A2 by lazily transforming A[i,j,k] to A[k][i,j]. However, it keeps the advantages of A, but will automatically make sure to broadcast in the correct way like f. Another tool like this is the ArrayPartition which allows for heterogeneous typing in a broadcast-performant way.
So yeah, it's not always the right tool, but these heterogeneous and recursive arrays are great tools when used correctly.
I need to split the variable z::Array{Complex128,1} into two arrays for the real and complex parts. One way do this is to make new variables ::Array{Float64,1} and fill them element by element:
for i = 1:size(z)[1]
ri[i] = z[i].re
ii[i] = z[i].im
end
Is there a way to do this that doesn't involve copying data, like somehow manipulating strides and offsets of z?
In the common case where copying is not an issue, just do real.(z) and imag.(z). I include this to help future readers who have a similar issue, but who might not care about copying.
As you suggest, you can manipulate strides of z to avoid copying data. Simply
zfl = reinterpret(Float64, z)
zre = #view zfl[1:2:end-1]
zim = #view zfl[2:2:end]
Combined, we observe that there is no data copying (the allocations are due to the heap-allocated array views, and are minimal).
julia> z = Vector{ComplexF64}(100000);
julia> function reimvec(z)
zfl = reinterpret(Float64, z)
zre = #view zfl[1:2:end-1]
zim = #view zfl[2:2:end]
zre, zim
end
reimvec (generic function with 1 method)
julia> #time reimvec(z);
0.000005 seconds (9 allocations: 400 bytes)
As we can see, behind the scenes, such an array is strided:
julia> strides(reimvec(z)[1])
(2,)
This appears to be a simple issue, but I've been struggling trying to efficiently split a 2D array:
start_time = time.time()
M = np.ones((400,400))
for i in range(10000):
e = np.array_split(M, 20)
print time.time() - start_time
However, this process takes ~6 seconds comparing to ~0.5 seconds when implemented in Mathematica with the Partition function, which can become a liability when the array gets much larger. Is there any way for me to speed up the process?
np.array_split may be useful when splitting an array into uneven pieces. Here, the size of each item in e is the same, so you could just use reshape:
e = M.reshape(20,-1)
This will be exceedingly fast, since it requires no copying of the array, only a change to the array's shape attribute.
e will be a 2D NumPy array of shape (20, 8000), not a list of NumPy arrays.
In [56]: M = np.ones((400,400))
In [60]: %timeit M.reshape(20,-1)
1000000 loops, best of 3: 447 ns per loop