Array assembly and StaticArrays under Julia: Why is my performance so bad? - arrays

I need to prepare "flattened" versions of 2D fftfrequencies in the shape Nx^2 * 2. Those are basically constructed like a ravel(meshgrid(fftfreqs1d,fftfreqs1d)) in matlab or python.
This appears to be no big deal in python, but can hang for reasonable array sizes in julia, especially when i want to build a StaticArray out of the intermediate results. To make it more confusing, #btime pretends that my arrays are created in no time, while they are clearly not.
My question is why this happens and how it is done right.
I am aware that using julia it might be a waste to keep the full 2D fftfreqs in memory instead of using the 1D versions and a loop, but let us assume for a moment that i need it this way.
Julia
function my_freqs1(Nnu::Int,T)
dx = 2. /Nnu
freq1d = fftfreq(Nnu).*dx
nu = hcat( vec([ i for i in freq1d, j in freq1d ]),
vec([ j for i in freq1d, j in freq1d ]))
return nu
end;
#btime my_freqs1(100,Float64)
28.528 μs (10 allocations: 312.80 KiB)
Julia, converting to a static array (in the hope for better performance of other code later on)
function my_freqs2(Nnu::Int,T)
### the same as above ###
return SMatrix{Nnu^2,2,T}(nu)
end;
#btime my_freqs2(100,Float64)
94.540 μs (36 allocations: 470.38 KiB)
Python
def my_fftfreqs(xy):
freqs = np.fft.fftfreq(np.shape(xy)[0],d=xy[1]-xy[0])
fx,fy = np.meshgrid(freqs,freqs,indexing="ij")
freq_list = np.transpose(np.asarray( [np.ravel(fx),np.ravel(fy)] ))
return freq_list
%time f=my_fftfreqs(np.linspace(0,1,100));
CPU times: user 1.08 ms, sys: 0 ns, total: 1.08 ms
Wall time: 600 µs
My observation is that while python %time reports a much longer time, it will actually run in a very reasonable time while the julia version has a noticable delay and the version with the static array will hang for a long time and completely crash for larger sizes.
Please help me to understand how i would do this correctly in Julia and whether (why not?) creating a static array seems to be such a bad idea.

Rather than making a SMatrix{Nnu^2,2} I think you probably want to make a Vector{SVector{2}}. The former will require recompiling for each new value of Nnu which is fairly inefficient.

You may also consider:
using FFTW
my_freqs3(ν) = fftfreq(ν)*2/ν |>
(w -> [repeat(w, inner=length(w)) repeat(w, outer=length(w))])
# or
my_freqs3alt(ν) = ( w = fftfreq(ν)*2/ν ;
[repeat(w, inner=length(w)) repeat(w, outer=length(w))] )
which is more Julian and "if-I-understand-correctly" is equivalent.
Usually shorter/simpler functions are also more efficient.
Julia features used:
Unicode nu variable.
Piping |> operator.
Definition with no function keyword.
repeat standard library vector filling function.
Matlab-like hcat [v1 v2] notation.
Multi-statement block enclosed in ( ) separated by ;.

Related

Median of multiple arrays in Julia

It's related to this question
I want to know how to calculate median along specific dimension on huge array, for example with size (20, 1920, 1080, 3). I not sure whether there is any practical purpose but I just wanted to check how well median works in Julia.
It takes ~0.5 seconds to calculate medians on (3,1920,1080,3) with numpy. It works very fast on zeros array (less than 2 seconds on (120, 1920, 1080,3)) and works not so fast but fine on real images (20 seconds on (120, 1920, 1080,3)).
Python code:
import cv2
import sys
import numpy as np
import time
ZEROES=True
N_IMGS=20
print("n_imgs:", N_IMGS)
print("use dummy data:", ZEROES)
imgs_paths = sys.argv[1:]
imgs_paths.sort()
imgs_paths_sparse = imgs_paths[::30]
imgs_paths = imgs_paths_sparse[N_IMGS]
if ZEROES:
imgs_arr = np.zeros((N_IMGS,1080,1920,3), dtype=np.float32)
else:
imgs = map(cv2.imread, imgs_paths)
imgs_arr = np.array(list(imgs), dtype=np.float32)
start = time.time()
imgs_median = np.median(imgs_arr, 0)
end = time.time()
print("time:", end - start)
cv2.imwrite('/tmp/median.png', imgs_median)
In julia I can only calculate median of (3, 1920, 1080,3). After that my earlyoom process kills julia process because of huge amount of used memory.
I tried approach similar to what I tried first on max:
function median1(imgs_arr)
a = imgs_arr
b = reshape(cat(a..., dims=1), tuple(length(a), size(a[1])...))
imgs_max = Statistics.median(b, dims=1)
return imgs_max
end
Or even more simple case:
import Statistics
a = zeros(3,1080,1920,3)
#time Statistics.median(a, dims=1)
10.609627 seconds (102.64 M allocations: 2.511 GiB, 3.37% gc time)
...
So, it takes 10 seconds vs 0.5 seconds on numpy.
I have only 4 CPU cores and it's not simply parallelization.
Is there more or less simple way to optimize it somehow?
Or at least take slices and compute it one-by-one without overuse of memory?
It's hard to know if the fact that the images are loaded separately is a key part of the problem here or not since the setup for the problem in Julia is missing and it's a bit hard for Julia programmers to follow the Python setup or know how much we need to match it. You either need to:
Load or move the image data so that they are, in fact, part of the same array and then take the median of that;
Make a set of spatially unrelated values in different arrays abstractly behave as though they are part of a single array and then take the median of that collection via a method that's generic enough to handle this abstraction.
Fredrik's answer implicitly assumes that you have already loaded the image data so that they're all part of the same contiguous array. If that's the case, however, then you don't even need JuliennedArrays, you can just use the median function from the Statistics stdlib:
julia> a = rand(3, 1080, 1920, 3);
julia> using Statistics
julia> median(a, dims=1)
1×1080×1920×3 Array{Float64,4}:
[:, :, 1, 1] =
0.63432 0.205958 0.216221 0.571541 … 0.238637 0.285947 0.901014
[:, :, 2, 1] =
0.821851 0.486859 0.622313 … 0.917329 0.417657 0.724073
If you can load the data like this, it's the best approach—this is by far the most efficient representation of a bunch of same-sized images and makes vectorize operations across images easy and efficient. The first dimension is the most efficient one to do operations across because Julia is column-major, so the first dimension (columns) is stored contiguously.
The best way to get the images into contiguous memory is to pre-allocate an uninitialized array of the right type and dimensions and then read the data into the array using some in-place API. For some reason your Julia code appears to have loaded the images as a vector of individual arrays while your Python code seems to have loaded all of the images into a single array?
The approach of reshaping and concatenating is an extreme case of the second approach where you move all of the data all at once before then applying a vectorized median operation. Obviously, that involves moving a lot of data around, which is pretty inefficient.
Due to memory locality, it may be more efficient to copy a single slice of the data into a temporary array and compute the median of that. That can be done pretty easily with an array comprehension:
julia> v_of_a = [rand(1080, 1920, 3) for _ = 1:3]
3-element Array{Array{Float64,3},1}:
[0.7206652600431633 0.7675119703509619 … 0.7117084561740263 0.8736518021960584; 0.8038479801395197 0.3159392943734012 … 0.976319025405266 0.3278606124069767; … ; 0.7424260315304789 0.4748658164109498 … 0.9942311708400311 0.37048961459068086; 0.7832577306186075 0.13184454935145773 … 0.5895094390350453 0.5470111170897787]
[0.26401298651503025 0.9113932653115289 … 0.5828647778524962 0.752444909740893; 0.5673144007678044 0.8154276504227804 … 0.2667436824684424 0.4895443896447764; … ; 0.2641913584303701 0.16639100493266934 … 0.1860616855126005 0.04922131616483538; 0.4968214514330498 0.994935452055218 … 0.28097239922248685 0.4980189891952156]
julia> [median(a[i,j,k] for a in v_of_a) for i=1:1080, j=1:1920, k=1:3]
1080×1920×3 Array{Float64,3}:
[:, :, 1] =
0.446895 0.643648 0.694714 … 0.221553 0.711708 0.225268
0.659251 0.457686 0.672072 0.731218 0.449915 0.129987
0.573196 0.328747 0.668702 0.355231 0.656686 0.303168
0.243656 0.702642 0.45708 0.23415 0.400252 0.482792
Try JuliennedArrays.jl
julia> a = zeros(3,1080,1920,3);
julia> using JuliennedArrays
julia> #time map(median, Slices(a,1));
0.822429 seconds (6.22 M allocations: 711.915 MiB, 20.15% gc time)
As Stefan commented below, the built in median does the same thing, but much slower
julia> #time median(a, dims=1);
7.450394 seconds (99.80 M allocations: 2.368 GiB, 4.47% gc time)
at least as of julia> VERSION v"1.5.0-DEV.876"

Julia: three dimensional arrays (performance)

Going thought the Julia's performance tips I haven't found any suggestions regarding how to speed up a code with three dimensional arrays.
From my understanding d-element Array{Array{Float64,2},1} would perform best when d (the third dimension) is small. However, I am not sure whether this is the case when d is large.
Is there any tutorial on this topic for Julia?
Example 1a (d=50)
x = [zeros(100, 10) for d=1:50];
#time for d=1:50
x[d] = rand(100,10);
end
0.000100 seconds (50 allocations: 396.875 KB)
Example 1b (d=50)
y=zeros(100, 10, 50);
#time for d=1:50
y[:,:,d] = rand(100,10);
end
0.000257 seconds (200 allocations: 400.781 KB)
Example 2a (d=50000)
x = [zeros(100, 10) for d=1:50000];
#time for d=1:50000
x[d] = rand(100,10);
end
0.410813 seconds (99.49 k allocations: 388.328 MB, 81.88% gc time)
Example 2b (d=50000)
y=zeros(100, 10, 50000);
#time for d=1:50000
y[:,:,d] = rand(100,10);
end
0.185929 seconds (298.98 k allocations: 392.898 MB, 6.83% gc time)
From my understanding d-element Array{Array{Float64,2},1} would perform best when d (the third dimension) is small. However, I am not sure whether this is the case when d is large.
No, it's moreso how you use it. A = Array{Array{Float64,2},1} is an array of pointers to matrices. The value of an array is the pointer or the reference. Thus A[i] returns a reference, i.e. it's cheap. A2 = Array{Float64,3} is a contiguous array of floats. It's really just an indexing setup over a linear slab of memory (and has a linear index A2[i] which runs through the whole thing using that linear form).
The latter has some advantages because it is contiguous. There's no indirection, so looping over all of A2s values will be faster. A has to deference two pointers to get a value, so a simple 3D loop will be slower if you don't know to deference each internal matrix only once. Also, you can get views to the matrices via #view A2[:,:,1] etc., but you have to take note that A2[:,:,1] by itself will make a copy of the matrix. A[1] is natural a view because it returns the reference to the matirx, and if you want to copy you'd have to explicitly do copy(A[1]). Because A is just a linear array of pointers, push!ing a new matrix onto it is cheap since it's just increasing a relatively small array (and push! is automatically amortized) to add a new pointer on the end (this is why things like DifferentialEqautions.jl use arrays of arrays to build timeseries instead of the more traditional matrix).
So they are different tools with different advantages and disadvantages.
As for your timings, you're doing two different things. x[d] = rand(100,10) is creating a new matrix and adding its reference to x. y[:,:,d] = rand(100,10) is creating a new matrix and looping through the values of y to change the values of y. You can see why that's slower. But what you're leaving out is the allocation-free cases.
function f2()
y=zeros(100, 10, 50);
#time for i in eachindex(y)
y[i] = rand()
end
y
end
In the small case this matches the array creation. You can't naively do this on case one, but as I said, if you dereference the pointer for the matrix once you do really well:
function f()
x = [zeros(100, 10) for d=1:5000];
#time #inbounds for d=1:50
xd = x[d]
for i in eachindex(xd)
xd[i] = rand()
end
end
x
end
So arrays of arrays can be great data structures in the right cases. The library RecursiveArrayTools.jl was created to take better advantage of it. For example, A3 = VectorOfArrays(A) gives A3 the same indexing structure as A2 by lazily transforming A[i,j,k] to A[k][i,j]. However, it keeps the advantages of A, but will automatically make sure to broadcast in the correct way like f. Another tool like this is the ArrayPartition which allows for heterogeneous typing in a broadcast-performant way.
So yeah, it's not always the right tool, but these heterogeneous and recursive arrays are great tools when used correctly.

Basic operations combining two SharedArrays

I've spent the last month or so learning julia and I'm very impressed. In particular I'm analysing large amount of climate model output, I put all this into SharedArrays and adjust and plot it all in parallel. So far it's very quick and efficient and I've got quite a library of code. My current problem is in creating a function that can do basic operations on two shared arrays. I've successfully written a function that takes two arrays and how you want to process them. The code is based around the example in the parallel section of the julia doc and uses the myrange function as shown there
function myrange(q::SharedArray)
idx = indexpids(q)
##show (idx)
if idx == 0
# This worker is not assigned a piece
return 1:0, 1:0
print("NO WORKERS ASSIGNED")
end
nchunks = length(procs(q))
splits = [round(Int, s) for s in linspace(0,length(q),nchunks+1)]
splits[idx]+1:splits[idx+1]
end
function combine_arrays_chunk!(array_1,array_2,output_array,func, length_range);
##show (length_range)
for i in length_range
output_array[i] = func(array_1[i], array_2[i]);
#hardwired example for func = +
#output_array[i] = +(array_1[i], array_2[i]);
end
output_array
end
combine_arrays_shared_chunk!(array_1,array_2,output_array,func) = combine_arrays_chunk!(array_1,array_2,output_array,func, myrange(array_1));
function combine_arrays_shared(array_1::SharedArray,array_2::SharedArray,func)
if size(array_1)!=size(array_2)
return print("inputs not of the same size")
end
output_array=SharedArray(Float64,size(array_1));
#sync begin
for p in procs(array_1)
#async remotecall_wait(p, combine_arrays_shared_chunk!, array_1,array_2,output_array,func)
end
end
output_array
end
The works so one can do
strain_div = combine_arrays_shared(eps_1,eps_2,+);
strain_tot = combine_arrays_shared(eps_1,eps_2,hypot);
with the correct results an the output as a shared array as required. But ... it's quite slow. It's actually quicker to combine the sharedarray as a normal array on one processor, calculate and then convert back to a sharedarray (for my test cases anyway, with each array approx 200MB, when I move up to GBs I guess not). I can hardwire the combine_arrays_shared function to only do addition (or some other function), and then you get the speed increase, but with function type being passed within combine_arrays_shared the whole thing is slow (10 times slower than the hard wired addition).
I've looked at the FastAnonymous.jl package but I can't see how it would work in this case. I tried, and failed. Any ideas?
I might just resort to writing a different combine_arrays_... function for each basic function I use, or having the func argument as a option and call different functions from within combine_arrays_shared, but I want it to be more elegant! Also this is good way to learn more about Julia.
Harry
This question actually has nothing to do with SharedArrays, and is just "how do I pass functions-as-arguments and get better performance?"
The way FastAnonymous works---and similar to the way closures will work in julia soon---is to create a type with a call method. If you're having trouble with FastAnonymous for some reason, you can always do it manually:
julia> immutable Foo end
julia> Base.call(f::Foo, x, y) = x*y
call (generic function with 1036 methods)
julia> function applyf(f, X)
s = zero(eltype(X))
for x in X
s += f(x, x)
end
s
end
applyf (generic function with 1 method)
julia> X = rand(10^6);
julia> f = Foo()
Foo()
# Run the function once with each type of argument to JIT-compile
julia> applyf(f, X)
333375.63216645207
julia> applyf(*, X)
333375.63216645207
# Compile anything used by #time
julia> #time 1
0.000004 seconds (148 allocations: 10.151 KB)
1
# Now let's benchmark
julia> #time applyf(f, X)
0.002860 seconds (5 allocations: 176 bytes)
333433.439233112
julia> #time applyf(*, X)
0.142411 seconds (4.00 M allocations: 61.035 MB, 19.24% gc time)
333433.439233112
Note the big increase in speed and greatly-reduced memory consumption.

2D array iteration speed in swift (Beta 4)

I've been battling with speed issues in Swift, mainly with arrays. Currently running with latest 'beta 4' release. I've broken the code out into playground to try and show the issues
I setup an 2D array, the iterate over it, set each element.
import UIKit
func getCurrentMillitime() -> NSTimeInterval {
let date: NSDate = NSDate()
return date.timeIntervalSince1970*1000;
}
let startTime = getCurrentMillitime()
let X = 40
let Y = 50
var distanceGrid = [[CGFloat]](count: X, repeatedValue:[CGFloat](count: Y,repeatedValue:CGFloat(0.0)))
for xi in 0..<X {
for yi in 0..<Y {
distanceGrid[xi][yi] = 1.1
//println("x:\(xi) y:\(yi) d:\(distanceGrid[xi][yi])")
}
}
let endTime = getCurrentMillitime()
let computationTime = endTime - startTime
println("Time to compute \(computationTime) ms")
Run the above code and you'll get :
Time to compute 2370.203125 ms
Which sure can't be right !.. Am I being a numpty ?
Two things to consider about Swift performance:
It's very much up in the air during the beta.
Many of Swift's performance tricks depend on the optimizer. Especially when generics are involved (every array is a generic Array<T>), Swift uses a more expressive / debugger-friendly implementation at -O0, but optimizes it away to a higher-performance implementation at -O or -Ofast. (Note that -Ofast also takes away bounds checks and other safety features, so it's not a great idea for production builds.)
Also, note your current example is measuring both the time to create a 2D array with init(count:repeatedValue: and the time to iterate it. If you're out to measure only the latter, you should set your startTime after creating the arrays.
It's obvious that the Swift beta is struggling with arrays.
Even with a one-dimensional array, compared to objective-c the difference is huge.
I've mixed a objC class into a swift program and had both languages create and alter an array of 1,000,000 elements. This is what I got on some MacBook:
Elapsed time by Swift method: 2.7078 sec
Elapsed time by objective-c method: 0.033815 seconds
Code: ( var nrOfElements = 1000000 )
// Swift
let startTime = NSDate();
var stringList = Array<String>(count: nrOfElements, repeatedValue:String())
for i in 0..<nrOfElements {
stringList[i] = "Some string";
}
let endTime = NSDate();
println("Elapsed time by Swift method: " +
NSString(format:"%.4f", endTime.timeIntervalSinceDate(startTime)) + " sec");
// Objective-c
NSDate *startTime = [NSDate date];
NSMutableArray *stringList = [NSMutableArray arrayWithCapacity:10];
for (int i = 0; i < nrOfElements; i++) {
[stringList addObject:#"Some string"];
}
NSDate *endTime = [NSDate date];
printf("%s\n", [[NSString stringWithFormat:#"Elapsed time by objective-c method: %f seconds", [endTime timeIntervalSinceDate:startTime]] UTF8String]);
I found no difference between Beta-3 and Beta-4, so improving
array-handling isn't high on the priority list.
Processing increasingly larger arrays, give proportional higher process times.
Handling multi-dimensional arrays is even more costly when increasing the number of higher dimension elements
Pre-creation of the array in Swift is indeed faster than 'append'
Let's hope that things adequately will be repaired in the final version.
In the language guide under "subscripts", you'll find a 2D (struct) implementation of a 2D array. But it is rather slow in assigning values if you go above a 1000 elements.
Creating a local 2D array and setting it into the struct for easy acces is much faster.
It's also faster to create an array with repeated values and overwrite them than to append values to an array.
For about 100k values it takes ~9seconds with the struct, 1.5 seconds with the append and 0.6 seconds with overwriting repeated values.
I kinda like the struct idea, but it is so slow.
I sure hope it's a beta issue.
I agree with you that even a beta version cannot behave like software of the eighties on hardware of the seventies. So I did some more digging into the Swift array-handling capabilities and I stumbled upon astonishing results. We learned already that the Swift array performance is poor when compared to objective-c or other languages as c++, c#, java, etc.
In my previous tests I measured the time to create and fill a local scope array of one million elements. As we saw, objective-c did this about 80 times faster. Worse it gets when we compare arrays declared in global class scope. Then objC appears to be about 500 times faster!
But hey, when we finally have filled up this global declared array with useful data, we can smoothly work with it right? Wrong!
I’ve printed 10 elements of the large array and the nightmare deepened. The 10 elements of a local scope array took 0,0004 seconds, as may be expected. But printing the same elements of our global declared array took… 1 minute and 11 seconds. This seems too bad to be true and I’m sure the Swift developers are on it as we speak.

Non-monolithic arrays in Haskell

I have accepted an answer to the question below, but It seemed I misunderstood how Arrays in haskell worked. I thought they were just beefed up lists. Keep that in mind when reading the question below.
I've found that monolithic arrays in haskell are quite inefficient when using them for larger arrays.
I haven't been able to find a non-monolithic implementation of arrays in haskell. What I need is O(1) time look up on a multidimensional array.
Is there an implementation of of arrays that supports this?
EDIT: I seem to have misunderstood the term monolithic. The problem is that it seems like the arrays in haskell treats an array like a list. I might be wrong though.
EDIT2: Short example of inefficient code:
fibArray n = a where
bnds = (0,n)
a = array bnds [ (i, f i) | i <- range bnds ]
f 0 = 0
f 1 = 1
f i = a!(i-1) + a!(i-2)
this is an array of length n+1 where the i'th field holds the i'th fibonacci number. But since arrays in haskell has O(n) time lookup, it takes O(n²) time to compute.
You're confusing linked lists in Haskell with arrays.
Linked lists are the data types that use the following syntax:
[1,2,3,5]
defined as:
data [a] = [] | a : [a]
These are classical recursive data types, supporting O(n) indexing and O(1) prepend.
If you're looking for multidimensional data with O(1) lookup, instead you should use a true array or matrix data structure. Good candidates are:
Repa - fast, parallel, multidimensional arrays -- (Tutorial)
Vector - An efficient implementation of Int-indexed arrays (both mutable and immutable), with a powerful loop optimisation framework . (Tutorial)
HMatrix - Purely functional interface to basic linear algebra and other numerical computations, internally implemented using GSL, BLAS and LAPACK.
Arrays have O(1) indexing. The problem is that each element is calculated lazily. So this is what happens when you run this in ghci:
*Main> :set +s
*Main> let t = 100000
(0.00 secs, 556576 bytes)
*Main> let a = fibArray t
Loading package array-0.4.0.0 ... linking ... done.
(0.01 secs, 1033640 bytes)
*Main> a!t -- result omitted
(1.51 secs, 570473504 bytes)
*Main> a!t -- result omitted
(0.17 secs, 17954296 bytes)
*Main>
Note that lookup is very fast, after it's already been looked up once. The array function creates an array of pointers to thunks that will eventually be calculated to produce a value. The first time you evaluate a value, you pay this cost. Here are a first few expansions of the thunk for evaluating a!t:
a!t -> a!(t-1)+a!(t-2)-> a!(t-2)+a!(t-3)+a!(t-2) -> a!(t-3)+a!(t-4)+a!(t-3)+a!(t-2)
It's not the cost of the calculations per se that's expensive, rather it's the need to create and traverse this very large thunk.
I tried strictifying the values in the list passed to array, but that seemed to result in an endless loop.
One common way around this is to use a mutable array, such as an STArray. The elements can be updated as they're available during the array creation, and the end result is frozen and returned. In the vector package, the create and constructN functions provide easy ways to do this.
-- constructN :: Unbox a => Int -> (Vector a -> a) -> Vector a
import qualified Data.Vector.Unboxed as V
import Data.Int
fibVec :: Int -> V.Vector Int64
fibVec n = V.constructN (n+1) c
where
c v | V.length v == 0 = 0
c v | V.length v == 1 = 1
c v | V.length v == 2 = 1
c v = let len = V.length v
in v V.! (len-1) + v V.! (len-2)
BUT, the fibVec function only works with unboxed vectors. Regular vectors (and arrays) aren't strict enough, leading back to the same problem you've already found. And unfortunately there isn't an Unboxed instance for Integer, so if you need unbounded integer types (this fibVec has already overflowed in this test) you're stuck with creating a mutable array in IO or ST to enable the necessary strictness.
Referring specifically to your fibArray example, try this and see if it speeds things up a bit:
-- gradually calculate m-th item in steps of k
-- to prevent STACK OVERFLOW , etc
gradualth m k arr
| m <= v = pre `seq` arr!m
where
pre = foldl1 (\a b-> a `seq` arr!b) [u,u+k..m]
(u,v) = bounds arr
For me, for let a=fibArray 50000, gradualth 50000 10 aran at 0.65 run time of just calling a!50000 right away.

Resources