Basic operations combining two SharedArrays - arrays

I've spent the last month or so learning julia and I'm very impressed. In particular I'm analysing large amount of climate model output, I put all this into SharedArrays and adjust and plot it all in parallel. So far it's very quick and efficient and I've got quite a library of code. My current problem is in creating a function that can do basic operations on two shared arrays. I've successfully written a function that takes two arrays and how you want to process them. The code is based around the example in the parallel section of the julia doc and uses the myrange function as shown there
function myrange(q::SharedArray)
idx = indexpids(q)
##show (idx)
if idx == 0
# This worker is not assigned a piece
return 1:0, 1:0
print("NO WORKERS ASSIGNED")
end
nchunks = length(procs(q))
splits = [round(Int, s) for s in linspace(0,length(q),nchunks+1)]
splits[idx]+1:splits[idx+1]
end
function combine_arrays_chunk!(array_1,array_2,output_array,func, length_range);
##show (length_range)
for i in length_range
output_array[i] = func(array_1[i], array_2[i]);
#hardwired example for func = +
#output_array[i] = +(array_1[i], array_2[i]);
end
output_array
end
combine_arrays_shared_chunk!(array_1,array_2,output_array,func) = combine_arrays_chunk!(array_1,array_2,output_array,func, myrange(array_1));
function combine_arrays_shared(array_1::SharedArray,array_2::SharedArray,func)
if size(array_1)!=size(array_2)
return print("inputs not of the same size")
end
output_array=SharedArray(Float64,size(array_1));
#sync begin
for p in procs(array_1)
#async remotecall_wait(p, combine_arrays_shared_chunk!, array_1,array_2,output_array,func)
end
end
output_array
end
The works so one can do
strain_div = combine_arrays_shared(eps_1,eps_2,+);
strain_tot = combine_arrays_shared(eps_1,eps_2,hypot);
with the correct results an the output as a shared array as required. But ... it's quite slow. It's actually quicker to combine the sharedarray as a normal array on one processor, calculate and then convert back to a sharedarray (for my test cases anyway, with each array approx 200MB, when I move up to GBs I guess not). I can hardwire the combine_arrays_shared function to only do addition (or some other function), and then you get the speed increase, but with function type being passed within combine_arrays_shared the whole thing is slow (10 times slower than the hard wired addition).
I've looked at the FastAnonymous.jl package but I can't see how it would work in this case. I tried, and failed. Any ideas?
I might just resort to writing a different combine_arrays_... function for each basic function I use, or having the func argument as a option and call different functions from within combine_arrays_shared, but I want it to be more elegant! Also this is good way to learn more about Julia.
Harry

This question actually has nothing to do with SharedArrays, and is just "how do I pass functions-as-arguments and get better performance?"
The way FastAnonymous works---and similar to the way closures will work in julia soon---is to create a type with a call method. If you're having trouble with FastAnonymous for some reason, you can always do it manually:
julia> immutable Foo end
julia> Base.call(f::Foo, x, y) = x*y
call (generic function with 1036 methods)
julia> function applyf(f, X)
s = zero(eltype(X))
for x in X
s += f(x, x)
end
s
end
applyf (generic function with 1 method)
julia> X = rand(10^6);
julia> f = Foo()
Foo()
# Run the function once with each type of argument to JIT-compile
julia> applyf(f, X)
333375.63216645207
julia> applyf(*, X)
333375.63216645207
# Compile anything used by #time
julia> #time 1
0.000004 seconds (148 allocations: 10.151 KB)
1
# Now let's benchmark
julia> #time applyf(f, X)
0.002860 seconds (5 allocations: 176 bytes)
333433.439233112
julia> #time applyf(*, X)
0.142411 seconds (4.00 M allocations: 61.035 MB, 19.24% gc time)
333433.439233112
Note the big increase in speed and greatly-reduced memory consumption.

Related

Array assembly and StaticArrays under Julia: Why is my performance so bad?

I need to prepare "flattened" versions of 2D fftfrequencies in the shape Nx^2 * 2. Those are basically constructed like a ravel(meshgrid(fftfreqs1d,fftfreqs1d)) in matlab or python.
This appears to be no big deal in python, but can hang for reasonable array sizes in julia, especially when i want to build a StaticArray out of the intermediate results. To make it more confusing, #btime pretends that my arrays are created in no time, while they are clearly not.
My question is why this happens and how it is done right.
I am aware that using julia it might be a waste to keep the full 2D fftfreqs in memory instead of using the 1D versions and a loop, but let us assume for a moment that i need it this way.
Julia
function my_freqs1(Nnu::Int,T)
dx = 2. /Nnu
freq1d = fftfreq(Nnu).*dx
nu = hcat( vec([ i for i in freq1d, j in freq1d ]),
vec([ j for i in freq1d, j in freq1d ]))
return nu
end;
#btime my_freqs1(100,Float64)
28.528 μs (10 allocations: 312.80 KiB)
Julia, converting to a static array (in the hope for better performance of other code later on)
function my_freqs2(Nnu::Int,T)
### the same as above ###
return SMatrix{Nnu^2,2,T}(nu)
end;
#btime my_freqs2(100,Float64)
94.540 μs (36 allocations: 470.38 KiB)
Python
def my_fftfreqs(xy):
freqs = np.fft.fftfreq(np.shape(xy)[0],d=xy[1]-xy[0])
fx,fy = np.meshgrid(freqs,freqs,indexing="ij")
freq_list = np.transpose(np.asarray( [np.ravel(fx),np.ravel(fy)] ))
return freq_list
%time f=my_fftfreqs(np.linspace(0,1,100));
CPU times: user 1.08 ms, sys: 0 ns, total: 1.08 ms
Wall time: 600 µs
My observation is that while python %time reports a much longer time, it will actually run in a very reasonable time while the julia version has a noticable delay and the version with the static array will hang for a long time and completely crash for larger sizes.
Please help me to understand how i would do this correctly in Julia and whether (why not?) creating a static array seems to be such a bad idea.
Rather than making a SMatrix{Nnu^2,2} I think you probably want to make a Vector{SVector{2}}. The former will require recompiling for each new value of Nnu which is fairly inefficient.
You may also consider:
using FFTW
my_freqs3(ν) = fftfreq(ν)*2/ν |>
(w -> [repeat(w, inner=length(w)) repeat(w, outer=length(w))])
# or
my_freqs3alt(ν) = ( w = fftfreq(ν)*2/ν ;
[repeat(w, inner=length(w)) repeat(w, outer=length(w))] )
which is more Julian and "if-I-understand-correctly" is equivalent.
Usually shorter/simpler functions are also more efficient.
Julia features used:
Unicode nu variable.
Piping |> operator.
Definition with no function keyword.
repeat standard library vector filling function.
Matlab-like hcat [v1 v2] notation.
Multi-statement block enclosed in ( ) separated by ;.

Updating a static array, in a nested function without making a temporary array? <Julia>

I have been banging my head against a wall trying to use static arrays in julia.
https://github.com/JuliaArrays/StaticArrays.jl
They are fast but updating them is a pain. This is no surprise, they are meant to be immutable!
But it is continuously recommended to me that I use static arrays even though I have to update them. In my case, the static arrays are small, just length 3, and i have a vector of them, but I only update 1 length three SVector at a time.
Option 1
There is a really neat package called Setfield that allows you to do inplace updates of SVectors in Julia.
https://github.com/jw3126/Setfield.jl
The catch... it updates the local copy. So if you are in a nested function, it updates the local copy. So it comes with some book keeping since you have to inplace update the local copy and then return that copy and update the actual array of interest. You can't pass in your desired array and update it in place, at least, not that I can figure out! Now, I do not mind bookeeping, but I feel like updating a local copy, then returning the value, updating another local copy, and then returning the values and finally updating the actual array must come with a speed penalty. I could be wrong.
Option 2
It bugs me that in order to do an update a static array I must
exampleSVector::SVector{3,Float64} <-- just to make clear its type and size
exampleSVector = [value1, value2, value3]
This will update the desired array even if it is inside a function, which is nice and the goal, but if you do this inside a function it creates a temporary array. And this kills me because my function is in a loop that gets called 4+ million times, so this creates a ton of allocations and slows things down.
How do I update an SVector for the Option 2 scenario without creating a temporary array?
For the Option 1 scenario, can I update the actual array of interest rather than the local copy?
If this requires a simple example code, please say so in the comments, and I will make one. My thinking is that it is answerable without one, but I will make one if it is needed.
EDIT:
MCVE code - Option 1 works, option 2 does not.
using Setfield
using StaticArrays
struct Keep
dreaming::Vector{SVector{3,Float64}}
end
function INNER!(vec::SVector{3,Float64},pre::SVector{3,Float64})
# pretend series of calculations
for i = 1:3 # illustrate use of Setfield (used in real code for this)
pre = #set pre[i] = rand() * i * 1000
end
# more pretend calculations
x = 25.0 # assume more calculations equals x
################## OPTION 1 ########################
vec = #set vec = x * [ pre[1], pre[2], pre[3] ] # UNCOMMENT FOR FOR OPTION 1
return vec # UNCOMMENT FOR FOR OPTION 1
################## OPTION 2 ########################
#vec = x * [ pre[1], pre[2], pre[3] ] # UNCOMMENT FOR FOR OPTION 2
#nothing # UNCOMMENT FOR FOR OPTION 2
end
function OUTER!(always::Keep)
preAllocate = SVector{3}(0.0,0.0,0.0)
for i=1:length(always.dreaming)
always.dreaming[i] = INNER!(always.dreaming[i], preAllocate) # UNCOMMENT FOR FOR OPTION 1
#INNER!(always.dreaming[i], preAllocate) # UNCOMMENT FOR FOR OPTION 2
end
end
code = Keep([zero(SVector{3}) for i=1:5])
OUTER!(code)
println(code.dreaming)
I hope that I have understood your question correctly. It's a bit hard with a MWE like this, that does a lot of things that are mostly redundant and a bit confusing.
There seems to be two alternative interpretations here: Either you really need to update ('mutate') an SVector, but your MWE fails to demonstrate why. Or, you have convinced yourself that you need to mutate, but you actually don't.
I have decided to focus on alternative 2: You don't really need to 'mutate'. Rewriting your code from that point of view simplifies it greatly.
I couldn't find any reason for you to mutate any static vectors here, so I just removed that. The behaviour of the INNER! function with the inputs was very confusing. You provide two inputs but don't use either of them, so I removed those inputs.
function inner()
pre = #SVector [rand() * 1000i for i in 1:3]
x = 25
return pre .* x
end
function outer!(always::Keep)
always.dreaming .= inner.() # notice the dot in inner.()
end
code = Keep([zero(SVector{3}) for i in 1:5])
outer!(code)
display(code.dreaming)
This runs fast and with zero allocations. In general with StaticArrays, don't try to mutate things, just create new instances.
Even though it's not clear from your MWE, there may be some legitimate reason why you may want to 'mutate' an SVector. In that case you can use the setindex method of StaticArrays, you don't need Setfield.jl:
julia> v = rand(SVector{3})
3-element SArray{Tuple{3},Float64,1,3}:
0.4730258499237898
0.23658547518737905
0.9140206579322541
julia> v = setindex(v, -3.1, 2)
3-element SArray{Tuple{3},Float64,1,3}:
0.4730258499237898
-3.1
0.9140206579322541
To clarify: setindex (without a !) does not mutate its input, but creates a new instance with one index value changed.
If you really do need to 'mutate', perhaps you can make a new MWE that shows this. I would recommend that you try to simplify it a bit, because it is quite confusing now. For example, the inclusion of the type Keep seems entirely unnecessary and distracting. Just make a Vector of SVectors and show what you want to do with that.
Edit: Here's an attempt based on the comments below. As far as I understand it now, the question is about modifying a vector of SVectors. You cannot really mutate the SVectors, but you can replace them using a convenient syntax, setindex, where you can keep some of the elements and change some of the others:
oldvec = [zero(SVector{3}) for _ in 1:5]
replacevec = [rand(SVector{3}) for _ in 1:5]
Now we replace the second element of each element of oldvec with the corresponding one in replacevec. First a one-liner:
oldvec .= setindex.(oldvec, getindex.(replacevec, 2), 2)
Then an even faster one with a loop:
for i in eachindex(oldvec, replacevec)
#inbounds oldvec[i] = setindex(oldvec[i], replacevec[i][2], 2)
end
There are two types of static arrays - mutable (starting with M in type name) and immutable ones (starting with S) - just use the mutable ones! have a look at the example below:
julia> mut = MVector{3,Int64}(1:3);
julia> mut[1]=55
55
julia> mut
3-element MArray{Tuple{3},Int64,1,3}:
55
2
3
julia> immut = SVector{3,Int64}(1:3);
julia> inmut[1]=55
ERROR: setindex!(::SArray{Tuple{3},Int64,1,3}, value, ::Int) is not defined.
Let us see some simple benchmark (ordinary array, vs mutable static vs immutable static):
using BenchmarkTools
julia> ord = [1,2,3];
julia> #btime $ord.*$ord;
39.680 ns (1 allocation: 112 bytes)
3-element Array{Int64,1}:
1
4
9
julia> #btime $mut.*$mut
8.533 ns (1 allocation: 32 bytes)
3-element MArray{Tuple{3},Int64,1,3}:
3025
4
9
julia> #btime $immut.*$immut
2.133 ns (0 allocations: 0 bytes)
3-element SArray{Tuple{3},Int64,1,3}:
1
4
9

Unexpected memory allocation when using array views (julia)

I'm trying to search for the desired pattern (variable template) in the array X. The length of the template is 9.
I'm doing something like:
function check_alloc{T <: ZeroOne}(x :: AbstractArray{T}, temp :: AbstractArray{T})
s = 0
for i in 1 : 1000
myView = view(x, i : i + 9)
if myView == temp
s += 1
end
end
return s
end
and obtain unexpected memory allocations (46 Kbytes) in this short loop. Why do it happen and how can I prevent memory allocations and performance degradation?
The reason you're getting allocations is because view(A, i:i+9) creates a small object called a SubArray. This is just a "wrapper" that essentially stores a reference to A and the indices you passed in (i:i+9). Because the wrapper is small (~40 bytes for a one-dimensional object), there are two reasonable choices for storing it: on the stack or on the heap. "Allocations" refer only to heap memory, so if Julia can store the wrapper on the stack it would report no allocations (and would also be faster).
Unfortunately, some SubArray objects currently (as of late 2017) have to be stored on the heap. The reason is because Julia is a garbage-collected language, which means that if A is a heap-allocated object that is no longer in use, then A might be freed from memory. The key point is this: currently, references to A from other variables are counted only if those variables are stored on the heap. Consequently, if all SubArrays were stored on the stack, you would have a problem for code like this:
function create()
A = rand(1000)
getfirst(view(A, 1:10))
end
function getfirst(v)
gc() # this triggers garbage collection
first(v)
end
Because create doesn't use A again after that call to getfirst, it's not "protecting" A. The risk is that the gc call could end up freeing the memory associated with A (and thus breaking any usage of entries in v itself, since v relies on A), unless having v protects A from being garbage-collected. But currently, stack-allocated variables can't protect heap-allocated memory: the garbage collector only scans variables that are on the heap.
You can watch this in action with your original function, modified to be slightly less restrictive by getting rid of the (irrelevant, for these purposes) T<:ZeroOne and allowing any T.
function check_alloc(x::AbstractArray{T}, temp::AbstractArray{T}) where T
s = 0
for i in 1 : 1000
myView = view(x, i : i + 9)
if myView == temp
s += 1
end
end
return s
end
a = collect(1:1010); # this uses heap-allocated memory
b = collect(1:10);
#time check_alloc(a, b); # ignore the first due to JIT-compilation
#time check_alloc(a, b)
a = 1:1010 # this doesn't require heap-allocated memory
#time check_alloc(a, b); # ignore due to JIT-compilation
#time check_alloc(a, b)
From the first one (with a = collect(1:1010)), you get
julia> #time check_alloc(a, b)
0.000022 seconds (1.00 k allocations: 47.031 KiB)
(notice this is ~47 bytes per iteration, consistent with the size of the SubArray wrapper) but from the second (with a = 1:1010) you get
julia> #time check_alloc(a, b)
0.000020 seconds (4 allocations: 160 bytes)
There's an "obvious" fix to this problem: change the garbage collector so that stack-allocated variables can protect heap-allocated memory. That will happen some day, but it's an extremely complex operation to support properly. So for now, the rule is that any object that contains a reference to heap-allocated memory must be stored on the heap.
There's one final subtlety: Julia's compiler is quite smart, and in some cases elides the creation of the SubArray wrapper (basically, it rewrites your code in a way that uses the parent array object and the indices separately so that it never needs the wrapper itself). For that to work, Julia has to be able to inline any function calls into the function that created the view. Unfortunately, here == is slightly too big for the compiler to be willing to inline it. If you manually write out the operations that will be performed, then the compiler will elide the view and you'll also avoid allocations.
This at least works for arbitrary sized temp and x but still has ~KB allocations.
function check_alloc{T}(x :: AbstractArray{T}, temp :: AbstractArray{T})
s = 0
pl = length(temp)
for i in 1:length(x)-pl+1
#views if x[i:i+pl-1] == temp
s += 1
end
end
return s
end
EDIT: As suggested by #Sairus in the comments, one can do something in the spirit of this:
function check_alloc2{T}(x :: AbstractArray{T}, temp :: AbstractArray{T})
s = 0
pl = length(temp)
plr = 1:pl
for i in 1:length(x)-pl+1
same = true
for k in plr
#inbounds if x[i+k-1] != temp[k]
same = false
break
end
end
if same
s+=1
end
end
return s
end
This has no allocations:
julia> using BenchmarkTools
julia> a = collect(1:1000);
julia> b = collect(5:12);
julia> #btime check_alloc2($a,$b);
1.195 μs (0 allocations: 0 bytes)
As of Julia 1.7.0 (maybe even earlier), the first code from #carstenbauer with a view, does not allocate any more (on the heap):
function check_alloc(x :: AbstractArray{T}, temp :: AbstractArray{T}) where T
s = 0
pl = length(temp)
for i in 1:length(x)-pl+1
#views if x[i:i+pl-1] == temp
s += 1
end
end
return s
end
using BenchmarkTools
a = collect(1:1000);
b = collect(5:12);
#btime check_alloc($a,$b);
# returns
# 8.495 μs (0 allocations: 0 bytes)

Julia - get real part of complex array

I need to split the variable z::Array{Complex128,1} into two arrays for the real and complex parts. One way do this is to make new variables ::Array{Float64,1} and fill them element by element:
for i = 1:size(z)[1]
ri[i] = z[i].re
ii[i] = z[i].im
end
Is there a way to do this that doesn't involve copying data, like somehow manipulating strides and offsets of z?
In the common case where copying is not an issue, just do real.(z) and imag.(z). I include this to help future readers who have a similar issue, but who might not care about copying.
As you suggest, you can manipulate strides of z to avoid copying data. Simply
zfl = reinterpret(Float64, z)
zre = #view zfl[1:2:end-1]
zim = #view zfl[2:2:end]
Combined, we observe that there is no data copying (the allocations are due to the heap-allocated array views, and are minimal).
julia> z = Vector{ComplexF64}(100000);
julia> function reimvec(z)
zfl = reinterpret(Float64, z)
zre = #view zfl[1:2:end-1]
zim = #view zfl[2:2:end]
zre, zim
end
reimvec (generic function with 1 method)
julia> #time reimvec(z);
0.000005 seconds (9 allocations: 400 bytes)
As we can see, behind the scenes, such an array is strided:
julia> strides(reimvec(z)[1])
(2,)

Shared array usage in Julia

I need to parallelise a certain task over a number of workers.
To that purpose I need all workers to have access to a matrix that stores the data.
I thought that the data matrix could be implemented as a Shared Array in order to minimise data movement.
In order to get me started with Shared Arrays, I am trying the following very simple example which gives me, what I think is, unexpected behaviour:
julia -p 2
# the data matrix
D = SharedArray(Float64, 2, 3)
# initialise the data matrix with dummy values
for ii=1:length(D)
D[ii] = rand()
end
# Define some kind of dummy computation involving the shared array
f = x -> x + sum(D)
# call function on worker
#time fetch(#spawnat 2 f(1.0))
The last command gives me the following error:
ERROR: On worker 2:
UndefVarError: D not defined
in anonymous at none:1
in anonymous at multi.jl:1358
in anonymous at multi.jl:904
in run_work_thunk at multi.jl:645
in run_work_thunk at multi.jl:654
in anonymous at task.jl:58
in remotecall_fetch at multi.jl:731
in call_on_owner at multi.jl:777
in fetch at multi.jl:795
I thought that the Shared Array D should be visible to all workers?
I am clearly missing something basic. Thanks in advance.
Although the underlying data is shared to all workers, the declaration of D is not. You will still need to pass in the reference to D, so something like
f = (x,SA) -> x + sum(SA)
#time fetch(#spawnat 2 f(1.0,D))
should work. You can change D on the main process and see that it is infact using the same data:
julia> # call function on worker
#time fetch(#spawnat 2 f(1.0,D))
0.325254 seconds (225.62 k allocations: 9.701 MB, 5.88% gc time)
4.405613684678047
julia> D[1] += 1
1.2005544517241717
julia> # call function on worker
#time fetch(#spawnat 2 f(1.0,D))
0.004548 seconds (637 allocations: 45.490 KB)
5.405613684678047
This works, without declaring D, through a closure within a function.
function dothis()
D = SharedArray{Float64}(2, 3)
# initialise the data matrix with dummy values
for ii=1:length(D)
D[ii] = ii #not rand() anymore
end
# Define some kind of dummy computation involving the shared array
f = x -> x + sum(D)
# call function on worker
#time fetch(#spawnat 2 f(1.0))
end
julia> dothis()
1.507047 seconds (206.04 k allocations: 11.071 MiB, 0.72% gc time)
22.0
julia> dothis()
0.012596 seconds (363 allocations: 19.527 KiB)
22.0
So though I have answered the OP's question, and the SharedArray is visible to all workers -- is this legitimate?

Resources