Build Dictionary of Arrays Efficiently in julia - arrays

I want to save the (x,y) coordinates in a grid network that are visited by different individuals. Let say I have 1000 individuals and the network size is x = 1:100 and y=1:100. I am using Dict() and here is a sample code about what I want to do:
individuals = 1:1000
x = 1:100
y = 1:100
function Visited_nodes()
nodes_of_inds =Dict{Int64, Array{Tuple{Int64, Int64}}}()
for ind in individuals
dum_array = Array{Tuple{Int64, Int64}}(0)
for i in x
for j in y
if rand()<0.2 # some conditions
push!(dum_array, (i,j))
end
end
end
nodes_of_inds[ind]=unique(dum_array)
end
return nodes_of_inds
end
#time nodes_of_inds = Visited_nodes()
# result: 1.742297 seconds (12.31 M allocations: 607.035 MB, 6.72% gc time)
But this is not efficient. I appreciate any advice how to make it more efficient.

Please see the performance tips. Very first piece of advice there: avoid global variables. individuals, x, and y are all non-constant global variables. Make them arguments to your function instead. That change alone speeds up your function by an order of magnitude.
By construction, you're not going to have any duplicate tuples in your dum_array, so you don't need to call unique. That shaves off another factor of two.
Finally, Array{T} isn't a concrete type. Julia's arrays also encode the dimensionality as a type parameter, which must be included for the dictionary of arrays to be efficient. Use Array{T, 1} or Vector{T} instead. This isn't a major consideration within the time of this function, though.
The major thing that's left is just the O(length(individuals)*length(x)*length(y)) computational complexity. Doing anything ten million times will add up quickly, no matter how efficient it is.

#Matt B., thanks for your response. About the global variables, I tried a simplified version of my code and it did not help the performance.
Let say I read my input data from a couple of csv files and I have three functions with different arguments:
function Read_input_data()
# read input data
individuals = readcsv("file1")
x = readcsv("file2")
y = readcsv("file3")
A = readcsv("file4")
B = readcsv("file5") # and a few other files
# call different functions
result_1 = Function1(individuals , x, y)
result_2 = Function2(result_1 ,y, A, B)
result_3 = Function3(result_2 , individuals, A, B)
return result_1, result_2, result_3
end
result_1, result_2, result_3 = Read_input_data()
I do not know why the performance is not better compared to when I define everything global! I appreciate any if you can comment about this!

Related

How do I split results into separate variables in matlab?

I'm pretty new to matlab, so I'm guessing there is some shortcut way to do this but I cant seem to find it
results = eqs\soltns;
A = results(1);
B = results(2);
C = results(3);
D = results(4);
E = results(5);
F = results(6);
soltns is a 6x1 vector and eqs is a 6x6 matrix, and I want the results of the operation in their own separate variables. It didn't let me save it like
[A, B, C, D, E, F] = eqs\soltns;
Which I feel like would make sense, but it doesn't work.
Up to now, I have never come across a MATLAB function doing this directly (but maybe I'm missing something?). So, my solution would be to write a function distribute on my own.
E.g. as follows:
result = [ 1 2 3 4 5 6 ];
[A,B,C,D,E,F] = distribute( result );
function varargout = distribute( vals )
assert( nargout <= numel( vals ), 'To many output arguments' )
varargout = arrayfun( #(X) {X}, vals(:) );
end
Explanation:
nargout is special variable in MATLAB function calls. Its value is equal to the number of output parameters that distribute is called with. So, the check nargout <= numel( vals ) evaluates if enough elements are given in vals to distribute them to the output variables and raises an assertion otherwise.
arrayfun( #(X) {X}, vals(:) ) converts vals to a cell array. The conversion is necessary as varargout is also a special variable in MATLAB's function calls, which must be a cell array.
The special thing about varargout is that MATLAB assigns the individual cells of varargout to the individual output parameters, i.e. in the above call to [A,B,C,D,E,F] as desired.
Note:
In general, I think such expanding of variables is seldom useful. MATLAB is optimized for processing of arrays, separating them to individual variables often only complicates things.
Note 2:
If result is a cell array, i.e. result = {1,2,3,4,5,6}, MATLAB actually allows to split its cells by [A,B,C,D,E,F] = result{:};
One way as long as you know the size of results in advance:
results = num2cell(eqs\soltns);
[A,B,C,D,E,F] = results{:};
This has to be done in two steps because MATLAB does not allow for indexing directly the results of a function call.
But note that this method is hard to generalize for arbitrary sizes. If the size of results is unknown in advance, it would probably be best to leave results as a vector in your downstream code.

Julia: How to efficiently sort subarrays of 2 large arrays in parallel?

I have large 1D arrays a and b, and an array of pointers I that separates them into subarrays. My a and b barely fit into RAM and are of different dtypes (one contains UInt32s, the other Rational{Int64}s), so I don’t want to join them into a 2D array, to avoid changing dtypes.
For each i in I[2:end], I wish to sort the subarray a[I[i-1],I[i]-1] and apply the same permutation to the corresponding subarray b[I[i-1],I[i]-1]. My attempt at this is:
function sort!(a,b)
p=sortperm(a);
a[:], b[:] = a[p], b[p]
end
Threads.#threads for i in I[2:end]
sort!( a[I[i-1], I[i]-1], b[I[i-1], I[i]-1] )
end
However, already on a small example, I see that sort! does not alter the view of a subarray:
a, b = rand(1:10,10), rand(-1000:1000,10) .//1
sort!(a,b); println(a,"\n",b) # works like it should
a, b = rand(1:10,10), rand(-1000:1000,10) .//1
sort!(a[1:5],b[1:5]); println(a,"\n",b) # does nothing!!!
Any help on how to create such function sort! (as efficient as possible) are welcome.
Background: I am dealing with data coming from sparse arrays:
using SparseArrays
n=10^6; x=sprand(n,n,1000/n); #random matrix with 1000 entries per column on average
x = SparseMatrixCSC(n,n,x.colptr,x.rowval,rand(-99:99,nnz(x)).//1); #chnging entries to rationals
U = randperm(n) #permutation of rows of matrix x
a, b, I = U[x.rowval], x.nzval, x.colptr;
Thus these a,b,I serve as good examples to my posted problem. What I am trying to do is sort the row indices (and corresponding matrix values) of entries in each column.
Note: I already asked this question on Julia discourse here, but received no replies nor comments. If I can improve on the quality of the question, don't hesitate to tell me.
The problem is that a[1:5] is not a view, it's just a copy. instead make the view like
function sort!(a,b)
p=sortperm(a);
a[:], b[:] = a[p], b[p]
end
Threads.#threads for i in I[2:end]
sort!(view(a, I[i-1]:I[i]-1), view(b, I[i-1]:I[i]-1))
end
is what you are looking for
ps.
the #view a[2:3], #view(a[2:3]) or the #views macro can help making thins more readable.
First of all, you shouldn't redefine Base.sort! like this. Now, sort! will shadow Base.sort! and you'll get errors if you call sort!(a).
Also, a[I[i-1], I[i]-1] and b[I[i-1], I[i]-1] are not slices, they are just single elements, so nothing should happen if you sort them either with views or not. And sorting arrays in a moving-window way like this is not correct.
What you want to do here, since your vectors are huge, is call p = partialsortperm(a[i:end], i:i+block_size-1) repeatedly in a loop, choosing a block_size that fits into memory, and modify both a and b according to p, then continue to the remaining part of a and find next p and repeat until nothing remains in a to be sorted. I'll leave the implementation as an exercise for you, but you can come back if you get stuck on something.

Function handle applied to array in MATLAB

I've searched and there are many answers to this kind of question, suggesting functions like arrayfun, bsxfun, and so on. I haven't been able to resolve the issue due to dimension mismatches (and probably a fundamental misunderstanding as to how MATLAB treats anonymous function handles).
I have a generic function handle of more than one variable:
f = #(x,y) (some function of x, y)
Heuristically, I would like to define a new function handle like
g = #(x) sum(f(x,1:3))
More precisely, the following does exactly what I need, but is tedious to write out for larger arrays (say, 1:10 instead of 1:3):
g = #(x) f(x,1)+f(x,2)+f(x,3)
I tried something like
g = #(x) sum(arrayfun(#(y) f(x,y), 1:3))
but this does not work as soon as the size of x exceeds 1x1.
Thanks in advance.
Assuming you cannot change the definition of f to be more vector-friendly, you can use your last solution by specifying a non-uniform output and converting the output cell array to a matrix:
g = #(x) sum(cell2mat(arrayfun(#(y) f(x,y), 1:3,'UniformOutput',false)),2);
This should work well if f(x,y) outputs column vectors and you wish to sum them together. For rows vectors, you can use
g = #(x) sum(cell2mat(arrayfun(#(y) f(x,y), 1:3,'UniformOutput',false).'));
If the arrays are higher in dimension, I actually think a function accumulator would be quicker and easier. For example, consider the (extremely simple) function:
function acc = accumfun(f,y)
acc = f(y(1));
for k = 2:numel(y)
acc = acc + f(y(k));
end
end
Then, you can make the one-liner
g = #(x) accumfun(#(y) f(x,y),y);

Efficiently calculating weighted distance in MATLAB

Several posts exist about efficiently calculating pairwise distances in MATLAB. These posts tend to concern quickly calculating euclidean distance between large numbers of points.
I need to create a function which quickly calculates the pairwise differences between smaller numbers of points (typically less than 1000 pairs). Within the grander scheme of the program i am writing, this function will be executed many thousands of times, so even small gains in efficiency are important. The function needs to be flexible in two ways:
On any given call, the distance metric can be euclidean OR city-block.
The dimensions of the data are weighted.
As far as i can tell, no solution to this particular problem has been posted. The statstics toolbox offers pdist and pdist2, which accept many different distance functions, but not weighting. I have seen extensions of these functions that allow for weighting, but these extensions do not allow users to select different distance functions.
Ideally, i would like to avoid using functions from the statistics toolbox (i am not certain the user of the function will have access to those toolboxes).
I have written two functions to accomplish this task. The first uses tricky calls to repmat and permute, and the second simply uses for-loops.
function [D] = pairdist1(A, B, wts, distancemetric)
% get some information about the data
numA = size(A,1);
numB = size(B,1);
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else error('Function only accepts "cityblock" and "euclidean" distance')
end
% format weights for multiplication
wts = repmat(wts,[numA,1,numB]);
% get featural differences between A and B pairs
A = repmat(A,[1 1 numB]);
B = repmat(permute(B,[3,2,1]),[numA,1,1]);
differences = abs(A-B).^r;
% weigh difference values before combining them
differences = differences.*wts;
differences = differences.^(1/r);
% combine features to get distance
D = permute(sum(differences,2),[1,3,2]);
end
AND:
function [D] = pairdist2(A, B, wts, distancemetric)
% get some information about the data
numA = size(A,1);
numB = size(B,1);
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else error('Function only accepts "cityblock" and "euclidean" distance')
end
% use for-loops to generate differences
D = zeros(numA,numB);
for i=1:numA
for j=1:numB
differences = abs(A(i,:) - B(j,:)).^(1/r);
differences = differences.*wts;
differences = differences.^(1/r);
D(i,j) = sum(differences,2);
end
end
end
Here are the performance tests:
A = rand(10,3);
B = rand(80,3);
wts = [0.1 0.5 0.4];
distancemetric = 'cityblock';
tic
D1 = pairdist1(A,B,wts,distancemetric);
toc
tic
D2 = pairdist2(A,B,wts,distancemetric);
toc
Elapsed time is 0.000238 seconds.
Elapsed time is 0.005350 seconds.
Its clear that the repmat-and-permute version works much more quickly than the double-for-loop version, at least for smaller datasets. But i also know that calls to repmat often slow things down, however. So I am wondering if anyone in the SO community has any advice to offer to improve the efficiency of either function!
EDIT
#Luis Mendo offered a nice cleanup of the repmat-and-permute function using bsxfun. I compared his function with my original on datasets of varying size:
As the data become larger, the bsxfun version becomes the clear winner!
EDIT #2
I have finished writing the function and it is available on github [link]. I ended up finding a pretty good vectorized method for computing euclidean distance [link], so i use that method in the euclidean case, and i took #Divakar's advice for city-block. It is still not as fast as pdist2, but its must faster than either of the approaches i laid out earlier in this post, and easily accepts weightings.
You can replace repmat by bsxfun. Doing so avoids explicit repetition, therefore it's more memory-efficient, and probably faster:
function D = pairdist1(A, B, wts, distancemetric)
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else
error('Function only accepts "cityblock" and "euclidean" distance')
end
differences = abs(bsxfun(#minus, A, permute(B, [3 2 1]))).^r;
differences = bsxfun(#times, differences, wts).^(1/r);
D = permute(sum(differences,2),[1,3,2]);
end
For r = 1 ("cityblock" case), you can use bsxfun to get elementwise subtractions and then use matrix-multiplication, which must speed up things. The implementation would look something like this -
%// Calculate absolute elementiwse subtractions
absm = abs(bsxfun(#minus,permute(A,[1 3 2]),permute(B,[3 1 2])));
%// Perform matrix multiplications with the given weights and reshape
D = reshape(reshape(absm,[],size(A,2))*wts(:),size(A,1),[]);

Growing arrays in Haskell

I have the following (imperative) algorithm that I want to implement in Haskell:
Given a sequence of pairs [(e0,s0), (e1,s1), (e2,s2),...,(en,sn)], where both "e" and "s" parts are natural numbers not necessarily different, at each time step one element of this sequence is randomly selected, let's say (ei,si), and based in the values of (ei,si), a new element is built and added to the sequence.
How can I implement this efficiently in Haskell? The need for random access would make it bad for lists, while the need for appending one element at a time would make it bad for arrays, as far as I know.
Thanks in advance.
I suggest using either Data.Set or Data.Sequence, depending on what you're needing it for. The latter in particular provides you with logarithmic index lookup (as opposed to linear for lists) and O(1) appending on either end.
"while the need for appending one element at a time would make it bad for arrays" Algorithmically, it seems like you want a dynamic array (aka vector, array list, etc.), which has amortized O(1) time to append an element. I don't know of a Haskell implementation of it off-hand, and it is not a very "functional" data structure, but it is definitely possible to implement it in Haskell in some kind of state monad.
If you know approx how much total elements you will need then you can create an array of such size which is "sparse" at first and then as need you can put elements in it.
Something like below can be used to represent this new array:
data MyArray = MyArray (Array Int Int) Int
(where the last Int represent how many elements are used in the array)
If you really need stop-and-start resizing, you could think about using the simple-rope package along with a StringLike instance for something like Vector. In particular, this might accommodate scenarios where you start out with a large array and are interested in relatively small additions.
That said, adding individual elements into the chunks of the rope may still induce a lot of copying. You will need to try out your specific case, but you should be prepared to use a mutable vector as you may not need pure intermediate results.
If you can build your array in one shot and just need the indexing behavior you describe, something like the following may suffice,
import Data.Array.IArray
test :: Array Int (Int,Int)
test = accumArray (flip const) (0,0) (0,20) [(i, f i) | i <- [0..19]]
where f 0 = (1,0)
f i = let (e,s) = test ! (i `div` 2) in (e*2,s+1)
Taking a note from ivanm, I think Sets are the way to go for this.
import Data.Set as Set
import System.Random (RandomGen, getStdGen)
startSet :: Set (Int, Int)
startSet = Set.fromList [(1,2), (3,4)] -- etc. Whatever the initial set is
-- grow the set by randomly producing "n" elements.
growSet :: (RandomGen g) => g -> Set (Int, Int) -> Int -> (Set (Int, Int), g)
growSet g s n | n <= 0 = (s, g)
| otherwise = growSet g'' s' (n-1)
where s' = Set.insert (x,y) s
((x,_), g') = randElem s g
((_,y), g'') = randElem s g'
randElem :: (RandomGen g) => Set a -> g -> (a, g)
randElem = undefined
main = do
g <- getStdGen
let (grownSet,_) = growSet g startSet 2
print $ grownSet -- or whatever you want to do with it
This assumes that randElem is an efficient, definable method for selecting a random element from a Set. (I asked this SO question regarding efficient implementations of such a method). One thing I realized upon writing up this implementation is that it may not suit your needs, since Sets cannot contain duplicate elements, and my algorithm has no way to give extra weight to pairings that appear multiple times in the list.

Resources