numpy array split/partition efficiency - arrays

This appears to be a simple issue, but I've been struggling trying to efficiently split a 2D array:
start_time = time.time()
M = np.ones((400,400))
for i in range(10000):
e = np.array_split(M, 20)
print time.time() - start_time
However, this process takes ~6 seconds comparing to ~0.5 seconds when implemented in Mathematica with the Partition function, which can become a liability when the array gets much larger. Is there any way for me to speed up the process?

np.array_split may be useful when splitting an array into uneven pieces. Here, the size of each item in e is the same, so you could just use reshape:
e = M.reshape(20,-1)
This will be exceedingly fast, since it requires no copying of the array, only a change to the array's shape attribute.
e will be a 2D NumPy array of shape (20, 8000), not a list of NumPy arrays.
In [56]: M = np.ones((400,400))
In [60]: %timeit M.reshape(20,-1)
1000000 loops, best of 3: 447 ns per loop


Median of multiple arrays in Julia

It's related to this question
I want to know how to calculate median along specific dimension on huge array, for example with size (20, 1920, 1080, 3). I not sure whether there is any practical purpose but I just wanted to check how well median works in Julia.
It takes ~0.5 seconds to calculate medians on (3,1920,1080,3) with numpy. It works very fast on zeros array (less than 2 seconds on (120, 1920, 1080,3)) and works not so fast but fine on real images (20 seconds on (120, 1920, 1080,3)).
Python code:
import cv2
import sys
import numpy as np
import time
print("n_imgs:", N_IMGS)
print("use dummy data:", ZEROES)
imgs_paths = sys.argv[1:]
imgs_paths_sparse = imgs_paths[::30]
imgs_paths = imgs_paths_sparse[N_IMGS]
imgs_arr = np.zeros((N_IMGS,1080,1920,3), dtype=np.float32)
imgs = map(cv2.imread, imgs_paths)
imgs_arr = np.array(list(imgs), dtype=np.float32)
start = time.time()
imgs_median = np.median(imgs_arr, 0)
end = time.time()
print("time:", end - start)
cv2.imwrite('/tmp/median.png', imgs_median)
In julia I can only calculate median of (3, 1920, 1080,3). After that my earlyoom process kills julia process because of huge amount of used memory.
I tried approach similar to what I tried first on max:
function median1(imgs_arr)
a = imgs_arr
b = reshape(cat(a..., dims=1), tuple(length(a), size(a[1])...))
imgs_max = Statistics.median(b, dims=1)
return imgs_max
Or even more simple case:
import Statistics
a = zeros(3,1080,1920,3)
#time Statistics.median(a, dims=1)
10.609627 seconds (102.64 M allocations: 2.511 GiB, 3.37% gc time)
So, it takes 10 seconds vs 0.5 seconds on numpy.
I have only 4 CPU cores and it's not simply parallelization.
Is there more or less simple way to optimize it somehow?
Or at least take slices and compute it one-by-one without overuse of memory?
It's hard to know if the fact that the images are loaded separately is a key part of the problem here or not since the setup for the problem in Julia is missing and it's a bit hard for Julia programmers to follow the Python setup or know how much we need to match it. You either need to:
Load or move the image data so that they are, in fact, part of the same array and then take the median of that;
Make a set of spatially unrelated values in different arrays abstractly behave as though they are part of a single array and then take the median of that collection via a method that's generic enough to handle this abstraction.
Fredrik's answer implicitly assumes that you have already loaded the image data so that they're all part of the same contiguous array. If that's the case, however, then you don't even need JuliennedArrays, you can just use the median function from the Statistics stdlib:
julia> a = rand(3, 1080, 1920, 3);
julia> using Statistics
julia> median(a, dims=1)
1×1080×1920×3 Array{Float64,4}:
[:, :, 1, 1] =
0.63432 0.205958 0.216221 0.571541 … 0.238637 0.285947 0.901014
[:, :, 2, 1] =
0.821851 0.486859 0.622313 … 0.917329 0.417657 0.724073
If you can load the data like this, it's the best approach—this is by far the most efficient representation of a bunch of same-sized images and makes vectorize operations across images easy and efficient. The first dimension is the most efficient one to do operations across because Julia is column-major, so the first dimension (columns) is stored contiguously.
The best way to get the images into contiguous memory is to pre-allocate an uninitialized array of the right type and dimensions and then read the data into the array using some in-place API. For some reason your Julia code appears to have loaded the images as a vector of individual arrays while your Python code seems to have loaded all of the images into a single array?
The approach of reshaping and concatenating is an extreme case of the second approach where you move all of the data all at once before then applying a vectorized median operation. Obviously, that involves moving a lot of data around, which is pretty inefficient.
Due to memory locality, it may be more efficient to copy a single slice of the data into a temporary array and compute the median of that. That can be done pretty easily with an array comprehension:
julia> v_of_a = [rand(1080, 1920, 3) for _ = 1:3]
3-element Array{Array{Float64,3},1}:
[0.7206652600431633 0.7675119703509619 … 0.7117084561740263 0.8736518021960584; 0.8038479801395197 0.3159392943734012 … 0.976319025405266 0.3278606124069767; … ; 0.7424260315304789 0.4748658164109498 … 0.9942311708400311 0.37048961459068086; 0.7832577306186075 0.13184454935145773 … 0.5895094390350453 0.5470111170897787]
[0.26401298651503025 0.9113932653115289 … 0.5828647778524962 0.752444909740893; 0.5673144007678044 0.8154276504227804 … 0.2667436824684424 0.4895443896447764; … ; 0.2641913584303701 0.16639100493266934 … 0.1860616855126005 0.04922131616483538; 0.4968214514330498 0.994935452055218 … 0.28097239922248685 0.4980189891952156]
julia> [median(a[i,j,k] for a in v_of_a) for i=1:1080, j=1:1920, k=1:3]
1080×1920×3 Array{Float64,3}:
[:, :, 1] =
0.446895 0.643648 0.694714 … 0.221553 0.711708 0.225268
0.659251 0.457686 0.672072 0.731218 0.449915 0.129987
0.573196 0.328747 0.668702 0.355231 0.656686 0.303168
0.243656 0.702642 0.45708 0.23415 0.400252 0.482792
Try JuliennedArrays.jl
julia> a = zeros(3,1080,1920,3);
julia> using JuliennedArrays
julia> #time map(median, Slices(a,1));
0.822429 seconds (6.22 M allocations: 711.915 MiB, 20.15% gc time)
As Stefan commented below, the built in median does the same thing, but much slower
julia> #time median(a, dims=1);
7.450394 seconds (99.80 M allocations: 2.368 GiB, 4.47% gc time)
at least as of julia> VERSION v"1.5.0-DEV.876"

Julia: three dimensional arrays (performance)

Going thought the Julia's performance tips I haven't found any suggestions regarding how to speed up a code with three dimensional arrays.
From my understanding d-element Array{Array{Float64,2},1} would perform best when d (the third dimension) is small. However, I am not sure whether this is the case when d is large.
Is there any tutorial on this topic for Julia?
Example 1a (d=50)
x = [zeros(100, 10) for d=1:50];
#time for d=1:50
x[d] = rand(100,10);
0.000100 seconds (50 allocations: 396.875 KB)
Example 1b (d=50)
y=zeros(100, 10, 50);
#time for d=1:50
y[:,:,d] = rand(100,10);
0.000257 seconds (200 allocations: 400.781 KB)
Example 2a (d=50000)
x = [zeros(100, 10) for d=1:50000];
#time for d=1:50000
x[d] = rand(100,10);
0.410813 seconds (99.49 k allocations: 388.328 MB, 81.88% gc time)
Example 2b (d=50000)
y=zeros(100, 10, 50000);
#time for d=1:50000
y[:,:,d] = rand(100,10);
0.185929 seconds (298.98 k allocations: 392.898 MB, 6.83% gc time)
From my understanding d-element Array{Array{Float64,2},1} would perform best when d (the third dimension) is small. However, I am not sure whether this is the case when d is large.
No, it's moreso how you use it. A = Array{Array{Float64,2},1} is an array of pointers to matrices. The value of an array is the pointer or the reference. Thus A[i] returns a reference, i.e. it's cheap. A2 = Array{Float64,3} is a contiguous array of floats. It's really just an indexing setup over a linear slab of memory (and has a linear index A2[i] which runs through the whole thing using that linear form).
The latter has some advantages because it is contiguous. There's no indirection, so looping over all of A2s values will be faster. A has to deference two pointers to get a value, so a simple 3D loop will be slower if you don't know to deference each internal matrix only once. Also, you can get views to the matrices via #view A2[:,:,1] etc., but you have to take note that A2[:,:,1] by itself will make a copy of the matrix. A[1] is natural a view because it returns the reference to the matirx, and if you want to copy you'd have to explicitly do copy(A[1]). Because A is just a linear array of pointers, push!ing a new matrix onto it is cheap since it's just increasing a relatively small array (and push! is automatically amortized) to add a new pointer on the end (this is why things like DifferentialEqautions.jl use arrays of arrays to build timeseries instead of the more traditional matrix).
So they are different tools with different advantages and disadvantages.
As for your timings, you're doing two different things. x[d] = rand(100,10) is creating a new matrix and adding its reference to x. y[:,:,d] = rand(100,10) is creating a new matrix and looping through the values of y to change the values of y. You can see why that's slower. But what you're leaving out is the allocation-free cases.
function f2()
y=zeros(100, 10, 50);
#time for i in eachindex(y)
y[i] = rand()
In the small case this matches the array creation. You can't naively do this on case one, but as I said, if you dereference the pointer for the matrix once you do really well:
function f()
x = [zeros(100, 10) for d=1:5000];
#time #inbounds for d=1:50
xd = x[d]
for i in eachindex(xd)
xd[i] = rand()
So arrays of arrays can be great data structures in the right cases. The library RecursiveArrayTools.jl was created to take better advantage of it. For example, A3 = VectorOfArrays(A) gives A3 the same indexing structure as A2 by lazily transforming A[i,j,k] to A[k][i,j]. However, it keeps the advantages of A, but will automatically make sure to broadcast in the correct way like f. Another tool like this is the ArrayPartition which allows for heterogeneous typing in a broadcast-performant way.
So yeah, it's not always the right tool, but these heterogeneous and recursive arrays are great tools when used correctly.

Broadcast function that changes dimension of the input array

Given some function f that accepts 1D array and gives 2D array, is it possible to apply it efficiently for each row of the NxM array A?
More specifically, I want to apply np.triu for each of the row of the NxM array A and then concatenate all the results. I can achieve this by
B = np.dstack(map(np.triu, A))
which gives MxMxN matrix. However, this is not very efficiently for large N. Unfortunately, the function np.apply_along_axis cannot be employed here because f changes dimension.
Knowing the power of NumPy for efficient broadcasting, I am almost sure that there exists a better solution for my problem.
Here's a vectorized approach using broadcasting -
Bout = A.T*(np.tri(A.shape[1],dtype=bool).T[...,None])
Runtime test and output verification -
In [319]: A = np.random.randint(0,20,(400,100))
In [320]: %timeit np.dstack(map(np.triu, A))
10 loops, best of 3: 69.9 ms per loop
In [321]: %timeit A.T*(np.tri(A.shape[1],dtype=bool).T[...,None])
10 loops, best of 3: 24.8 ms per loop
In [322]: B = np.dstack(map(np.triu, A))
In [323]: Bout = A.T*(np.tri(A.shape[1],dtype=bool).T[...,None])
In [324]: np.allclose(B,Bout)
Out[324]: True

Fast Random Permutation of Binary Array

For my project, I wish to quickly generate random permutations of a binary array of fixed length and a given number of 1s and 0s. Given these random permutations, I wish to add them elementwise.
I am currently using numpy's ndarray object, which is convenient for adding elementwise. My current code is as follows:
# n is the length of the array. I want to run this across a range of
# n=100 to n=1000.
row = np.zeros(n)
# m_list is a given list of integers. I am iterating over many possible
# combinations of possible values for m in m_list. For example, m_list
# could equal [5, 100, 201], for n = 500.
for m in m_list:
row += np.random.permutation(np.concatenate([np.ones(m), np.zeros(n - m)]))
My question is, is there any faster way to do this? According to timeit, 1000000 calls of "np.random.permutation(np.concatenate([np.ones(m), np.zeros(n - m)]))" takes 49.6 seconds. For my program's purposes, I'd like to decrease this by an order of magnitude. Can anyone suggest a faster way to do this?
Thank you!
For me version with array allocation outside the loop
was faster but not much - 8% or so, using cProfile
row = np.zeros(n, dtype=np.float64)
wrk = np.zeros(n, dtype=np.float64)
for m in m_list:
wrk[0:m] = 1.0
wrk[m:n] = 0.0
row += np.random.permutation(wrk)
You might try to shuffle(wrk) in-place instead of returning another array from permutation, but for me difference was negligible

Converting 2D cell of 2D matrices (consistent sizes) into 4D matlab double

Searching around here one finds many questions how one can convert cell arrays of doubles into one big matrix.
In my application I have a two dimensional cell array (lets call it celldata of size m times n) of all same sized double matrices (lets say of size a times b).
I want to convert that data structure into one bit 4D double (m times n times a times b).
At the moment I do that by
but maybe there are other methods doing that directly? Maybe with a call like
cat([3 4],celldata{:,:})
or similar.
I think
cell2mat(permute(celldata, [3 4 1 2]))
will do the trick. However,
%// create some bogus data
m = 1.1e2;
n = 1.2e2;
a = 1.3e2;
b = 1.4e2;
celldata = cellfun(#(~) randi(10, a,b, 'uint8'), cell(m,n), 'UniformOutput', false);
%// new method
cell2mat(permute(celldata, [3 4 1 2]));
%// your current method
Elapsed time is 1.745495 seconds. % cell2mat/permute
Elapsed time is 0.305368 seconds. % reshape/cat
cell2mat is a matlab m-file (with necessary inefficiencies in the loop due to compatibility issues), while reshape and cat are built-ins. This is where that difference comes from.
I'd stick with your current method :)
Now, I'm asking you why you'd want to do this convesion in the first place. Is it an indexing problem? Because
prevents you from having to do the conversion, so you can index like
I don't see other reasons, because matrix/vector operations don't work anyway on 4D arrays...
