Why is slice faster than view() when constructing a multidimensional array from a vector? - arrays

Consider the following Vector:
numbers = Int32[1,2,3,4,5,6,7,8,9,10]
If I want to create a 2x5 matrix with the result:
1 2 3 4 5
6 7 8 9 10
I can't use reshape(numbers,2,5) or else I'll get:
1 3 5 7 9
2 4 6 8 10
Using slice or view(), you can extract the top row and bottom row, convert them to a matrix row, and then use vcat().
I'm not saying using slice or view() is the only or best way of doing it, perhaps there is a faster way using reshape(), I just haven't figured it out.
numbers = Int32[1,2,3,4,5,6,7,8,9,10]
println("Using Slice:")
#time numbers_slice_matrix_top = permutedims(numbers[1:5])
#time numbers_slice_matrix_bottom = permutedims(numbers[6:10])
#time vcat(numbers_slice_matrix_top,numbers_slice_matrix_bottom)
println("Using view():")
#time numbers_view_matrix_top = permutedims(view(numbers,1:5))
#time numbers_view_matrix_bottom = permutedims(view(numbers,6:10))
#time vcat(numbers_view_matrix_top,numbers_view_matrix_bottom)
Output:
Using Slice:
0.026763 seconds (5.48 k allocations: 329.155 KiB, 99.78% compilation time)
0.000015 seconds (3 allocations: 208 bytes)
0.301833 seconds (177.09 k allocations: 10.976 MiB, 93.30% compilation time)
Using view():
0.103084 seconds (72.25 k allocations: 4.370 MiB, 99.90% compilation time)
0.000011 seconds (2 allocations: 112 bytes)
0.503787 seconds (246.63 k allocations: 14.537 MiB, 99.85% compilation time)
Why is slice faster? In a few rare cases view() was faster, but not by much.
From view() documentation:
For example, if x is an array and v = #view x[1:10], then v acts like
a 10-element array, but its data is actually accessing the first 10
elements of x. Writing to a view, e.g. v[3] = 2, writes directly to
the underlying array x (in this case modifying x[3]).
I don't know enough, but from my understanding, because view() has to convert the Vector to a matrix row (the original Vector) through another array (the view()), it's slower. Using slice we create a copy and don't have to worry about manipulating the original Vector.

Your results actually show that view is faster not slicing. The point is that only the second tests is measuring the time to run the code while in the tests 1 and 3 you are measuring the time to compile the code.
This is a common misunderstanding how to run benchmarks in Julia. The point is that when a Julia function is run for the first time it needs to be compiled to an assembly code. Normally in production codes compile times do not matter because you compile only once for a fraction of a second and then run computations for many minutes, hours or days.
More than that - your code is using a global variable so in such a microbenchmark you are also measuring "how long does it take to resolve a global variable type" which is slow in Julia and not used in a production code.
Here is the correct way to run the benchmark using BenchmarkTools:
julia> #btime vcat(permutedims($numbers[1:5]),permutedims($numbers[6:10]));
202.326 ns (7 allocations: 448 bytes)
julia> #btime vcat(permutedims(view($numbers,1:5)),permutedims(view($numbers,6:10)));
88.736 ns (1 allocation: 96 bytes)
Note the interpolation symbol $ that makes numbers a type stable variable.

reshape(numbers, 5, 2)' can be used also to create the desired 2x5 matrix.

Related

Numpy is very slow for basic array operations

I have a code where I do a lot of basic arithmetic calculations with a bunch of numerical data that is is multiple arrays. I have realized that in most concievable operations, numpy classes are always slower than the default python ones. Why is this?
For example I have a simple snippet where all I do is just update 1 numpy array element with another one retrieved from another numpy array, or I update it with the mathematical product of 2 other numpy array elements. It should be a basic operation, yet it will always be at least 2-3x slower than if I do it with list.
First I thought that it's because I haven't harmonized the data structures and the compiler has to do a lot of unecessary transformations. So then I recoded the whole thing and replaced every float with numpy.float64 and every list with numpy.ndarray, and the entire data is numpy.float64 all across the code so that it doesn't have to do any unecessary transformations.
The code is still 2-3 times slower than if I just use list and float.
For example:
ALPHA = [[random.uniform(*a_param) for k in range(l2)] for l in range(l1)]
COEFF = [[random.uniform(*c_param) for k in range(l2)] for l in range(l1)]
summa=0.0
for l in range(l1):
for k in range(l2):
summa+=COEFF[l][k] * ALPHA[l][k]
will always be 2-3x faster than:
ALPHA = numpy.random.uniform(*a_param, (l1,l2))
COEFF = numpy.random.uniform(*c_param, (l1,l2))
summa=0.0
for l in range(l1):
for k in range(l2):
summa+=COEFF[l][k] * ALPHA[l][k]
How is this possible, am I doing something wrong , since numpy is supposed to speed up things.
For the record I am using Python 3.5.3 and numpy (1.12.1), should I update?
Modifying a single element of a NumPy array is not expected to be faster than modifying a single element of a Python list. The speedup from using NumPy comes when you perform "vectorized" operations on entire arrays (or subsets of arrays). Try assigning the first 10000 elements of a NumPy array to be equal to the first 10000 elements of another, and compare that with using lists.
If your data and/or operations are very small (one or just a few elements), you are probably better off not using NumPy.
I tried two things:
Running your two blocks of code. For me, they were about the same speed.
Writing a new function that exploits numpy's vectorized math. This is several times faster than the other methods.
Here are my functions:
import numpy as np
def with_lists(l1, l2):
ALPHA = [[random.uniform(0, 1) for k in range(l2)] for l in range(l1)]
COEFF = [[random.uniform(0, 1) for k in range(l2)] for l in range(l1)]
summa=0.0
for l in range(l1):
for k in range(l2):
summa+=COEFF[l][k] * ALPHA[l][k]
return summa
def with_arrays(l1, l2):
ALPHA = np.random.uniform(size=(l1,l2))
COEFF = np.random.uniform(size=(l1,l2))
summa=0.0
for l in range(l1):
for k in range(l2):
summa+=COEFF[l][k] * ALPHA[l][k]
return summa
def with_ufunc(l1, l2):
"""Avoid the loop completely by exploitng numpy's
elementwise math."""
ALPHA = np.random.uniform(size=(l1,l2))
COEFF = np.random.uniform(size=(l1,l2))
return np.sum(COEFF * ALPHA)
When I compare the speed (I'm using the %timeit magic in IPython), I get the following:
>>> %timeit with_lists(10, 10)
107 µs ± 4.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit with_arrays(10, 10)
91.9 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit with_ufunc(10, 10)
12.6 µs ± 589 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The third function, without loops, about 10 to 30 times faster on my machine, depending on the values of l1 and l2.

Is there a fast way to expand matrices n times by duplicating each line?

For example, [1 1 ; 2 2 ; 3 3] becomes
[1 1
1 1
1 1
2 2
2 2
2 2
3 3
3 3
3 3]
I am using this:
expander(orig,mult::Int) = orig[ceil(Int,(1:size(orig,1)*mult)/mult),:]; in Julia and the following in Matlab:
function expanded = expander(original,multiplier)
expanded = original(ceil((1:size(original,1)*multiplier)/multiplier),:);
end
Another matlab only way to do it is this:
expanded = kron(original,ones(multiplier,1));
I would prefer a superfast julia option if it exists.
This doesn't prove that kron is fastest, but I compared its time to how long it would just take to populate a similarly sized Array with ones, and kron did quite well:
original = [1 1 ; 2 2 ; 3 3];
multiplier = 3*10^6;
#time begin
for idx = 1:100
expanded = kron(original,ones(multiplier));
end
end
## 9.199143 seconds (600 allocations: 15.646 GB, 9.05% gc time)
#time begin
for idx = 1:100
myones = [ones(multiplier*size(original,1)) ones(multiplier*size(original,1))];
end
end
## 12.746123 seconds (800 allocations: 26.822 GB, 14.86% gc time)
Update In response to comments by David Sanders, here are tests wrapped in a function. The reason I did the tests globally, which I know isn't normal best practice, is because it seemed quite plausible to me that the objects might get created globally.
function kron_test(original, multiplier)
for idx = 1:100
expanded = kron(original,ones(multiplier));
end
end
function ones_test(original, multiplier)
for idx = 1:100
myones = [ones(multiplier*size(original,1)) ones(multiplier*size(original,1))];
end
end
## times given after first function call to compile
#time kron_test(original, multiplier); ## 11.107632 seconds (604 allocations: 15.646 GB, 23.98% gc time)
#time ones_test(original, multiplier); ## 15.849761 seconds (604 allocations: 26.822 GB, 33.50% gc time)
Personally, I'd just use repeat:
repeat(original, inner=(multiplier, 1))
Unlike kron, it's very readable and understandable. Unfortunately it is quite a bit slower. Even so, I'd only use kron if you've identified it as a performance bottleneck. While it's faster for computers to execute, it's much slower for humans to understand what's going on… and the performance of repeat should eventually get better (it's issue #15553).
Edit: As of Julia 1.2, repeat has indeed gotten significantly faster. It now rivals kron:
julia> #btime kron($original,ones($multiplier));
81.039 ms (6 allocations: 160.22 MiB)
julia> #btime repeat($original, inner=($multiplier, 1));
84.087 ms (27 allocations: 137.33 MiB)
You could do
a = [1 1 ; 2 2 ; 3 3]
a = a' #matrices are in column major order in julia, should be faster this way
a = repmat(a,1,n)
a = sortcols(a)
Unfortunatelly I have no clue wheter this method is "superfast" but it's relatively simple and intuitive

Sorting coordinates of point cloud in accordance with X, Y or Z value

A is a series of points coordinates in 3D (X,Y,Z), for instance:
>> A = [1 2 0;3 4 7;5 6 9;9 0 5;7 8 4]
A =
1 2 0
3 4 7
5 6 9
9 0 5
7 8 4
I would like to sort the matrix with respect to "Y" (second column) values.
Here is the code that I am using:
>> tic;[~, loc] = sort(A(:,2));
SortedA = A(loc,:)
toc;
SortedA =
9 0 5
1 2 0
3 4 7
5 6 9
7 8 4
Elapsed time is **0.001525** seconds.
However, it can be very slow for a large set of data. I would appreciate it if anyone knows a more efficient approach.
Introductory Discussion
This answer would mainly talk about how one can harness a compute efficient GPU for solving the stated problem. The solution code to the stated problem presented in the question was -
[~, loc] = sort(A(:,2));
SortedA = A(loc,:);
There are essentially two parts to it -
Select the second column, sort them and get the sorted indices.
Index into the rows of input matrix with the sorted indices.
Now, Part 1 is compute intensive, which could be ported onto GPU, but Part 2 being an indexing work, could be done on CPU itself.
Proposed solution
So, considering all these, an efficient GPU solution would be -
gA = gpuArray(A(:,2)); %// Port only the second column of input matrix to GPU
[~, gloc] = sort(gA); %// compute sorted indices on GPU
SortedA = A(gather(gloc),:); %// get the sorted indices back to CPU with `gather`
%// and then use them to get sorted A
Benchmarking
Presented next is the benchmark code to compare the GPU version against the original solution, however do keep in mind that since we are running the GPU codes on different hardware as compared to the originally stated solution that runs on CPU, the benchmark results might vary from system to system.
Here's the benchmark code -
N = 3000000; %// datasize (number of rows in input)
A = rand(N,3); %// generate random large input
disp('------------------ With original solution on CPU')
tic
[~, loc] = sort(A(:,2));
SortedA = A(loc,:);
toc, clear SortedA loc
disp('------------------ With proposed solution on GPU')
tic
gA = gpuArray(A(:,2));
[~, gloc] = sort(gA);
SortedA = A(gather(gloc),:);
toc
Here are the benchmark results -
------------------ With original solution on CPU
Elapsed time is 0.795616 seconds.
------------------ With proposed solution on GPU
Elapsed time is 0.465643 seconds.
So, if you have a decent enough GPU, it's high time to try out GPU for sorting related problem and more so with MATLAB providing such easy GPU porting solutions.
System Configuration
MATLAB Version: 8.3.0.532 (R2014a)
Operating System: Windows 7
RAM: 3GB
CPU Model: Intel® Pentium® Processor E5400 (2M Cache, 2.70 GHz)
GPU Model: GTX 750Ti 2GB
Try sortrows, specifying column 2:
Asorted = sortrows(A,2)
Simpler, but actually slower now that I test it... Apparently sortrows is not so great if you're only considering 1 column to sort. It's probably best when you consider multiple columns in a certain order.
MATLAB does have a feature called sortrows() to do this but in my experience, it tends to be as slow as what you're doing for a general unstructured matrix.
Test:
N = 1e4;
A = rand(N,N);
tic;[~, loc] = sort(A(:,2));
SortedA = A(loc,:);
toc;
tic; sortrows(A,2); toc;
Gives:
Elapsed time is 0.515903 seconds.
Elapsed time is 0.525725 seconds.

Trying to compare elements of on array with every element of another array in matlab

I'm using Matlab, and I'm trying to come up with a vectorized solution for comparing the elements of one array to every element of another array. Specifically I want to find the difference and see if this difference is below a certain threshold.
Ex: a = [1 5 10 15] and b=[12 13 14 15], threshold = 6
so the elements in a that would satisfy the threshold would be 10 and 15 since each value comes within 6 of any of the values in b while 1 and 5 do not. Currently I have a for loop going through the elements of a and subtracting an equivalently sized matrix from b (for 5 it would be a = [5 5 5 5]). This obviously takes a long time so I'm trying to find a vectorized solution. Additionally, the current format I have my data in is actually cells where each cell element has size [1 2], and I have been using the cellfun function to perform my subtraction. I'm not sure if this complicates the solution of each [1 2] block with the [1 2] block of the second cell. A vectorized solution response is fine, there is no need to do the threshold analysis. I just added it in for a little more background.
Thanks in advance,
Manwei Chan
Use bsxfun:
>> ind = any(abs(bsxfun(#minus,a(:).',b(:)))<threshold)
ind =
0 0 1 1
>> a(ind)
ans =
10 15

apending for loop/recursion / strange error

I have a matlab/octave for loop which gives me an inf error messages along with the incorrect data
I'm trying to get 240,120,60,30,15... every number is divided by two then that number is also divided by two
but the code below gives me the wrong value when the number hits 30 and 5 and a couple of others it doesn't divide by two.
ang=240;
for aa=2:2:10
ang=[ang;ang/aa];
end
240
120
60
30
40
20
10
5
30
15
7.5
3.75
5
2.5
1.25
0.625
24
12
6
3
4
2
1
0.5
3
1.5
0.75
0.375
0.5
0.25
0.125
0.0625
PS: I will be accessing these values from different arrays, that's why I used a for loop so I can access the values using their indexes
In addition to the divide-by-zero error you were starting with (fixed in the edit), the approach you're taking isn't actually doing what you think it is. if you print out each step, you'll see why.
Instead of that approach, I suggest taking more of a "matlab way": avoid the loop by making use of vectorized operations.
orig = 240;
divisor = 2.^(0:5); #% vector of 2 to the power of [0 1 2 3 4 5]
ans = orig./divisor;
output:
ans = [240 120 60 30 15 7.5]
Try the following:
ang=240;
for aa=1:5
% sz=size(ang,1);
% ang=[ang;ang(sz)/2];
ang=[ang;ang(end)/2];
end
You should be getting warning: division by zero if you're running it in Octave. That says pretty much everything.
When you divide by zero, you get Inf. Because of your recursion... you see the problem.
You can simultaneously generalise and vectorise by using logic:
ang=240; %Replace 240 with any positive integer you like
ang=ang*2.^-(0:log2(ang));
ang=ang(1:sum(ang==floor(ang)));
This will work for any positive integer (to make it work for negatives as well, replace log2(ang) with log2(abs(ang))), and will produce the vector down to the point at which it goes odd, at which point the vector ends. It's also faster than jitendra's solution:
octave:26> tic; for i=1:100000 ang=240; ang=ang*2.^-(0:log2(ang)); ang=ang(1:sum(ang==floor(ang))); end; toc;
Elapsed time is 3.308 seconds.
octave:27> tic; for i=1:100000 ang=240; for aa=1:5 ang=[ang;ang(end)/2]; end; end; toc;
Elapsed time is 5.818 seconds.

Resources