I am trying to do simple math operations on every element of a Jython array in the following manner:
import math
for i in xrange (x*y*z):
medfiltArray[i] = 2 * math.sqrt(medfiltArray[i] + (3.0/8.0) )
InputImgArray[i] = 2 * math.sqrt(InputImgArray[i] + (3.0/8.0) )
The problem is that my array is large (8388608 elements) and the process takes a little more than 12 seconds. Is there a more efficient way to do this whole process? I found a slightly more faster way (about 7 seconds):
medfiltArray = map(lambda x: 2 * math.sqrt(x + (3.0/8.0) ) , medfiltArray)
The advantage of the for loop over this method is that I can modify several arrays of the same size simultaneously and therefore save up on net time. But despite all this, this is still very slow. In MATLAB modifying a matrix would take less than a second:
img = 2 * sqrt(img + (3/8));
Any tips on modifying arrays in Jython would be very appreciated. Thanks !!!
Python comes with batteries included but no good matrix batteries. Fortunately NumPy fixes that but unfortunately I don't know of the Jython alternatives from personal experience, only what a couple searches reveal: jnumeric (seems outdated), http://acs.lbl.gov/ACSSoftware/colt/ (outdated as well?), http://mail.scipy.org/pipermail/numpy-discussion/2012-August/063751.html and its SO link: Using NumPy and Cpython with Jython ..
In any case a simple CPython/NumpPy example could look like this:
import numpy as np
# dummy init values:
x = 800
y = 100
z = 100
length = x*y*z
medfiltArray = np.arange(length, dtype='f')
InputImgArray = np.arange(length, dtype='f')
# m is a constant, no reason to recalculate it 8million times
m = (3.0/8.0)
medfiltArray = 2 * np.sqrt(medfiltArray + m)
InputImgArray = 2 * np.sqrt(InputImgArray + m)
# timed, it runs in:
# real 0m0.161s
# user 0m0.131s
# sys 0m0.032s
Good luck finding your Jython alternative, I hope this sets you onto the right path.
There is a fast vector and matrix java library called Vectorz. Vectorz can be imported in Jython and does the computation described in my question in about 200 ms. The user will have to switch over from the python (or java) arrays in Jython and use Vectorz arrays. There is also another solution, if you are doing image processing (like me), there is a program called ImageJ and it has extensive functionality. I am working on an ImageJ plugin and to do these math operations you can also use internal ImageJ math commands:
IJ.run(InputImg, "32-bit", "");
IJ.run(InputImg, "Add...", "value=0.375 stack");
IJ.run(InputImg, "Square Root", "stack");
IJ.run(InputImg, "Multiply...", "value=2 stack");
This takes only .1 sec.
Related
According to the Numba docs, numpy array creation functions zeros and ones should be supported. However, testing this with simple functions leads to a nopython error when I import the zeros function from numpy. However, if I do import numpy as np and use np.zeros, there is no problem. Is there some difference in the functions I'm getting from numpy? I'd prefer only to import the functions I need, rather than the entire numpy library.
This code snippet fails:
from numpy import array
from numpy import zeros
from numpy.random import rand
from numba import njit, prange
# #njit()
#njit(parallel=True)
def prange_test(A):
s = 0
z = zeros((3, 3))
for i in prange(A.shape[0]):
s += A[i]
return s
A = rand(10)
test = prange_test(A)
This code snippet works:
from numpy import array
from numpy.random import rand
from numba import njit, prange
import numpy as np
#njit(parallel=True)
def prange_test(A):
s = 0
z = np.zeros((3, 3))
for i in prange(A.shape[0]):
s += A[i]
return s
A = rand(10)
test = prange_test(A)
I'm using Numba version 0.35.0, Numpy version 1.13.2
Let's go step by step
a ) the #numba.njit( parallel = True ) decorator's parallel option is (cit.) "experimental" in its efforts to auto-detect chances in the code to introduce some form of parallelism.
b ) the code is almost exactly the code-snippet from numba documentation, using almost exactly the same prange()-constructor code-block, but inside an #autojit decorated example:
from numba import autojit, prange
#autojit
def parallel_sum(A):
sum = 0.0
for i in prange(A.shape[0]):
sum += A[i]
return sum
c ) error message reports problems inside almost with such auto-detect transformation related to the line 12 which only weakly referenced might be s += A[i], referring to some kind of a problem inside the "automated-understanding" of the intent expressed in the Intermediate Representation of the code-block, where the prange-index ought be used - Var($parfor_index_tuple_var.14) but some type-related or tuple-decoupling-related problem was not able to get resolved by numba.jit-LLVM translator. Yet, the traceback also mentions call_parallel_gufunc to have problems to detect the upper bound of the prange-constructor stop = load_range( stop ), whereas the numba documentation so far mentions that only CPU-directed parallel-code is supported ( not any { GPU | guvectorize | et al }-non-CPU-kernel(s) ), here a better documented MCVE altogether with matching error Traceback would be appreciated, instead of a weakly referring PNG-picture.
d ) last but not least, the numba requires as a mandatory step in the documentation the parallel=True to be used only (cit.) "in conjunction with nopython=True"
How to proceed?
1 ) test the above copied numba-published code as-is, to see, whether the newer release of numba still keeps all the promises that were already working in the previous releases. I.e. use #numba.autojit-decorator and re-run the exact code copy to { POSACK | NACK }-this test.
2 ) test the code, POSACK-ed from step 1, this time under #numba.njit( parallel = True, nopython = True ) decorator ( no other change except the decorator ) to
{ POSACK | NACK }-influence of the decorator-policy.
3 ) test the code, POSACK-ed from step 2, this time with other modifications
Conceptual remarks:
With all due respect to the numba-team, there could hardly be a worse example of parallel and prange() anti-pattern than this one.
Besides the awfully immense overhead costs of the [PAR]-process section setup and an absolutely nothing to efficiently compute in parallel ( just notice the actual value dependency-graph .. ) the criticism on the Amdahl's Law initial, add-on overheads-agnostic, formulation shows how much one can pay for principally just worse than original performance. Parallel process scheduling typically has exactly the opposite motivation.
If indeed interested in smarter code-execution, use numba.jit having much better performance/cost ratio:
shave off any residual type-analyses related parts of the IR-code using explicit announcements of the calling-interface signatures
avoid memory allocations inside the performance-tuned code, rather pre-allocate and pass as another parameter
extend calling interface, so as to avoid things well known at the caller side to be deferred into the numba-automated code-analyses
#numba.jit( 'float64( float64[:], int64, float64[:,:] )', nogil = True, nopython = True )
def prange_test( vectorA, #
vectorAshape0, # avoids numba-code to speculate on type
arrayZ # avoids "local" new memory allocation
):
sum = 0
...
return sum
Performance?
from zmq import Stopwatch; aClk = Stopwatch()
def a_just_vectorised_sum( vectorA ):
return vectorA.sum()
A = np.random.rand( 1000000 )
aClk.start(); s = a_just_vectorised_sum( A ); aClk.stop()
1145L
1190L
1188L
Benchmark. Always. Always on a real-world sized dataset. Never rely on a schoolbook sized artifacts, but go into real-world scales.
Results show that the 1.000.000 cell-sized vector took about 1,200 [us] ~ 0.0012 [s] to sum(), leaving less than about 1.2 [ns] per cell sum()-ed this sets a yardstick to compare any other implementation against.
Several posts exist about efficiently calculating pairwise distances in MATLAB. These posts tend to concern quickly calculating euclidean distance between large numbers of points.
I need to create a function which quickly calculates the pairwise differences between smaller numbers of points (typically less than 1000 pairs). Within the grander scheme of the program i am writing, this function will be executed many thousands of times, so even small gains in efficiency are important. The function needs to be flexible in two ways:
On any given call, the distance metric can be euclidean OR city-block.
The dimensions of the data are weighted.
As far as i can tell, no solution to this particular problem has been posted. The statstics toolbox offers pdist and pdist2, which accept many different distance functions, but not weighting. I have seen extensions of these functions that allow for weighting, but these extensions do not allow users to select different distance functions.
Ideally, i would like to avoid using functions from the statistics toolbox (i am not certain the user of the function will have access to those toolboxes).
I have written two functions to accomplish this task. The first uses tricky calls to repmat and permute, and the second simply uses for-loops.
function [D] = pairdist1(A, B, wts, distancemetric)
% get some information about the data
numA = size(A,1);
numB = size(B,1);
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else error('Function only accepts "cityblock" and "euclidean" distance')
end
% format weights for multiplication
wts = repmat(wts,[numA,1,numB]);
% get featural differences between A and B pairs
A = repmat(A,[1 1 numB]);
B = repmat(permute(B,[3,2,1]),[numA,1,1]);
differences = abs(A-B).^r;
% weigh difference values before combining them
differences = differences.*wts;
differences = differences.^(1/r);
% combine features to get distance
D = permute(sum(differences,2),[1,3,2]);
end
AND:
function [D] = pairdist2(A, B, wts, distancemetric)
% get some information about the data
numA = size(A,1);
numB = size(B,1);
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else error('Function only accepts "cityblock" and "euclidean" distance')
end
% use for-loops to generate differences
D = zeros(numA,numB);
for i=1:numA
for j=1:numB
differences = abs(A(i,:) - B(j,:)).^(1/r);
differences = differences.*wts;
differences = differences.^(1/r);
D(i,j) = sum(differences,2);
end
end
end
Here are the performance tests:
A = rand(10,3);
B = rand(80,3);
wts = [0.1 0.5 0.4];
distancemetric = 'cityblock';
tic
D1 = pairdist1(A,B,wts,distancemetric);
toc
tic
D2 = pairdist2(A,B,wts,distancemetric);
toc
Elapsed time is 0.000238 seconds.
Elapsed time is 0.005350 seconds.
Its clear that the repmat-and-permute version works much more quickly than the double-for-loop version, at least for smaller datasets. But i also know that calls to repmat often slow things down, however. So I am wondering if anyone in the SO community has any advice to offer to improve the efficiency of either function!
EDIT
#Luis Mendo offered a nice cleanup of the repmat-and-permute function using bsxfun. I compared his function with my original on datasets of varying size:
As the data become larger, the bsxfun version becomes the clear winner!
EDIT #2
I have finished writing the function and it is available on github [link]. I ended up finding a pretty good vectorized method for computing euclidean distance [link], so i use that method in the euclidean case, and i took #Divakar's advice for city-block. It is still not as fast as pdist2, but its must faster than either of the approaches i laid out earlier in this post, and easily accepts weightings.
You can replace repmat by bsxfun. Doing so avoids explicit repetition, therefore it's more memory-efficient, and probably faster:
function D = pairdist1(A, B, wts, distancemetric)
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else
error('Function only accepts "cityblock" and "euclidean" distance')
end
differences = abs(bsxfun(#minus, A, permute(B, [3 2 1]))).^r;
differences = bsxfun(#times, differences, wts).^(1/r);
D = permute(sum(differences,2),[1,3,2]);
end
For r = 1 ("cityblock" case), you can use bsxfun to get elementwise subtractions and then use matrix-multiplication, which must speed up things. The implementation would look something like this -
%// Calculate absolute elementiwse subtractions
absm = abs(bsxfun(#minus,permute(A,[1 3 2]),permute(B,[3 1 2])));
%// Perform matrix multiplications with the given weights and reshape
D = reshape(reshape(absm,[],size(A,2))*wts(:),size(A,1),[]);
Suppose that f(x,y) is a bivariate function as follows:
function [ f ] = f(x,y)
UN=(g)1.6*(1-acos(g)/pi)-0.8;
f= 1+UN(cos(0.5*pi*x+y));
end
How to improve execution time for function F(N) with the following code:
function [VAL] = F(N)
x=0:4/N:4;
y=0:2*pi/1000:2*pi;
VAL=zeros(N+1,3);
for i = 1:N+1
val = zeros(1,N+1);
for j = 1:N+1
val(j) = trapz(y,f(0,y).*f(x(i),y).*f(x(j),y))/2/pi;
end
val = fftshift(fft(val))/N;
l = (length(val)+1)/2;
VAL(i,:)= val(l-1:l+1);
end
VAL = fftshift(fft(VAL,[],1),1)/N;
L = (size(VAL,1)+1)/2;
VAL = VAL(L-1:L+1,:);
end
Note that N=2^p where p>10, so please consider the memory limitations while optimizing the code using ndgrid, arrayfun, etc.
FYI: The code intends to find the central 3-by-3 submatrix of the fftn of
fun=#(a,b) trapz(y,f(0,y).*f(a,y).*f(b,y))/2/pi;
where a,b are in [0,4]. The key idea is that we can save memory using the code above specially when N is very large. But the execution time is still an issue because of nested loops. See the figure below for N=2^2:
This is not a full answer, but some possibly helpful hints:
0) The trivial: Are you sure you need numerics? Can't you do the computation analytically?
1) Do not use function handles:
function [ f ] = f(x,y)
f= 1+1.6*(1-acos(cos(0.5*pi*x+y))/pi)-0.8
end
2) Simplify analytically: acos(cos(x)) is the same as abs(mod(x + pi, 2 * pi) - pi), which should compute slightly faster. Or, instead of sampling and then numerically integrating, first integrate analytically and sample the result.
3) The FFT is a very efficient algorithm to compute the full DFT, but you don't need the full DFT. Since you only want the central 3 x 3 coefficients, it might be more efficient to directly apply the DFT definition and evaluate the formula only for those coefficients that you want. That should be both fast and memory-efficient.
4) If you repeatedly do this computation, it might be helpful to precompute DFT coefficients. Here, dftmtx from the Signal Processing toolbox can assist.
5) To get rid of the loops, think about the problem not in the form of computation instructions, but a single matrix operation. If you consider your input N x N matrix as a vector with N² elements, and your output 3 x 3 matrix as a 9-element vector, then the whole operation you apply (numerical integration via trapz and DFT via fft) appears to be a simple linear transform, which it should be possible to express as an N² x 9 matrix.
I have around 3000 files. Each file has a around 55000 rows/identifier and around ~100 columns. I need to calculate row-wise correlation or weighted covariance for each file (depending upon the number of columns in the file). The number of rows are same in all the files. I would like to know what is the most effective way to calculate the correlation matrix for each file ? I have tried Perl and C++ but it is taking a lot of time to process a file -- Perl takes 6 days, C takes more than a day. Typically, I don't want to take more than 15-20 minutes per file.
Now, I would like to know if I could process it faster using some trick or something. Here is my pseudo code:
while (using the file handler)
reading the file line by line
Storing the column values in hash1 where the key is the identifier
Storing the mean and ssxx (Sum of Squared Deviations of x to the mean) to the hash2 and hash3 respectively (I used hash of hashed in Perl) by calling the mean and ssxx function
end
close file handler
for loop traversing the hash (this is nested for loop as I need values of 2 different identifiers to calculate correlation coefficient)
calculate ssxxy by calling the ssxy function i.e. Sum of Squared Deviations of x and y to their mean
calculate correlation coefficient.
end
Now, I am calculating the correlation coefficient for a pair only once and I am not calculating the correlation coefficient for the same identifier. I have taken care of that using my nested for loop. Do you think if there is a way to calculate the correlation coefficient faster ? Any hints/advice would be great. Thanks!
EDIT1:
My Input File looks like this -- for the first 10 identifiers:
"Ident_01" 6453.07 8895.79 8145.31 6388.25 6779.12
"Ident_02" 449.803 367.757 302.633 318.037 331.55
"Ident_03" 16.4878 198.937 220.376 91.352 237.983
"Ident_04" 26.4878 398.937 130.376 92.352 177.983
"Ident_05" 36.4878 298.937 430.376 93.352 167.983
"Ident_06" 46.4878 498.937 560.376 94.352 157.983
"Ident_07" 56.4878 598.937 700.376 95.352 147.983
"Ident_08" 66.4878 698.937 990.376 96.352 137.983
"Ident_09" 76.4878 798.937 120.376 97.352 117.983
"Ident_10" 86.4878 898.937 450.376 98.352 127.983
EDIT2: here is snippet/subroutines or functions that I wrote in perl
## Pearson Correlation Coefficient
sub correlation {
my( $arr1, $arr2) = #_;
my $ssxy = ssxy( $arr1->{string}, $arr2->{string}, $arr1->{mean}, $arr2->{mean} );
my $cor = $ssxy / sqrt( $arr1->{ssxx} * $arr2->{ssxx} );
return $cor ;
}
## Mean
sub mean {
my $arr1 = shift;
my $mu_x = sum( #$arr1) /scalar(#$arr1);
return($mu_x);
}
## Sum of Squared Deviations of x to the mean i.e. ssxx
sub ssxx {
my ( $arr1, $mean_x ) = #_;
my $ssxx = 0;
## looping over all the samples
for( my $i = 0; $i < #$arr1; $i++ ){
$ssxx = $ssxx + ( $arr1->[$i] - $mean_x )**2;
}
return($ssxx);
}
## Sum of Squared Deviations of xy to the mean i.e. ssxy
sub ssxy {
my( $arr1, $arr2, $mean_x, $mean_y ) = #_;
my $ssxy = 0;
## looping over all the samples
for( my $i = 0; $i < #$arr1; $i++ ){
$ssxy = $ssxy + ( $arr1->[$i] - $mean_x ) * ( $arr2->[$i] - $mean_y );
}
return ($ssxy);
}
Have you searched CPAN? Method gsl_stats_correlation for computing Pearsons correlation. This one is in Math::GSL::Statisics. This module binds to the GNU Scientific Library.
gsl_stats_correlation($data1, $stride1, $data2, $stride2, $n) - This function efficiently computes the Pearson correlation coefficient between the array reference $data1 and $data2 which must both be of the same length $n. r = cov(x, y) / (\Hat\sigma_x \Hat\sigma_y) = {1/(n-1) \sum (x_i - \Hat x) (y_i - \Hat y) \over \sqrt{1/(n-1) \sum (x_i - \Hat x)^2} \sqrt{1/(n-1) \sum (y_i - \Hat y)^2} }
While minor improvements might be possible, I would suggest investing in learning PDL. The documentation on matrix operations may be useful.
#Sinan and #Praveen have the right idea for how to do this within perl. I would suggest that the overhead inherent in perl means you will never get the efficiency that you are looking for. I would suggest that you work on optimizing your C code.
First step would be to set the -O3 flag for maximum code optimization.
From there, I would change your ssxx code so that it subtracts the mean from each data point in place: x[i] -= mean. This means that you no longer need to subtract the mean in your ssxy code so that you do the subtraction once instead 55001 times.
I would check the disassembly to guarantee that the (x-mean)**2 is compiled to a multiplication, instead of 2^(2 * log(x - mean)), or just write it that way instead.
What sort of data structure are you using for your data? A double** with memory allocated for each row will lead to extra calls to (the slow function) malloc. Also, it is more likely to lead to memory thrashing with the allocated memory being located in different places. Ideally, you should have as few calls to malloc for as large as possible blocks of memory, and using pointer arithmetic to traverse the data.
More optimizations should be possible. If you post your code, I can make some suggestions.
Does there exist a function similar to that of numpy's * operator for two arrays to multiply their elements in an element-wise manner, returning an array of the similar type?
For example:
#Lets define:
a = [0,1,2,3]
b = [1,2,3,4]
d = [[1,2] , [3,4], [5,6]]
e = [3,4,5]
#I want:
a * 2 == [2*0, 1*2, 2*2, 2*3]
a * b == [0*1, 1*2, 2*3, 3*4]
d * e == [[1*3, 2*3], [3*4, 4*4], [5*5, 6*5]]
d * d == [[1*1, 2*2], [3*3, 4*4], [5*5, 6*6]]
Note how * IS NOT regular matrix multiplication it is element-wise multiplication.
My current best solution is to write some c code, which does this, and import a compiled dll.
There must exist a better solution.
EDIT:
Using LabVIEW 2011 - Needs to be fast.
The first two multiplications can be done by using the 'multiply' primitive. Make sure the arrays in the second case are of the same length.
For the third multipllication you can use a for loop (with auto-indexing). This is needed because you need to instruct LabVIEW what the basic index is.
The last multiplication can (again) be done using the multiply primitive.
My result is different (opposite) from the previous posters. I generated a 4x1000 array of random numbers (magnitude 1000) which I multiplied by a 4x4 array of integers (1,2,3,4,...). I did this 100,000 times using the matrix multiplication VI and also using for loops to perform the operation on the arrays. I'm seeing times on the order of 0.328s for the matrix VIs and 0.051s for the for loops. Using a compiled DLL may be faster than Labview, but this does not seem to be true for the built-in functions.
This is certainly not what I expected, but it is consistent over many cycles. The VI is standard execution thread. All data types are set before the timed operations - no coercion takes place in the loops. The operations are performed separately, staged by a flat sequence structure, as is the time measurement. Parallelism is turned off.