I have around 3000 files. Each file has a around 55000 rows/identifier and around ~100 columns. I need to calculate row-wise correlation or weighted covariance for each file (depending upon the number of columns in the file). The number of rows are same in all the files. I would like to know what is the most effective way to calculate the correlation matrix for each file ? I have tried Perl and C++ but it is taking a lot of time to process a file -- Perl takes 6 days, C takes more than a day. Typically, I don't want to take more than 15-20 minutes per file.
Now, I would like to know if I could process it faster using some trick or something. Here is my pseudo code:
while (using the file handler)
reading the file line by line
Storing the column values in hash1 where the key is the identifier
Storing the mean and ssxx (Sum of Squared Deviations of x to the mean) to the hash2 and hash3 respectively (I used hash of hashed in Perl) by calling the mean and ssxx function
end
close file handler
for loop traversing the hash (this is nested for loop as I need values of 2 different identifiers to calculate correlation coefficient)
calculate ssxxy by calling the ssxy function i.e. Sum of Squared Deviations of x and y to their mean
calculate correlation coefficient.
end
Now, I am calculating the correlation coefficient for a pair only once and I am not calculating the correlation coefficient for the same identifier. I have taken care of that using my nested for loop. Do you think if there is a way to calculate the correlation coefficient faster ? Any hints/advice would be great. Thanks!
EDIT1:
My Input File looks like this -- for the first 10 identifiers:
"Ident_01" 6453.07 8895.79 8145.31 6388.25 6779.12
"Ident_02" 449.803 367.757 302.633 318.037 331.55
"Ident_03" 16.4878 198.937 220.376 91.352 237.983
"Ident_04" 26.4878 398.937 130.376 92.352 177.983
"Ident_05" 36.4878 298.937 430.376 93.352 167.983
"Ident_06" 46.4878 498.937 560.376 94.352 157.983
"Ident_07" 56.4878 598.937 700.376 95.352 147.983
"Ident_08" 66.4878 698.937 990.376 96.352 137.983
"Ident_09" 76.4878 798.937 120.376 97.352 117.983
"Ident_10" 86.4878 898.937 450.376 98.352 127.983
EDIT2: here is snippet/subroutines or functions that I wrote in perl
## Pearson Correlation Coefficient
sub correlation {
my( $arr1, $arr2) = #_;
my $ssxy = ssxy( $arr1->{string}, $arr2->{string}, $arr1->{mean}, $arr2->{mean} );
my $cor = $ssxy / sqrt( $arr1->{ssxx} * $arr2->{ssxx} );
return $cor ;
}
## Mean
sub mean {
my $arr1 = shift;
my $mu_x = sum( #$arr1) /scalar(#$arr1);
return($mu_x);
}
## Sum of Squared Deviations of x to the mean i.e. ssxx
sub ssxx {
my ( $arr1, $mean_x ) = #_;
my $ssxx = 0;
## looping over all the samples
for( my $i = 0; $i < #$arr1; $i++ ){
$ssxx = $ssxx + ( $arr1->[$i] - $mean_x )**2;
}
return($ssxx);
}
## Sum of Squared Deviations of xy to the mean i.e. ssxy
sub ssxy {
my( $arr1, $arr2, $mean_x, $mean_y ) = #_;
my $ssxy = 0;
## looping over all the samples
for( my $i = 0; $i < #$arr1; $i++ ){
$ssxy = $ssxy + ( $arr1->[$i] - $mean_x ) * ( $arr2->[$i] - $mean_y );
}
return ($ssxy);
}
Have you searched CPAN? Method gsl_stats_correlation for computing Pearsons correlation. This one is in Math::GSL::Statisics. This module binds to the GNU Scientific Library.
gsl_stats_correlation($data1, $stride1, $data2, $stride2, $n) - This function efficiently computes the Pearson correlation coefficient between the array reference $data1 and $data2 which must both be of the same length $n. r = cov(x, y) / (\Hat\sigma_x \Hat\sigma_y) = {1/(n-1) \sum (x_i - \Hat x) (y_i - \Hat y) \over \sqrt{1/(n-1) \sum (x_i - \Hat x)^2} \sqrt{1/(n-1) \sum (y_i - \Hat y)^2} }
While minor improvements might be possible, I would suggest investing in learning PDL. The documentation on matrix operations may be useful.
#Sinan and #Praveen have the right idea for how to do this within perl. I would suggest that the overhead inherent in perl means you will never get the efficiency that you are looking for. I would suggest that you work on optimizing your C code.
First step would be to set the -O3 flag for maximum code optimization.
From there, I would change your ssxx code so that it subtracts the mean from each data point in place: x[i] -= mean. This means that you no longer need to subtract the mean in your ssxy code so that you do the subtraction once instead 55001 times.
I would check the disassembly to guarantee that the (x-mean)**2 is compiled to a multiplication, instead of 2^(2 * log(x - mean)), or just write it that way instead.
What sort of data structure are you using for your data? A double** with memory allocated for each row will lead to extra calls to (the slow function) malloc. Also, it is more likely to lead to memory thrashing with the allocated memory being located in different places. Ideally, you should have as few calls to malloc for as large as possible blocks of memory, and using pointer arithmetic to traverse the data.
More optimizations should be possible. If you post your code, I can make some suggestions.
Related
Suppose I have a CuArray with random zeros and ones and I want to get a random index of CuArray corresponding to value one. For instance,
m = 100;
A = CuArray(rand([0, 1], m));
i = rand(1:m);
while A[i]!=1
i = rand(1:m);
end
Is there a function so that I can not use while looping?
Your construction of A has the following equivalent representation:
using Distributions
n_ones = rand(Binomial(m, 0.5))
one_inds = shuffle(1:m)[1:n_ones]
A = zeros(Int, m)
A[one_inds] .= 1
That is, you first choose the number of ones you are going to set (from a binomial distribution, since you have m independent choices), and then select without repetition that many indices (by just taking the init of all indices, shuffled).
Written this way, choosing a random index of a one is just
rand(one_inds)
I want to save the (x,y) coordinates in a grid network that are visited by different individuals. Let say I have 1000 individuals and the network size is x = 1:100 and y=1:100. I am using Dict() and here is a sample code about what I want to do:
individuals = 1:1000
x = 1:100
y = 1:100
function Visited_nodes()
nodes_of_inds =Dict{Int64, Array{Tuple{Int64, Int64}}}()
for ind in individuals
dum_array = Array{Tuple{Int64, Int64}}(0)
for i in x
for j in y
if rand()<0.2 # some conditions
push!(dum_array, (i,j))
end
end
end
nodes_of_inds[ind]=unique(dum_array)
end
return nodes_of_inds
end
#time nodes_of_inds = Visited_nodes()
# result: 1.742297 seconds (12.31 M allocations: 607.035 MB, 6.72% gc time)
But this is not efficient. I appreciate any advice how to make it more efficient.
Please see the performance tips. Very first piece of advice there: avoid global variables. individuals, x, and y are all non-constant global variables. Make them arguments to your function instead. That change alone speeds up your function by an order of magnitude.
By construction, you're not going to have any duplicate tuples in your dum_array, so you don't need to call unique. That shaves off another factor of two.
Finally, Array{T} isn't a concrete type. Julia's arrays also encode the dimensionality as a type parameter, which must be included for the dictionary of arrays to be efficient. Use Array{T, 1} or Vector{T} instead. This isn't a major consideration within the time of this function, though.
The major thing that's left is just the O(length(individuals)*length(x)*length(y)) computational complexity. Doing anything ten million times will add up quickly, no matter how efficient it is.
#Matt B., thanks for your response. About the global variables, I tried a simplified version of my code and it did not help the performance.
Let say I read my input data from a couple of csv files and I have three functions with different arguments:
function Read_input_data()
# read input data
individuals = readcsv("file1")
x = readcsv("file2")
y = readcsv("file3")
A = readcsv("file4")
B = readcsv("file5") # and a few other files
# call different functions
result_1 = Function1(individuals , x, y)
result_2 = Function2(result_1 ,y, A, B)
result_3 = Function3(result_2 , individuals, A, B)
return result_1, result_2, result_3
end
result_1, result_2, result_3 = Read_input_data()
I do not know why the performance is not better compared to when I define everything global! I appreciate any if you can comment about this!
Several posts exist about efficiently calculating pairwise distances in MATLAB. These posts tend to concern quickly calculating euclidean distance between large numbers of points.
I need to create a function which quickly calculates the pairwise differences between smaller numbers of points (typically less than 1000 pairs). Within the grander scheme of the program i am writing, this function will be executed many thousands of times, so even small gains in efficiency are important. The function needs to be flexible in two ways:
On any given call, the distance metric can be euclidean OR city-block.
The dimensions of the data are weighted.
As far as i can tell, no solution to this particular problem has been posted. The statstics toolbox offers pdist and pdist2, which accept many different distance functions, but not weighting. I have seen extensions of these functions that allow for weighting, but these extensions do not allow users to select different distance functions.
Ideally, i would like to avoid using functions from the statistics toolbox (i am not certain the user of the function will have access to those toolboxes).
I have written two functions to accomplish this task. The first uses tricky calls to repmat and permute, and the second simply uses for-loops.
function [D] = pairdist1(A, B, wts, distancemetric)
% get some information about the data
numA = size(A,1);
numB = size(B,1);
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else error('Function only accepts "cityblock" and "euclidean" distance')
end
% format weights for multiplication
wts = repmat(wts,[numA,1,numB]);
% get featural differences between A and B pairs
A = repmat(A,[1 1 numB]);
B = repmat(permute(B,[3,2,1]),[numA,1,1]);
differences = abs(A-B).^r;
% weigh difference values before combining them
differences = differences.*wts;
differences = differences.^(1/r);
% combine features to get distance
D = permute(sum(differences,2),[1,3,2]);
end
AND:
function [D] = pairdist2(A, B, wts, distancemetric)
% get some information about the data
numA = size(A,1);
numB = size(B,1);
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else error('Function only accepts "cityblock" and "euclidean" distance')
end
% use for-loops to generate differences
D = zeros(numA,numB);
for i=1:numA
for j=1:numB
differences = abs(A(i,:) - B(j,:)).^(1/r);
differences = differences.*wts;
differences = differences.^(1/r);
D(i,j) = sum(differences,2);
end
end
end
Here are the performance tests:
A = rand(10,3);
B = rand(80,3);
wts = [0.1 0.5 0.4];
distancemetric = 'cityblock';
tic
D1 = pairdist1(A,B,wts,distancemetric);
toc
tic
D2 = pairdist2(A,B,wts,distancemetric);
toc
Elapsed time is 0.000238 seconds.
Elapsed time is 0.005350 seconds.
Its clear that the repmat-and-permute version works much more quickly than the double-for-loop version, at least for smaller datasets. But i also know that calls to repmat often slow things down, however. So I am wondering if anyone in the SO community has any advice to offer to improve the efficiency of either function!
EDIT
#Luis Mendo offered a nice cleanup of the repmat-and-permute function using bsxfun. I compared his function with my original on datasets of varying size:
As the data become larger, the bsxfun version becomes the clear winner!
EDIT #2
I have finished writing the function and it is available on github [link]. I ended up finding a pretty good vectorized method for computing euclidean distance [link], so i use that method in the euclidean case, and i took #Divakar's advice for city-block. It is still not as fast as pdist2, but its must faster than either of the approaches i laid out earlier in this post, and easily accepts weightings.
You can replace repmat by bsxfun. Doing so avoids explicit repetition, therefore it's more memory-efficient, and probably faster:
function D = pairdist1(A, B, wts, distancemetric)
if strcmp(distancemetric,'cityblock')
r=1;
elseif strcmp(distancemetric,'euclidean')
r=2;
else
error('Function only accepts "cityblock" and "euclidean" distance')
end
differences = abs(bsxfun(#minus, A, permute(B, [3 2 1]))).^r;
differences = bsxfun(#times, differences, wts).^(1/r);
D = permute(sum(differences,2),[1,3,2]);
end
For r = 1 ("cityblock" case), you can use bsxfun to get elementwise subtractions and then use matrix-multiplication, which must speed up things. The implementation would look something like this -
%// Calculate absolute elementiwse subtractions
absm = abs(bsxfun(#minus,permute(A,[1 3 2]),permute(B,[3 1 2])));
%// Perform matrix multiplications with the given weights and reshape
D = reshape(reshape(absm,[],size(A,2))*wts(:),size(A,1),[]);
For my project, I wish to quickly generate random permutations of a binary array of fixed length and a given number of 1s and 0s. Given these random permutations, I wish to add them elementwise.
I am currently using numpy's ndarray object, which is convenient for adding elementwise. My current code is as follows:
# n is the length of the array. I want to run this across a range of
# n=100 to n=1000.
row = np.zeros(n)
# m_list is a given list of integers. I am iterating over many possible
# combinations of possible values for m in m_list. For example, m_list
# could equal [5, 100, 201], for n = 500.
for m in m_list:
row += np.random.permutation(np.concatenate([np.ones(m), np.zeros(n - m)]))
My question is, is there any faster way to do this? According to timeit, 1000000 calls of "np.random.permutation(np.concatenate([np.ones(m), np.zeros(n - m)]))" takes 49.6 seconds. For my program's purposes, I'd like to decrease this by an order of magnitude. Can anyone suggest a faster way to do this?
Thank you!
For me version with array allocation outside the loop
was faster but not much - 8% or so, using cProfile
row = np.zeros(n, dtype=np.float64)
wrk = np.zeros(n, dtype=np.float64)
for m in m_list:
wrk[0:m] = 1.0
wrk[m:n] = 0.0
row += np.random.permutation(wrk)
You might try to shuffle(wrk) in-place instead of returning another array from permutation, but for me difference was negligible
Suppose that f(x,y) is a bivariate function as follows:
function [ f ] = f(x,y)
UN=(g)1.6*(1-acos(g)/pi)-0.8;
f= 1+UN(cos(0.5*pi*x+y));
end
How to improve execution time for function F(N) with the following code:
function [VAL] = F(N)
x=0:4/N:4;
y=0:2*pi/1000:2*pi;
VAL=zeros(N+1,3);
for i = 1:N+1
val = zeros(1,N+1);
for j = 1:N+1
val(j) = trapz(y,f(0,y).*f(x(i),y).*f(x(j),y))/2/pi;
end
val = fftshift(fft(val))/N;
l = (length(val)+1)/2;
VAL(i,:)= val(l-1:l+1);
end
VAL = fftshift(fft(VAL,[],1),1)/N;
L = (size(VAL,1)+1)/2;
VAL = VAL(L-1:L+1,:);
end
Note that N=2^p where p>10, so please consider the memory limitations while optimizing the code using ndgrid, arrayfun, etc.
FYI: The code intends to find the central 3-by-3 submatrix of the fftn of
fun=#(a,b) trapz(y,f(0,y).*f(a,y).*f(b,y))/2/pi;
where a,b are in [0,4]. The key idea is that we can save memory using the code above specially when N is very large. But the execution time is still an issue because of nested loops. See the figure below for N=2^2:
This is not a full answer, but some possibly helpful hints:
0) The trivial: Are you sure you need numerics? Can't you do the computation analytically?
1) Do not use function handles:
function [ f ] = f(x,y)
f= 1+1.6*(1-acos(cos(0.5*pi*x+y))/pi)-0.8
end
2) Simplify analytically: acos(cos(x)) is the same as abs(mod(x + pi, 2 * pi) - pi), which should compute slightly faster. Or, instead of sampling and then numerically integrating, first integrate analytically and sample the result.
3) The FFT is a very efficient algorithm to compute the full DFT, but you don't need the full DFT. Since you only want the central 3 x 3 coefficients, it might be more efficient to directly apply the DFT definition and evaluate the formula only for those coefficients that you want. That should be both fast and memory-efficient.
4) If you repeatedly do this computation, it might be helpful to precompute DFT coefficients. Here, dftmtx from the Signal Processing toolbox can assist.
5) To get rid of the loops, think about the problem not in the form of computation instructions, but a single matrix operation. If you consider your input N x N matrix as a vector with N² elements, and your output 3 x 3 matrix as a 9-element vector, then the whole operation you apply (numerical integration via trapz and DFT via fft) appears to be a simple linear transform, which it should be possible to express as an N² x 9 matrix.