We know that using functions like colMeans(), colSums() is a lot faster than their equivalent using apply(), as this are precompiled (in C, C++...?)
Are these functions also automatically parallelized in these languages?
Maybe it does not really make sense the overhead cost of parallelizing as it is only a very simple function, Right? But still, for very large matrices it could make sense?
Nothing is automatically parallelized in R, however these exact functions exist in the Rfast package with a parallel argument.
Even without parallelization the Rfast implementation is better.
x = matrix(rnorm(1000*100),ncol=100)
microbenchmark::microbenchmark(.colSums(x,1000,100), colSums(x),Rfast::colsums(x,parallel=0),times=1000)
Unit: microseconds
expr min lq mean median uq max neval
.colSums(x, 1000, 100) 124.870 125.725 127.28580 125.726 126.153 301.911 1000
colSums(x) 132.567 133.423 136.30507 134.705 135.134 282.668 1000
Rfast::colsums(x, parallel = 0) 79.541 80.824 84.00742 81.252 82.107 307.470 1000
Bigger matrixes make parallelization relevant (benchmarks from 2 core laptop)
x = matrix(rnorm(1000*1000),ncol=1000)
microbenchmark::microbenchmark(.colSums(x,1000,1000), colSums(x),Rfast::colsums(x,parallel=0),
Rfast::colsums(x,parallel=1),times=1000)
Unit: microseconds
expr min lq mean median uq max neval
.colSums(x, 1000, 1000) 1313.268 1336.3600 1366.5152 1344.486 1355.1760 2468.310 1000
colSums(x) 1350.473 1366.2950 1401.0574 1374.420 1386.8210 2826.241 1000
Rfast::colsums(x, parallel = 0) 755.205 769.7440 801.5617 779.152 796.6850 1796.068 1000
Rfast::colsums(x, parallel = 1) 604.249 637.8185 680.3327 651.289 674.1675 1808.042 1000
x = matrix(rnorm(100000*100),ncol=100)
microbenchmark::microbenchmark(.colSums(x,100000,100), colSums(x),Rfast::colsums(x,parallel=0),
Rfast::colsums(x,parallel=1),times=1000)
Unit: milliseconds
expr min lq mean median uq max neval
.colSums(x, 1e+05, 100) 12.910733 12.971671 13.124897 13.011441 13.087346 26.914502 1000
colSums(x) 12.944944 13.012511 13.142074 13.058268 13.128400 14.879138 1000
Rfast::colsums(x, parallel = 0) 6.988414 7.009796 7.165310 7.068810 7.154765 16.826589 1000
Rfast::colsums(x, parallel = 1) 4.061679 4.251977 4.373995 4.296451 4.385827 7.610196 1000
Related
Background
I am simulating the motion of N charged particles in molecular dynamics with Fortran90 and OpenMP. The analytical expression of forces applied to each ion i is known and is a function of the position of the ion i and the other ions (r_x,r_y,r_z). I compute the Coulomb interaction between each pair of ion using a parallelised 2-nested do loop. I can determine the acceleration (a2_x,a2_y,a2_z) of each ion at the end of the loop (then update velocity and position with velocity-Verlet).
Method
I use the following code in my program to compute the Coulomb forces applied to each ion. I compute the acceleration (a2_x) at the next time step, starting from the position (r_x) at the current time step. It is a 3D problem, I put all the lines but most of them are just same thing for x, y and z so at first read you can just consider the _x variables to see how this works.
I parallelize my loop over C threads, ia and ib are arrays used to split the N ions into C parts. For instance for C=4 threads and N=16 ions (Se edit remarks below)
integer, parameter :: ia(C) = [1,5,9,13]
integer, parameter :: ib(C) = [4,8,12,16]
Then Coulomb is computed as follows
!$omp parallel default(none) &
!$omp private(im, i,j,rji,r2inv) &
!$omp firstprivate(r_x,r_y,r_z, N, ia, ib) &
!$omp shared(a2_x, a2_y, a2_z)
im = omp_get_thread_num() + 1 ! How much threads
! Coulomb forces between each ion pair
! Compute the Coulomb force applied to ion i
do i = ia(im,1), ib(im,1) ! loop over threads
do j = 1, N ! loop over all ions
rji(1) = r_x(j) - r_x(i) ! distance between the ion i and j over x
rji(2) = r_y(j) - r_y(i) ! over y
rji(3) = r_z(j) - r_z(i) ! over z
! then compute the inverse square root of distance between the current ion i and the neighbor j
r2inv = 1.d0/dsqrt(rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening)
r2inv = r2inv * r2inv * r2inv * alpha(1) ! alpha is 1/4.pi.eps0
! computation of the accelerations
a2_x(i) = a2_x(i) - rji(1)*r2inv
a2_y(i) = a2_y(i) - rji(2)*r2inv
a2_z(i) = a2_z(i) - rji(3)*r2inv
enddo
enddo
!$omp end parallel
Problematics
I am trying to optimize this time consuming part of my program. The number of operation is quite high, scales quickly with N. Can you tell me your opinion on this program ? I have some specific questions.
I have been told I should have the positions r_x, r_y and r_z as private variables, which seems counter-intuitive to me because I want to enter that loop using the previously defined positions of the ions, so i use firstprivate. Is that right ?
I am not sure that the parallelisation is optimal regarding the other variables. Shouldn't rji and r2inv be shared ? Because to compute the distance between ions i and j, I go "beyond" threads, you see what I mean ? I need info between ions spread over two different threads.
Is the way I split the ions in the first do optimal ?
I loop over all ions respectively for each ion, which will induce a division by zero when the distance between ion i and i is computed. To prevent this I have a softening variable defined at very small value so it is not exactly zero. I do this to avoid an if i==i that would be time consuming.
Also the square root is maybe also time consuming ?
For any additional detail feel free to ask.
Edit (Remarks)
My computer have a 10 core CPU Xeon W2155, 32 Go RAM. I intend to render around 1000 ions, while thinking about 4000, which requires a lot of time.
I have this Coulomb subroutine among other subroutine that may consume some CPU time. For instance one routine that may be time consuming is devoted to generating random numbers for each ion depending they are already excited or not, and applying the correct effect whether they absorb or not a photon. So that is a lot of RNG and if for each ion.
Edit (Test of the propositions)
Using !$omp do in combination with schedule(dynamic,1), or schedule(guided) or schedule(nonmonotonic:dynamic) and/or collapse(2) did not improve the run time. It made it at least three time longer. It is suggested the number of element in my simulations (N) is too low to see a significant improve. If I ever try to render much higher number of elements (4096, 8192 ...) I will try those options.
Using !$omp do rather than a home made ion distribution among cores did show equivalent in term of run time. It is easier to implement I will keep this.
Replacing the inverse dsqrt by **(-1/2) showed to be equivalent in term of run time.
Delaying the square root and combining it with the third power of r2inv was also equivalent. So I replace the whole series of operation by **(-1.5).
Same idea with rji(1)*r2inv, I do rji*r2inv before and only use the result in the next lines.
Edit 2 (Test with !$omp reduction(+:...))
I have tested the program with the following instructions
original which is the program I present in my question.
!$omp do schedule(dynamic,1)
!$omp reduction(+:a2_x,a2_y,a2_z) with `schedule(dynamic,1)'.
!$omp reduction(+:a2_x,a2_y,a2_z) with schedule(guided) and do i = 2, N do j = 1, i-1 for the loop (half work).
for 1024 and 16384 ions. Turns out my original version is still faster for me but the reduction version is not as much "catastrophic" as the previous test without reduction.
Version
N = 1024
N = 16384
original
84 s
15194 s
schedule(dynamic,1)
1300 s
not tested
reduction and schedule(dynamic,1)
123 s
24860 s
reduction and schedule(guided) (2)
121 s
24786 s
What is weird is that #PierU still has a faster computation with reduction, while for me it is not optimal. Where should such difference come from ?
Hypothesis
The fact I have a 10 core make the workload so lighter on each core for a given number of ion ?
I use double precision, maybe single precision are faster ?
Do you have AVX-512 instruction set ? It has a specific hardware to compute inverse square root much faster (see this article).
The bottleneck is elsewhere in my program. I am aware I should only test the Coulomb part. But I wanted to test it in the context of my program, see if it really shorten computation time. I have a section with a lot of where and rng perhaps I should work on this.
Generally speaking, the variables that you just need to read in the parallel region can be shared. However, having firstprivate copies for each threads can give better performances in some cases (the copies can be in the local cache of each core), particularly for variables that are repeatedly read.
definitely not! If you do that, there will be a race condition on these variables
looks ok, but it is generally simpler (and at worst as efficient) to use an !$OMP DO directive instead of manually distributing the work to the different threads
!$OMP DO
do i = 1, N ! loop over all ions
do j = 1, N ! loop over all ions
why not, provided that you are able to choose a softening value that doesn't alter your simulation (this is something that you have to test against the if solution)
it is somehow, but at some point you cannot avoid an exponentiation. I would delay the sqrt and the division like this:
r2inv = (rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening)
r2inv = r2inv**(-1.5) * alpha(1) ! alpha is 1/4.pi.eps0
Dividing the work by 2
The forces are symmetrical, and can be computed only once for a given (i,j) pair. This also naturally avoids the i==j case and the softening value. A reduction clause is needed on the a2* arrays as the same elements can be updated by different threads. The workload between iterations is highly unbalanced, though, and a dynamic clause is needed. This is actually a case were manually distributing the iterations to the threads can be more efficient ;) ...
!$omp parallel default(none) &
!$omp private(im, i,j,rji,r2inv) &
!$omp firstprivate(r_x,r_y,r_z, N, ia, ib) &
!$omp reduction(+:a2_x, a2_y, a2_z)
! Coulomb forces between each ion pair
! Compute the Coulomb force applied to ion i
!$omp do schedule(dynamic,1)
do i = 1, N-1 ! loop over all ions
do j = i+1, N ! loop over some ions
rji(1) = r_x(j) - r_x(i) ! distance between the ion i and j over x
rji(2) = r_y(j) - r_y(i) ! over y
rji(3) = r_z(j) - r_z(i) ! over z
! then compute the inverse square root of distance between the current ion i and the neighbor j
r2inv = (rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3))
r2inv = r2inv**(-1.5) * alpha(1) ! alpha is 1/4.pi.eps0
! computation of the accelerations
rji(:) = rji(:)*r2inv
a2_x(i) = a2_x(i) - rji(1)
a2_y(i) = a2_y(i) - rji(2)
a2_z(i) = a2_z(i) - rji(3)
a2_x(j) = a2_x(j) + rji(1)
a2_y(j) = a2_y(j) + rji(2)
a2_z(j) = a2_z(j) + rji(3)
enddo
enddo
!$omp end do
!$omp end parallel
Alternatively, a guided clause could be used, with some changes in the iterations to have the low workloads in the first ones:
!$omp do schedule(guided)
do i = 2, N ! loop over all ions
do j = 1, i-1 ! loop over some ions
TIMING
I have timed the latter code (divided by 2) on a old core i5 from 2011 (4 cores). Code compiled with gfortran 12.
No OpenMP / OpenMP with 1 thread / 4 threads no explicit schedule (that is static by default) / schedule(dynamic) / schedule(nonmonotonic:dynamic) / schedule(guided). guided timed with 2 code versions : (1) with do i=1,N-1; do j=i+1,N, (2) with do i=2,N; do j=1,i-1
N=256
N=1204
N=4096
N=16384
N=65536
no omp
0.0016
0.026
0.41
6.9
116
1 thread
0.0019
0.027
0.48
8.0
118
4 threads
0.0014
0.013
0.20
3.4
55
dynamic
0.0018
0.020
0.29
5.3
84
nonmonotonic
0.29
5.2
85
guided (1)
0.0014
0.013
0.21
3.7
61
guided (2)
0.0009
0.093
0.13
2.2
38
The guided schedule with low workload iterations first wins. And I have some speed-up even for low values of N. It's important to note however that the behavior can differ on a different CPU, and with a different compiler.
I have also timed the code with do i=1,N; do j=1,N (as the work is balanced between iterations there's no need of sophisticated schedule clauses):
N=256
N=1204
N=4096
N=16384
N=65536
no omp
0.0028
0.047
0.72
11.5
183
4 threads
0.0013
0.019
0.25
4.0
71
I did not see how you limited the number of threads. This could be an additional !$omp directive.
The following could be effective with schedule(static).
Should "if ( i==j ) cycle" be included ?
Is N = 16 for your code example ?
dimension rjm(3)
!$omp parallel do default(none) &
!$omp private (i,j, rji, r2inv, rjm) &
!$omp shared (a2_x, a2_y, a2_z, r_x,r_y,r_z, N, softening, alpha ) &
!$omp schedule (static)
! Coulomb forces between each ion pair
! Compute the Coulomb force applied to ion i
do i = 1,16 ! loop over threads , ignore ia(im,1), ib(im,1), is N=16 ?
rjm = 0
do j = 1, N ! loop over all ions
if ( i==j ) cycle
rji(1) = r_x(j) - r_x(i) ! distance between the ion i and j over x
rji(2) = r_y(j) - r_y(i) ! over y
rji(3) = r_z(j) - r_z(i) ! over z
! then compute the inverse square root of distance between the current ion i and the neighbor j
r2inv = sqrt ( rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening )
r2inv = alpha(1) / ( r2inv * r2inv * r2inv ) ! alpha is 1/4.pi.eps0
! computation of the accelerations
! rjm = rjm + rji*r2inv
rjm(1) = rjm(1) + rji(1)*r2inv
rjm(2) = rjm(2) + rji(2)*r2inv
rjm(3) = rjm(3) + rji(3)*r2inv
end do
a2_x(i) = a2_x(i) - rjm(1)
a2_y(i) = a2_y(i) - rjm(2)
a2_z(i) = a2_z(i) - rjm(3)
end do
!$omp end parallel do
I have no experience of using firstprivate to shift shared variables into the stack for improved performance. Is it worthwile ?
I am reviewing a code which I believe runs based on probability. I would like to verify that if it is true. Does the following code snippet runs 80% of the time? I dont quite get why use 1000 then if our job is to merely run a code 80% of the time.
if(rand()%1000<1000*0.8){
...
}
It will run approximately 80% of the time.
rand() returns a number between 0 and RAND_MAX which is probably 2,147,483,647
rand() % 1000 reduces that range to 0-999, although some numbers in the first half or so of the range will be slightly more common because RAND_MAX is not evenly divisible by 1,000
1000 * 0.8 is just 800
The use of 1,000 here is arbitrary. A clearer way of representing 80% would be:
if (rand() % 100 < 80)
or just:
if (rand() < RAND_MAX * 0.8)
I'm trying to optimizing the value N to split arrays up for vectorizing an array so it runs the quickest on different machines. I have some test code below
#example use random values
clear all,
t=rand(1,556790);
inner_freq=rand(8193,6);
N=100; # use N chunks
nn = int32(linspace(1, length(t)+1, N+1))
aa_sig_combined=zeros(size(t));
total_time_so_far=0;
for ii=1:N
tic;
ind = nn(ii):nn(ii+1)-1;
aa_sig_combined(ind) = sum(diag(inner_freq(1:end-1,2)) * cos(2 .* pi .* inner_freq(1:end-1,1) * t(ind)) .+ repmat(inner_freq(1:end-1,3),[1 length(ind)]));
toc
total_time_so_far=total_time_so_far+sum(toc)
end
fprintf('- Complete test in %4.4fsec or %4.4fmins\n',total_time_so_far,total_time_so_far/60);
This takes 162.7963sec or 2.7133mins to complete when N=100 on a 16gig i7 machine running ubuntu
Is there a way to find out what value N should be to get this to run the fastest on different machines?
PS: I'm running Octave 3.8.1 on 16gig i7 ubuntu 14.04 but it will also be running on even a 1 gig raspberry pi 2.
This is the Matlab test script that I used to time each parameter. The return is used to break it after the first iteration as it looks like the rest of the iterations are similar.
%example use random values
clear all;
t=rand(1,556790);
inner_freq=rand(8193,6);
N=100; % use N chunks
nn = int32( linspace(1, length(t)+1, N+1) );
aa_sig_combined=zeros(size(t));
D = diag(inner_freq(1:end-1,2));
for ii=1:N
ind = nn(ii):nn(ii+1)-1;
tic;
cosPara = 2 * pi * A * t(ind);
toc;
cosResult = cos( cosPara );
sumParaA = D * cosResult;
toc;
sumParaB = repmat(inner_freq(1:end-1,3),[1 length(ind)]);
toc;
aa_sig_combined(ind) = sum( sumParaA + sumParaB );
toc;
return;
end
The output is indicated as follows. Note that I have a slow computer.
Elapsed time is 0.156621 seconds.
Elapsed time is 17.384735 seconds.
Elapsed time is 17.922553 seconds.
Elapsed time is 18.452994 seconds.
As you can see, the cos operation is what's taking so long. You are running cos on a 8192x5568 matrix (45,613,056 elements) which makes sense that it takes so long.
If you wish to improve performance, use parfor as it appears each iteration is independent. Assuming you had 100 cores to run your 100 iterations, your script would be done in 17 seconds + parfor overhead.
Within the cos calculation, you might want to look into if another method exists to calculate cos of a value faster and more parallel than the stock method.
Another minor optimization is this line. It ensures that the diag function isn't called within the loop as the diagonal matrix is constant. You don't want a 8192x8192 diagonal matrix to be generated every time! I just stored it outside the loop and it gives a bit of a performance boost as well.
D = diag(inner_freq(1:end-1,2));
Note that I didn't use the Matlab profile as it didn't work for me, but you should use that in the future for more functionalized code.
I have a dataset (Data) which is a vector of, let's say, 1000 real numbers. I would like to extract at random from Data 100 times 10 contiguous numbers. I don't know how to use Datasample for that purpose.
Thanks in advance for you help.
You can just pick 100 random numbers between 1 and 991:
I = randi(991, 100, 1)
Then use them as the starting points to index 10 contiguous elements:
cell2mat(arrayfun(#(x)(Data(x:x+9)), I, 'uni', false))
Here you have a snipet, but instead of using Datasample, I used randi to generate random indexes.
n_times = 100;
l_data = length(Data);
index_random = randi(l_data-9,n_times,1); % '- 9' to not to surpass the vector limit when you read the 10 items
for ind1 = 1:n_times
random_number(ind1,:) = Data(index_random(ind1):index_random(ind1)+9)
end
This is similar to Dan's answer, but avoids using cells and arrayfun, so it may be faster.
Let Ns denote the number of contiguous numbers you want (10 in your example), and Nt the number of times (100 in your example). Then:
result = Data(bsxfun(#plus, randi(numel(Data)-Ns+1, Nt, 1), 0:Ns-1)); %// Nt x Ns
Here is another solution, close to #Luis, but with cumsum instead of bsxfun:
A = rand(1,1000); % The vector to sample
sz = size(A,2);
N = 100; % no. of samples
B = 10; % size of one sample
first = randi(sz-B+1,N,1); % the starting point for all blocks
rand_blocks = A(cumsum([first ones(N,B-1)],2)); % the result
This results in an N-by-B matrix (rand_blocks), each row of it is one sample. Of course, this could be one-lined, but it won't make it faster, and I want to keep it clear. For small N or B this method is slightly faster. If N or B becomes very large then the bsxfun method is slightly faster. This ranking is not affected by the size of A.
An rpc server is given which receives millions of requests a day. Each request i takes processing time Ti to get processed. We want to find the 65th percentile processing time (when processing times are sorted according to their values in increasing order) at any moment. We cannot store processing times of all the requests of the past as the number of requests is very large. And so the answer need not be exact 65th percentile, you can give some approximate answer i.e. processing time which will be around the exact 65th percentile number.
Hint: Its something to do how a histogram (i.e. an overview) is stored for a very large data without storing all of data.
Take one day's data. Use it to figure out what size to make your buckets (say one day's data shows that the vast majority (95%?) of your data is within 0.5 seconds of 1 second (ridiculous values, but hang in)
To get 65th percentile, you'll want at least 20 buckets in that range, but be generous, and make it 80. So you divide your 1 second window (-0.5 seconds to +0.5 seconds) into 80 buckets by making each 1/80th of a second wide.
Each bucket is 1/80th of 1 second. Make bucket 0 be (center - deviation) = (1 - 0.5) = 0.5 to itself + 1/80th of a second. Bucket 1 is 0.5+1/80th - 0.5 + 2/80ths. Etc.
For every value, find out which bucket it falls in, and increment a counter for that bucket.
To find 65th percentile, get the total count, and walk the buckets from zero until you get to 65% of that total.
Whenever you want to reset, set the counters all to zero.
If you always want to have good data available, keep two of these, and alternate resetting them, using the one you reset least recently as having more useful data.
Use an updown filter:
if q < x:
q += .01 * (x - q) # up a little
else:
q += .005 * (x - q) # down a little
Here a quantile estimator q tracks the x stream,
moving a little towards each x.
If both factors were .01, it would move up as often as down,
tracking the 50 th percentile.
With .01 up, .005 down, it floats up, 67 th percentile;
in general, it tracks the up / (up + down) th percentile.
Bigger up/down factors track faster but noisier --
you'll have to experiment on your real data.
(I have no idea how to analyze updowns, would appreciate a link.)
The updown() below works on long vectors X, Q in order to plot them:
#!/usr/bin/env python
from __future__ import division
import sys
import numpy as np
import pylab as pl
def updown( X, Q, up=.01, down=.01 ):
""" updown filter: running ~ up / (up + down) th percentile
here vecs X in, Q out to plot
"""
q = X[0]
for j, x in np.ndenumerate(X):
if q < x:
q += up * (x - q) # up a little
else:
q += down * (x - q) # down a little
Q[j] = q
return q
#...............................................................................
if __name__ == "__main__":
N = 1000
up = .01
down = .005
plot = 0
seed = 1
exec "\n".join( sys.argv[1:] ) # python this.py N= up= down=
np.random.seed(seed)
np.set_printoptions( 2, threshold=100, suppress=True ) # .2f
title = "updown random.exponential: N %d up %.2g down %.2g" % (N, up, down)
print title
X = np.random.exponential( size=N )
Q = np.zeros(N)
updown( X, Q, up=up, down=down )
# M = np.zeros(N)
# updown( X, M, up=up, down=up )
print "last 10 Q:", Q[-10:]
if plot:
fig = pl.figure( figsize=(8,3) )
pl.title(title)
x = np.arange(N)
pl.plot( x, X, "," )
pl.plot( x, Q )
pl.ylim( 0, 2 )
png = "updown.png"
print >>sys.stderr, "writing", png
pl.savefig( png )
pl.show()
An easier way to get the value that represents a given percentile of a list or array is the scoreatpercentile function in the scipy.stats module.
>>>import scipy.stats as ss
>>>ss.scoreatpercentile(v,65)
there's a sibling percentileofscore to return the percentile given the value
you will need to store a running sum and a total count.
then check out standard deviation calculations.