I'm trying to time the execution of a function before attempting to optimize it. (The code is Elixir, but I'm using Erlang's :timer.tc.)
My general approach is "run it lots of times, then calculate the average duration." But the average decreases dramatically the more times I run it (up to a point).
An example:
some_func = fn ->
# not my actual function; it's a pure function,
# but exhibits the same speedup
:rand.uniform()
end
run_n_times = fn (count, func) ->
Enum.each(1..count, fn (_i) ->
func.()
end)
end
n = 20
{microseconds, :ok} = :timer.tc(run_n_times, [n, some_func])
IO.puts "#{microseconds / n} microseconds per call (#{microseconds} total for #{n} calls)"
Outputs for increasing values of n are like this (lightly formatted):
174.8 microseconds per call (3496 total for 20 calls )
21.505 microseconds per call (4301 total for 200 calls )
4.5755 microseconds per call (9151 total for 2000 calls )
0.543415 microseconds per call (108683 total for 200000 calls )
0.578474 microseconds per call (578474 total for 1000000 calls )
0.5502955 microseconds per call (1100591 total for 2000000 calls )
0.556457 microseconds per call (2225828 total for 4000000 calls )
0.544754125 microseconds per call (4358033 total for 8000000 calls )
Why does a function run faster the more I call it, and what does this imply for benchmarking? Eg, is there a rule of thumb like "run something >= 200k times in order to benchmark"?
Since your function is very fast (does nothing basically) what I think you're seeing here is the overhead of the setup and not any speedup in the runtime of the function. In this case before you start running your function you have to construct a range, construct an anonymous function and call the Enum.each function. For small numbers of repetitions these factors probably contribute more to the overall runtime of the benchmark than the actual repetitions.
I second what Paweł Obrok wrote in his answer. You could optimize your code by calling the function multiple times inside the loop:
run_n_times = fn (count, func) ->
Enum.each(1..count, fn (_i) ->
func.()
func.()
func.()
func.()
func.()
func.()
func.()
func.()
func.()
func.()
end)
end
That's 10 calls but you could 100 or 1000 of them. The more you do in the same loop the less will be the overhead.
I don't know what exactly Erlang does, but if you do the same in Javascript with a modern Javascript interpreter, then the first few calls will be interpreted (slow). Then the interpreter figures out that you are calling this function a lot and compiles it with a quick and dirty compiler. Another hundred calls, and the interpreter figures out what's happening and compiles it again, with a proper compiler this time. And another thousand calls, it gets compiled again with a highly optimising compiler. That would give exactly the kind of numbers that you found.
Related
I need to prepare "flattened" versions of 2D fftfrequencies in the shape Nx^2 * 2. Those are basically constructed like a ravel(meshgrid(fftfreqs1d,fftfreqs1d)) in matlab or python.
This appears to be no big deal in python, but can hang for reasonable array sizes in julia, especially when i want to build a StaticArray out of the intermediate results. To make it more confusing, #btime pretends that my arrays are created in no time, while they are clearly not.
My question is why this happens and how it is done right.
I am aware that using julia it might be a waste to keep the full 2D fftfreqs in memory instead of using the 1D versions and a loop, but let us assume for a moment that i need it this way.
Julia
function my_freqs1(Nnu::Int,T)
dx = 2. /Nnu
freq1d = fftfreq(Nnu).*dx
nu = hcat( vec([ i for i in freq1d, j in freq1d ]),
vec([ j for i in freq1d, j in freq1d ]))
return nu
end;
#btime my_freqs1(100,Float64)
28.528 μs (10 allocations: 312.80 KiB)
Julia, converting to a static array (in the hope for better performance of other code later on)
function my_freqs2(Nnu::Int,T)
### the same as above ###
return SMatrix{Nnu^2,2,T}(nu)
end;
#btime my_freqs2(100,Float64)
94.540 μs (36 allocations: 470.38 KiB)
Python
def my_fftfreqs(xy):
freqs = np.fft.fftfreq(np.shape(xy)[0],d=xy[1]-xy[0])
fx,fy = np.meshgrid(freqs,freqs,indexing="ij")
freq_list = np.transpose(np.asarray( [np.ravel(fx),np.ravel(fy)] ))
return freq_list
%time f=my_fftfreqs(np.linspace(0,1,100));
CPU times: user 1.08 ms, sys: 0 ns, total: 1.08 ms
Wall time: 600 µs
My observation is that while python %time reports a much longer time, it will actually run in a very reasonable time while the julia version has a noticable delay and the version with the static array will hang for a long time and completely crash for larger sizes.
Please help me to understand how i would do this correctly in Julia and whether (why not?) creating a static array seems to be such a bad idea.
Rather than making a SMatrix{Nnu^2,2} I think you probably want to make a Vector{SVector{2}}. The former will require recompiling for each new value of Nnu which is fairly inefficient.
You may also consider:
using FFTW
my_freqs3(ν) = fftfreq(ν)*2/ν |>
(w -> [repeat(w, inner=length(w)) repeat(w, outer=length(w))])
# or
my_freqs3alt(ν) = ( w = fftfreq(ν)*2/ν ;
[repeat(w, inner=length(w)) repeat(w, outer=length(w))] )
which is more Julian and "if-I-understand-correctly" is equivalent.
Usually shorter/simpler functions are also more efficient.
Julia features used:
Unicode nu variable.
Piping |> operator.
Definition with no function keyword.
repeat standard library vector filling function.
Matlab-like hcat [v1 v2] notation.
Multi-statement block enclosed in ( ) separated by ;.
Recently I'm investigating the complexity of accessing fortran array. Thanks to the comments, here I include complete examples.
program main
implicit none
integer, parameter :: mp = SELECTED_REAL_KIND(15,307)
integer, parameter :: Np=10, rep=100
integer*8, parameter :: Ng(7) = (/1E3,1E4,1E5,1E6,1E7,1E8,1E9/)
real(mp), allocatable :: x(:)
real(mp) :: time1, time2
integer*8 :: i,j,k, Ngj
real(mp) :: temp
integer :: g
! print to screen
print *, 'calling program main'
do j=1,SIZE(Ng) !test with different Ng
!initialization with each Ng. Don't count for complexity.
Ngj = Ng(j)
if(ALLOCATED(x)) DEALLOCATE(x)
ALLOCATE(x(Ngj))
x = 0.0_mp
!!===This is the part I want to check the complexity===!!
call CPU_TIME(time1)
do k=1,rep
do i=1,Np
call RANDOM_NUMBER(temp)
g = floor( Ngj*temp ) + 1
x( g ) = x( g ) + 1.0_mp
end do
end do
call CPU_TIME(time2)
print *, 'Ng: ',Ngj,(time2-time1)/rep, '(sec)'
end do
! print to screen
print *, 'program main...done.'
contains
end program
I thought in the beginning its complexity is O(Np). But this is the time measurement for Np=10:
calling program main
Ng: 1000 7.9000000000000080E-007 (sec)
Ng: 10000 4.6000000000000036E-007 (sec)
Ng: 100000 3.0999999999999777E-007 (sec)
Ng: 1000000 4.8000000000001171E-007 (sec)
Ng: 10000000 7.3999999999997682E-007 (sec)
Ng: 100000000 2.1479999999999832E-005 (sec)
Ng: 1000000000 4.5719999999995761E-005 (sec)
program main...done.
This Ng-dependency is very slow and appears only for very large Ng, but is not dominated when increasing Np; increasing Np just multiplies a constant factor on that time scaling.
Also, it seems that the scaling slope increases when I use more complicated subroutines rather than random number.
Computing temp and g was verified to be independent of Ng.
There are two questions with this situation:
Based on comments, this kind of measurement does not only include intended arithmetic operations but also costs related to memory cache or compiler. Would there be a more correct way to measure the complexity?
Concerning the issues mentioned in the comments, like memory cache, page missing, or compiler, are they inevitable as the array size increases? or is there any way to avoid these costs?
How do I understand this complexity? What is the cost that I missed to
account for? I guess the cost for accessing to an element in
an array does depend on the size of the array. A few stack overflow
posts say that array accessing costs only O(1) for some languages. I
think it should also hold for fortran, but I do not know why that is
not the case.
Aside from what you ask more or less explicitly to the program (performing the loop, getting random numbers, etc), a number of events occur such as the loading of the runtime environment and input/output processing. To make useful timings, you must either perfectly isolate the code to time or arrange for the actual computation to take a lot more time than the rest of the code.
Is there any way to avoid this cost?
This is in reply 1 :-)
Now, for a solution: I completed your example and let it run for hundreds of millions of iterations. See below:
program time_random
integer, parameter :: rk = selected_real_kind(15)
integer, parameter :: Ng = 100
real(kind=rk), dimension(Ng) :: x = 0
real(kind=rk) :: temp
integer :: g, Np
write(*,*) 'Enter number of loops'
read(*,*) Np
do i=1,Np
call RANDOM_NUMBER(temp)
g = floor( Ng*temp ) + 1
x(g) = x(g) + 1
end do
write(*,*) x
end program time_random
I compiled it with gfortran -O3 -Wall -o time_random time_random.f90 and time it with the time function from bash. Beware that this is very crude (and explains why I made the number of iterations so large). It is also very simple to set up:
for ii in 100000000 200000000 300000000 400000000 500000000 600000000
do
time echo $ii | ./time_random 1>out
done
You can now collect the timings and observe a linear complexity. My computer reports 14 ns per iteration.
Remarks:
I used selected_real_kind to specify the real kind.
I write x after the loop to ensure that the loop is not optimized away.
I've spent the last month or so learning julia and I'm very impressed. In particular I'm analysing large amount of climate model output, I put all this into SharedArrays and adjust and plot it all in parallel. So far it's very quick and efficient and I've got quite a library of code. My current problem is in creating a function that can do basic operations on two shared arrays. I've successfully written a function that takes two arrays and how you want to process them. The code is based around the example in the parallel section of the julia doc and uses the myrange function as shown there
function myrange(q::SharedArray)
idx = indexpids(q)
##show (idx)
if idx == 0
# This worker is not assigned a piece
return 1:0, 1:0
print("NO WORKERS ASSIGNED")
end
nchunks = length(procs(q))
splits = [round(Int, s) for s in linspace(0,length(q),nchunks+1)]
splits[idx]+1:splits[idx+1]
end
function combine_arrays_chunk!(array_1,array_2,output_array,func, length_range);
##show (length_range)
for i in length_range
output_array[i] = func(array_1[i], array_2[i]);
#hardwired example for func = +
#output_array[i] = +(array_1[i], array_2[i]);
end
output_array
end
combine_arrays_shared_chunk!(array_1,array_2,output_array,func) = combine_arrays_chunk!(array_1,array_2,output_array,func, myrange(array_1));
function combine_arrays_shared(array_1::SharedArray,array_2::SharedArray,func)
if size(array_1)!=size(array_2)
return print("inputs not of the same size")
end
output_array=SharedArray(Float64,size(array_1));
#sync begin
for p in procs(array_1)
#async remotecall_wait(p, combine_arrays_shared_chunk!, array_1,array_2,output_array,func)
end
end
output_array
end
The works so one can do
strain_div = combine_arrays_shared(eps_1,eps_2,+);
strain_tot = combine_arrays_shared(eps_1,eps_2,hypot);
with the correct results an the output as a shared array as required. But ... it's quite slow. It's actually quicker to combine the sharedarray as a normal array on one processor, calculate and then convert back to a sharedarray (for my test cases anyway, with each array approx 200MB, when I move up to GBs I guess not). I can hardwire the combine_arrays_shared function to only do addition (or some other function), and then you get the speed increase, but with function type being passed within combine_arrays_shared the whole thing is slow (10 times slower than the hard wired addition).
I've looked at the FastAnonymous.jl package but I can't see how it would work in this case. I tried, and failed. Any ideas?
I might just resort to writing a different combine_arrays_... function for each basic function I use, or having the func argument as a option and call different functions from within combine_arrays_shared, but I want it to be more elegant! Also this is good way to learn more about Julia.
Harry
This question actually has nothing to do with SharedArrays, and is just "how do I pass functions-as-arguments and get better performance?"
The way FastAnonymous works---and similar to the way closures will work in julia soon---is to create a type with a call method. If you're having trouble with FastAnonymous for some reason, you can always do it manually:
julia> immutable Foo end
julia> Base.call(f::Foo, x, y) = x*y
call (generic function with 1036 methods)
julia> function applyf(f, X)
s = zero(eltype(X))
for x in X
s += f(x, x)
end
s
end
applyf (generic function with 1 method)
julia> X = rand(10^6);
julia> f = Foo()
Foo()
# Run the function once with each type of argument to JIT-compile
julia> applyf(f, X)
333375.63216645207
julia> applyf(*, X)
333375.63216645207
# Compile anything used by #time
julia> #time 1
0.000004 seconds (148 allocations: 10.151 KB)
1
# Now let's benchmark
julia> #time applyf(f, X)
0.002860 seconds (5 allocations: 176 bytes)
333433.439233112
julia> #time applyf(*, X)
0.142411 seconds (4.00 M allocations: 61.035 MB, 19.24% gc time)
333433.439233112
Note the big increase in speed and greatly-reduced memory consumption.
basically in my project, I am trying to write a list of strings into file like this:
val mutable rodata_list : (string*string) list = []
.....
let zip1 ll =
List.map (fun (h,e) -> h^e) ll in
let oc = open_out_gen [Open_append; Open_creat] 0o666 "final_data.s" in
List.iter (fun l -> Printf.fprintf oc "%s\n" l) (zip1 rodata_list);
Here is my problem, usually the rodata_list can reach as long as 800,000 size, and the above code on our server (64-bit, 32 core Intel(R) Xeon(R) CPU E5-2690 0 # 2.90GHz) takes about 3.5 seconds.. The OCaml version I use is 4.01.0.
This is not acceptable, especially as I have 4 piece of code like this to write into a file. Totally they could take me over 15 seconds..
I tried this:
Printf.fprintf oc "%s\n" (String.concat "\n" (zip1 rodata_list));
But no obvious improvement..
So I am wondering that, how to optimize this part? I appreciate any solutions. Thank you!
Don't use ^ to concatenate a bunch of strings in performance critical code, as it will lead to quadratic complexity;
Try not to rely on *printf functions, when performance matters (although in OCaml 4.02 it is pretty fast);
Don't apply several iterations on a list in a row, since OCaml doesn't have a deforesting. Try to do as much operations in an iteration as possible;
If you're using lists of 1 million elements, then you're actually doing something wrong. Try to use different data structure;
So, given the advices above we have the following:
List.iter (fun (x,y) ->
output_string oc x;
output_string oc y;
output_char oc '\n') rodata_list
Also, any optimizations should start from profiling, to get the profile you need to compile it with profiling info, for example like this:
ocamlbuild myprogram.p.native
Then you can run program to collect the profile, that can be read with gprof. My guess, that you will spend all the time not in the actual IO, or even concatenation, but in garbage collection, since your zip, will create millions of string.
How fast it should be
So to proof, that you're actually trying to optimize wrong part of your code, I've wrote this small program:
let rec init_rev acc = function
| 0 -> acc
| n -> init_rev (("hello", "world") :: acc) (n-1)
let () = List.iter (fun (x,y) ->
print_string x;
print_endline y) (init_rev [] 1000_000)
It creates a list of one million elements and outputs it:
$ ocamlbuild main.native
$ time ./main.native > data.txt
real 0m0.998s
user 0m0.211s
sys 0m0.783s
This is on macbook laptop. Moreover we spend most of the time in the system, with only 200ms in OCaml. And a simple loop for 1000_000 iterations without creating a list, takes only 11ms.
So, profile.
Here's a array A with length N, and its values are between 1 and N (no duplication).
I want to get the array B which satisfies B[A[i]]=i , for i in [1,N]
e.g.
for A=[4,2,1,3], I want to get
B=[3,2,4,1]
I've writen a fortran code with openmp as showed below, array A is given by other procedure. For N = 1024^3(~10^9), it takes about 40 seconds, and assigning more threads do little help (it takes similar time for OMP_NUM_THREADS=1, 4 or 16). It seens openmp does not work well for very large N. (However it works well for N=10^7)
I wonder if there are other better algorithm to do assignment to B or make openmp valid.
the code:
subroutine fill_inverse_array(leng, A, B)
use omp_lib
implicit none
integer*4 intent(in) :: leng, i
integer*4 intent(in) :: A(leng)
integer*4 intent(out) :: B(leng)
!$omp parallel do private(i) firstprivate(leng) shared(A, B)
do i=1,leng
B(A(i))=i
enddo
!$omp end parallel do
end subroutine
It's a slow day here so I ran some tests. I managed to squeeze out a useful increase in speed by rewriting the expression inside the loop, from B(A(i))=i to the equivalent B(i) = A(A(i)). I think this has a positive impact on performance because it is a little more cache-friendly.
I used the following code to test various alternatives:
A = random_permutation(length)
CALL system_clock(st1)
B = A(A)
CALL system_clock(nd1)
CALL system_clock(st2)
DO i = 1, length
B(i) = A(A(i))
END DO
CALL system_clock(nd2)
CALL system_clock(st3)
!$omp parallel do shared(A,B,length) private(i)
DO i = 1, length
B(i) = A(A(i))
END DO
!$omp end parallel do
CALL system_clock(nd3)
CALL system_clock(st4)
DO i = 1, length
B(A(i)) = i
END DO
CALL system_clock(nd4)
CALL system_clock(st5)
!$omp parallel do shared(A,B,length) private(i)
DO i = 1, length
B(A(i)) = i
END DO
!$omp end parallel do
CALL system_clock(nd5)
As you can see, there are 5 timed sections in this code. The first is a simple one-line revision of your original code, to provide a baseline. This is followed by an unparallelised and then a parallelised version of your loop, rewritten along the lines I outlined above. Sections 4 and 5 reproduce your original order of operations, first unparallelised, then parallelised.
Over a series of four runs I got the following average times. In all cases I was using arrays of 10**9 elements and 8 threads. I tinkered a little and found that using 16 (hyperthreads) gave very little improvement, but that 8 was a definite improvement on fewer. Some average timings
Sec 1: 34.5s
Sec 2: 32.1s
Sec 3: 6.4s
Sec 4: 31.5s
Sec 5: 8.6s
Make of those numbers what you will. As noted above, I suspect that my version is marginally faster than your version because it makes better use of cache.
I'm using Intel Fortran 14.0.1.139 on a 64-bit Windows 7 machine with 10GB RAM. I used the '/O2' option for compiler optimisation.