Complexity for accessing fortran array in a loop - arrays

Recently I'm investigating the complexity of accessing fortran array. Thanks to the comments, here I include complete examples.
program main
implicit none
integer, parameter :: mp = SELECTED_REAL_KIND(15,307)
integer, parameter :: Np=10, rep=100
integer*8, parameter :: Ng(7) = (/1E3,1E4,1E5,1E6,1E7,1E8,1E9/)
real(mp), allocatable :: x(:)
real(mp) :: time1, time2
integer*8 :: i,j,k, Ngj
real(mp) :: temp
integer :: g
! print to screen
print *, 'calling program main'
do j=1,SIZE(Ng) !test with different Ng
!initialization with each Ng. Don't count for complexity.
Ngj = Ng(j)
if(ALLOCATED(x)) DEALLOCATE(x)
ALLOCATE(x(Ngj))
x = 0.0_mp
!!===This is the part I want to check the complexity===!!
call CPU_TIME(time1)
do k=1,rep
do i=1,Np
call RANDOM_NUMBER(temp)
g = floor( Ngj*temp ) + 1
x( g ) = x( g ) + 1.0_mp
end do
end do
call CPU_TIME(time2)
print *, 'Ng: ',Ngj,(time2-time1)/rep, '(sec)'
end do
! print to screen
print *, 'program main...done.'
contains
end program
I thought in the beginning its complexity is O(Np). But this is the time measurement for Np=10:
calling program main
Ng: 1000 7.9000000000000080E-007 (sec)
Ng: 10000 4.6000000000000036E-007 (sec)
Ng: 100000 3.0999999999999777E-007 (sec)
Ng: 1000000 4.8000000000001171E-007 (sec)
Ng: 10000000 7.3999999999997682E-007 (sec)
Ng: 100000000 2.1479999999999832E-005 (sec)
Ng: 1000000000 4.5719999999995761E-005 (sec)
program main...done.
This Ng-dependency is very slow and appears only for very large Ng, but is not dominated when increasing Np; increasing Np just multiplies a constant factor on that time scaling.
Also, it seems that the scaling slope increases when I use more complicated subroutines rather than random number.
Computing temp and g was verified to be independent of Ng.
There are two questions with this situation:
Based on comments, this kind of measurement does not only include intended arithmetic operations but also costs related to memory cache or compiler. Would there be a more correct way to measure the complexity?
Concerning the issues mentioned in the comments, like memory cache, page missing, or compiler, are they inevitable as the array size increases? or is there any way to avoid these costs?

How do I understand this complexity? What is the cost that I missed to
account for? I guess the cost for accessing to an element in
an array does depend on the size of the array. A few stack overflow
posts say that array accessing costs only O(1) for some languages. I
think it should also hold for fortran, but I do not know why that is
not the case.
Aside from what you ask more or less explicitly to the program (performing the loop, getting random numbers, etc), a number of events occur such as the loading of the runtime environment and input/output processing. To make useful timings, you must either perfectly isolate the code to time or arrange for the actual computation to take a lot more time than the rest of the code.
Is there any way to avoid this cost?
This is in reply 1 :-)
Now, for a solution: I completed your example and let it run for hundreds of millions of iterations. See below:
program time_random
integer, parameter :: rk = selected_real_kind(15)
integer, parameter :: Ng = 100
real(kind=rk), dimension(Ng) :: x = 0
real(kind=rk) :: temp
integer :: g, Np
write(*,*) 'Enter number of loops'
read(*,*) Np
do i=1,Np
call RANDOM_NUMBER(temp)
g = floor( Ng*temp ) + 1
x(g) = x(g) + 1
end do
write(*,*) x
end program time_random
I compiled it with gfortran -O3 -Wall -o time_random time_random.f90 and time it with the time function from bash. Beware that this is very crude (and explains why I made the number of iterations so large). It is also very simple to set up:
for ii in 100000000 200000000 300000000 400000000 500000000 600000000
do
time echo $ii | ./time_random 1>out
done
You can now collect the timings and observe a linear complexity. My computer reports 14 ns per iteration.
Remarks:
I used selected_real_kind to specify the real kind.
I write x after the loop to ensure that the loop is not optimized away.

Related

Optimizing the value N to split arrays up for vectorizing an array so it runs the quickest

I'm trying to optimizing the value N to split arrays up for vectorizing an array so it runs the quickest on different machines. I have some test code below
#example use random values
clear all,
t=rand(1,556790);
inner_freq=rand(8193,6);
N=100; # use N chunks
nn = int32(linspace(1, length(t)+1, N+1))
aa_sig_combined=zeros(size(t));
total_time_so_far=0;
for ii=1:N
tic;
ind = nn(ii):nn(ii+1)-1;
aa_sig_combined(ind) = sum(diag(inner_freq(1:end-1,2)) * cos(2 .* pi .* inner_freq(1:end-1,1) * t(ind)) .+ repmat(inner_freq(1:end-1,3),[1 length(ind)]));
toc
total_time_so_far=total_time_so_far+sum(toc)
end
fprintf('- Complete test in %4.4fsec or %4.4fmins\n',total_time_so_far,total_time_so_far/60);
This takes 162.7963sec or 2.7133mins to complete when N=100 on a 16gig i7 machine running ubuntu
Is there a way to find out what value N should be to get this to run the fastest on different machines?
PS: I'm running Octave 3.8.1 on 16gig i7 ubuntu 14.04 but it will also be running on even a 1 gig raspberry pi 2.
This is the Matlab test script that I used to time each parameter. The return is used to break it after the first iteration as it looks like the rest of the iterations are similar.
%example use random values
clear all;
t=rand(1,556790);
inner_freq=rand(8193,6);
N=100; % use N chunks
nn = int32( linspace(1, length(t)+1, N+1) );
aa_sig_combined=zeros(size(t));
D = diag(inner_freq(1:end-1,2));
for ii=1:N
ind = nn(ii):nn(ii+1)-1;
tic;
cosPara = 2 * pi * A * t(ind);
toc;
cosResult = cos( cosPara );
sumParaA = D * cosResult;
toc;
sumParaB = repmat(inner_freq(1:end-1,3),[1 length(ind)]);
toc;
aa_sig_combined(ind) = sum( sumParaA + sumParaB );
toc;
return;
end
The output is indicated as follows. Note that I have a slow computer.
Elapsed time is 0.156621 seconds.
Elapsed time is 17.384735 seconds.
Elapsed time is 17.922553 seconds.
Elapsed time is 18.452994 seconds.
As you can see, the cos operation is what's taking so long. You are running cos on a 8192x5568 matrix (45,613,056 elements) which makes sense that it takes so long.
If you wish to improve performance, use parfor as it appears each iteration is independent. Assuming you had 100 cores to run your 100 iterations, your script would be done in 17 seconds + parfor overhead.
Within the cos calculation, you might want to look into if another method exists to calculate cos of a value faster and more parallel than the stock method.
Another minor optimization is this line. It ensures that the diag function isn't called within the loop as the diagonal matrix is constant. You don't want a 8192x8192 diagonal matrix to be generated every time! I just stored it outside the loop and it gives a bit of a performance boost as well.
D = diag(inner_freq(1:end-1,2));
Note that I didn't use the Matlab profile as it didn't work for me, but you should use that in the future for more functionalized code.

Read data from file into array

I am having trouble reading data from a file into an array, I am new to programing
and fortran so any information would be a blessing.
this is a sample of the txt file the program is reading from
!60
!USS Challenger 1.51 12712.2 1.040986E+11
!USS Cortez 9.14 97822.5 2.009091E+09
!USS Dauntless 5.76 27984.0 2.599167E+09
!USS Enterprise 2.48 9136.3 1.478474E+10
!USS Excalibur 3.83 32253.0 1.286400E+10
all together there is 60 ships. the variables are separated by spaces and
are as follows
warp factor, distance in light years, actual velocity.
this is the current code I have used, it has given the error The module or main
program array 'm' at (1) must have constant shape
PROGRAM engine_performance
IMPLICIT NONE
INTEGER :: i ! loop index
REAL,PARAMETER :: c = 2.997925*10**8 ! light years (m/s)
REAL,PARAMETER :: n = 60 ! number of flights
REAL :: warpFactor ! warp factor
REAL :: distanceLy ! distance in light years
REAL :: actualTT ! actual time travel
REAL :: velocity ! velocity in m/s
REAL :: TheoTimeT ! theoretical time travel
REAL :: average ! average of engine performance
REAL :: median ! median of engine performance
INTEGER, DIMENSION (3,n), ALLOCATABLE :: M
OPEN(UNIT=10, FILE="TrekTimes.txt")
DO i = 1,n
READ(*,100) warpFactor, distanceLY, actualTT
100 FORMAT(T19,F4.2,1X,F7.1,&
1X,ES 12.6)
WRITE(*,*) M
END DO
CLOSE (10)
END PROGRAM engine_performance
The first time I read your code I mistakenly read M as an array in which your program would store the numbers from the file of ships. On closer inspection I realise (a) that M is an array of integers and (b) the read statement later in the code reads each line of the input file but doesn't store warpFactor, distanceLY, actualTT anywhere.
Making a wild guess that M ought to be the representation of the numeric factors associated with each ship, here's how to fix your code. If the wild guess is wide of the mark, clarify what you are trying to do and any remaining problems with your code.
First, fix the declaration of M
REAL, DIMENSION (:,:), ALLOCATABLE :: M
The term (3,n) can't be used in the declaration of an allocatable array. Because n is previously declared to be a real constant it's not valid as the specification of an extent of an array. If it could be the declaration of the array would fix its dimensions at (3,60) which means that the array can't be allocatable.
So, also change the declaration of n to
INTEGER :: n
It's no longer a parameter, you're going to read its value from the first line of the file, which is why it's in the file in the first place.
Second, I think you have rows and columns switched in your array. The file has 60 rows (of ships), each of which has 3 columns of numeric data. So when it comes time to allocate M use the statement
ALLOCATE(M(n,3))
Of course, prior to that you'll have had to read n from the first line in the file, but that shouldn't be a serious problem.
Third, read the values into the array. Change
READ(*,100) warpFactor, distanceLY, actualTT
to
READ(*,100) M(i,:)
which will read the whole of row i.
Finally, those leading ! on each line of the file -- get rid of them (use an editor or re-create the file without them). They prevent the easy reading of values. With the ! present reading n requires a format specification, without it it's just read(10,*).
Oh, and absolutely finally: you should probably, after you've got this program working, direct your attention to the topic of defined types in your favourite Fortran tutorial, and learn how to declare type(starship) for added expressiveness and ease of programming.

Code becomes slower as more boxed arrays are allocated

Edit: It turns out that things generally (not just array/ref operations) slow down the more arrays have been created, so I guess this might just be measuring increased GC times and might not be as strange as I thought. But I'd really like to know (and learn how to find out) what's happening here though, and if there's some way to mitigate this effect in code that creates lots of smallish arrays. Original question follows.
In investigating some weird benchmarking results in a library, I stumbled upon some behavior I don't understand, though it might be really obvious. It seems that the time taken for many operations (creating a new MutableArray, reading or modifying an IORef) increases in proportion to the number of arrays in memory.
Here's the first example:
module Main
where
import Control.Monad
import qualified Data.Primitive as P
import Control.Concurrent
import Data.IORef
import Criterion.Main
import Control.Monad.Primitive(PrimState)
main = do
let n = 100000
allTheArrays <- newIORef []
defaultMain $
[ bench "array creation" $ do
newArr <- P.newArray 64 () :: IO (P.MutableArray (PrimState IO) ())
atomicModifyIORef' allTheArrays (\l-> (newArr:l,()))
]
We're creating a new array and adding it to a stack. As criterion does more samples and the stack grows, array creation takes more time, and this seems to grow linearly and regularly:
Even more odd, IORef reads and writes are affected, and we can see the atomicModifyIORef' getting faster presumably as more arrays are GC'd.
main = do
let n = 1000000
arrs <- replicateM (n) $ (P.newArray 64 () :: IO (P.MutableArray (PrimState IO) ()))
-- print $ length arrs -- THIS WORKS TO MAKE THINGS FASTER
arrsRef <- newIORef arrs
defaultMain $
[ bench "atomic-mods of IORef" $
-- nfIO $ -- OR THIS ALSO WORKS
replicateM 1000 $
atomicModifyIORef' arrsRef (\(a:as)-> (as,()))
]
Either of the two lines that are commented get rid of this behavior but I'm not sure why (maybe after we force the spine of the list, the elements can actually by collected).
Questions
What's happening here?
Is it expected behavior?
Is there a way I can avoid this slowdown?
Edit: I assume this has something to do with GC taking longer, but I'd like to understand more precisely what's happening, especially in the first benchmark.
Bonus example
Finally, here's a simple test program that can be used to pre-allocate some number of arrays and time a bunch of atomicModifyIORefs. This seems to exhibit the slow IORef behavior.
import Control.Monad
import System.Environment
import qualified Data.Primitive as P
import Control.Concurrent
import Control.Concurrent.Chan
import Control.Concurrent.MVar
import Data.IORef
import Criterion.Main
import Control.Exception(evaluate)
import Control.Monad.Primitive(PrimState)
import qualified Data.Array.IO as IO
import qualified Data.Vector.Mutable as V
import System.CPUTime
import System.Mem(performGC)
import System.Environment
main :: IO ()
main = do
[n] <- fmap (map read) getArgs
arrs <- replicateM (n) $ (P.newArray 64 () :: IO (P.MutableArray (PrimState IO) ()))
arrsRef <- newIORef arrs
t0 <- getCPUTimeDouble
cnt <- newIORef (0::Int)
replicateM_ 1000000 $
(atomicModifyIORef' cnt (\n-> (n+1,())) >>= evaluate)
t1 <- getCPUTimeDouble
-- make sure these stick around
readIORef cnt >>= print
readIORef arrsRef >>= (flip P.readArray 0 . head) >>= print
putStrLn "The time:"
print (t1 - t0)
A heap profile with -hy shows mostly MUT_ARR_PTRS_CLEAN, which I don't completely understand.
If you want to reproduce, here is the cabal file I've been using
name: small-concurrency-benchmarks
version: 0.1.0.0
build-type: Simple
cabal-version: >=1.10
executable small-concurrency-benchmarks
main-is: Main.hs
build-depends: base >=4.6
, criterion
, primitive
default-language: Haskell2010
ghc-options: -O2 -rtsopts
Edit: Here's another test program, that can be used to compare slowdown with heaps of the same size of arrays vs [Integer]. It takes some trial and error adjusting n and observing profiling to get comparable runs.
main4 :: IO ()
main4= do
[n] <- fmap (map read) getArgs
let ns = [(1::Integer).. n]
arrsRef <- newIORef ns
print $ length ns
t0 <- getCPUTimeDouble
mapM (evaluate . sum) (tails [1.. 10000])
t1 <- getCPUTimeDouble
readIORef arrsRef >>= (print . sum)
print (t1 - t0)
Interestingly, when I test this I find that the same heap size-worth of arrays affects performance to a greater degree than [Integer]. E.g.
Baseline 20M 200M
Lists: 0.7 1.0 4.4
Arrays: 0.7 2.6 20.4
Conclusions (WIP)
This is most likely due to GC behavior
But mutable unboxed arrays seem to lead to more sever slowdowns (see above). Setting +RTS -A200M brings performance of the array garbage version in line with the list version, supporting that this has to do with GC.
The slowdown is proportional to the number of arrays allocated, not the number of total cells in the array. Here is a set of runs showing, for a similar test to main4, the effects of number of arrays allocated both on the time taken to allocate, and a completely unrelated "payload". This is for 16777216 total cells (divided amongst however many arrays):
Array size Array create time Time for "payload":
8 3.164 14.264
16 1.532 9.008
32 1.208 6.668
64 0.644 3.78
128 0.528 2.052
256 0.444 3.08
512 0.336 4.648
1024 0.356 0.652
And running this same test on 16777216*4 cells, shows basically identical payload times as above, only shifted down two places.
From what I understand about how GHC works, and looking at (3), I think this overhead might be simply from having pointers to all these arrays sticking around in the remembered set (see also: here), and whatever overhead that causes for the GC.
You are paying linear overhead every minor GC per mutable array that remains live and gets promoted to the old generation. This is because GHC unconditionally places all mutable arrays on the mutable list and traverses the entire list every minor GC. See https://ghc.haskell.org/trac/ghc/ticket/7662 for more information, as well as my mailing list response to your question: http://www.haskell.org/pipermail/glasgow-haskell-users/2014-May/024976.html
I think you're definitely seeing GC effects. I had a related issue in cassava (https://github.com/tibbe/cassava/issues/49#issuecomment-34929984) where the GC time was increasing linearly with increasing heap size.
Try to measure how the GC time and mutator time increase as you hold on to more and more arrays in memory.
You can reduce GC time with playing with the +RTS options. For example, try setting -A to your L3 cache size.

Fastest way to get Inverse Mapping from Values to Indices in fortran

Here's a array A with length N, and its values are between 1 and N (no duplication).
I want to get the array B which satisfies B[A[i]]=i , for i in [1,N]
e.g.
for A=[4,2,1,3], I want to get
B=[3,2,4,1]
I've writen a fortran code with openmp as showed below, array A is given by other procedure. For N = 1024^3(~10^9), it takes about 40 seconds, and assigning more threads do little help (it takes similar time for OMP_NUM_THREADS=1, 4 or 16). It seens openmp does not work well for very large N. (However it works well for N=10^7)
I wonder if there are other better algorithm to do assignment to B or make openmp valid.
the code:
subroutine fill_inverse_array(leng, A, B)
use omp_lib
implicit none
integer*4 intent(in) :: leng, i
integer*4 intent(in) :: A(leng)
integer*4 intent(out) :: B(leng)
!$omp parallel do private(i) firstprivate(leng) shared(A, B)
do i=1,leng
B(A(i))=i
enddo
!$omp end parallel do
end subroutine
It's a slow day here so I ran some tests. I managed to squeeze out a useful increase in speed by rewriting the expression inside the loop, from B(A(i))=i to the equivalent B(i) = A(A(i)). I think this has a positive impact on performance because it is a little more cache-friendly.
I used the following code to test various alternatives:
A = random_permutation(length)
CALL system_clock(st1)
B = A(A)
CALL system_clock(nd1)
CALL system_clock(st2)
DO i = 1, length
B(i) = A(A(i))
END DO
CALL system_clock(nd2)
CALL system_clock(st3)
!$omp parallel do shared(A,B,length) private(i)
DO i = 1, length
B(i) = A(A(i))
END DO
!$omp end parallel do
CALL system_clock(nd3)
CALL system_clock(st4)
DO i = 1, length
B(A(i)) = i
END DO
CALL system_clock(nd4)
CALL system_clock(st5)
!$omp parallel do shared(A,B,length) private(i)
DO i = 1, length
B(A(i)) = i
END DO
!$omp end parallel do
CALL system_clock(nd5)
As you can see, there are 5 timed sections in this code. The first is a simple one-line revision of your original code, to provide a baseline. This is followed by an unparallelised and then a parallelised version of your loop, rewritten along the lines I outlined above. Sections 4 and 5 reproduce your original order of operations, first unparallelised, then parallelised.
Over a series of four runs I got the following average times. In all cases I was using arrays of 10**9 elements and 8 threads. I tinkered a little and found that using 16 (hyperthreads) gave very little improvement, but that 8 was a definite improvement on fewer. Some average timings
Sec 1: 34.5s
Sec 2: 32.1s
Sec 3: 6.4s
Sec 4: 31.5s
Sec 5: 8.6s
Make of those numbers what you will. As noted above, I suspect that my version is marginally faster than your version because it makes better use of cache.
I'm using Intel Fortran 14.0.1.139 on a 64-bit Windows 7 machine with 10GB RAM. I used the '/O2' option for compiler optimisation.

Haskell Constant Propagation on Data Structures?

I want to know how deeply Haskell evaluates data structures at compile time.
Consider the following list:
simpleTableMultsList :: [Int]
simpleTableMultsList = [n*m | n <- [1 ..9],m <- [1 ..9]]
This list gives a representation of the multiplication table for 1 through 9. Now, suppose we want to change it so that we represent the product of two one digit numbers as a pair of numbers (first digit, second digit). Then we may consider
simpleTableMultsList :: [(Int,Int)]
simpleTableMultsList = [(k `div` 10, k `rem` 10) | n <- [1 ..9],m <- [1 ..9],let k = n*m]
Now we can implement multiplication on one digit numbers as a table lookup. YAY!! However, we want to be more efficient than this! So we want to make this structure an unboxed array. Haskell gives a really great way to do this using
import qualified Data.Array.Unboxed as A
Then we can do:
simpleTableMults :: A.Array (Int,Int) (Int,Int)
simpleTableMults = A.listArray ((1,1),(9,9)) simpleTableMultsList
Now if I want a constant time multiplication of two one digit numbers n and m, I can do:
simpleTableMults ! (n,m)
This is great! Now suppose I compile this module we've been working on. Does the simpleTableMults get fully evaluated so that when I run the computation simpleTableMults ! (n,m), the program literally makes a lookup in memory ... or does it have to build the data structure in memory first. Since it is an unboxed array, my understanding is that the Array must be created at once and is completely strict in its elements -- so that all the elements of the array are fully evaluated.
So really my question is: when does this evaluation occur, and can I force it to occur at compile time?
------- Edit ---------------
I tried to dig further on this! I tried compiling and examining information about the core. It seems GHC is performing a lot of reductions on the code at compile time. I wish I knew more about core to be able to tell. If we compile with
ghc -O2 -ddump-simpl-stats Main.hs
We can see that 98 beta reductions are performed, an unpack-list operation is carried out, many things are unfolded, and a bunch of inlines are performed (around 150). It even tells you where the beta reductions occur, ... since the word IxArray is coming, I am more curious if some sort of simplification is occuring. Now the interesting thing from my point of view is that adding
simpleTableMults = D.deepseq t t
where t = A.listArray ((1,1),(9,9)) simpleTableMultsList
increases the number of beta reductions, inlines, and simplifications quite substantially at compile time. It would be really great if I could load the compiled into a debugger of some sort and "view" the data structure! I am, as it stands, more mistified than before.
------ Edit 2 -------------
I still don't know what beta reductions are being performed. However, I did find out some interesting things based on sassa-nf's repsonse response. For the following experiment, I used the ghc-heap-view package. I changed the way Array was represented in the source according to the Sassa-NF answer. I loaded the program into GHCi, and immediately called
:printHeap simpleTableMults
And as expected got a index too large exception. But under the suggested unpacked datatype, I got a let expression with a toArray and a bunch of _thunks, and some _funs. Not really sure yet what these mean ... The other interesting thing is that by using seq, or some other strictness forcing in the source code, I ended up with all _thunks inside of the let. I can upload the exact emission if that helps.
Also, if I perform a single indexing, the array gets completely evaluated in all cases.
Also, there is no way to call ghci with optimizations, so I might not be getting the same results as when compiled with GHC -O2.
Let's exaggerate:
import qualified Data.Array.Unboxed as A
simpleTableMults :: A.Array (Int,Int) (Int,Int)
simpleTableMults = A.listArray ((1,1),(10000,2000))
[(k `div` 10, k `rem` 10) | n <- [1 ..10000],m <- [1 ..2000],let k = n*m]
main = print $ simpleTableMults A.! (10000,1000)
Then
ghc -O2 -prof b.hs
b +RTS -hy
......Out of memory
hp2hs b.exe.hp
What happened?! You can see the heap consumption graph to go above 1GB, and then it died.
Well, the pair is computed eagerly, but the projections of the pair are lazy, so we end up with tons of thunks to compute k ``div`` 10 and k ``rem`` 10.
import qualified Data.Array.Unboxed as A
data P = P {-# UNPACK #-} !Int {-# UNPACK #-} !Int deriving (Show)
simpleTableMults :: A.Array (Int,Int) P
simpleTableMults = A.listArray ((1,1),(10000,2000))
[P (k `div` 10) (k `rem` 10) |
n <- [1 ..10000],m <- [1 ..2000],let k = n*m]
main = print $ simpleTableMults A.! (10000,1000)
This one is fine, because we eagerly computed the pair.

Resources