Trick to read data from hard drive faster between sucessive compilations

Trick to read data from hard drive faster between sucessive compilations - arrays

I am developing code with a compiled language (Fortran 95) that does certain calculations on a huge galaxy catalog. Each time I implement some change, I compile and run the code, and it takes about 3 minutes just reading the ASCII file with the galaxy data from disk. This is a waste of time.
Had I started this project in IDL or Matlab, then it would be different, because the variables containing the array data would be kept in memory between different compilations.
However, I think something could be done to speed up that unnerving reading from disk, like having the files in a fake RAM partition or something.

Instead of going into details on RAM disks I propose you switch from ASCII databases to Binary ones. here is a very simplistic example... An array of random numbers, stored as ASCII (ASCII.txt) and as binary date (binary.bin):
program writeArr
use,intrinsic :: ISO_Fortran_env, only: REAL64
implicit none
real(REAL64),allocatable :: tmp(:,:)
integer :: uFile, i
allocate( tmp(10000,10000) )
! Formatted read
open(unit=uFile, file='ASCII.txt',form='formatted', &
status='replace',action='write')
do i=1,size(tmp,1)
write(uFile,*) tmp(:,i)
enddo !i
close(uFile)
! Unformatted read
open(unit=uFile, file='binary.bin',form='unformatted', &
status='replace',action='write')
write(uFile) tmp
close(uFile)
end program
Here is the result in terms of sizes:
:> ls -lah ASCII.txt binary.bin
-rw-rw-r--. 1 elias elias 2.5G Feb 20 20:59 ASCII.txt
-rw-rw-r--. 1 elias elias 763M Feb 20 20:59 binary.bin
So, you save a factor of ~3.35 in terms of storage.
Now comes the fun part: reading it back in...
program readArr
use,intrinsic :: ISO_Fortran_env, only: REAL64
implicit none
real(REAL64),allocatable :: tmp(:,:)
integer :: uFile, i
integer :: count_rate, iTime1, iTime2
allocate( tmp(10000,10000) )
! Get the count rate
call system_clock(count_rate=count_rate)
! Formatted write
open(unit=uFile, file='ASCII.txt',form='formatted', &
status='old',action='read')
call system_clock(iTime1)
do i=1,size(tmp,1)
read(uFile,*) tmp(:,i)
enddo !i
call system_clock(iTime2)
close(uFile)
print *,'ASCII read ',real(iTime2-iTime1,REAL64)/real(count_rate,REAL64)
! Unformatted write
open(unit=uFile, file='binary.bin',form='unformatted', &
status='old',action='read')
call system_clock(iTime1)
read(uFile) tmp
call system_clock(iTime2)
close(uFile)
print *,'Binary read ',real(iTime2-iTime1,REAL64)/real(count_rate,REAL64)
end program
The result is
ASCII read 37.250999999999998
Binary read 1.5460000000000000
So, a factor of >24!
So instead of thinking of anything else, please switch to a binary file format first.

Related

Complexity for accessing fortran array in a loop

Recently I'm investigating the complexity of accessing fortran array. Thanks to the comments, here I include complete examples.
program main
implicit none
integer, parameter :: mp = SELECTED_REAL_KIND(15,307)
integer, parameter :: Np=10, rep=100
integer*8, parameter :: Ng(7) = (/1E3,1E4,1E5,1E6,1E7,1E8,1E9/)
real(mp), allocatable :: x(:)
real(mp) :: time1, time2
integer*8 :: i,j,k, Ngj
real(mp) :: temp
integer :: g
! print to screen
print *, 'calling program main'
do j=1,SIZE(Ng) !test with different Ng
!initialization with each Ng. Don't count for complexity.
Ngj = Ng(j)
if(ALLOCATED(x)) DEALLOCATE(x)
ALLOCATE(x(Ngj))
x = 0.0_mp
!!===This is the part I want to check the complexity===!!
call CPU_TIME(time1)
do k=1,rep
do i=1,Np
call RANDOM_NUMBER(temp)
g = floor( Ngj*temp ) + 1
x( g ) = x( g ) + 1.0_mp
end do
end do
call CPU_TIME(time2)
print *, 'Ng: ',Ngj,(time2-time1)/rep, '(sec)'
end do
! print to screen
print *, 'program main...done.'
contains
end program
I thought in the beginning its complexity is O(Np). But this is the time measurement for Np=10:
calling program main
Ng: 1000 7.9000000000000080E-007 (sec)
Ng: 10000 4.6000000000000036E-007 (sec)
Ng: 100000 3.0999999999999777E-007 (sec)
Ng: 1000000 4.8000000000001171E-007 (sec)
Ng: 10000000 7.3999999999997682E-007 (sec)
Ng: 100000000 2.1479999999999832E-005 (sec)
Ng: 1000000000 4.5719999999995761E-005 (sec)
program main...done.
This Ng-dependency is very slow and appears only for very large Ng, but is not dominated when increasing Np; increasing Np just multiplies a constant factor on that time scaling.
Also, it seems that the scaling slope increases when I use more complicated subroutines rather than random number.
Computing temp and g was verified to be independent of Ng.
There are two questions with this situation:
Based on comments, this kind of measurement does not only include intended arithmetic operations but also costs related to memory cache or compiler. Would there be a more correct way to measure the complexity?
Concerning the issues mentioned in the comments, like memory cache, page missing, or compiler, are they inevitable as the array size increases? or is there any way to avoid these costs?

How do I understand this complexity? What is the cost that I missed to
account for? I guess the cost for accessing to an element in
an array does depend on the size of the array. A few stack overflow
posts say that array accessing costs only O(1) for some languages. I
think it should also hold for fortran, but I do not know why that is
not the case.
Aside from what you ask more or less explicitly to the program (performing the loop, getting random numbers, etc), a number of events occur such as the loading of the runtime environment and input/output processing. To make useful timings, you must either perfectly isolate the code to time or arrange for the actual computation to take a lot more time than the rest of the code.
Is there any way to avoid this cost?
This is in reply 1 :-)
Now, for a solution: I completed your example and let it run for hundreds of millions of iterations. See below:
program time_random
integer, parameter :: rk = selected_real_kind(15)
integer, parameter :: Ng = 100
real(kind=rk), dimension(Ng) :: x = 0
real(kind=rk) :: temp
integer :: g, Np
write(*,*) 'Enter number of loops'
read(*,*) Np
do i=1,Np
call RANDOM_NUMBER(temp)
g = floor( Ng*temp ) + 1
x(g) = x(g) + 1
end do
write(*,*) x
end program time_random
I compiled it with gfortran -O3 -Wall -o time_random time_random.f90 and time it with the time function from bash. Beware that this is very crude (and explains why I made the number of iterations so large). It is also very simple to set up:
for ii in 100000000 200000000 300000000 400000000 500000000 600000000
do
time echo $ii | ./time_random 1>out
done
You can now collect the timings and observe a linear complexity. My computer reports 14 ns per iteration.
Remarks:
I used selected_real_kind to specify the real kind.
I write x after the loop to ensure that the loop is not optimized away.

How to read a .dat file from bottom to top in fortran?

I'm new to fortran and am trying to change the following A.dat file to the desired B.dat and overwrite it (B.dat) on A.dat (i.e. I want to read A.dat's rows from bottom to top and overwrite it (e.g. in this example, I want to replace the first row with the third (last) one and vice versa )). Can anyone show me how to do that in fortran 90?
A.dat's contents B.dat's contents (desired)
111001 1111
110110 110110
1111 111001
So far, by #High Performance Mark's help, I tried the following:
program test
real, dimension(:), allocatable :: x
Integer (kind=8) :: n
integer(kind = 4) :: i
open (unit=99, file='A.dat', access='sequential', form='formatted')
open(unit=20, file='B.dat', access='sequential', form='formatted')
do i=3,1,-1
read(99,*) n
write(*,*) n
write(20, *) n
end do
close(20)
end program test
But I stuck in the section "write a new file "in reverse order"" (aforementioned program just reads A.dat contents and writes them in the terminal and B.dat file with the same order). How should I do that?
P.S. Machine info:
"Linux 3.16.6-200.fc20.x86_64 (fedora 20)"
"gcc version 4.8.3 20140911 (Red Hat 4.8.3-7) (GCC)"
"using .f90"

The easiest way to do this in Fortran will demonstrate why its not something you want to do in Fortran.
program futility
implicit none
call execute_command_line('tac A.dat > B.dat')
end program

Read data from file into array

I am having trouble reading data from a file into an array, I am new to programing
and fortran so any information would be a blessing.
this is a sample of the txt file the program is reading from
!60
!USS Challenger 1.51 12712.2 1.040986E+11
!USS Cortez 9.14 97822.5 2.009091E+09
!USS Dauntless 5.76 27984.0 2.599167E+09
!USS Enterprise 2.48 9136.3 1.478474E+10
!USS Excalibur 3.83 32253.0 1.286400E+10
all together there is 60 ships. the variables are separated by spaces and
are as follows
warp factor, distance in light years, actual velocity.
this is the current code I have used, it has given the error The module or main
program array 'm' at (1) must have constant shape
PROGRAM engine_performance
IMPLICIT NONE
INTEGER :: i ! loop index
REAL,PARAMETER :: c = 2.997925*10**8 ! light years (m/s)
REAL,PARAMETER :: n = 60 ! number of flights
REAL :: warpFactor ! warp factor
REAL :: distanceLy ! distance in light years
REAL :: actualTT ! actual time travel
REAL :: velocity ! velocity in m/s
REAL :: TheoTimeT ! theoretical time travel
REAL :: average ! average of engine performance
REAL :: median ! median of engine performance
INTEGER, DIMENSION (3,n), ALLOCATABLE :: M
OPEN(UNIT=10, FILE="TrekTimes.txt")
DO i = 1,n
READ(*,100) warpFactor, distanceLY, actualTT
100 FORMAT(T19,F4.2,1X,F7.1,&
1X,ES 12.6)
WRITE(*,*) M
END DO
CLOSE (10)
END PROGRAM engine_performance

The first time I read your code I mistakenly read M as an array in which your program would store the numbers from the file of ships. On closer inspection I realise (a) that M is an array of integers and (b) the read statement later in the code reads each line of the input file but doesn't store warpFactor, distanceLY, actualTT anywhere.
Making a wild guess that M ought to be the representation of the numeric factors associated with each ship, here's how to fix your code. If the wild guess is wide of the mark, clarify what you are trying to do and any remaining problems with your code.
First, fix the declaration of M
REAL, DIMENSION (:,:), ALLOCATABLE :: M
The term (3,n) can't be used in the declaration of an allocatable array. Because n is previously declared to be a real constant it's not valid as the specification of an extent of an array. If it could be the declaration of the array would fix its dimensions at (3,60) which means that the array can't be allocatable.
So, also change the declaration of n to
INTEGER :: n
It's no longer a parameter, you're going to read its value from the first line of the file, which is why it's in the file in the first place.
Second, I think you have rows and columns switched in your array. The file has 60 rows (of ships), each of which has 3 columns of numeric data. So when it comes time to allocate M use the statement
ALLOCATE(M(n,3))
Of course, prior to that you'll have had to read n from the first line in the file, but that shouldn't be a serious problem.
Third, read the values into the array. Change
READ(*,100) warpFactor, distanceLY, actualTT
to
READ(*,100) M(i,:)
which will read the whole of row i.
Finally, those leading ! on each line of the file -- get rid of them (use an editor or re-create the file without them). They prevent the easy reading of values. With the ! present reading n requires a format specification, without it it's just read(10,*).
Oh, and absolutely finally: you should probably, after you've got this program working, direct your attention to the topic of defined types in your favourite Fortran tutorial, and learn how to declare type(starship) for added expressiveness and ease of programming.

Trying to read a 2 column input file into two separate arrays and having a lot of trouble

This is likely a trivial question, but for some reason I'm having a lot of trouble solving this problem. I'm reading from an input file which has to sets of numbers in two columns. The first column is a list of integers representing time(e.g. 0530). The second column is a list of real data 5 digits long with 3 digits after the decimal place(e.g. 19.213). The two columns have 3 spaces in between. I would like to read this into my program into separate arrays. I've stat the dimensions of the arrays at the maximum possible length(1440), as is shown below. Id like to use this arrays in a function eventually, but i cant even get the input to work properly. Thanks for the help.
PROGRAM readtest1
IMPLICIT NONE
INTEGER, DIMENSION(1440) :: t
REAL, DIMENSION(1440) :: tuvr
OPEN(1, FILE='AP2412.tv', STATUS='old', ACTION='read')
OPEN(2, FILE='timetuvr.txt', STATUS='replace', ACTION='write')
READ(1,100) t, tuvr
100 FORMAT(I5, F8.3)
WRITE(2,100) t, tuvr
END PROGRAM readtest1
Oh and when I compile and run the program i get the error 'FORTRAN runtime error: Expected REAL for item 2 in formatted transfer, got INTEGER)
I believe that fortran is reading directly down the column, which is causing this problem, but I'm unsure of how to fix it. Do i need a double loop?

read (...) t, tuvr reads the entire arrays at once, in a block. You want to read them one element at a time since that is how they are file. Like this:
do i=1, 1440
read (1, '(i5,f8.3)' ) t(i), tuvr(i)
end do
Depending on whether or not the numbers in the file are perfectly in columns, you might find it necessary to use list-directed IO: read (1, *) t(i), tuvr(i). This method is very flexible and easy to use.
If the file might have less than 1440 lines, try something like this, which detects the end of file and counts how many lines were read:
program test
use, intrinsic :: iso_fortran_env
implicit none
integer, parameter :: ArrayLen = 1440
INTEGER, DIMENSION(ArrayLen) :: t
REAL, DIMENSION(ArrayLen) :: tuvr
integer :: i, ReadCode, num
num = 0
ReadLoop: do i=1, ArrayLen
read (1, '(i5,f8.3)', iostat=ReadCode ) t(i), tuvr(i)
if ( ReadCode /= 0 ) then
if ( ReadCode == iostat_end ) then
exit ReadLoop
else
write ( *, '( / "Error on read: ", I0 )' ) ReadCode
stop
end if
end if
num = num + 1
end do ReadLoop
end program test

In Fortran 90, what is a good way to write an array to a text file, row-wise?

I am new to Fortran, and I would like to be able to write a two-dimensional array to a text file, in a row-wise manner (spaces between columns, and each row on its own line). I have tried the following, and it seems to work in the following simple example:
PROGRAM test3
IMPLICIT NONE
INTEGER :: i, j, k, numrows, numcols
INTEGER, DIMENSION(:,:), ALLOCATABLE :: a
numrows=5001
numcols=762
ALLOCATE(a(numrows,numcols))
k=1
DO i=1,SIZE(a,1)
DO j=1,SIZE(a,2)
a(i,j)=k
k=k+1
END DO
END DO
OPEN(UNIT=12, FILE="aoutput.txt", ACTION="write", STATUS="replace")
DO i=1,numrows
WRITE(12,*) (a(i,j), j=1,numcols)
END DO
END PROGRAM test3
As I said, this seems to work fine in this simple example: the resulting text file, aoutput.txt, contains the numbers 1-762 on line 1, numbers 763-1524 on line 2, and so on.
But, when I use the above ideas (i.e., the last fifth-to-last, fourth-to-last, third-to-last, and second-to-last lines of code above) in a more complicated program, I run into trouble; each row is delimited (by a new line) only intermittently, it seems. (I have not posted, and probably will not post, here my entire complicated program/script--because it is rather long.) The lack of consistent row delimiters in my complicated program/script probably suggests another bug in my code, not with the four-line write-to-file routine above, since the above simple example appears to work okay. Still, I am wondering, can you please help me think if there is a better row-wise write-to-text file routine that I should be using?
Thank you very much for your time. I really appreciate it.

There's a few issues here.
The fundamental one is that you shouldn't use text as a data format for sizable chunks of data. It's big and it's slow. Text output is good for something you're going to read yourself; you aren't going to sit down with a printout of 3.81 million integers and flip through them. As the code below demonstrates, the correct text output is about 10x slower, and 50% bigger, than the binary output. If you move to floating point values, there are precision loss issues with using ascii strings as a data interchange format. etc.
If your aim is to interchange data with matlab, it's fairly easy to write the data into a format matlab can read; you can use the matOpen/matPutVariable API from matlab, or just write it out as an HDF5 array that matlab can read. Or you can just write out the array in raw Fortran binary as below and have matlab read it.
If you must use ascii to write out huge arrays (which, as mentioned, is a bad and slow idea) then you're running into problems with default record lengths in list-drected IO. Best is to generate at runtime a format string which correctly describes your output, and safest on top of this for such large (~5000 character wide!) lines is to set the record length explicitly to something larger than what you'll be printing out so that the fortran IO library doesn't helpfully break up the lines for you.
In the code below,
WRITE(rowfmt,'(A,I4,A)') '(',numcols,'(1X,I6))'
generates the string rowfmt which in this case would be (762(1X,I6)) which is the format you'll use for printing out, and the RECL option to OPEN sets the record length to be something bigger than 7*numcols + 1.
PROGRAM test3
IMPLICIT NONE
INTEGER :: i, j, k, numrows, numcols
INTEGER, DIMENSION(:,:), ALLOCATABLE :: a
CHARACTER(LEN=30) :: rowfmt
INTEGER :: txtclock, binclock
REAL :: txttime, bintime
numrows=5001
numcols=762
ALLOCATE(a(numrows,numcols))
k=1
DO i=1,SIZE(a,1)
DO j=1,SIZE(a,2)
a(i,j)=k
k=k+1
END DO
END DO
CALL tick(txtclock)
WRITE(rowfmt,'(A,I4,A)') '(',numcols,'(1X,I6))'
OPEN(UNIT=12, FILE="aoutput.txt", ACTION="write", STATUS="replace", &
RECL=(7*numcols+10))
DO i=1,numrows
WRITE(12,FMT=rowfmt) (a(i,j), j=1,numcols)
END DO
CLOSE(UNIT=12)
txttime = tock(txtclock)
CALL tick(binclock)
OPEN(UNIT=13, FILE="boutput.dat", ACTION="write", STATUS="replace", &
FORM="unformatted")
WRITE(13) a
CLOSE(UNIT=13)
bintime = tock(binclock)
PRINT *, 'ASCII time = ', txttime
PRINT *, 'Binary time = ', bintime
CONTAINS
SUBROUTINE tick(t)
INTEGER, INTENT(OUT) :: t
CALL system_clock(t)
END SUBROUTINE tick
! returns time in seconds from now to time described by t
REAL FUNCTION tock(t)
INTEGER, INTENT(IN) :: t
INTEGER :: now, clock_rate
call system_clock(now,clock_rate)
tock = real(now - t)/real(clock_rate)
END FUNCTION tock
END PROGRAM test3

This may be a very roundabout and time-consuming way of doing it, but anyway... You could simply print each array element separately, using advance='no' (to suppress insertion of a newline character after what was being printed) in your write statement. Once you're done with a line you use a 'normal' write statement to get the newline character, and start again on the next line. Here's a small example:
program testing
implicit none
integer :: i, j, k
k = 1
do i=1,4
do j=1,10
write(*, '(I2,X)', advance='no') k
k = k + 1
end do
write(*, *) '' ! this gives you the line break
end do
end program testing
When you run this program the output is as follows:
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40

Using an "*" is list-directed IO -- Fortran will make the decisions for you. Some behaviors aren't specified. You could gain more control using a format statement. If you wanted to positively identify row boundaries you write a marker symbol after each row. Something like:
DO i=1,numrows
WRITE(12,*) a(i,:)
write (12, '("X")' )
END DO
Addendum several hours later:
Perhaps with large values of numcols the lines are too long for some programs that are you using to examine the file? For the output statement, try:
WRITE(12, '( 10(2X, I11) )' ) a(i,:)
which will break each row of the matrix, if it has more than 10 columns, into multiple, shorter lines in the file.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight