Optimal I/O practice in numerical Fortran - arrays

I have some iterative Fortran code which at each integration step produces some output. What is the best practice in terms of speed/accuracy for getting each of these steps saved to disk?
My current approach involves declaring some large array, at each integration step saving the output to a row of the array, and then finally saving a cropped version of the total array to file. A psuedo-example is shown below.
program IO_example
integer, parameter :: dp = selected_real_kind(33,4931)
integer(kind=dp) :: nrows = 1e6, ncols = 6
real(kind=dp), dimension(nrows,ncols) :: BigDataArray
real(kind=dp), dimension(ncols) :: RowVector
real(kind=dp), dimension(:,:), allocatable :: SmallDataArray
integer(kind=dp) :: i !for iterating
i = 1
do while (condition)
!Update RowVector
BigDataArray(i,:) = RowVector
i = i+1
enddo
!First reallocate to create a smaller array
allocate(SmallDataArray(i,ncols))
SmallDataArray = BigDataArray(1:i, :)
!Now save
open(unit=10,file=BinaryData,status='replace',form='unformatted')
write(10) SmallDataArray
close(10)
end program IO_example
Now this works fine, but my question is is this the best way to do this, or is some other approach more favourable? By best I am particularly referring to speed (how much does writing to array and writing to file slow down the code), although accuracy issues are also important (I understand these are avoided by writing in binary unformatted. See this StackOverflow answer).
Some potential issues I can foresee is the SmallDataArray being greater than the RAM (especially in quad precision) and so unable to write to disk. In addition, the number of iterations could become greater than nrows (in this case I suppose one can just increase nrows, but at what point does this start to impact performance?)
Thanks in advance for any help.

This is probably an extended comment, taking advantage of some formatting, and verges close to an opinion, but there are one or two matters which are amenable to measurement and which you might care to test for yourself.
I'm not sure what role BigDataArray plays in your code, since you don't seem to need all the data in memory after it has been computed. You could probably drop it altogether and simply accumulate results into SmallDataArray. If BigDataArray has 10^6 rows, then maybe give SmallDataArray 10^5 rows, and fill it up 10 times. Or, if you're not certain at the outset how many rows to allocate to Big, then don't, just set Small to 10^5 and fill it up as many times as necessary, exiting when the computation converges.
(And don't get hung up on the numbers I've chosen, the best size for Small is something you probably ought to experiment with.)
Once the code has filled Small write it to file, go back to row 1 and carry on.
If you follow this approach you will eliminate at least a couple of potential performance issues; the repeated allocation of Small (not sure what that's about anyway), and the movement of data when you copy a bunch of rows from Big to Small (which gains you nothing in terms of computation performance and is unnecessary for writing the data to the file).
As you seem to know, the rule when writing data to file (which is very slow computationally) is to write large volumes in one go, but it's difficult to state how large that volume should be without at least some measurements and some testing, so go measure and test.
By dropping Big altogether you remove that burden from the memory while the code runs. And if you do need all of Big at the end of the calculation, you could always read it back in (subject to memory being available of course).
Finally, let me get some retaliation in first: if your response to this 'answer' is something akin to Oh, that doesn't answer my real question, it only answers the simplified question I asked but I have all these other issues to consider would you mind taking a look at these too ... then you can take it that my response to that is (a) unprintable and (b) boils down to Yes, I would mind

Related

Counting FLOPs and size of data and check whether function is memory-bound or cpu-bound

I am going to analyse and optimize some C-Code and therefore I first have to check, whether the functions I want to optimize are memory-bound or cpu-bound. In general I know, how to do this, but I have some questions about counting Floating Point Operations and analysing the size of data, which is used. Look at the following for-loop, which I want to analyse. The values of the array are doubles (that means 8 Byte each):
for(int j=0 ;j<N;j++){
for(int i=1 ;i<Nt;i++){
matrix[j*Nt+i] = matrix[j*Nt+i-1] * mu + matrix[j*Nt+i]*sigma;
}
}
1) How many floating point operations do you count? I thought about 3*(Nt-1)*N... but do I have to count the operations within the arrays, too (matrix[j*Nt+i], which are 2 more FLOP for this array)?
2)How much data is transfered? 2* ((Nt-1)*N)8Byte or 3 ((Nt-1)*N)*8Byte. I mean, every entry of the matrix has to be loaded. After the calculation, the new values is saved to that index of the array (now these is 1load and 1 store). But this value is used for the next calculation. Is another load operations needed therefore, or is this value (matrix[j*Nt+i-1]) already available without a load operation?
Thx a lot!!!
With this type of code, the direct sort of analysis you are proposing to do can be almost completely misleading. The only meaningful information about the performance of the code is actually measuring how fast it runs in practice (benchmarking).
This is because modern compilers and processors are very clever about optimizing code like this, and it will end up executing in a way which is nothing like your straightforward analysis. The compiler will optimize the code, rearranging the individual operations. The processor will itself try to execute the individual sub-operations in parallel and/or pipelined, so that for example computation is occurring while data is being fetched from memory.
It's useful to think about algorithmic complexity, to distinguish between O(n) and O(n²) and so on, but constant factors (like you ask about 2*... or 3*...) are completely moot because they vary in practice depending on lots of details.

How much efficiency would be lost if a hash table is implemented with a 2d array but the second dimension of the array is never accessed?

I need to make a hash table that can eventually be used to write a full assembler.
Basically I will have something like:
foo 100,
and I will need to hash foo and then store the 100 (the address of the command). I was thinking I should just use a 2d array. The second dimension of the array would only be accessed when recording the address (just an int) or when returning the address. There would be no searching done in the second dimension.
If I implement the hash table this way, would it be inefficient? If it is very inefficient, what would be a better way to implement the table?
Edit: I haven't written any code yet. In fact I don't even know what language I'm going to use yet. I want to write it in C so it will be more of a challenge, but I might write it in Java if I feel pressured for time.
If you have every other int in the array unused then in addition to memory waste you're going to use the cache poorly as the cache lines will be underused.
But normally I wouldn't worry about such things when writing an assembler as it's not something very performance demanding as say graphics or heavy computations. At least, I wouldn't rush into optimizing too early.
It is, however, important to keep in mind that once you start assembling large pieces of code (~100,000 lines of assembly) generated automatically (say, from C/C++ code by a compiler), performance will become more and more important as the user experience (wait times) degrades. At that point there will be many candidates for optimization: I/O, parsing, symbol look up, generation of as short as possible jump instructions if they can have multiple encodings for shorter and longer jumps. Expressions and macros will contribute too. You may even consider minimizing white space and comments in the input assembly code in the first place.
Without being able to see any code, there is no reason that this would have to be inefficient. The only reason that it could be is if you pre allocated a bunch of memory that you did not end up using, however without seeing your algorithm you had in mind it is impossible to tell.

Is it possible to create a float array of 10^13 elements in C?

I am writing a program in C to solve an optimisation problem, for which I need to create an array of type float with an order of 1013 elements. Is it practically possible to do so on a machine with 20GB memory.
A float in C occupies 4 bytes (assuming IEEE floating point arithmetic, which is pretty close to universal nowadays). That means 1013 elements are naïvely going to require 4×1013 bytes of space. That's quite a bit (40 TB, a.k.a. quite a lot of disk for a desktop system, and rather more than most people can afford when it comes to RAM) so you need to find another approach.
Is the data sparse (i.e., mostly zeroes)? If it is, you can try using a hash table or tree to store only the values which are anything else; if your data is sufficiently sparse, that'll let you fit everything in. Also be aware that processing 1013 elements will take a very long time. Even if you could process a billion items a second (very fast, even now) it would still take 104 seconds (several hours) and I'd be willing to bet that in any non-trivial situation you'll not be able to get anything near that speed. Can you find some way to make not just the data storage sparse but also the processing, so that you can leave that massive bulk of zeroes alone?
Of course, if the data is non-sparse then you're doomed. In that case, you might need to find a smaller, more tractable problem instead.
I suppose if you had a 64 bit machine with a lot of swap space, you could just declare an array of size 10^13 and it may work.
But for a data set of this size it becomes important to consider carefully the nature of the problem. Do you really need random access read and write operations for all 10^13 elements? Is the array at all sparse? Could you express this as a map/reduce problem? If so, sequential access to 10^13 elements is much more practical than random access.

Nested for loops extremely slow in MATLAB (preallocated)

I am trying to learn MATLAB and one of the first problems I encountered was to guess the background from an image sequence with a static camera and moving objects. For a start I just want to do a mean or median on pixels over time, so it's just a single function I would like to apply to one of the rows of the 4 dimensional array.
I have loaded my RGB images in a 4 dimensional array with the following dimensions:
uint8 [ num_images, width, height, RGB ]
Here is the function I wrote which includes 4 nested loops. I use preallocation but still, it is extremely slow. In C++ I believe this function could run at least 10x-20x faster, and I think on CUDA it could actually run in real time. In MATLAB it takes about 20 seconds with the 4 nested loops. My stack is 100 images with 640x480x3 dimensions.
function background = calc_background(stack)
tic;
si = size(stack,1);
sy = size(stack,2);
sx = size(stack,3);
sc = size(stack,4);
background = zeros(sy,sx,sc);
A = zeros(si,1);
for x = 1:sx
for y = 1:sy
for c = 1:sc
for i = 1:si
A(i) = stack(i,y,x,c);
end
background(y,x,c) = median(A);
end
end
end
background = uint8(background);
disp(toc);
end
Could you tell me how to make this code much faster? I have tried experimenting with somehow getting the data directly from the array using only the indexes and it seems MUCH faster. It completes in 3 seconds vs. 20 seconds, so that’s a 7x performance difference, just by writing a smaller function.
function background = calc_background2(stack)
tic;
% bad code, confusing
% background = uint8(squeeze(median(stack(:, 1:size(stack,2), 1:size(stack,3), 1:3 ))));
% good code (credits: Laurent)
background=uint8((squeeze(median(stack,1)));
disp(toc);
end
So now I don't understand if MATLAB could be this fast then why is the nested loop version so slow? I am not making any dynamic resizing and MATLAB must be running the same 4 nested loops inside.
Why is this happening?
Is there any way to make nested loops run fast, like it would happen naturally in C++?
Or should I get used to the idea of programming MATLAB in this crazy one line statements way to get optimal performance?
Update
Thank you for all the great answers, now I understand a lot more. My original code with stack(:, 1:size(stack,2), 1:size(stack,3), 1:3 )) didn't make any sense, it is exactly the same as stack, I was just lucky with median's default option of using the 1st dimension for its working range.
I think it's better to ask how to write an efficient question in an other question, so I asked it here:
How to write vectorized functions in MATLAB
If I understand your question, you're asking why Matlab is faster for matrix operations than for procedural programming calls. The answer is simply that that's how it's designed. If you really want to know what makes it that way, you can read this newsletter from Matlab's website which discusses some of the underlying technology, but you probably won't get a great answer, as the software is proprietary. I also found some relevant pages by simply googling, and this old SO question
also seems to address your question.
Matlab is an interpreted language, meaning that it must evaluate each line of code of your script.
Evaluating is a lengthy process since it must parse, 'compile' and interpret each line*.
Using for loops with simple operations means that matlab takes far more time parsing/compiling than actually executing your code.
Builtin functions, on the other hand are coded in a compiled language and heavily optimized. They're very fast, hence the speed difference.
Bottom line: we're very used to procedural language and for loops, but there's almost always a nice and fast way to do the same things in a vectorized way.
* To be complete and to pay honour to whom honour is due: recent versions of Matlab actually tries to accelerate loops by analyzing repeated operations to compile chunks of repetitive operations into native executable. This is called Just In Time compilation (JIT) and was pointed out by Jonas in the following comments.
Original answer:
If I understood well (and you want the median of the first dimension) you might try:
background=uint8((squeeze(median(stack,1)));
Well, the difference between both is their method of executing code. To sketch it very roughly: in C you feed your code to a compiler which will try to optimize your code or at any rate convert it to machine code. This takes some time, but when you actually execute your program, it is in machine code already and therefore executes very fast. You compiler can take a lot of time trying to optimize the code for you, in general you don't care whether it takes 1 minute or 10 minutes to compile a distribution-ready program.
MATLAB (and other interpreted languages) don't generally work that way. When you execute your program, an interpreter will interprete each line of code and transform it into a sequence of machine code on the fly. This is a bit slower if you write for-loops as it has to interprete the code over and over again (at least in principle, there are other overheads which might matter more for the newest versions of MATLAB). Here the hurdle is the fact that everything has to be done at runtime: the interpreter can perform some optimizations, but it is not useful to perform time-consuming optimizations that might increase performance by a lot in some cases as they will cause performance to suffer in most other cases.
You might ask what you gain by using MATLAB? You gain flexibility and clear semantics. When you want to do a matrix multiplication, you just write it as such; in C this would yield a double for loop. You have to worry very little about data types, memory management, ...
Behind the scenes, MATLAB uses compiled code (Fortan/C/C++ if I'm not mistaken) to perform large operations: so a matrix multiplication is really performed by a piece of machine code which was compiled from another language. For smaller operations, this is the case as well, but you won't notice the speed of these calculations as most of your time is spent in management code (passing variables, allocating memory, ...).
To sum it all up: yes you should get used to such compact statements. If you see a line of code like Laurent's example, you immediately see that it computes a median of stack. Your code requires 11 lines of code to express the same, so when you are looking at code like yours (which might be embedded in hundreds of lines of other code), you will have a harder time understanding what is happening and pinpointing where a certain operation is performed.
To argue even further: you shouldn't program in MATLAB in the same way as you'd program in C/C++; nor should you do the other way round. Each language has its stronger and weaker points, learn to know them and use each language for what it's made for. E.g. you could write a whole compiler or webserver in MATLAB but in general that will be really slow as MATLAB was not intended to handle or concatenate strings (it can, but it might be very slow).

Array index efficiency (specifically, temporary vars)

I've read a fair amount of stuff about efficiency of array indices vs. pointers, and how it doesn't really matter unless you're doing something a lot. However, I am doing this a lot.
The code in question has an array of structs. (Two different ones, for two different types actually, but whatever). Since my background is mostly in higher level languages, I defaulted to using a standard particles[i].whatever format. However, I'm not sure if that's a good idea. For a single access, I know it doesn't matter much, but as it stands now, one of my two main functions calls particles[i].something 8 times, and boxes[boxnum].something 4 times per particle, per iteration.
Currently it takes roughly a second to do 5000 particles and 5000 iterations. This means that I'm dealing with these accesses upwards of [including the other function] 200 million times a second. At that frequency, every little bit matters (especially since I'll end up running this code on time on someone else's cluster).
So my question is if it's worth it to do something along the lines of using a pointer to the struct instead of the array access, if gcc will magically do that for me, or if it really doesn't matter.
Thanks
~~Zeb
EDIT: OK, so compiler magic means I shouldn't worry about it. Thanks.
You suggest a profiler, but I can't seem to make gprof tell me any finer-grained information than the time functions take... and I already know that. Is there anything that'll tell me that on a line-by-line basis?
If you're iterating through the array sequentially, then you might benefit from using a pointer that increments over each loop. There is a slight bit less arithmetic than dereferencing an array. However the compilers are pretty good at optimizing things so you might not see any gain.
Best to run a profiler to see where the problem actually lies. You may be surprised at how far you are from guessing the bottleneck.
I dont think you will see any difference. At the very most you are dropping one integer multiplication per array access, and, in all probabilty gcc will recognise whats happening and optimize it all to pointer arithimatic anyway.
If you want to improve the running time of your algorithm you should probably try to improve the algorithm per se, not your implementation (unless you have something obviously not optimized).
For example: are there any steps you could skip? Are there any values you could store for the next iteration? This would take more memory, but could save you a lot of time!

Resources