Fortran memory management and subroutines/functions - arrays

at the moment I am working on a code for numerical simulations in Fortran 95. My platform is WIndows and I take advantage of the MSVC environment with the Intel Fortran compiler.
This code, as many in this field, creates a system of equations to be resolved. Numerically, this happen storing a square matrix and a vector of known values. Now, in order to optimize the memory, the matrices are stored in convenient form, like the compressed sparse rows format (CSR) or analogous, so the zero values are not stored.
Given this brief introduction, here there are my doubts.
Since at compiling time I do not know the dimension of my arrays, I just declare them as:
REAL, DIMENSION(:), ALLOCATABLE :: myArray
and once I retrieve the dimension of such a vector, I call
ALLOCATE(myArray(N)) where N is the number of elements that I want to allocate
Still, memory is empty, since the values are not stored but a memory check is done in order to avoid stack overflow. Is it right?
Now, filling it with values, the occupied space ramp up. The structure of a Fortran array, both for a 1D vector and multi-dimensional array, is to fill in column order a space equivalent to the number of value. It is to say that if we have a 2D array of dimension 1000x1000, it will be stored in 1M "contiguous boxes" ordered by column numbers (first the first column is stored, then the second one and so on..).
If this is true, so the structure of data is the same, is the access time to a particular value the only difference between a multidimensional and a 1D vector?
Is then the command RESHAPE changing only the way the program "sees" the arrays?
The array I need for my purposes is defined in a module that each subroutine/function share. In particular, a subroutine allocate and fill it. Coming back to the main program, there is no problem with that since I display to the user some statistics about it. Let us say, we allocated 400M REAL*4 values, with about 1.5GB of used memory.
However, once that I get into another subroutine, the program stops saying forrtl: severe(170): Program Exception - Stack Overflow. I ran out of memory. But how could it be if the matrix is already allocated and I did not allocate anything more? Notice that: the subroutine uses the same module, so variables are already declared; my RAM has still a free space of about 1.3GB; the stop is at the first line of the subroutine.
Is subroutine (and also function) doubling the data? I thought Fortran would pass the address of my variables in that case, avoiding copies and working directly on the values.
Finally, as many of you, I enjoyed in C++ the STD library functions, like vector::push_back and so on. In Fortran, there are not such beautiful routines but some very useful functions are still there. Masking an array, using WHERE or COUNT or MERGEcan help you to handle some operation effectively.
However, they are veeeeery slow when my matrix is bigger than 1M entries. In that case even a sequential search and substitute is faster than creating a mask or use where. How could it be possible? Aren't they multithreaded?
Thank you in advance for your patience!! All suggestions are very welcome!!

Comment space is limited, so I am posting this as an answer. Obviously you are running out of stack space, not out of memory. The stack size of the main thread on Windows is fixed at link time (the default is 1 MiB) and any larger stack allocation could result in a stack overflow. This could happen because of many reasons, but mainly:
the subroutine that you call uses big stack arrays (e.g. non-ALLOCATABLE arrays);
you pass a non-contiguous array subsection to the subroutine, e.g. myArray(1:10:2), and you don't have an explicit interface for that subroutine. In this case the compiler would make a temporary most likely stack copy of the data being passed, which could exhaust the stack space and trigger the exception.
I would guess the first point is the one, relevant to your case, since the exception occurs when you enter the subroutine (probably in the prologue, where stack space for all local variables is being reserved). You might instruct Intel Fortran to enable heap arrays in the project settings and see if it helps (not sure if the Windows version enables heap arrays be default or not).
Without even a single line of your code shown, it would be quite hard to guess what is the source of the problem and to solve it.

Related

Passing size of array as argument to a subroutine in fortran

I was wondering about the overhead of querying size of array in fortran. Old fortran (<f95) way was to pass the size of array to the arguments of subroutine:
subroutine asub(nelem,ar)
integer,intent(in)::nelem
real*8,intent(in)::ar(:)
! do stuff with nelem such as allocate other arrays
end subroutine asub
Since the size function of f95, it can be done this way:
subroutine asub(ar)
real*8,intent(in)::ar(:)
! do stuff with size(ar) such as allocate other arrays
end subroutine asub
Is method 2 bad performance-wise if asub is called million times ?
I am asking because I am working on a relatively big code where some array sizes are global variables (not even passed as subroutine arguments), which is really bad in my opinion. Method 1 would require a lot of work in order to propagate the array sizes to the whole code while method 2 is clearly faster to achieve in my case.
Thanks !
nelem is a number that you need to read from memory, size(ar) is also a number that you need to read from memory. And you need to inquire the value just once. And then probably do a lot of computation over nelem elements. The overhead inquiring the size of the value will be completely negligible.
OK, size(ar) is a function call, but the compiler can just insert reading the right value from the array descriptor). And even if it remains a function call, still it will be called just once.
Differences, if any, will be elsewhere, mainly as described in the Q/A linked linked by francescalus Passing arrays to subroutines in Fortran: Assumed shape vs explicit shape. Depending on what the compiler can assume about the array being contiguous in memory it will be able to optimize it better or worse (e.g. SIMD vectorization).
As always, where performance matters, you should test and measure. Remember to enable all relevant compiler optimizations.

Can we know the length of the pointer returned by mxRealloc or mxMalloc?

Given a pointer returned by mxGetPr or mxRealloc, are we still able to get its length? Since MATLAB manages memory of the pointers, does it store the meta data for us to query?
Your question is a bit unclear, so let me try to explain the two functions:
mxGetPr is called on an existing mxArray numeric array to retrieve pointer to its data (to be exact, pointer to double real data). If you want to know the length of this data, you can query the original array itself using mxGetNumberOfElements.
mxRealloc and related functions are similar to the standard malloc family of function available in C. So if you're using them, you know what size they are since you're the one allocating the memory!
The purpose of mxRealloc and related functions is to allow MATLAB to auto-manage memory to some extent; so when a MEX-function returns, MATLAB takes care of releasing any registered heap memory allocated with mxMalloc and such.
Now writing good code means you should free your own memory (it can slow things down if you rely on this automatic memory management), but it does come in handy in certain cases (think throwing an error inside a MEX function without having to rely on ugly goto statements to ensure resources are freed on exit).

Shrink memory of an array of pointers, possible?

I am having difficulties to find a possible solution so I decided to post my question. I am writing a program in C, and:
i am generating a huge array containing a lot of pointers to ints, it is allocated dynamically and filled during runtime. So before I don't know which pointers will be added and how many. The problem is that they are just to many of them, so I need to shrink somehow the space.
IS there any package or tool available which could possibly encode my entries somehow or change the representation so that I save space?
Another question, I also thought about writing a file with my information, is this then kept in memory the whole time or just if I reopen the file again?
It seems like you are looking for a simple dynamic array (the advanced data type dynamic array, that is). There should be many implementations for this out there. You can simply start with a small dynamic array and push new items to the back just like you would do with a vector in c++ or java. One implementation would be GArray. You will only allocate the memory you need.
If you have to/want to do it manually, the usual method is to store the capacity and the size of the array you allocated along with the pointer in a struct and call realloc() from within push_back() whenever you need more space. Usually you should increase the size of your array by a factor of 1.3 to 1.4, but a factor of 2 will do if you're not expecting a HUGE array. If you call remove and your size is below a certain threshold (e.g. capacity/2) you shrink the array again with realloc();

Working with large local arrays: is there a faster way than malloc?

I am working with large arrays in C for numerical calculations.
Within one of the functions I make use of some temporary local (disposable) arrays. Normally I would just declare these as double v[N] and then not worry about having to free the memory manually.
The problem that I have noticed is that when my N gets very large (greater than 4 million) my program fails to declare the variable. Hence I resorted to malloc() and free() which allows me to run the program successfully. The problem is that my execution time almost doubles from 24s to 40s when I use the dynamic allocation (for a smaller N).
It is possible for me to modify the code to avoid creating these temporary arrays in the first place, however it would impact on the readability of the code and coding effort. I am currently using a preprocessor macro to access the vector like a 2D matrix. Is there another solution that will avoid the CPU cost whilst allowing me to save the data like a matrix?
When you declare a variable local to the method you are working with automatic allocated variables which go on the stack and unfortunately the stack size is limited. Using malloc means that the variable will be allocated on heap and the time difference is what you pay for that dynamic allocation.
I see two possible solutions:
use a static global array (and reuse it when necessary) so that the compiler will be able to optimize accesses to it
change the stack size so that you won't have problems with larger arrays local to your functions, this can be done even dynamically, take a look here.

Does initialization of 2D array in c program waste of too much time?

I am writing a C program which has to use a 2D array to store previously processed data for later using.
The size of this 2D array 33x33; matrix[33][33].
I define it as a global parameter, so it will be initialized for only one time. Dose this definition cost a lot of time when program is running? Because I found my program turn to be slower than previous version without using this matrix to store data.
Additional:
I initialize this matrix as a global parameter like this:
int map[33][33];
In one of function A, I need to store all of 33x33 data into this matrix.
In another function B, I will fetch 3x3 small matrix from map[33][33] for my next step of processing.
Above 2 steps will be repeated for about 8000 times. So, will it affect program running efficiency?
Or, I have another guess that the program truns to be slower because of there are couple of if-else branch statements were lately added into the program.
How ere you doing it before? The only problem I can think of is that extracting a 3x3 sub matrix from a 33x33 integer matrix is going to cause you cacheing issues every time you extract the sub matrix.
On most modern machines the cacheline is 64 bytes in size. Thats enough for 8 elements of the matrix. So for each extra line of the 3x3 sub matrix you will be performing a new cacheline fetch. If the matrix gets hammered very regularly then the matrix will probably sit mostly in the level 2 cache (or maybe even the level 1 if its big enough) but if you are doing lots of other data calculations in between each sub-matrix fetch then you will be getting 3expensive cacheline fetches each time you grab the sub matrix.
However even then its unlikely you'd see a HUGE difference in performance. As stated elsewhere we need to see before and after code to be able to hazard a guess at why performance has got worse ...
Simplifying slightly, there are three kinds of variables in C: static, automatic, and dynamic.
Static variables exist throughout the lifetime of the program, and include both global variables, and local variables declared using static. They are either initialized to zeroes (the default), or explicitly initialized data. If they are zeroes, the linker stores them into a fresh memory page that it initializes to zeroes by the operating system (this takes a tiny amount of time). If they are explicitly allocated, the linker puts the data into a memory area in the executable and the operating system loads it from there (this requires reading the data from disk into memory).
Automatic variables are allocated from the stack, and if they are initialized, this happens every time they are allocated. (If not, they have no value, or perhaps they have a random value, and so initialization takes no time.)
Dynamic variables are allocated using malloc, and you have to initialize them yourself, and that again takes a bit of time.
It is highly probably that your slowdown is not caused by the initialization. To make sure of this, you should measure it by profiling your program and seeing where time is spent. Unfortunately, profiling may be difficult for initialization done by the compiler/linker/operating system, especially for the parts that happen before your program starts executing.
If you want to measure how much time it takes to initialize your array, you could write a dummy program that does nothing but includes the array.
However, since 33*33 is a fairly small number, either your matrix items are very large, your computer is very slow, or your 33 is larger than mine.
No, there is no difference in runtime between initializing an array once (with whatever method) and not initializing it.
If you found a difference between your 2 versions, that must be due to differences in the implementation of the algorithm (or a different algorithm).
Well, I wouldn't expect it to (something like that should take much less than a second), but an easy way to find out would be to simply put a print statement at the start of main().
That way you can see if global, static variable initialization is really causing this. Is there anything else in your program that you've changed lately?
EDIT One way to get a clearer idea of whats taking so long would be to use a debugger like GDB or a profiler like GProf
If your program is accessing the matrix a lot during running (even if it's not being updated at all), the calculations of address of an element will involve a multiply by 33. Doing a lot of this could have the effect of slowing down your program.
How did your previous program version store the data if not in matrix? How were you able to read a sub-matrix if you did not have the big matrix?
Many answers talk about the time spent for initializing. But I don't think that was the question. Anyway, on modern processors, initializing such a small array takes just a few microseconds. And it is only done once, at program start.
If you need to fetch a sub-matrix from any position, there is probably no faster method than using a static 2D array. However, depending on processor architecture, accessing the array could be faster if the array dimensions (or just the last dimension) are power of 2 (e.g. 32, 64 etc.) since this would allow using shift instead of multiply.
If the accessed sub-matrices do not overlap (i.e. you would only access indexes 0, 3, 6 etc.) then using 3-dimensional or 4-dimensional array could speed up the access
int map[11][11][3][3];
This makes each sub-matrix a contiguous block of memory, which can be copied with a single block copy command.
Further, it may fit in single cache line.
theoretically using N-th dimensional array shouldn't have performance difference as all of them resolve into contiguous memory reservation by compiler.
int _1D[1089];
int _2D[33][33];
int _3D[3][11][33];
should give similar allocation/deallocation speed.
You need to benchmark your program. If you don't need the initialization, don't make the variable static, or (maybe) allocate it yourself from the heap using malloc():
mystery_type *matrix;
matrix = malloc(33 * 33 * sizeof *matrix);

Resources