Fortran global work array vs. local dynamically allocated arrays - arrays

I am working with an older F77 code that has been upgraded to F9X. It still has some older "legacy" code structure and I'm curious on the performance aspect towards adding code in the legacy way or modern way. We have a separate F9x code that we are trying to integrate into this older code and use as many of their procedures as possible instead of rewriting our own versions. Also note, assume that all of these procedures are NOT explicitly interfaced.
Specifically, the old code has one large rank-1 work array that is allocated in the main program and as this array is passed deeper into procedures, it is split apart and used where it is needed. Essentially there is one allocation/deallocation and the only overhead with this array involves finding the starting indices (trivial) of needed temporary arrays and passing these sections of the work array into the procedure.
Our new code generally uses lower level procedures from the old code in which multiple dummy arrays originated from the older code's global work array. Instead of the hassle of creating our own work array, finding starting indices, and passing all these array sections with their starting indices, I could just create dynamically allocated arrays where they are needed. However, these procedures can be called thousands (possibly millions for some lower level routines) of times during the code execution and I am concerned with the overhead of allocating and deallocating each time any of these procedures are used. Also, these temporary arrays could contain many millions of double precision elements.
I've also dabbled with automatic arrays but stopped when I started encountering stack overflow issues and now almost exclusively use dynamic arrays. I've heard different things about the stack and heap with regards to how memory for different kinds of arrays is stored but I really don't know the difference and which is better (performance, efficiency, etc.).
Long story short, are these dynamically allocated (or automatic) arrays going to be significantly less efficient due to overhead issues? I also realize that dynamically allocated arrays are more robust in the life span of the code but what I am really after is performance. A 5% performance gain could mean many hours saved in code execution.
I realize I might not get a definitive answer to this due to differences in compiler optimizations and other factors but I'm curious if anyone might have some knowledge/experience with anything similar. Thanks for your help.

I think that any answers are going to be guesses and speculation. My guess: array creation is going to be a very low CPU load. Unless these subroutines are doing a negligible amount of computations, the differing overhead of differing arrays types won't be noticeable. But the only way to be sure would be to try two different methods and to time them, e.g., with the Fortran intrinsic cpu_time.
Automatic arrays are usually placed on the stack, but some compilers place large automatic arrays on the heap. Some compilers have an option to change this behavior. Allocatable are probably on the heap.

Related

What is the most suitable alternative for Linked List?

I am working on Embedded C, Task related implementation in OS. I have implemented the Linked List. Now it needs to minimize the use of pointers to satisfy MISRA C, in my present implementation I am searching for the best alternative for the Linked List, in Embedded OS for task operation.
It'd be easy to use a static array of structures to completely avoid pointers (you'd just use array indexes and not pointers). This has both advantages and disadvantages.
The disadvantages are:
you have to implement your own allocator (to allocate and free "array elements" within the static array)
the memory used for the array can't be used for any other purpose when it's not being used for the linked list
you have to determine a "max. number of elements that could possibly be needed"
it has all the same problems as pointers. E.g. you can access an array element that was freed, free the same array element multiple times, use an index that's out of bounds (including the equivalent of NULL if you decide to do something like use -1 to represent NULL_ELEMENT), etc.
The advantages are:
by implementing your own allocator you can avoid the mistakes caused by malloc(), including (e.g.) checking something isn't already free when freeing it and returning an error instead of trashing your own metadata
allocation can typically be simpler/faster, because you're only allocating/freeing one "thing" (array element) at a time and don't need to worry about allocating/freeing a variable number of contiguous "things" (bytes) at a time
entries in your list are more likely to be closer (in memory) to each other (unlike for malloc() where your entries are scattered among everything else you allocate), and this can improve performance (cache locality)
you have a "max. number of elements that could possibly be needed" to make it far easier to track down problems like (e.g.) memory leaks; and (where memory is limited) make it easier to determine things like worst case memory footprint
it satisfies pointless requirements (like "no pointers") despite not avoiding anything these requirements are intended to avoid
Now it needs to minimize the use of pointers to satisfy MISRA C
I used to work with some embedded engineers. They built low-end (and high-end) routers and gateways. Rather than dynamically allocating memory, they used fixed buffers provisioned at boot. They then tracked indexes into the array of provisioned buffers.
Static arrays and indexes begs for a Cursor data structure. Your first search hit is Cursor Implementation of Linked Lists from
Data Structures and Algorithm Analysis in C++, 2nd ed. by Mark Weiss. (I actually used that book in college years ago).

How is conditional initialization handled and is it a good practice?

I am trying to decide between several possible practices. Say, my function has a number of if() blocks, that work on data, that is unique to them.
Should I declare and initialize the local (for the block) data inside the block? Does this have runtime performance cost (due to runtime allocation in the stack)?
Or should I declare and/or initialize all variables at function entry, so that is is done in one, possibly faster, operation block?
Or should I seperate the if() blocks in different functions, even though they are only a couple of lines long and used only one in the program?
Or am I ovelooking another, cleaner, option? Is the question even answerable in it's current, general form?
Should I declare and initialize the local (for the block) data inside the block?
Absolutely: this tends to make programs more readable.
Does this have runtime performance cost (due to runtime allocation in the stack)?
No: all allocations are done upfront - the space on the stack is reserved for variables in all branches upon entering a function, not when the branch is entered. Moreover, this could even save you some space, because the space allocated for variables in non-overlapping branches can be reused by the compiler.
Or should I declare and/or initialize all variables at function entry, so that is is done in one, possibly faster, operation block?
No, this is not faster, and could be slightly more wasteful.
Or should I seperate the if() blocks in different functions, even though they are only a couple of lines long and used only one in the program?
That would probably have a negative impact on readability of your program.
It's a good practice to keep the scope of the variable as small as possible.
If you declare all the variable one time at the beginning, and you don't use
them often in your program. It's no use, it takes more memory.
Also, another advantages of keeping the scope small is that you can reuse
the same names again. (you don't have to invent new names each time you
do something trivial).
Out of the options you are stating, declare and initialize the local (for the block) data inside the block is what will serve your purpose. Forget the rest of the things.
For completeness; another, usually less important consideration is stack padding control / packing, which is intuitively more difficult if you don't declare everything upfront.
See this for more information, although let me emphasize the following paragraph before anyone does anything crazy:
Usually, for the small number of scalar variables in your C programs,
bumming out the few bytes you can get by changing the order of
declaration won’t save you enough to be significant. The technique
becomes more interesting when applied to nonscalar variables -
especially structs.
Now the answer concerning performance.
Should I declare and initialize the local (for the block) data inside the block? Does this have runtime performance cost (due to runtime allocation in the stack)?
Allocation of local variables is practically free. In most cases, it will really be free, because the update of the stack pointer is performed in the same instruction that writes the value to the stack. Deallocation is either free as well (when something is popped off the stack), or done once at the return (when a stack frame had been created).
Or should I declare and/or initialize all variables at function entry, so that is is done in one, possibly faster, operation block?
While allocation is virtually free, running constructors/destructors is not. While this does not apply to variables of primitive types, it applies to virtually all user defined types, including smart pointers and the like. If you declare a smart pointer at the beginning of the function, but only use it half of the time, you construct, and subsequently destruct the smart pointer twice as much as needed.
Also, if you declare a variable where you have the information to initialize it to your needs, you can construct it directly to the state you want it to have instead of first default constructing it only to change it's value afterwards (using the assignment operator in many cases). So, from a performance perspective, you should always declare variables late and only in the blocks that need them.
Or should I seperate the if() blocks in different functions, even though they are only a couple of lines long and used only one in the program?
No, this is completely contraproductive from a performance perspective. Each function call has an overhead, I think it's between 10 and 20 cycles most of the time. You can do quite a bit of calculation in that time.

Memory management in Lua

I am making a level editor for a simple game in lua and the tiles are represented by integers in a 2d array, when I read the level description in from a file, it may so happen that this 2d array is sparsely populated, how does lua manage memory ? will it kep those holes in the array or will it be smart about it and not waste any space?
The question itself is irrelevant in a practical sense. You have one of two cases:
Your tilemaps are reasonably small.
Your tilemaps are big enough such that compression is important for fitting in memory constraints.
If #1 is the case, then you shouldn't care. It doesn't matter how memory efficient Lua is or isn't, because your tilemaps aren't big enough for it to ever matter.
If #2 is the case, then you shouldn't care either. Why? Because if fitting in memory is important to you, and you're likely to run out, then you shouldn't leave it to the vagaries of how Lua happens to manage the memory for arrays.
If memory is important, you should build a specialized data structure that Lua can use, but is written in C. That way, you can have explicit control over memory management; your tilemaps will therefore take up as much or as little memory as you choose for them to.
As for the actual question, it rather depends on how you build your "array". Lua tables are associative arrays by nature, but their implementation is split between an "array part" and a "table part". In general though, if you store elements sparsely, then the elements will be sparsely stored in memory (for some definition of "sparse"). As long as you don't do something silly like:
for i = 1, max_table_size do
my_tilemap[i] = 0
end
Then again, you may want to do that for performance reasons. This ensures that you have a big array rather than a sparse table. Since the array elements are references rather than values, they only take up maybe 16 bytes per element. Once you decide to put something real in an entry (an actual tile), you can. Indexing into the array would be fast in this case, though since the table part is a hash-table, it's not exactly slow.

In C, does using static variables in a function make it faster?

My function will be called thousands of times. If i want to make it faster, will changing the local function variables to static be of any use? My logic behind this is that, because static variables are persistent between function calls, they are allocated only the first time, and thus, every subsequent call will not allocate memory for them and will become faster, because the memory allocation step is not done.
Also, if the above is true, then would using global variables instead of parameters be faster to pass information to the function every time it is called? i think space for parameters is also allocated on every function call, to allow for recursion (that's why recursion uses up more memory), but since my function is not recursive, and if my reasoning is correct, then taking off parameters will in theory make it faster.
I know these things I want to do are horrible programming habits, but please, tell me if it is wise. I am going to try it anyway but please give me your opinion.
The overhead of local variables is zero. Each time you call a function, you are already setting up the stack for the parameters, return values, etc. Adding local variables means that you're adding a slightly bigger number to the stack pointer (a number which is computed at compile time).
Also, local variables are probably faster due to cache locality.
If you are only calling your function "thousands" of times (not millions or billions), then you should be looking at your algorithm for optimization opportunities after you have run a profiler.
Re: cache locality (read more here):
Frequently accessed global variables probably have temporal locality. They also may be copied to a register during function execution, but will be written back into memory (cache) after a function returns (otherwise they wouldn't be accessible to anything else; registers don't have addresses).
Local variables will generally have both temporal and spatial locality (they get that by virtue of being created on the stack). Additionally, they may be "allocated" directly to registers and never be written to memory.
The best way to find out is to actually run a profiler. This can be as simple as executing several timed tests using both methods and then averaging out the results and comparing, or you may consider a full-blown profiling tool which attaches itself to a process and graphs out memory use over time and execution speed.
Do not perform random micro code-tuning because you have a gut feeling it will be faster. Compilers all have slightly different implementations of things and what is true on one compiler on one environment may be false on another configuration.
To tackle that comment about fewer parameters: the process of "inlining" functions essentially removes the overhead related to calling a function. Chances are a small function will be automatically in-lined by the compiler, but you can suggest a function be inlined as well.
In a different language, C++, the new standard coming out supports perfect forwarding, and perfect move semantics with rvalue references which removes the need for temporaries in certain cases which can reduce the cost of calling a function.
I suspect you're prematurely optimizing, however, you should not be this concerned with performance until you've discovered your real bottlenecks.
Absolutly not! The only "performance" difference is when variables are initialised
int anint = 42;
vs
static int anint = 42;
In the first case the integer will be set to 42 every time the function is called in the second case ot will be set to 42 when the program is loaded.
However the difference is so trivial as to be barely noticable. Its a common misconception that storage has to be allocated for "automatic" variables on every call. This is not so C uses the already allocated space in the stack for these variables.
Static variables may actually slow you down as its some aggresive optimisations are not possible on static variables. Also as locals are in a contiguous area of the stack they are easier to cache efficiently.
There is no one answer to this. It will vary with the CPU, the compiler, the compiler flags, the number of local variables you have, what the CPU's been doing before you call the function, and quite possibly the phase of the moon.
Consider two extremes; if you have only one or a few local variables, it/they might easily be stored in registers rather than be allocated memory locations at all. If register "pressure" is sufficiently low that this may happen without executing any instructions at all.
At the opposite extreme there are a few machines (e.g., IBM mainframes) that don't have stacks at all. In this case, what we'd normally think of as stack frames are actually allocated as a linked list on the heap. As you'd probably guess, this can be quite slow.
When it comes to accessing the variables, the situation's somewhat similar -- access to a machine register is pretty well guaranteed to be faster than anything allocated in memory can possible hope for. OTOH, it's possible for access to variables on the stack to be pretty slow -- it normally requires something like an indexed indirect access, which (especially with older CPUs) tends to be fairly slow. OTOH, access to a global (which a static is, even though its name isn't globally visible) typically requires forming an absolute address, which some CPUs penalize to some degree as well.
Bottom line: even the advice to profile your code may be misplaced -- the difference may easily be so tiny that even a profiler won't detect it dependably, and the only way to be sure is to examine the assembly language that's produced (and spend a few years learning assembly language well enough to know say anything when you do look at it). The other side of this is that when you're dealing with a difference you can't even measure dependably, the chances that it'll have a material effect on the speed of real code is so remote that it's probably not worth the trouble.
It looks like the static vs non-static has been completely covered but on the topic of global variables. Often these will slow down a programs execution rather than speed it up.
The reason is that tightly scoped variables make it easy for the compiler to heavily optimise, if the compiler has to look all over your application for instances of where a global might be used then its optimising won't be as good.
This is compounded when you introduce pointers, say you have the following code:
int myFunction()
{
SomeStruct *A, *B;
FillOutSomeStruct(B);
memcpy(A, B, sizeof(A);
return A.result;
}
the compiler knows that the pointer A and B can never overlap and so it can optimise the copy. If A and B are global then they could possibly point to overlapping or identical memory, this means the compiler must 'play it safe' which is slower. The problem is generally called 'pointer aliasing' and can occur in lots of situations not just memory copies.
http://en.wikipedia.org/wiki/Pointer_alias
Using static variables may make a function a tiny bit faster. However, this will cause problems if you ever want to make your program multi-threaded. Since static variables are shared between function invocations, invoking the function simultaneously in different threads will result in undefined behaviour. Multi-threading is the type of thing you may want to do in the future to really speed up your code.
Most of the things you mentioned are referred to as micro-optimizations. Generally, worrying about these kind of things is a bad idea. It makes your code harder to read, and harder to maintain. It's also highly likely to introduce bugs. You'll likely get more bang for your buck doing optimizations at a higher level.
As M2tM suggests, running a profiler is also a good idea. Check out gprof for one which is quite easy to use.
You can always time your application to truly determine what is fastest. Here is what I understand: (all of this depends on the architecture of your processor, btw)
C functions create a stack frame, which is where passed parameters are put, and local variables are put, as well as the return pointer back to where the caller called the function. There is no memory management allocation here. It usually a simple pointer movement and thats it. Accessing data off the stack is also pretty quick. Penalties usually come into play when you're dealing with pointers.
As for global or static variables, they're the same...from the standpoint that they're going to be allocated in the same region of memory. Accessing these may use a different method of access than local variables, depends on the compiler.
The major difference between your scenarios is memory footprint, not so much speed.
Using static variables can actually make your code significantly slower. Static variables must exist in a 'data' region of memory. In order to use that variable, the function must execute a load instruction to read from main memory, or a store instruction to write to it. If that region is not in the cache, you lose many cycles. A local variable that lives on the stack will most surely have an address that is in the cache, and might even be in a cpu register, never appearing in memory at all.
I agree with the others comments about profiling to find out stuff like that, but generally speaking, function static variables should be slower. If you want them, what you are really after is a global. Function statics insert code/data to check if the thing has been initialized already that gets run every time your function is called.
Profiling may not see the difference, disassembling and knowing what to look for might.
I suspect you are only going to get a variation as much as a few clock cycles per loop (on average depending on the compiler, etc). Sometimes the change will be dramatic improvement or dramatically slower, and that wont necessarily be because the variables home has moved to/from the stack. Lets say you save four clock cycles per function call for 10000 calls on a 2ghz processor. Very rough calculation: 20 microseconds saved. Is 20 microseconds a lot or a little compared to your current execution time?
You will likely get more a performance improvement by making all of your char and short variables into ints, among other things. Micro-optimization is a good thing to know but takes lots of time experimenting, disassembling, timing the execution of your code, understanding that fewer instructions does not necessarily mean faster for example.
Take your specific program, disassemble both the function in question and the code that calls it. With and without the static. If you gain only one or two instructions and this is the only optimization you are going to do, it is probably not worth it. You may not be able to see the difference while profiling. Changes in where the cache lines hit could show up in profiling before changes in the code for example.

C: using a lot of structs can make a program slow?

I am coding a breakout clone. I had one version in which I only had one level deep of structures. This version runs at 70 fps.
For more clarity in the code I decided the code should have more abstractions and created more structs. Most of the times I have two two three level deep of structures. This version runs at 30 fps.
Since there are some other differences besides the structures, I ask you: Does using a lot of structs in C can slow down the code significantly?
For example on the second version, I am using:
struct Breakout
{
Ball ball;
Paddle paddle;
Level* levels;
}
struct Level
{
Bricks* bricks;
}
So, I am using lots of times breakout.levels[level_in_play].bricks[i].visible for example. Will this be a possible cause?
Thanks.
Doing a lot of pointer dereferences can be a performance hit. When you split a big struct up into smaller structs, two things happen:
Accessing a member of a sub-struct requires an additional pointer dereference and memory fetch, which is slightly slower, and
You can reduce the locality of reference, which causes more cache misses and page faults and can drastically reduce performance.
The locality of reference one is probably what is biting you here. If possible, try to allocate related structs in the same malloc block, which increases the likelihood that they will be cached together.
Adding extra layers of dereferencing can cause a little (very little) amount of slowdown overhead. The reason is, each -> that the compiler sees means it has to do an extra memory lookup and offset. For instance, c->b->a requires the compiler to load pointer c into memory, reference it, offset to b, dereference that, offset to a, dereference that, then load a into memory. That's quite a bit of memory work. Doing c.b.a requires the initial load of c, a single add, then direct load of a from memory. That is 2 loads vs 5.
Unless this type of work is being done a ton in small, tight loops, it won't amount to squat for time. If you are doing this in heavy inner loops though (and your compiler isn't helping you), then it could add up. For those cases, consider caching the lowest level struct pointer and working from there.
That said, anytime you bring up performance, step one is to profile. Without a profile, you are guessing. You have made an assertion that struct derefencing is the root of your performance, but without an up to date and valid profile (on a release build) you are guessing and probably wasting time.
In the first place, it's easy and tempting to guess what the problem is. The sneaky thing about guesses is - they are sometimes right. But why guess, when you can find out for drop-dead sure what's taking the time. I recommend this approach.
That said, here's my guess. malloc and free, if you single-step through them at the assembly language level, are probably doing a lot more than you thought. I only allocate memory for structures if I know I will not be doing it at particularly high frequency. If I must allocate/deallocate them dynamically, at high frequency, it helps to have a free list of used copies, so I can just grab them off the list rather than going to malloc all the time.
Nevertheless, take some stackshots. Chances are you can fix a series of problems and make it a whole lot faster.
Not "lots of structs" in an of itself, other that the potential for a greater number of memory accesses. "lots of indirection" however is more likely to be a cause. You have to consider how many memory accesses are generated in order to get to the actual data. Data proximity and size may also affect caching, but that is much harder to analyse.
Also since you mentioned in a comment that you are performing dynamic memory allocation, the time taken to find an allocate a block is non-deterministic and variable. If you are repeatedly allocating and freeing blocks during execution of the algorithm (rather than pre-allocating at initialisation for example), this can cause both degradation and variability in performance.
If you have a profiling tool, profile the code to see where the performance hit occurs.

Resources