Code optimization - c

If I have a big structure(having lot of member variables). This structure pointer is passed to many functions in my code. Some member variables of this structure are used very often, in almost all functions.
If I put those frequently used member variables at the beginning in the structure declaration, will it optmize the code for MCPS - Million cycles per second(time consumed by the code). If i put frequently accessed members at time, will they be accessed efficiently/lesser time than if they are put randomly in the structure of at bottom of structure declaration? If yes what is the logic?
If I have a structure member being accessed in some function as follows:
structurepointer1->member_variable
Will it help in optimizing it in MCPS aspect if I assign it to a local variable and then access the local variable, as shown below?
local_variable = structurepointer1->member_variable;
If yes, then how does it help?

1) The position of a field in a structure should have no effect on its access time except to the extent that, if your structure is very large and spans multiple pages, it may be a good idea to position members that are often used in quick succession close together in order to increase locality of reference and try to decrease cache misses.
2) Maybe / maybe not. In fact it may make things slower. If the variable is not volatile, your compiler may be smart enough to store the field in a register anyway. Even if not, your processor will cache its value, but this may not help if is uses are somewhat far apart, with lots of other memory access in between. If the value would have either been stored in a register or would have stayed in your processor's cache, then assigning it to a local will only be unnecessary extra work.
Standard Optimizations Disclaimer: Always profile before optimizing. Make sure that what you are trying to optimize is worth optimizing. Always profile your attempted optimizations and make sure they actually made things faster (and not slower).

First, the obligatory disclaimer: for all performance questions, you must profile the code to see where improvements can be made.
In general though, anything you can do to keep your data in the processor cache will help. Putting the most commonly accessed items close together will facilitate this.

I know this is not really answering your question, but before you delve into super-optimizing your code, go through this presentation http://dl.fefe.de/optimizer-isec.pdf. I saw it live and it was a good eye opening experience showing compilers are getting far more advanced in optimization than we tend to think and readable code is more important than small optimizations.
On 2, you most likely are better off not declaring a local variable. The compiler is usually smart enough to figure out when and how variable is used and utilize registers to keep it around.
Also, I would second Mark Ransom's suggestion, profile the code before making assumptions about bottlenecks.

I think your question is related with data alignment and data structure padding. In modern compilers this is handled automatically the most of the times, trying to avoid the alignment faults that could happen on memory. You can read about this here. Of course, you can change the alignment for your data, but I think you would need to specify some compiler options to disable auto-alignment and rearrange the fields on the structure to match the architecture you are aiming to.
I would say this is a very low level optimization.

The location of the field in the structure is irrelevant as that will be calculated by the compiler. A more promising optimization is to make sure that your most-used fields are byte-aligned with the word size of your processor.
If you are using the variable local to a function, this should have no impact. If you are passing it to other functions (separate from the larger structure) than that might help a bit.

As with all of the other answers, you need to run a profile baseline before optimizing, to make sure changes are effective. If you're worried about execution time, profile your algorithms and optimize them before you worry about the code a compiler creates, more bang for the buck.
Also, if you want to know what is going to happen, you should consider compiling your c code into assembly output. This will give you an idea of what the compiler is going to do and how you may go about further "fine tuning".
Structure access is most always indexed indirect access. The assembly code will effectively pull memory knowing the pointer to the structure as the base plus and index to get the right field. This is usually an expensive operation, but for modern CPU's its probably not that slow.
This depends on the locality of the data being accessed. First and foremost accessing the structure the first time will be the most expensive. Accessing the data afterwards, can be quick if the data is already in a processor register, however, this may not be the case depending on the processor used. Storing to a local variable should be less expensive since the memory access instructions for such an operation is less expensive. Again, I think now days processors are fast enough that this optimization is minimal.
I still think that there are probably better places to optimize your code. It is good though that there is someone out there that thinks about this still, in a world of code bloat ;) Embedded computing, you still need to worry about these things.

This depends on the size of your fields and caching details. Look at using valgrind for profiling this.
If you doing this dereferencing a lot it would cost time. A decent optimizing compiler will effectively do the storing the pointer into the local variable optimization as you described. It will do a better job than you will and it will do it in an architecture-specific way.
What you want to do in this situation, overall, is make sure that you test the correctness and the performance of each optimization you are trying. Otherwise you are poking around in the dark.
Remember that fine optimizations at the C line level will virtually never trump higher-order algorithm/design optimizations.

Yes, it can help. But as people have already stated, it depends and can even be counter productive.
The reason why I think it can help, has to do with pointer aliasing. If you access your variables via a pointer, and the compiler can not guarantee that the structure was not changed elsewhere (via your pointer or another) he will generate code to reload or save the variable even if he could have hold the value in a register. Here an example to show what I mean:
calc = structurepointer1->member_variable * x + c;
/* Do something in function which doesn't involve member_variable; */
function(structurepointer1);
calc2 = structurepointer1->member_variable * y;
The compiler will make a memory access for both references to member_variable, because it can not be sure that the called function has modified that field.
If you're sure the function doesn't change that value, doing this would save 1 memory access
int temp = structurepointer1->member_variable;
calc = temp * x + something;
function(structurepointer1);
calc2 = temp * y;
There's also another reason you can use a local variable for your member variables, it can make the code much more readable.

Related

Using Structs in Functions

I have a function and i'm accessing a struct's members a lot of times in it.
What I was wondering about is what is the good practice to go about this?
For example:
struct s
{
int x;
int y;
}
and I have allocated memory for 10 objects of that struct using malloc.
So, whenever I need to use only one of the object in a function, I usually create (or is passed as argument) pointer and point it to the required object (My superior told me to avoid array indexing because it adds a calculation when accessing any member of the struct)
But is this the right way? I understand that dereferencing is not as expensive as creating a copy, but what if I'm dereferencing a number of times (like 20 to 30) in the function.
Would it be better if i created temporary variables for the struct variables (only the ones I need, I certainly don't use all the members) and copy over the value and then set the actual struct's value before returning?
Also, is this unnecessary micro optimization? Please note that this is for embedded devices.
This is for an embedded system. So, I can't make any assumptions about what the compiler will do. I can't make any assumptions about word size, or the number of registers, or the cost of accessing off the stack, because you didn't tell me what the architecture is. I used to do embedded code on 8080s when they were new...
OK, so what to do?
Pick a real section of code and code it up. Code it up each of the different ways you have listed above. Compile it. Find the compiler option that forces it to print out the assembly code that is produced. Compile each piece of code with every different set of optimization options. Grab the reference manual for the processor and count the cycles used by each case.
Now you will have real data on which to base a decision. Real data is much better that the opinions of a million highly experience expert programmers. Sit down with your lead programmer and show him the code and the data. He may well show you better ways to code it. If so, recode it his way, compile it, and count the cycles used by his code. Show him how his way worked out.
At the very worst you will have spent a weekend learning something very important about the way your compiler works. You will have examined N ways to code things times M different sets of optimization options. You will have learned a lot about the instruction set of the machine. You will have learned how good, or bad, the compiler is. You will have had a chance to get to know your lead programmer better. And, you will have real data.
Real data is the kind of data that you must have to answer this question. With out that data nothing anyone tells you is anything but an ego based guess. Data answers the question.
Bob Pendleton
First of all, indexing an array is not very expensive (only like one operation more expensive than a pointer dereference, or sometimes none, depending on the situation).
Secondly, most compilers will perform what is called RVO or return value optimisation when returning structs by value. This is where the caller allocates space for the return value of the function it calls, and secretly passes the address of that memory to the function for it to use, and the effect is that no copies are made. It does this automatically, so
struct mystruct blah = func();
Only constructs one object, passes it to func for it to use transparently to the programmer, and no copying need be done.
What I do not know is if you assign an array index the return value of the function, like this:
someArray[0] = func();
will the compiler pass the address of someArray[0] and do RVO that way, or will it just not do that optimisation? You'll have to get a more experienced programmer to answer that. I would guess that the compiler is smart enough to do it though, but it's just a guess.
And yes, I would call it micro optimisation. But we're C programmers. And that's how we roll.
Generally, the case in which you want to make a copy of a passed struct in C is if you want to manipulate the data in place. That is to say, have your changes not be reflected in the struct it self but rather only in the return value. As for which is more expensive, it depends on a lot of things. Many of which change implementation to implementation so I would need more specific information to be more helpful. Though, I would expect, that in an embedded environment you memory is at a greater premium than your processing power. Really this reads like needless micro optimization, your compiler should handle it.
In this case creating temp variable on the stack will be faster. But if your structure is much bigger then you might be better with dereferencing.

Global Variables performance effect (c, c++)

I'm currently developing a very fast algorithm, with one part of it being an extremely fast scanner and statistics function.
In this quest, i'm after any performance benefit.
Therefore, I'm also interested in keeping the code "multi-thread" friendly.
Now for the question :
i've noticed that putting some very frequently accessed variables and arrays into "Global", or "static local" (which does the same), there is a measurable performance benefit (in the range of +10%).
I'm trying to understand why, and to find a solution about it, since i would prefer to avoid using these types of allocation.
Note that i don't think the difference comes from "allocation", since allocating a few variables and small array on the stack is almost instantaneous. I believe the difference comes from "accessing" and "modifying" data.
In this search, i've found this old post from stackoverflow :
C++ performance of global variables
But i'm very disappointed by the answers there. Very little explanation, mostly ranting about "you should not do that" (hey, that's not the question !) and very rough statements like 'it doesn't affect performance', which is obviously incorrect, since i'm measuring it with precise benchmark tools.
As said above, i'm looking for an explanation, and, if it exists, a solution to this issue. So far, i've got the feeling that calculating the memory address of a local (dynamic) variable costs a bit more than a global (or local static). Maybe something like an ADD operation difference. But that doesn't help finding a solution...
It really depends on your compiler, platform, and other details. However, I can describe one scenario where global variables are faster.
In many cases, a global variable is at a fixed offset. This allows the generated instructions to simply use that address directly. (Something along the lines of MOV AX,[MyVar].)
However, if you have a variable that's relative to the current stack pointer or a member of a class or array, some math is required to take the address of the array and determine the address of the actual variable.
Obviously, if you need to place some sort of mutex on your global variable in order to keep it thread-safe, then you'll almost certainly more than lose any performance gain.
Creating local variables can be literally free if they are POD types. You likely are overflowing a cache line with too many stack variables or other similar alignment-based causes which are very specific to your piece of code. I usually find that non-local variables significantly decrease performance.
It's hard to beat static allocation for speed, and while the 10% is a pretty small difference, it could be due to address calculation.
But if you're looking for speed,
your example in a comment while(p<end)stats[*p++]++; is an obvious candidate for unrolling, such as:
static int stats[M];
static int index_array[N];
int *p = index_array, *pend = p+N;
// ... initialize the arrays ...
while (p < pend-8){
stats[p[0]]++;
stats[p[1]]++;
stats[p[2]]++;
stats[p[3]]++;
stats[p[4]]++;
stats[p[5]]++;
stats[p[6]]++;
stats[p[7]]++;
p += 8;
}
while(p<pend) stats[*p++]++;
Don't count on the compiler to do it for you. It might or might not be able to figure it out.
Other possible optimizations come to mind, but they depend on what you're actually trying to do.
If you have something like
int stats[256]; while (p<end) stats[*p++]++;
static int stats[256]; while (p<end) stats[*p++]++;
you are not really comparing the same thing because for the first instance you are not doing an initialization of your array. Written explicitly the second line is equivalent to
static int stats[256] = { 0 }; while (p<end) stats[*p++]++;
So to be a fair comparison you should have the first read
int stats[256] = { 0 }; while (p<end) stats[*p++]++;
Your compiler might deduce much more things if he has the variables in a known state.
Now then, there could be runtime advantage of the static case, since the initialization is done at compile time (or program startup).
To test if this makes up for your difference you should run the same function with the static declaration and the loop several times, to see if the difference vanishes if your number of invocations grows.
But as other said already, best is to inspect the assembler that your compiler produces to see what effective difference there are in the code that is produced.

Best Practices for PIC18 Stack/Memory Management?

The limited stack size of budget PICs is a problem area and I have adjusted my code to accommodate this reality. I currently adopt a rough paradigm of grouping closely related functions into a module and declaring all variables global static in the module (to reduce the amount of variables stored in the auto psect, and issues of mutability are only relevant in ISRs, which I account for.) I don't do this because it is good practice, but the reality is you have a finite amount of space to allocate all local function vars that exist in an entire project. In the embedded world of 8/16 bit chips, is this an appropriate method, provided I'm sure to take necessary precautions? I also do things like allocate > 256 bytes of RAM for Ethernet (I know it should be 1500 as standard MTU, but we have a custom situation and very limited RAM) buffers and have to access that memory via pointers so I can avoid the semantics of memory banking. Am I doing it wrong? My app works, but I am 100% open to suggestions for improvement. [c]
I know this was asked 4 years ago but it still has not been properly answered. I believe what the OP is asking is is their approach to working around a limitation of the HiTech PICC18 C compiler valid and/or best practice. As mentioned in a later comment the limitation (a rather bad one and not well advertised by Hitech) is "the Hi-Tech compiler only allows up 256 bytes of auto variables". Actually the limitation is worse than that as it is a total of 256 bytes for local variables and parameters. The linker warning when this is exceeded is pretty cryptic too. Provided that functions are on different branches of the call tree then the compiler can overlap the variables to reuse the space. This means that you can effectively have more than 256 bytes. But note that the interrupt handler (or handlers if you use the priority scheme) has it's own call tree that shares the 256 byte local/param block.
Locals
The two solutions to reduce the space required for locals are: make the locals global or make them static. Making them static keeps the scope the same and provided the function is not called from interrupts is safe (rentrancy is not allowed by the compiler anyway). This is probably the preferred option. The drawback is that the compiler can not reuse those variable's locations to reduce overall memory consumption. Moving the variables to global scope allows reuse, but the reuse management must be managed by the programmer. Probably the best balance is to make simple variables static but to make large chunks of memory like string buffers global and carefully reuse them.
Be careful with initialisation.
foo()
{
int myvar = 5;
}
must change to
foo()
{
static int myvar;
myvar = 5;
}
Parameters
If you go around passing large lots of data down the call tree in parameters you will quickly run into the same 256 byte limitation. Your best option here may be to pass a pointer to a globally allocated struct/s of "options".Alternatively you can have global settings variables that are set by the top caller and read by callees down the tree. It really depends on the design of the software which approach is better.
I've struggled with the same issues as the OP and I think the best option in the long run is to move away from using the Hitech compiler. The optimisation decision the compiler writers took to allocate all locals/params in one block is only really appropriate for the very small ram size PICS. For large PICS you will run out of local/param far before you hit the ram size of the device. Then you have to start hacking your code around to fit the compiler which is perverse.
In summary... Yes your approach is valid. But do consider simply making locals static if that is appropriate as, in general, reducing the scope makes your code safer.
Whereas the C18 compiler used some FSRs (pointers) to manage the data stack, it sounds like the new XC8 compiler from Microchip uses a compiled stack, so you should know exactly how much space is taken up by the stack at compile time. You will also know exactly where each stack variable is stored. I read all about this in the XC8 user's guide and it sounds great. That feature should make this question be moot, assuming you are using XC8.
My experience with compilers/linkers for chips with limited memory is that, as long as you don't use recursive functions and inform the compiler about that, then the compiler is very capable of determining the minimal amount of stack-space that is needed.
I have even seen compilers that give each variable with automatic storage a globally fixed address (no stack at all), where several variables got allocated to overlapping memory, as long as their lifetimes did not overlap.
The general advise when doing (speed or space) optimisations is: make measurements to prove that your optimisation actually has a positive effect.
Since you are nearly out of memory, you have to count each byte of RAM. Using local variables (auto) allows to reuse the memory where you need it (local in the function). When you move the variables to global static address space, you give each variable a unique space. That's wast of address space.
The Microchip compiler allows that different variables share the same address. I don't have the docs at hand, but this can be done by pragma.
But what you need is a analysis of RAM requirements. When you see, that the stack cannot hold all variables but the auto variables would reduce the global memory use, you should consider to increase the stack size using startup code and the linker script.
Best practive is to choose a hardware that fits the requirements.
There are microcontrollers around the cost only some dollars more, but save hundereds or thousand of dollars development costs. If this is a hobby development your effort may not count. But in real world you can often find hardware that is designed only with view of hardware costs.
Especially the PIC18 is not the best example for compact code, what also can be a problem with the flash memory.
This migth sound obvious, but try not to use 16 bits variables on 8 bit precessors. 16 bits variables are fine and needed on bigger arquitectures, but in limited (8 bit) architectures a 16 bit aritmetic is a quick way for depleting both RAM and ROM memories in no time.
If you try to increment a 16 bits variable, the compiler would include a 16 bits increment library, that consumes in most cases a lot of space.
Also, try not to divide or multiply, as for some controllers they are software implemented.
Personally, I go alwais for char and when in need of a divide operation, use rotate rigth 'n' times to divide by 2 n times.
hope this helps!
A bit late, but you should also have a closer look at the C18 compiler user guide (if you were using this compiler).
You could decrease the stack dramatically by statically allocating local variables (overriding the auto keyword). Even better, you can use the overlay storage identifier, which allows different non-overlapping lifetimes variables to be placed at the same address, minimizing RAM. (C18 compiler must operate in Non-Extended mode).

In C, does using static variables in a function make it faster?

My function will be called thousands of times. If i want to make it faster, will changing the local function variables to static be of any use? My logic behind this is that, because static variables are persistent between function calls, they are allocated only the first time, and thus, every subsequent call will not allocate memory for them and will become faster, because the memory allocation step is not done.
Also, if the above is true, then would using global variables instead of parameters be faster to pass information to the function every time it is called? i think space for parameters is also allocated on every function call, to allow for recursion (that's why recursion uses up more memory), but since my function is not recursive, and if my reasoning is correct, then taking off parameters will in theory make it faster.
I know these things I want to do are horrible programming habits, but please, tell me if it is wise. I am going to try it anyway but please give me your opinion.
The overhead of local variables is zero. Each time you call a function, you are already setting up the stack for the parameters, return values, etc. Adding local variables means that you're adding a slightly bigger number to the stack pointer (a number which is computed at compile time).
Also, local variables are probably faster due to cache locality.
If you are only calling your function "thousands" of times (not millions or billions), then you should be looking at your algorithm for optimization opportunities after you have run a profiler.
Re: cache locality (read more here):
Frequently accessed global variables probably have temporal locality. They also may be copied to a register during function execution, but will be written back into memory (cache) after a function returns (otherwise they wouldn't be accessible to anything else; registers don't have addresses).
Local variables will generally have both temporal and spatial locality (they get that by virtue of being created on the stack). Additionally, they may be "allocated" directly to registers and never be written to memory.
The best way to find out is to actually run a profiler. This can be as simple as executing several timed tests using both methods and then averaging out the results and comparing, or you may consider a full-blown profiling tool which attaches itself to a process and graphs out memory use over time and execution speed.
Do not perform random micro code-tuning because you have a gut feeling it will be faster. Compilers all have slightly different implementations of things and what is true on one compiler on one environment may be false on another configuration.
To tackle that comment about fewer parameters: the process of "inlining" functions essentially removes the overhead related to calling a function. Chances are a small function will be automatically in-lined by the compiler, but you can suggest a function be inlined as well.
In a different language, C++, the new standard coming out supports perfect forwarding, and perfect move semantics with rvalue references which removes the need for temporaries in certain cases which can reduce the cost of calling a function.
I suspect you're prematurely optimizing, however, you should not be this concerned with performance until you've discovered your real bottlenecks.
Absolutly not! The only "performance" difference is when variables are initialised
int anint = 42;
vs
static int anint = 42;
In the first case the integer will be set to 42 every time the function is called in the second case ot will be set to 42 when the program is loaded.
However the difference is so trivial as to be barely noticable. Its a common misconception that storage has to be allocated for "automatic" variables on every call. This is not so C uses the already allocated space in the stack for these variables.
Static variables may actually slow you down as its some aggresive optimisations are not possible on static variables. Also as locals are in a contiguous area of the stack they are easier to cache efficiently.
There is no one answer to this. It will vary with the CPU, the compiler, the compiler flags, the number of local variables you have, what the CPU's been doing before you call the function, and quite possibly the phase of the moon.
Consider two extremes; if you have only one or a few local variables, it/they might easily be stored in registers rather than be allocated memory locations at all. If register "pressure" is sufficiently low that this may happen without executing any instructions at all.
At the opposite extreme there are a few machines (e.g., IBM mainframes) that don't have stacks at all. In this case, what we'd normally think of as stack frames are actually allocated as a linked list on the heap. As you'd probably guess, this can be quite slow.
When it comes to accessing the variables, the situation's somewhat similar -- access to a machine register is pretty well guaranteed to be faster than anything allocated in memory can possible hope for. OTOH, it's possible for access to variables on the stack to be pretty slow -- it normally requires something like an indexed indirect access, which (especially with older CPUs) tends to be fairly slow. OTOH, access to a global (which a static is, even though its name isn't globally visible) typically requires forming an absolute address, which some CPUs penalize to some degree as well.
Bottom line: even the advice to profile your code may be misplaced -- the difference may easily be so tiny that even a profiler won't detect it dependably, and the only way to be sure is to examine the assembly language that's produced (and spend a few years learning assembly language well enough to know say anything when you do look at it). The other side of this is that when you're dealing with a difference you can't even measure dependably, the chances that it'll have a material effect on the speed of real code is so remote that it's probably not worth the trouble.
It looks like the static vs non-static has been completely covered but on the topic of global variables. Often these will slow down a programs execution rather than speed it up.
The reason is that tightly scoped variables make it easy for the compiler to heavily optimise, if the compiler has to look all over your application for instances of where a global might be used then its optimising won't be as good.
This is compounded when you introduce pointers, say you have the following code:
int myFunction()
{
SomeStruct *A, *B;
FillOutSomeStruct(B);
memcpy(A, B, sizeof(A);
return A.result;
}
the compiler knows that the pointer A and B can never overlap and so it can optimise the copy. If A and B are global then they could possibly point to overlapping or identical memory, this means the compiler must 'play it safe' which is slower. The problem is generally called 'pointer aliasing' and can occur in lots of situations not just memory copies.
http://en.wikipedia.org/wiki/Pointer_alias
Using static variables may make a function a tiny bit faster. However, this will cause problems if you ever want to make your program multi-threaded. Since static variables are shared between function invocations, invoking the function simultaneously in different threads will result in undefined behaviour. Multi-threading is the type of thing you may want to do in the future to really speed up your code.
Most of the things you mentioned are referred to as micro-optimizations. Generally, worrying about these kind of things is a bad idea. It makes your code harder to read, and harder to maintain. It's also highly likely to introduce bugs. You'll likely get more bang for your buck doing optimizations at a higher level.
As M2tM suggests, running a profiler is also a good idea. Check out gprof for one which is quite easy to use.
You can always time your application to truly determine what is fastest. Here is what I understand: (all of this depends on the architecture of your processor, btw)
C functions create a stack frame, which is where passed parameters are put, and local variables are put, as well as the return pointer back to where the caller called the function. There is no memory management allocation here. It usually a simple pointer movement and thats it. Accessing data off the stack is also pretty quick. Penalties usually come into play when you're dealing with pointers.
As for global or static variables, they're the same...from the standpoint that they're going to be allocated in the same region of memory. Accessing these may use a different method of access than local variables, depends on the compiler.
The major difference between your scenarios is memory footprint, not so much speed.
Using static variables can actually make your code significantly slower. Static variables must exist in a 'data' region of memory. In order to use that variable, the function must execute a load instruction to read from main memory, or a store instruction to write to it. If that region is not in the cache, you lose many cycles. A local variable that lives on the stack will most surely have an address that is in the cache, and might even be in a cpu register, never appearing in memory at all.
I agree with the others comments about profiling to find out stuff like that, but generally speaking, function static variables should be slower. If you want them, what you are really after is a global. Function statics insert code/data to check if the thing has been initialized already that gets run every time your function is called.
Profiling may not see the difference, disassembling and knowing what to look for might.
I suspect you are only going to get a variation as much as a few clock cycles per loop (on average depending on the compiler, etc). Sometimes the change will be dramatic improvement or dramatically slower, and that wont necessarily be because the variables home has moved to/from the stack. Lets say you save four clock cycles per function call for 10000 calls on a 2ghz processor. Very rough calculation: 20 microseconds saved. Is 20 microseconds a lot or a little compared to your current execution time?
You will likely get more a performance improvement by making all of your char and short variables into ints, among other things. Micro-optimization is a good thing to know but takes lots of time experimenting, disassembling, timing the execution of your code, understanding that fewer instructions does not necessarily mean faster for example.
Take your specific program, disassemble both the function in question and the code that calls it. With and without the static. If you gain only one or two instructions and this is the only optimization you are going to do, it is probably not worth it. You may not be able to see the difference while profiling. Changes in where the cache lines hit could show up in profiling before changes in the code for example.

C: using a lot of structs can make a program slow?

I am coding a breakout clone. I had one version in which I only had one level deep of structures. This version runs at 70 fps.
For more clarity in the code I decided the code should have more abstractions and created more structs. Most of the times I have two two three level deep of structures. This version runs at 30 fps.
Since there are some other differences besides the structures, I ask you: Does using a lot of structs in C can slow down the code significantly?
For example on the second version, I am using:
struct Breakout
{
Ball ball;
Paddle paddle;
Level* levels;
}
struct Level
{
Bricks* bricks;
}
So, I am using lots of times breakout.levels[level_in_play].bricks[i].visible for example. Will this be a possible cause?
Thanks.
Doing a lot of pointer dereferences can be a performance hit. When you split a big struct up into smaller structs, two things happen:
Accessing a member of a sub-struct requires an additional pointer dereference and memory fetch, which is slightly slower, and
You can reduce the locality of reference, which causes more cache misses and page faults and can drastically reduce performance.
The locality of reference one is probably what is biting you here. If possible, try to allocate related structs in the same malloc block, which increases the likelihood that they will be cached together.
Adding extra layers of dereferencing can cause a little (very little) amount of slowdown overhead. The reason is, each -> that the compiler sees means it has to do an extra memory lookup and offset. For instance, c->b->a requires the compiler to load pointer c into memory, reference it, offset to b, dereference that, offset to a, dereference that, then load a into memory. That's quite a bit of memory work. Doing c.b.a requires the initial load of c, a single add, then direct load of a from memory. That is 2 loads vs 5.
Unless this type of work is being done a ton in small, tight loops, it won't amount to squat for time. If you are doing this in heavy inner loops though (and your compiler isn't helping you), then it could add up. For those cases, consider caching the lowest level struct pointer and working from there.
That said, anytime you bring up performance, step one is to profile. Without a profile, you are guessing. You have made an assertion that struct derefencing is the root of your performance, but without an up to date and valid profile (on a release build) you are guessing and probably wasting time.
In the first place, it's easy and tempting to guess what the problem is. The sneaky thing about guesses is - they are sometimes right. But why guess, when you can find out for drop-dead sure what's taking the time. I recommend this approach.
That said, here's my guess. malloc and free, if you single-step through them at the assembly language level, are probably doing a lot more than you thought. I only allocate memory for structures if I know I will not be doing it at particularly high frequency. If I must allocate/deallocate them dynamically, at high frequency, it helps to have a free list of used copies, so I can just grab them off the list rather than going to malloc all the time.
Nevertheless, take some stackshots. Chances are you can fix a series of problems and make it a whole lot faster.
Not "lots of structs" in an of itself, other that the potential for a greater number of memory accesses. "lots of indirection" however is more likely to be a cause. You have to consider how many memory accesses are generated in order to get to the actual data. Data proximity and size may also affect caching, but that is much harder to analyse.
Also since you mentioned in a comment that you are performing dynamic memory allocation, the time taken to find an allocate a block is non-deterministic and variable. If you are repeatedly allocating and freeing blocks during execution of the algorithm (rather than pre-allocating at initialisation for example), this can cause both degradation and variability in performance.
If you have a profiling tool, profile the code to see where the performance hit occurs.

Resources