Are there performance problems with non-local jumps? - c

I am using non-local jumps (setjmp, longjmp). I would like to know if it can be a problem for the performances. Does setjmp save all the stack, or just some pointers ?
Thanks.

setjmp has to save sufficient information for the program to continue execution when longjmp is called. This will typically consist of the current stack pointer, along with the current values of any other CPU registers that could affect the computation.
I can't comment on whether this causes a "performance problem", because I don't know what you want to compare it against.

The quick answer is: not very likely. If setjmp ever becomes a noticeable bottleneck in your program, I'd tend to say your program design needs an overhaul.

Like Jens said, if it ever becomes a noticeable bottleneck, redesign it since that is not how setjmp is supposed to be used.
As for your question:
This probably depends on which architecture you are running your program on and exactly what the compiler does with your code. On ARM, goto is probably translated into a single branch instruction which is quite fast. setjmp and longjmp on the other hand need to save and restore all registers in order to resume execution after the jump. On an ARMv7-a with NEON support, this would require saving roughly 16 32-bit registers and up to 16 128-bit registers which is quite a bit of extra work compared to a simple branch.
I have no idea if less work is required on x86, but I would suspect that goto is a lot cheaper there too.

Related

C function code length VS processor cache

As C is procedure oriented language, while working with C, I always end up with sequential code, running from top to bottom as one or few C functions.
Sometime, I code functions of 1000 lines. Because I think function calls has overhead. While this doesn't duplicate code, I can say I duplicate less than 5% code in long functions.
So, what are effects of long functions over processor cache? Will long functions prevent from better CPU cache usage? Does CPU caches works like caching whole C function? If processor cache doesn't like long functions, then will it be more efficient to have function calls?
Readability should, in general, always come first, and you can pretty much regard this as a "last resort" kind of optimisation which will not buy you a significant performance gain.
Today's CPUs are caching the instructions as well as the data. In general, you should optimise the layout of the data and the memory access patterns, but the way in which instructions are arranged also matters for the utilisation of the instruction cache.
Calling a non-inlined function is in fact an unconditional jump, much like a jmp instruction. This jump makes the CPU start fetching instructions from another (possibly far) location in memory. If this new location isn't found in the instruction cache, the CPU will stall until the corresponding memory is brought there. In theory, if the code contains no jumps and branches, the CPU could prefetch instructions as aggressively as possible.
Also, you never really know how far is "too far". Jumping a few kilobytes forwards or backwards might well be a cache hit, since the usual instruction cache today is about 32 kilobytes.
It's a very tricky optimisation to do right, and I would advise you to look at your data layout and memory access patterns first.
The other concern is the overhead of passing the arguments on the stack or in registers. With today's CPUs this is less of a problem, since the whole stack is usually "hot" in the data cache, and register renaming can even eliminate register-to-register moves to a no-op.

In C, does using static variables in a function make it faster?

My function will be called thousands of times. If i want to make it faster, will changing the local function variables to static be of any use? My logic behind this is that, because static variables are persistent between function calls, they are allocated only the first time, and thus, every subsequent call will not allocate memory for them and will become faster, because the memory allocation step is not done.
Also, if the above is true, then would using global variables instead of parameters be faster to pass information to the function every time it is called? i think space for parameters is also allocated on every function call, to allow for recursion (that's why recursion uses up more memory), but since my function is not recursive, and if my reasoning is correct, then taking off parameters will in theory make it faster.
I know these things I want to do are horrible programming habits, but please, tell me if it is wise. I am going to try it anyway but please give me your opinion.
The overhead of local variables is zero. Each time you call a function, you are already setting up the stack for the parameters, return values, etc. Adding local variables means that you're adding a slightly bigger number to the stack pointer (a number which is computed at compile time).
Also, local variables are probably faster due to cache locality.
If you are only calling your function "thousands" of times (not millions or billions), then you should be looking at your algorithm for optimization opportunities after you have run a profiler.
Re: cache locality (read more here):
Frequently accessed global variables probably have temporal locality. They also may be copied to a register during function execution, but will be written back into memory (cache) after a function returns (otherwise they wouldn't be accessible to anything else; registers don't have addresses).
Local variables will generally have both temporal and spatial locality (they get that by virtue of being created on the stack). Additionally, they may be "allocated" directly to registers and never be written to memory.
The best way to find out is to actually run a profiler. This can be as simple as executing several timed tests using both methods and then averaging out the results and comparing, or you may consider a full-blown profiling tool which attaches itself to a process and graphs out memory use over time and execution speed.
Do not perform random micro code-tuning because you have a gut feeling it will be faster. Compilers all have slightly different implementations of things and what is true on one compiler on one environment may be false on another configuration.
To tackle that comment about fewer parameters: the process of "inlining" functions essentially removes the overhead related to calling a function. Chances are a small function will be automatically in-lined by the compiler, but you can suggest a function be inlined as well.
In a different language, C++, the new standard coming out supports perfect forwarding, and perfect move semantics with rvalue references which removes the need for temporaries in certain cases which can reduce the cost of calling a function.
I suspect you're prematurely optimizing, however, you should not be this concerned with performance until you've discovered your real bottlenecks.
Absolutly not! The only "performance" difference is when variables are initialised
int anint = 42;
vs
static int anint = 42;
In the first case the integer will be set to 42 every time the function is called in the second case ot will be set to 42 when the program is loaded.
However the difference is so trivial as to be barely noticable. Its a common misconception that storage has to be allocated for "automatic" variables on every call. This is not so C uses the already allocated space in the stack for these variables.
Static variables may actually slow you down as its some aggresive optimisations are not possible on static variables. Also as locals are in a contiguous area of the stack they are easier to cache efficiently.
There is no one answer to this. It will vary with the CPU, the compiler, the compiler flags, the number of local variables you have, what the CPU's been doing before you call the function, and quite possibly the phase of the moon.
Consider two extremes; if you have only one or a few local variables, it/they might easily be stored in registers rather than be allocated memory locations at all. If register "pressure" is sufficiently low that this may happen without executing any instructions at all.
At the opposite extreme there are a few machines (e.g., IBM mainframes) that don't have stacks at all. In this case, what we'd normally think of as stack frames are actually allocated as a linked list on the heap. As you'd probably guess, this can be quite slow.
When it comes to accessing the variables, the situation's somewhat similar -- access to a machine register is pretty well guaranteed to be faster than anything allocated in memory can possible hope for. OTOH, it's possible for access to variables on the stack to be pretty slow -- it normally requires something like an indexed indirect access, which (especially with older CPUs) tends to be fairly slow. OTOH, access to a global (which a static is, even though its name isn't globally visible) typically requires forming an absolute address, which some CPUs penalize to some degree as well.
Bottom line: even the advice to profile your code may be misplaced -- the difference may easily be so tiny that even a profiler won't detect it dependably, and the only way to be sure is to examine the assembly language that's produced (and spend a few years learning assembly language well enough to know say anything when you do look at it). The other side of this is that when you're dealing with a difference you can't even measure dependably, the chances that it'll have a material effect on the speed of real code is so remote that it's probably not worth the trouble.
It looks like the static vs non-static has been completely covered but on the topic of global variables. Often these will slow down a programs execution rather than speed it up.
The reason is that tightly scoped variables make it easy for the compiler to heavily optimise, if the compiler has to look all over your application for instances of where a global might be used then its optimising won't be as good.
This is compounded when you introduce pointers, say you have the following code:
int myFunction()
{
SomeStruct *A, *B;
FillOutSomeStruct(B);
memcpy(A, B, sizeof(A);
return A.result;
}
the compiler knows that the pointer A and B can never overlap and so it can optimise the copy. If A and B are global then they could possibly point to overlapping or identical memory, this means the compiler must 'play it safe' which is slower. The problem is generally called 'pointer aliasing' and can occur in lots of situations not just memory copies.
http://en.wikipedia.org/wiki/Pointer_alias
Using static variables may make a function a tiny bit faster. However, this will cause problems if you ever want to make your program multi-threaded. Since static variables are shared between function invocations, invoking the function simultaneously in different threads will result in undefined behaviour. Multi-threading is the type of thing you may want to do in the future to really speed up your code.
Most of the things you mentioned are referred to as micro-optimizations. Generally, worrying about these kind of things is a bad idea. It makes your code harder to read, and harder to maintain. It's also highly likely to introduce bugs. You'll likely get more bang for your buck doing optimizations at a higher level.
As M2tM suggests, running a profiler is also a good idea. Check out gprof for one which is quite easy to use.
You can always time your application to truly determine what is fastest. Here is what I understand: (all of this depends on the architecture of your processor, btw)
C functions create a stack frame, which is where passed parameters are put, and local variables are put, as well as the return pointer back to where the caller called the function. There is no memory management allocation here. It usually a simple pointer movement and thats it. Accessing data off the stack is also pretty quick. Penalties usually come into play when you're dealing with pointers.
As for global or static variables, they're the same...from the standpoint that they're going to be allocated in the same region of memory. Accessing these may use a different method of access than local variables, depends on the compiler.
The major difference between your scenarios is memory footprint, not so much speed.
Using static variables can actually make your code significantly slower. Static variables must exist in a 'data' region of memory. In order to use that variable, the function must execute a load instruction to read from main memory, or a store instruction to write to it. If that region is not in the cache, you lose many cycles. A local variable that lives on the stack will most surely have an address that is in the cache, and might even be in a cpu register, never appearing in memory at all.
I agree with the others comments about profiling to find out stuff like that, but generally speaking, function static variables should be slower. If you want them, what you are really after is a global. Function statics insert code/data to check if the thing has been initialized already that gets run every time your function is called.
Profiling may not see the difference, disassembling and knowing what to look for might.
I suspect you are only going to get a variation as much as a few clock cycles per loop (on average depending on the compiler, etc). Sometimes the change will be dramatic improvement or dramatically slower, and that wont necessarily be because the variables home has moved to/from the stack. Lets say you save four clock cycles per function call for 10000 calls on a 2ghz processor. Very rough calculation: 20 microseconds saved. Is 20 microseconds a lot or a little compared to your current execution time?
You will likely get more a performance improvement by making all of your char and short variables into ints, among other things. Micro-optimization is a good thing to know but takes lots of time experimenting, disassembling, timing the execution of your code, understanding that fewer instructions does not necessarily mean faster for example.
Take your specific program, disassemble both the function in question and the code that calls it. With and without the static. If you gain only one or two instructions and this is the only optimization you are going to do, it is probably not worth it. You may not be able to see the difference while profiling. Changes in where the cache lines hit could show up in profiling before changes in the code for example.

Why do we need to suggest a variable to be stored in register?

As I know that in C we could use the keyword "register" to suggest to the compiler that the variable should be stored in the CPU register. Isn't it true that all variables that involved in CPU instructions will be eventually stored in CPU registers for execution?
The register keyword is a way of telling the compiler that the variable is heavily used. It's true that values must usually be loaded temporarily into registers to perform calculations on them. The name comes from the idea that a compiler might keep the variable in a register for the entire duration that it is in scope, rather than only temporarily when it is being used in a calculation.
The keyword is obsolete for the purpose of optimisation, since modern compilers can determine when a variable is heavily used (and when it does not have its address taken) without help from the programmer.
You should not use that register keyword. It is an antique relic, maintained for backward compatibility. Most compilers will ignore it (by default).
There could be exceptions but they are very rare, consult your compiler manual.
Isn't it true that all variables that involved in CPU instructions will be eventually stored in CPU registers for execution?
Yes, that is true. But CPU registers are limited so variables are usually LOAD/STOREd from 'normal' memory and live in a register only briefly. The register keyword is (was) a way of indicating high priority variables that should occupy a register longer. Like the i in for(i = 0; ...).
In old times, compilers weren't that smart as they are today. It was a hint from the programmer to the compiler that this variable should be stored in a register to allow fast access/modification. Today, almost any decent compiler implements clever registers allocation algorithms that beats the mind of humans.
Most variables will be loaded into registers for a short while...as long as necessary to do what needs to be done with them. The register keyword hints that they should be kept there.
Compilers' optimization has gotten so much better, though, that the register keyword isn't very helpful. In fact, if your compiler respects it at all (and many don't), it could even mess you up (by tying the compiler's hands, making certain optimizations impossible). So it's a pretty bad idea these days.

Mechanism of the Boehm Weiser Garbage Collector

I was reading the paper "Garbage Collector in an Uncooperative Environment" and wondering how hard it would be to implement it. The paper describes a need to collect all addresses from the processor (in addition to the stack). The stack part seems intuitive. Is there any way to collect addresses from the registers other than enumerating each register explicitly in assembly? Let's assume x86_64 on a POSIX-like system such as linux or mac.
SetJmp
Since Boehm and Weiser actually implemented their GC, then a basic source of information is the source code of that implementation (it is opensource).
To collect the register values, you may want to subvert the setjmp() function, which saves a copy of the registers in a custom structure (at least those registers which are supposed to be preserved across function calls). But that structure is not standardized (its contents are nominally opaque) and setjmp() may be specially handled by the C compiler, making it a bit delicate to use for anything other than a longjmp() (which is already quite hard as it is). A piece of inline assembly seems much easier and safer.
The first hard part in the GC implementation seems to be able to reliably detect the start and end of stacks (note the plural: there may be threads, each with its own stack). This requires delving into ill-documented details of OS ABI. When my desktop system was an Alpha machine running FreeBSD, the Boehm-Weiser implementation could not run on it (although it supported Linux on the same processor).
The second hard part will be when trying to go generational, trapping write accesses by playing with page access rights. This again will require reading some documentation of questionable existence, and some inline assembly.
I think on x86_86 they use the flushrs assembly instruction to put the registers on the stack. I am sure someone on stack overflow will correct me if this is wrong.
It is not hard to implement a naive collector: it's just an algorithm after all. The hard bits are as stated, but I will add the worst ones: tracking exceptions is nasty, and stopping threads is even worse: that one can't be done at all on some platforms. There's also the problem of trapping all pointers that get handed over to the OS and lost from the program temporarily (happens a lot in Windows window message handlers).
My own multi-threaded GC is similar to the Boehm collector and more or less standard C++ with few hacks (using jmpbuf is more or less certain to work) and a slightly less hostile environment (no exceptions). But it stops the world by cooperation, which is very bad: if you have a busy CPU the idle ones wait for it. Boehm uses signals or other OS features to try to stop threads but the support is very flaky.
And note also the Intel i64 processor has two stacks per thread .. a bit hard to account for this kind of thing generically.

Is there a way to force a variable to be stored in the cache in C?

I just had a phone interview where I was asked this question. I am aware of ways to store in register or heap or stack, but cache specifically?
Not in C as a language. In GCC as a compiler - look for __builtin_prefetch.
You might be interested in reading What every programmer should know about memory.
Edit:
Just to clear some confusion - caches are physically separate memories in hardware, but not in software abstraction of the machine. A word in a cache is always associated with address in main memory. This is different from the CPU registers, which are named/addressed separately from the RAM.
In C, as in as defined by the C standard? No.
In C, as in some specific implementation on a specific platform? Maybe.
As cache is a CPU concept and is meaningless for C language (and C language has targets processors that have no cache, unlikely today, but quite common in old days) definitely No.
Trying to optimize such things by hand is also usually a quite bad idea.
What you can do is keep the job easy for the compiler keeping loops very short and doing only one thing (good for instruction cache), iterate on memory blocks in the right order (prefer accesses to consecutive cells in memory to sparse accesses), avoid reusing the same variables for different uses (it introduces read-after-write dependencies), etc. If you are attentive to such details the program is more likely to be efficiently optimized by compiler and memory accesses be cached.
But it will still depend on actual hardware and even compiler may not guarantee it.
It depends on the platform, so if you were speaking to a company targetting current generation consoles, you would need to know the PowerPC data cache intrinsics/instructions. On various platforms, you would also need to know the false sharing rules. Also, you can't cache from memory marked explicitly as uncached.
Without more context about the actual job or company or question, this would probably be best answered by talking about what not to do to keep memory references in the data cache.
If you are trying to force something to be stored in the CPU cache, I would recommend that you avoid trying to do so unless you have an overwhelmingly good reason. Manually manipulating the CPU cache can have all sorts of unintended consequences, not the least among them being coherency in multi-core or multi-CPU applications. This is something that is done by the CPU at run-time and is generally transparent to the programmer and the compiler for a good reason.
The specific answer will depend on your compiler and platform. If you are targeting a MIPS architecture, there is a CACHE instruction (assembly) which allows you to do CPU cache manipulations.

Resources