Functions in our code make it slower? - c

When we compile code and execute it, in assembly, in which our code gets converted, functions are stored in a non sequential manner. So every time a function is called, the processor needs to throw away the instructions in the pipeline. Doesn't this affect the performance of the program?
PS: I'm not considering the time invested in developing such programs without functions. Purely on the performance level. Are there any ways in which compilers deal with this to reduce it?

So every time a function is called, the processor needs to throw away the instructions in the pipeline.
No, everything after the decode stage is still good. The CPU knows not to keep decoding after an unconditional branch (like a jmp, call, or ret). Only the instructions that have been fetched but not yet decoded are ones that shouldn't run. Until the target address is decoded from the instruction, there's nothing useful for beginning of the pipeline to do, so you get bubbles in the pipeline until the target address is known. Decoding branch instructions as early as possible thus minimizes the penalty for taken branches.
In the classic RISC pipeline, the stages are IF ID EX MEM WB (fetch, decode, execute, mem, write-back (results to registers). So when ID decodes a branch instruction, the pipeline throws away the instruction currently being fetched in IF, and the instruction currently being decoded in ID (because it's the instruction after the branch).
"Hazard" is the term for things that prevent a steady stream of instructions from going through the pipeline at one per clock. Branches are a Control Hazard. (Control as in flow-control, as opposed to data.)
If the branch target isn't in L1 I-cache, the pipeline will have to wait for instructions to streaming in from memory before the IF pipeline stage can produce a fetched instruction. I-cache misses always create a pipeline bubble. Prefetching usually avoids this for non-branching code.
More complex CPUs decode far enough ahead to detect branches and re-steer fetch soon enough to hide this bubble. This may involve a queue of decoded instructions to hide the fetch bubble.
Also, instead of actually decoding to detect branch instructions, the CPU can check every instruction address against a "Branch Target Buffer" cache. If you get a hit, you know the instruction is a branch even though you haven't decoded it yet. The BTB also holds the target address, so you can start fetching from there right away (if it's an unconditional branch or your CPU supports speculative execution based on branch prediction).
ret is actually the harder case: the return address is in a register or on the stack, not encoded directly into the instruction. It's an unconditional indirect branch. Modern x86 CPUs maintain an internal return-address predictor stack, and perform very badly when you mis-match call/ret instructions. E.g. call label / label: pop ebx is terrible for position-independent 32bit code to get EIP into EBX. That will cause a mis-predict for the next 15 or so rets up the call tree.
I think I've read that a return-address predictor stack is used by some other non-x86 microarchitectures.
See Agner Fog's microarchitecture pdf to learn more about how x86 CPUs behave (also see the x86 tag wiki), or read a computer architecture textbook to learn about simple RISC pipelines.
For more about caches and memory (mostly focused on data caching / prefetching), see Ulrich Drepper's What Every Programmer Should Know About Memory.
An unconditional branch is quite cheap, like usually a couple cycles at worst (not including I-cache misses).
The big cost of a function call is when the compiler can't see the definition of the target function, and has to assume it clobbers all the call-clobbered registers in the calling convention. (In x86-64 SystemV, all the float/vector registers, and about 8 integer registers.) This requires either spilling to memory or keeping live data in call-preserved registers. But that means the function has to save/restore those register to not break the caller.
Inter-procedural optimization to let functions take advantage of knowing which registers other functions actually clobber, and which they don't, is something compilers can do within the same compilation unit. Or even across compilation units with link-time whole-program optimization. But it can't extend across dynamic-linking boundaries, because the compiler isn't allowed to make code that will break with a differently-compiled version of the same shared library.
Are there any ways in which compilers deal with this to reduce it?
They inline small functions, or even large static functions that are only called once.
e.g.
int foo(void) { return 1; }
mov eax, 1 #,
ret
int bar(int x) { return foo() + x;}
lea eax, [rdi+1] # D.2839,
ret
As #harold points out, overdoing it with inlining can cause cache misses, too, because it inflates your code size so much that not all of your hot code fits in cache.
Intel SnB-family designs have a small but very fast uop cache that caches decoded instructions. It only holds at most 1536 uops IIRC, in lines of 6 uops each. Executing from uop cache instead of from the decoders shortens the branch-mispredict penalty from 19 to 15 cycles, IIRC (something like that, but those numbers are probably not actually correct for any specific uarch). There's also a significant frontend throughput boost compared to the decoders, esp. for long instructions which are common in vector code.

Related

How to prevent LDM/STM instructions expansion in ARM Compiler 5 armcc inline assembler?

I'm trying to generate AXI bus burst accesses using STM/LDM instructions in inline assembly in .c file compiled with ARM Compiler 5 armcc.
inline void STMIA2(uint32_t addr, uint32_t w0, uint32_t w1)
{
__asm {
STMIA addr!, { w0, w1 }
}
}
But ARM Compiler armcc User Guide, paragraph 7.18 is saying:
"All LDM and STM instructions are expanded into a sequence of LDR and STR instructions with equivalent effect. However, the compiler might subsequently recombine the separate instructions into an LDM or STM during optimization."
And that is what really happens in practice, LDM/STM are expanded into a set of LDR/STR in some cases and order of these instuctions is arbitrary.
This affects performance since HW we use optimized for bursts processing. Also this breaks functional correctness because HW we use takes into consideration sequence of words and ignores offsets (but compiler think that it's safe to change the order of instructions).
To resolve this it's possible to use embedded assembler instead of inline assembler, but this leads to extra function calls-returns what affects performance.
So I'm wondering if there is a way to generate LDM/STM properly without losing performance? We were able to do this in GCC, but didn't find any solution for armcc.
Target CPU: Cortex M0+ (ARMv6-M).
Edit:
Slave devices are all on-chip devices, most of them are non-memory devices. For every register of non-memory slave that supports burst access region of address space is reserved (for example [0x10000..0x10100]), I'm not completely sure why, maybe CPU or bus doesn't support fixed (non-incremental) addresses. HW ignores offsets within this region. Full request can be 16 bytes for example and first word of the full request is first word written (even if offset is non-zero).
So I'm wondering if there is a way to generate LDM/STM properly without losing performance? We were able to do this in GCC, but didn't find any solution for armcc.
A little bit about compiler optimizations. Register allocation is one of it's toughest jobs. The heart of any compiler's code generation is probably around when it allocates physical CPU registers. Most compilers are using Single static assignment or SSA to rename your 'C' variables into a bunch of pseudo variable (or time order variables).
In order for your STMIA and LDMIA to work you need the loads and stores to be consistent. Ie, if it is stmia [rx], {r3,r7} and a restore like ldmia [rx], {r4,r8} with the 'r3' mapping to the new 'r4' and the stored 'r7' mapping to the restored 'r8'. This is not simple for any compiler to implement generically as 'C' variables will be assigned according to need. Different versions of the same variable maybe in different registers. To make the stm/ldm work those variable must be assigned so that register increments in the right order. Ie, for the ldmia above if the compiler want the stored r7 in r0 (maybe a return value?), there is no way for it to create a good ldm instruction without generating additional code.
You may have gotten gcc to generate this, but it was probably luck. If you proceed with only gcc, you will probably find it doesn't work as well.
See: ldm/stm and gcc for issues with GCC stm/ldm.
Taking your example,
inline void STMIA2(uint32_t addr, uint32_t w0, uint32_t w1)
{
__asm {
STMIA addr!, { w0, w1 }
}
}
The value of inline is that the whole function body may be put right in the code. The caller might have the w0 and w1 in registers R8 and R4. If the function is not inline, then the compile must place them in R1 and R2 but may have generated extra moves. It is difficult for any compiler to fulfil the requirements of the ldm/stm generically.
This affects performance since HW we use optimized for bursts processing. Also this breaks functional correctness because HW we use takes into consideration sequence of words and ignores offsets (but compiler think that it's safe to change the order of instructions).
If the hardware is a particular non-memory slave peripheral on the bus, then you can wrap the functionality to write to this slave in an external wrapper and force the register allocation (see AAPCS) so that ldm/stm will work. This will result in a performance hit which could be mitigated by some custom assembler in the driver for the device.
However, it sounds like the device might be memory? In this case, you have a problem. Normally, memory devices like this will use a cache only? If your CPU has an MPU (memory protection unit) and can enable both data and code cache, then you might resolve this issue. Cache lines will always be burst accesses. Care only needs to be taken in the code to setup the MPU and the data cache. OPs Cortex-M0+ has no cache and the devices are non-memory so this will not be possible (nor needed).
If your device is memory and you have no data cache then your issue is probably unresolvable (without massive effort) and you need different hardware. Or you can wrap it like the peripheral device and take a performance hit; loosing the benefits of the random access of the memory device.

C function code length VS processor cache

As C is procedure oriented language, while working with C, I always end up with sequential code, running from top to bottom as one or few C functions.
Sometime, I code functions of 1000 lines. Because I think function calls has overhead. While this doesn't duplicate code, I can say I duplicate less than 5% code in long functions.
So, what are effects of long functions over processor cache? Will long functions prevent from better CPU cache usage? Does CPU caches works like caching whole C function? If processor cache doesn't like long functions, then will it be more efficient to have function calls?
Readability should, in general, always come first, and you can pretty much regard this as a "last resort" kind of optimisation which will not buy you a significant performance gain.
Today's CPUs are caching the instructions as well as the data. In general, you should optimise the layout of the data and the memory access patterns, but the way in which instructions are arranged also matters for the utilisation of the instruction cache.
Calling a non-inlined function is in fact an unconditional jump, much like a jmp instruction. This jump makes the CPU start fetching instructions from another (possibly far) location in memory. If this new location isn't found in the instruction cache, the CPU will stall until the corresponding memory is brought there. In theory, if the code contains no jumps and branches, the CPU could prefetch instructions as aggressively as possible.
Also, you never really know how far is "too far". Jumping a few kilobytes forwards or backwards might well be a cache hit, since the usual instruction cache today is about 32 kilobytes.
It's a very tricky optimisation to do right, and I would advise you to look at your data layout and memory access patterns first.
The other concern is the overhead of passing the arguments on the stack or in registers. With today's CPUs this is less of a problem, since the whole stack is usually "hot" in the data cache, and register renaming can even eliminate register-to-register moves to a no-op.

Cost of push vs. mov (stack vs. near memory), and the overhead of function calls

Question:
Is accessing the stack the same speed as accessing memory?
For example, I could choose to do some work within the stack, or I could do work directly with a labelled location in memory.
So, specifically: is push ax the same speed as mov [bx], ax? Likewise is pop ax the same speed as mov ax, [bx]? (assume bx holds a location in near memory.)
Motivation for Question:
It is common in C to discourage trivial functions that take parameters.
I've always thought that is because not only must the parameters get pushed onto the stack and then popped off the stack once the function returns, but also because the function call itself must preserve the CPU's context, which means more stack usage.
But assuming one knows the answer to the headlined question, it should be possible to quantify the overhead that the function uses to set itself up (push / pop / preserve context, etc.) in terms of an equivalent number of direct memory accesses. Hence the headlined question.
(Edit: Clarification: near used above is as opposed to far in the segmented memory model of 16-bit x86 architecture.)
Nowadays your C compiler can outsmart you. It may inline simple functions and if it does that, there will be no function call or return and, perhaps, there will be no additional stack manipulations related to passing and accessing formal function parameters (or an equivalent operation when the function is inlined but the available registers are exhausted) if everything can be done in registers or, better yet, if the result is a constant value and the compiler can see that and take advantage of it.
Function calls themselves can be relatively cheap (but not necessarily zero-cost) on modern CPUs, if they're repeated and if there's a separate instruction cache and various predicting mechanisms, helping with efficient code execution.
Other than that, I'd expect the performance implications of the choice "local var vs global var" to depend on the memory usage patterns. If there's a memory cache in the CPU, the stack is likely to be in that cache, unless you allocate and deallocate large arrays or structures on it or have deep function calls or deep recursion, causing cache misses. If the global variable of interest is accessed often or if its neighbors are accessed often, I'd expect that variable to be in the cache most of the time as well. Again, if you're accessing large spans of memory that can't fit into the cache, you'll have cache misses and possibly reduced performance (possibly because there may or may not be a better, cache-friendly way of doing what you want to do).
If the hardware is pretty dumb (no or small caches, no prediction, no instruction reordering, no speculative execution, nothing), clearly you want to reduce the memory pressure and the number of function calls because each and everyone will count.
Yet another factor is instruction length and decoding. Instructions to access an on-stack location (relative to the stack pointer) can be shorter than instructions to access an arbitrary memory location at a given address. Shorter instructions may be decoded and executed faster.
I'd say there's no definitive answer for all cases because performance depends on:
your hardware
your compiler
your program and its memory accessing patterns
For the clock-cycle-curious...
For those who would like to see specific clock cycles, instruction / latency tables for a variety of modern x86 and x86-64 CPUs are available here (thanks to hirschhornsalz for pointing these out).
You then get, on a Pentium 4 chip:
push ax and mov [bx], ax (red boxed) are virtually identical in their efficiency with identical latencies and throughputs.
pop ax and mov ax, [bx] (blue boxed) are similarly efficient, with identical throughputs despite mov ax, [bx] having twice the latency of pop ax
As far as the follow-on question in the comments (3rd comment):
indirect addressing (i.e. mov [bx], ax) is not materially different than direct addressing (i.e. mov [loc], ax), where loc is a variable holding an immediate value, e.g. loc equ 0xfffd.
Conclusion: Combine this with Alexey's thorough answer, and there's a pretty solid case for the efficiency of using the stack and letting the compiler decide when a function should be inlined.
(Side note: In fact, even as far back as the 8086 from 1978, using the stack was still not less efficient than corresponding mov's to memory as can be seen from these old 8086 instruction timing tables.)
Understanding Latency & Throughput
A bit more may be needed to understand timing tables for modern CPUs. These should help:
definitions of latency and throughput
a useful analogy for latency and throughput, and their relation to instruction processing pipelines)

Do function pointers force an instruction pipeline to clear?

Modern CPUs have extensive pipelining, that is, they are loading necessary instructions and data long before they actually execute the instruction.
Sometimes, the data loaded into the pipeline gets invalidated, and the pipeline must be cleared and reloaded with new data. The time it takes to refill the pipeline can be considerable, and cause a performance slowdown.
If I call a function pointer in C, is the pipeline smart enough to realize that the pointer in the pipeline is a function pointer, and that it should follow that pointer for the next instructions? Or will having a function pointer cause the pipeline to clear and reduce performance?
I'm working in C, but I imagine this is even more important in C++ where many function calls are through v-tables.
edit
#JensGustedt writes:
To be a real performance hit for function calls, the function that you
call must be extremely brief. If you observe this by measuring your
code, you definitively should revisit your design to allow that call
to be inlined
Unfortunately, that may be the trap that I fell into.
I wrote the target function small and fast for performance reasons.
But it is referenced by a function-pointer so that it can easily be replaced with other functions (Just make the pointer reference a different function!). Because I refer to it via a function-pointer, I don't think it can be inlined.
So, I have an extremely brief, not-inlined function.
On some processors an indirect branch will always clear at least part of the pipeline, because it will always mispredict. This is especially the case for in-order processors.
For example, I ran some timings on the processor we develop for, comparing the overhead of an inline function call, versus a direct function call, versus an indirect function call (virtual function or function pointer; they're identical on this platform).
I wrote a tiny function body and measured it in a tight loop of millions of calls, to determine the cost of just the call penalty. The "inline" function was a control group measuring just the cost of the function body (basically a single load op). The direct function measured the penalty of a correctly predicted branch (because it's a static target and the PPC's predictor can always get that right) and the function prologue. The indirect function measured the penalty of a bctrl indirect branch.
614,400,000 function calls:
inline: 411.924 ms ( 2 cycles/call )
direct: 3406.297 ms ( ~17 cycles/call )
virtual: 8080.708 ms ( ~39 cycles/call )
As you can see, the direct call costs 15 cycles more than the function body, and the virtual call (exactly equivalent to a function pointer call) costs 22 cycles more than the direct call. That happens to be approximately how many pipeline stages there are between the start of the pipline (instruction fetch) and the end of the branch ALU. Therefore on this architecture, an indirect branch (aka a virtual call) causes a clear of 22 pipeline stages 100% of the time.
Other architectures may vary. You should make these determinations from direct empirical measurements, or from the CPU's pipeline specifications, rather than assumptions about what processors "should" predict, because implementations are so different. In this case the pipeline clear occurs because there is no way for the branch predictor to know where the bctrl will go until it has retired. At best it could guess that it's to the same target as the last bctrl, and this particular CPU doesn't even try that guess.
Calling a function pointer is not fundamentally different from calling a virtual method in C++, nor, for that matter, is it fundamentally different from a return. The processor, in looking ahead, will recognize that a branch via pointer is coming up and will decide if it can, in the prefetch pipeline, safely and effectively resolve the pointer and follow that path. This is obviously more difficult and expensive than following a regular relative branch, but, since indirect branches are so common in modern programs, it's something that most processors will attempt.
As Oli said, "clearing" the pipeline would only be necessary if there was a mis-prediction on a conditional branch, which has nothing to do with whether the branch is by offset or by variable address. However, processors may have policies that predict differently depending on the type of branch address -- in general a processor would be less likely to agressively follow an indirect path off of a conditional branch because of the possibility of a bad address.
A call through a function pointer doesn't necessarily cause a pipeline clear, but it may, depending on the scenario. The key is whether the CPU can effectively predict the destination of the branch ahead of time.
The way that modern "big" out-of-order cores handle indirect calls1 is roughly as follows:
Once you've executed the indirect branch a few times, the indirect branch predictor will try to predict the address to which the branch will occur in the future.
The first indirect branch predictors were very simple, capable of "predicting" only a single, fixed location.
Later predictors including those on most modern CPUs are much more complex, often capable of predicting well a repeated pattern of indirect jumps and also correlating the jump target with the direction of earlier conditional or indirect branches.
If the prediction is successful, the indirect call has a cost similar to a normal direct call, and this cost is largely "out of line" with the rest of the code (i.e., doesn't participate in dependency chains) so the impact on final runtime of the code is likely to be small unless the calls are very dense.
On the other hand, if the prediction is unsuccessful, you get a full misprediction, similar to a branch direction misprediction. You can't put a fixed number on the cost of this misprediction, since it depends on the surrounding code, but it usually causes a bubble of about 20 cycles in the front-end, and the overall cost in runtime often ends up similar.
So given those basics we can make some educated guesses at what happens in some specific scenarios:
A function pointer always points to the same function will almost always1 be well predicted and cost about the same as a regular function call.
A function pointer that alternates randomly between several targets will almost always be mispredicted. At best, we can hope the predictor always predicts whatever target is most common, so in the worst-case that the targets are chosen uniformly at random between N targets the prediction success rate is bounded by 1 / N (i.e., goes to zero as N goes to infinity). In this respect, indirect branches have a worse worst-case behavior than conditional branches, which generally have a worst-case misprediction rate of 50%2.
The prediction rate for a function pointer with behavior somewhere in the middle, e.g., somewhat predictable (e.g., following a repeating pattern), will depend heavily on the details of the hardware and the sophistication of the predictor. Modern Intel chips have quite good indirect predictors, but details haven't been publicly released. Conventional wisdom holds that they are using some indirect variant of an TAGE predictors used also for conditional branches.
1 A case that would mispredict even for a single target include the first time (or few times) the function is encountered, since the predictor can't predict indirect calls it hasn't seen yet! Also, the size of prediction resources in the CPU is limited, so if the function pointer hasn't been used in a while, eventually the prediction resources will be used for other branches and you'll suffer a misprediction the next time you call it.
2 Indeed, a very simple conditional predictor that simply predicts the direction most often seen recently should have a 50% prediction rate on totally randomly branch directions. To get significantly worse than 50% result, you'd have to design an adversarial algorithm which essentially models the predictor and always chooses to branch in the direction opposite of the model.
There's not a great deal of difference between a function-pointer call and a "normal" call, other than an extra level of indirection. So potentially there's a greater latency involved; if the destination address is not already in cache or registers, then the CPU potentially has to wait while it's retrieved from main memory.
So the answer is; yes, the pipeline can stall, but this is no different to normal function calls. And as usual, mechanisms such as branch prediction and out-of-order execution can help minimise the penalty.

How come INC instruction of x86 is not atomic? [duplicate]

This question already has answers here:
Can num++ be atomic for 'int num'?
(13 answers)
Closed 5 years ago.
I've read that INC instruction of x86 is not atomic. My question is how come? Suppose we are incrementing a 64 bit integer on x86-64, we can do it with one instruction, since INC instruction works with both memory variables and register. So how come its not atomic?
Why would it be? The processor core still needs to read the value stored at the memory location, calculate the increment of it, and then store it back. There's a latency between reading and storing, and in the mean time another operation could have affected that memory location.
Even with out-of-order execution, processor cores are 'smart' enough not to trip over their own instructions and wouldn't be responsible for modifying this memory in the time gap. However, another core could have issued an instruction that modifies that location, a DMA transfer could have affected that location, or other hardware touched that memory location somehow.
Modern x86 processors as part of their execution pipeline "compile" x86 instructions into a lower-level set of operations; Intel calls these uOps, AMD rOps, but what it boils down to is that certain type of single x86 instructions get executed by the actual functional units in the CPU as several steps.
That means, for example, that:
INC EAX
gets executed as a single "mini-op" like uOp.inc eax (let me call it that - they're not exposed).
For other operands things will look differently, like:
INC DWORD PTR [ EAX ]
the low-level decomposition though would look more like:
uOp.load tmp_reg, [ EAX ]
uOp.inc tmp_reg
uOp.store [ EAX ], tmp_reg
and therefore is not executed atomically. If on the other hand you prefix by saying LOCK INC [ EAX ], that'll tell the "compile" stage of the pipeline to decompose in a different way in order to ensure the atomicity requirement is met.
The reason for this is of course as mentioned by others - speed; why make something atomic and necessarily slower if not always required ?
You really don't want a guaranteed atomic operation unless you need it, from Agner Fog's Software optimization resources: instruction_tables.pdf (1996 – 2017):
Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices
then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor
systems. This also applies to the XCHG instruction with a memory operand.

Resources