How to identify a loop on the instruction level? - loops

Can a direct branch instruction with a lower target address than the address of the branch instruction itself be considered as the beginning of a loop? Is this condition sufficient, or are there are other situations (compiler optimizations, etc) in which similar behaviour is shown on the instruction level.
What other approach would you recommend? The obvious one is storing a list of target addresses encountered and if a target address is taken by the same instruction more than once, it means it's the beginning of a loop. The downside of that is that it takes up memory of storing all the addresses and time for checks.

Related

Why does allowing for recursion make C slower/inefficient on 8-bit CPUs

Answers to this question about compiler efficiency for 8-bit CPUs seem to imply that allowing for recursion makes the C language inefficient on these architectures. I don't understand how recursive function calling (of the same function) is different from just repeated function calling of various functions.
I would like to understand why this is so (or why seemingly learned people think it is). I could guess that maybe these architectures just don't have the stack-space, or perhaps the push/pop is inefficient - but these are just guesses.
Because to efficiently implement the C stack, you need the ability to efficiently load and store to arbitrary offsets within the current frame. For example, the 8086 processor provided the indexed and based address modes, that allowed loading a stack variable within a single instruction. With the 6502, you can only do this with the X or Y register, and since those are the only general purpose registers, reserving one for a data stack pointer is extremely costly. The Z80 can do this with its IX or IY registers, but not the stack pointer register. However, indexed load instructions on the Z80 take a long time to execute, so it is still costly, along with the fact you either reserve a second register for the stack pointer, or have to load the stack pointer from the SP register any time you want to access variables.
By comparison, if recursive calls are not supported, then a second instance of the function can not start inside a call whilst an existing is still in progress. This means only a single set of variables is needed at a time and you can just allocate each function its own static piece of memory to use for variables. Since the memory has a fixed location, you can then use fixed address loads. Some implementations of fortran used this approach.

How to find the basic block given an instruction location in that basic block?

Suppose I have an instruction location. I would like to find the basic block that contains that instruction. Let's define "basic block" as the instruction location for the entry point into the basic block that contains the desired instruction. Assume that I have any form of address space randomization disabled, so all program sections and libraries get loaded to the same locations in virtual address space whenever the program is executed. How might I go about doing this?
You can do this under restrictive assumptions.
First, the code can't be self-modifying in any general sense. This would make the problem undecidable.
Second, you need a complete list of jump targets. Certainly debugging information will include this. But if you don't have debugging information, it's still possible to deduce much by disassembling, finding all branch and jump instructions, and taking their immediate targets. Jump tables implementing switch are also useful. A hard case will be function pointers. Good reverse engineering tools do this quite well: disassemble code when little is known about its structure. On the other hand, they can't be perfect: interspersed data and code can always be confused with each other.
Third, you'll need a list of all jump/branch instruction addresses in the program.
With these lists in hand, you'r good to go. Each basic block starts with a jump target and runs either to the instruction before the next target or to a jump/branch instruction (inclusive), whichever comes first. An algorithm that accepts an instruction address and searches the lists for the associated block beginning and end is straightforward.
Actually, it's simplest to merge the lists into a single one and use binary search. The entries before and after the searched address define the block it lies in.
This is an extremely difficult question. In fact, you cannot even hope to know where are the basic blocks for the general case at the assembly level.
The problem comes from the fact that assembly is a jump based language and, by definition a basic block is a sequence of instructions where no jump does land.
Even if you executed 99% of the program, you can never know if the last instruction will not land in the middle of something that you believed to be a basic block. And, of course, I am speaking about only ONE EXECUTION, but this should be looked at for ANY EXECUTION.
So, finding the CFG of a binary program (and thus its basic blocks) is something which is as hard as the halting problem (see Turing diagonal argument).
You should maybe try to give more details about what you really need, because the general question, as you stated it, is simply not possible.
Two things need to happen:
You need to keep debug information containing the mapping
The optimization level must be low enough for this to be unambiguous.
In short, you need support from your toolchain, even more so if you actually want to have more information than an instruction pointer where a new variable goes live, without any information about the variable.

How to use pld instruction in ARM

I use it like this.
__pld(pin[0], pin[1], pin[2], pin[3], pin[4]);
But I get this error.
undefined reference to `__pld'
What am I missing? Do I need to include a header file or something? I am using ARM Cortex A8, does it even support the pld instruction?
As shown in this answer, you can use inline assembler as per Clark. __builtin_prefetch is also a good suggestion. An important fact to know is how the pld instruction acts on the ARM; for some processors it does nothing. For others, it brings the data into the cache. This is only going to be effective for a read operation (or read/modify/write). The other thing to note, is that if it does work on your processor, it fetches an entire cache line. So the example of fetching the pin array, doesn't need to specify all members.
You will get more performance by ensuring that pld data is cache aligned. Another issue, from seeing the previous code, you will only gain performance with variables you read. In some cases, you are just writing to the pin array. There is no value in prefetching these items. The ARM has a write buffer, so writes are batched together and will burst to an SDRAM chip automatically.
Grouping all read data together on a cache line will show the most performance improvement; the whole line can be pre-fectched with a single pld. Also, when you un-roll a loop, the compiler will be able to see these reads and will schedule them earlier if possible so that they are filled in the cache; at least for some ARM cpus.
Also, you may consider,
__attribute__((optimize("prefetch-loop-arrays")))
in the spirit of the accepted answer to the other question; probably the compiler will have already enabled this at -O3 if it is effective on the CPU you have specified.
Various compiler options can be specified with --param NAME=VALUE that allow you to give hints to the compiler on the memory sub-system. This could be a very potent combination, if you get the parameters correct.
prefetch-latency
simultaneous-prefetches
l1-cache-line-size
l1-cache-size
l2-cache-size
min-insn-to-prefetch-ratio
prefetch-min-insn-to-mem-ratio
Make sure you specify a -mcpu to the compiler that supports the pld. If all is right, the compiler should do this automatically for you. However, sometime you may need to do it manually.
For reference, here is gcc-4.7.3's ARM prefetch loop arrays code activation.
/* Enable sw prefetching at -O3 for CPUS that have prefetch, and we have deemed
it beneficial (signified by setting num_prefetch_slots to 1 or more.) */
if (flag_prefetch_loop_arrays < 0
&& HAVE_prefetch
&& optimize >= 3
&& current_tune->num_prefetch_slots > 0)
flag_prefetch_loop_arrays = 1;
Try http://www.ethernut.de/en/documents/arm-inline-asm.html
In GCC it might look like this:
Example from: http://communities.mentor.com/community/cs/archives/arm-gnu/msg01553.html
and a usage of pld:
__asm__ __volatile__(
"pld\t[%0]"
:
: "r" (first) );
You may want to look at gcc's __builtin_prefetch. I reproduced it here for your convenience:
This function is used to minimize cache-miss latency by moving data into a cache before it is accessed. You can insert calls to __builtin_prefetch into code for which you know addresses of data in memory that is likely to be accessed soon. If the target supports them, data prefetch instructions will be generated. If the prefetch is done early enough before the access then the data will be in the cache by the time it is accessed.
The value of addr is the address of the memory to prefetch. There are two optional arguments, rw and locality. The value of rw is a compile-time constant one or zero; one means that the prefetch is preparing for a write to the memory address and zero, the default, means that the prefetch is preparing for a read. The value locality must be a compile-time constant integer between zero and three. A value of zero means that the data has no temporal locality, so it need not be left in the cache after the access. A value of three means that the data has a high degree of temporal locality and should be left in all levels of cache possible. Values of one and two mean, respectively, a low or moderate degree of temporal locality. The default is three.
for (i = 0; i < n; i++)
{
a[i] = a[i] + b[i];
__builtin_prefetch (&a[i+j], 1, 1);
__builtin_prefetch (&b[i+j], 0, 1);
/* ... */
}
Data prefetch does not generate faults if addr is invalid, but the address expression itself must be valid. For example, a prefetch of p->next will not fault if p->next is not a valid address, but evaluation will fault if p is not a valid address.
If the target does not support data prefetch, the address expression is evaluated if it includes side effects but no other code is generated and GCC does not issue a warning.
undefined reference to `__pld'
To answer the question about the undefined reference, __pld is an ARM compiler intrinsic. See __pld intrinsic in the ARM manual.
Perhaps GCC does not recognize the ARM instrinsic.

Do function pointers force an instruction pipeline to clear?

Modern CPUs have extensive pipelining, that is, they are loading necessary instructions and data long before they actually execute the instruction.
Sometimes, the data loaded into the pipeline gets invalidated, and the pipeline must be cleared and reloaded with new data. The time it takes to refill the pipeline can be considerable, and cause a performance slowdown.
If I call a function pointer in C, is the pipeline smart enough to realize that the pointer in the pipeline is a function pointer, and that it should follow that pointer for the next instructions? Or will having a function pointer cause the pipeline to clear and reduce performance?
I'm working in C, but I imagine this is even more important in C++ where many function calls are through v-tables.
edit
#JensGustedt writes:
To be a real performance hit for function calls, the function that you
call must be extremely brief. If you observe this by measuring your
code, you definitively should revisit your design to allow that call
to be inlined
Unfortunately, that may be the trap that I fell into.
I wrote the target function small and fast for performance reasons.
But it is referenced by a function-pointer so that it can easily be replaced with other functions (Just make the pointer reference a different function!). Because I refer to it via a function-pointer, I don't think it can be inlined.
So, I have an extremely brief, not-inlined function.
On some processors an indirect branch will always clear at least part of the pipeline, because it will always mispredict. This is especially the case for in-order processors.
For example, I ran some timings on the processor we develop for, comparing the overhead of an inline function call, versus a direct function call, versus an indirect function call (virtual function or function pointer; they're identical on this platform).
I wrote a tiny function body and measured it in a tight loop of millions of calls, to determine the cost of just the call penalty. The "inline" function was a control group measuring just the cost of the function body (basically a single load op). The direct function measured the penalty of a correctly predicted branch (because it's a static target and the PPC's predictor can always get that right) and the function prologue. The indirect function measured the penalty of a bctrl indirect branch.
614,400,000 function calls:
inline: 411.924 ms ( 2 cycles/call )
direct: 3406.297 ms ( ~17 cycles/call )
virtual: 8080.708 ms ( ~39 cycles/call )
As you can see, the direct call costs 15 cycles more than the function body, and the virtual call (exactly equivalent to a function pointer call) costs 22 cycles more than the direct call. That happens to be approximately how many pipeline stages there are between the start of the pipline (instruction fetch) and the end of the branch ALU. Therefore on this architecture, an indirect branch (aka a virtual call) causes a clear of 22 pipeline stages 100% of the time.
Other architectures may vary. You should make these determinations from direct empirical measurements, or from the CPU's pipeline specifications, rather than assumptions about what processors "should" predict, because implementations are so different. In this case the pipeline clear occurs because there is no way for the branch predictor to know where the bctrl will go until it has retired. At best it could guess that it's to the same target as the last bctrl, and this particular CPU doesn't even try that guess.
Calling a function pointer is not fundamentally different from calling a virtual method in C++, nor, for that matter, is it fundamentally different from a return. The processor, in looking ahead, will recognize that a branch via pointer is coming up and will decide if it can, in the prefetch pipeline, safely and effectively resolve the pointer and follow that path. This is obviously more difficult and expensive than following a regular relative branch, but, since indirect branches are so common in modern programs, it's something that most processors will attempt.
As Oli said, "clearing" the pipeline would only be necessary if there was a mis-prediction on a conditional branch, which has nothing to do with whether the branch is by offset or by variable address. However, processors may have policies that predict differently depending on the type of branch address -- in general a processor would be less likely to agressively follow an indirect path off of a conditional branch because of the possibility of a bad address.
A call through a function pointer doesn't necessarily cause a pipeline clear, but it may, depending on the scenario. The key is whether the CPU can effectively predict the destination of the branch ahead of time.
The way that modern "big" out-of-order cores handle indirect calls1 is roughly as follows:
Once you've executed the indirect branch a few times, the indirect branch predictor will try to predict the address to which the branch will occur in the future.
The first indirect branch predictors were very simple, capable of "predicting" only a single, fixed location.
Later predictors including those on most modern CPUs are much more complex, often capable of predicting well a repeated pattern of indirect jumps and also correlating the jump target with the direction of earlier conditional or indirect branches.
If the prediction is successful, the indirect call has a cost similar to a normal direct call, and this cost is largely "out of line" with the rest of the code (i.e., doesn't participate in dependency chains) so the impact on final runtime of the code is likely to be small unless the calls are very dense.
On the other hand, if the prediction is unsuccessful, you get a full misprediction, similar to a branch direction misprediction. You can't put a fixed number on the cost of this misprediction, since it depends on the surrounding code, but it usually causes a bubble of about 20 cycles in the front-end, and the overall cost in runtime often ends up similar.
So given those basics we can make some educated guesses at what happens in some specific scenarios:
A function pointer always points to the same function will almost always1 be well predicted and cost about the same as a regular function call.
A function pointer that alternates randomly between several targets will almost always be mispredicted. At best, we can hope the predictor always predicts whatever target is most common, so in the worst-case that the targets are chosen uniformly at random between N targets the prediction success rate is bounded by 1 / N (i.e., goes to zero as N goes to infinity). In this respect, indirect branches have a worse worst-case behavior than conditional branches, which generally have a worst-case misprediction rate of 50%2.
The prediction rate for a function pointer with behavior somewhere in the middle, e.g., somewhat predictable (e.g., following a repeating pattern), will depend heavily on the details of the hardware and the sophistication of the predictor. Modern Intel chips have quite good indirect predictors, but details haven't been publicly released. Conventional wisdom holds that they are using some indirect variant of an TAGE predictors used also for conditional branches.
1 A case that would mispredict even for a single target include the first time (or few times) the function is encountered, since the predictor can't predict indirect calls it hasn't seen yet! Also, the size of prediction resources in the CPU is limited, so if the function pointer hasn't been used in a while, eventually the prediction resources will be used for other branches and you'll suffer a misprediction the next time you call it.
2 Indeed, a very simple conditional predictor that simply predicts the direction most often seen recently should have a 50% prediction rate on totally randomly branch directions. To get significantly worse than 50% result, you'd have to design an adversarial algorithm which essentially models the predictor and always chooses to branch in the direction opposite of the model.
There's not a great deal of difference between a function-pointer call and a "normal" call, other than an extra level of indirection. So potentially there's a greater latency involved; if the destination address is not already in cache or registers, then the CPU potentially has to wait while it's retrieved from main memory.
So the answer is; yes, the pipeline can stall, but this is no different to normal function calls. And as usual, mechanisms such as branch prediction and out-of-order execution can help minimise the penalty.

Why isn't all code compiled position independent?

When compiling shared libraries in gcc the -fPIC option compiles the code as position independent. Is there any reason (performance or otherwise) why you would not compile all code position independent?
It adds an indirection. With position independent code you have to load the address of your function and then jump to it. Normally the address of the function is already present in the instruction stream.
Yes there are performance reasons. Some accesses are effectively under another layer of indirection to get the absolute position in memory.
There is also the GOT (Global offset table) which stores offsets of global variables. To me, this just looks like an IAT fixup table, which is classified as position dependent by wikipedia and a few other sources.
http://en.wikipedia.org/wiki/Position_independent_code
In addition to the accepted answer. One thing that hurts PIC code performance a lot is the lack of "IP relative addressing" on x86. With "IP relative addressing" you could ask for data that is X bytes from the current instruction pointer. This would make PIC code a lot simpler.
Jumps and calls, are usually EIP relative, so those don't really pose a problem. However, accessing data will require a little extra trickery. Sometimes, a register will be temporarily reserved as a "base pointer" to data that the code requires. For example, a common technique is to abuse the way calls work on x86:
call label_1
.dd 0xdeadbeef
.dd 0xfeedf00d
.dd 0x11223344
label_1:
pop ebp ; now ebp holds the address of the first dataword
; this works because the call pushes the **next**
; instructions address
; real code follows
mov eax, [ebp + 4] ; for example i'm accessing the '0xfeedf00d' in a PIC way
This and other techniques add a layer of indirection to the data accesses. For example, the GOT (Global offset table) used by gcc compilers.
x86-64 added a "RIP relative" mode which makes things a lot simpler.
Because implementing completely position independent code adds a constraint to the code generator which can prevent the use of faster operations, or add extra steps to preserve that constraint.
This might be an acceptable trade-off to get multiprocessing without a virtual memory system, where you trust processes to not invade each other's memory and might need to load a particular application at any base address.
In many modern systems the performance trade-offs are different, and a relocating loader is often less expensive (it costs any time code is first loaded) than the best an optimizer can do if it has free reign. Also, the availability of virtual address spaces hides most of the motivation for position independence in the first place.
position-independent code has a performance overhead on most architecture, because it requires an extra register.
So, this is for performance purpose.
Also, virtual memory hardware in most modern processors (used by most modern OSes) means that lots of code (all user space apps, barring quirky use of mmap or the like) doesn't need to be position independent. Every program gets its own address space which it thinks starts at zero.
Nowadays operating system and compiler by default make all the code as position independent code. Try compiling without the -fPIC flag, the code will compile fine but you will just get a warning.OS's like windows use a technique called as memory mapping to achieve this.

Resources