What bytes to emit for an ARM equivalent of EBFE? - arm

In x86, if you want to cause an infinite loop, you can emit an ebfe, basically a jump to the current instruction. What's the ARM equivalent of an EBFE?

That would be 0xeafffffe -- an unconditional branch to itself

Related

Branch and ARM program counter

My understanding is the ARM program counter points to two instructions ahead of the currently executing instruction.
How does this work with conditional branching or even a plain branch?
If you are executing op1, have a branch at op2 and then op3, does the PC point to op3? Or does it point to the next instruction contiguous from op2?
How can you do PC relative addressing with branch instructions present? Do you need to add nops?
The PC register in ARM points to two instructions ahead of the current instruction in the address space, not in the flow of execution. So the PC points to the instruction next to op2 in the given example. Whether the subsequent instruction is branch or not is irrelevant to encoding.

Functions in our code make it slower?

When we compile code and execute it, in assembly, in which our code gets converted, functions are stored in a non sequential manner. So every time a function is called, the processor needs to throw away the instructions in the pipeline. Doesn't this affect the performance of the program?
PS: I'm not considering the time invested in developing such programs without functions. Purely on the performance level. Are there any ways in which compilers deal with this to reduce it?
So every time a function is called, the processor needs to throw away the instructions in the pipeline.
No, everything after the decode stage is still good. The CPU knows not to keep decoding after an unconditional branch (like a jmp, call, or ret). Only the instructions that have been fetched but not yet decoded are ones that shouldn't run. Until the target address is decoded from the instruction, there's nothing useful for beginning of the pipeline to do, so you get bubbles in the pipeline until the target address is known. Decoding branch instructions as early as possible thus minimizes the penalty for taken branches.
In the classic RISC pipeline, the stages are IF ID EX MEM WB (fetch, decode, execute, mem, write-back (results to registers). So when ID decodes a branch instruction, the pipeline throws away the instruction currently being fetched in IF, and the instruction currently being decoded in ID (because it's the instruction after the branch).
"Hazard" is the term for things that prevent a steady stream of instructions from going through the pipeline at one per clock. Branches are a Control Hazard. (Control as in flow-control, as opposed to data.)
If the branch target isn't in L1 I-cache, the pipeline will have to wait for instructions to streaming in from memory before the IF pipeline stage can produce a fetched instruction. I-cache misses always create a pipeline bubble. Prefetching usually avoids this for non-branching code.
More complex CPUs decode far enough ahead to detect branches and re-steer fetch soon enough to hide this bubble. This may involve a queue of decoded instructions to hide the fetch bubble.
Also, instead of actually decoding to detect branch instructions, the CPU can check every instruction address against a "Branch Target Buffer" cache. If you get a hit, you know the instruction is a branch even though you haven't decoded it yet. The BTB also holds the target address, so you can start fetching from there right away (if it's an unconditional branch or your CPU supports speculative execution based on branch prediction).
ret is actually the harder case: the return address is in a register or on the stack, not encoded directly into the instruction. It's an unconditional indirect branch. Modern x86 CPUs maintain an internal return-address predictor stack, and perform very badly when you mis-match call/ret instructions. E.g. call label / label: pop ebx is terrible for position-independent 32bit code to get EIP into EBX. That will cause a mis-predict for the next 15 or so rets up the call tree.
I think I've read that a return-address predictor stack is used by some other non-x86 microarchitectures.
See Agner Fog's microarchitecture pdf to learn more about how x86 CPUs behave (also see the x86 tag wiki), or read a computer architecture textbook to learn about simple RISC pipelines.
For more about caches and memory (mostly focused on data caching / prefetching), see Ulrich Drepper's What Every Programmer Should Know About Memory.
An unconditional branch is quite cheap, like usually a couple cycles at worst (not including I-cache misses).
The big cost of a function call is when the compiler can't see the definition of the target function, and has to assume it clobbers all the call-clobbered registers in the calling convention. (In x86-64 SystemV, all the float/vector registers, and about 8 integer registers.) This requires either spilling to memory or keeping live data in call-preserved registers. But that means the function has to save/restore those register to not break the caller.
Inter-procedural optimization to let functions take advantage of knowing which registers other functions actually clobber, and which they don't, is something compilers can do within the same compilation unit. Or even across compilation units with link-time whole-program optimization. But it can't extend across dynamic-linking boundaries, because the compiler isn't allowed to make code that will break with a differently-compiled version of the same shared library.
Are there any ways in which compilers deal with this to reduce it?
They inline small functions, or even large static functions that are only called once.
e.g.
int foo(void) { return 1; }
mov eax, 1 #,
ret
int bar(int x) { return foo() + x;}
lea eax, [rdi+1] # D.2839,
ret
As #harold points out, overdoing it with inlining can cause cache misses, too, because it inflates your code size so much that not all of your hot code fits in cache.
Intel SnB-family designs have a small but very fast uop cache that caches decoded instructions. It only holds at most 1536 uops IIRC, in lines of 6 uops each. Executing from uop cache instead of from the decoders shortens the branch-mispredict penalty from 19 to 15 cycles, IIRC (something like that, but those numbers are probably not actually correct for any specific uarch). There's also a significant frontend throughput boost compared to the decoders, esp. for long instructions which are common in vector code.

ARM Assembly loop using PC?

I am currently learning arm assembly and I have some questions. When reading docs, I've found that the register nÂș 15 is the program counter that stores the next instruction adress, and when an instruction is done, it is incremented by 4 (bytes, or 2 in thumb mode).
So, my question is, if I run an instruction that changes PC by itself less 4 bytes, would it return to the instruction before, won't it? Then back and over and over again so it will be an infinite loop?
Thanks, and sorry if it is an obvious question.
Regards,
Pedro.
You have to look on an instruction by instruction basis, as some have modification of the PC being unpredictable, but for those where it is legal modification of the program counter essentially causes a jump to the address you save in the program counter. You dont have to worry about the two instructions ahead thing (it is 8 and 4 bytes not 4 and 2, two instructions ahead).
Yes - a jump/branch instruction is exactly what you're describing - it's an instruction which modifies the PC. If you arrange the result of the jump to put the program counter back where it was then, yes, you'll loop on the spot.
Note that this is not really the address of the next instruction but the address of the current instruction +4 (In Thumb mode) or +8 (In ARM mode). So in ARM this is 2 instructions later, but in Thumb it may not be (As instructions can be 16-bit or 32-bit)

Is it atomic to access(load/store) 32 bit integer when using ARM Thumb instruction set?

Using ARM cortex with thumb instruction set and Keil realview compiler, is it safe to access to 32 bit integer? Since the thumb register set is 16 bits, does this mean, fetching a 32 bit int needs 2 machine instructions? If so, accessing 32 bit will not be atomic. If my worry is true, does it mean that int assignment should be protected by a critical region?
Thumb uses the same 32-bit registers as ARM, so there's no issue there. What's halved is the instruction size (and even that is not strictly true for Thumb-2).
Do not worry, you don't need to change your code if you're compiling to Thumb.
The instruction size is 16-Bit in thumb mode, not the register size.
This means that a constant assignment - as in i=1; - can be seen as atomic. Although more than one instruction is generated, only one will modify the memory location of i even if i is int32_t.
But you need a critical section once you to things like i=i+1. That is of course not atomic.

Is it possible to "jump"/"skip" in GDB debugger?

Is it possible to jump to some location/address in the code/executable while debugging in GDB ?
Let say I have something similar to the following
int main()
{
caller_f1() {
f1(); // breakpoint
f2() } // want to skip f2() and jump
caller_f2() { // jump to this this location ??
f1();
f2(); }
}
To resume execution at a new address, use jump (short form: j):
jump LINENUM
jump *ADDRESS
The GDB manual suggests using tbreak (temporary breakpoint) before jumping.
The linenum can be any linespec expression, like +1 for the next line.
See #gospes's answer on a related question for a handy skip macro that does exactly that.
Using jump is only "safe" in un-optimized code (-O0), and even then only within the current function. It only modifies the program counter; it doesn't change any other registers or memory.
Only gcc -O0 compiles each source statement (or line?) into an independent block of instructions that loads variable values from memory and stores results. This lets you modify variable values with a debugger at any breakpoint, and makes jumping between lines in the machine code work like jumping between lines in the C source.
This is part of why -O0 makes such slow code: not only does the compiler not spend time optimizing, it is required to make slow code that spills/reloads everything after every statement to support asynchronous modification of variables and even program-counter. (Store/reload latency is about 5 cycles on a typical x86, so a 1 cycle add takes 6 cycles in -O0 builds).
gcc's manual suggests using -Og for the usual edit-compile-debug cycle, but even that light level of optimization will break jump and async modification of variables. If you don't want to do that while debugging, it's a good choice, especially for projects where -O0 runs so slowly that it's a problem.
To set program-counter / instruction-pointer to a new address without resuming, you can also use this:
set $pc = 0x4005a5
Copy/paste addresses from the disassembly window (layout asm / layout reg).
This is equivalent to tbreak + jump, but you can't use line numbers, only instruction addresses. (And you don't get a warning + confirmation-request for jumping outside the current function).
Then you can stepi from there. $pc is a generic gdb name for whatever the register is really called in the target architecture. e.g. RIP in x86-64. (See also the bottom of the x86 tag wiki for asm debugging tips for gdb.)
There seems to be a jump command which is exactly what you are looking for:
http://idlebox.net/2010/apidocs/gdb-7.0.zip/gdb_18.html#SEC163
Updated link:
http://web.archive.org/web/20140101193811/http://idlebox.net/2010/apidocs/gdb-7.0.zip/gdb_18.html#SEC163

Resources