Question
Say you have a simple function that returns a value based on a look table for example:
See edit about assumptions.
uint32_t
lookup0(uint32_t r) {
static const uint32_t tbl[] = { 0, 1, 2, 3 };
if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
__builtin_unreachable();
}
/* Can replace with: `return r`. */
return tbl[r];
}
uint32_t
lookup1(uint32_t r) {
static const uint32_t tbl[] = { 0, 0, 1, 1 };
if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
__builtin_unreachable();
}
/* Can replace with: `return r / 2`. */
return tbl[r];
}
Is there any super-optimization infrastructure or algorithm that can take go from the lookup table to the optimized ALU implementation.
Motivation
The motivation is I'm building some locks for NUMA machines and want to be able to configure my code generically. Its pretty common that in NUMA locks you will need to do cpu_id -> numa_node. I can obviously setup the lookup table during configuration, but since I'm fighting for every drop of memory bandwidth I can, I am hoping to generically reach a solution that will be able to cover most layouts.
Looking at how modern compilers do:
Neither clang or gcc are able to do this at the moment.
Clang is able to get lookup0 if you rewrite it as a switch/case statement.
lookup0(unsigned int): # #lookup0(unsigned int)
movl %edi, %eax
movl lookup0(unsigned int)::tbl(,%rax,4), %eax
retq
...
case0(unsigned int): # #case0(unsigned int)
movl %edi, %eax
retq
but can't get lookup1.
lookup1(unsigned int): # #lookup1(unsigned int)
movl %edi, %eax
movl .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
retq
...
case1(unsigned int): # #case1(unsigned int)
movl %edi, %eax
movl .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
retq
Gcc cant get either.
lookup0(unsigned int):
movl %edi, %edi
movl lookup0(unsigned int)::tbl(,%rdi,4), %eax
ret
lookup1(unsigned int):
movl %edi, %edi
movl lookup1(unsigned int)::tbl(,%rdi,4), %eax
ret
case0(unsigned int):
leal -1(%rdi), %eax
cmpl $2, %eax
movl $0, %eax
cmovbe %edi, %eax
ret
case1(unsigned int):
subl $2, %edi
xorl %eax, %eax
cmpl $1, %edi
setbe %al
ret
I imagine I can cover a fair amount of the necessary cases with some custom brute-force approach, but was hoping this was a solved problem.
Edit:
The only true assumption is:
All inputs are have an index in the LUT.
All values are positive (think that makes things easier) and will be true for just about any sys-config thats online.
(Edit4) Would add one more assumption. The LUT is dense. That is it covers a range [<low_bound>, <bound_bound>] but nothing outside of that range.
In my case for CPU topology, I would generally expect sizeof(LUT) >= <max_value_in_lut> but that is specific to the one example I gave and would have some counter-examples.
Edit2:
I wrote a pretty simple optimizer that does a reasonable job for the CPU topologies I've tested here. But obviously it could be a lot better.
Edit3:
There seems to be some confusion about the question/initial example (I should have been clearer).
The example lookup0/lookup1 are arbitrary. I am hoping to find a solution that can scale beyond 4 indexes and with different values.
The use case I have in mind is CPU topology so ~256 - 1024 is where I would expect the upper bound in size but for a generic LUT it could obviously get much larger.
The best "generic" solution I am aware of is the following:
int compute(int r)
{
static const int T[] = {0,0,1,1};
const int lut_size = sizeof(T) / sizeof(T[0]);
int result = 0;
for(int i=0 ; i<lut_size ; ++i)
result += (r == i) * T[i];
return result;
}
In -O3, GCC and Clang unroll the loop, propagate constants, and generate an intermediate code similar to the following:
int compute(int r)
{
return (r == 0) * 0 + (r == 1) * 0 + (r == 2) * 1 + (r == 3) * 1;
}
GCC/Clang optimizers know that multiplication can be replaced with conditional moves (since developers often use this as a trick to guide compilers generating assembly codes without conditional branches).
The resulting assembly is the following for Clang:
compute:
xor ecx, ecx
cmp edi, 2
sete cl
xor eax, eax
cmp edi, 3
sete al
add eax, ecx
ret
The same applies for GCC. There is no branches nor memory accesses (at least as long as the values are small). Multiplication by small values are also replaced with the fast lea instruction.
A more complete test is available on Godbolt.
Note that this method should work for bigger tables but if the table is too big, then the loop will not be automatically unrolled. You can tell the compiler to use a more aggressive unrolling thanks to compilation flags. That being said, a LUT will likely be faster if it is big since having a huge code to load and execute is slow in this pathological case.
You could pack the array into a long integer and use bitshifts and anding to extract the result.
For example for the table {2,0,3,1} could be handled with:
uint32_t lookup0(uint32_t r) {
static const uint32_t tbl = (2u << 0) | (0u << 8) |
(3u << 16) | (1u << 24);
return (tbl >> (8 * r)) & 0xff;
}
It produces relatively nice assembly:
lookup0: # #lookup0
lea ecx, [8*rdi]
mov eax, 16973826
shr eax, cl
movzx eax, al
ret
Not perfect but branchless and with no indirection.
This method is quite generic and it could support vectorization by "looking up" multiple inputs at the same time.
There are a few tricks to allow handling larger arrays like using longer integers (i.e. uint64_t or __uint128_t extension).
Other approach is splitting bits of value in array like high and low byte, lookup them and combine using bitwise operations.
Related
In C, is indexing an array faster than the ?: operator?
For example, would (const int[]){8, 14}[N > 10] be faster than N > 10? 14 : 8?
Stick with the ternary operator:
It is simpler
It is fewer characters to type
It is easier to read and understand
It is more maintainable
It is likely not the main bottleneck in your application
For the CPU it is a simple comparison
Compilers are clever, if the array solution was faster, compilers would already generate the same code for both variants
Mandatory quote (emphasis mine):
Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered.
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%
— Donald Knuth • https://wiki.c2.com/?PrematureOptimization
Now that's out of the way, let's compare what compilers actually produce.
#include <stdlib.h>
int ternary(int n) { return n > 10 ? 14 : 8; }
int array(int n) { return (const int[]){8, 14}[n > 10]; }
Compile with (g)cc 10.2.1 in Ubuntu and optimizations enabled:
$ cc -O3 -S -fno-stack-protector -fno-asynchronous-unwind-tables ternary.c
-S stops after compilation and does not assemble. You will end up with a .s file which contains the generated assembly code. (the -fno… flags are to disable additional code generation which is not required for our example).
ternary.s assembly code, lines unrelated to the methods removed:
ternary:
endbr64
cmpl $10, %edi
movl $8, %edx
movl $14, %eax
cmovle %edx, %eax
ret
array:
endbr64
movq .LC0(%rip), %rax
movq %rax, -8(%rsp)
xorl %eax, %eax
cmpl $10, %edi
setg %al
movl -8(%rsp,%rax,4), %eax
ret
.LC0:
.long 8
.long 14
If you compare them, you will notice a lot more instructions for the array version: 6 instructions vs. 4 instructions.
There is no reason to write the more complicated code which every developer has to read twice; the shorter and straight-forward code compiles to more efficient machine code.
Use of the compound literal (and array in general) will be much less efficient as arrays are created (by current real-world compilers) despite the optimization level. Worse, they're created on the stack, not just indexing static constant data (which would still be slower, at least higher latency, than an ALU select operation like x86 cmov or AArch64 csel which most modern ISAs have).
I have tested it using all compilers I use (including Keil and IAR) and some I don't use (icc and clang).
int foo(int N)
{
return (const int[]){8, 14}[N > 10];
}
int bar(int N)
{
return N > 10? 14 : 8;
}
foo:
mov rax, QWORD PTR .LC0[rip] # load 8 bytes from .rodata
mov QWORD PTR [rsp-8], rax # store both elements to the stack
xor eax, eax # prepare a zeroed reg for setcc
cmp edi, 10
setg al # materialize N>10 as a 0/1 integer
mov eax, DWORD PTR [rsp-8+rax*4] # index the array with it
ret
bar:
cmp edi, 10
mov edx, 8 # set up registers with both constants
mov eax, 14
cmovle eax, edx # ALU select operation on FLAGS from CMP
ret
.LC0:
.long 8
.long 14
https://godbolt.org/z/qK65Gv
So, I'm trying to get familiar with assembly and trying to reverse-engineer some code. My problem lies in trying to decode addq which I understands performs Source + Destination= Destination.
I am using the assumptions that parameters x, y, and z are passed in registers %rdi, %rsi, and %rdx. The return value is stored in %rax.
long someFunc(long x, long y, long z){
1. long temp=(x-z)*x;
2. long temp2= (temp<<63)>>63;
3. long temp3= (temp2 ^ x);
4. long answer=y+temp3;
5. return answer;
}
So far everything above line 4 is exactly what I am wanting. However, line 4 gives me leaq (%rsi,%rdi), %rax rather than addq %rsi, %rax. I'm not sure if this is something I am doing wrong, but I am looking for some insight.
Those instructions aren't equivalent. For LEA, rax is a pure output. For your hoped-for add, it's rax += rsi so the compiler would have to mov %rdi, %rax first. That's less efficient so it doesn't do that.
lea is a totally normal way for compilers to implement dst = src1 + src2, saving a mov instruction. In general don't expect C operators to compile to instruction named after them. Especially small left-shifts and add, or multiply by 3, 5, or 9, because those are prime targets for optimization with LEA. e.g. lea (%rsi, %rsi, 2), %rax implements result = y*3. See Using LEA on values that aren't addresses / pointers? for more. LEA is also useful to avoid destroying either of the inputs, if they're both needed later.
Assuming you meant t3 to be the same variable as temp3, clang does compile the way you were expecting, doing a better job of register allocation so it can use a shorter and more efficient add instruction without any extra mov instructions, instead of needing lea.
Clang chooses to do better register allocation than GCC so it can just use add instead of needing lea for the last instruction. (Godbolt). This saves code-size (because of the indexed addressing mode), and add has slightly better throughput than LEA on most CPUs, like 4/clock instead of 2/clock.
Clang also optimized the shifts into andl $1, %eax / negq %rax to create the 0 or -1 result of that arithmetic right shift = bit-broadcast. It also optimized to 32-bit operand-size for the first few steps because the shifts throw away all but the low bit of temp1.
# side by side comparison, like the Godbolt diff pane
clang: | gcc:
movl %edi, %eax movq %rdi, %rax
subl %edx, %eax subq %rdx, %rdi
imull %edi, %eax imulq %rax, %rdi # temp1
andl $1, %eax salq $63, %rdi
negq %rax sarq $63, %rdi # temp2
xorq %rdi, %rax xorq %rax, %rdi # temp3
addq %rsi, %rax leaq (%rdi,%rsi), %rax # answer
retq ret
Notice that clang chose imul %edi, %eax (into RAX) but GCC chose to multiply into RDI. That's the difference in register allocation that leads to GCC needing an lea at the end instead of an add.
Compilers sometimes even get stuck with an extra mov instruction at the end of a small function when they make poor choices like this, if the last operation wasn't something like addition that can be done with lea as a non-destructive op-and-copy. These are missed-optimization bugs; you can report them on GCC's bugzilla.
Other missed optimizations
GCC and clang could have optimized by using and instead of imul to set the low bit only if both inputs are odd.
Also, since only the low bit of the sub output matters, XOR (add without carry) would have worked, or even addition! (Odd+-even = odd. even+-even = even. odd+-odd = odd.) That would have allowed an lea instead of mov/sub as the first instruction.
lea (%rdi,%rsi), %eax
and %edi, %eax # low bit matches (x-z)*x
andl $1, %eax # keep only the low bit
negq %rax # temp2
Lets make a truth table for the low bits of x and z to see how this shakes out if we want to optimize more / differently:
# truth table for low bit: input to shifts that broadcasts this to all bits
x&1 | z&1 | x-z = x^z | x*(x-z) = x & (x-z)
0 0 0 0
0 1 1 0
1 0 1 1
1 1 0 0
x & (~z) = BMI1 andn
So temp2 = (x^z) & x & 1 ? -1 : 0. But also temp2 = -((x & ~z) & 1).
We can rearrange that to -((x&1) & ~z) which lets us start with not z and and $1, x in parallel, for better ILP. Or if z might be ready first, we could do operations on it and shorten the critical path from x -> answer, at the expense of z.
Or with a BMI1 andn instruction which does (~z) & x, we can do this in one instruction. (Plus another to isolate the low bit)
I think this function has the same behaviour for every possible input, so compilers could have emitted it from your source code. This is one possibility you should wish your compiler emitted:
# hand-optimized
# long someFunc(long x, long y, long z)
someFunc:
not %edx # ~z
and $1, %edx
and %edi, %edx # x&1 & ~z = low bit of temp1
neg %rdx # temp2 = 0 or -1
xor %rdi, %rdx # temp3 = x or ~x
lea (%rsi, %rdx), %rax # answer = y + temp3
ret
So there's still no ILP, unless z is ready before x and/or y. Using an extra mov instruction, we could do x&1 in parallel with not z
Possibly you could do something with test/setz or cmov, but IDK if that would beat lea/and (temp1) + and/neg (temp2) + xor + add.
I haven't looked into optimizing the final xor and add, but note that temp3 is basically a conditional NOT of x. You could maybe improve latency at the expense of throughput by calculating both ways at once and selecting between them with cmov. Possibly by involving the 2's complement identity that -x - 1 = ~x. Maybe improve ILP / latency by doing x+y and then correcting that with something that depends on the x and z condition? Since we can't subtract using LEA, it seems best to just NOT and ADD.
# return y + x or y + (~x) according to the condition on x and z
someFunc:
lea (%rsi, %rdi), %rax # y + x
andn %edi, %edx, %ecx # ecx = x & (~z)
not %rdi # ~x
add %rsi, %rdi # y + (~x)
test $1, %cl
cmovnz %rdi, %rax # select between y+x and y+~x
retq
This has more ILP, but needs BMI1 andn to still be only 6 (single-uop) instructions. Broadwell and later have single-uop CMOV; on earlier Intel it's 2 uops.
The other function could be 5 uops using BMI andn.
In this version, the first 3 instructions can all run in the first cycle, assuming x,y, and z are all ready. Then in the 2nd cycle, ADD and TEST can both run. In the 3rd cycle, CMOV can run, taking integer inputs from LEA, ADD, and flag input from TEST. So the total latency from x->answer, y->answer, or z->answer is 3 cycles in this version. (Assuming single-uop / single-cycle cmov). Great if it's on the critical path, not very relevant if it's part of an independent dep chain and throughput is all that matters.
vs. 5 (andn) or 6 cycles (without) for the previous attempt. Or even worse for the compiler output using imul instead of and (3 cycle latency just for that instruction).
I'm trying to understand assembly in x86 more. I have a mystery function here that I know returns an int and takes an int argument.
So it looks like int mystery(int n){}. I can't figure out the function in C however. The assembly is:
mov %edi, %eax
lea 0x0(,%rdi, 8), %edi
sub %eax, %edi
add $0x4, %edi
callq < mystery _util >
repz retq
< mystery _util >
mov %edi, %eax
shr %eax
and $0x1, %edi
and %edi, %eax
retq
I don't understand what the lea does here and what kind of function it could be.
The assembly code appeared to be computer generated, and something that was probably compiled by GCC since there is a repz retq after an unconditional branch (call). There is also an indication that because there isn't a tail call (jmp) instead of a call when going to mystery_util that the code was compiled with -O1 (higher optimization levels would likely inline the function which didn't happen here). The lack of frame pointers and extra load/stores indicated that it isn't compiled with -O0
Multiplying x by 7 is the same as multiplying x by 8 and subtracting x. That is what the following code is doing:
lea 0x0(,%rdi, 8), %edi
sub %eax, %edi
LEA can compute addresses but it can be used for simple arithmetic as well. The syntax for a memory operand is displacement(base, index, scale). Scale can be 1, 2, 4, 8. The computation is displacement + base + index * scale. In your case lea 0x0(,%rdi, 8), %edi is effectively EDI = 0x0 + RDI * 8 or EDI = RDI * 8. The full calculation is n * 7 - 4;
The calculation for mystery_util appears to simply be
n &= (n>>1) & 1;
If I take all these factors together we have a function mystery that passes n * 7 - 4 to a function called mystery_util that returns n &= (n>>1) & 1.
Since mystery_util returns a single bit value (0 or 1) it is reasonable that bool is the return type.
I was curious if I could get a particular version of GCC with optimization level 1 (-O1) to reproduce this assembly code. I discovered that GCC 4.9.x will yield this exact assembly code for this given C program:
#include<stdbool.h>
bool mystery_util(unsigned int n)
{
n &= (n>>1) & 1;
return n;
}
bool mystery(unsigned int n)
{
return mystery_util (7*n+4);
}
The assembly output is:
mystery_util:
movl %edi, %eax
shrl %eax
andl $1, %edi
andl %edi, %eax
ret
mystery:
movl %edi, %eax
leal 0(,%rdi,8), %edi
subl %eax, %edi
addl $4, %edi
call mystery_util
rep ret
You can play with this code on godbolt.
Important Update - Version without bool
I apparently erred in interpreting the question. I assumed the person asking this question determined by themselves that the prototype for mystery was int mystery(int n). I thought I could change that. According to a related question asked on Stackoverflow a day later, it seems int mystery(int n) is given to you as the prototype as part of the assignment. This is important because it means that a modification has to be made.
The change that needs to be made is related to mystery_util. In the code to be reverse engineered are these lines:
mov %edi, %eax
shr %eax
EDI is the first parameter. SHR is logical shift right. Compilers would only generate this if EDI was an unsigned int (or equivalent). int is a signed type an would generate SAR (arithmetic shift right). This means that the parameter for mystery_util has to be unsigned int (and it follows that the return value is likely unsigned int. That means the code would look like this:
unsigned int mystery_util(unsigned int n)
{
n &= (n>>1) & 1;
return n;
}
int mystery(int n)
{
return mystery_util (7*n+4);
}
mystery now has the prototype given by your professor (bool is removed) and we use unsigned int for the parameter and return type of mystery_util. In order to generate this code with GCC 4.9.x I found you need to use -O1 -fno-inline. This code can be found on godbolt. The assembly output is the same as the version using bool.
If you use unsigned int mystery_util(int n) you would discover that it doesn't quite output what we want:
mystery_util:
movl %edi, %eax
sarl %eax ; <------- SAR (arithmetic shift right) is not SHR
andl $1, %edi
andl %edi, %eax
ret
The LEA is just a left-shift by 3, and truncating the result to 32 bit (i.e. zero-extending EDI into RDI implicilty). x86-64 System V passes the first integer arg in RDI, so all of this is consistent with one int arg. LEA uses memory-operand syntax and machine encoding, but it's really just a shift-and-add instruction. Using it as part of a multiply by a constant is a common compiler optimization for x86.
The compiler that generated this function missed an optimization here; the first mov could have been avoided with
lea 0x0(,%rdi, 8), %eax # n << 3 = n*8
sub %edi, %eax # eax = n*7
lea 4(%rax), %edi # rdi = 4 + n*7
But instead, the compiler got stuck on generating n*7 in %edi, probably because it applied a peephole optimization for the constant multiply too late to redo register allocation.
mystery_util returns the bitwise AND of the low 2 bits of its arg, in the low bit, so a 0 or 1 integer value, which could also be a bool.
(shr with no count means a count of 1; remember that x86 has a special opcode for shifts with an implicit count of 1. 8086 only has counts of 1 or cl; immediate counts were added later as an extension and the implicit-form opcode is still shorter.)
The LEA performs an address computation, but instead of dereferencing the address, it stores the computed address into the destination register.
In AT&T syntax, lea C(b,c,d), reg means reg = C + b + c*d where C is a constant, and b,c are registers and d is a scalar from {1,2,4,8}. Hence you can see why LEA is popular for simple math operations: it does quite a bit in a single instruction. (*includes correction from prl's comment below)
There are some strange features of this assembly code: the repz prefix is only strictly defined when applied to certain instructions, and retq is not one of them (though the general behavior of the processor is to ignore it). See Michael Petch's comment below with a link for more info. The use of lea (,rdi,8), edi followed by sub eax, edi to compute arg1 * 7 also seemed strange, but makes sense once prl noted the scalar d had to be a constant power of 2. In any case, here's how I read the snippet:
mov %edi, %eax ; eax = arg1
lea 0x0(,%rdi, 8), %edi ; edi = arg1 * 8
sub %eax, %edi ; edi = (arg1 * 8) - arg1 = arg1 * 7
add $0x4, %edi ; edi = (arg1 * 7) + 4
callq < mystery _util > ; call mystery_util(arg1 * 7 + 4)
repz retq ; repz prefix on return is de facto nop.
< mystery _util >
mov %edi, %eax ; eax = arg1
shr %eax ; eax = arg1 >> 1
and $0x1, %edi ; edi = 1 iff arg1 was odd, else 0
and %edi, %eax ; eax = 1 iff smallest 2 bits of arg1 were both 1.
retq
Note the +4 on the 4th line is entirely spurious. It cannot affect the outcome of mystery_util.
So, overall this ASM snippet computes the boolean (arg1 * 7) % 4 == 3.
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I want to do the following arithmetic functions in a C pre-processor include statement when I send in the variable x.
#define calc_addr_data_reg (x) ( base_offset + ((x/7) * 0x20) + data_reg_offset)
How would I go about implementing the division and multiplication operations using bitshifts? In the division operation I only need the the quotient.
To answer the questions,
"Is this expression correct in the C Preprocessor?"
I don't see anything wrong with it.
How would I go about implementing the division and multiplication operations using bitshifts? In the division operation I only need the the quotient.
The compiler is going to do a better job of optimizing your code than you will in almost all cases. If you have to ask StackOverflow how to do this, then you don't know enough to outperform GCC. I know I certainly don't. But because you asked here's how gcc optimizes it.
#EdHeal,
This needed a little bit more room to respond properly. You're absolutely correct in the example you gave (getters and setters), but in this particular example, inlineing the function would slightly increase side of the binary, assuming that it's called a few times.
GCC compiles the function to:
mov ecx, edx
mov edx, -1840700269
mov eax, edi
imul edx
lea eax, [rdx+rdi]
sar eax, 2
sar edi, 31
sub eax, edi
sal eax, 5
add esi, eax
lea eax, [rsi+rcx]
ret
Which is more bytes than the assembly for calling and getting a return value from the function, which is 3 push statements, a call, a return, and a pop statement (presumably).
with -Os it compiles into:
mov eax, edi
mov ecx, 7
mov edi, edx
cdq
idiv ecx
sal eax, 5
add eax, esi
add eax, edi
ret
Which is less bytes than the call return push and pops.
So in this case it really matters what compiler flags he uses whether or not the code is smaller or larger when inlining.
To Op again:
Explaining what the code up there means:
The next part of this post is ripped directly from: http://porn.quiteajolt.com/2008/04/30/the-voodoo-of-gcc-part-i/
The proper reaction to this monstrosity is “wait what.” Some specific instructions that I think could use more explanation:
movl $-1840700269, -4(%ebp)
-1840700269 = -015555555555 in octal (indicated by the leading zero). I’ll be using the octal representation because it looks cooler.
imull %ecx
This multiplies %ecx and %eax. Both of these registers contain a 32-bit number, so this multiplication could possibly result in a 64-bit number. This can’t fit into one 32-bit register, so the result is split across two: the high 32 bits of the product get put into %edx, and the low 32 get put into %eax.
leal (%edx,%ecx), %eax
This adds %edx and %ecx and puts the result into %eax. lea‘s ostensible purpose is for address calculations, and it would be more clear to write this as two instructions: an add and a mov, but that would take two clock cycles to execute, whereas this takes just one.
Also note that this instruction uses the high 32 bits of the multiplication from the previous instruction (stored in %edx) and then overwrites the low 32 bits in %eax, so only the high bits from the multiplication are ever used.
sarl $2, %edx # %edx = %edx >> 2
Technically, whether or not sar (arithmetic right shift) is equivalent to the >> operator is implementation-defined. gcc guarantees that the operator is an arithmetic shift for signed numbers (“Signed `>>’ acts on negative numbers by sign extension”), and since I’ve already used gcc once, let’s just assume I’m using it for the rest of this post (because I am).
sarl $31, %eax
%eax is a 32-bit register, so it’ll be operating on integers in the range [-231, 231 - 1]. This produces something interesting: this calculation only has two possible results. If the number is greater than or equal to 0, the shift will reduce the number to 0 no matter what. If the number is less than 0, the result will be -1.
Here’s a pretty direct rewrite of this assembly back into C, with some integer-width paranoia just to be on the safe side, since a few of these steps are dependent on integers being exactly 32 bits wide:
int32_t divideBySeven(int32_t num) {
int32_t eax, ecx, edx, temp; // push %ebp / movl %esp, %ebp / subl $4, %esp
ecx = num; // movl 8(%ebp), %ecx
temp = -015555555555; // movl $-1840700269, -4(%ebp)
eax = temp; // movl -4(%ebp), %eax
// imull %ecx - int64_t casts to avoid overflow
edx = ((int64_t)ecx * eax) >> 32; // high 32 bits
eax = (int64_t)ecx * eax; // low 32 bits
eax = edx + ecx; // leal (%edx,%ecx), %eax
edx = eax; // movl %eax, %edx
edx = edx >> 2; // sarl $2, %edx
eax = ecx; // movl %ecx, %eax
eax = eax >> 31; // sarl $31, %eax
ecx = edx; // movl %edx, %ecx
ecx = ecx - eax; // subl %eax, %ecx
eax = ecx; // movl %ecx, %eax
return eax; // leave / ret
}
Now there’s clearly a whole bunch of inefficient stuff here: unnecessary local variables, a bunch of unnecessary variable swapping, and eax = (int64_t)ecx * eax1; is not needed at all (I just included it for completion’s sake). So let’s clean that up a bit. This next listing just has the most of the cruft eliminated, with the corresponding assembly above each block:
int32_t divideBySeven(int32_t num) {
// pushl %ebp
// movl %esp, %ebp
// subl $4, %esp
// movl 8(%ebp), %ecx
// movl $-1840700269, -4(%ebp)
// movl -4(%ebp), %eax
int32_t eax, edx;
eax = -015555555555;
// imull %ecx
edx = ((int64_t)num * eax) >> 32;
// leal (%edx,%ecx), %eax
// movl %eax, %edx
// sarl $2, %edx
edx = edx + num;
edx = edx >> 2;
// movl %ecx, %eax
// sarl $31, %eax
eax = num >> 31;
// movl %edx, %ecx
// subl %eax, %ecx
// movl %ecx, %eax
// leave
// ret
eax = edx - eax;
return eax;
}
And the final version:
int32_t divideBySeven(int32_t num) {
int32_t temp = ((int64_t)num * -015555555555) >> 32;
temp = (temp + num) >> 2;
return (temp - (num >> 31));
}
I still have yet to answer the obvious question, “why would they do that?” And the answer is, of course, speed. The integer division instruction used in the very first listing, idiv, takes a whopping 43 clock cycles to execute. But the divisionless method that gcc produces has quite a few more instructions, so is it really faster overall? This is why we have the benchmark.
int main(int argc, char *argv[]) {
int i = INT_MIN;
do {
divideBySeven(i);
i++;
} while (i != INT_MIN);
return 0;
}
Loop over every single possible integer? Sure! I ran the test five times for both implementations and timed it with time. The user CPU times for gcc were 45.9, 45.89, 45.9, 45.99, and 46.11 seconds, while the times for my assembly using the idiv instruction were 62.34, 62.32, 62.44, 62.3, and 62.29 seconds, meaning the naive implementation ran about 36% slower on average. Yeow.
Compiler optimizations are a beautiful thing.
Ok, I'm back, now why does this work?
int32_t divideBySeven(int32_t num) {
int32_t temp = ((int64_t)num * -015555555555) >> 32;
temp = (temp + num) >> 2;
return (temp - (num >> 31));
}
Let's take a look at the first part:
int32_t temp = ((int64_t)num * -015555555555) >> 32;
Why this number?
Well, let's take 2^64 and divide it by 7 and see what pops out.
2^64 / 7 = 2635249153387078802.28571428571428571429
That looks like a mess, what if we convert it into octal?
0222222222222222222222.22222222222222222222222
That's a very pretty repeating pattern, surely that can't be a coincidence. I mean we remember that 7 is 0b111 and we know that when we divide by 99 we tend to get repeating patterns in base 10. So it makes sense that we'd get a repeating pattern in base 8 when we divide by 7.
So where does our number come in?
(int32_t)-1840700269 is the same as (uint_32t)2454267027
* 7 = 17179869189
And finally 17179869184 is 2^34
Which means that 17179869189 is the closest multiple of 7 2^34. Or to put it another way 2454267027 is the largest number that will fit in a uint32_t which when multiplied by 7 is very close to a power of 2
What's this number in octal?
0222222222223
Why is this important? Well, we want to divide by 7. This number is 2^34/7... approximately. So if we multiply by it, and then leftshift 34 times, we should get a number very close to the exact number.
The last two lines look like they were designed to patch up approximation errors.
Perhaps someone with a little more knowledge and/or expertise in this field can chime in on this.
>>> magic = 2454267027
>>> def div7(a):
... if (int(magic * a >> 34) != a // 7):
... return 0
... return 1
...
>>> for a in xrange(2**31, 2**32):
... if (not div7(a)):
... print "%s fails" % a
...
Failures begin at 3435973841 which is, funnily enough 0b11001100110011001100110011010001
I'm trying to do some Code Optimization to Eliminate Branches, the original c code is
if( a < b )
k = (k<<1) + 1;
else
k = (k<<1)
I intend to replace it with assembly code like below
mov a, %rax
mov b, %rbx
mov k, %rcx
xor %rdx %rdx
shl 1, %rcx
cmp %rax, %rax
setb %rdx
add %rdx,%rcx
mov %rcx, k
so I write c inline assembly code like blow,
#define next(a, b, k)\
__asm__("shl $0x1, %0; \
xor %%rbx, %%rbx; \
cmp %1, %2; \
setb %%rbx; \
addl %%rbx,%0;":"+c"(k) :"g"(a),"g"(b))
when I compile the code below i got error:
operand type mismatch for `add'
operand type mismatch for `setb'
How can I fix it?
Here are the mistakes in your code:
Error: operand type mismatch for 'cmp' -- One of CMP's operands must be a register. You're probably generating code that's trying to compare two immediates. Change the second operand's constraint from "g" to "r". (See GCC Manual - Extended Asm - Simple Constraints)
Error: operand type mismatch for 'setb' -- SETB only takes 8 bit operands, i.e. setb %bl works while setb %rbx doesn't.
The C expression T = (A < B) should translate to cmp B,A; setb T in AT&T x86 assembler syntax. You had the two operands to CMP in the wrong order. Remember that CMP works like SUB.
Once you realize the first two error messages are produced by the assembler, it follows that the trick to debugging them is to look at the assembler code generated by gcc. Try gcc $CFLAGS -S t.c and compare the problematic lines in t.s with an x86 opcode reference. Focus on the allowed operand codes for each instruction and you'll quickly see the problems.
In the fixed source code posted below, I assume your operands are unsigned since you're using SETB instead of SETL. I switched from using RBX to RCX to hold the temporary value because RCX is a call clobbered register in the ABI and used the "=&c" constraint to mark it as an earlyclobber operand since RCX is cleared before the inputs a and b are read:
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
static uint64_t next(uint64_t a, uint64_t b, uint64_t k)
{
uint64_t tmp;
__asm__("shl $0x1, %[k];"
"xor %%rcx, %%rcx;"
"cmp %[b], %[a];"
"setb %%cl;"
"addq %%rcx, %[k];"
: /* outputs */ [k] "+g" (k), [tmp] "=&c" (tmp)
: /* inputs */ [a] "r" (a), [b] "g" (b)
: /* clobbers */ "cc");
return k;
}
int main()
{
uint64_t t, t0, k;
k = next(1, 2, 0);
printf("%" PRId64 "\n", k);
scanf("%" SCNd64 "%" SCNd64, &t, &t0);
k = next(t, t0, k);
printf("%" PRId64 "\n", k);
return 0;
}
main() translates to:
<+0>: push %rbx
<+1>: xor %ebx,%ebx
<+3>: mov $0x4006c0,%edi
<+8>: mov $0x1,%bl
<+10>: xor %eax,%eax
<+12>: sub $0x10,%rsp
<+16>: shl %rax
<+19>: xor %rcx,%rcx
<+22>: cmp $0x2,%rbx
<+26>: setb %cl
<+29>: add %rcx,%rax
<+32>: mov %rax,%rbx
<+35>: mov %rax,%rsi
<+38>: xor %eax,%eax
<+40>: callq 0x400470 <printf#plt>
<+45>: lea 0x8(%rsp),%rdx
<+50>: mov %rsp,%rsi
<+53>: mov $0x4006c5,%edi
<+58>: xor %eax,%eax
<+60>: callq 0x4004a0 <__isoc99_scanf#plt>
<+65>: mov (%rsp),%rax
<+69>: mov %rbx,%rsi
<+72>: mov $0x4006c0,%edi
<+77>: shl %rsi
<+80>: xor %rcx,%rcx
<+83>: cmp 0x8(%rsp),%rax
<+88>: setb %cl
<+91>: add %rcx,%rsi
<+94>: xor %eax,%eax
<+96>: callq 0x400470 <printf#plt>
<+101>: add $0x10,%rsp
<+105>: xor %eax,%eax
<+107>: pop %rbx
<+108>: retq
You can see the result of next() being moved into RSI before each printf() call.
Given that gcc (and it looks like gcc inline assembler) produces:
leal (%rdx,%rdx), %eax
xorl %edx, %edx
cmpl %esi, %edi
setl %dl
addl %edx, %eax
ret
from
int f(int a, int b, int k)
{
if( a < b )
k = (k<<1) + 1;
else
k = (k<<1);
return k;
}
It would think that writing your own inline assembler is a complete waste of time and effort.
As always, BEFORE you start writing inline assembler, check what the compiler actually does. If your compiler doesn't produce this code, then you may need to upgrade the version of compiler to something a bit newer (I reported this sort of thing to Jan Hubicka [gcc maintainer for x86-64 at the time] ca 2001, and I'm sure it's been in gcc for quite some time).
You could just do this and the compiler will not generate a branch:
k = (k<<1) + (a < b) ;
But if you must, I fixed some stuff in your code now it should work as expected:
__asm__(
"shl $0x1, %0; \
xor %%eax, %%eax; \
cmpl %3, %2; \
setb %%al; \
addl %%eax, %0;"
:"=r"(k) /* output */
:"0"(k), "r"(a),"r"(b) /* input */
:"eax", "cc" /* clobbered register */
);
Note that setb expects a reg8 or mem8 and you should add eax to the clobbered list, because you change it, as well as cc just to be safe, as for the register constraints, I'm not sure why you used those, but =r and r work just fine.
And you need to add k to both the input and output lists. There's more in the GCC-Inline-Assembly-HOWTO
Summary:
Branchless might not even be the best choice.
Inline asm defeats some other optimizations, try other source changes first, e.g. ? : often compiles branchlessly, also use booleans as integer 0/1.
If you use inline-asm, make sure you optimize the constraints as well to make the compiler-generated code outside your asm block efficient.
The whole thing is doable with cmp %[b], %[a] / adc %[k],%[k]. Your hand-written code is worse than what compilers generate, but they are beatable in the small scale for cases where constant-propagation / CSE / inlining didn't make this code (partially) optimize away.
If your compiler generates branchy code, and profiling shows that was the wrong choice (high counts for branch misses at that instruction, e.g. on Linux perf record -ebranch-misses ./my_program && perf report), then yes you should do something to get branchless code.
(Branchy can be an advantage if it's predictable: branching means out-of-order execution of code that uses (k<<1) + 1 doesn't have to wait for a and b to be ready. LLVM recently merged a patch that makes x86 code-gen more branchy by default, because modern x86 CPUs have such powerful branch predictors. Clang/LLVM nightly build (with that patch) does still choose branchless for this C source, at least in a stand-alone function outside a loop).
If this is for a binary search, branchless probably is good strategy, unless you see the same search often. (Branching + speculative execution means you have a control dependency off the critical path,
Compile with profile-guided optimization so the compiler has run-time info on which branches almost always go one way. It still might not know the difference between a poorly-predictable branch and one that does overall take both paths but with a simple pattern. (Or that's predictable based on global history; many modern branch-predictor designs index based on branch history, so which way the last few branches went determine which table entry is used for the current branch.)
Related: gcc optimization flag -O3 makes code slower then -O2 shows a case where a sorted array makes for near-perfect branch prediction for a condition inside a loop, and gcc -O3's branchless code (without profile guided optimization) bottlenecks on a data dependency from using cmov. But -O3 -fprofile-use makes branchy code. (Also, a different way of writing it makes lower-latency branchless code that also auto-vectorizes better.)
Inline asm should be your last resort if you can't hand-hold the compiler into making the asm you want, e.g. by writing it as (k<<1) + (a<b) as others have suggested.
Inline asm defeats many optimizations, most obvious constant-propagation (as seen in some other answers, where gcc moves a constant into a register outside the block of inline-asm code). https://gcc.gnu.org/wiki/DontUseInlineAsm.
You could maybe use if(__builtin_constant_p(a)) and so on to use a pure C version when the compiler has constant values for some/all of the variables, but that's a lot more work. (And doesn't work well with Clang, where __builtin_constant_p() is evaluated before function inlining.)
Even then (once you've limited things to cases where the inputs aren't compile-time constants), it's not possible to give the compiler the full range of options, because you can't use different asm blocks depending on which constraints are matched (e.g. a in a register and b in memory, or vice versa.) In cases where you want to use a different instruction depending on the situation, you're screwed, but here we can use multi-alternative constraints to expose most of the flexibility of cmp.
It's still usually better to let the compiler make near-optimal code than to use inline asm. Inline-asm destroys the ability of the compiler to reuse use any temporary results, or spread out the instructions to mix with other compiler-generated code. (Instruction-scheduling isn't a big deal on x86 because of good out-of-order execution, but still.)
That asm is pretty crap. If you get a lot of branch misses, it's better than a branchy implementation, but a much better branchless implementation is possible.
Your a<b is an unsigned compare (you're using setb, the unsigned below condition). So your compare result is in the carry flag. x86 has an add-with-carry instruction. Furthermore, k<<1 is the same thing as k+k.
So the asm you want (compiler-generated or with inline asm) is:
# k in %rax, a in %rdi, b in %rsi for this example
cmp %rsi, %rdi # CF = (a < b) = the carry-out from edi - esi
adc %rax, %rax # eax = (k<<1) + CF = (k<<1) + (a < b)
Compilers are smart enough to use add or lea for a left-shift by 1, and some are smart enough to use adc instead of setb, but they don't manage to combine both.
Writing a function with register args and a return value is often a good way to see what compilers might do, although it does force them to produce the result in a different register. (See also this Q&A, and Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”).
// I also tried a version where k is a function return value,
// or where k is a global, so it's in the same register.
unsigned funcarg(unsigned a, unsigned b, unsigned k) {
if( a < b )
k = (k<<1) + 1;
else
k = (k<<1);
return k;
}
On the Godbolt compiler explorer, along with a couple other versions. (I used unsigned in this version, because you had addl in your asm. Using unsigned long makes everything except the xor-zeroing into 64-bit registers. (xor %eax,%eax is still the best way to zero RAX.)
# gcc7.2 -O3 When it can keep the value in the same reg, uses add instead of lea
leal (%rdx,%rdx), %eax #, <retval>
cmpl %esi, %edi # b, a
adcl $0, %eax #, <retval>
ret
#clang 6.0 snapshot -O3
xorl %eax, %eax
cmpl %esi, %edi
setb %al
leal (%rax,%rdx,2), %eax
retq
# ICC18, same as gcc but fails to save a MOV
addl %edx, %edx #14.16
cmpl %esi, %edi #17.12
adcl $0, %edx #17.12
movl %edx, %eax #17.12
ret #17.12
MSVC is the only compiler that doesn't make branchless code without hand-holding. ((k<<1) + ( a < b ); gives us exactly the same xor/cmp/setb / lea sequence as clang above (but with the Windows x86-64 calling convention).
funcarg PROC ; x86-64 MSVC CL19 -Ox
lea eax, DWORD PTR [r8*2+1]
cmp ecx, edx
jb SHORT $LN3#funcarg
lea eax, DWORD PTR [r8+r8] ; conditionally jumped over
$LN3#funcarg:
ret 0
Inline asm
The other answers cover the problems with your implementation pretty well. To debug assembler errors in inline asm, use gcc -O3 -S -fverbose-asm to see what the compiler is feeding to the assembler, with the asm template filled in. You would have seen addl %rax, %ecx or something.
This optimized implementation uses multi-alternative constraints to let the compiler pick either the cmp $imm, r/m, cmp r/m, r, or cmp r, r/m forms of CMP. I used two alternates that split things up not by opcode but by which side included the possible memory operand. "rme" is like "g" (rmi) but limited to 32-bit sign-extended immediates).
unsigned long inlineasm(unsigned long a, unsigned long b, unsigned long k)
{
__asm__("cmpq %[b], %[a] \n\t"
"adc %[k],%[k]"
: /* outputs */ [k] "+r,r" (k)
: /* inputs */ [a] "r,rm" (a), [b] "rme,re" (b)
: /* clobbers */ "cc"); // "cc" clobber is implicit for x86, but it doesn't hurt
return k;
}
I put this on Godbolt with callers that inline it in different contexts. gcc7.2 -O3 does what we expect for the stand-alone version (with register args).
inlineasm:
movq %rdx, %rax # k, k
cmpq %rsi, %rdi # b, a
adc %rax,%rax # k
ret
We can look at how well our constraints work by inlining into other callers:
unsigned long call_with_mem(unsigned long *aptr) {
return inlineasm(*aptr, 5, 4);
}
# gcc
movl $4, %eax #, k
cmpq $55555, (%rdi) #, *aptr_3(D)
adc %rax,%rax # k
ret
With a larger immediate, we get movabs into a register. (But with an "i" or "g" constraint, gcc would emit code that doesn't assemble, or truncates the constant, trying to use a large immediate constant for cmpq.)
Compare what we get from pure C:
unsigned long call_with_mem_nonasm(unsigned long *aptr) {
return handhold(*aptr, 5, 4);
}
# gcc -O3
xorl %eax, %eax # tmp93
cmpq $4, (%rdi) #, *aptr_3(D)
setbe %al #, tmp93
addq $8, %rax #, k
ret
adc $8, %rax without setc would probably have been better, but we can't get that from inline asm without __builtin_constant_p() on k.
clang often picks the mem alternative if there is one, so it does this: /facepalm. Don't use inline asm.
inlineasm: # clang 5.0
movq %rsi, -8(%rsp)
cmpq -8(%rsp), %rdi
adcq %rdx, %rdx
movq %rdx, %rax
retq
BTW, unless you're going to optimize the shift into the compare-and-add, you can and should have asked the compiler for k<<1 as an input.