Can a modern C compiler optimize a combination of bit accesses? - c

I would like var to be unequal FALSE in case one of the bits 1, 3, 5, 7, 9, 11, 13 or 15 of input is set.
One solution which seem to be fairly common is this:
int var = 1 & (input >> 1) ||
1 & (input >> 3) ||
1 & (input >> 5) ||
1 & (input >> 7) ||
1 & (input >> 9) ||
1 & (input >> 11) ||
1 & (input >> 13) ||
1 & (input >> 15);
However, I'm afraid that that would lead the compiler to generate unnecessarily long code.
Following code would also yield the desired result. Would it be more efficient?
int var = input & 0b1010101010101010;
Thanks!

Your second example is not equivalent.
What you wanted was (using non-standard binary literals):
int var = !!(input & 0b1010101010101010));
Or with hex-literals (those are standard):
int var = !!(input & 0xaaaa));
Changes: Use of hexadecimal constants and double-negation (equivalent to != 0).
This presupposes input is not volatile, nor an atomic type.
A good compiler should optimize both to the same instructions (and most modern compilers are good enough).
In the end though, test and measure, most compilers will output the produced assembler code, you don't even need a disassembler!

If input is volatile, the compiler would be required to read it once if bit 1 was set, twice of bit 1 was clear but 3 was set, three times if bits 1 and 3 were clear but 5 was set, etc. The compiler may have ways of optimizing the code for doing the individual bit tests, but would have to test the bits separately.
If input is not volatile, a compiler could optimize the code, but I would not particularly expect it to. I would expect any compiler, however, no matter how ancient, to optimize
int var = (input & (
(1 << 1) | (1 << 3) | (1 << 5) | (1 << 7) |
(1 << 9) | (1 << 11) | (1 << 13) | (1 << 15)
) != 0);
which would appear to be what you're after.

It's going to depend on the processor and what instructions is has available, as well as how good the optimizing compiler is. I'd suspect that in your case, either of those lines of code will compile to the same instructions.
But we can do better than suspect, you can check for yourself. With gcc, use the -S compiler flag to have it output the assembly it generates. Then you can compare them yourself.

The orthodoxical solution should be to use the forgotten bitfields to map your flags, like
struct
{
bool B0: 1;
bool B1: 1;
bool B2: 1;
bool B3: 1;
bool B4: 1;
bool B5: 1;
bool B6: 1;
bool B7: 1;
bool B8: 1;
bool B9: 1;
bool B10: 1;
bool B11: 1;
bool B12: 1;
bool B13: 1;
bool B14: 1;
bool B15: 1;
} input;
and use the expression
bool Var= input.B1 || input.B3 || input.B5 || input.B7 || input.B9 || input.B11 || input.B13 || input.B15;
I doubt that an optimizing compiler will use the single-go masking trick, but honestly I have not tried.

How well this is handled depends on the compiler.
I've tested a minor variation of this code:
int test(int input) {
int var = 1 & (input >> 1) ||
1 & (input >> 3) ||
1 & (input >> 5) ||
1 & (input >> 7) ||
1 & (input >> 9) ||
1 & (input >> 11) ||
1 & (input >> 13) ||
1 & (input >> 15);
return var != 0;
}
Results
For x64, all compiled with -O2
GCC:
xor eax, eax
and edi, 43690
setne al
ret
Very good. That's precisely the transformation you were hoping for.
Clang:
testw $10922, %di # imm = 0x2AAA
movb $1, %al
jne .LBB0_2
andl $32768, %edi # imm = 0x8000
shrl $15, %edi
movb %dil, %al
.LBB0_2:
movzbl %al, %eax
ret
Yea that's a bit odd. Most of the tests were rolled together .. except for one. I see no reason why it would do this, maybe someone else can shed some light on that.
And the real surprise, ICC:
movl %edi, %eax #7.32
movl %edi, %edx #8.26
movl %edi, %ecx #9.26
shrl $1, %eax #7.32
movl %edi, %esi #10.26
shrl $3, %edx #8.26
movl %edi, %r8d #11.26
shrl $5, %ecx #9.26
orl %edx, %eax #7.32
shrl $7, %esi #10.26
orl %ecx, %eax #7.32
shrl $9, %r8d #11.26
orl %esi, %eax #7.32
movl %edi, %r9d #12.25
orl %r8d, %eax #7.32
shrl $11, %r9d #12.25
movl %edi, %r10d #13.25
shrl $13, %r10d #13.25
orl %r9d, %eax #7.32
shrl $15, %edi #14.25
orl %r10d, %eax #7.32
orl %edi, %eax #7.32
andl $1, %eax #7.32
ret #15.21
Ok so it optimized it a bit - no branches, and the 1 &'s are rolled together. But this is disappointing.
Conclusion
Your mileage may vary. To be safe, you can of course use the simple version directly, instead of relying on the compiler.

Related

A reference for converting assembly 'shl', 'OR', 'AND', 'SHR' operations into C?

I'm to convert the following AT&T x86 assembly into C:
movl 8(%ebp), %edx
movl $0, %eax
movl $0, %ecx
jmp .L2
.L1
shll $1, %eax
movl %edx, %ebx
andl $1, %ebx
orl %ebx, %eax
shrl $1, %edx
addl $1, %ecx
.L2
cmpl $32, %ecx
jl .L1
leave
But must adhere to the following skeleton code:
int f(unsigned int x) {
int val = 0, i = 0;
while(________) {
val = ________________;
x = ________________;
i++;
}
return val;
}
I can tell that the snippet
.L2
cmpl $32, %ecx
jl .L1
can be interpreted as while(i<32). I also know that x is stored in %edx, val in %eax, and i in %ecx. However, I'm having a hard time converting the assembly within the while/.L1 loop into condensed high-level language that fits into the provided skeleton code. For example, can shll, shrl, orl, and andl simply be written using their direct C equivalents (<<,>>,|,&), or is there some more nuance to it?
Is there a standardized guide/"cheat sheet" for Assembly-to-C conversions?
I understand assembly to high-level conversion is not always clear-cut, but there are certainly patterns in assembly code that can be consistently interpreted as certain C operations.
For example, can shll, shrl, orl, and andl simply be written using
their direct C equivalents (<<,>>,|,&), or is there some more nuance
to it?
they can. Let's examine the loop body step-by-step:
shll $1, %eax // shift left eax by 1, same as "eax<<1" or even "eax*=2"
movl %edx, %ebx
andl $1, %ebx // ebx &= 1
orl %ebx, %eax // eax |= ebx
shrl $1, %edx // shift right edx by 1, same as "edx>>1" = "edx/=2"
gets us to
%eax *=2
%ebx = %edx
%ebx = %ebx & 1
%eax |= %ebx
%edx /= 2
ABI tells us (8(%ebp), %edx) that %edx is x, and %eax (return value) is val:
val *=2
%ebx = x // a
%ebx = %ebx & 1 // b
val |= %ebx // c
x /= 2
combine a,b,c: #2 insert a into b:
val *=2
%ebx = (x & 1) // b
val |= %ebx // c
x /= 2
combine a,b,c: #2 insert b into c:
val *=2
val |= (x & 1)
x /= 2
final step: combine both 'val =' into one
val = 2*val | (x & 1)
x /= 2
while (i < 32) { val = (val << 1) | (x & 1); x = x >> 1; i++; } except val and the return value should be unsigned and they aren't in your template. The function returns the bits in x reversed.
The actual answer to your question is more complicated and is pretty much: no there is no such guide and it can't exist because compilation loses information and you can't recreate that lost information from assembler. But you can often make a good educated guess.

Counting '1' in number in C

My task was to print all whole numbers from 2 to N(for which in binary amount of '1' is bigger than '0')
int CountOnes(unsigned int x)
{
unsigned int iPassedNumber = x; // number to be modifed
unsigned int iOriginalNumber = iPassedNumber;
unsigned int iNumbOfOnes = 0;
while (iPassedNumber > 0)
{
iPassedNumber = iPassedNumber >> 1 << 1; //if LSB was '1', it turns to '0'
if (iOriginalNumber - iPassedNumber == 1) //if diffrence == 1, then we increment numb of '1'
{
++iNumbOfOnes;
}
iOriginalNumber = iPassedNumber >> 1; //do this to operate with the next bit
iPassedNumber = iOriginalNumber;
}
return (iNumbOfOnes);
}
Here is my function to calculate the number of '1' in binary. It was my homework in college. However, my teacher said that it would be more efficient to
{
if(n%2==1)
++CountOnes;
else(n%2==0)
++CountZeros;
}
In the end, I just messed up and don`t know what is better. What do you think about this?
I used gcc compiler for the experiment below. Your compiler may be different, so you may have to do things a bit differently to get a similar effect.
When trying to figure out the most optimized method for doing something you want to see what kind of code the compiler produces. Look at the CPU's manual and see which operations are fast and which are slow on that particular architecture. Although there are general guidelines. And of course if there are ways you can reduce the number of instructions that a CPU has to perform.
I decided to show you a few different methods (not exhaustive) and give you a sample of how to go about looking at optimization of small functions (like this one) manually. There are more sophisticated tools that help with larger and more complex functions, however this approach should work with pretty much anything:
Note
All assembly code was produced using:
gcc -O99 -o foo -fprofile-generate foo.c
followed by
gcc -O99 -o foo -fprofile-use foo.c
On -fprofile-generate
The double compile makes gcc really let's gcc work (although -O99 most likely does that already) however milage may vary based on which version of gcc you may be using.
On with it:
Method I (you)
Here is the disassembly of your function:
CountOnes_you:
.LFB20:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
je .L5
.p2align 4,,10
.p2align 3
.L4:
movl %edi, %edx
xorl %ecx, %ecx
andl $-2, %edx
subl %edx, %edi
cmpl $1, %edi
movl %edx, %edi
sete %cl
addl %ecx, %eax
shrl %edi
jne .L4
rep ret
.p2align 4,,10
.p2align 3
.L5:
rep ret
.cfi_endproc
At a glance
Approximately 9 instructions in a loop, until the loop exits
Method II (teacher)
Here is a function which uses your teacher's algo:
int CountOnes_teacher(unsigned int x)
{
unsigned int one_count = 0;
while(x) {
if(x%2)
++one_count;
x >>= 1;
}
return one_count;
}
Here's the disassembly of that:
CountOnes_teacher:
.LFB21:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
je .L12
.p2align 4,,10
.p2align 3
.L11:
movl %edi, %edx
andl $1, %edx
cmpl $1, %edx
sbbl $-1, %eax
shrl %edi
jne .L11
rep ret
.p2align 4,,10
.p2align 3
.L12:
rep ret
.cfi_endproc
At a glance:
5 instructions in a loop until the loop exits
Method III
Here is Krenighan's method:
int CountOnes_K(unsigned int x) {
unsigned int count;
for(count = 0; ; x; count++) {
x &= x - 1; // clear least sig bit
}
return count;
}
Here's the disassembly:
CountOnes_k:
.LFB22:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
je .L19
.p2align 4,,10
.p2align 3
.L18:
leal -1(%rdi), %edx
addl $1, %eax
andl %edx, %edi
jne .L18 ; loop is here
rep ret
.p2align 4,,10
.p2align 3
.L19:
rep ret
.cfi_endproc
At a glance
3 instructions in a loop.
Some commentary before continuing
As you can see the compiler doesn't really use the best way when you employ % to count (which was used by both you and your teacher).
Krenighan method is pretty optimized, least number of operations in the loop). It is instructional to compare Krenighan to the naive method of counting, while on the surface it may look the same it's really not!
for (c = 0; v; v >>= 1)
{
c += v & 1;
}
This method sucks compared to Krenighans. Here if you have say the 32nd bit set this loop will run 32 times, whereas Krenighan's will not!
But all these methods are still rather sub-par because they loop.
If we combine a couple of other piece of (implicit) knowledge into our algorithms we can get rid of loops all together. Those are, 1 the size of our number in bits, and the size of a character in bits. With these pieces and by realizing that we can filter out bits in chunks of 14, 24 or 32 bits given that we have a 64 bit register.
So for instance, if we look at a 14-bit number then we can simply count the bits by:
(n * 0x200040008001ULL & 0x111111111111111ULL) % 0xf;
uses % but only once for all numbers between 0x0 and 0x3fff
For 24 bits we use 14 bits and then something similar for the remaining 10 bits:
((n & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f
+ (((n & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL)
% 0x1f;
But we can generalize this concept by realizing the patterns in the numbers above and realize that the magic numbers are actually just compliments (look at the hex numbers closely 0x8000 + 0x400 + 0x200 + 0x1) shifted
We can generalize and then shrink the ideas here, giving us the most optimized method for counting bits (up to 128 bits) (no loops) O(1):
CountOnes_best(unsigned int n) {
const unsigned char_bits = sizeof(unsigned char) << 3;
typedef __typeof__(n) T; // T is unsigned int in this case;
n = n - ((n >> 1) & (T)~(T)0/3); // reuse n as a temporary
n = (n & (T)~(T)0/15*3) + ((n >> 2) & (T)~(T)0/15*3);
n = (n + (n >> 4)) & (T)~(T)0/255*15;
return (T)(n * ((T)~(T)0/255)) >> (sizeof(T) - 1) * char_bits;
}
CountOnes_best:
.LFB23:
.cfi_startproc
movl %edi, %eax
shrl %eax
andl $1431655765, %eax
subl %eax, %edi
movl %edi, %edx
shrl $2, %edi
andl $858993459, %edx
andl $858993459, %edi
addl %edx, %edi
movl %edi, %ecx
shrl $4, %ecx
addl %edi, %ecx
andl $252645135, %ecx
imull $16843009, %ecx, %eax
shrl $24, %eax
ret
.cfi_endproc
This may be a bit of a jump from (how the heck did you go from previous to here), but just take your time to go over it.
The most optimized method was first mentioned in Software Optimization Guide for AMD Athelon™ 64 and Opteron™ Processor, my URL of that is broken. It is also well explained on the very excellent C bit twiddling page
I highly recommend going over the content of that page it really is a fantastic read.
Even better that your teacher's suggestion:
if( n & 1 ) {
++ CountOnes;
}
else {
++ CountZeros;
}
n % 2 has an implicit divide operation which the compiler is likely to optimise, but you should not rely on it - divide is a complex operation that takes longer on some platforms. Moreover there are only two options 1 or 0, so if it is not a one, it is a zero - there is no need for the second test in the else block.
Your original code is overcomplex and hard to follow. If you want to assess the "efficiency" of an algorithm, consider the number of operations performed per iteration, and the number of iterations. Also the number of variables involved. In your case there are 10 operations per iteration and three variables (but you omitted to count the zeros so you'd need four variables to complete the assignment). The following:
unsigned int n = x; // number to be modifed
int ones = 0 ;
int zeroes = 0 ;
while( i > 0 )
{
if( (n & 1) != 0 )
{
++ones ;
}
else
{
++zeroes ;
}
n >>= 1 ;
}
has only 7 operations (counting >>= as two - shift and assign). More importantly perhaps, it is much easier to follow.

intro to x86 assembly

I'm looking over an example on assembly in CSAPP (Computer Systems - A programmer's Perspective 2nd) and I just want to know if my understanding of the assembly code is correct.
Practice problem 3.23
int fun_b(unsigned x) {
int val = 0;
int i;
for ( ____;_____;_____) {
}
return val;
}
The gcc C compiler generates the following assembly code:
x at %ebp+8
// what I've gotten so far
1 movl 8(%ebp), %ebx // ebx: x
2 movl $0, %eax // eax: val, set to 0 since eax is where the return
// value is stored and val is being returned at the end
3 movl $0, %ecx // ecx: i, set to 0
4 .L13: // loop
5 leal (%eax,%eax), %edx // edx = val+val
6 movl %ebx, %eax // val = x (?)
7 andl $1, %eax // x = x & 1
8 orl %edx, %eax // x = (val+val) | (x & 1)
9 shrl %ebx Shift right by 1 // x = x >> 1
10 addl $1, %ecx // i++
11 cmpl $32, %ecx // if i < 32 jump back to loop
12 jne .L13
There was a similar post on the same problem with the solution, but I'm looking for more of a walk-through and explanation of the assembly code line by line.
You already seem to have the meaning of the instructions figured out. The comment on lines 7-8 are slightly wrong however, because those assign to eax which is val not x:
7 andl $1, %eax // val = val & 1 = x & 1
8 orl %edx, %eax // val = (val+val) | (x & 1)
Putting this into the C template could be:
for(i = 0; i < 32; i++, x >>= 1) {
val = (val + val) | (x & 1);
}
Note that (val + val) is just a left shift, so what this function is doing is shifting out bits from x on the right and shifting them in into val from the right. As such, it's mirroring the bits.
PS: if the body of the for must be empty you can of course merge it into the third expression.

Assembly language to C

So I have the following assembly language code which I need to convert into C. I am confused on a few lines of the code.
I understand that this is a for loop. I have added my comments on each line.
I think the for loop goes like this
for (int i = 1; i > 0; i << what?) {
//Calculate result
}
What is the test condition? And how do I change it?
Looking at the assembly code, what does the variable 'n' do?
This is Intel x86 so the format is movl = source, dest
movl 8(%ebp), %esi //Get x
movl 12(%ebp), %ebx //Get n
movl $-1, %edi //This should be result
movl $1, %edx //The i of the loop
.L2:
movl %edx, %eax
andl %esi, %eax
xorl %eax, %edi //result = result ^ (i & x)
movl %ebx, %ecx //Why do we do this? As we never use $%ebx or %ecx again
sall %cl, %edx //Where did %cl come from?
testl %edx, %edx //Tests if i != what? - condition of the for loop
jne .L2 //Loop again
movl %edi, %eax //Otherwise return result.
sall %cl, %edx shifts %edx left by %cl bits. (%cl, for reference, is the low byte of %ecx.) The subsequent testl tests whether that shift zeroed out %edx.
The jne is called that because it's often used in the context of comparisons, which in ASM are often just subtractions. The flags would be set based on the difference; ZF would be set if the items are equal (since x - x == 0). It's also called jnz in Intel syntax; i'm not sure whether GNU allows that too.
All together, the three instructions translate to i <<= n; if (i != 0) goto L2;. That plus the label seem to make a for loop.
for (i = 1; i != 0; i <<= n) { result ^= i & x; }
Or, more correctly (but achieving the same goal), a do...while loop.
i = 1;
do { result ^= i & x; i <<= n; } while (i != 0);

Could someone help explain what this C one liner does?

I can usually figure out most C code but this one is over my head.
#define kroundup32(x) (--(x), (x)|=(x)>>1, (x)|=(x)>>2, (x)|=(x)>>4, (x)|=(x)>>8, (x)|=(x)>>16, ++(x))
an example usage would be something like:
int x = 57;
kroundup32(x);
//x is now 64
A few other examples are:
1 to 1
2 to 2
7 to 8
31 to 32
60 to 64
3000 to 4096
I know it's rounding an integer to it's nearest power of 2, but that's about as far as my knowledge goes.
Any explanations would be greatly appreciated.
Thanks
(--(x), (x)|=(x)>>1, (x)|=(x)>>2, (x)|=(x)>>4, (x)|=(x)>>8, (x)|=(x)>>16, ++(x))
Decrease x by 1
OR x with (x / 2).
OR x with (x / 4).
OR x with (x / 16).
OR x with (x / 256).
OR x with (x / 65536).
Increase x by 1.
For a 32-bit unsigned integer, this should move a value up to the closest power of 2 that is equal or greater. The OR sections set all the lower bits below the highest bit, so it ends up as a power of 2 minus one, then you add one back to it. It looks like it's somewhat optimized and therefore not very readable; doing it by bitwise operations and bit shifting alone, and as a macro (so no function call overhead).
The bitwise or and shift operations essentially set every bit between the highest set bit and bit zero. This will produce a number of the form 2^n - 1. The final increment adds one to get a number of the form 2^n. The initial decrement ensures that you don't round numbers which are already powers of two up to the next power, so that e.g. 2048 doesn't become 4096.
At my machine kroundup32 gives 6.000m rounds/sec
And next function gives 7.693m rounds/sec
inline int scan_msb(int x)
{
#if defined(__i386__) || defined(__x86_64__)
int y;
__asm__("bsr %1, %0"
: "=r" (y)
: "r" (x)
: "flags"); /* ZF */
return y;
#else
#error "Implement me for your platform"
#endif
}
inline int roundup32(int x)
{
if (x == 0) return x;
else {
const int bit = scan_msb(x);
const int mask = ~((~0) << bit);
if (x & mask) return (1 << (bit+1));
else return (1 << bit);
}
}
So #thomasrutter I woudn't say that it is "highly optimized".
And appropriate (only meaningful part) assembly (for GCC 4.4.4):
kroundup32:
subl $1, %edi
movl %edi, %eax
sarl %eax
orl %edi, %eax
movl %eax, %edx
sarl $2, %edx
orl %eax, %edx
movl %edx, %eax
sarl $4, %eax
orl %edx, %eax
movl %eax, %edx
sarl $8, %edx
orl %eax, %edx
movl %edx, %eax
sarl $16, %eax
orl %edx, %eax
addl $1, %eax
ret
roundup32:
testl %edi, %edi
movl %edi, %eax
je .L6
movl $-1, %edx
bsr %edi, %ecx
sall %cl, %edx
notl %edx
testl %edi, %edx
jne .L10
movl $1, %eax
sall %cl, %eax
.L6:
rep
ret
.L10:
addl $1, %ecx
movl $1, %eax
sall %cl, %eax
ret
By some reason I haven't found appropriate implementation of scan_msb (like #define scan_msb(x) if (__builtin_constant_p (x)) ...) within standart headers of GCC (only __TBB_machine_lg/__TBB_Log2).

Resources