Counting '1' in number in C - c

My task was to print all whole numbers from 2 to N(for which in binary amount of '1' is bigger than '0')
int CountOnes(unsigned int x)
{
unsigned int iPassedNumber = x; // number to be modifed
unsigned int iOriginalNumber = iPassedNumber;
unsigned int iNumbOfOnes = 0;
while (iPassedNumber > 0)
{
iPassedNumber = iPassedNumber >> 1 << 1; //if LSB was '1', it turns to '0'
if (iOriginalNumber - iPassedNumber == 1) //if diffrence == 1, then we increment numb of '1'
{
++iNumbOfOnes;
}
iOriginalNumber = iPassedNumber >> 1; //do this to operate with the next bit
iPassedNumber = iOriginalNumber;
}
return (iNumbOfOnes);
}
Here is my function to calculate the number of '1' in binary. It was my homework in college. However, my teacher said that it would be more efficient to
{
if(n%2==1)
++CountOnes;
else(n%2==0)
++CountZeros;
}
In the end, I just messed up and don`t know what is better. What do you think about this?

I used gcc compiler for the experiment below. Your compiler may be different, so you may have to do things a bit differently to get a similar effect.
When trying to figure out the most optimized method for doing something you want to see what kind of code the compiler produces. Look at the CPU's manual and see which operations are fast and which are slow on that particular architecture. Although there are general guidelines. And of course if there are ways you can reduce the number of instructions that a CPU has to perform.
I decided to show you a few different methods (not exhaustive) and give you a sample of how to go about looking at optimization of small functions (like this one) manually. There are more sophisticated tools that help with larger and more complex functions, however this approach should work with pretty much anything:
Note
All assembly code was produced using:
gcc -O99 -o foo -fprofile-generate foo.c
followed by
gcc -O99 -o foo -fprofile-use foo.c
On -fprofile-generate
The double compile makes gcc really let's gcc work (although -O99 most likely does that already) however milage may vary based on which version of gcc you may be using.
On with it:
Method I (you)
Here is the disassembly of your function:
CountOnes_you:
.LFB20:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
je .L5
.p2align 4,,10
.p2align 3
.L4:
movl %edi, %edx
xorl %ecx, %ecx
andl $-2, %edx
subl %edx, %edi
cmpl $1, %edi
movl %edx, %edi
sete %cl
addl %ecx, %eax
shrl %edi
jne .L4
rep ret
.p2align 4,,10
.p2align 3
.L5:
rep ret
.cfi_endproc
At a glance
Approximately 9 instructions in a loop, until the loop exits
Method II (teacher)
Here is a function which uses your teacher's algo:
int CountOnes_teacher(unsigned int x)
{
unsigned int one_count = 0;
while(x) {
if(x%2)
++one_count;
x >>= 1;
}
return one_count;
}
Here's the disassembly of that:
CountOnes_teacher:
.LFB21:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
je .L12
.p2align 4,,10
.p2align 3
.L11:
movl %edi, %edx
andl $1, %edx
cmpl $1, %edx
sbbl $-1, %eax
shrl %edi
jne .L11
rep ret
.p2align 4,,10
.p2align 3
.L12:
rep ret
.cfi_endproc
At a glance:
5 instructions in a loop until the loop exits
Method III
Here is Krenighan's method:
int CountOnes_K(unsigned int x) {
unsigned int count;
for(count = 0; ; x; count++) {
x &= x - 1; // clear least sig bit
}
return count;
}
Here's the disassembly:
CountOnes_k:
.LFB22:
.cfi_startproc
xorl %eax, %eax
testl %edi, %edi
je .L19
.p2align 4,,10
.p2align 3
.L18:
leal -1(%rdi), %edx
addl $1, %eax
andl %edx, %edi
jne .L18 ; loop is here
rep ret
.p2align 4,,10
.p2align 3
.L19:
rep ret
.cfi_endproc
At a glance
3 instructions in a loop.
Some commentary before continuing
As you can see the compiler doesn't really use the best way when you employ % to count (which was used by both you and your teacher).
Krenighan method is pretty optimized, least number of operations in the loop). It is instructional to compare Krenighan to the naive method of counting, while on the surface it may look the same it's really not!
for (c = 0; v; v >>= 1)
{
c += v & 1;
}
This method sucks compared to Krenighans. Here if you have say the 32nd bit set this loop will run 32 times, whereas Krenighan's will not!
But all these methods are still rather sub-par because they loop.
If we combine a couple of other piece of (implicit) knowledge into our algorithms we can get rid of loops all together. Those are, 1 the size of our number in bits, and the size of a character in bits. With these pieces and by realizing that we can filter out bits in chunks of 14, 24 or 32 bits given that we have a 64 bit register.
So for instance, if we look at a 14-bit number then we can simply count the bits by:
(n * 0x200040008001ULL & 0x111111111111111ULL) % 0xf;
uses % but only once for all numbers between 0x0 and 0x3fff
For 24 bits we use 14 bits and then something similar for the remaining 10 bits:
((n & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f
+ (((n & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL)
% 0x1f;
But we can generalize this concept by realizing the patterns in the numbers above and realize that the magic numbers are actually just compliments (look at the hex numbers closely 0x8000 + 0x400 + 0x200 + 0x1) shifted
We can generalize and then shrink the ideas here, giving us the most optimized method for counting bits (up to 128 bits) (no loops) O(1):
CountOnes_best(unsigned int n) {
const unsigned char_bits = sizeof(unsigned char) << 3;
typedef __typeof__(n) T; // T is unsigned int in this case;
n = n - ((n >> 1) & (T)~(T)0/3); // reuse n as a temporary
n = (n & (T)~(T)0/15*3) + ((n >> 2) & (T)~(T)0/15*3);
n = (n + (n >> 4)) & (T)~(T)0/255*15;
return (T)(n * ((T)~(T)0/255)) >> (sizeof(T) - 1) * char_bits;
}
CountOnes_best:
.LFB23:
.cfi_startproc
movl %edi, %eax
shrl %eax
andl $1431655765, %eax
subl %eax, %edi
movl %edi, %edx
shrl $2, %edi
andl $858993459, %edx
andl $858993459, %edi
addl %edx, %edi
movl %edi, %ecx
shrl $4, %ecx
addl %edi, %ecx
andl $252645135, %ecx
imull $16843009, %ecx, %eax
shrl $24, %eax
ret
.cfi_endproc
This may be a bit of a jump from (how the heck did you go from previous to here), but just take your time to go over it.
The most optimized method was first mentioned in Software Optimization Guide for AMD Athelon™ 64 and Opteron™ Processor, my URL of that is broken. It is also well explained on the very excellent C bit twiddling page
I highly recommend going over the content of that page it really is a fantastic read.

Even better that your teacher's suggestion:
if( n & 1 ) {
++ CountOnes;
}
else {
++ CountZeros;
}
n % 2 has an implicit divide operation which the compiler is likely to optimise, but you should not rely on it - divide is a complex operation that takes longer on some platforms. Moreover there are only two options 1 or 0, so if it is not a one, it is a zero - there is no need for the second test in the else block.
Your original code is overcomplex and hard to follow. If you want to assess the "efficiency" of an algorithm, consider the number of operations performed per iteration, and the number of iterations. Also the number of variables involved. In your case there are 10 operations per iteration and three variables (but you omitted to count the zeros so you'd need four variables to complete the assignment). The following:
unsigned int n = x; // number to be modifed
int ones = 0 ;
int zeroes = 0 ;
while( i > 0 )
{
if( (n & 1) != 0 )
{
++ones ;
}
else
{
++zeroes ;
}
n >>= 1 ;
}
has only 7 operations (counting >>= as two - shift and assign). More importantly perhaps, it is much easier to follow.

Related

Trying to reverse engineer a function

I'm trying to understand assembly in x86 more. I have a mystery function here that I know returns an int and takes an int argument.
So it looks like int mystery(int n){}. I can't figure out the function in C however. The assembly is:
mov %edi, %eax
lea 0x0(,%rdi, 8), %edi
sub %eax, %edi
add $0x4, %edi
callq < mystery _util >
repz retq
< mystery _util >
mov %edi, %eax
shr %eax
and $0x1, %edi
and %edi, %eax
retq
I don't understand what the lea does here and what kind of function it could be.
The assembly code appeared to be computer generated, and something that was probably compiled by GCC since there is a repz retq after an unconditional branch (call). There is also an indication that because there isn't a tail call (jmp) instead of a call when going to mystery_util that the code was compiled with -O1 (higher optimization levels would likely inline the function which didn't happen here). The lack of frame pointers and extra load/stores indicated that it isn't compiled with -O0
Multiplying x by 7 is the same as multiplying x by 8 and subtracting x. That is what the following code is doing:
lea 0x0(,%rdi, 8), %edi
sub %eax, %edi
LEA can compute addresses but it can be used for simple arithmetic as well. The syntax for a memory operand is displacement(base, index, scale). Scale can be 1, 2, 4, 8. The computation is displacement + base + index * scale. In your case lea 0x0(,%rdi, 8), %edi is effectively EDI = 0x0 + RDI * 8 or EDI = RDI * 8. The full calculation is n * 7 - 4;
The calculation for mystery_util appears to simply be
n &= (n>>1) & 1;
If I take all these factors together we have a function mystery that passes n * 7 - 4 to a function called mystery_util that returns n &= (n>>1) & 1.
Since mystery_util returns a single bit value (0 or 1) it is reasonable that bool is the return type.
I was curious if I could get a particular version of GCC with optimization level 1 (-O1) to reproduce this assembly code. I discovered that GCC 4.9.x will yield this exact assembly code for this given C program:
#include<stdbool.h>
bool mystery_util(unsigned int n)
{
n &= (n>>1) & 1;
return n;
}
bool mystery(unsigned int n)
{
return mystery_util (7*n+4);
}
The assembly output is:
mystery_util:
movl %edi, %eax
shrl %eax
andl $1, %edi
andl %edi, %eax
ret
mystery:
movl %edi, %eax
leal 0(,%rdi,8), %edi
subl %eax, %edi
addl $4, %edi
call mystery_util
rep ret
You can play with this code on godbolt.
Important Update - Version without bool
I apparently erred in interpreting the question. I assumed the person asking this question determined by themselves that the prototype for mystery was int mystery(int n). I thought I could change that. According to a related question asked on Stackoverflow a day later, it seems int mystery(int n) is given to you as the prototype as part of the assignment. This is important because it means that a modification has to be made.
The change that needs to be made is related to mystery_util. In the code to be reverse engineered are these lines:
mov %edi, %eax
shr %eax
EDI is the first parameter. SHR is logical shift right. Compilers would only generate this if EDI was an unsigned int (or equivalent). int is a signed type an would generate SAR (arithmetic shift right). This means that the parameter for mystery_util has to be unsigned int (and it follows that the return value is likely unsigned int. That means the code would look like this:
unsigned int mystery_util(unsigned int n)
{
n &= (n>>1) & 1;
return n;
}
int mystery(int n)
{
return mystery_util (7*n+4);
}
mystery now has the prototype given by your professor (bool is removed) and we use unsigned int for the parameter and return type of mystery_util. In order to generate this code with GCC 4.9.x I found you need to use -O1 -fno-inline. This code can be found on godbolt. The assembly output is the same as the version using bool.
If you use unsigned int mystery_util(int n) you would discover that it doesn't quite output what we want:
mystery_util:
movl %edi, %eax
sarl %eax ; <------- SAR (arithmetic shift right) is not SHR
andl $1, %edi
andl %edi, %eax
ret
The LEA is just a left-shift by 3, and truncating the result to 32 bit (i.e. zero-extending EDI into RDI implicilty). x86-64 System V passes the first integer arg in RDI, so all of this is consistent with one int arg. LEA uses memory-operand syntax and machine encoding, but it's really just a shift-and-add instruction. Using it as part of a multiply by a constant is a common compiler optimization for x86.
The compiler that generated this function missed an optimization here; the first mov could have been avoided with
lea 0x0(,%rdi, 8), %eax # n << 3 = n*8
sub %edi, %eax # eax = n*7
lea 4(%rax), %edi # rdi = 4 + n*7
But instead, the compiler got stuck on generating n*7 in %edi, probably because it applied a peephole optimization for the constant multiply too late to redo register allocation.
mystery_util returns the bitwise AND of the low 2 bits of its arg, in the low bit, so a 0 or 1 integer value, which could also be a bool.
(shr with no count means a count of 1; remember that x86 has a special opcode for shifts with an implicit count of 1. 8086 only has counts of 1 or cl; immediate counts were added later as an extension and the implicit-form opcode is still shorter.)
The LEA performs an address computation, but instead of dereferencing the address, it stores the computed address into the destination register.
In AT&T syntax, lea C(b,c,d), reg means reg = C + b + c*d where C is a constant, and b,c are registers and d is a scalar from {1,2,4,8}. Hence you can see why LEA is popular for simple math operations: it does quite a bit in a single instruction. (*includes correction from prl's comment below)
There are some strange features of this assembly code: the repz prefix is only strictly defined when applied to certain instructions, and retq is not one of them (though the general behavior of the processor is to ignore it). See Michael Petch's comment below with a link for more info. The use of lea (,rdi,8), edi followed by sub eax, edi to compute arg1 * 7 also seemed strange, but makes sense once prl noted the scalar d had to be a constant power of 2. In any case, here's how I read the snippet:
mov %edi, %eax ; eax = arg1
lea 0x0(,%rdi, 8), %edi ; edi = arg1 * 8
sub %eax, %edi ; edi = (arg1 * 8) - arg1 = arg1 * 7
add $0x4, %edi ; edi = (arg1 * 7) + 4
callq < mystery _util > ; call mystery_util(arg1 * 7 + 4)
repz retq ; repz prefix on return is de facto nop.
< mystery _util >
mov %edi, %eax ; eax = arg1
shr %eax ; eax = arg1 >> 1
and $0x1, %edi ; edi = 1 iff arg1 was odd, else 0
and %edi, %eax ; eax = 1 iff smallest 2 bits of arg1 were both 1.
retq
Note the +4 on the 4th line is entirely spurious. It cannot affect the outcome of mystery_util.
So, overall this ASM snippet computes the boolean (arg1 * 7) % 4 == 3.

A reference for converting assembly 'shl', 'OR', 'AND', 'SHR' operations into C?

I'm to convert the following AT&T x86 assembly into C:
movl 8(%ebp), %edx
movl $0, %eax
movl $0, %ecx
jmp .L2
.L1
shll $1, %eax
movl %edx, %ebx
andl $1, %ebx
orl %ebx, %eax
shrl $1, %edx
addl $1, %ecx
.L2
cmpl $32, %ecx
jl .L1
leave
But must adhere to the following skeleton code:
int f(unsigned int x) {
int val = 0, i = 0;
while(________) {
val = ________________;
x = ________________;
i++;
}
return val;
}
I can tell that the snippet
.L2
cmpl $32, %ecx
jl .L1
can be interpreted as while(i<32). I also know that x is stored in %edx, val in %eax, and i in %ecx. However, I'm having a hard time converting the assembly within the while/.L1 loop into condensed high-level language that fits into the provided skeleton code. For example, can shll, shrl, orl, and andl simply be written using their direct C equivalents (<<,>>,|,&), or is there some more nuance to it?
Is there a standardized guide/"cheat sheet" for Assembly-to-C conversions?
I understand assembly to high-level conversion is not always clear-cut, but there are certainly patterns in assembly code that can be consistently interpreted as certain C operations.
For example, can shll, shrl, orl, and andl simply be written using
their direct C equivalents (<<,>>,|,&), or is there some more nuance
to it?
they can. Let's examine the loop body step-by-step:
shll $1, %eax // shift left eax by 1, same as "eax<<1" or even "eax*=2"
movl %edx, %ebx
andl $1, %ebx // ebx &= 1
orl %ebx, %eax // eax |= ebx
shrl $1, %edx // shift right edx by 1, same as "edx>>1" = "edx/=2"
gets us to
%eax *=2
%ebx = %edx
%ebx = %ebx & 1
%eax |= %ebx
%edx /= 2
ABI tells us (8(%ebp), %edx) that %edx is x, and %eax (return value) is val:
val *=2
%ebx = x // a
%ebx = %ebx & 1 // b
val |= %ebx // c
x /= 2
combine a,b,c: #2 insert a into b:
val *=2
%ebx = (x & 1) // b
val |= %ebx // c
x /= 2
combine a,b,c: #2 insert b into c:
val *=2
val |= (x & 1)
x /= 2
final step: combine both 'val =' into one
val = 2*val | (x & 1)
x /= 2
while (i < 32) { val = (val << 1) | (x & 1); x = x >> 1; i++; } except val and the return value should be unsigned and they aren't in your template. The function returns the bits in x reversed.
The actual answer to your question is more complicated and is pretty much: no there is no such guide and it can't exist because compilation loses information and you can't recreate that lost information from assembler. But you can often make a good educated guess.

GCC optimization missed opportunity

I'm compiling this C code:
int mode; // use aa if true, else bb
int aa[2];
int bb[2];
inline int auto0() { return mode ? aa[0] : bb[0]; }
inline int auto1() { return mode ? aa[1] : bb[1]; }
int slow() { return auto1() - auto0(); }
int fast() { return mode ? aa[1] - aa[0] : bb[1] - bb[0]; }
Both slow() and fast() functions are meant to do the same thing, though fast() does it with one branch statement instead of two. I wanted to check if GCC would collapse the two branches into one. I've tried this with GCC 4.4 and 4.7, with various levels of optimization such as -O2, -O3, -Os, and -Ofast. It always gives the same strange results:
slow():
movl mode(%rip), %ecx
testl %ecx, %ecx
je .L10
movl aa+4(%rip), %eax
movl aa(%rip), %edx
subl %edx, %eax
ret
.L10:
movl bb+4(%rip), %eax
movl bb(%rip), %edx
subl %edx, %eax
ret
fast():
movl mode(%rip), %esi
testl %esi, %esi
jne .L18
movl bb+4(%rip), %eax
subl bb(%rip), %eax
ret
.L18:
movl aa+4(%rip), %eax
subl aa(%rip), %eax
ret
Indeed, only one branch is generated in each function. However, slow() seems to be inferior in a surprising way: it uses one extra load in each branch, for aa[0] and bb[0]. The fast() code uses them straight from memory in the subls without loading them into a register first. So slow() uses one extra register and one extra instruction per call.
A simple micro-benchmark shows that calling fast() one billion times takes 0.7 seconds, vs. 1.1 seconds for slow(). I'm using a Xeon E5-2690 at 2.9 GHz.
Why should this be? Can you tweak my source code somehow so that GCC does a better job?
Edit: here are the results with clang 4.2 on Mac OS:
slow():
movq _aa#GOTPCREL(%rip), %rax ; rax = aa (both ints at once)
movq _bb#GOTPCREL(%rip), %rcx ; rcx = bb
movq _mode#GOTPCREL(%rip), %rdx ; rdx = mode
cmpl $0, (%rdx) ; mode == 0 ?
leaq 4(%rcx), %rdx ; rdx = bb[1]
cmovneq %rax, %rcx ; if (mode != 0) rcx = aa
leaq 4(%rax), %rax ; rax = aa[1]
cmoveq %rdx, %rax ; if (mode == 0) rax = bb
movl (%rax), %eax ; eax = xx[1]
subl (%rcx), %eax ; eax -= xx[0]
fast():
movq _mode#GOTPCREL(%rip), %rax ; rax = mode
cmpl $0, (%rax) ; mode == 0 ?
je LBB1_2 ; if (mode != 0) {
movq _aa#GOTPCREL(%rip), %rcx ; rcx = aa
jmp LBB1_3 ; } else {
LBB1_2: ; // (mode == 0)
movq _bb#GOTPCREL(%rip), %rcx ; rcx = bb
LBB1_3: ; }
movl 4(%rcx), %eax ; eax = xx[1]
subl (%rcx), %eax ; eax -= xx[0]
Interesting: clang generates branchless conditionals for slow() but one branch for fast()! On the other hand, slow() does three loads (two of which are speculative, one will be unnecessary) vs. two for fast(). The fast() implementation is more "obvious," and as with GCC it's shorter and uses one less register.
GCC 4.7 on Mac OS generally suffers the same issue as on Linux. Yet it uses the same "load 8 bytes then twice extract 4 bytes" pattern as Clang on Mac OS. That's sort of interesting, but not very relevant, as the original issue of emitting subl with two registers rather than one memory and one register is the same on either platform for GCC.
The reason is that in the initial intermediate code, emitted for slow(), the memory load and the subtraction are in different basic blocks:
slow ()
{
int D.1405;
int mode.3;
int D.1402;
int D.1379;
# BLOCK 2 freq:10000
mode.3_5 = mode;
if (mode.3_5 != 0)
goto <bb 3>;
else
goto <bb 4>;
# BLOCK 3 freq:5000
D.1402_6 = aa[1];
D.1405_10 = aa[0];
goto <bb 5>;
# BLOCK 4 freq:5000
D.1402_7 = bb[1];
D.1405_11 = bb[0];
# BLOCK 5 freq:10000
D.1379_3 = D.1402_17 - D.1405_12;
return D.1379_3;
}
whereas in fast() they are in the same basic block:
fast ()
{
int D.1377;
int D.1376;
int D.1374;
int D.1373;
int mode.1;
int D.1368;
# BLOCK 2 freq:10000
mode.1_2 = mode;
if (mode.1_2 != 0)
goto <bb 3>;
else
goto <bb 4>;
# BLOCK 3 freq:3900
D.1373_3 = aa[1];
D.1374_4 = aa[0];
D.1368_5 = D.1373_3 - D.1374_4;
goto <bb 5>;
# BLOCK 4 freq:6100
D.1376_6 = bb[1];
D.1377_7 = bb[0];
D.1368_8 = D.1376_6 - D.1377_7;
# BLOCK 5 freq:10000
return D.1368_1;
}
GCC relies on instruction combining pass to handle cases like this (i.e. apparently not on the peephole optimization pass) and combining works on the scope of a basic block. That's why the subtraction and load are combined in a single insn in fast() and they aren't even considered for combining in slow().
Later, in the basic block reordering pass, the subtraction in slow() is duplicated and moved into the basic blocks, which contain the loads. Now there's opportunity for the combiner to, well, combine the load and the subtraction, but unfortunately, the combiner pass is not run again (and perhaps it cannot be run that late in the compilation process with hard registers already allocated and stuff).
I don't have an answer as to why GCC is unable to optimize the code the way you want it to, but I have a way to re-organize your code to achieve similar performance. Instead of organizing your code the way you have done so in slow() or fast(), I would recommend that you define an inline function that returns either aa or bb based on mode without needing a branch:
inline int * xx () { static int *xx[] = { bb, aa }; return xx[!!mode]; }
inline int kwiky(int *xx) { return xx[1] - xx[0]; }
int kwik() { return kwiky(xx()); }
When compiled by GCC 4.7 with -O3:
movl mode, %edx
xorl %eax, %eax
testl %edx, %edx
setne %al
movl xx.1369(,%eax,4), %edx
movl 4(%edx), %eax
subl (%edx), %eax
ret
With the definition of xx(), you can redefine auto0() and auto1() like so:
inline int auto0() { return xx()[0]; }
inline int auto1() { return xx()[1]; }
And, from this, you should see that slow() now compiles into code similar or identical to kwik().
Have you tried to modify internals compilers parameters (--param name=value in man page). Those are not changed with any optimizations level (with three minor excepts).
Some of them control code reduction/deduplication.
For some optimizations in this section you can read things like « larger values can exponentially increase compilation time » .

Assembly language to C

So I have the following assembly language code which I need to convert into C. I am confused on a few lines of the code.
I understand that this is a for loop. I have added my comments on each line.
I think the for loop goes like this
for (int i = 1; i > 0; i << what?) {
//Calculate result
}
What is the test condition? And how do I change it?
Looking at the assembly code, what does the variable 'n' do?
This is Intel x86 so the format is movl = source, dest
movl 8(%ebp), %esi //Get x
movl 12(%ebp), %ebx //Get n
movl $-1, %edi //This should be result
movl $1, %edx //The i of the loop
.L2:
movl %edx, %eax
andl %esi, %eax
xorl %eax, %edi //result = result ^ (i & x)
movl %ebx, %ecx //Why do we do this? As we never use $%ebx or %ecx again
sall %cl, %edx //Where did %cl come from?
testl %edx, %edx //Tests if i != what? - condition of the for loop
jne .L2 //Loop again
movl %edi, %eax //Otherwise return result.
sall %cl, %edx shifts %edx left by %cl bits. (%cl, for reference, is the low byte of %ecx.) The subsequent testl tests whether that shift zeroed out %edx.
The jne is called that because it's often used in the context of comparisons, which in ASM are often just subtractions. The flags would be set based on the difference; ZF would be set if the items are equal (since x - x == 0). It's also called jnz in Intel syntax; i'm not sure whether GNU allows that too.
All together, the three instructions translate to i <<= n; if (i != 0) goto L2;. That plus the label seem to make a for loop.
for (i = 1; i != 0; i <<= n) { result ^= i & x; }
Or, more correctly (but achieving the same goal), a do...while loop.
i = 1;
do { result ^= i & x; i <<= n; } while (i != 0);

Could someone help explain what this C one liner does?

I can usually figure out most C code but this one is over my head.
#define kroundup32(x) (--(x), (x)|=(x)>>1, (x)|=(x)>>2, (x)|=(x)>>4, (x)|=(x)>>8, (x)|=(x)>>16, ++(x))
an example usage would be something like:
int x = 57;
kroundup32(x);
//x is now 64
A few other examples are:
1 to 1
2 to 2
7 to 8
31 to 32
60 to 64
3000 to 4096
I know it's rounding an integer to it's nearest power of 2, but that's about as far as my knowledge goes.
Any explanations would be greatly appreciated.
Thanks
(--(x), (x)|=(x)>>1, (x)|=(x)>>2, (x)|=(x)>>4, (x)|=(x)>>8, (x)|=(x)>>16, ++(x))
Decrease x by 1
OR x with (x / 2).
OR x with (x / 4).
OR x with (x / 16).
OR x with (x / 256).
OR x with (x / 65536).
Increase x by 1.
For a 32-bit unsigned integer, this should move a value up to the closest power of 2 that is equal or greater. The OR sections set all the lower bits below the highest bit, so it ends up as a power of 2 minus one, then you add one back to it. It looks like it's somewhat optimized and therefore not very readable; doing it by bitwise operations and bit shifting alone, and as a macro (so no function call overhead).
The bitwise or and shift operations essentially set every bit between the highest set bit and bit zero. This will produce a number of the form 2^n - 1. The final increment adds one to get a number of the form 2^n. The initial decrement ensures that you don't round numbers which are already powers of two up to the next power, so that e.g. 2048 doesn't become 4096.
At my machine kroundup32 gives 6.000m rounds/sec
And next function gives 7.693m rounds/sec
inline int scan_msb(int x)
{
#if defined(__i386__) || defined(__x86_64__)
int y;
__asm__("bsr %1, %0"
: "=r" (y)
: "r" (x)
: "flags"); /* ZF */
return y;
#else
#error "Implement me for your platform"
#endif
}
inline int roundup32(int x)
{
if (x == 0) return x;
else {
const int bit = scan_msb(x);
const int mask = ~((~0) << bit);
if (x & mask) return (1 << (bit+1));
else return (1 << bit);
}
}
So #thomasrutter I woudn't say that it is "highly optimized".
And appropriate (only meaningful part) assembly (for GCC 4.4.4):
kroundup32:
subl $1, %edi
movl %edi, %eax
sarl %eax
orl %edi, %eax
movl %eax, %edx
sarl $2, %edx
orl %eax, %edx
movl %edx, %eax
sarl $4, %eax
orl %edx, %eax
movl %eax, %edx
sarl $8, %edx
orl %eax, %edx
movl %edx, %eax
sarl $16, %eax
orl %edx, %eax
addl $1, %eax
ret
roundup32:
testl %edi, %edi
movl %edi, %eax
je .L6
movl $-1, %edx
bsr %edi, %ecx
sall %cl, %edx
notl %edx
testl %edi, %edx
jne .L10
movl $1, %eax
sall %cl, %eax
.L6:
rep
ret
.L10:
addl $1, %ecx
movl $1, %eax
sall %cl, %eax
ret
By some reason I haven't found appropriate implementation of scan_msb (like #define scan_msb(x) if (__builtin_constant_p (x)) ...) within standart headers of GCC (only __TBB_machine_lg/__TBB_Log2).

Resources