Left shift of only part of a number

Left shift of only part of a number - c

I need to find the fastest equivalence of the following C code.
int d = 1 << x; /* d=pow(2,x) */
int j = 2*d*(i / d) + (i % d);
What I thought is to shift left upper 32 - x bits of i.
For example the following i with x=5:
1010 1010 1010 1010
will become:
0101 0101 0100 1010
Is there an assembly command for that? How can I perform this operation fast?

divisions are slow:
int m = (1 << x) - 1;
int j = (i << 1) - (i & m);
update:
or probably faster:
int j = i + (i & (~0 << x));

x86 32bit assembly (AT&T syntax):
/* int MaskedShiftByOne(int val, int lowest_bit_to_shift) */
mov 8(%esp), %ecx
mov $1, %eax
shl %ecx, %eax ; does 1 << lowest_bit_to_shift
mov 4(%esp), %ecx
dec %eax ; (1 << ...) - 1 == 0xf..f (lower bitmask)
mov %eax, %edx
not %edx ; complement - higher mask
and %ecx, %edx ; higher bits
and %ecx, %eax ; lower bits
lea (%eax, %edx, 2), %eax ; low + 2 * high
ret
This should work both on Linux and Windows.
Edit: the i + (i & (~0 << x)) is shorter:
mov 4(%esp), %ecx
mov $-1, %eax
mov 8(%esp), %edx
shl %edx, %eax
and %ecx, %eax
add %ecx, %eax
ret
Morale: Don't ever start with assembly. If you really need it, disassemble highly-optimized compiler output ...

Shift left by one upper x bits.
unsigned i = 0xAAAAu;
int x = 5;
i = (i & ((1 << x) - 1)) | ((i & ~((1 << x) - 1)) << 1); // 0x1554A;
Some explanations:
(1 << x) - 1 makes a mask to zero upper 32 - x bits.
~((1 << x) - 1) makes a mask to zero lower x bits.
After bits a zeroed we shift the upper part and or them together.
Try this on Codepad.

int m = (1 << x) - 1;
int j = ((i & ~m) << 1) | (i & m);
There is no assembly command to do what you want, but the solution I give is quicker since it avoids the division.

Intel syntax:
mov ecx,[esp+4] ;ecx = x
mov eax,[esp+8] ;eax = i
ror eax,cl
inc cl
clc
rcl eax,cl
ret
Moral: Highly-optimized compiler output... isn't.

Related

Need help constructing a Long loop(long x, int n) function in C given 64 bit assembly instructions

I have the following assembly code from the C function long loop(long x, int n)
with x in %rdi, n in %esi on a 64 bit machine. I've written my comments on what I think the assembly instructions are doing.
loop:
movl %esi, %ecx // store the value of n in register ecx
movl $1, %edx // store the value of 1 in register edx (rdx).initial mask
movl $0, %eax //store the value of 0 in register eax (rax). this is initial return value
jmp .L2
.L3
movq %rdi, %r8 //store the value of x in register r8
andq %rdx, %r8 //store the value of (x & mask) in r8
orq %r8, %rax //update the return value rax by (x & mask | [rax] )
salq %cl, %rdx //update the mask rdx by ( [rdx] << n)
.L2
testq %rdx, %rdx //test mask&mask
jne .L3 // if (mask&mask) != 0, jump to L3
rep; ret
I have the following C function which needs to correspond to the assembly code:
long loop(long x, int n){
long result = _____ ;
long mask;
// for (mask = ______; mask ________; mask = ______){ // filled in as:
for (mask = 1; mask != 0; mask <<n) {
result |= ________;
}
return result;
}
I need some help filling in the blanks, I'm not a 100% sure what the assembly instructions but I've given it my best shot by commenting with each line.

You've pretty much got it in your comments.
long loop(long x, long n) {
long result = 0;
long mask;
for (mask = 1; mask != 0; mask <<= n) {
result |= (x & mask);
}
return result;
}
Because result is the return value, and the return value is stored in %rax, movl $0, %eax loads 0 into result initially.
Inside the for loop, %r8 holds the value that is or'd with result, which, like you mentioned in your comments, is just x & mask.

The function copies every nth bit to result.
For the record, the implementation is full of missed optimizations, especially if we're tuning for Sandybridge-family where bts reg,reg is only 1 uop with 1c latency, but shl %cl is 3 uops. (BTS is also 1 uop on Intel P6 / Atom / Silvermont CPUs)
bts is only 2 uops on AMD K8/Bulldozer/Zen. BTS reg,reg masks the shift count the same way x86 integer shifts do, so bts %rdx, %rax implements rax |= 1ULL << (rdx&0x3f). i.e. setting bit n in RAX.
(This code is clearly designed to be simple to understand, and doesn't even use the most well-known x86 peephole optimization, xor-zeroing, but it's fun to see how we can implement the same thing efficiently.)
More obviously, doing the and inside the loop is bad. Instead we can just build up a mask with every nth bit set, and return x & mask. This has the added advantage that with a non-ret instruction following the conditional branch, we don't need the rep prefix as padding for the ret even if we care about tuning for the branch predictors in AMD Phenom CPUs. (Because it isn't the first byte after a conditional branch.)
# x86-64 System V: x in RDI, n in ESI
mask_bitstride: # give the function a meaningful name
mov $1, %eax # mask = 1
mov %esi, %ecx # unsigned bitidx = n (new tmp var)
# the loop always runs at least 1 iteration, so just fall into it
.Lloop: # do {
bts %rcx, %rax # rax |= 1ULL << bitidx
add %esi, %ecx # bitidx += n
cmp $63, %ecx # sizeof(long)*CHAR_BIT - 1
jbe .Lloop # }while(bitidx <= maxbit); // unsigned condition
and %rdi, %rax # return x & mask
ret # not following a JCC so no need for a REP prefix even on K10
We assume n is in the 0..63 range because otherwise the C would have undefined behaviour. In that case, this implementation differs from the shift-based implementation in the question. The shl version would treat n==64 as an infinite loop, because shift count = 0x40 & 0x3f = 0, so mask would never change. This bitidx += n version would exit after the first iteration, because idx immediately becomes >=63, i.e. out of range.
A less extreme case is that n=65 would copy all the bits (shift count of 1); this would copy only the low bit.
Both versions create an infinite loop for n=0. I used an unsigned compare so negative n will exit the loop promptly.
On Intel Sandybridge-family, the inner loop in the original is 7 uops. (mov = 1 + and=1 + or=1 + variable-count-shl=3 + macro-fused test+jcc=1). This will bottleneck on the front-end, or on ALU throughput on SnB/IvB.
My version is only 3 uops, and can run about twice as fast. (1 iteration per clock.)

intro to x86 assembly

I'm looking over an example on assembly in CSAPP (Computer Systems - A programmer's Perspective 2nd) and I just want to know if my understanding of the assembly code is correct.
Practice problem 3.23
int fun_b(unsigned x) {
int val = 0;
int i;
for ( ____;_____;_____) {
}
return val;
}
The gcc C compiler generates the following assembly code:
x at %ebp+8
// what I've gotten so far
1 movl 8(%ebp), %ebx // ebx: x
2 movl $0, %eax // eax: val, set to 0 since eax is where the return
// value is stored and val is being returned at the end
3 movl $0, %ecx // ecx: i, set to 0
4 .L13: // loop
5 leal (%eax,%eax), %edx // edx = val+val
6 movl %ebx, %eax // val = x (?)
7 andl $1, %eax // x = x & 1
8 orl %edx, %eax // x = (val+val) | (x & 1)
9 shrl %ebx Shift right by 1 // x = x >> 1
10 addl $1, %ecx // i++
11 cmpl $32, %ecx // if i < 32 jump back to loop
12 jne .L13
There was a similar post on the same problem with the solution, but I'm looking for more of a walk-through and explanation of the assembly code line by line.

You already seem to have the meaning of the instructions figured out. The comment on lines 7-8 are slightly wrong however, because those assign to eax which is val not x:
7 andl $1, %eax // val = val & 1 = x & 1
8 orl %edx, %eax // val = (val+val) | (x & 1)
Putting this into the C template could be:
for(i = 0; i < 32; i++, x >>= 1) {
val = (val + val) | (x & 1);
}
Note that (val + val) is just a left shift, so what this function is doing is shifting out bits from x on the right and shifting them in into val from the right. As such, it's mirroring the bits.
PS: if the body of the for must be empty you can of course merge it into the third expression.

Changing endianess, is union more efficient than bitshifts?

I was asked for a challenge to change the endianess of an int. The idea I had was to use bitshifts
int swap_endianess(int color)
{
int a;
int r;
int g;
int b;
a = (color & (255 << 24)) >> 24;
r = (color & (255 << 16)) >> 16;
g = (color & (255 << 8)) >> 8;
b = (color & 255)
return (b << 24 | g << 16 | r << 8 | a);
}
But someone told me that it was more easy to use a union containing an int and an array of four chars (if an int is stored on 4 chars), fill the int and then reverse the array.
union u_color
{
int color;
char c[4];
};
int swap_endianess(int color)
{
union u_color ucol;
char tmp;
ucol.color = color;
tmp = ucol.c[0];
ucol.c[0] = ucol.c[3];
ucol.c[3] = tmp;
tmp = ucol.c[1];
ucol.c[1] = ucol.c[2];
ucol.c[2] = tmp;
return (ucol.color);
}
What is the more efficient way of swapping bytes between those two? Are there more efficient ways of doing this?
EDIT
After having tested on an I7, the union way takes about 24 seconds (measured with time command), while the bitshift way takes about 15 seconds on 2,000,000,000 iterations.
The is that if I compile with -O1, both of the methods will take only 1 second, and 0.001 second with -O2 or -O3.
The bitshift methods compile to bswap in ASM with -02 and -03, but not the union way, gcc seems to recognize the naive pattern but not the complicated union way to do it. To conclude, read the bottom line of #user3386109.

Here is the correct code for a byte swap function
uint32_t changeEndianess( uint32_t value )
{
uint32_t r, g, b, a;
r = (value >> 24) & 0xff;
g = (value >> 16) & 0xff;
b = (value >> 8) & 0xff;
a = value & 0xff;
return (a << 24) | (b << 16) | (g << 8) | r;
}
Here's a function that tests the byte swap function
void testEndianess( void )
{
uint32_t value = arc4random();
uint32_t result = changeEndianess( value );
printf( "%08x %08x\n", value, result );
}
Using the LLVM compiler with full optimization, the resulting assembly code for the testEndianess function is
0x93d0: calll 0xc82e ; call `arc4random`
0x93d5: movl %eax, %ecx ; copy `value` into register CX
0x93d7: bswapl %ecx ; <--- this is the `changeEndianess` function
0x93d9: movl %ecx, 0x8(%esp) ; put 'result' on the stack
0x93dd: movl %eax, 0x4(%esp) ; put 'value' on the stack
0x93e1: leal 0x6536(%esi), %eax ; compute address of the format string
0x93e7: movl %eax, (%esp) ; put the format string on the stack
0x93ea: calll 0xc864 ; call 'printf'
In other words, the LLVM compiler recognizes the entire changeEndianess function and implements it as a single bswapl instruction.
Side note for those wondering why the call to arc4random is necessary. Given this code
void testEndianess( void )
{
uint32_t value = 0x11223344;
uint32_t result = changeEndianess( value );
printf( "%08x %08x\n", value, result );
}
the compiler generates this assembly
0x93dc: leal 0x6524(%eax), %eax ; compute address of format string
0x93e2: movl %eax, (%esp) ; put the format string on the stack
0x93e5: movl $0x44332211, 0x8(%esp) ; put 'result' on the stack
0x93ed: movl $0x11223344, 0x4(%esp) ; put 'value' on the stack
0x93f5: calll 0xc868 ; call 'printf'
In other words, given a hardcoded value as input, the compiler precomputes the result of the changeEndianess function, and puts that directly into the assembly code, bypassing the function entirely.
The bottom line. Write your code the way it makes sense to write your code, and let the compiler do the optimizing. Compilers these days are amazing. Using tricky optimizations in source code (e.g. unions) may defeat the optimizations built into the compiler, actually resulting in slower code.

You can also use this code which might be slightly more efficient:
#include <stdint.h>
extern uint32_t
change_endianness(uint32_t x)
{
x = (x & 0x0000FFFFLU) << 16 | (x & 0xFFFF0000LU) >> 16;
x = (x & 0x00FF00FFLU) << 8 | (x & 0xFF00FF00LU) >> 8;
return (x);
}
This is compiled by gcc on amd64 to the following assembly:
change_endianness:
roll $16, %edi
movl %edi, %eax
andl $16711935, %edi
andl $-16711936, %eax
salq $8, %rdi
sarq $8, %rax
orl %edi, %eax
ret
To get an even better result, you might want to employ embedded assembly. The i386 and amd64 architectures provide a bswap instruction to do what you want. As user3386109 explained, compilers might recognize the “naïve” approach and emit bswap instructions, something that doesn't happen with the approach from above. It is however better in case the compiler is not smart enough to detect that it can use bswap.

Can a modern C compiler optimize a combination of bit accesses?

I would like var to be unequal FALSE in case one of the bits 1, 3, 5, 7, 9, 11, 13 or 15 of input is set.
One solution which seem to be fairly common is this:
int var = 1 & (input >> 1) ||
1 & (input >> 3) ||
1 & (input >> 5) ||
1 & (input >> 7) ||
1 & (input >> 9) ||
1 & (input >> 11) ||
1 & (input >> 13) ||
1 & (input >> 15);
However, I'm afraid that that would lead the compiler to generate unnecessarily long code.
Following code would also yield the desired result. Would it be more efficient?
int var = input & 0b1010101010101010;
Thanks!

Your second example is not equivalent.
What you wanted was (using non-standard binary literals):
int var = !!(input & 0b1010101010101010));
Or with hex-literals (those are standard):
int var = !!(input & 0xaaaa));
Changes: Use of hexadecimal constants and double-negation (equivalent to != 0).
This presupposes input is not volatile, nor an atomic type.
A good compiler should optimize both to the same instructions (and most modern compilers are good enough).
In the end though, test and measure, most compilers will output the produced assembler code, you don't even need a disassembler!

If input is volatile, the compiler would be required to read it once if bit 1 was set, twice of bit 1 was clear but 3 was set, three times if bits 1 and 3 were clear but 5 was set, etc. The compiler may have ways of optimizing the code for doing the individual bit tests, but would have to test the bits separately.
If input is not volatile, a compiler could optimize the code, but I would not particularly expect it to. I would expect any compiler, however, no matter how ancient, to optimize
int var = (input & (
(1 << 1) | (1 << 3) | (1 << 5) | (1 << 7) |
(1 << 9) | (1 << 11) | (1 << 13) | (1 << 15)
) != 0);
which would appear to be what you're after.

It's going to depend on the processor and what instructions is has available, as well as how good the optimizing compiler is. I'd suspect that in your case, either of those lines of code will compile to the same instructions.
But we can do better than suspect, you can check for yourself. With gcc, use the -S compiler flag to have it output the assembly it generates. Then you can compare them yourself.

The orthodoxical solution should be to use the forgotten bitfields to map your flags, like
struct
{
bool B0: 1;
bool B1: 1;
bool B2: 1;
bool B3: 1;
bool B4: 1;
bool B5: 1;
bool B6: 1;
bool B7: 1;
bool B8: 1;
bool B9: 1;
bool B10: 1;
bool B11: 1;
bool B12: 1;
bool B13: 1;
bool B14: 1;
bool B15: 1;
} input;
and use the expression
bool Var= input.B1 || input.B3 || input.B5 || input.B7 || input.B9 || input.B11 || input.B13 || input.B15;
I doubt that an optimizing compiler will use the single-go masking trick, but honestly I have not tried.

How well this is handled depends on the compiler.
I've tested a minor variation of this code:
int test(int input) {
int var = 1 & (input >> 1) ||
1 & (input >> 3) ||
1 & (input >> 5) ||
1 & (input >> 7) ||
1 & (input >> 9) ||
1 & (input >> 11) ||
1 & (input >> 13) ||
1 & (input >> 15);
return var != 0;
}
Results
For x64, all compiled with -O2
GCC:
xor eax, eax
and edi, 43690
setne al
ret
Very good. That's precisely the transformation you were hoping for.
Clang:
testw $10922, %di # imm = 0x2AAA
movb $1, %al
jne .LBB0_2
andl $32768, %edi # imm = 0x8000
shrl $15, %edi
movb %dil, %al
.LBB0_2:
movzbl %al, %eax
ret
Yea that's a bit odd. Most of the tests were rolled together .. except for one. I see no reason why it would do this, maybe someone else can shed some light on that.
And the real surprise, ICC:
movl %edi, %eax #7.32
movl %edi, %edx #8.26
movl %edi, %ecx #9.26
shrl $1, %eax #7.32
movl %edi, %esi #10.26
shrl $3, %edx #8.26
movl %edi, %r8d #11.26
shrl $5, %ecx #9.26
orl %edx, %eax #7.32
shrl $7, %esi #10.26
orl %ecx, %eax #7.32
shrl $9, %r8d #11.26
orl %esi, %eax #7.32
movl %edi, %r9d #12.25
orl %r8d, %eax #7.32
shrl $11, %r9d #12.25
movl %edi, %r10d #13.25
shrl $13, %r10d #13.25
orl %r9d, %eax #7.32
shrl $15, %edi #14.25
orl %r10d, %eax #7.32
orl %edi, %eax #7.32
andl $1, %eax #7.32
ret #15.21
Ok so it optimized it a bit - no branches, and the 1 &'s are rolled together. But this is disappointing.
Conclusion
Your mileage may vary. To be safe, you can of course use the simple version directly, instead of relying on the compiler.

C - Rotate a 64 bit unsigned integer

today, I have been trying to write a function, which should rotate a given 64 bit integer n bits to the right, but also to the left, if the n is negative. Of course, bits out of the integer shall be rotated in on the other side.
I have kept the function quite simple.
void rotate(uint64_t *i, int n)
uint64_t one = 1;
if(n > 0) {
do {
int storeBit = *i & one;
*i = *i >> 1;
if(storeBit == 1)
*i |= 0x80000000000000;
n--;
}while(n>0);
}
}
possible inputs are:
uint64_t num = 0x2;
rotate(&num, 1); // num should be 0x1
rotate(&num, -1); // num should be 0x2, again
rotate(&num, 62); // num should 0x8
Unfortunately, I could not figure it out. I was hoping someone could help me.
EDIT: Now, the code is online. Sry, it took a while. I had some difficulties with the editor. But I just did it for the rotation to the right. The rotation to the left is missing, because I did not do it.

uint64_t rotate(uint64_t v, int n) {
n = n & 63U;
if (n)
v = (v >> n) | (v << (64-n));
return v; }
gcc -O3 produces:
.cfi_startproc
andl $63, %esi
movq %rdi, %rdx
movq %rdi, %rax
movl %esi, %ecx
rorq %cl, %rdx
testl %esi, %esi
cmovne %rdx, %rax
ret
.cfi_endproc
not perfect, but reasonable.

int storeBit = *i & one;
#This line you are assigning an 64 bit unsigned integer to probably 4 byte integer. I think your problem is related to this. In little endian machines things will be complicated if you do, non-defined operations.