Say I have 2 binary inputs named IN and MASK. Actual field size could be 32 to 256 bits depending on what instruction set is used to accomplish the task. Both inputs change every call.
Inputs:
IN = ...1100010010010100...
MASK = ...0001111010111011...
Output:
OUT = ...0001111010111000...
edit: another example result from some comment discussion
IN = ...11111110011010110...
MASK = ...01011011001111110...
Output:
OUT = ...01011011001111110...
I want to get the contiguous adjacent 1 bits of MASK that a 1 bit of IN is within. (Is there a general term for this kind of operation? Maybe I'm not phrasing my searches properly.) I'm trying to find a way to do this that is a bit faster. I'm open to using any x86 or x86 SIMD extensions that can get this done in minimum cpu cycles. A wider data type SIMD is preferred as it will allow me to process more data at once.
The best naive solution I've come up with is the following pseudocode, which manually shifts left until there are no more matching bits, then repeats shifting right:
// (using the variables above)
testL = testR = OUT = (IN & MASK);
LoopL:
testL = (testL << 1) & MASK;
if (testL != 0) {
OUT = OUT | testL;
goto LoopL;
}
LoopR:
testR = (testR >> 1) & MASK;
if (testR != 0) {
OUT = OUT | testR;
goto LoopR;
}
return OUT;
I guess #fuz comment was on the right track.
The following example shows how the SSE and AVX2 code below works.
The algorithm starts with IN_reduced = IN & MASK because we are not interested
in IN bits at positions where MASK is 0.
IN = . . . 0 0 0 0 . . . . p q r s . . .
MASK = . . 0 1 1 1 1 0 . . 0 1 1 1 1 0 . .
IN_reduced = IN & MASK = . . 0 0 0 0 0 0 . . 0 p q r s 0 . .
If any of the p q r s bits is 1, then IN_reduced + MASK has a carry bit 1
at position X, which is right left to the
requested contiguous bits.
MASK = . . 0 1 1 1 1 0 . . 0 1 1 1 1 0 . .
IN_reduced = . . 0 0 0 0 0 0 . . 0 p q r s 0 . .
IN_reduced + MASK = . . 0 1 1 1 1 . . . 1 . . . . . .
X
(IN_reduced + MASK) >>1 = . . . 0 1 1 1 1 . . . 1 . . . . . .
With >> 1 this carry bit 1 is shifted to the same column as bit p
(the first bit of the contiguous bits).
Now, (IN_reduced + MASK) >>1 is actually an average of IN_reduced and MASK.
In order to avoid possible overflow of addition we use the following
average: avg(a, b) = (a & b) + ((a ^ b) >> 1) (See #Harold's comment,
see also here and here.)
With average = avg(IN_reduced, MASK) we get
MASK = . . 0 1 1 1 1 0 . . 0 1 1 1 1 0 . .
IN_reduced = . . 0 0 0 0 0 0 . . 0 p q r s 0 . .
average = . . . 0 1 1 1 1 . . . 1 . . . . . .
MASK >> 1 = . . . 0 1 1 1 1 0 . . 0 1 1 1 1 0 .
leading_bits = (~(MASK>>1))&average = . . . 0 0 0 0 0 . . . 1 0 0 0 0 . .
We can isolate the leading carry bits with
leading_bits = (~(MASK>>1) ) & average because MASK>>1 is zero at the positions
of the carry bits
that we are interested in.
With normal addition the carry propagates from right to left. Here we use a
reverse addition: with a carry from left to right.
Reverse adding MASK and leading_bits:
rev_added = bit_swap(bit_swap(MASK) + bit_swap(leading_bits)),
This zeros the bits at
the wanted positions.
With OUT = (~rev_added) & MASK we get the result.
MASK = . . 0 1 1 1 1 0 . . 0 1 1 1 1 0 . .
leading_bits = . . . 0 0 0 0 0 . . . 1 0 0 0 0 . .
rev_added (MASK,leading_bits) = . . . 1 1 1 1 0 . . . 0 0 0 0 1 . .
OUT = ~rev_added & MASK = . . 0 0 0 0 0 0 . . . 1 1 1 1 0 . .
The algorithm was not thoroughly tested, but the output looks ok.
The code block below contains two separate codes:
The upper half is the SSE code,
and the lower half is the AVX2 code.
(In order to avoid
bloating the answer too much with two large code blocks.)
The SSE algorithm works with 2 x 64-bit elements and the AVX2 version works with 4 x 64-bit elements.
With gcc 9.1, the algorithm compiles to about 29 instructions,
aside from 4 vmovdqa-s for loading some constants, which are likely
hoisted out of the loop in a real world application (after inlining).
These 29 instructions are a good mix of 9 shuffles (vpshufb) that execute
on port 5 (p5) on Intel Skylake, and many other instructions that often may
execute on p0, p1 or p5.
Therefore, a performance of about 3 instructions per cycle might be possible.
In that case the throughput would be about 1 function call (inlined)
per 10 cycles. In the AVX2 case this means 4 uint64_t OUT results per
about 10 cycles.
Note that the performance is independent of the data(!), which is a great
benefit of this answer I think. The solution is branchless, and loopless, and
cannot suffer from failing branch prediction.
/* gcc -O3 -m64 -Wall -march=skylake select_bits.c */
#include <immintrin.h>
#include <stdio.h>
#include <stdint.h>
int print_sse_128_bin(__m128i x);
__m128i bit_128_k(unsigned int k);
__m128i mm_bitreverse_epi64(__m128i x);
__m128i mm_revadd_epi64(__m128i x, __m128i y);
/* Select specific pieces of contiguous bits from `MASK` based on selector `IN` */
__m128i mm_select_bits_epi64(__m128i IN, __m128i MASK){
__m128i IN_reduced = _mm_and_si128(IN, MASK);
/* Compute the average of IN_reduced and MASK with avg(a,b)=(a&b)+((a^b)>>1) */
/* (IN_reduced & MASK) + ((IN_reduced ^ MASK) >>1) = */
/* ((IN & MASK) & MASK) + ((IN_reduced ^ MASK) >>1) = */
/* IN_reduced + ((IN_reduced ^ MASK) >>1) */
__m128i tmp = _mm_xor_si128(IN_reduced, MASK);
__m128i tmp_div2 = _mm_srli_epi64(tmp, 1);
__m128i average = _mm_add_epi64(IN_reduced, tmp_div2); /* average is the average */
__m128i MASK_div2 = _mm_srli_epi64(MASK, 1);
__m128i leading_bits = _mm_andnot_si128(MASK_div2, average);
__m128i rev_added = mm_revadd_epi64(MASK, leading_bits);
__m128i OUT = _mm_andnot_si128(rev_added, MASK);
/* Uncomment the next lines to check the arithmetic */ /*
printf("IN ");print_sse_128_bin(IN );
printf("MASK ");print_sse_128_bin(MASK );
printf("IN_reduced ");print_sse_128_bin(IN_reduced );
printf("tmp ");print_sse_128_bin(tmp );
printf("tmp_div2 ");print_sse_128_bin(tmp_div2 );
printf("average ");print_sse_128_bin(average );
printf("MASK_div2 ");print_sse_128_bin(MASK_div2 );
printf("leading_bits ");print_sse_128_bin(leading_bits );
printf("rev_added ");print_sse_128_bin(rev_added );
printf("OUT ");print_sse_128_bin(OUT );
printf("\n");*/
return OUT;
}
int main(){
__m128i IN = _mm_set_epi64x(0b11111110011010110, 0b1100010010010100);
__m128i MASK = _mm_set_epi64x(0b01011011001111110, 0b0001111010111011);
__m128i OUT;
printf("Example 1 \n");
OUT = mm_select_bits_epi64(IN, MASK);
printf("IN ");print_sse_128_bin(IN);
printf("MASK ");print_sse_128_bin(MASK);
printf("OUT ");print_sse_128_bin(OUT);
printf("\n\n");
/* 0b7654321076543210765432107654321076543210765432107654321076543210 */
IN = _mm_set_epi64x(0b1000001001001010000010000000100000010000000000100000000111100011,
0b11111110011010111);
MASK = _mm_set_epi64x(0b1110011110101110111111000000000111011111101101111100011111000001,
0b01011011001111111);
printf("Example 2 \n");
OUT = mm_select_bits_epi64(IN, MASK);
printf("IN ");print_sse_128_bin(IN);
printf("MASK ");print_sse_128_bin(MASK);
printf("OUT ");print_sse_128_bin(OUT);
printf("\n\n");
return 0;
}
int print_sse_128_bin(__m128i x){
for (int i = 127; i >= 0; i--){
printf("%1u", _mm_testnzc_si128(bit_128_k(i), x));
if (((i & 7) == 0) && (i > 0)) printf(" ");
}
printf("\n");
return 0;
}
/* From my answer here https://stackoverflow.com/a/39595704/2439725, adapted to 128-bit */
inline __m128i bit_128_k(unsigned int k){
__m128i indices = _mm_set_epi32(96, 64, 32, 0);
__m128i one = _mm_set1_epi32(1);
__m128i kvec = _mm_set1_epi32(k);
__m128i shiftcounts = _mm_sub_epi32(kvec, indices);
__m128i kbit = _mm_sllv_epi32(one, shiftcounts);
return kbit;
}
/* Copied from Harold's answer https://stackoverflow.com/a/46318399/2439725 */
/* Adapted to epi64 and __m128i: bit reverse two 64 bit elements */
inline __m128i mm_bitreverse_epi64(__m128i x){
__m128i shufbytes = _mm_setr_epi8(7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8);
__m128i luthigh = _mm_setr_epi8(0, 8, 4, 12, 2, 10, 6, 14, 1, 9, 5, 13, 3, 11, 7, 15);
__m128i lutlow = _mm_slli_epi16(luthigh, 4);
__m128i lowmask = _mm_set1_epi8(15);
__m128i rbytes = _mm_shuffle_epi8(x, shufbytes);
__m128i high = _mm_shuffle_epi8(lutlow, _mm_and_si128(rbytes, lowmask));
__m128i low = _mm_shuffle_epi8(luthigh, _mm_and_si128(_mm_srli_epi16(rbytes, 4), lowmask));
return _mm_or_si128(low, high);
}
/* Add in the reverse direction: With a carry from left to */
/* right, instead of right to left */
inline __m128i mm_revadd_epi64(__m128i x, __m128i y){
x = mm_bitreverse_epi64(x);
y = mm_bitreverse_epi64(y);
__m128i sum = _mm_add_epi64(x, y);
return mm_bitreverse_epi64(sum);
}
/* End of SSE code */
/************* AVX2 code starts here ********************************************/
/* gcc -O3 -m64 -Wall -march=skylake select_bits256.c */
#include <immintrin.h>
#include <stdio.h>
#include <stdint.h>
int print_avx_256_bin(__m256i x);
__m256i bit_256_k(unsigned int k);
__m256i mm256_bitreverse_epi64(__m256i x);
__m256i mm256_revadd_epi64(__m256i x, __m256i y);
/* Select specific pieces of contiguous bits from `MASK` based on selector `IN` */
__m256i mm256_select_bits_epi64(__m256i IN, __m256i MASK){
__m256i IN_reduced = _mm256_and_si256(IN, MASK);
/* Compute the average of IN_reduced and MASK with avg(a,b)=(a&b)+((a^b)>>1) */
/* (IN_reduced & MASK) + ((IN_reduced ^ MASK) >>1) = */
/* ((IN & MASK) & MASK) + ((IN_reduced ^ MASK) >>1) = */
/* IN_reduced + ((IN_reduced ^ MASK) >>1) */
__m256i tmp = _mm256_xor_si256(IN_reduced, MASK);
__m256i tmp_div2 = _mm256_srli_epi64(tmp, 1);
__m256i average = _mm256_add_epi64(IN_reduced, tmp_div2); /* average is the average */
__m256i MASK_div2 = _mm256_srli_epi64(MASK, 1);
__m256i leading_bits = _mm256_andnot_si256(MASK_div2, average);
__m256i rev_added = mm256_revadd_epi64(MASK, leading_bits);
__m256i OUT = _mm256_andnot_si256(rev_added, MASK);
/* Uncomment the next lines to check the arithmetic */ /*
printf("IN ");print_avx_256_bin(IN );
printf("MASK ");print_avx_256_bin(MASK );
printf("IN_reduced ");print_avx_256_bin(IN_reduced );
printf("tmp ");print_avx_256_bin(tmp );
printf("tmp_div2 ");print_avx_256_bin(tmp_div2 );
printf("average ");print_avx_256_bin(average );
printf("MASK_div2 ");print_avx_256_bin(MASK_div2 );
printf("leading_bits ");print_avx_256_bin(leading_bits );
printf("rev_added ");print_avx_256_bin(rev_added );
printf("OUT ");print_avx_256_bin(OUT );
printf("\n");*/
return OUT;
}
int main(){
__m256i IN = _mm256_set_epi64x(0b11111110011010110,
0b1100010010010100,
0b1000001001001010000010000000100000010000000000100000000111100011,
0b11111110011010111
);
__m256i MASK = _mm256_set_epi64x(0b01011011001111110,
0b0001111010111011,
0b1110011110101110111111000000000111011111101101111100011111000001,
0b01011011001111111);
__m256i OUT;
printf("Example \n");
OUT = mm256_select_bits_epi64(IN, MASK);
printf("IN ");print_avx_256_bin(IN);
printf("MASK ");print_avx_256_bin(MASK);
printf("OUT ");print_avx_256_bin(OUT);
printf("\n");
return 0;
}
int print_avx_256_bin(__m256i x){
for (int i=255;i>=0;i--){
printf("%1u",_mm256_testnzc_si256(bit_256_k(i),x));
if (((i&7) ==0)&&(i>0)) printf(" ");
}
printf("\n");
return 0;
}
/* From my answer here https://stackoverflow.com/a/39595704/2439725 */
inline __m256i bit_256_k(unsigned int k){
__m256i indices = _mm256_set_epi32(224,192,160,128,96,64,32,0);
__m256i one = _mm256_set1_epi32(1);
__m256i kvec = _mm256_set1_epi32(k);
__m256i shiftcounts = _mm256_sub_epi32(kvec, indices);
__m256i kbit = _mm256_sllv_epi32(one, shiftcounts);
return kbit;
}
/* Copied from Harold's answer https://stackoverflow.com/a/46318399/2439725 */
/* Adapted to epi64: bit reverse four 64 bit elements */
inline __m256i mm256_bitreverse_epi64(__m256i x){
__m256i shufbytes = _mm256_setr_epi8(7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8);
__m256i luthigh = _mm256_setr_epi8(0, 8, 4, 12, 2, 10, 6, 14, 1, 9, 5, 13, 3, 11, 7, 15, 0, 8, 4, 12, 2, 10, 6, 14, 1, 9, 5, 13, 3, 11, 7, 15);
__m256i lutlow = _mm256_slli_epi16(luthigh, 4);
__m256i lowmask = _mm256_set1_epi8(15);
__m256i rbytes = _mm256_shuffle_epi8(x, shufbytes);
__m256i high = _mm256_shuffle_epi8(lutlow, _mm256_and_si256(rbytes, lowmask));
__m256i low = _mm256_shuffle_epi8(luthigh, _mm256_and_si256(_mm256_srli_epi16(rbytes, 4), lowmask));
return _mm256_or_si256(low, high);
}
/* Add in the reverse direction: With a carry from left to */
/* right, instead of right to left */
inline __m256i mm256_revadd_epi64(__m256i x, __m256i y){
x = mm256_bitreverse_epi64(x);
y = mm256_bitreverse_epi64(y);
__m256i sum = _mm256_add_epi64(x, y);
return mm256_bitreverse_epi64(sum);
}
Output of the SSE code with an uncommented debugging section:
Example 1
IN 00000000 00000000 00000000 00000000 00000000 00000001 11111100 11010110 00000000 00000000 00000000 00000000 00000000 00000000 11000100 10010100
MASK 00000000 00000000 00000000 00000000 00000000 00000000 10110110 01111110 00000000 00000000 00000000 00000000 00000000 00000000 00011110 10111011
IN_reduced 00000000 00000000 00000000 00000000 00000000 00000000 10110100 01010110 00000000 00000000 00000000 00000000 00000000 00000000 00000100 10010000
tmp 00000000 00000000 00000000 00000000 00000000 00000000 00000010 00101000 00000000 00000000 00000000 00000000 00000000 00000000 00011010 00101011
tmp_div2 00000000 00000000 00000000 00000000 00000000 00000000 00000001 00010100 00000000 00000000 00000000 00000000 00000000 00000000 00001101 00010101
average 00000000 00000000 00000000 00000000 00000000 00000000 10110101 01101010 00000000 00000000 00000000 00000000 00000000 00000000 00010001 10100101
MASK_div2 00000000 00000000 00000000 00000000 00000000 00000000 01011011 00111111 00000000 00000000 00000000 00000000 00000000 00000000 00001111 01011101
leading_bits 00000000 00000000 00000000 00000000 00000000 00000000 10100100 01000000 00000000 00000000 00000000 00000000 00000000 00000000 00010000 10100000
rev_added 00000000 00000000 00000000 00000000 00000000 00000000 01001001 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000001 01000111
OUT 00000000 00000000 00000000 00000000 00000000 00000000 10110110 01111110 00000000 00000000 00000000 00000000 00000000 00000000 00011110 10111000
IN 00000000 00000000 00000000 00000000 00000000 00000001 11111100 11010110 00000000 00000000 00000000 00000000 00000000 00000000 11000100 10010100
MASK 00000000 00000000 00000000 00000000 00000000 00000000 10110110 01111110 00000000 00000000 00000000 00000000 00000000 00000000 00011110 10111011
OUT 00000000 00000000 00000000 00000000 00000000 00000000 10110110 01111110 00000000 00000000 00000000 00000000 00000000 00000000 00011110 10111000
Example 2
IN 10000010 01001010 00001000 00001000 00010000 00000010 00000001 11100011 00000000 00000000 00000000 00000000 00000000 00000001 11111100 11010111
MASK 11100111 10101110 11111100 00000001 11011111 10110111 11000111 11000001 00000000 00000000 00000000 00000000 00000000 00000000 10110110 01111111
IN_reduced 10000010 00001010 00001000 00000000 00010000 00000010 00000001 11000001 00000000 00000000 00000000 00000000 00000000 00000000 10110100 01010111
tmp 01100101 10100100 11110100 00000001 11001111 10110101 11000110 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000010 00101000
tmp_div2 00110010 11010010 01111010 00000000 11100111 11011010 11100011 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000001 00010100
average 10110100 11011100 10000010 00000000 11110111 11011100 11100100 11000001 00000000 00000000 00000000 00000000 00000000 00000000 10110101 01101011
MASK_div2 01110011 11010111 01111110 00000000 11101111 11011011 11100011 11100000 00000000 00000000 00000000 00000000 00000000 00000000 01011011 00111111
leading_bits 10000100 00001000 10000000 00000000 00010000 00000100 00000100 00000001 00000000 00000000 00000000 00000000 00000000 00000000 10100100 01000000
rev_added 00010000 01100001 00000010 00000001 11000000 01110000 00100000 00100000 00000000 00000000 00000000 00000000 00000000 00000000 01001001 00000000
OUT 11100111 10001110 11111100 00000000 00011111 10000111 11000111 11000001 00000000 00000000 00000000 00000000 00000000 00000000 10110110 01111111
IN 10000010 01001010 00001000 00001000 00010000 00000010 00000001 11100011 00000000 00000000 00000000 00000000 00000000 00000001 11111100 11010111
MASK 11100111 10101110 11111100 00000001 11011111 10110111 11000111 11000001 00000000 00000000 00000000 00000000 00000000 00000000 10110110 01111111
OUT 11100111 10001110 11111100 00000000 00011111 10000111 11000111 11000001 00000000 00000000 00000000 00000000 00000000 00000000 10110110 01111111
The following approach needs only a single loop, with the number of iterations equal to the number of 'groups' found.
I don't know if it will be more efficient than your approach; there's 6 arith/bitwise operations in each iteration.
In pseudo code (C-like):
OUT = 0;
a = MASK;
while (a)
{
e = a & ~(a + (a & (-a)));
if (e & IN) OUT |= e;
a ^= e;
}
Here's how it works, step by step, using 11010111 as an example mask:
OUT = 0
a = MASK 11010111
c = a & (-a) 00000001 keeps rightmost one only
d = a + c 11011000 clears rightmost group (and set the bit to its immediate left)
e = a & ~d 00000111 keeps rightmost group only
if (e & IN) OUT |= e; adds group to OUT
a = a ^ e 11010000 clears rightmost group, so we can proceed with the next group
c = a & (-a) 00010000
d = a + c 11100000
e = a & ~d 00010000
if (e & IN) OUT |= e;
a = a ^ e 11000000
c = a & (-a) 01000000
d = a + c 00000000 (ignoring carry when adding)
e = a & ~d 11000000
if (e & IN) OUT |= e;
a = a ^ e 00000000 done
As pointed out #PeterCordes, some operations could be optimized using x86 BMI1 instructions:
c = a & (-a): blsi
e = a & ~d: andn
This approach is good for processor architectures that do not support bitwise reversal. On architectures that do have a dedicated instruction to reverse the order of bits in an integer, wim's answer is more efficient.
I'm using the below code for reading bits/bytes in a structure.
Each line is printed in two different methods when DEBUG is 1, else it uses only one method.
Code:
#include <stdio.h>
#define DEBUG 0
typedef struct n{
int a;
int b;
int (*add)(struct n*, int,int);
int (*sub)(struct n*, int,int);
} num;
int add (num *st, int a, int b){}
int sub(num *st, int a, int b){}
int main(){
num* var = calloc(1,sizeof(num));
var->add = add;
var->sub = sub;
var->a = 13;
var->b = 53;
long int *byte = (long int*)var;
int i;
int j;
for(i=0;i<6;i++){
# if DEBUG == 1
for(j=0;j<64;j++){
printf("%d",( *(byte+i) & (1UL<<(63-j)) )?1:0 );
putchar( ((j+1)%8 == 0)?' ':'\0' );
}
putchar('\n');
# endif
printf("0x%06x %06x\n",(*(byte+i) >> 32), *(byte+i));
}
printf("\n0x%x\n",var->add);
printf("0x%x\n",var->sub);
}
Output 1: (DEBUG==1)
00000000 00000000 00000000 00110101 00000000 00000000 00000000 00001101
0x000035 00000d
00000000 00000000 00000000 00000000 00000000 01000000 00000111 01110000
0x000000 400770
00000000 00000000 00000000 00000000 00000000 01000000 00000111 01100000
0x000000 400760
00000000 00000000 00000000 00000000 00000000 00000010 00001111 11100001
0x000000 020fe1
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
0x000000 000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
0x000000 000000
0x400770
0x400760
Output 2: (DEBUG==0)
0x000000 000000
0x000000 000000
0x000000 000000
0x000000 020fe1
0x000000 000000
0x000000 000000
0x400640
0x400630
As you can see, everything except one line is blank in Output 2.
I'm just curious to know why this is happening.
Also enlighten me, if there's a better way to print bits/bytes.
Nb: I'm using an online compiler(onlinegdb.com) for testing
What you are printing here is address of functions.
printf("\n0x%x\n",var->add);
printf("0x%x\n",var->sub);
When you enable DEBUG.
# if DEBUG == 1
for(j=0;j<64;j++){
printf("%d",( *(byte+i) & (1UL<<(63-j)) )?1:0 );
putchar( ((j+1)%8 == 0)?' ':'\0' );
}
putchar('\n');
# endif
You are adding above code to your code segment.
Hence addresses of function(add and sub) inside code segment may change.
Note:: You are not following strict aliasing rule here.
long int *byte = (long int*)var;
This appears to be an issue with the compiler you are using, I don't see any specific reason why your code would return 0's. When your code is compiled with GCC and run the results are as follows:
Debug = 1
00000000 00000000 00000000 00001101 00000000 00000000 00000000 00001101
0x00000d 00000d
00000000 00000000 00000000 00110101 00000000 00000000 00000000 00110101
0x000035 000035
00000000 01000000 00010101 01100000 00000000 01000000 00010101 01100000
0x401560 401560
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
0x000000 000000
00000000 01000000 00010101 01110010 00000000 01000000 00010101 01110010
0x401572 401572
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
0x000000 000000
0x401560
0x401572
And Debug = 0:
0x00000d 00000d
0x000035 000035
0x401560 401560
0x000000 000000
0x401572 401572
0x000000 000000
0x401560
0x401572
I'd recommend getting GCC on your machine and compiling your code with -Wall. There are a handful of things in this code that make it hard to understand and could make for unpredictable behavior that -Wall will warn you about.
I have a small query in c,
I am using the bitwise left shift on number 69 which is 01000101 in binary
01000101 << 8
and I get answer as 100010100000000
Shouldn't it be all 8 zeros i.e. 00000000 as we shift all the 8 bits to left and then pad with zeros.
It is because of the literal (default data type) for a number (int) is, in most of nowadays CPU, greater than 8-bit (typically 32-bit) and thus when you apply
69 << 8 //note 69 is int
It is actually applied like this
00000000 00000000 00000000 01000101 << 8
Thus you get the result
00000000 00000000 01000101 00000000
If you use, say, unsigned char specifically, then it won't happen:
unsigned char a = 69 << 8; //resulting in 0
This is because though 69 << 8 itself will still result in
01000101 00000000
But the above value will be casted to 8-bit unsigned char, resulting in:
00000000
Bit shift operators act on entire objects, not individual bytes. If the object storing 69 is wider than 1 byte (int is typically 4 bytes for example), then the bits that are shifted outside of the first (lowest/rightmost) byte overflow and are "pushed into" the second byte. For example:
00000000 00000000 00000000 01000101 //The number 69, stored in a 32 bit object
00000000 00000000 01010000 00000000 //shifted left by 8
If you had stored the number in a 1-byte variable, such as a char, the result would indeed have been zero.
01000101 //The number 69, stored in an 8 bit object
(01000101) 00000000 //shifted left by 8
^^^^^^^^
these bits have been shifted outside the size of the object.
The same thing would happen if you shifted an int by 32.
00000000 00000000 00000000 01000101 //The number 69, stored in a 32 bit int
00000000 00000000 01010000 00000000 //shifted left by 8
00000000 01010000 00000000 00000000 //shifted left by 16
01010000 00000000 00000000 00000000 //shifted left by 24
00000000 00000000 00000000 00000000 //shifted left by 32, overflow
I have this program:
#include <stdio.h>
int main(void)
{
unsigned char unit_id[] = { 0x2B, 0xC, 0x6B, 0x54}; // 8-bit (1 byte)
unsigned long long int unit_id_val; //64-bit (8 bytes)
int i;
// loops 4 times
for(i=0;i<sizeof(unit_id)/sizeof(char);i++){
unit_id_val |= unit_id[i] << (8 * i);
}
printf("the unit id is %llu\n", unit_id_val);
return 0;
}
hex to binary conversions:
0x2B = 00101011
0xC = 00001100
0x6B = 01101011
0x54 = 01010100
unit_id_val is 8 bytes (I use 5 bytes for unit_id_val below to simplify things)
1) first iteration 8*0=0 so no left shift occurs:
00101011 = 00101011 << 0
00000000 00000000 00000000 00000000 00000000 |= 00101011
So the result should be:
00000000 00000000 00000000 00000000 00101011
2) Second iteration 8*1=8, so left shift all bits of unsigned char 0xC by 8:
00000000 = 00101011 << 8
00000000 00000000 00000000 00000000 00101011 |= 00000000
So the result should be:
00000000 00000000 00000000 00000000 00101011
3) Third iteration 8*2=16, so left shift all bits of unsigned char 0x6B by 16:
00000000 = 01101011 << 16
00000000 00000000 00000000 00000000 00101011 |= 00000000
So the result should be:
00000000 00000000 00000000 00000000 00101011
4) Fourth iteration 8*3=24, so left shift all bits of unsigned char 0x54 by 32:
00000000 = 01010100 << 32
00000000 00000000 00000000 00000000 00101011 |= 00000000
So the result should be:
00000000 00000000 00000000 00000000 00101011
00101011 is 43
But when you run this program you get
1416301611
which is binary:
00010100 00010110 00110000 00010110 00010001
I am not understanding something here. I am following the precedence chart by evaluating primary expression operator () before evaluating left shift operator << before evaluating assignment operator !=. Yet I am not understanding why I get the response I get.
00000000 = 00101011 << 8
Ok first your second element is 0x0C (i.e., binary 00001100 not 00101011), so you are actually doing:
(unsigned char) 0x0C << 8
and the result of this expression is not 0 but 0x0C00 as the bitwise << operator does an integer promotion of its left operand, so it is actually equivalent to:
(int) (unsigned char) 0x0C << 8
You never initialize unit_id_val, and then you |= your shifted byte values into it, so whatever bits happened to be set in the uninitialized value will still be set, so your output will look like random garbage. Add
unit_id_val = 0;
before your loop.
In addition, whenever you do ANY operation in C, the operands are always converted by the standard conversions. In particular, that means any integer type smaller than an int will first be converted to int. So even though unsigned char is only 8 bits, when you do unit_id[i] << (8 * i), the 8-bit value from unit_id[i] will be converted to int (presumably 32 bits on your machine) before the shift. There's no way to do any sort of computation on integers smaller than an int in C -- even if you cast them, they'll be implicitly converted back to int.
see in this line
unit_id_val |= unit_id[i] << (8 * i);
unit_id_val = unit_id_val | unit_id[i] << (8 * i);
the problem is you are using a variable which is only declare but not initialized and we know that in c uninitialized variable default value is garbage or junk. So each and every time without any doubt you will get some unpredictable value.
Suppose I have the number 'numb'=1025 [00000000 00000000 00000100 00000001] represented:
On Little-Endian Machine:
00000001 00000100 00000000 00000000
On Big-Endian Machine:
00000000 00000000 00000100 00000001
Now, if I apply Left Shift on 10 bits (i.e.: numb <<= 10), I should have:
[A] On Little-Endian Machine:
As I noticed in GDB, Little Endian does the Left Shift in 3 steps: [I have shown '3' Steps to better understand the processing only]
Treat the no. in Big-Endian Convention:
00000000 00000000 00000100 00000001
Apply Left-Shift:
00000000 00010000 00000100 00000000
Represent the Result again in Little-Endian:
00000000 00000100 00010000 00000000
[B]. On Big-Endian Machine:
00000000 00010000 00000100 00000000
My Question is:
If I directly apply a Left Shift on the Little Endian
Convention, it should give:
numb:
00000001 00000100 00000000 00000000
numb << 10:
00010000 00000000 00000000 00000000
But actually, it gives:
00000000 00000100 00010000 00000000
To achieve the second result only, I have shown three hypothetical steps above.
Please explain me why the above two results are different: The actual outcome of numb << 10 is different than the expected outcome.
Endianness is the way values are stored in memory. When loaded into the processor, regardless of endianness, the bit shift instruction is operating on the value in the processor's register. Therefore, loading from memory to processor is the equivalent of converting to big endian, the shifting operation comes next and then the new value is stored back in memory, which is where the little endian byte order comes into effect again.
Update, thanks to #jww: On PowerPC the vector shifts and rotates are endian sensitive. You can have a value in a vector register and a shift will produce different results on little-endian and big-endian.
No, bitshift, like any other part of C, is defined in terms of values, not representations. Left-shift by 1 is mutliplication by 2, right-shift is division. (As always when using bitwise operations, beware of signedness. Everything is most well-defined for unsigned integral types.)
Though the accepted answer points out that endianess is a concept from the memory view. But I don't think that answer the question directly.
Some answers tell me that bitwise operations don't depend on endianess, and the processor may represent the bytes in any other way. Anyway, it's talking about that endianess gets abstracted.
But when we do some bitwise calculations on the paper for example, don't need to state the endianess in the first place? Most times we choose an endianess implicitly.
For example, assume we have a line of code like this
0x1F & 0xEF
How would you calculate the result by hand, on a paper?
MSB 0001 1111 LSB
1110 1111
result: 0000 1111
So here we use a Big Endian format to do the calculation. You can also use Little Endian to calculate and get the same result.
Btw, when we write numbers in code, I think it's like a Big Endian format. 123456 or 0x1F, most significant numbers starts from the left.
Again, as soon as we write some a binary format of a value on the paper, I think we've already chosen an Endianess and we are viewing the value as we see it from the memory.
So back to the question, an shift operation << should be thought as shifting from LSB(least significant byte) to MSB(most significant byte).
Then as for the example in the question:
numb=1025
Little Endian
LSB 00000001 00000100 00000000 00000000 MSB
So << 10 would be 10bit shifting from LSB to MSB.
Comparison and << 10 operations for Little Endian format step by step:
MSB LSB
00000000 00000000 00000100 00000001 numb(1025)
00000000 00010000 00000100 00000000 << 10
LSB MSB
00000000 00000100 00010000 00000000 numb(1025) << 10, and put in a Little Endian Format
LSB MSB
00000001 00000100 00000000 00000000 numb(1205) in Little Endian format
00000010 00001000 00000000 00000000 << 1
00000100 00010000 00000000 00000000 << 2
00001000 00100000 00000000 00000000 << 3
00010000 01000000 00000000 00000000 << 4
00100000 10000000 00000000 00000000 << 5
01000000 00000000 00000001 00000000 << 6
10000000 00000000 00000010 00000000 << 7
00000000 00000001 00000100 00000000 << 8
00000000 00000010 00001000 00000000 << 9
00000000 00000100 00010000 00000000 << 10 (check this final result!)
Wow! I get the expected result as the OP described!
The problems that the OP didn't get the expected result are that:
It seems that he didn't shift from LSB to MSB.
When shifting bits in Little Endian format, you should realize(thank god I realize it) that:
LSB 10000000 00000000 MSB << 1 is
LSB 00000000 00000001 MSB, not
LSB 01000000 00000000 MSB
Because for each individual 8bits, we are actually writing it in a MSB 00000000 LSB Big Endian format.
So it's like
LSB[ (MSB 10000000 LSB) (MSB 00000000 LSB) ]MSB
To sum up:
Though bitwise operations is said to be abstracted away blablablabla..., when we calculate bitwise operations by hand, we still need to know what endianess we are using as we write down the binary format on the paper. Also we need to make sure all the operators use the same endianess.
The OP didn't get the expected result is because he did the shifting wrong.
Whichever shift instruction shifts out the higher-order bits first is considered the left shift. Whichever shift instruction shifts out the lower-order bits first is considered the right shift. In that sense, the behavior of >> and << for unsigned numbers will not depend on endianness.
Computers don't write numbers down the way we do. The value simply shifts. If you insist on looking at it byte-by-byte (even though that's not how the computer does it), you could say that on a little-endian machine, the first byte shifts left, the excess bits go into the second byte, and so on.
(By the way, little-endian makes more sense if you write the bytes vertically rather than horizontally, with higher addresses on top. Which happens to be how memory map diagrams are commonly drawn.)