C unsigned int array and bit shifts

C unsigned int array and bit shifts - c

If i have an array of short unsigned ints.
Would shifting array[k+1] left by 8 bits, put 8 bits into the lower half of array[k+1]?
Or do they simply drop off as they have gone outside of the allocated space for the element?

They drop off. You can't affect the other bits this way. Try it:
#include <stdio.h>
void print_a (short * a)
{
int i;
for (i = 0; i < 3; i++)
printf ("%d:%X\n", i, a[i]);
}
int main ()
{
short a[3] = {1, -1, 3};
print_a (a);
a[1] <<= 8;
print_a (a);
return 0;
}
Output is
0:1
1:FFFFFFFF
2:3
0:1
1:FFFFFF00
2:3

They drop off the data type totally, not carrying over to the next array element.
If you want that sort of behavior, you have to code it yourself with something like (left shifting the entire array by four bits):
#include <stdio.h>
int main(void) {
int i;
unsigned short int a[4] = {0xdead,0x1234,0x5678,0xbeef};
// Output "before" variables.
for (i = 0; i < sizeof(a)/sizeof(*a); i++)
printf ("before %d: 0x%04x\n", i, a[i]);
printf ("\n");
// This left-shifts the array by left-shifting the current
// element and bringing in the top bit of the next element.
// It is in a loop for all but hte last element.
// Then it just left-shifts the last element (no more data
// to shift into that one).
for (i = 0; i < sizeof(a)/sizeof(*a)-1; i++)
a[i] = (a[i] << 8) | (a[i+1] >> 8);
a[i] = (a[i] << 8);
// Print the "after" variables.
for (i = 0; i < sizeof(a)/sizeof(*a); i++)
printf ("after %d: 0x%04x\n", i, a[i]);
return 0;
}
This outputs:
before 0: 0xdead
before 1: 0x1234
before 2: 0x5678
before 3: 0xbeef
after 0: 0xad12
after 1: 0x3456
after 2: 0x78be
after 3: 0xef00

The way to think about this is that in C (and for most programming languages) the implementation for array[k] << 8 involves loading array[k] into a register, shifting the register, and then storing the register back into array[k]. Thus array[k+1] will remain untouched.
As an example, foo.c:
unsigned short array[5];
void main() {
array[3] <<= 8;
}
Will generate the following instructions:
movzwl array+6(%rip), %eax
sall $8, %eax
movw %ax, array+6(%rip)
This loads array[3] into %eax, modifies it, and stores it back.

Shifting an unsigned int left by 8 bits will fill the lower 8 bits with zeros. The top 8 bits will be discarded, it doesn't matter that they are in an array.
Incidentally, whether 8 bits is half of an unsigned int depends on your system, but on 32-bit systems, 8 bits is typically a quarter of an unsigned int.
unsigned int x = 0x12345678;
// x is 0x12345678
x <<= 8;
// x is 0x34567800

Be aware that the C definition of the int data type does not specify how many bits it contains and is system dependent. An int was originally intended to be the "natural" word size of the processor, but this isn't always so and you could find int contains 16, 32, 64 or even some odd number like 24 bits.
The only thing you are guaranteed is an unsigned int can hold all the values between 0 and UINT_MAX inclusive, where UINT_MAX must be at least 65535 - so the int types must contain at least 16 bits to hold the required range of values.
So shifting an array of integer by 8 bits will change each int individually, but be aware that this shift will not necessarily be 'half of the array'

Related

Combining four 16bit numbers into single 64bit number

I need to combine four numbers in hex format into single number. The first option that I thought of was to do left shift by n*16 (n=0,1,2,3..) for each number.
This works fine when the numbers are 0xABCD.
If a number is 0x000A, the leading zeroes are ignored and whole thing stops working (not performs as expected).
I need to have all the leading zeroes because I have to know the position of 1's in the 64bit number.
user.profiles is a 64bit value where the each part of tmp_arr is shifted to the left and stored. Am I missing something here? Or am I just going crazy?
for(int i = 0; i < 4; i++)
{
EE_ReadVariable(EE_PROFILES_1 + i, &tmp_arr[i]); // increment the address by i (D1->D2->D3->D4)
user.profiles |= (tmp_arr[i] << (i*16)); // shift the value by multiples of 16 to get 64bit number
}

C allows type-punning using unions, so you could have something like this:
union value_union
{
uint16_t v16[4];
uint64_t v64;
};
// ...
union value_union values;
values.v16[0] = value1;
values.v16[1] = value2;
values.v16[2] = value3;
values.v16[3] = value4;
printf("64-bit value = 0x%016"PRIx64"\n", values.v64);

In order to write embedded C, it is very important that you know of Implicit type promotion rules.
tmp_arr[i] << (i*16) on a 32 bit system like STM32 promotes the tmp_arr[i] argument to 32 bit signed int. This comes with two complications:
If you happen to shift a value into the sign bit of this 32 bit int, you get an undefined behavior bug.
If you shift beyond the size of this 32 bit int, you get an undefined behavior bug (and data shifted out is lost.
You need to use 64 bit unsigned arithmetic for this (which will be fairly inefficient on a 32 bitter):
user.profiles |= (uint64_t)tmp_arr[i] << i*16;
The size and type of what's on the left side of the assignment operator is completely irrelevant here.
Also, when coding for embedded systems get rid of sloppy int and the other "primitive" default types, use the types of stdint.h only. In the average embedded system, you rarely ever want any signed types, they just create problems. For STM32, you'll want to use uint32_t in most cases.

As I mentioned in the comments, you need to cast your uint16_t temporary to uint64_t before shifting up and you should check for errors.
The "missing" leading zeroes probably comes from using the wrong format for printf.
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
uint64_t get_profiles() {
uint64_t rv = 0;
uint16_t tmp;
for(int i = 0; i < 4; i++) {
uint16_t res = EE_ReadVariable(EE_PROFILES_1 + i, &tmp);
switch(res) {
case 0: rv |= (uint64_t)tmp << i*16; break;
case 1: /* variable not found */ break; // deal with error
case NO_VALID_PAGE: /* no valid page found */ break; // deal with error
}
}
return rv;
}
int main () {
// ...
user.profiles = get_profiles();
printf("%016" PRIx64 "\n", user.profiles); // prints leading zeroes
}

You can avoid casting and the need to force intermediate 64-bit arithmetic by shifting the target and masking into the least significant 16 bits in two separate implicitly 64 bit operations, thus:
user.profiles = 0u ;
for(int i = 0; i < 4; i++)
{
EE_ReadVariable( EE_PROFILES_1 + i, &tmp_arr[i] ) ;
user.profiles <<= 16 ;
user.profiles |= tmp_arr[i] ;
}
In the shift-assignment and the OR-assignment, the right-hand operand it is implicitly promoted the same type as the right hand.
If tmp_arr is not used then:
user.profiles = 0u ;
for(int i = 0; i < 4; i++)
{
uint16_t tmp = 0 ;
EE_ReadVariable( EE_PROFILES_1 + i, &tmp ) ;
user.profiles <<= 16 ;
user.profiles |= tmp ;
}

Iterate through bits in C

I have a big char *str where the first 8 chars (which equals 64 bits if I'm not wrong), represents a bitmap. Is there any way to iterate through these 8 chars and see which bits are 0? I'm having alot of trouble understanding the concept of bits, as you can't "see" them in the code, so I can't think of any way to do this.

Imagine you have only one byte, a single char my_char. You can test for individual bits using bitwise operators and bit shifts.
unsigned char my_char = 0xAA;
int what_bit_i_am_testing = 0;
while (what_bit_i_am_testing < 8) {
if (my_char & 0x01) {
printf("bit %d is 1\n", what_bit_i_am_testing);
}
else {
printf("bit %d is 0\n", what_bit_i_am_testing);
}
what_bit_i_am_testing++;
my_char = my_char >> 1;
}
The part that must be new to you, is the >> operator. This operator will "insert a zero on the left and push every bit to the right, and the rightmost will be thrown away".
That was not a very technical description for a right bit shift of 1.

Here is a way to iterate over each of the set bits of an unsigned integer (use unsigned rather than signed integers for well-defined behaviour; unsigned of any width should be fine), one bit at a time.
Define the following macros:
#define LSBIT(X) ((X) & (-(X)))
#define CLEARLSBIT(X) ((X) & ((X) - 1))
Then you can use the following idiom to iterate over the set bits, LSbit first:
unsigned temp_bits;
unsigned one_bit;
temp_bits = some_value;
for ( ; temp_bits; temp_bits = CLEARLSBIT(temp_bits) ) {
one_bit = LSBIT(temp_bits);
/* Do something with one_bit */
}
I'm not sure whether this suits your needs. You said you want to check for 0 bits, rather than 1 bits — maybe you could bitwise-invert the initial value. Also for multi-byte values, you could put it in another for loop to process one byte/word at a time.

It's true for little-endian memory architecture:
const int cBitmapSize = 8;
const int cBitsCount = cBitmapSize * 8;
const unsigned char cBitmap[cBitmapSize] = /* some data */;
for(int n = 0; n < cBitsCount; n++)
{
unsigned char Mask = 1 << (n % 8);
if(cBitmap[n / 8] & Mask)
{
// if n'th bit is 1...
}
}

In the C language, chars are 8-bit wide bytes, and in general in computer science, data is organized around bytes as the fundamental unit.
In some cases, such as your problem, data is stored as boolean values in individual bits, so we need a way to determine whether a particular bit in a particular byte is on or off. There is already an SO solution for this explaining how to do bit manipulations in C.
To check a bit, the usual method is to AND it with the bit you want to check:
int isBitSet = bitmap & (1 << bit_position);
If the variable isBitSet is 0 after this operation, then the bit is not set. Any other value indicates that the bit is on.

For one char b you can simply iterate like this :
for (int i=0; i<8; i++) {
printf("This is the %d-th bit : %d\n",i,(b>>i)&1);
}
You can then iterate through the chars as needed.
What you should understand is that you cannot manipulate directly the bits, you can just use some arithmetic properties of number in base 2 to compute numbers that in some way represents some bits you want to know.
How does it work for example ? In a char there is 8 bits. A char can be see as a number written with 8 bits in base 2. If the number in b is b7b6b5b4b3b2b1b0 (each being a digit) then b>>i is b shifted to the right by i positions (in the left 0's are pushed). So, 10110111 >> 2 is 00101101, then the operation &1 isolate the last bit (bitwise and operator).

If you want to iterate through all char.
char *str = "MNO"; // M=01001101, N=01001110, O=01001111
int bit = 0;
for (int x = strlen(str)-1; x > -1; x--){ // Start from O, N, M
printf("Char %c \n", str[x]);
for(int y=0; y<8; y++){ // Iterate though every bit
// Shift bit the the right with y step and mask last position
if( str[x]>>y & 0b00000001 ){
printf("bit %d = 1\n", bit);
}else{
printf("bit %d = 0\n", bit);
}
bit++;
}
}
output
Char O
bit 0 = 1
bit 1 = 1
bit 2 = 1
bit 3 = 1
bit 4 = 0
bit 5 = 0
bit 6 = 1
bit 7 = 0
Char N
bit 8 = 0
bit 9 = 1
bit 10 = 1
...

Memory layout of struct having bitfields

I have this C struct: (representing an IP datagram)
struct ip_dgram
{
unsigned int ver : 4;
unsigned int hlen : 4;
unsigned int stype : 8;
unsigned int tlen : 16;
unsigned int fid : 16;
unsigned int flags : 3;
unsigned int foff : 13;
unsigned int ttl : 8;
unsigned int pcol : 8;
unsigned int chksm : 16;
unsigned int src : 32;
unsigned int des : 32;
unsigned char opt[40];
};
I'm assigning values to it, and then printing its memory layout in 16-bit words like this:
//prints 16 bits at a time
void print_dgram(struct ip_dgram dgram)
{
unsigned short int* ptr = (unsigned short int*)&dgram;
int i,j;
//print only 10 words
for(i=0 ; i<10 ; i++)
{
for(j=15 ; j>=0 ; j--)
{
if( (*ptr) & (1<<j) ) printf("1");
else printf("0");
if(j%8==0)printf(" ");
}
ptr++;
printf("\n");
}
}
int main()
{
struct ip_dgram dgram;
dgram.ver = 4;
dgram.hlen = 5;
dgram.stype = 0;
dgram.tlen = 28;
dgram.fid = 1;
dgram.flags = 0;
dgram.foff = 0;
dgram.ttl = 4;
dgram.pcol = 17;
dgram.chksm = 0;
dgram.src = (unsigned int)htonl(inet_addr("10.12.14.5"));
dgram.des = (unsigned int)htonl(inet_addr("12.6.7.9"));
print_dgram(dgram);
return 0;
}
I get this output:
00000000 01010100
00000000 00011100
00000000 00000001
00000000 00000000
00010001 00000100
00000000 00000000
00001110 00000101
00001010 00001100
00000111 00001001
00001100 00000110
But I expect this:
The output is partially correct; somewhere, the bytes and nibbles seem to be interchanged. Is there some endianness issue here? Are bit-fields not good for this purpose? I really don't know. Any help? Thanks in advance!

No, bitfields are not good for this purpose. The layout is compiler-dependant.
It's generally not a good idea to use bitfields for data where you want to control the resulting layout, unless you have (compiler-specific) means, such as #pragmas, to do so.
The best way is probably to implement this without bitfields, i.e. by doing the needed bitwise operations yourself. This is annoying, but way easier than somehow digging up a way to fix this. Also, it's platform-independent.
Define the header as just an array of 16-bit words, and then you can compute the checksum easily enough.

The C11 standard says:
An implementation may allocate any addressable storage unit large
enough to hold a bitfield. If enough space remains, a bit-field that
immediately follows another bit-field in a structure shall be packed
into adjacent bits of the same unit. If insufficient space remains,
whether a bit-field that does not fit is put into the next unit or
overlaps adjacent units is implementation-defined. The order of
allocation of bit-fields within a unit (high-order to low-order or
low-order to high-order) is implementation-defined.
I'm pretty sure this is undesirable, as it means there might be padding between your fields, and that you can't control the order of your fields. Not just that, but you're at the whim of the implementation in terms of network byte order. Additionally, imagine if an unsigned int is only 16 bits, and you're asking to fit a 32-bit bitfield into it:
The expression that specifies the width of a bit-field shall be an
integer constant expression with a nonnegative value that does not
exceed the width of an object of the type that would be specified were
the colon and expression omitted.
I suggest using an array of unsigned chars instead of a struct. This way you're guaranteed control over padding and network byte order. Start off with the size in bits that you want your structure to be, in total. I'll assume you're declaring this in a constant such as IP_PACKET_BITCOUNT: typedef unsigned char ip_packet[(IP_PACKET_BITCOUNT / CHAR_BIT) + (IP_PACKET_BITCOUNT % CHAR_BIT > 0)];
Write a function, void set_bits(ip_packet p, size_t bitfield_offset, size_t bitfield_width, unsigned char *value) { ... } which allows you to set the bits starting at p[bitfield_offset / CHAR_BIT] bit bitfield_offset % CHARBIT to the bits found in value, up to bitfield_width bits in length. This will be the most complicated part of your task.
Then you could define identifiers for VER_OFFSET 0 and VER_WIDTH 4, HLEN_OFFSET 4 and HLEN_WIDTH 4, etc to make modification of the array seem less painless.

Although question was asked long time back, there's no answer with explaination of your result. I'll answer it, hopefully it'll be useful to someone.
I'll illustrate the bug using first 16 bits of your data structure.
Please Note: This explaination is guarranteed to be true only with the set of your processor and compiler. If any of these changes, behaviour may change.
Fields:
unsigned int ver : 4;
unsigned int hlen : 4;
unsigned int stype : 8;
Assigned to:
dgram.ver = 4;
dgram.hlen = 5;
dgram.stype = 0;
Compiler starts assigning bit fields starting with offset 0. This means first byte of your data structure is stored in memory as:
Bit offset: 7 4 0
-------------
| 5 | 4 |
-------------
First 16 bits after assignment look like this:
Bit offset: 15 12 8 4 0
-------------------------
| 5 | 4 | 0 | 0 |
-------------------------
Memory Address: 100 101
You are using Unsigned 16 pointer to dereference memory address 100. As a result address 100 is treated as LSB of a 16 bit number. And 101 is treated as MSB of a 16 bit number.
If you print *ptr in hex you'll see this:
*ptr = 0x0054
Your loop is running on this 16 bit value and hence you get:
00000000 0101 0100
-------- ---- ----
0 5 4
Solution:
Change order of elements to
unsigned int hlen : 4;
unsigned int ver : 4;
unsigned int stype : 8;
And use unsigned char * pointer to traverse and print values.
It should work.
Please note, as others've said, this behavior is platform and compiler specific. If any of these changes, you need to verify that memory layout of your data structure is correct.

For Chinese users, I think you can refer blog for more details, really good.
In summary, due to endianness, there is byte order as well as bit order. Bit order is the order how each bit of one byte saved in memory. Bit order has same rule with byte order in sense of endianness issue.
For your picture, it's designed in network order which is big endian. So your struct defination is actually for big endian. Per your output, your PC is little endian, so you need change struct field orders when use.
The way to show each bits is incorrect since when get by char, the bit order has changed from machine order (little endian in your case) to normal order which we human use. You may change it as following per refered blog.
void
dump_native_bits_storage_layout(unsigned char *p, int bytes_num)
{
union flag_t {
unsigned char c;
struct base_flag_t {
unsigned int p7:1,
p6:1,
p5:1,
p4:1,
p3:1,
p2:1,
p1:1,
p0:1;
} base;
} f;
for (int i = 0; i < bytes_num; i++) {
f.c = *(p + i);
printf("%d%d%d%d %d%d%d%d ",
f.base.p7,
f.base.p6,
f.base.p5,
f.base.p4,
f.base.p3,
f.base.p2,
f.base.p1,
f.base.p0);
}
printf("\n");
}
//prints 16 bits at a time
void print_dgram(struct ip_dgram dgram)
{
unsigned char* ptr = (unsigned short int*)&dgram;
int i,j;
//print only 10 words
for(i=0 ; i<10 ; i++)
{
dump_native_bits_storage_layout(ptr, 1);
/* for(j=7 ; j>=0 ; j--)
{
if( (*ptr) & (1<<j) ) printf("1");
else printf("0");
if(j%8==0)printf(" ");
}*/
ptr++;
//printf("\n");
}
}

#unwind
A typical use case of Bit Fields is interpreting/emulation of byte code or CPU instructions with given layout. "Don't use it, because you cannot control it" is the answer for children.
#Bruce
For Intel/GCC I see a packed LITTLE ENDIAN bit layout, i.e. in struct ip_dgram field ver is represented by bits 0..3, field hlen is represented by bits 4..7 ...
For correctness of operation it is required to verify the memory layout against your design at runtime.
struct ModelIndicator
{
int a:4;
int b:4;
int c:4;
};
union UModelIndicator
{
ModelIndicator i;
int v;
};
// test packed little endian
static bool verifyLayoutModel()
{
UModelIndicator um;
um.v = 0;
um.i.a = 2; // 0..3
um.i.b = 3; // 4..7
um.i.c = 9; // 8..11
return um.v = (9 << 8) + (3 << 4) + 2;
}
int main()
{
if (!verifyLayoutModel())
{
std::cerr << "Invalid memory layout" << std::endl;
return -1;
}
// ...
}
At the earliest, when above test fails, you need to consider compiler pragmas or adjust your structures accordingly, resp. verifyLayoutModel().

I agree with what unwind said. Bit fields are compiler dependent.
If you need the bits to be in a specific order, pack the data into a pointer to a character array. Increment the buffer the size of the element being packed. Pack the next element.
pack( char** buffer )
{
if ( buffer & *buffer )
{
//pack ver
//assign first 4 bits to 4.
*((UInt4*) *buffer ) = 4;
*buffer += sizeof(UInt4);
//assign next 4 bits to 5
*((UInt4*) *buffer ) = 5;
*buffer += sizeof(UInt4);
... continue packing
}
}

Compiler dependant or not, It depends whether you want to write a very fast program or if you want one that works with different compilers. To write for C a fast, compact application, use a stuct with bit fields/. If you want a slow general purpose program , long code it.

MMX operation (add 16bit is not done)

I got some vectors containing unsigned chars that represent pixels from a frame.
I got this function working without the MMX improvement, but I frustrated whit MMX that doesnt work ... So:
I need to add two unsigned chars (the sum need to be done as a 16bit instead of a 8bit cause unsigned char goes from 0-255 as known) and divide them by two (shift right 1). The code I have done so far is below, but the values are wrong, the adds_pu16 doesnt add the 16bit just 8:
MM0 = _mm_setzero_si64(); //all zeros
MM1 = TO_M64(lv1+k); //first 8 unsigned chars
MM2 = TO_M64(lv2+k); //second 8 unsigned chars
MM3 =_mm_unpacklo_pi8(MM0,MM1); //get first 4chars from MM1 and add Zeros
MM4 =_mm_unpackhi_pi8(MM0,MM1); //get last 4chars from MM1 and add Zeros
MM5 =_mm_unpacklo_pi8(MM0,MM2); //same as above for line 2
MM6 =_mm_unpackhi_pi8(MM0,MM2);
MM1 = _mm_adds_pu16(MM3,MM5); //add both chars as a 16bit sum (255+255 max range)
MM2 = _mm_adds_pu16(MM4,MM6);
MM3 = _mm_srai_pi16(MM1,1); //right shift (division by 2)
MM4 = _mm_srai_pi16(MM2,1);
MM1 = _mm_packs_pi16(MM3,MM4); //pack the 2 MMX registers into one
v2 = TO_UCHAR(MM1); //put results in the destination array
New developments:
Thanks for that king_nak!!
I wrote a simple version of what I am trying to do:
int main()
{
char A[8]={255,155,2,3,4,5,6,7};
char B[8]={255,155,2,3,4,5,6,7};
char C[8];
char D[8];
char R[8];
__m64* pA=(__m64*) A;
__m64* pB=(__m64*) B;
__m64* pC=(__m64*) C;
__m64* pD=(__m64*) D;
__m64* pR=(__m64*) R;
_mm_empty();
__m64 MM0 = _mm_setzero_si64();
__m64 MM1 = _mm_unpacklo_pi8(*pA,MM0);
__m64 MM2 = _mm_unpackhi_pi8(*pA,MM0);
__m64 MM3 = _mm_unpacklo_pi8(*pB,MM0);
__m64 MM4 = _mm_unpackhi_pi8(*pB,MM0);
__m64 MM5 = _mm_add_pi16(MM1,MM3);
__m64 MM6 = _mm_add_pi16(MM2,MM4);
printf("SUM:\n");
*pC= _mm_add_pi16(MM1,MM3);
*pD= _mm_add_pi16(MM2,MM4);
for(int i=0; i<8; i++) printf("\t%d ", (C[i])); printf("\n");
for(int i=0; i<8; i++) printf("\t%d ", D[i]); printf("\n");
printf("DIV:\n");
*pC= _mm_srai_pi16(MM5,1);
*pD= _mm_srai_pi16(MM6,1);
for(int i=0; i<8; i++) printf("\t%d ", (C[i])); printf("\n");
for(int i=0; i<8; i++) printf("\t%d ", D[i]); printf("\n");
MM1= _mm_srai_pi16(MM5,1);
MM2= _mm_srai_pi16(MM6,1);
printf("Final Result:\n");
*pR= _mm_packs_pi16(MM1,MM2);
for(int i=0; i<8; i++) printf("\t%d ", (R[i])); printf("\n");
return(0);
}
And the results are:
SUM:
-2 1 54 1 4 0 6 0
8 0 10 0 12 0 14 0
DIV:
-1 0 -101 0 2 0 3 0
4 0 5 0 6 0 7 0
Final Result:
127 127 2 3 4 5 6 7
Well the small numbers are ok while the big numbers which give 127 are wrong. This is a problem, what am I doing wrong :s

You should switch the operands in the _mm_unpacklo_pi8 calls. As you do it, the value bytes are in the higher bytes of the word (e.g. AB and 00 packed to AB00). After addition and shifting, the values will be greater then 0x7F, und thus saturated to that value by the pack instruction.
With switched operands, the math is done on values like 00AB, and the result will fit into a signed byte.
UPATE:
After your additional info, I see that the problem is with the _mm_packs_pi16. This is the assembly instruction packsswb, which will saturate signed bytes. E.g. Values > 127 will be set to 127. (255+255)>>1 is 255, and (155+155)>>1 is 155...
Use _mm_packs_pu16 instead. This treats the values as unsigned bytes, and you get the desired results (255/155).

I think i found the problem:
The arguments of the unpack instructions are in the wrong order. If you look at the registers as a whole, it looks like the individual chars are zero-extended to shorts, but in fact, they are zero-padded. Just swap around mm0 and the other register in each case and it should work.
Also, you don't need saturated add, a normal PADDW is sufficient. The maximum value you will get is 0xff+0xff=0x01fe, which doesn't have to be saturated.
Edit: What's more, PACKSSWB doesn't quite do what you want. PACKUSWB is the correct instruction, saturation will get you wrong results.
Here's a solution (Also replaced the shifts with logical ones and used different pseudo-registers in some places):
mm0=pxor(mm0,mm0) =[00,00,00,00,00,00,00,00]
mm1 =[a0,10,ff,18,7f,f0,ff,cc]
mm2 =[c0,20,ff,00,70,26,ff,01]
mm3=punpcklbw(mm1,mm0) =[00a0,0010,00ff,0018]
mm4=punpckhbw(mm1,mm0) =[007f,00f0,00ff,00cc]
mm5=punpcklbw(mm2,mm0) =[00c0,0020,00ff,0000]
mm6=punpckhbw(mm2,mm0) =[0070,0026,00ff,0001]
mm5=paddw(mm3,mm5) =[0160,0030,01fe,0018]
mm6=paddw(mm4,mm6) =[00ef,0116,01fe,00cd]
mm3=psrlwi(mm5,1) =[00b0,0018,00ff,000c]
mm4=psrlwi(mm6,1) =[0077,008b,00ff,0066]
mm1=packuswb(mm3,mm4) =[b0,18,ff,0c,77,8b,ff,66]

As an aside, you don't need a 16 bit intermediate to calculate the average of two 8 bit values. The formulation:
(a >> 1) + (b >> 1) + (a & b & 1)
gives the correct result with only 8 bit intermediates necessary. Perhaps you can utilise this to improve your throughput, if you have 8 bit vector instructions available.

Bit reversal of an integer, ignoring integer size and endianness

Given an integer typedef:
typedef unsigned int TYPE;
or
typedef unsigned long TYPE;
I have the following code to reverse the bits of an integer:
TYPE max_bit= (TYPE)-1;
void reverse_int_setup()
{
TYPE bits= (TYPE)max_bit;
while (bits <<= 1)
max_bit= bits;
}
TYPE reverse_int(TYPE arg)
{
TYPE bit_setter= 1, bit_tester= max_bit, result= 0;
for (result= 0; bit_tester; bit_tester>>= 1, bit_setter<<= 1)
if (arg & bit_tester)
result|= bit_setter;
return result;
}
One just needs first to run reverse_int_setup(), which stores an integer with the highest bit turned on, then any call to reverse_int(arg) returns arg with its bits reversed (to be used as a key to a binary tree, taken from an increasing counter, but that's more or less irrelevant).
Is there a platform-agnostic way to have in compile-time the correct value for max_int after the call to reverse_int_setup(); Otherwise, is there an algorithm you consider better/leaner than the one I have for reverse_int()?
Thanks.

#include<stdio.h>
#include<limits.h>
#define TYPE_BITS sizeof(TYPE)*CHAR_BIT
typedef unsigned long TYPE;
TYPE reverser(TYPE n)
{
TYPE nrev = 0, i, bit1, bit2;
int count;
for(i = 0; i < TYPE_BITS; i += 2)
{
/*In each iteration, we swap one bit on the 'right half'
of the number with another on the left half*/
count = TYPE_BITS - i - 1; /*this is used to find how many positions
to the left (and right) we gotta move
the bits in this iteration*/
bit1 = n & (1<<(i/2)); /*Extract 'right half' bit*/
bit1 <<= count; /*Shift it to where it belongs*/
bit2 = n & 1<<((i/2) + count); /*Find the 'left half' bit*/
bit2 >>= count; /*Place that bit in bit1's original position*/
nrev |= bit1; /*Now add the bits to the reversal result*/
nrev |= bit2;
}
return nrev;
}
int main()
{
TYPE n = 6;
printf("%lu", reverser(n));
return 0;
}
This time I've used the 'number of bits' idea from TK, but made it somewhat more portable by not assuming a byte contains 8 bits and instead using the CHAR_BIT macro. The code is more efficient now (with the inner for loop removed). I hope the code is also slightly less cryptic this time. :)
The need for using count is that the number of positions by which we have to shift a bit varies in each iteration - we have to move the rightmost bit by 31 positions (assuming 32 bit number), the second rightmost bit by 29 positions and so on. Hence count must decrease with each iteration as i increases.
Hope that bit of info proves helpful in understanding the code...

The following program serves to demonstrate a leaner algorithm for reversing bits, which can be easily extended to handle 64bit numbers.
#include <stdio.h>
#include <stdint.h>
int main(int argc, char**argv)
{
int32_t x;
if ( argc != 2 )
{
printf("Usage: %s hexadecimal\n", argv[0]);
return 1;
}
sscanf(argv[1],"%x", &x);
/* swap every neigbouring bit */
x = (x&0xAAAAAAAA)>>1 | (x&0x55555555)<<1;
/* swap every 2 neighbouring bits */
x = (x&0xCCCCCCCC)>>2 | (x&0x33333333)<<2;
/* swap every 4 neighbouring bits */
x = (x&0xF0F0F0F0)>>4 | (x&0x0F0F0F0F)<<4;
/* swap every 8 neighbouring bits */
x = (x&0xFF00FF00)>>8 | (x&0x00FF00FF)<<8;
/* and so forth, for say, 32 bit int */
x = (x&0xFFFF0000)>>16 | (x&0x0000FFFF)<<16;
printf("0x%x\n",x);
return 0;
}
This code should not contain errors, and was tested using 0x12345678 to produce 0x1e6a2c48 which is the correct answer.

typedef unsigned long TYPE;
TYPE reverser(TYPE n)
{
TYPE k = 1, nrev = 0, i, nrevbit1, nrevbit2;
int count;
for(i = 0; !i || (1 << i && (1 << i) != 1); i+=2)
{
/*In each iteration, we swap one bit
on the 'right half' of the number with another
on the left half*/
k = 1<<i; /*this is used to find how many positions
to the left (or right, for the other bit)
we gotta move the bits in this iteration*/
count = 0;
while(k << 1 && k << 1 != 1)
{
k <<= 1;
count++;
}
nrevbit1 = n & (1<<(i/2));
nrevbit1 <<= count;
nrevbit2 = n & 1<<((i/2) + count);
nrevbit2 >>= count;
nrev |= nrevbit1;
nrev |= nrevbit2;
}
return nrev;
}
This works fine in gcc under Windows, but I'm not sure if it's completely platform independent. A few places of concern are:
the condition in the for loop - it assumes that when you left shift 1 beyond the leftmost bit, you get either a 0 with the 1 'falling out' (what I'd expect and what good old Turbo C gives iirc), or the 1 circles around and you get a 1 (what seems to be gcc's behaviour).
the condition in the inner while loop: see above. But there's a strange thing happening here: in this case, gcc seems to let the 1 fall out and not circle around!
The code might prove cryptic: if you're interested and need an explanation please don't hesitate to ask - I'll put it up someplace.

#ΤΖΩΤΖΙΟΥ
In reply to ΤΖΩΤΖΙΟΥ 's comments, I present modified version of above which depends on a upper limit for bit width.
#include <stdio.h>
#include <stdint.h>
typedef int32_t TYPE;
TYPE reverse(TYPE x, int bits)
{
TYPE m=~0;
switch(bits)
{
case 64:
x = (x&0xFFFFFFFF00000000&m)>>16 | (x&0x00000000FFFFFFFF&m)<<16;
case 32:
x = (x&0xFFFF0000FFFF0000&m)>>16 | (x&0x0000FFFF0000FFFF&m)<<16;
case 16:
x = (x&0xFF00FF00FF00FF00&m)>>8 | (x&0x00FF00FF00FF00FF&m)<<8;
case 8:
x = (x&0xF0F0F0F0F0F0F0F0&m)>>4 | (x&0x0F0F0F0F0F0F0F0F&m)<<4;
x = (x&0xCCCCCCCCCCCCCCCC&m)>>2 | (x&0x3333333333333333&m)<<2;
x = (x&0xAAAAAAAAAAAAAAAA&m)>>1 | (x&0x5555555555555555&m)<<1;
}
return x;
}
int main(int argc, char**argv)
{
TYPE x;
TYPE b = (TYPE)-1;
int bits;
if ( argc != 2 )
{
printf("Usage: %s hexadecimal\n", argv[0]);
return 1;
}
for(bits=1;b;b<<=1,bits++);
--bits;
printf("TYPE has %d bits\n", bits);
sscanf(argv[1],"%x", &x);
printf("0x%x\n",reverse(x, bits));
return 0;
}
Notes:
gcc will warn on the 64bit constants
the printfs will generate warnings too
If you need more than 64bit, the code should be simple enough to extend
I apologise in advance for the coding crimes I committed above - mercy good sir!

There's a nice collection of "Bit Twiddling Hacks", including a variety of simple and not-so simple bit reversing algorithms coded in C at http://graphics.stanford.edu/~seander/bithacks.html.
I personally like the "Obvious" algorigthm (http://graphics.stanford.edu/~seander/bithacks.html#BitReverseObvious) because, well, it's obvious. Some of the others may require less instructions to execute. If I really need to optimize the heck out of something I may choose the not-so-obvious but faster versions. Otherwise, for readability, maintainability, and portability I would choose the Obvious one.

Here is a more generally useful variation. Its advantage is its ability to work in situations where the bit length of the value to be reversed -- the codeword -- is unknown but is guaranteed not to exceed a value we'll call maxLength. A good example of this case is Huffman code decompression.
The code below works on codewords from 1 to 24 bits in length. It has been optimized for fast execution on a Pentium D. Note that it accesses the lookup table as many as 3 times per use. I experimented with many variations that reduced that number to 2 at the expense of a larger table (4096 and 65,536 entries). This version, with the 256-byte table, was the clear winner, partly because it is so advantageous for table data to be in the caches, and perhaps also because the processor has an 8-bit table lookup/translation instruction.
const unsigned char table[] = {
0x00,0x80,0x40,0xC0,0x20,0xA0,0x60,0xE0,0x10,0x90,0x50,0xD0,0x30,0xB0,0x70,0xF0,
0x08,0x88,0x48,0xC8,0x28,0xA8,0x68,0xE8,0x18,0x98,0x58,0xD8,0x38,0xB8,0x78,0xF8,
0x04,0x84,0x44,0xC4,0x24,0xA4,0x64,0xE4,0x14,0x94,0x54,0xD4,0x34,0xB4,0x74,0xF4,
0x0C,0x8C,0x4C,0xCC,0x2C,0xAC,0x6C,0xEC,0x1C,0x9C,0x5C,0xDC,0x3C,0xBC,0x7C,0xFC,
0x02,0x82,0x42,0xC2,0x22,0xA2,0x62,0xE2,0x12,0x92,0x52,0xD2,0x32,0xB2,0x72,0xF2,
0x0A,0x8A,0x4A,0xCA,0x2A,0xAA,0x6A,0xEA,0x1A,0x9A,0x5A,0xDA,0x3A,0xBA,0x7A,0xFA,
0x06,0x86,0x46,0xC6,0x26,0xA6,0x66,0xE6,0x16,0x96,0x56,0xD6,0x36,0xB6,0x76,0xF6,
0x0E,0x8E,0x4E,0xCE,0x2E,0xAE,0x6E,0xEE,0x1E,0x9E,0x5E,0xDE,0x3E,0xBE,0x7E,0xFE,
0x01,0x81,0x41,0xC1,0x21,0xA1,0x61,0xE1,0x11,0x91,0x51,0xD1,0x31,0xB1,0x71,0xF1,
0x09,0x89,0x49,0xC9,0x29,0xA9,0x69,0xE9,0x19,0x99,0x59,0xD9,0x39,0xB9,0x79,0xF9,
0x05,0x85,0x45,0xC5,0x25,0xA5,0x65,0xE5,0x15,0x95,0x55,0xD5,0x35,0xB5,0x75,0xF5,
0x0D,0x8D,0x4D,0xCD,0x2D,0xAD,0x6D,0xED,0x1D,0x9D,0x5D,0xDD,0x3D,0xBD,0x7D,0xFD,
0x03,0x83,0x43,0xC3,0x23,0xA3,0x63,0xE3,0x13,0x93,0x53,0xD3,0x33,0xB3,0x73,0xF3,
0x0B,0x8B,0x4B,0xCB,0x2B,0xAB,0x6B,0xEB,0x1B,0x9B,0x5B,0xDB,0x3B,0xBB,0x7B,0xFB,
0x07,0x87,0x47,0xC7,0x27,0xA7,0x67,0xE7,0x17,0x97,0x57,0xD7,0x37,0xB7,0x77,0xF7,
0x0F,0x8F,0x4F,0xCF,0x2F,0xAF,0x6F,0xEF,0x1F,0x9F,0x5F,0xDF,0x3F,0xBF,0x7F,0xFF};
const unsigned short masks[17] =
{0,0,0,0,0,0,0,0,0,0X0100,0X0300,0X0700,0X0F00,0X1F00,0X3F00,0X7F00,0XFF00};
unsigned long codeword; // value to be reversed, occupying the low 1-24 bits
unsigned char maxLength; // bit length of longest possible codeword (<= 24)
unsigned char sc; // shift count in bits and index into masks array
if (maxLength <= 8)
{
codeword = table[codeword << (8 - maxLength)];
}
else
{
sc = maxLength - 8;
if (maxLength <= 16)
{
codeword = (table[codeword & 0X00FF] << sc)
| table[codeword >> sc];
}
else if (maxLength & 1) // if maxLength is 17, 19, 21, or 23
{
codeword = (table[codeword & 0X00FF] << sc)
| table[codeword >> sc] |
(table[(codeword & masks[sc]) >> (sc - 8)] << 8);
}
else // if maxlength is 18, 20, 22, or 24
{
codeword = (table[codeword & 0X00FF] << sc)
| table[codeword >> sc]
| (table[(codeword & masks[sc]) >> (sc >> 1)] << (sc >> 1));
}
}

How about:
long temp = 0;
int counter = 0;
int number_of_bits = sizeof(value) * 8; // get the number of bits that represent value (assuming that it is aligned to a byte boundary)
while(value > 0) // loop until value is empty
{
temp <<= 1; // shift whatever was in temp left to create room for the next bit
temp |= (value & 0x01); // get the lsb from value and set as lsb in temp
value >>= 1; // shift value right by one to look at next lsb
counter++;
}
value = temp;
if (counter < number_of_bits)
{
value <<= counter-number_of_bits;
}
(I'm assuming that you know how many bits value holds and it is stored in number_of_bits)
Obviously temp needs to be the longest imaginable data type and when you copy temp back into value, all the extraneous bits in temp should magically vanish (I think!).
Or, the 'c' way would be to say :
while(value)
your choice

We can store the results of reversing all possible 1 byte sequences in an array (256 distinct entries), then use a combination of lookups into this table and some oring logic to get the reverse of integer.

Here is a variation and correction to TK's solution which might be clearer than the solutions by sundar. It takes single bits from t and pushes them into return_val:
typedef unsigned long TYPE;
#define TYPE_BITS sizeof(TYPE)*8
TYPE reverser(TYPE t)
{
unsigned int i;
TYPE return_val = 0
for(i = 0; i < TYPE_BITS; i++)
{/*foreach bit in TYPE*/
/* shift the value of return_val to the left and add the rightmost bit from t */
return_val = (return_val << 1) + (t & 1);
/* shift off the rightmost bit of t */
t = t >> 1;
}
return(return_val);
}

The generic approach hat would work for objects of any type of any size would be to reverse the of bytes of the object, and the reverse the order of bits in each byte. In this case the bit-level algorithm is tied to a concrete number of bits (a byte), while the "variable" logic (with regard to size) is lifted to the level of whole bytes.

Here's my generalization of freespace's solution (in case we one day get 128-bit machines). It results in jump-free code when compiled with gcc -O3, and is obviously insensitive to the definition of foo_t on sane machines. Unfortunately it does depend on shift being a power of 2!
#include <limits.h>
#include <stdio.h>
typedef unsigned long foo_t;
foo_t reverse(foo_t x)
{
int shift = sizeof (x) * CHAR_BIT / 2;
foo_t mask = (1 << shift) - 1;
int i;
for (i = 0; shift; i++) {
x = ((x & mask) << shift) | ((x & ~mask) >> shift);
shift >>= 1;
mask ^= (mask << shift);
}
return x;
}
int main() {
printf("reverse = 0x%08lx\n", reverse(0x12345678L));
}

In case bit-reversal is time critical, and mainly in conjunction with FFT, the best is to store the whole bit reversed array. In any case, this array will be smaller in size than the roots of unity that have to be precomputed in FFT Cooley-Tukey algorithm. An easy way to compute the array is:
int BitReverse[Size]; // Size is power of 2
void Init()
{
BitReverse[0] = 0;
for(int i = 0; i < Size/2; i++)
{
BitReverse[2*i] = BitReverse[i]/2;
BitReverse[2*i+1] = (BitReverse[i] + Size)/2;
}
} // end it's all