Bitwise memmove

Bitwise memmove - c

What is the best way to implement a bitwise memmove? The method should take an additional destination and source bit-offset and the count should be in bits too.
I saw that ARM provides a non-standard _membitmove, which does exactly what I need, but I couldn't find its source.
Bind's bitset includes isc_bitstring_copy, but it's not efficient
I'm aware that the C standard library doesn't provide such a method, but I also couldn't find any third-party code providing a similar method.

Assuming "best" means "easiest", you can copy bits one by one. Conceptually, an address of a bit is an object (struct) that has a pointer to a byte in memory and an index of a bit in the byte.
struct pointer_to_bit
{
uint8_t* p;
int b;
};
void membitmovebl(
void *dest,
const void *src,
int dest_offset,
int src_offset,
size_t nbits)
{
// Create pointers to bits
struct pointer_to_bit d = {dest, dest_offset};
struct pointer_to_bit s = {src, src_offset};
// Bring the bit offsets to range (0...7)
d.p += d.b / 8; // replace division by right-shift if bit offset can be negative
d.b %= 8; // replace "%=8" by "&=7" if bit offset can be negative
s.p += s.b / 8;
s.b %= 8;
// Determine whether it's OK to loop forward
if (d.p < s.p || d.p == s.p && d.b <= s.b)
{
// Copy bits one by one
for (size_t i = 0; i < nbits; i++)
{
// Read 1 bit
int bit = (*s.p >> s.b) & 1;
// Write 1 bit
*d.p &= ~(1 << d.b);
*d.p |= bit << d.b;
// Advance pointers
if (++s.b == 8)
{
s.b = 0;
++s.p;
}
if (++d.b == 8)
{
d.b = 0;
++d.p;
}
}
}
else
{
// Copy stuff backwards - essentially the same code but ++ replaced by --
}
}
If you want to write a version optimized for speed, you will have to do copying by bytes (or, better, words), unroll loops, and handle a number of special cases (memmove does that; you will have to do more because your function is more complicated).
P.S. Oh, seeing that you call isc_bitstring_copy inefficient, you probably want the speed optimization. You can use the following idea:
Start copying bits individually until the destination is byte-aligned (d.b == 0). Then, it is easy to copy 8 bits at once, doing some bit twiddling. Do this until there are less than 8 bits left to copy; then continue copying bits one by one.
// Copy 8 bits from s to d and advance pointers
*d.p = *s.p++ >> s.b;
*d.p++ |= *s.p << (8 - s.b);
P.P.S Oh, and seeing your comment on what you are going to use the code for, you don't really need to implement all the versions (byte/halfword/word, big/little-endian); you only want the easiest one - the one working with words (uint32_t).

Here is a partial implementation (not tested). There are obvious efficiency and usability improvements.
Copy n bytes from src to dest (not overlapping src), and shift bits at dest rightwards by bit bits, 0 <= bit <= 7. This assumes that the least significant bits are at the right of the bytes
void memcpy_with_bitshift(unsigned char *dest, unsigned char *src, size_t n, int bit)
{
int i;
memcpy(dest, src, n);
for (i = 0; i < n; i++) {
dest[i] >> bit;
}
for (i = 0; i < n; i++) {
dest[i+1] |= (src[i] << (8 - bit));
}
}
Some improvements to be made:
Don't overwrite first bit bits at beginning of dest.
Merge loops
Have a way to copy a number of bits not divisible by 8
Fix for >8 bits in a char

Related

Internet Checksum function move 8bits or not

This is my implementation of the Internet Checksum (RFC 1071):
static unsigned short
compute_checksum(unsigned short *addr, unsigned int count) {
register unsigned long sum = 0;
while (count > 1) {
sum += * addr++;
count -= 2;
}
//if any bytes left, pad the bytes and add
if(count > 0) {
sum+=*(unsigned char*)addr;// left move 8 bits or not?
}
//Fold sum to 16 bits: add carrier to result
while (sum>>16) {
sum = (sum & 0xffff) + (sum >> 16);
}
//one's complement
sum = ~sum;
return ((unsigned short)sum);
}
when meet the odd byte, why we don't need left move 8 bits like this,and RFC does't left move 8 bits too. why? I think this is right one
sum += (*(unsigned char*)addr << 8) & 0xFF00;

The code you posted from the RFC is correct for a littleendian machine. On a bigendian machine, your shifted solution would be necessary.
With an odd number of bytes, a (theoretical) 0 byte is added on to the end of the sequence. So the last byte XX should be treated as a short with byte sequence XX 00, which needs to be handled differently depending on the endianness of your machine.
Here's one way to handle it correctly for either endianness:
if (count > 0) {
unsigned char temp[2];
temp[0] = *(unsigned char *) addr;
temp[1] = 0;
sum += *(unsigned short *) temp;
}
For those of you who don't believe the RFC code is wrong, I refer you to this linux source, where it is clear that the littleendian case and the bigendian case must be treated differently in the way I described. The linux code is a little more complicated because it handles unaligned buffers.

Short is at-least 16 bits, there's no guarantee that' it's not 32 or 64 bits.
you should be using uint_16t from <stdint.h>
you kind of need to shift, the last add would be
{
uint16_t tmp=0;
memcpy(addr,&tmp,1);
sum += tmp;
}
which preserves the alignment, so, on a little endian machhine that's not shifted
but on a big-endian it is compared to it's aligment as a char.
sum += ( *addr && *((uint16_t*)"\xff"));
but that code may not work unless you have some trick to word-align the string.
Be aware that the result is in network byte order, so
if you need it in host byte order use the
ntohs() function to convert it.
foo=compute_checksum(blah,blah_size);
printf("the internet checksum is %04h\n",(int)ntohs(foo));

Easy way to convert a string of 0's and 1's into a character? Plain C

I'm doing a steganography project where I read in bytes from a ppm file and add the least significant bit to an array. So once 8 bytes are read in, I would have 8 bits in my array, which should equal some character in a hidden message. Is there an easy way to convert an array of 0's and 1's into an ascii value? For example, the array: char bits[] = {0,1,1,1,0,1,0,0} would equal 't'. Plain C
Thanks for all the answers. I'm gonna give some of these a shot.

A simple for loop would work - something like
unsigned char ascii = 0;
unsigned char i;
for(i = 0; i < 8; i++)
ascii |= (bits[7 - i] << i);
There might be a faster way to do this, but this is a start at least.

I wouldn't store the bits in an array -- I'd OR them with a char.
So you start off with a char value of 0: char bit = 0;
When you get the first bit, OR it with what you have: bit |= bit_just_read;
Keep doing that with each bit, shifting appropriately; i.e., after you get the next bit, do bit |= (next_bit << 1);. And so forth.
After you read your 8 bits, bit will be the appropriate ASCII value, and you can print it out or do whatever with it you want to do.

I agree with mipadi, don't bother storing in an array first, that's kind of pointless. Since you have to loop or otherwise keep track of the array index while reading it in, you might as well do it in one go. Something like this, perhaps?
bits = 0;
for ( i = 0; i < 8; ++i ) {
lsb = get_byte_from_ppm_somehow() & 0x01;
bits <<= 1 | lsb;
}

As long as the bit endian is correct, this should work and compile down pretty small.
If the bit endian is backwards then you should be able to change the initial value of mask to 1, the mask shift to <<= , and you might need to have (0x0ff & mask) as the do{}while conditional if your compiler doesn't do what it's supposed to with byte sized variables.
Don't forget to do something for the magic functions that I included where I didn't know what you wanted or how you did something
#include <stdint.h> // needed for uint8_t
...
uint8_t acc, lsb, mask;
uint8_t buf[SOME_SIZE];
size_t len = 0;
while (is_there_more_ppm_data()) {
acc = 0;
mask = 0x80; // This is the high bit
do {
if (!is_there_more() ) {
// I don't know what you think should happen if you run out on a non-byte boundary
EARLY_END_OF_DATA();
break;
}
lsb = 1 & get_next_ppm_byte();
acc |= lsb ? mask : 0; // You could use an if statement
mask >>= 1;
} while (mask);
buf[len] = acc; // NOTE: I didn't worry about the running off the end of the buff, but you should.
len++;
}

Turn a large chunk of memory backwards, fast

I need to rewrite about 4KB of data in reverse order, at bit level (last bit of last byte becoming first bit of first byte), as fast as possible. Are there any clever sniplets to do it?
Rationale: The data is display contents of LCD screen in an embedded device that is usually positioned in a way that the screen is on your shoulders level. The screen has "6 o'clock" orientation, that is to be viewed from below - like lying flat or hanging above your eyes level. This is fixable by rotating the screen 180 degrees, but then I need to reverse the screen data (generated by library), which is 1 bit = 1 pixel, starting with upper left of the screen. The CPU isn't very powerful, and the device has enough work already, plus several frames a second would be desirable so performance is an issue; RAM not so much.
edit:
Single core, ARM 9 series. 64MB, (to be scaled down to 32MB later), Linux. The data is pushed from system memory to the LCD driver over 8-bit IO port.
The CPU is 32bit and performs much better at this word size than at byte level.

There's a classic way to do this. Let's say unsigned int is your 32-bit word. I'm using C99 because the restrict keyword lets the compiler perform extra optimizations in this speed-critical code that would otherwise be unavailable. These keywords inform the compiler that "src" and "dest" do not overlap. This also assumes you are copying an integral number of words, if you're not, then this is just a start.
I also don't know which bit shifting / rotation primitives are fast on the ARM and which are slow. This is something to consider. If you need more speed, consider disassembling the output from the C compiler and going from there. If using GCC, try O2, O3, and Os to see which one is fastest. You might reduce stalls in the pipeline by doing two words at the same time.
This uses 23 operations per word, not counting load and store. However, these 23 operations are all very fast and none of them access memory. I don't know if a lookup table would be faster or not.
void
copy_rev(unsigned int *restrict dest,
unsigned int const *restrict src,
unsigned int n)
{
unsigned int i, x;
for (i = 0; i < n; ++i) {
x = src[i];
x = (x >> 16) | (x << 16);
x = ((x >> 8) & 0x00ff00ffU) | ((x & 0x00ff00ffU) << 8);
x = ((x >> 4) & 0x0f0f0f0fU) | ((x & 0x0f0f0f0fU) << 4);
x = ((x >> 2) & 0x33333333U) | ((x & 0x33333333U) << 2);
x = ((x >> 1) & 0x55555555U) | ((x & 0x555555555) << 1);
dest[n-1-i] = x;
}
}
This page is a great reference: http://graphics.stanford.edu/~seander/bithacks.html#BitReverseObvious
Final note: Looking at the ARM assembly reference, there is a "REV" opcode which reverses the byte order in a word. This would shave 7 operations per loop off the above code.

Fastest way would probably to store the reverse of all possible byte values in a look-up table. The table would take only 256 bytes.

Build a 256 element lookup table of byte values that are bit-reversed from their index.
{0x00, 0x80, 0x40, 0xc0, etc}
Then iterate through your array copying using each byte as an index into your lookup table.
If you are writing assembly language, the x86 instruction set has an XLAT instruction that does just this sort of lookup. Although it may not actually be faster than C code on modern processors.
You can do this in place if you iterate from both ends towards the middle. Because of cache effects, you may find it's faster to swap in 16 byte chunks (assuming a 16 byte cache line).
Here's the basic code (not including the cache line optimization)
// bit reversing lookup table
typedef unsigned char BYTE;
extern const BYTE g_RevBits[256];
void ReverseBitsInPlace(BYTE * pb, int cb)
{
int iter = cb/2;
for (int ii = 0, jj = cb-1; ii < iter; ++ii, --jj)
{
BYTE b1 = g_RevBits[pb[ii]];
pb[ii] = g_RevBits[pb[jj]];
pb[jj] = b1;
}
if (cb & 1) // if the number of bytes was odd, swap the middle one in place
{
pb[cb/2] = g_RevBits[pb[cb/2]];
}
}
// initialize the bit reversing lookup table using macros to make it less typing.
#define BITLINE(n) \
0x0##n, 0x8##n, 0x4##n, 0xC##n, 0x2##n, 0xA##n, 0x6##n, 0xE##n,\
0x1##n, 0x9##n, 0x5##n, 0xD##n, 0x3##n, 0xB##n, 0x7##n, 0xF##n,
const BYTE g_RevBits[256] = {
BITLINE(0), BITLINE(8), BITLINE(4), BITLINE(C),
BITLINE(2), BITLINE(A), BITLINE(6), BITLINE(E),
BITLINE(1), BITLINE(9), BITLINE(5), BITLINE(D),
BITLINE(3), BITLINE(B), BITLINE(7), BITLINE(F),
};

The Bit Twiddling Hacks site is alwas a good starting point for these kind of problems. Take a look here for fast bit reversal. Then its up to you to apply it to each byte/word of your memory block.
EDIT:
Inspired by Dietrich Epps answer and looking at the ARM instruction set, there is a RBIT opcode that reverses the bits contained in a register. So if performance is critical, you might consider using some assembly code.

Loop through the half of the array, convert and exchange bytes.
for( int i = 0; i < arraySize / 2; i++ ) {
char inverted1 = invert( array[i] );
char inverted2 = invert( array[arraySize - i - 1] );
array[i] = inverted2;
array[arraySize - i - 1] = inverted1;
}
For conversion use a precomputed table - an array of 2CHAR_BIT (CHAR_BIT will most likely be 8) elements where at position "I" the result of byte with value "I" inversion is stored. This will be very fast - one pass - and consume only 2CHAR_BIT for the table.

It looks like this code takes about 50 clocks per bit swap on my i7 XPS 8500 machine. 7.6 seconds for a million array flips. Single threaded. It prints some ASCI art based on patterns of 1s and 0s. I rotated the pic left 180 degrees after reversing the bit array, using a graphic editor, and they look identical to me. A double-reversed image comes out the same as the original.
As for pluses, it's a complete solution. It swaps bits from the back of a bit array to the front, vs operating on ints/bytes and then needing to swap ints/bytes in an array.
Also, this is a general purpose bit library, so you might find it handy in the future for solving other, more mundane problems.
Is it as fast as the accepted answer? I think it's close, but without working code to benchmark it's impossible to say. Feel free to cut and paste this working program.
// Reverse BitsInBuff.cpp : Defines the entry point for the console application.
#include "stdafx.h"
#include "time.h"
#include "memory.h"
//
// Manifest constants
#define uchar unsigned char
#define BUFF_BYTES 510 //400 supports a display of 80x40 bits
#define DW 80 // Display Width
// ----------------------------------------------------------------------------
uchar mask_set[] = { 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80 };
uchar mask_clr[] = { 0xfe, 0xfd, 0xfb, 0xf7, 0xef, 0xdf, 0xbf, 0x7f };
//
// Function Prototypes
static void PrintIntBits(long x, int bits);
void BitSet(uchar * BitArray, unsigned long BitNumber);
void BitClr(uchar * BitArray, unsigned long BitNumber);
void BitTog(uchar * BitArray, unsigned long BitNumber);
uchar BitGet(uchar * BitArray, unsigned long BitNumber);
void BitPut(uchar * BitArray, unsigned long BitNumber, uchar value);
//
uchar *ReverseBitsInArray(uchar *Buff, int BitKnt);
static void PrintIntBits(long x, int bits);
// -----------------------------------------------------------------------------
// Reverse the bit ordering in an array
uchar *ReverseBitsInArray(uchar *Buff, int BitKnt) {
unsigned long front=0, back = BitKnt-1;
uchar temp;
while( front<back ) {
temp = BitGet(Buff, front); // copy front bit to temp before overwriting
BitPut(Buff, front, BitGet(Buff, back)); // copy back bit to front bit
BitPut(Buff, back, temp); // copy saved value of front in temp to back of bit arra)
front++;
back--;
}
return Buff;
}
// ---------------------------------------------------------------------------
// ---------------------------------------------------------------------------
int _tmain(int argc, _TCHAR* argv[]) {
int i, j, k, LoopKnt = 1000001;
time_t start;
uchar Buff[BUFF_BYTES];
memset(Buff, 0, sizeof(Buff));
// make an ASCII art picture
for(i=0, k=0; i<(sizeof(Buff)*8)/DW; i++) {
for(j=0; j<DW/2; j++) {
BitSet(Buff, (i*DW)+j+k);
}
k++;
}
// print ASCII art picture
for(i=0; i<sizeof(Buff); i++) {
if(!(i % 10)) printf("\n"); // print bits in blocks of 80
PrintIntBits(Buff[i], 8);
}
i=LoopKnt;
start = clock();
while( i-- ) {
ReverseBitsInArray((uchar *)Buff, BUFF_BYTES * 8);
}
// print ASCII art pic flipped upside-down and rotated left
printf("\nMilliseconds elapsed = %d", clock() - start);
for(i=0; i<sizeof(Buff); i++) {
if(!(i % 10)) printf("\n"); // print bits in blocks of 80
PrintIntBits(Buff[i], 8);
}
printf("\n\nBenchmark time for %d loops\n", LoopKnt);
getchar();
return 0;
}
// -----------------------------------------------------------------------------
// Scaffolding...
static void PrintIntBits(long x, int bits) {
unsigned long long z=1;
int i=0;
z = z << (bits-1);
for (; z > 0; z >>= 1) {
printf("%s", ((x & z) == z) ? "#" : ".");
}
}
// These routines do bit manipulations on a bit array of unsigned chars
// ---------------------------------------------------------------------------
void BitSet(uchar *buff, unsigned long BitNumber) {
buff[BitNumber >> 3] |= mask_set[BitNumber & 7];
}
// ----------------------------------------------------------------------------
void BitClr(uchar *buff, unsigned long BitNumber) {
buff[BitNumber >> 3] &= mask_clr[BitNumber & 7];
}
// ----------------------------------------------------------------------------
void BitTog(uchar *buff, unsigned long BitNumber) {
buff[BitNumber >> 3] ^= mask_set[BitNumber & 7];
}
// ----------------------------------------------------------------------------
uchar BitGet(uchar *buff, unsigned long BitNumber) {
return (uchar) ((buff[BitNumber >> 3] >> (BitNumber & 7)) & 1);
}
// ----------------------------------------------------------------------------
void BitPut(uchar *buff, unsigned long BitNumber, uchar value) {
if(value) { // if the bit at buff[BitNumber] is true.
BitSet(buff, BitNumber);
} else {
BitClr(buff, BitNumber);
}
}
Below is the code listing for an optimization using a new buffer, instead of swapping bytes in place. Given that only 2030:4080 BitSet()s are needed because of the if() test, and about half the GetBit()s and PutBits() are eliminated by eliminating TEMP, I suspect memory access time is a large, fixed cost to these kinds of operations, providing a hard limit to optimization.
Using a look-up approach, and CONDITIONALLY swapping bytes, rather than bits, reduces by a factor of 8 the number of memory accesses, and testing for a 0 byte gets amortized across 8 bits, rather than 1.
Using these two approaches together, testing to see if the entire 8-bit char is 0 before doing ANYTHING, including the table lookup, and the write, is likely going to be the fastest possible approach, but would require an extra 512 bytes for the new, destination bit array, and 256 bytes for the lookup table. The performance payoff might be quite dramatic though.
// -----------------------------------------------------------------------------
// Reverse the bit ordering in new array
uchar *ReverseBitsInNewArray(uchar *Dst, const uchar *Src, const int BitKnt) {
int front=0, back = BitKnt-1;
memset(Dst, 0, BitKnt/BitsInByte);
while( front < back ) {
if(BitGet(Src, back--)) { // memset() has already set all bits in Dst to 0,
BitSet(Dst, front); // so only reset if Src bit is 1
}
front++;
}
return Dst;

To reverse a single byte x you can handle the bits one at a time:
unsigned char a = 0;
for (i = 0; i < 8; ++i) {
a += (unsigned char)(((x >> i) & 1) << (7 - i));
}
You can create a cache of these results in an array so that you can quickly reverse a byte just by making a single lookup instead of looping.
Then you just have to reverse the byte array, and when you write the data apply the above mapping. Reversing a byte array is a well documented problem, e.g. here.

Single Core?
How much memory?
Is the display buffered in memory and pushed to the device, or is the only copy of the pixels in the screens memory?

The data is pushed from system memory to the LCD driver over 8-bit IO
port.
Since you'll be writing to the LCD one byte at a time, I think the best idea is to perform the bit reversal right when sending the data to the LCD driver rather than as a separate pre-pass. Something along those lines should be faster than any of the other answers:
void send_to_LCD(uint8_t* data, int len, bool rotate) {
if (rotate)
for (int i=len-1; i>=0; i--)
write(reverse(data[i]));
else
for (int i=0; i<len; i++)
write(data[i]);
}
Where write() is the function that sends a byte to the LCD driver and reverse() one of the single-byte bit reversal methods described in the other answers.
This approach avoids the need to store two copies of the video data in ram and also avoids the read-invert-write roundtrip. Also note that this is the simplest implementation: it could be trivially adapted to load, say, 4 bytes at a time from memory if this were to yield better performance. A smart vectorizing compiler may be even able to do it for you.

How to shift an array of bytes by 12-bits

I want to shift the contents of an array of bytes by 12-bit to the left.
For example, starting with this array of type uint8_t shift[10]:
{0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x0A, 0xBC}
I'd like to shift it to the left by 12-bits resulting in:
{0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xAB, 0xC0, 0x00}

Hurray for pointers!
This code works by looking ahead 12 bits for each byte and copying the proper bits forward. 12 bits is the bottom half (nybble) of the next byte and the top half of 2 bytes away.
unsigned char length = 10;
unsigned char data[10] = {0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0A,0xBC};
unsigned char *shift = data;
while (shift < data+(length-2)) {
*shift = (*(shift+1)&0x0F)<<4 | (*(shift+2)&0xF0)>>4;
shift++;
}
*(data+length-2) = (*(data+length-1)&0x0F)<<4;
*(data+length-1) = 0x00;
Justin wrote:
#Mike, your solution works, but does not carry.
Well, I'd say a normal shift operation does just that (called overflow), and just lets the extra bits fall off the right or left. It's simple enough to carry if you wanted to - just save the 12 bits before you start to shift. Maybe you want a circular shift, to put the overflowed bits back at the bottom? Maybe you want to realloc the array and make it larger? Return the overflow to the caller? Return a boolean if non-zero data was overflowed? You'd have to define what carry means to you.
unsigned char overflow[2];
*overflow = (*data&0xF0)>>4;
*(overflow+1) = (*data&0x0F)<<4 | (*(data+1)&0xF0)>>4;
while (shift < data+(length-2)) {
/* normal shifting */
}
/* now would be the time to copy it back if you want to carry it somewhere */
*(data+length-2) = (*(data+length-1)&0x0F)<<4 | (*(overflow)&0x0F);
*(data+length-1) = *(overflow+1);
/* You could return a 16-bit carry int,
* but endian-ness makes that look weird
* if you care about the physical layout */
unsigned short carry = *(overflow+1)<<8 | *overflow;

Here's my solution, but even more importantly my approach to solving the problem.
I approached the problem by
drawing the memory cells and drawing arrows from the destination to the source.
made a table showing the above drawing.
labeling each row in the table with the relative byte address.
This showed me the pattern:
let iL be the low nybble (half byte) of a[i]
let iH be the high nybble of a[i]
iH = (i+1)L
iL = (i+2)H
This pattern holds for all bytes.
Translating into C, this means:
a[i] = (iH << 4) OR iL
a[i] = ((a[i+1] & 0x0f) << 4) | ((a[i+2] & 0xf0) >> 4)
We now make three more observations:
since we carry out the assignments left to right, we don't need to store any values in temporary variables.
we will have a special case for the tail: all 12 bits at the end will be zero.
we must avoid reading undefined memory past the array. since we never read more than a[i+2], this only affects the last two bytes
So, we
handle the general case by looping for N-2 bytes and performing the general calculation above
handle the next to last byte by it by setting iH = (i+1)L
handle the last byte by setting it to 0
given a with length N, we get:
for (i = 0; i < N - 2; ++i) {
a[i] = ((a[i+1] & 0x0f) << 4) | ((a[i+2] & 0xf0) >> 4);
}
a[N-2] = (a[N-1) & 0x0f) << 4;
a[N-1] = 0;
And there you have it... the array is shifted left by 12 bits. It could easily be generalized to shifting N bits, noting that there will be M assignment statements where M = number of bits modulo 8, I believe.
The loop could be made more efficient on some machines by translating to pointers
for (p = a, p2=a+N-2; p != p2; ++p) {
*p = ((*(p+1) & 0x0f) << 4) | (((*(p+2) & 0xf0) >> 4);
}
and by using the largest integer data type supported by the CPU.
(I've just typed this in, so now would be a good time for somebody to review the code, especially since bit twiddling is notoriously easy to get wrong.)

Lets make it the best way to shift N bits in the array of 8 bit integers.
N - Total number of bits to shift
F = (N / 8) - Full 8 bit integers shifted
R = (N % 8) - Remaining bits that need to be shifted
I guess from here you would have to find the most optimal way to make use of this data to move around ints in an array. Generic algorithms would be to apply the full integer shifts by starting from the right of the array and moving each integer F indexes. Zero fill the newly empty spaces. Then finally perform an R bit shift on all of the indexes, again starting from the right.
In the case of shifting 0xBC by R bits you can calculate the overflow by doing a bitwise AND, and the shift using the bitshift operator:
// 0xAB shifted 4 bits is:
(0xAB & 0x0F) >> 4 // is the overflow (0x0A)
0xAB << 4 // is the shifted value (0xB0)
Keep in mind that the 4 bits is just a simple mask: 0x0F or just 0b00001111. This is easy to calculate, dynamically build, or you can even use a simple static lookup table.
I hope that is generic enough. I'm not good with C/C++ at all so maybe someone can clean up my syntax or be more specific.
Bonus: If you're crafty with your C you might be able to fudge multiple array indexes into a single 16, 32, or even 64 bit integer and perform the shifts. But that is prabably not very portable and I would recommend against this. Just a possible optimization.

Here a working solution, using temporary variables:
void shift_4bits_left(uint8_t* array, uint16_t size)
{
int i;
uint8_t shifted = 0x00;
uint8_t overflow = (0xF0 & array[0]) >> 4;
for (i = (size - 1); i >= 0; i--)
{
shifted = (array[i] << 4) | overflow;
overflow = (0xF0 & array[i]) >> 4;
array[i] = shifted;
}
}
Call this function 3 times for a 12-bit shift.
Mike's solution maybe faster, due to the use of temporary variables.

The 32 bit version... :-) Handles 1 <= count <= num_words
#include <stdio.h>
unsigned int array[] = {0x12345678,0x9abcdef0,0x12345678,0x9abcdef0,0x66666666};
int main(void) {
int count;
unsigned int *from, *to;
from = &array[0];
to = &array[0];
count = 5;
while (count-- > 1) {
*to++ = (*from<<12) | ((*++from>>20)&0xfff);
};
*to = (*from<<12);
printf("%x\n", array[0]);
printf("%x\n", array[1]);
printf("%x\n", array[2]);
printf("%x\n", array[3]);
printf("%x\n", array[4]);
return 0;
}

#Joseph, notice that the variables are 8 bits wide, while the shift is 12 bits wide. Your solution works only for N <= variable size.
If you can assume your array is a multiple of 4 you can cast the array into an array of uint64_t and then work on that. If it isn't a multiple of 4, you can work in 64-bit chunks on as much as you can and work on the remainder one by one.
This may be a bit more coding, but I think it's more elegant in the end.

There are a couple of edge-cases which make this a neat problem:
the input array might be empty
the last and next-to-last bits need to be treated specially, because they have zero bits shifted into them
Here's a simple solution which loops over the array copying the low-order nibble of the next byte into its high-order nibble, and the high-order nibble of the next-next (+2) byte into its low-order nibble. To save dereferencing the look-ahead pointer twice, it maintains a two-element buffer with the "last" and "next" bytes:
void shl12(uint8_t *v, size_t length) {
if (length == 0) {
return; // nothing to do
}
if (length > 1) {
uint8_t last_byte, next_byte;
next_byte = *(v + 1);
for (size_t i = 0; i + 2 < length; i++, v++) {
last_byte = next_byte;
next_byte = *(v + 2);
*v = ((last_byte & 0x0f) << 4) | (((next_byte) & 0xf0) >> 4);
}
// the next-to-last byte is half-empty
*(v++) = (next_byte & 0x0f) << 4;
}
// the last byte is always empty
*v = 0;
}
Consider the boundary cases, which activate successively more parts of the function:
When length is zero, we bail out without touching memory.
When length is one, we set the one and only element to zero.
When length is two, we set the high-order nibble of the first byte to low-order nibble of the second byte (that is, bits 12-16), and the second byte to zero. We don't activate the loop.
When length is greater than two we hit the loop, shuffling the bytes across the two-element buffer.
If efficiency is your goal, the answer probably depends largely on your machine's architecture. Typically you should maintain the two-element buffer, but handle a machine word (32/64 bit unsigned integer) at a time. If you're shifting a lot of data it will be worthwhile treating the first few bytes as a special case so that you can get your machine word pointers word-aligned. Most CPUs access memory more efficiently if the accesses fall on machine word boundaries. Of course, the trailing bytes have to be handled specially too so you don't touch memory past the end of the array.

Bit reversal of an integer, ignoring integer size and endianness

Given an integer typedef:
typedef unsigned int TYPE;
or
typedef unsigned long TYPE;
I have the following code to reverse the bits of an integer:
TYPE max_bit= (TYPE)-1;
void reverse_int_setup()
{
TYPE bits= (TYPE)max_bit;
while (bits <<= 1)
max_bit= bits;
}
TYPE reverse_int(TYPE arg)
{
TYPE bit_setter= 1, bit_tester= max_bit, result= 0;
for (result= 0; bit_tester; bit_tester>>= 1, bit_setter<<= 1)
if (arg & bit_tester)
result|= bit_setter;
return result;
}
One just needs first to run reverse_int_setup(), which stores an integer with the highest bit turned on, then any call to reverse_int(arg) returns arg with its bits reversed (to be used as a key to a binary tree, taken from an increasing counter, but that's more or less irrelevant).
Is there a platform-agnostic way to have in compile-time the correct value for max_int after the call to reverse_int_setup(); Otherwise, is there an algorithm you consider better/leaner than the one I have for reverse_int()?
Thanks.

#include<stdio.h>
#include<limits.h>
#define TYPE_BITS sizeof(TYPE)*CHAR_BIT
typedef unsigned long TYPE;
TYPE reverser(TYPE n)
{
TYPE nrev = 0, i, bit1, bit2;
int count;
for(i = 0; i < TYPE_BITS; i += 2)
{
/*In each iteration, we swap one bit on the 'right half'
of the number with another on the left half*/
count = TYPE_BITS - i - 1; /*this is used to find how many positions
to the left (and right) we gotta move
the bits in this iteration*/
bit1 = n & (1<<(i/2)); /*Extract 'right half' bit*/
bit1 <<= count; /*Shift it to where it belongs*/
bit2 = n & 1<<((i/2) + count); /*Find the 'left half' bit*/
bit2 >>= count; /*Place that bit in bit1's original position*/
nrev |= bit1; /*Now add the bits to the reversal result*/
nrev |= bit2;
}
return nrev;
}
int main()
{
TYPE n = 6;
printf("%lu", reverser(n));
return 0;
}
This time I've used the 'number of bits' idea from TK, but made it somewhat more portable by not assuming a byte contains 8 bits and instead using the CHAR_BIT macro. The code is more efficient now (with the inner for loop removed). I hope the code is also slightly less cryptic this time. :)
The need for using count is that the number of positions by which we have to shift a bit varies in each iteration - we have to move the rightmost bit by 31 positions (assuming 32 bit number), the second rightmost bit by 29 positions and so on. Hence count must decrease with each iteration as i increases.
Hope that bit of info proves helpful in understanding the code...

The following program serves to demonstrate a leaner algorithm for reversing bits, which can be easily extended to handle 64bit numbers.
#include <stdio.h>
#include <stdint.h>
int main(int argc, char**argv)
{
int32_t x;
if ( argc != 2 )
{
printf("Usage: %s hexadecimal\n", argv[0]);
return 1;
}
sscanf(argv[1],"%x", &x);
/* swap every neigbouring bit */
x = (x&0xAAAAAAAA)>>1 | (x&0x55555555)<<1;
/* swap every 2 neighbouring bits */
x = (x&0xCCCCCCCC)>>2 | (x&0x33333333)<<2;
/* swap every 4 neighbouring bits */
x = (x&0xF0F0F0F0)>>4 | (x&0x0F0F0F0F)<<4;
/* swap every 8 neighbouring bits */
x = (x&0xFF00FF00)>>8 | (x&0x00FF00FF)<<8;
/* and so forth, for say, 32 bit int */
x = (x&0xFFFF0000)>>16 | (x&0x0000FFFF)<<16;
printf("0x%x\n",x);
return 0;
}
This code should not contain errors, and was tested using 0x12345678 to produce 0x1e6a2c48 which is the correct answer.

typedef unsigned long TYPE;
TYPE reverser(TYPE n)
{
TYPE k = 1, nrev = 0, i, nrevbit1, nrevbit2;
int count;
for(i = 0; !i || (1 << i && (1 << i) != 1); i+=2)
{
/*In each iteration, we swap one bit
on the 'right half' of the number with another
on the left half*/
k = 1<<i; /*this is used to find how many positions
to the left (or right, for the other bit)
we gotta move the bits in this iteration*/
count = 0;
while(k << 1 && k << 1 != 1)
{
k <<= 1;
count++;
}
nrevbit1 = n & (1<<(i/2));
nrevbit1 <<= count;
nrevbit2 = n & 1<<((i/2) + count);
nrevbit2 >>= count;
nrev |= nrevbit1;
nrev |= nrevbit2;
}
return nrev;
}
This works fine in gcc under Windows, but I'm not sure if it's completely platform independent. A few places of concern are:
the condition in the for loop - it assumes that when you left shift 1 beyond the leftmost bit, you get either a 0 with the 1 'falling out' (what I'd expect and what good old Turbo C gives iirc), or the 1 circles around and you get a 1 (what seems to be gcc's behaviour).
the condition in the inner while loop: see above. But there's a strange thing happening here: in this case, gcc seems to let the 1 fall out and not circle around!
The code might prove cryptic: if you're interested and need an explanation please don't hesitate to ask - I'll put it up someplace.

#ΤΖΩΤΖΙΟΥ
In reply to ΤΖΩΤΖΙΟΥ 's comments, I present modified version of above which depends on a upper limit for bit width.
#include <stdio.h>
#include <stdint.h>
typedef int32_t TYPE;
TYPE reverse(TYPE x, int bits)
{
TYPE m=~0;
switch(bits)
{
case 64:
x = (x&0xFFFFFFFF00000000&m)>>16 | (x&0x00000000FFFFFFFF&m)<<16;
case 32:
x = (x&0xFFFF0000FFFF0000&m)>>16 | (x&0x0000FFFF0000FFFF&m)<<16;
case 16:
x = (x&0xFF00FF00FF00FF00&m)>>8 | (x&0x00FF00FF00FF00FF&m)<<8;
case 8:
x = (x&0xF0F0F0F0F0F0F0F0&m)>>4 | (x&0x0F0F0F0F0F0F0F0F&m)<<4;
x = (x&0xCCCCCCCCCCCCCCCC&m)>>2 | (x&0x3333333333333333&m)<<2;
x = (x&0xAAAAAAAAAAAAAAAA&m)>>1 | (x&0x5555555555555555&m)<<1;
}
return x;
}
int main(int argc, char**argv)
{
TYPE x;
TYPE b = (TYPE)-1;
int bits;
if ( argc != 2 )
{
printf("Usage: %s hexadecimal\n", argv[0]);
return 1;
}
for(bits=1;b;b<<=1,bits++);
--bits;
printf("TYPE has %d bits\n", bits);
sscanf(argv[1],"%x", &x);
printf("0x%x\n",reverse(x, bits));
return 0;
}
Notes:
gcc will warn on the 64bit constants
the printfs will generate warnings too
If you need more than 64bit, the code should be simple enough to extend
I apologise in advance for the coding crimes I committed above - mercy good sir!

There's a nice collection of "Bit Twiddling Hacks", including a variety of simple and not-so simple bit reversing algorithms coded in C at http://graphics.stanford.edu/~seander/bithacks.html.
I personally like the "Obvious" algorigthm (http://graphics.stanford.edu/~seander/bithacks.html#BitReverseObvious) because, well, it's obvious. Some of the others may require less instructions to execute. If I really need to optimize the heck out of something I may choose the not-so-obvious but faster versions. Otherwise, for readability, maintainability, and portability I would choose the Obvious one.

Here is a more generally useful variation. Its advantage is its ability to work in situations where the bit length of the value to be reversed -- the codeword -- is unknown but is guaranteed not to exceed a value we'll call maxLength. A good example of this case is Huffman code decompression.
The code below works on codewords from 1 to 24 bits in length. It has been optimized for fast execution on a Pentium D. Note that it accesses the lookup table as many as 3 times per use. I experimented with many variations that reduced that number to 2 at the expense of a larger table (4096 and 65,536 entries). This version, with the 256-byte table, was the clear winner, partly because it is so advantageous for table data to be in the caches, and perhaps also because the processor has an 8-bit table lookup/translation instruction.
const unsigned char table[] = {
0x00,0x80,0x40,0xC0,0x20,0xA0,0x60,0xE0,0x10,0x90,0x50,0xD0,0x30,0xB0,0x70,0xF0,
0x08,0x88,0x48,0xC8,0x28,0xA8,0x68,0xE8,0x18,0x98,0x58,0xD8,0x38,0xB8,0x78,0xF8,
0x04,0x84,0x44,0xC4,0x24,0xA4,0x64,0xE4,0x14,0x94,0x54,0xD4,0x34,0xB4,0x74,0xF4,
0x0C,0x8C,0x4C,0xCC,0x2C,0xAC,0x6C,0xEC,0x1C,0x9C,0x5C,0xDC,0x3C,0xBC,0x7C,0xFC,
0x02,0x82,0x42,0xC2,0x22,0xA2,0x62,0xE2,0x12,0x92,0x52,0xD2,0x32,0xB2,0x72,0xF2,
0x0A,0x8A,0x4A,0xCA,0x2A,0xAA,0x6A,0xEA,0x1A,0x9A,0x5A,0xDA,0x3A,0xBA,0x7A,0xFA,
0x06,0x86,0x46,0xC6,0x26,0xA6,0x66,0xE6,0x16,0x96,0x56,0xD6,0x36,0xB6,0x76,0xF6,
0x0E,0x8E,0x4E,0xCE,0x2E,0xAE,0x6E,0xEE,0x1E,0x9E,0x5E,0xDE,0x3E,0xBE,0x7E,0xFE,
0x01,0x81,0x41,0xC1,0x21,0xA1,0x61,0xE1,0x11,0x91,0x51,0xD1,0x31,0xB1,0x71,0xF1,
0x09,0x89,0x49,0xC9,0x29,0xA9,0x69,0xE9,0x19,0x99,0x59,0xD9,0x39,0xB9,0x79,0xF9,
0x05,0x85,0x45,0xC5,0x25,0xA5,0x65,0xE5,0x15,0x95,0x55,0xD5,0x35,0xB5,0x75,0xF5,
0x0D,0x8D,0x4D,0xCD,0x2D,0xAD,0x6D,0xED,0x1D,0x9D,0x5D,0xDD,0x3D,0xBD,0x7D,0xFD,
0x03,0x83,0x43,0xC3,0x23,0xA3,0x63,0xE3,0x13,0x93,0x53,0xD3,0x33,0xB3,0x73,0xF3,
0x0B,0x8B,0x4B,0xCB,0x2B,0xAB,0x6B,0xEB,0x1B,0x9B,0x5B,0xDB,0x3B,0xBB,0x7B,0xFB,
0x07,0x87,0x47,0xC7,0x27,0xA7,0x67,0xE7,0x17,0x97,0x57,0xD7,0x37,0xB7,0x77,0xF7,
0x0F,0x8F,0x4F,0xCF,0x2F,0xAF,0x6F,0xEF,0x1F,0x9F,0x5F,0xDF,0x3F,0xBF,0x7F,0xFF};
const unsigned short masks[17] =
{0,0,0,0,0,0,0,0,0,0X0100,0X0300,0X0700,0X0F00,0X1F00,0X3F00,0X7F00,0XFF00};
unsigned long codeword; // value to be reversed, occupying the low 1-24 bits
unsigned char maxLength; // bit length of longest possible codeword (<= 24)
unsigned char sc; // shift count in bits and index into masks array
if (maxLength <= 8)
{
codeword = table[codeword << (8 - maxLength)];
}
else
{
sc = maxLength - 8;
if (maxLength <= 16)
{
codeword = (table[codeword & 0X00FF] << sc)
| table[codeword >> sc];
}
else if (maxLength & 1) // if maxLength is 17, 19, 21, or 23
{
codeword = (table[codeword & 0X00FF] << sc)
| table[codeword >> sc] |
(table[(codeword & masks[sc]) >> (sc - 8)] << 8);
}
else // if maxlength is 18, 20, 22, or 24
{
codeword = (table[codeword & 0X00FF] << sc)
| table[codeword >> sc]
| (table[(codeword & masks[sc]) >> (sc >> 1)] << (sc >> 1));
}
}

How about:
long temp = 0;
int counter = 0;
int number_of_bits = sizeof(value) * 8; // get the number of bits that represent value (assuming that it is aligned to a byte boundary)
while(value > 0) // loop until value is empty
{
temp <<= 1; // shift whatever was in temp left to create room for the next bit
temp |= (value & 0x01); // get the lsb from value and set as lsb in temp
value >>= 1; // shift value right by one to look at next lsb
counter++;
}
value = temp;
if (counter < number_of_bits)
{
value <<= counter-number_of_bits;
}
(I'm assuming that you know how many bits value holds and it is stored in number_of_bits)
Obviously temp needs to be the longest imaginable data type and when you copy temp back into value, all the extraneous bits in temp should magically vanish (I think!).
Or, the 'c' way would be to say :
while(value)
your choice

We can store the results of reversing all possible 1 byte sequences in an array (256 distinct entries), then use a combination of lookups into this table and some oring logic to get the reverse of integer.

Here is a variation and correction to TK's solution which might be clearer than the solutions by sundar. It takes single bits from t and pushes them into return_val:
typedef unsigned long TYPE;
#define TYPE_BITS sizeof(TYPE)*8
TYPE reverser(TYPE t)
{
unsigned int i;
TYPE return_val = 0
for(i = 0; i < TYPE_BITS; i++)
{/*foreach bit in TYPE*/
/* shift the value of return_val to the left and add the rightmost bit from t */
return_val = (return_val << 1) + (t & 1);
/* shift off the rightmost bit of t */
t = t >> 1;
}
return(return_val);
}

The generic approach hat would work for objects of any type of any size would be to reverse the of bytes of the object, and the reverse the order of bits in each byte. In this case the bit-level algorithm is tied to a concrete number of bits (a byte), while the "variable" logic (with regard to size) is lifted to the level of whole bytes.

Here's my generalization of freespace's solution (in case we one day get 128-bit machines). It results in jump-free code when compiled with gcc -O3, and is obviously insensitive to the definition of foo_t on sane machines. Unfortunately it does depend on shift being a power of 2!
#include <limits.h>
#include <stdio.h>
typedef unsigned long foo_t;
foo_t reverse(foo_t x)
{
int shift = sizeof (x) * CHAR_BIT / 2;
foo_t mask = (1 << shift) - 1;
int i;
for (i = 0; shift; i++) {
x = ((x & mask) << shift) | ((x & ~mask) >> shift);
shift >>= 1;
mask ^= (mask << shift);
}
return x;
}
int main() {
printf("reverse = 0x%08lx\n", reverse(0x12345678L));
}

In case bit-reversal is time critical, and mainly in conjunction with FFT, the best is to store the whole bit reversed array. In any case, this array will be smaller in size than the roots of unity that have to be precomputed in FFT Cooley-Tukey algorithm. An easy way to compute the array is:
int BitReverse[Size]; // Size is power of 2
void Init()
{
BitReverse[0] = 0;
for(int i = 0; i < Size/2; i++)
{
BitReverse[2*i] = BitReverse[i]/2;
BitReverse[2*i+1] = (BitReverse[i] + Size)/2;
}
} // end it's all