Which is faster: logarithm or lookup table? [duplicate] - c

I have a byte I'm using for bitflags. I know that one and only one bit in the byte is set at any give time.
Ex: unsigned char b = 0x20; //(00100000) 6th most bit set
I currently use the following loop to determine which bit is set:
int getSetBitLocation(unsigned char b) {
int i=0;
while( !((b >> i++) & 0x01) ) { ; }
return i;
}
How do I most efficiently determine the position of the set bit? Can I do this without iteration?

Can I do this without iteration?
It is indeed possible.
How do I most efficiently determine the position of the set bit?
You can try this algorithm. It splits the char in half to search for the top bit, shifting to the low half each time:
int getTopSetBit(unsigned char b) {
int res = 0;
if(b>15){
b = b >> 4;
res = res + 4;
}
if(b>3){
b = b >> 2;
res = res + 2;
}
//thanks #JasonD
return res + (b>>1);
}
It uses two comparisons (three for uint16s, four for uint32s...). and it might be faster than your loop. It is definitely not shorter.
Based on the idea by Anton Kovalenko (hashed lookup) and the comment by 6502 (division is slow), I also suggest this implementation (8-bit => 3-bit hash using a de-Bruijn sequence)
int[] lookup = {7, 0, 5, 1, 6, 4, 3, 2};
int getBitPosition(unsigned char b) {
// return lookup[(b | (b>>1) | (b>>2) | (b>>4)) & 0x7];
return lookup[((b * 0x1D) >> 4) & 0x7];
}
or (larger LUT, but uses just three terms instead of four)
int[] lookup = {0xFF, 0, 1, 4, 2, 0xFF, 5, 0xFF, 7, 3, 0xFF, 0xFF, 6, 0xFF, 0xFF, 0xFF};
int getBitPosition(unsigned char b) {
return lookup[(b | (b>>3) | (b>>4)) & 0xF];
}

Lookup table is simple enough, and you can reduce its size if the set of values is sparse. Let's try with 11 elements instead of 128:
unsigned char expt2mod11_bits[11]={0xFF,0,1,0xFF,2,4,0xFF,7,3,6,5};
unsigned char pos = expt2mod11_bits[b%11];
assert(pos < 8);
assert(1<<pos == b);
Of course, it's not necessarily more effective, especially for 8 bits, but the same trick can be used for larger sizes, where full lookup table would be awfully big. Let's see:
unsigned int w;
....
unsigned char expt2mod19_bits[19]={0xFF,0,1,13,2,0xFF,14,6,3,8,0xFF,12,15,5,7,11,4,10,9};
unsigned char pos = expt2mod19_bits[w%19];
assert(pos < 16);
assert(1<<pos == w);

This is a quite common problem for chess programs that use 64 bits to represent positions (i.e. one 64-bit number to store where are all the white pawns, another for where are all the black ones and so on).
With this representation there is sometimes the need to find the index 0...63 of the first or last set bit and there are several possible approaches:
Just doing a loop like you did
Using a dichotomic search (i.e. if x & 0x00000000ffffffffULL is zero there's no need to check low 32 bits)
Using special instruction if available on the processor (e.g. bsf and bsr on x86)
Using lookup tables (of course not for the whole 64-bit value, but for 8 or 16 bits)
What is faster however really depends on your hardware and on real use cases.
For 8 bits only and a modern processor I think that probably a lookup table with 256 entries is the best choice...
But are you really sure this is the bottleneck of your algorithm?

unsigned getSetBitLocation(unsigned char b) {
unsigned pos=0;
pos = (b & 0xf0) ? 4 : 0; b |= b >>4;
pos += (b & 0xc) ? 2 : 0; b |= b >>2;
pos += (b & 0x2) ? 1 : 0;
return pos;
}
It would be hard to do it jumpfree. Maybe with the Bruin sequences ?

Based on log2 calculation in Find the log base 2 of an N-bit integer in O(lg(N)) operations:
int getSetBitLocation(unsigned char c) {
// c is in {1, 2, 4, 8, 16, 32, 64, 128}, returned values are {0, 1, ..., 7}
return (((c & 0xAA) != 0) |
(((c & 0xCC) != 0) << 1) |
(((c & 0xF0) != 0) << 2));
}

Easiest thing is to create a lookup table. The simplest one will be sparse (having 256 elements) but it would technically avoid iteration.
This comment here technically avoids iteration, but who are we kidding, it is still doing the same number of checks: How to write log base(2) in c/c++
Closed form would be log2(), a la, log2() + 1 But I'm not sure how efficient that is - possibly the CPU has an instruction for taking base 2 logrithms?

if you define
const char bytes[]={1,2,4,8,16,32,64,128}
and use
struct byte{
char data;
int pos;
}
void assign(struct byte b,int i){
b.data=bytes[i];
b.pos=i
}
you don't need to determine the position of the set bit

A lookup table is fast and easy when CHAR_BIT == 8, but on some systems, CHAR_BIT == 16 or 32 and a lookup table becomes insanely bulky. If you're considering a lookup table, I'd suggest wrapping it; make it a "lookup table function", instead, so that you can swap the logic when you need to optimise.
Using divide and conquer, by performing a binary search on a sorted array, involves comparisons based on log2 CHAR_BIT. That code is more complex, involving an initialisation of an array of unsigned char to use as a lookup table for a start. Once you have such the array initialised, you can use bsearch to search it, for example:
#include <stdio.h>
#include <stdlib.h>
void uchar_bit_init(unsigned char *table) {
for (size_t x = 0; x < CHAR_BIT; x++) {
table[x] = 1U << x;
}
}
int uchar_compare(void const *x, void const *y) {
char const *X = x, *Y = y;
return (*X > *Y) - (*X < *Y);
}
size_t uchar_bit_lookup(unsigned char *table, unsigned char value) {
unsigned char *position = bsearch(lookup, c, sizeof lookup, 1, char_compare);
return position ? position - table + 1 : 0;
}
int main(void) {
unsigned char lookup[CHAR_BIT];
uchar_bit_init(lookup);
for (;;) {
int c = getchar();
if (c == EOF) { break; }
printf("Bit for %c found at %zu\n", c, uchar_bit_lookup(lookup, c));
}
}
P.S. This sounds like micro-optimisation. Get your solution done (abstracting the operations required into these functions), then worry about optimisations based on your profiling. Make sure your profiling targets the system that your solution will run on if you're going to focus on micro-optimisations, because the efficiency of micro-optimisations differ widely as hardware differs even slightly... It's usually a better idea to buy a faster PC ;)

Related

C variable smaller then 8-bit

I'm writing C implementation of Conway's Game of Life and pretty much done with the code, but I'm wondering what is the most efficient way to storage the net in the program.
The net is two dimensional and stores whether cell (x, y) is alive (1) or dead (0). Currently I'm doing it with unsigned char like that:
struct:
typedef struct {
int rows;
int cols;
unsigned char *vec;
} net_t;
allocation:
n->vec = calloc( n->rows * n->cols, sizeof(unsigned char) );
filling:
i = ( n->cols * (x - 1) ) + (y - 1);
n->vec[i] = 1;
searching:
if( n->vec[i] == 1 )
but I don't really need 0-255 values - I only need 0 - 1, so I'm feeling that doing it like that is a waste of space, but as far as I know 8-bit char is the smallest type in C.
Is there any way to do it better?
Thanks!
The smallest declarable / addressable unit of memory you can address/use is a single byte, implemented as unsigned char in your case.
If you want to really save on space, you could make use of masking off individual bits in a character, or using bit fields via a union. The trade-off will be that your code will execute a bit slower, and will certainly be more complicated.
#include <stdio.h>
union both {
struct {
unsigned char b0: 1;
unsigned char b1: 1;
unsigned char b2: 1;
unsigned char b3: 1;
unsigned char b4: 1;
unsigned char b5: 1;
unsigned char b6: 1;
unsigned char b7: 1;
} bits;
unsigned char byte;
};
int main ( ) {
union both var;
var.byte = 0xAA;
if ( var.bits.b0 ) {
printf("Yes\n");
} else {
printf("No\n");
}
return 0;
}
References
Union and Bit Fields, Accessed 2014-04-07, <http://www.rightcorner.com/code/CPP/Basic/union/sample.php>
Access Bits in a Char in C, Accessed 2014-04-07, <https://stackoverflow.com/questions/8584577/access-bits-in-a-char-in-c>
Struct - Bit Field, Accessed 2014-04-07, <http://cboard.cprogramming.com/c-programming/10029-struct-bit-fields.html>
Unless you're working on an embedded platform, I wouldn't be too concerned about the size your net takes up by using an unsigned char to store only a 1 or 0.
To address your specific question: char is the smallest of the C data types. char, signed char, and unsigned char are all only going to take up 1 byte each.
If you want to make your code smaller you can use bitfields to decrees the amount of space you take up, but that will increase the complexity of your code.
For a simple exercise like this, I'd be more concerned about readability than size. One way you can make it more obvious what you're doing is switch to a bool instead of a char.
#include <stdbool.h>
typedef struct {
int rows;
int cols;
bool *vec;
} net_t;
You can then use true and false which, IMO, will make your code much easier to read and understand when all you need is 1 and 0.
It will take up at least as much space as the way you're doing it now, but like I said, consider what's really important in the program you're writing for the platform you're writing it for... it's probably not the size.
The smallest type on C as i know are the char (-128, 127), signed char (-128, 127), unsigned char (0, 255) types, all of them takes a whole byte, so if you are storing multiple bits values on different variables, you can instead use an unsigned char as a group of bits.
unsigned char lives = 128;
At this moment, lives have a 128 decimal value, which it's 10000000 in binary, so now you can use a bitwise operator to get a single value from this variable (like an array of bits)
if((lives >> 7) == 1) {
//This code will run if the 8 bit from right to left (decimal 128) it's true
}
It's a little complex, but finally you'll end up with a bit array, so instead of using multiple variables to store single TRUE / FALSE values, you can use a single unsigned char variable to store 8 TRUE / FALSE values.
Note: As i have some time out of the C/C++ world, i'm not 100% sure that it's "lives >> 7", but it's with the '>' symbol, a little research on it and you'll be ready to go.
You're correct that a char is the smallest type - and it is typically (8) bits, though this is a minimum requirement. And sizeof(char) or (unsigned char) is (1). So, consider using an (unsigned) char to represent (8) columns.
How many char's are required per row? It's (cols / 8), but we have to round up for an integer value:
int byte_cols = (cols + 7) / 8;
or:
int byte_cols = (cols + 7) >> 3;
which you may wish to store with in the net_t data structure. Then:
calloc(n->rows * n->byte_cols, 1) is sufficient for a contiguous bit vector.
Address columns and rows by x and y respectively. Setting (x, y) (relative to 0) :
n->vec[y * byte_cols + (x >> 3)] |= (1 << (x & 0x7));
Clearing:
n->vec[y * byte_cols + (x >> 3)] &= ~(1 << (x & 0x7));
Searching:
if (n->vec[y * byte_cols + (x >> 3)] & (1 << (x & 0x7)))
/* ... (x, y) is set... */
else
/* ... (x, y) is clear... */
These are bit manipulation operations. And it's fundamentally important to learn how (and why) this works. Google the term for more resources. This uses an eighth of the memory of a char per cell, so I certainly wouldn't consider it premature optimization.

Bit Rotation in C

The Problem: Exercise 2-8 of The C Programming Language, "Write a function rightrot(x,n) that returns the value of the integer x, rotated to the right by n positions."
I have done this every way that I know how. Here is the issue that I am having. Take a given number for this exercise, say 29, and rotate it right one position.
11101 and it becomes 11110 or 30. Let's say for the sake of argument that the system we are working on has an unsigned integer type size of 32 bits. Let's further say that we have the number 29 stored in an unsigned integer variable. In memory the number will have 27 zeros ahead of it. So when we rotate 29 right one using one of several algorithms mine is posted below, we get the number 2147483662. This is obviously not the desired result.
unsigned int rightrot(unsigned x, int n) {
return (x >> n) | (x << (sizeof(x) * CHAR_BIT) - n);
}
Technically, this is correct, but I was thinking that the 27 zeros that are in front of 11101 were insignificant. I have also tried a couple of other solutions:
int wordsize(void) { // compute the wordsize on a given machine...
unsigned x = ~0;
int b;
for(b = 0; x; b++)
x &= x-1;
return x;
}
unsigned int rightrot(unsigned x, int n) {
unsigned rbit;
while(n --) {
rbit = x >> 1;
x |= (rbit << wordsize() - 1);
}
return x;
This last and final solution is the one where I thought that I had it, I will explain where it failed once I get to the end. I am sure that you will see my mistake...
int bitcount(unsigned x) {
int b;
for(b = 0; x; b++)
x &= x-1;
return b;
}
unsigned int rightrot(unsigned x, int n) {
unsigned rbit;
int shift = bitcount(x);
while(n--) {
rbit = x & 1;
x >>= 1;
x |= (rbit << shift);
}
}
This solution gives the expected answer of 30 that I was looking for, but if you use a number for x like oh say 31 (11111), then there are issues, specifically the outcome is 47, using one for n. I did not think of this earlier, but if a number like 8 (1000) is used then mayhem. There is only one set bit in 8, so the shift is most certainly going to be wrong. My theory at this point is that the first two solutions are correct (mostly) and I am just missing something...
A bitwise rotation is always necessarily within an integer of a given width. In this case, as you're assuming a 32-bit integer, 2147483662 (0b10000000000000000000000000001110) is indeed the correct answer; you aren't doing anything wrong!
0b11110 would not be considered the correct result by any reasonable definition, as continuing to rotate it right using the same definition would never give you back the original input. (Consider that another right rotation would give 0b1111, and continuing to rotate that would have no effect.)
In my opinion, the spirit of the section of the book which immediately precedes this exercise would have the reader do this problem without knowing anything about the size (in bits) of integers, or any other type. The examples in the section do not require that information; I don't believe the exercises should either.
Regardless of my belief, the book had not yet introduced the sizeof operator by section 2.9, so the only way to figure the size of a type is to count the bits "by hand".
But we don't need to bother with all that. We can do bit rotation in n steps, regardless of how many bits there are in the data type, by rotating one bit at a time.
Using only the parts of the language that are covered by the book up to section 2.9, here's my implementation (with integer parameters, returning an integer, as specified by the exercise): Loop n times, x >> 1 each iteration; if the old low bit of x was 1, set the new high bit.
int rightrot(int x, int n) {
int lowbit;
while (n-- > 0) {
lowbit = x & 1; /* save low bit */
x = (x >> 1) & (~0u >> 1); /* shift right by one, and clear the high bit (in case of sign extension) */
if (lowbit)
x = x | ~(~0u >> 1); /* set the high bit if the low bit was set */
}
return x;
}
You could find the location of the first '1' in the 32-bit value using binary search. Then note the bit in the LSB location, right shift the value by the required number of places, and put the LSB bit in the location of the first '1'.
int bitcount(unsigned x) {
int b;
for(b = 0; x; b++)
x &= x-1;
return b;
}
unsigned rightrot(unsigned x,int n) {
int b = bitcount(x);
unsigned a = (x&~(~0<<n))<<(b-n+1);
x>> = n;
x| = a;
}

Finding trailing 0s in a binary number

How to find number of trailing 0s in a binary number?Based on K&R bitcount example of finding 1s in a binary number i modified it a bit to find the trailing 0s.
int bitcount(unsigned x)
{
int b;
for(b=0;x!=0;x>>=1)
{
if(x&01)
break;
else
b++;
}
I would like to review this method.
Here's a way to compute the count in parallel for better efficiency:
unsigned int v; // 32-bit word input to count zero bits on right
unsigned int c = 32; // c will be the number of zero bits on the right
v &= -signed(v);
if (v) c--;
if (v & 0x0000FFFF) c -= 16;
if (v & 0x00FF00FF) c -= 8;
if (v & 0x0F0F0F0F) c -= 4;
if (v & 0x33333333) c -= 2;
if (v & 0x55555555) c -= 1;
On GCC on X86 platform you can use __builtin_ctz(no)
On Microsoft compilers for X86 you can use _BitScanForward
They both emit a bsf instruction
Another approach (I'm surprised it's not mentioned here) would be to build a table of 256 integers, where each element in the array is the lowest 1 bit for that index. Then, for each byte in the integer, you look up in the table.
Something like this (I haven't taken any time to tweak this, this is just to roughly illustrate the idea):
int bitcount(unsigned x)
{
static const unsigned char table[256] = { /* TODO: populate with constants */ };
for (int i=0; i<sizeof(x); ++i, x >>= 8)
{
unsigned char r = table[x & 0xff];
if (r)
return r + i*8; // Found a 1...
}
// All zeroes...
return sizeof(x)*8;
}
The idea with some of the table-driven approaches to a problem like this is that if statements cost you something in terms of branch prediction, so you should aim to reduce them. It also reduces the number of bit shifts. Your approach does an if statement and a shift per bit, and this one does one per byte. (Hopefully the optimizer can unroll the for loop, and not issue a compare/jump for that.) Some of the other answers have even fewer if statements than this, but a table approach is simple and easy to understand. Of course you should be guided by actual measurements to see if any of this matters.
I think your method is working (allthough you might want to use unsigned int). You check the last digit each time, and if it's zero, you discard it an increment the number of trailing zero-bits.
I think for trailing zeroes you don't need a loop.
Consider the following:
What happens with the number (in binary representation, of course) if you subtract 1? Which digits change, which stay the same?
How could you combine the original number and the decremented version such that only bits representing trailing zeroes are left?
If you apply the above steps correctly, you can just find the highest bit set in O(lg n) steps (look here if you're interested in how to do).
Should be:
int bitcount(unsigned char x)
{
int b;
for(b=0; b<7; x>>=1)
{
if(x&1)
break;
else
b++;
}
return b;
}
or even
int bitcount(unsigned char x)
{
int b;
for(b=0; b<7 && !(x&1); x>>=1) b++;
return b;
}
or even (yay!)
int bitcount(unsigned char x)
{
int b;
for(b=0; b<7 && !(x&1); b++) x>>=1;
return b;
}
or ...
Ah, whatever, there are 100500 millions methods of doing this. Use whatever you need or like.
We can easily get it using bit operations, we don't need to go through all the bits. Pseudo code:
int bitcount(unsigned x) {
int xor = x ^ (x-1); // this will have (1 + #trailing 0s) trailing 1s
return log(i & xor); // i & xor will have only one bit 1 and its log should give the exact number of zeroes
}
int countTrailZero(unsigned x) {
if (x == 0) return DEFAULT_VALUE_YOU_NEED;
return log2 (x & -x);
}
Explanation:
x & -x returns the number of right most bit set with 1.
e.g. 6 -> "0000,0110", (6 & -6) -> "0000,0010"
You can deduct this by two complement:
x = "a1b", where b represents all trailing zeros.
then
-x = !(x) + 1 = !(a1b) + 1 = (!a)0(!b) + 1 = (!a)0(1...1) + 1 = (!a)1(0...0) = (!a)1b
so
x & (-x) = (a1b) & (!a)1b = (0...0)1(0...0)
you can get the number of trailing zeros just by doing log2.

Turn a large chunk of memory backwards, fast

I need to rewrite about 4KB of data in reverse order, at bit level (last bit of last byte becoming first bit of first byte), as fast as possible. Are there any clever sniplets to do it?
Rationale: The data is display contents of LCD screen in an embedded device that is usually positioned in a way that the screen is on your shoulders level. The screen has "6 o'clock" orientation, that is to be viewed from below - like lying flat or hanging above your eyes level. This is fixable by rotating the screen 180 degrees, but then I need to reverse the screen data (generated by library), which is 1 bit = 1 pixel, starting with upper left of the screen. The CPU isn't very powerful, and the device has enough work already, plus several frames a second would be desirable so performance is an issue; RAM not so much.
edit:
Single core, ARM 9 series. 64MB, (to be scaled down to 32MB later), Linux. The data is pushed from system memory to the LCD driver over 8-bit IO port.
The CPU is 32bit and performs much better at this word size than at byte level.
There's a classic way to do this. Let's say unsigned int is your 32-bit word. I'm using C99 because the restrict keyword lets the compiler perform extra optimizations in this speed-critical code that would otherwise be unavailable. These keywords inform the compiler that "src" and "dest" do not overlap. This also assumes you are copying an integral number of words, if you're not, then this is just a start.
I also don't know which bit shifting / rotation primitives are fast on the ARM and which are slow. This is something to consider. If you need more speed, consider disassembling the output from the C compiler and going from there. If using GCC, try O2, O3, and Os to see which one is fastest. You might reduce stalls in the pipeline by doing two words at the same time.
This uses 23 operations per word, not counting load and store. However, these 23 operations are all very fast and none of them access memory. I don't know if a lookup table would be faster or not.
void
copy_rev(unsigned int *restrict dest,
unsigned int const *restrict src,
unsigned int n)
{
unsigned int i, x;
for (i = 0; i < n; ++i) {
x = src[i];
x = (x >> 16) | (x << 16);
x = ((x >> 8) & 0x00ff00ffU) | ((x & 0x00ff00ffU) << 8);
x = ((x >> 4) & 0x0f0f0f0fU) | ((x & 0x0f0f0f0fU) << 4);
x = ((x >> 2) & 0x33333333U) | ((x & 0x33333333U) << 2);
x = ((x >> 1) & 0x55555555U) | ((x & 0x555555555) << 1);
dest[n-1-i] = x;
}
}
This page is a great reference: http://graphics.stanford.edu/~seander/bithacks.html#BitReverseObvious
Final note: Looking at the ARM assembly reference, there is a "REV" opcode which reverses the byte order in a word. This would shave 7 operations per loop off the above code.
Fastest way would probably to store the reverse of all possible byte values in a look-up table. The table would take only 256 bytes.
Build a 256 element lookup table of byte values that are bit-reversed from their index.
{0x00, 0x80, 0x40, 0xc0, etc}
Then iterate through your array copying using each byte as an index into your lookup table.
If you are writing assembly language, the x86 instruction set has an XLAT instruction that does just this sort of lookup. Although it may not actually be faster than C code on modern processors.
You can do this in place if you iterate from both ends towards the middle. Because of cache effects, you may find it's faster to swap in 16 byte chunks (assuming a 16 byte cache line).
Here's the basic code (not including the cache line optimization)
// bit reversing lookup table
typedef unsigned char BYTE;
extern const BYTE g_RevBits[256];
void ReverseBitsInPlace(BYTE * pb, int cb)
{
int iter = cb/2;
for (int ii = 0, jj = cb-1; ii < iter; ++ii, --jj)
{
BYTE b1 = g_RevBits[pb[ii]];
pb[ii] = g_RevBits[pb[jj]];
pb[jj] = b1;
}
if (cb & 1) // if the number of bytes was odd, swap the middle one in place
{
pb[cb/2] = g_RevBits[pb[cb/2]];
}
}
// initialize the bit reversing lookup table using macros to make it less typing.
#define BITLINE(n) \
0x0##n, 0x8##n, 0x4##n, 0xC##n, 0x2##n, 0xA##n, 0x6##n, 0xE##n,\
0x1##n, 0x9##n, 0x5##n, 0xD##n, 0x3##n, 0xB##n, 0x7##n, 0xF##n,
const BYTE g_RevBits[256] = {
BITLINE(0), BITLINE(8), BITLINE(4), BITLINE(C),
BITLINE(2), BITLINE(A), BITLINE(6), BITLINE(E),
BITLINE(1), BITLINE(9), BITLINE(5), BITLINE(D),
BITLINE(3), BITLINE(B), BITLINE(7), BITLINE(F),
};
The Bit Twiddling Hacks site is alwas a good starting point for these kind of problems. Take a look here for fast bit reversal. Then its up to you to apply it to each byte/word of your memory block.
EDIT:
Inspired by Dietrich Epps answer and looking at the ARM instruction set, there is a RBIT opcode that reverses the bits contained in a register. So if performance is critical, you might consider using some assembly code.
Loop through the half of the array, convert and exchange bytes.
for( int i = 0; i < arraySize / 2; i++ ) {
char inverted1 = invert( array[i] );
char inverted2 = invert( array[arraySize - i - 1] );
array[i] = inverted2;
array[arraySize - i - 1] = inverted1;
}
For conversion use a precomputed table - an array of 2CHAR_BIT (CHAR_BIT will most likely be 8) elements where at position "I" the result of byte with value "I" inversion is stored. This will be very fast - one pass - and consume only 2CHAR_BIT for the table.
It looks like this code takes about 50 clocks per bit swap on my i7 XPS 8500 machine. 7.6 seconds for a million array flips. Single threaded. It prints some ASCI art based on patterns of 1s and 0s. I rotated the pic left 180 degrees after reversing the bit array, using a graphic editor, and they look identical to me. A double-reversed image comes out the same as the original.
As for pluses, it's a complete solution. It swaps bits from the back of a bit array to the front, vs operating on ints/bytes and then needing to swap ints/bytes in an array.
Also, this is a general purpose bit library, so you might find it handy in the future for solving other, more mundane problems.
Is it as fast as the accepted answer? I think it's close, but without working code to benchmark it's impossible to say. Feel free to cut and paste this working program.
// Reverse BitsInBuff.cpp : Defines the entry point for the console application.
#include "stdafx.h"
#include "time.h"
#include "memory.h"
//
// Manifest constants
#define uchar unsigned char
#define BUFF_BYTES 510 //400 supports a display of 80x40 bits
#define DW 80 // Display Width
// ----------------------------------------------------------------------------
uchar mask_set[] = { 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80 };
uchar mask_clr[] = { 0xfe, 0xfd, 0xfb, 0xf7, 0xef, 0xdf, 0xbf, 0x7f };
//
// Function Prototypes
static void PrintIntBits(long x, int bits);
void BitSet(uchar * BitArray, unsigned long BitNumber);
void BitClr(uchar * BitArray, unsigned long BitNumber);
void BitTog(uchar * BitArray, unsigned long BitNumber);
uchar BitGet(uchar * BitArray, unsigned long BitNumber);
void BitPut(uchar * BitArray, unsigned long BitNumber, uchar value);
//
uchar *ReverseBitsInArray(uchar *Buff, int BitKnt);
static void PrintIntBits(long x, int bits);
// -----------------------------------------------------------------------------
// Reverse the bit ordering in an array
uchar *ReverseBitsInArray(uchar *Buff, int BitKnt) {
unsigned long front=0, back = BitKnt-1;
uchar temp;
while( front<back ) {
temp = BitGet(Buff, front); // copy front bit to temp before overwriting
BitPut(Buff, front, BitGet(Buff, back)); // copy back bit to front bit
BitPut(Buff, back, temp); // copy saved value of front in temp to back of bit arra)
front++;
back--;
}
return Buff;
}
// ---------------------------------------------------------------------------
// ---------------------------------------------------------------------------
int _tmain(int argc, _TCHAR* argv[]) {
int i, j, k, LoopKnt = 1000001;
time_t start;
uchar Buff[BUFF_BYTES];
memset(Buff, 0, sizeof(Buff));
// make an ASCII art picture
for(i=0, k=0; i<(sizeof(Buff)*8)/DW; i++) {
for(j=0; j<DW/2; j++) {
BitSet(Buff, (i*DW)+j+k);
}
k++;
}
// print ASCII art picture
for(i=0; i<sizeof(Buff); i++) {
if(!(i % 10)) printf("\n"); // print bits in blocks of 80
PrintIntBits(Buff[i], 8);
}
i=LoopKnt;
start = clock();
while( i-- ) {
ReverseBitsInArray((uchar *)Buff, BUFF_BYTES * 8);
}
// print ASCII art pic flipped upside-down and rotated left
printf("\nMilliseconds elapsed = %d", clock() - start);
for(i=0; i<sizeof(Buff); i++) {
if(!(i % 10)) printf("\n"); // print bits in blocks of 80
PrintIntBits(Buff[i], 8);
}
printf("\n\nBenchmark time for %d loops\n", LoopKnt);
getchar();
return 0;
}
// -----------------------------------------------------------------------------
// Scaffolding...
static void PrintIntBits(long x, int bits) {
unsigned long long z=1;
int i=0;
z = z << (bits-1);
for (; z > 0; z >>= 1) {
printf("%s", ((x & z) == z) ? "#" : ".");
}
}
// These routines do bit manipulations on a bit array of unsigned chars
// ---------------------------------------------------------------------------
void BitSet(uchar *buff, unsigned long BitNumber) {
buff[BitNumber >> 3] |= mask_set[BitNumber & 7];
}
// ----------------------------------------------------------------------------
void BitClr(uchar *buff, unsigned long BitNumber) {
buff[BitNumber >> 3] &= mask_clr[BitNumber & 7];
}
// ----------------------------------------------------------------------------
void BitTog(uchar *buff, unsigned long BitNumber) {
buff[BitNumber >> 3] ^= mask_set[BitNumber & 7];
}
// ----------------------------------------------------------------------------
uchar BitGet(uchar *buff, unsigned long BitNumber) {
return (uchar) ((buff[BitNumber >> 3] >> (BitNumber & 7)) & 1);
}
// ----------------------------------------------------------------------------
void BitPut(uchar *buff, unsigned long BitNumber, uchar value) {
if(value) { // if the bit at buff[BitNumber] is true.
BitSet(buff, BitNumber);
} else {
BitClr(buff, BitNumber);
}
}
Below is the code listing for an optimization using a new buffer, instead of swapping bytes in place. Given that only 2030:4080 BitSet()s are needed because of the if() test, and about half the GetBit()s and PutBits() are eliminated by eliminating TEMP, I suspect memory access time is a large, fixed cost to these kinds of operations, providing a hard limit to optimization.
Using a look-up approach, and CONDITIONALLY swapping bytes, rather than bits, reduces by a factor of 8 the number of memory accesses, and testing for a 0 byte gets amortized across 8 bits, rather than 1.
Using these two approaches together, testing to see if the entire 8-bit char is 0 before doing ANYTHING, including the table lookup, and the write, is likely going to be the fastest possible approach, but would require an extra 512 bytes for the new, destination bit array, and 256 bytes for the lookup table. The performance payoff might be quite dramatic though.
// -----------------------------------------------------------------------------
// Reverse the bit ordering in new array
uchar *ReverseBitsInNewArray(uchar *Dst, const uchar *Src, const int BitKnt) {
int front=0, back = BitKnt-1;
memset(Dst, 0, BitKnt/BitsInByte);
while( front < back ) {
if(BitGet(Src, back--)) { // memset() has already set all bits in Dst to 0,
BitSet(Dst, front); // so only reset if Src bit is 1
}
front++;
}
return Dst;
To reverse a single byte x you can handle the bits one at a time:
unsigned char a = 0;
for (i = 0; i < 8; ++i) {
a += (unsigned char)(((x >> i) & 1) << (7 - i));
}
You can create a cache of these results in an array so that you can quickly reverse a byte just by making a single lookup instead of looping.
Then you just have to reverse the byte array, and when you write the data apply the above mapping. Reversing a byte array is a well documented problem, e.g. here.
Single Core?
How much memory?
Is the display buffered in memory and pushed to the device, or is the only copy of the pixels in the screens memory?
The data is pushed from system memory to the LCD driver over 8-bit IO
port.
Since you'll be writing to the LCD one byte at a time, I think the best idea is to perform the bit reversal right when sending the data to the LCD driver rather than as a separate pre-pass. Something along those lines should be faster than any of the other answers:
void send_to_LCD(uint8_t* data, int len, bool rotate) {
if (rotate)
for (int i=len-1; i>=0; i--)
write(reverse(data[i]));
else
for (int i=0; i<len; i++)
write(data[i]);
}
Where write() is the function that sends a byte to the LCD driver and reverse() one of the single-byte bit reversal methods described in the other answers.
This approach avoids the need to store two copies of the video data in ram and also avoids the read-invert-write roundtrip. Also note that this is the simplest implementation: it could be trivially adapted to load, say, 4 bytes at a time from memory if this were to yield better performance. A smart vectorizing compiler may be even able to do it for you.

Bit reversal of an integer, ignoring integer size and endianness

Given an integer typedef:
typedef unsigned int TYPE;
or
typedef unsigned long TYPE;
I have the following code to reverse the bits of an integer:
TYPE max_bit= (TYPE)-1;
void reverse_int_setup()
{
TYPE bits= (TYPE)max_bit;
while (bits <<= 1)
max_bit= bits;
}
TYPE reverse_int(TYPE arg)
{
TYPE bit_setter= 1, bit_tester= max_bit, result= 0;
for (result= 0; bit_tester; bit_tester>>= 1, bit_setter<<= 1)
if (arg & bit_tester)
result|= bit_setter;
return result;
}
One just needs first to run reverse_int_setup(), which stores an integer with the highest bit turned on, then any call to reverse_int(arg) returns arg with its bits reversed (to be used as a key to a binary tree, taken from an increasing counter, but that's more or less irrelevant).
Is there a platform-agnostic way to have in compile-time the correct value for max_int after the call to reverse_int_setup(); Otherwise, is there an algorithm you consider better/leaner than the one I have for reverse_int()?
Thanks.
#include<stdio.h>
#include<limits.h>
#define TYPE_BITS sizeof(TYPE)*CHAR_BIT
typedef unsigned long TYPE;
TYPE reverser(TYPE n)
{
TYPE nrev = 0, i, bit1, bit2;
int count;
for(i = 0; i < TYPE_BITS; i += 2)
{
/*In each iteration, we swap one bit on the 'right half'
of the number with another on the left half*/
count = TYPE_BITS - i - 1; /*this is used to find how many positions
to the left (and right) we gotta move
the bits in this iteration*/
bit1 = n & (1<<(i/2)); /*Extract 'right half' bit*/
bit1 <<= count; /*Shift it to where it belongs*/
bit2 = n & 1<<((i/2) + count); /*Find the 'left half' bit*/
bit2 >>= count; /*Place that bit in bit1's original position*/
nrev |= bit1; /*Now add the bits to the reversal result*/
nrev |= bit2;
}
return nrev;
}
int main()
{
TYPE n = 6;
printf("%lu", reverser(n));
return 0;
}
This time I've used the 'number of bits' idea from TK, but made it somewhat more portable by not assuming a byte contains 8 bits and instead using the CHAR_BIT macro. The code is more efficient now (with the inner for loop removed). I hope the code is also slightly less cryptic this time. :)
The need for using count is that the number of positions by which we have to shift a bit varies in each iteration - we have to move the rightmost bit by 31 positions (assuming 32 bit number), the second rightmost bit by 29 positions and so on. Hence count must decrease with each iteration as i increases.
Hope that bit of info proves helpful in understanding the code...
The following program serves to demonstrate a leaner algorithm for reversing bits, which can be easily extended to handle 64bit numbers.
#include <stdio.h>
#include <stdint.h>
int main(int argc, char**argv)
{
int32_t x;
if ( argc != 2 )
{
printf("Usage: %s hexadecimal\n", argv[0]);
return 1;
}
sscanf(argv[1],"%x", &x);
/* swap every neigbouring bit */
x = (x&0xAAAAAAAA)>>1 | (x&0x55555555)<<1;
/* swap every 2 neighbouring bits */
x = (x&0xCCCCCCCC)>>2 | (x&0x33333333)<<2;
/* swap every 4 neighbouring bits */
x = (x&0xF0F0F0F0)>>4 | (x&0x0F0F0F0F)<<4;
/* swap every 8 neighbouring bits */
x = (x&0xFF00FF00)>>8 | (x&0x00FF00FF)<<8;
/* and so forth, for say, 32 bit int */
x = (x&0xFFFF0000)>>16 | (x&0x0000FFFF)<<16;
printf("0x%x\n",x);
return 0;
}
This code should not contain errors, and was tested using 0x12345678 to produce 0x1e6a2c48 which is the correct answer.
typedef unsigned long TYPE;
TYPE reverser(TYPE n)
{
TYPE k = 1, nrev = 0, i, nrevbit1, nrevbit2;
int count;
for(i = 0; !i || (1 << i && (1 << i) != 1); i+=2)
{
/*In each iteration, we swap one bit
on the 'right half' of the number with another
on the left half*/
k = 1<<i; /*this is used to find how many positions
to the left (or right, for the other bit)
we gotta move the bits in this iteration*/
count = 0;
while(k << 1 && k << 1 != 1)
{
k <<= 1;
count++;
}
nrevbit1 = n & (1<<(i/2));
nrevbit1 <<= count;
nrevbit2 = n & 1<<((i/2) + count);
nrevbit2 >>= count;
nrev |= nrevbit1;
nrev |= nrevbit2;
}
return nrev;
}
This works fine in gcc under Windows, but I'm not sure if it's completely platform independent. A few places of concern are:
the condition in the for loop - it assumes that when you left shift 1 beyond the leftmost bit, you get either a 0 with the 1 'falling out' (what I'd expect and what good old Turbo C gives iirc), or the 1 circles around and you get a 1 (what seems to be gcc's behaviour).
the condition in the inner while loop: see above. But there's a strange thing happening here: in this case, gcc seems to let the 1 fall out and not circle around!
The code might prove cryptic: if you're interested and need an explanation please don't hesitate to ask - I'll put it up someplace.
#ΤΖΩΤΖΙΟΥ
In reply to ΤΖΩΤΖΙΟΥ 's comments, I present modified version of above which depends on a upper limit for bit width.
#include <stdio.h>
#include <stdint.h>
typedef int32_t TYPE;
TYPE reverse(TYPE x, int bits)
{
TYPE m=~0;
switch(bits)
{
case 64:
x = (x&0xFFFFFFFF00000000&m)>>16 | (x&0x00000000FFFFFFFF&m)<<16;
case 32:
x = (x&0xFFFF0000FFFF0000&m)>>16 | (x&0x0000FFFF0000FFFF&m)<<16;
case 16:
x = (x&0xFF00FF00FF00FF00&m)>>8 | (x&0x00FF00FF00FF00FF&m)<<8;
case 8:
x = (x&0xF0F0F0F0F0F0F0F0&m)>>4 | (x&0x0F0F0F0F0F0F0F0F&m)<<4;
x = (x&0xCCCCCCCCCCCCCCCC&m)>>2 | (x&0x3333333333333333&m)<<2;
x = (x&0xAAAAAAAAAAAAAAAA&m)>>1 | (x&0x5555555555555555&m)<<1;
}
return x;
}
int main(int argc, char**argv)
{
TYPE x;
TYPE b = (TYPE)-1;
int bits;
if ( argc != 2 )
{
printf("Usage: %s hexadecimal\n", argv[0]);
return 1;
}
for(bits=1;b;b<<=1,bits++);
--bits;
printf("TYPE has %d bits\n", bits);
sscanf(argv[1],"%x", &x);
printf("0x%x\n",reverse(x, bits));
return 0;
}
Notes:
gcc will warn on the 64bit constants
the printfs will generate warnings too
If you need more than 64bit, the code should be simple enough to extend
I apologise in advance for the coding crimes I committed above - mercy good sir!
There's a nice collection of "Bit Twiddling Hacks", including a variety of simple and not-so simple bit reversing algorithms coded in C at http://graphics.stanford.edu/~seander/bithacks.html.
I personally like the "Obvious" algorigthm (http://graphics.stanford.edu/~seander/bithacks.html#BitReverseObvious) because, well, it's obvious. Some of the others may require less instructions to execute. If I really need to optimize the heck out of something I may choose the not-so-obvious but faster versions. Otherwise, for readability, maintainability, and portability I would choose the Obvious one.
Here is a more generally useful variation. Its advantage is its ability to work in situations where the bit length of the value to be reversed -- the codeword -- is unknown but is guaranteed not to exceed a value we'll call maxLength. A good example of this case is Huffman code decompression.
The code below works on codewords from 1 to 24 bits in length. It has been optimized for fast execution on a Pentium D. Note that it accesses the lookup table as many as 3 times per use. I experimented with many variations that reduced that number to 2 at the expense of a larger table (4096 and 65,536 entries). This version, with the 256-byte table, was the clear winner, partly because it is so advantageous for table data to be in the caches, and perhaps also because the processor has an 8-bit table lookup/translation instruction.
const unsigned char table[] = {
0x00,0x80,0x40,0xC0,0x20,0xA0,0x60,0xE0,0x10,0x90,0x50,0xD0,0x30,0xB0,0x70,0xF0,
0x08,0x88,0x48,0xC8,0x28,0xA8,0x68,0xE8,0x18,0x98,0x58,0xD8,0x38,0xB8,0x78,0xF8,
0x04,0x84,0x44,0xC4,0x24,0xA4,0x64,0xE4,0x14,0x94,0x54,0xD4,0x34,0xB4,0x74,0xF4,
0x0C,0x8C,0x4C,0xCC,0x2C,0xAC,0x6C,0xEC,0x1C,0x9C,0x5C,0xDC,0x3C,0xBC,0x7C,0xFC,
0x02,0x82,0x42,0xC2,0x22,0xA2,0x62,0xE2,0x12,0x92,0x52,0xD2,0x32,0xB2,0x72,0xF2,
0x0A,0x8A,0x4A,0xCA,0x2A,0xAA,0x6A,0xEA,0x1A,0x9A,0x5A,0xDA,0x3A,0xBA,0x7A,0xFA,
0x06,0x86,0x46,0xC6,0x26,0xA6,0x66,0xE6,0x16,0x96,0x56,0xD6,0x36,0xB6,0x76,0xF6,
0x0E,0x8E,0x4E,0xCE,0x2E,0xAE,0x6E,0xEE,0x1E,0x9E,0x5E,0xDE,0x3E,0xBE,0x7E,0xFE,
0x01,0x81,0x41,0xC1,0x21,0xA1,0x61,0xE1,0x11,0x91,0x51,0xD1,0x31,0xB1,0x71,0xF1,
0x09,0x89,0x49,0xC9,0x29,0xA9,0x69,0xE9,0x19,0x99,0x59,0xD9,0x39,0xB9,0x79,0xF9,
0x05,0x85,0x45,0xC5,0x25,0xA5,0x65,0xE5,0x15,0x95,0x55,0xD5,0x35,0xB5,0x75,0xF5,
0x0D,0x8D,0x4D,0xCD,0x2D,0xAD,0x6D,0xED,0x1D,0x9D,0x5D,0xDD,0x3D,0xBD,0x7D,0xFD,
0x03,0x83,0x43,0xC3,0x23,0xA3,0x63,0xE3,0x13,0x93,0x53,0xD3,0x33,0xB3,0x73,0xF3,
0x0B,0x8B,0x4B,0xCB,0x2B,0xAB,0x6B,0xEB,0x1B,0x9B,0x5B,0xDB,0x3B,0xBB,0x7B,0xFB,
0x07,0x87,0x47,0xC7,0x27,0xA7,0x67,0xE7,0x17,0x97,0x57,0xD7,0x37,0xB7,0x77,0xF7,
0x0F,0x8F,0x4F,0xCF,0x2F,0xAF,0x6F,0xEF,0x1F,0x9F,0x5F,0xDF,0x3F,0xBF,0x7F,0xFF};
const unsigned short masks[17] =
{0,0,0,0,0,0,0,0,0,0X0100,0X0300,0X0700,0X0F00,0X1F00,0X3F00,0X7F00,0XFF00};
unsigned long codeword; // value to be reversed, occupying the low 1-24 bits
unsigned char maxLength; // bit length of longest possible codeword (<= 24)
unsigned char sc; // shift count in bits and index into masks array
if (maxLength <= 8)
{
codeword = table[codeword << (8 - maxLength)];
}
else
{
sc = maxLength - 8;
if (maxLength <= 16)
{
codeword = (table[codeword & 0X00FF] << sc)
| table[codeword >> sc];
}
else if (maxLength & 1) // if maxLength is 17, 19, 21, or 23
{
codeword = (table[codeword & 0X00FF] << sc)
| table[codeword >> sc] |
(table[(codeword & masks[sc]) >> (sc - 8)] << 8);
}
else // if maxlength is 18, 20, 22, or 24
{
codeword = (table[codeword & 0X00FF] << sc)
| table[codeword >> sc]
| (table[(codeword & masks[sc]) >> (sc >> 1)] << (sc >> 1));
}
}
How about:
long temp = 0;
int counter = 0;
int number_of_bits = sizeof(value) * 8; // get the number of bits that represent value (assuming that it is aligned to a byte boundary)
while(value > 0) // loop until value is empty
{
temp <<= 1; // shift whatever was in temp left to create room for the next bit
temp |= (value & 0x01); // get the lsb from value and set as lsb in temp
value >>= 1; // shift value right by one to look at next lsb
counter++;
}
value = temp;
if (counter < number_of_bits)
{
value <<= counter-number_of_bits;
}
(I'm assuming that you know how many bits value holds and it is stored in number_of_bits)
Obviously temp needs to be the longest imaginable data type and when you copy temp back into value, all the extraneous bits in temp should magically vanish (I think!).
Or, the 'c' way would be to say :
while(value)
your choice
We can store the results of reversing all possible 1 byte sequences in an array (256 distinct entries), then use a combination of lookups into this table and some oring logic to get the reverse of integer.
Here is a variation and correction to TK's solution which might be clearer than the solutions by sundar. It takes single bits from t and pushes them into return_val:
typedef unsigned long TYPE;
#define TYPE_BITS sizeof(TYPE)*8
TYPE reverser(TYPE t)
{
unsigned int i;
TYPE return_val = 0
for(i = 0; i < TYPE_BITS; i++)
{/*foreach bit in TYPE*/
/* shift the value of return_val to the left and add the rightmost bit from t */
return_val = (return_val << 1) + (t & 1);
/* shift off the rightmost bit of t */
t = t >> 1;
}
return(return_val);
}
The generic approach hat would work for objects of any type of any size would be to reverse the of bytes of the object, and the reverse the order of bits in each byte. In this case the bit-level algorithm is tied to a concrete number of bits (a byte), while the "variable" logic (with regard to size) is lifted to the level of whole bytes.
Here's my generalization of freespace's solution (in case we one day get 128-bit machines). It results in jump-free code when compiled with gcc -O3, and is obviously insensitive to the definition of foo_t on sane machines. Unfortunately it does depend on shift being a power of 2!
#include <limits.h>
#include <stdio.h>
typedef unsigned long foo_t;
foo_t reverse(foo_t x)
{
int shift = sizeof (x) * CHAR_BIT / 2;
foo_t mask = (1 << shift) - 1;
int i;
for (i = 0; shift; i++) {
x = ((x & mask) << shift) | ((x & ~mask) >> shift);
shift >>= 1;
mask ^= (mask << shift);
}
return x;
}
int main() {
printf("reverse = 0x%08lx\n", reverse(0x12345678L));
}
In case bit-reversal is time critical, and mainly in conjunction with FFT, the best is to store the whole bit reversed array. In any case, this array will be smaller in size than the roots of unity that have to be precomputed in FFT Cooley-Tukey algorithm. An easy way to compute the array is:
int BitReverse[Size]; // Size is power of 2
void Init()
{
BitReverse[0] = 0;
for(int i = 0; i < Size/2; i++)
{
BitReverse[2*i] = BitReverse[i]/2;
BitReverse[2*i+1] = (BitReverse[i] + Size)/2;
}
} // end it's all

Resources