Turn a large chunk of memory backwards, fast

Turn a large chunk of memory backwards, fast - c

I need to rewrite about 4KB of data in reverse order, at bit level (last bit of last byte becoming first bit of first byte), as fast as possible. Are there any clever sniplets to do it?
Rationale: The data is display contents of LCD screen in an embedded device that is usually positioned in a way that the screen is on your shoulders level. The screen has "6 o'clock" orientation, that is to be viewed from below - like lying flat or hanging above your eyes level. This is fixable by rotating the screen 180 degrees, but then I need to reverse the screen data (generated by library), which is 1 bit = 1 pixel, starting with upper left of the screen. The CPU isn't very powerful, and the device has enough work already, plus several frames a second would be desirable so performance is an issue; RAM not so much.
edit:
Single core, ARM 9 series. 64MB, (to be scaled down to 32MB later), Linux. The data is pushed from system memory to the LCD driver over 8-bit IO port.
The CPU is 32bit and performs much better at this word size than at byte level.

There's a classic way to do this. Let's say unsigned int is your 32-bit word. I'm using C99 because the restrict keyword lets the compiler perform extra optimizations in this speed-critical code that would otherwise be unavailable. These keywords inform the compiler that "src" and "dest" do not overlap. This also assumes you are copying an integral number of words, if you're not, then this is just a start.
I also don't know which bit shifting / rotation primitives are fast on the ARM and which are slow. This is something to consider. If you need more speed, consider disassembling the output from the C compiler and going from there. If using GCC, try O2, O3, and Os to see which one is fastest. You might reduce stalls in the pipeline by doing two words at the same time.
This uses 23 operations per word, not counting load and store. However, these 23 operations are all very fast and none of them access memory. I don't know if a lookup table would be faster or not.
void
copy_rev(unsigned int *restrict dest,
unsigned int const *restrict src,
unsigned int n)
{
unsigned int i, x;
for (i = 0; i < n; ++i) {
x = src[i];
x = (x >> 16) | (x << 16);
x = ((x >> 8) & 0x00ff00ffU) | ((x & 0x00ff00ffU) << 8);
x = ((x >> 4) & 0x0f0f0f0fU) | ((x & 0x0f0f0f0fU) << 4);
x = ((x >> 2) & 0x33333333U) | ((x & 0x33333333U) << 2);
x = ((x >> 1) & 0x55555555U) | ((x & 0x555555555) << 1);
dest[n-1-i] = x;
}
}
This page is a great reference: http://graphics.stanford.edu/~seander/bithacks.html#BitReverseObvious
Final note: Looking at the ARM assembly reference, there is a "REV" opcode which reverses the byte order in a word. This would shave 7 operations per loop off the above code.

Fastest way would probably to store the reverse of all possible byte values in a look-up table. The table would take only 256 bytes.

Build a 256 element lookup table of byte values that are bit-reversed from their index.
{0x00, 0x80, 0x40, 0xc0, etc}
Then iterate through your array copying using each byte as an index into your lookup table.
If you are writing assembly language, the x86 instruction set has an XLAT instruction that does just this sort of lookup. Although it may not actually be faster than C code on modern processors.
You can do this in place if you iterate from both ends towards the middle. Because of cache effects, you may find it's faster to swap in 16 byte chunks (assuming a 16 byte cache line).
Here's the basic code (not including the cache line optimization)
// bit reversing lookup table
typedef unsigned char BYTE;
extern const BYTE g_RevBits[256];
void ReverseBitsInPlace(BYTE * pb, int cb)
{
int iter = cb/2;
for (int ii = 0, jj = cb-1; ii < iter; ++ii, --jj)
{
BYTE b1 = g_RevBits[pb[ii]];
pb[ii] = g_RevBits[pb[jj]];
pb[jj] = b1;
}
if (cb & 1) // if the number of bytes was odd, swap the middle one in place
{
pb[cb/2] = g_RevBits[pb[cb/2]];
}
}
// initialize the bit reversing lookup table using macros to make it less typing.
#define BITLINE(n) \
0x0##n, 0x8##n, 0x4##n, 0xC##n, 0x2##n, 0xA##n, 0x6##n, 0xE##n,\
0x1##n, 0x9##n, 0x5##n, 0xD##n, 0x3##n, 0xB##n, 0x7##n, 0xF##n,
const BYTE g_RevBits[256] = {
BITLINE(0), BITLINE(8), BITLINE(4), BITLINE(C),
BITLINE(2), BITLINE(A), BITLINE(6), BITLINE(E),
BITLINE(1), BITLINE(9), BITLINE(5), BITLINE(D),
BITLINE(3), BITLINE(B), BITLINE(7), BITLINE(F),
};

The Bit Twiddling Hacks site is alwas a good starting point for these kind of problems. Take a look here for fast bit reversal. Then its up to you to apply it to each byte/word of your memory block.
EDIT:
Inspired by Dietrich Epps answer and looking at the ARM instruction set, there is a RBIT opcode that reverses the bits contained in a register. So if performance is critical, you might consider using some assembly code.

Loop through the half of the array, convert and exchange bytes.
for( int i = 0; i < arraySize / 2; i++ ) {
char inverted1 = invert( array[i] );
char inverted2 = invert( array[arraySize - i - 1] );
array[i] = inverted2;
array[arraySize - i - 1] = inverted1;
}
For conversion use a precomputed table - an array of 2CHAR_BIT (CHAR_BIT will most likely be 8) elements where at position "I" the result of byte with value "I" inversion is stored. This will be very fast - one pass - and consume only 2CHAR_BIT for the table.

It looks like this code takes about 50 clocks per bit swap on my i7 XPS 8500 machine. 7.6 seconds for a million array flips. Single threaded. It prints some ASCI art based on patterns of 1s and 0s. I rotated the pic left 180 degrees after reversing the bit array, using a graphic editor, and they look identical to me. A double-reversed image comes out the same as the original.
As for pluses, it's a complete solution. It swaps bits from the back of a bit array to the front, vs operating on ints/bytes and then needing to swap ints/bytes in an array.
Also, this is a general purpose bit library, so you might find it handy in the future for solving other, more mundane problems.
Is it as fast as the accepted answer? I think it's close, but without working code to benchmark it's impossible to say. Feel free to cut and paste this working program.
// Reverse BitsInBuff.cpp : Defines the entry point for the console application.
#include "stdafx.h"
#include "time.h"
#include "memory.h"
//
// Manifest constants
#define uchar unsigned char
#define BUFF_BYTES 510 //400 supports a display of 80x40 bits
#define DW 80 // Display Width
// ----------------------------------------------------------------------------
uchar mask_set[] = { 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80 };
uchar mask_clr[] = { 0xfe, 0xfd, 0xfb, 0xf7, 0xef, 0xdf, 0xbf, 0x7f };
//
// Function Prototypes
static void PrintIntBits(long x, int bits);
void BitSet(uchar * BitArray, unsigned long BitNumber);
void BitClr(uchar * BitArray, unsigned long BitNumber);
void BitTog(uchar * BitArray, unsigned long BitNumber);
uchar BitGet(uchar * BitArray, unsigned long BitNumber);
void BitPut(uchar * BitArray, unsigned long BitNumber, uchar value);
//
uchar *ReverseBitsInArray(uchar *Buff, int BitKnt);
static void PrintIntBits(long x, int bits);
// -----------------------------------------------------------------------------
// Reverse the bit ordering in an array
uchar *ReverseBitsInArray(uchar *Buff, int BitKnt) {
unsigned long front=0, back = BitKnt-1;
uchar temp;
while( front<back ) {
temp = BitGet(Buff, front); // copy front bit to temp before overwriting
BitPut(Buff, front, BitGet(Buff, back)); // copy back bit to front bit
BitPut(Buff, back, temp); // copy saved value of front in temp to back of bit arra)
front++;
back--;
}
return Buff;
}
// ---------------------------------------------------------------------------
// ---------------------------------------------------------------------------
int _tmain(int argc, _TCHAR* argv[]) {
int i, j, k, LoopKnt = 1000001;
time_t start;
uchar Buff[BUFF_BYTES];
memset(Buff, 0, sizeof(Buff));
// make an ASCII art picture
for(i=0, k=0; i<(sizeof(Buff)*8)/DW; i++) {
for(j=0; j<DW/2; j++) {
BitSet(Buff, (i*DW)+j+k);
}
k++;
}
// print ASCII art picture
for(i=0; i<sizeof(Buff); i++) {
if(!(i % 10)) printf("\n"); // print bits in blocks of 80
PrintIntBits(Buff[i], 8);
}
i=LoopKnt;
start = clock();
while( i-- ) {
ReverseBitsInArray((uchar *)Buff, BUFF_BYTES * 8);
}
// print ASCII art pic flipped upside-down and rotated left
printf("\nMilliseconds elapsed = %d", clock() - start);
for(i=0; i<sizeof(Buff); i++) {
if(!(i % 10)) printf("\n"); // print bits in blocks of 80
PrintIntBits(Buff[i], 8);
}
printf("\n\nBenchmark time for %d loops\n", LoopKnt);
getchar();
return 0;
}
// -----------------------------------------------------------------------------
// Scaffolding...
static void PrintIntBits(long x, int bits) {
unsigned long long z=1;
int i=0;
z = z << (bits-1);
for (; z > 0; z >>= 1) {
printf("%s", ((x & z) == z) ? "#" : ".");
}
}
// These routines do bit manipulations on a bit array of unsigned chars
// ---------------------------------------------------------------------------
void BitSet(uchar *buff, unsigned long BitNumber) {
buff[BitNumber >> 3] |= mask_set[BitNumber & 7];
}
// ----------------------------------------------------------------------------
void BitClr(uchar *buff, unsigned long BitNumber) {
buff[BitNumber >> 3] &= mask_clr[BitNumber & 7];
}
// ----------------------------------------------------------------------------
void BitTog(uchar *buff, unsigned long BitNumber) {
buff[BitNumber >> 3] ^= mask_set[BitNumber & 7];
}
// ----------------------------------------------------------------------------
uchar BitGet(uchar *buff, unsigned long BitNumber) {
return (uchar) ((buff[BitNumber >> 3] >> (BitNumber & 7)) & 1);
}
// ----------------------------------------------------------------------------
void BitPut(uchar *buff, unsigned long BitNumber, uchar value) {
if(value) { // if the bit at buff[BitNumber] is true.
BitSet(buff, BitNumber);
} else {
BitClr(buff, BitNumber);
}
}
Below is the code listing for an optimization using a new buffer, instead of swapping bytes in place. Given that only 2030:4080 BitSet()s are needed because of the if() test, and about half the GetBit()s and PutBits() are eliminated by eliminating TEMP, I suspect memory access time is a large, fixed cost to these kinds of operations, providing a hard limit to optimization.
Using a look-up approach, and CONDITIONALLY swapping bytes, rather than bits, reduces by a factor of 8 the number of memory accesses, and testing for a 0 byte gets amortized across 8 bits, rather than 1.
Using these two approaches together, testing to see if the entire 8-bit char is 0 before doing ANYTHING, including the table lookup, and the write, is likely going to be the fastest possible approach, but would require an extra 512 bytes for the new, destination bit array, and 256 bytes for the lookup table. The performance payoff might be quite dramatic though.
// -----------------------------------------------------------------------------
// Reverse the bit ordering in new array
uchar *ReverseBitsInNewArray(uchar *Dst, const uchar *Src, const int BitKnt) {
int front=0, back = BitKnt-1;
memset(Dst, 0, BitKnt/BitsInByte);
while( front < back ) {
if(BitGet(Src, back--)) { // memset() has already set all bits in Dst to 0,
BitSet(Dst, front); // so only reset if Src bit is 1
}
front++;
}
return Dst;

To reverse a single byte x you can handle the bits one at a time:
unsigned char a = 0;
for (i = 0; i < 8; ++i) {
a += (unsigned char)(((x >> i) & 1) << (7 - i));
}
You can create a cache of these results in an array so that you can quickly reverse a byte just by making a single lookup instead of looping.
Then you just have to reverse the byte array, and when you write the data apply the above mapping. Reversing a byte array is a well documented problem, e.g. here.

Single Core?
How much memory?
Is the display buffered in memory and pushed to the device, or is the only copy of the pixels in the screens memory?

The data is pushed from system memory to the LCD driver over 8-bit IO
port.
Since you'll be writing to the LCD one byte at a time, I think the best idea is to perform the bit reversal right when sending the data to the LCD driver rather than as a separate pre-pass. Something along those lines should be faster than any of the other answers:
void send_to_LCD(uint8_t* data, int len, bool rotate) {
if (rotate)
for (int i=len-1; i>=0; i--)
write(reverse(data[i]));
else
for (int i=0; i<len; i++)
write(data[i]);
}
Where write() is the function that sends a byte to the LCD driver and reverse() one of the single-byte bit reversal methods described in the other answers.
This approach avoids the need to store two copies of the video data in ram and also avoids the read-invert-write roundtrip. Also note that this is the simplest implementation: it could be trivially adapted to load, say, 4 bytes at a time from memory if this were to yield better performance. A smart vectorizing compiler may be even able to do it for you.

Related

Which is faster: logarithm or lookup table? [duplicate]

I have a byte I'm using for bitflags. I know that one and only one bit in the byte is set at any give time.
Ex: unsigned char b = 0x20; //(00100000) 6th most bit set
I currently use the following loop to determine which bit is set:
int getSetBitLocation(unsigned char b) {
int i=0;
while( !((b >> i++) & 0x01) ) { ; }
return i;
}
How do I most efficiently determine the position of the set bit? Can I do this without iteration?

Can I do this without iteration?
It is indeed possible.
How do I most efficiently determine the position of the set bit?
You can try this algorithm. It splits the char in half to search for the top bit, shifting to the low half each time:
int getTopSetBit(unsigned char b) {
int res = 0;
if(b>15){
b = b >> 4;
res = res + 4;
}
if(b>3){
b = b >> 2;
res = res + 2;
}
//thanks #JasonD
return res + (b>>1);
}
It uses two comparisons (three for uint16s, four for uint32s...). and it might be faster than your loop. It is definitely not shorter.
Based on the idea by Anton Kovalenko (hashed lookup) and the comment by 6502 (division is slow), I also suggest this implementation (8-bit => 3-bit hash using a de-Bruijn sequence)
int[] lookup = {7, 0, 5, 1, 6, 4, 3, 2};
int getBitPosition(unsigned char b) {
// return lookup[(b | (b>>1) | (b>>2) | (b>>4)) & 0x7];
return lookup[((b * 0x1D) >> 4) & 0x7];
}
or (larger LUT, but uses just three terms instead of four)
int[] lookup = {0xFF, 0, 1, 4, 2, 0xFF, 5, 0xFF, 7, 3, 0xFF, 0xFF, 6, 0xFF, 0xFF, 0xFF};
int getBitPosition(unsigned char b) {
return lookup[(b | (b>>3) | (b>>4)) & 0xF];
}

Lookup table is simple enough, and you can reduce its size if the set of values is sparse. Let's try with 11 elements instead of 128:
unsigned char expt2mod11_bits[11]={0xFF,0,1,0xFF,2,4,0xFF,7,3,6,5};
unsigned char pos = expt2mod11_bits[b%11];
assert(pos < 8);
assert(1<<pos == b);
Of course, it's not necessarily more effective, especially for 8 bits, but the same trick can be used for larger sizes, where full lookup table would be awfully big. Let's see:
unsigned int w;
....
unsigned char expt2mod19_bits[19]={0xFF,0,1,13,2,0xFF,14,6,3,8,0xFF,12,15,5,7,11,4,10,9};
unsigned char pos = expt2mod19_bits[w%19];
assert(pos < 16);
assert(1<<pos == w);

This is a quite common problem for chess programs that use 64 bits to represent positions (i.e. one 64-bit number to store where are all the white pawns, another for where are all the black ones and so on).
With this representation there is sometimes the need to find the index 0...63 of the first or last set bit and there are several possible approaches:
Just doing a loop like you did
Using a dichotomic search (i.e. if x & 0x00000000ffffffffULL is zero there's no need to check low 32 bits)
Using special instruction if available on the processor (e.g. bsf and bsr on x86)
Using lookup tables (of course not for the whole 64-bit value, but for 8 or 16 bits)
What is faster however really depends on your hardware and on real use cases.
For 8 bits only and a modern processor I think that probably a lookup table with 256 entries is the best choice...
But are you really sure this is the bottleneck of your algorithm?

unsigned getSetBitLocation(unsigned char b) {
unsigned pos=0;
pos = (b & 0xf0) ? 4 : 0; b |= b >>4;
pos += (b & 0xc) ? 2 : 0; b |= b >>2;
pos += (b & 0x2) ? 1 : 0;
return pos;
}
It would be hard to do it jumpfree. Maybe with the Bruin sequences ?

Based on log2 calculation in Find the log base 2 of an N-bit integer in O(lg(N)) operations:
int getSetBitLocation(unsigned char c) {
// c is in {1, 2, 4, 8, 16, 32, 64, 128}, returned values are {0, 1, ..., 7}
return (((c & 0xAA) != 0) |
(((c & 0xCC) != 0) << 1) |
(((c & 0xF0) != 0) << 2));
}

Easiest thing is to create a lookup table. The simplest one will be sparse (having 256 elements) but it would technically avoid iteration.
This comment here technically avoids iteration, but who are we kidding, it is still doing the same number of checks: How to write log base(2) in c/c++
Closed form would be log2(), a la, log2() + 1 But I'm not sure how efficient that is - possibly the CPU has an instruction for taking base 2 logrithms?

if you define
const char bytes[]={1,2,4,8,16,32,64,128}
and use
struct byte{
char data;
int pos;
}
void assign(struct byte b,int i){
b.data=bytes[i];
b.pos=i
}
you don't need to determine the position of the set bit

A lookup table is fast and easy when CHAR_BIT == 8, but on some systems, CHAR_BIT == 16 or 32 and a lookup table becomes insanely bulky. If you're considering a lookup table, I'd suggest wrapping it; make it a "lookup table function", instead, so that you can swap the logic when you need to optimise.
Using divide and conquer, by performing a binary search on a sorted array, involves comparisons based on log2 CHAR_BIT. That code is more complex, involving an initialisation of an array of unsigned char to use as a lookup table for a start. Once you have such the array initialised, you can use bsearch to search it, for example:
#include <stdio.h>
#include <stdlib.h>
void uchar_bit_init(unsigned char *table) {
for (size_t x = 0; x < CHAR_BIT; x++) {
table[x] = 1U << x;
}
}
int uchar_compare(void const *x, void const *y) {
char const *X = x, *Y = y;
return (*X > *Y) - (*X < *Y);
}
size_t uchar_bit_lookup(unsigned char *table, unsigned char value) {
unsigned char *position = bsearch(lookup, c, sizeof lookup, 1, char_compare);
return position ? position - table + 1 : 0;
}
int main(void) {
unsigned char lookup[CHAR_BIT];
uchar_bit_init(lookup);
for (;;) {
int c = getchar();
if (c == EOF) { break; }
printf("Bit for %c found at %zu\n", c, uchar_bit_lookup(lookup, c));
}
}
P.S. This sounds like micro-optimisation. Get your solution done (abstracting the operations required into these functions), then worry about optimisations based on your profiling. Make sure your profiling targets the system that your solution will run on if you're going to focus on micro-optimisations, because the efficiency of micro-optimisations differ widely as hardware differs even slightly... It's usually a better idea to buy a faster PC ;)

make the moving bits more efficient

in our company's latest project, we want to move a char arry to left half a byte, e.g.
char buf[] = {0x12, 0x34, 0x56, 0x78, 0x21}
we want to make the buf like
0x23, 0x45, 0x67, 0x82, 0x10
how do I make the process more efficient, can you make the time complexity less than O(N) if there are N bytes to be processed?
SOS...

Without more context, I would go even as far as questioning the need for an actual array. If you have 4 bytes, that can easily be represented using a uint32_t, and then you can perform an O(1) shift operation:
uint32_t x = 0x12345678;
uint32_t offByHalf = x << 4;
This way, you would replace array access with bit masking, like this:
array[i]
would be equivalent with
(x >> 8 * (3 - i)) & 0xff
And who knows, arithmetic may even be faster than memory access. But don't take my word for it, benchmark it.

No, if you want to actually shift the array, you'll need to hit every element at least once so it'll be O(n). There's no getting around that. You can do it with something like the following:
#include <stdio.h>
void shiftNybbleLeft (unsigned char *arr, size_t sz) {
for (int i = 1; i < sz; i++)
arr[i-1] = ((arr[i-1] & 0x0f) << 4) | (arr[i] >> 4);
arr[sz-1] = (arr[sz-1] & 0x0f) << 4;
}
int main (int argc, char *argv[]) {
unsigned char buf[] = {0x12, 0x34, 0x56, 0x78};
shiftNybbleLeft (buf, sizeof (buf));
for (int i = 0; i < sizeof (buf); i++)
printf ("0x%02x ", buf[i]);
putchar ('\n');
return 0;
}
which gives you:
0x23 0x45 0x67 0x80
That's not to say you can't make it more efficient (a). If you instead modify your extraction code so that it behaves differently, you can avoid the shifting operation.
In other words, don't shift the array, simply set an offset variable and use that to modify the extraction process. Examine the following code:
#include <stdio.h>
unsigned char getByte (unsigned char *arr, size_t index, size_t shiftSz) {
if ((shiftSz % 2) == 0)
return arr[index + shiftSz / 2];
return ((arr[index + shiftSz / 2] & 0x0f) << 4)
| (arr[index + shiftSz / 2 + 1] >> 4);
}
int main (int argc, char *argv[]) {
unsigned char buf[] = {0x12, 0x34, 0x56, 0x78};
//shiftNybbleLeft (buf, sizeof (buf));
for (int i = 0; i < 4; i++)
printf ("buf[1] with left shift %d nybbles -> 0x%02x\n",
i, getByte (buf, 1, i));
return 0;
}
With shiftSz set to 0, it's as if the array isn't shifted. By setting shiftSz to non-zero, an O(1) operation, getByte() will actually return the element as if you had shifted it by that amount. The output is as you would expect:
Index 1 with left shift 0 nybbles -> 0x34
Index 1 with left shift 1 nybbles -> 0x45
Index 1 with left shift 2 nybbles -> 0x56
Index 1 with left shift 3 nybbles -> 0x67
Now that may seem a contrived example (because it is) but there's ample precedent in using tricks like that to avoid potentially costly operations. You'd probably also want to add some bounds checking to catch problems with referencing outside the array.
Keep in mind that there's a trade-off. What you gain by not having to shift the array may be offset to some degree by the calculations done during extraction. Whether it's actually worth it depends on how you use the data. If the arrays is large but you don't extract that many values from it, this trick may be worth it.
As another example of using "tricks" to prevent costly operations, I've seen text editors that don't bother shifting the contents of lines either (when deleting a character for example). Instead they simply set the character to a 0 code point and take care of it when displaying the line (ignoring the 0 code points).
They'll generally clean up eventually but often in the background where it won't interfere with your editing speed.
(a) Though you may want to actually make sure this is necessary.
One of your comments stated that your arrays are about 500 entries in length and I can tell you that my not-supremely-grunty development box can shift that array one nybble to the left at the rate of about half a million times every single second.
So, even if your profiler states that a large proportion of time is being spent in there, that doesn't necessarily mean it's a large amount of time.
You should only look into optimising code if there's a specific, identified bottleneck.

I'll tackle the only objectively answerable part of the question, which is:
can you make the time complexity less than O(N) if there are N bytes to be processed?
If you need the entire output array then no, you cannot do better than O(N).
If you only need certain elements of the output array, then you can compute just those.

It may not compile well due to alignment, but you can try using a bitfield offset in a struct.
struct __attribute__((packed)) shifted{
char offset:4; // dump data
char data[N]; // rest of data
};
or on some systems
struct __attribute__((packed)) shifted{
char offset:4; // dump data
char data[N]; // rest of data
char last:4; // to make an even byte
};
struct shifted *shifted_buf=&buf;
//now operate on shifted_buf->data
Or you can try making it a union
union __attribute__((packed)) {
char old[N];
struct{
char offset:4;
char buf[N];
char last:4; // to make an even byte
}shifted;
}data;
The alternative would be to cast to an array of int and <<4 for each int, reducing it to N/4, but this is dependent on endianness.

Bitwise memmove

What is the best way to implement a bitwise memmove? The method should take an additional destination and source bit-offset and the count should be in bits too.
I saw that ARM provides a non-standard _membitmove, which does exactly what I need, but I couldn't find its source.
Bind's bitset includes isc_bitstring_copy, but it's not efficient
I'm aware that the C standard library doesn't provide such a method, but I also couldn't find any third-party code providing a similar method.

Assuming "best" means "easiest", you can copy bits one by one. Conceptually, an address of a bit is an object (struct) that has a pointer to a byte in memory and an index of a bit in the byte.
struct pointer_to_bit
{
uint8_t* p;
int b;
};
void membitmovebl(
void *dest,
const void *src,
int dest_offset,
int src_offset,
size_t nbits)
{
// Create pointers to bits
struct pointer_to_bit d = {dest, dest_offset};
struct pointer_to_bit s = {src, src_offset};
// Bring the bit offsets to range (0...7)
d.p += d.b / 8; // replace division by right-shift if bit offset can be negative
d.b %= 8; // replace "%=8" by "&=7" if bit offset can be negative
s.p += s.b / 8;
s.b %= 8;
// Determine whether it's OK to loop forward
if (d.p < s.p || d.p == s.p && d.b <= s.b)
{
// Copy bits one by one
for (size_t i = 0; i < nbits; i++)
{
// Read 1 bit
int bit = (*s.p >> s.b) & 1;
// Write 1 bit
*d.p &= ~(1 << d.b);
*d.p |= bit << d.b;
// Advance pointers
if (++s.b == 8)
{
s.b = 0;
++s.p;
}
if (++d.b == 8)
{
d.b = 0;
++d.p;
}
}
}
else
{
// Copy stuff backwards - essentially the same code but ++ replaced by --
}
}
If you want to write a version optimized for speed, you will have to do copying by bytes (or, better, words), unroll loops, and handle a number of special cases (memmove does that; you will have to do more because your function is more complicated).
P.S. Oh, seeing that you call isc_bitstring_copy inefficient, you probably want the speed optimization. You can use the following idea:
Start copying bits individually until the destination is byte-aligned (d.b == 0). Then, it is easy to copy 8 bits at once, doing some bit twiddling. Do this until there are less than 8 bits left to copy; then continue copying bits one by one.
// Copy 8 bits from s to d and advance pointers
*d.p = *s.p++ >> s.b;
*d.p++ |= *s.p << (8 - s.b);
P.P.S Oh, and seeing your comment on what you are going to use the code for, you don't really need to implement all the versions (byte/halfword/word, big/little-endian); you only want the easiest one - the one working with words (uint32_t).

Here is a partial implementation (not tested). There are obvious efficiency and usability improvements.
Copy n bytes from src to dest (not overlapping src), and shift bits at dest rightwards by bit bits, 0 <= bit <= 7. This assumes that the least significant bits are at the right of the bytes
void memcpy_with_bitshift(unsigned char *dest, unsigned char *src, size_t n, int bit)
{
int i;
memcpy(dest, src, n);
for (i = 0; i < n; i++) {
dest[i] >> bit;
}
for (i = 0; i < n; i++) {
dest[i+1] |= (src[i] << (8 - bit));
}
}
Some improvements to be made:
Don't overwrite first bit bits at beginning of dest.
Merge loops
Have a way to copy a number of bits not divisible by 8
Fix for >8 bits in a char

How to define and work with an array of bits in C?

I want to create a very large array on which I write '0's and '1's. I'm trying to simulate a physical process called random sequential adsorption, where units of length 2, dimers, are deposited onto an n-dimensional lattice at a random location, without overlapping each other. The process stops when there is no more room left on the lattice for depositing more dimers (lattice is jammed).
Initially I start with a lattice of zeroes, and the dimers are represented by a pair of '1's. As each dimer is deposited, the site on the left of the dimer is blocked, due to the fact that the dimers cannot overlap. So I simulate this process by depositing a triple of '1's on the lattice. I need to repeat the entire simulation a large number of times and then work out the average coverage %.
I've already done this using an array of chars for 1D and 2D lattices. At the moment I'm trying to make the code as efficient as possible, before working on the 3D problem and more complicated generalisations.
This is basically what the code looks like in 1D, simplified:
int main()
{
/* Define lattice */
array = (char*)malloc(N * sizeof(char));
total_c = 0;
/* Carry out RSA multiple times */
for (i = 0; i < 1000; i++)
rand_seq_ads();
/* Calculate average coverage efficiency at jamming */
printf("coverage efficiency = %lf", total_c/1000);
return 0;
}
void rand_seq_ads()
{
/* Initialise array, initial conditions */
memset(a, 0, N * sizeof(char));
available_sites = N;
count = 0;
/* While the lattice still has enough room... */
while(available_sites != 0)
{
/* Generate random site location */
x = rand();
/* Deposit dimer (if site is available) */
if(array[x] == 0)
{
array[x] = 1;
array[x+1] = 1;
count += 1;
available_sites += -2;
}
/* Mark site left of dimer as unavailable (if its empty) */
if(array[x-1] == 0)
{
array[x-1] = 1;
available_sites += -1;
}
}
/* Calculate coverage %, and add to total */
c = count/N
total_c += c;
}
For the actual project I'm doing, it involves not just dimers but trimers, quadrimers, and all sorts of shapes and sizes (for 2D and 3D).
I was hoping that I would be able to work with individual bits instead of bytes, but I've been reading around and as far as I can tell you can only change 1 byte at a time, so either I need to do some complicated indexing or there is a simpler way to do it?
Thanks for your answers

If I am not too late, this page gives awesome explanation with examples.
An array of int can be used to deal with array of bits. Assuming size of int to be 4 bytes, when we talk about an int, we are dealing with 32 bits. Say we have int A[10], means we are working on 10*4*8 = 320 bits and following figure shows it: (each element of array has 4 big blocks, each of which represent a byte and each of the smaller blocks represent a bit)
So, to set the kth bit in array A:
// NOTE: if using "uint8_t A[]" instead of "int A[]" then divide by 8, not 32
void SetBit( int A[], int k )
{
int i = k/32; //gives the corresponding index in the array A
int pos = k%32; //gives the corresponding bit position in A[i]
unsigned int flag = 1; // flag = 0000.....00001
flag = flag << pos; // flag = 0000...010...000 (shifted k positions)
A[i] = A[i] | flag; // Set the bit at the k-th position in A[i]
}
or in the shortened version
void SetBit( int A[], int k )
{
A[k/32] |= 1 << (k%32); // Set the bit at the k-th position in A[i]
}
similarly to clear kth bit:
void ClearBit( int A[], int k )
{
A[k/32] &= ~(1 << (k%32));
}
and to test if the kth bit:
int TestBit( int A[], int k )
{
return ( (A[k/32] & (1 << (k%32) )) != 0 ) ;
}
As said above, these manipulations can be written as macros too:
// Due order of operation wrap 'k' in parentheses in case it
// is passed as an equation, e.g. i + 1, otherwise the first
// part evaluates to "A[i + (1/32)]" not "A[(i + 1)/32]"
#define SetBit(A,k) ( A[(k)/32] |= (1 << ((k)%32)) )
#define ClearBit(A,k) ( A[(k)/32] &= ~(1 << ((k)%32)) )
#define TestBit(A,k) ( A[(k)/32] & (1 << ((k)%32)) )

typedef unsigned long bfield_t[ size_needed/sizeof(long) ];
// long because that's probably what your cpu is best at
// The size_needed should be evenly divisable by sizeof(long) or
// you could (sizeof(long)-1+size_needed)/sizeof(long) to force it to round up
Now, each long in a bfield_t can hold sizeof(long)*8 bits.
You can calculate the index of a needed big by:
bindex = index / (8 * sizeof(long) );
and your bit number by
b = index % (8 * sizeof(long) );
You can then look up the long you need and then mask out the bit you need from it.
result = my_field[bindex] & (1<<b);
or
result = 1 & (my_field[bindex]>>b); // if you prefer them to be in bit0
The first one may be faster on some cpus or may save you shifting back up of you need
to perform operations between the same bit in multiple bit arrays. It also mirrors
the setting and clearing of a bit in the field more closely than the second implemention.
set:
my_field[bindex] |= 1<<b;
clear:
my_field[bindex] &= ~(1<<b);
You should remember that you can use bitwise operations on the longs that hold the fields
and that's the same as the operations on the individual bits.
You'll probably also want to look into the ffs, fls, ffc, and flc functions if available. ffs should always be avaiable in strings.h. It's there just for this purpose -- a string of bits.
Anyway, it is find first set and essentially:
int ffs(int x) {
int c = 0;
while (!(x&1) ) {
c++;
x>>=1;
}
return c; // except that it handles x = 0 differently
}
This is a common operation for processors to have an instruction for and your compiler will probably generate that instruction rather than calling a function like the one I wrote. x86 has an instruction for this, by the way. Oh, and ffsl and ffsll are the same function except take long and long long, respectively.

You can use & (bitwise and) and << (left shift).
For example, (1 << 3) results in "00001000" in binary. So your code could look like:
char eightBits = 0;
//Set the 5th and 6th bits from the right to 1
eightBits &= (1 << 4);
eightBits &= (1 << 5);
//eightBits now looks like "00110000".
Then just scale it up with an array of chars and figure out the appropriate byte to modify first.
For more efficiency, you could define a list of bitfields in advance and put them in an array:
#define BIT8 0x01
#define BIT7 0x02
#define BIT6 0x04
#define BIT5 0x08
#define BIT4 0x10
#define BIT3 0x20
#define BIT2 0x40
#define BIT1 0x80
char bits[8] = {BIT1, BIT2, BIT3, BIT4, BIT5, BIT6, BIT7, BIT8};
Then you avoid the overhead of the bit shifting and you can index your bits, turning the previous code into:
eightBits &= (bits[3] & bits[4]);
Alternatively, if you can use C++, you could just use an std::vector<bool> which is internally defined as a vector of bits, complete with direct indexing.

bitarray.h:
#include <inttypes.h> // defines uint32_t
//typedef unsigned int bitarray_t; // if you know that int is 32 bits
typedef uint32_t bitarray_t;
#define RESERVE_BITS(n) (((n)+0x1f)>>5)
#define DW_INDEX(x) ((x)>>5)
#define BIT_INDEX(x) ((x)&0x1f)
#define getbit(array,index) (((array)[DW_INDEX(index)]>>BIT_INDEX(index))&1)
#define putbit(array, index, bit) \
((bit)&1 ? ((array)[DW_INDEX(index)] |= 1<<BIT_INDEX(index)) \
: ((array)[DW_INDEX(index)] &= ~(1<<BIT_INDEX(index))) \
, 0 \
)
Use:
bitarray_t arr[RESERVE_BITS(130)] = {0, 0x12345678,0xabcdef0,0xffff0000,0};
int i = getbit(arr,5);
putbit(arr,6,1);
int x=2; // the least significant bit is 0
putbit(arr,6,x); // sets bit 6 to 0 because 2&1 is 0
putbit(arr,6,!!x); // sets bit 6 to 1 because !!2 is 1
EDIT the docs:
"dword" = "double word" = 32-bit value (unsigned, but that's not really important)
RESERVE_BITS: number_of_bits --> number_of_dwords
RESERVE_BITS(n) is the number of 32-bit integers enough to store n bits
DW_INDEX: bit_index_in_array --> dword_index_in_array
DW_INDEX(i) is the index of dword where the i-th bit is stored.
Both bit and dword indexes start from 0.
BIT_INDEX: bit_index_in_array --> bit_index_in_dword
If i is the number of some bit in the array, BIT_INDEX(i) is the number
of that bit in the dword where the bit is stored.
And the dword is known via DW_INDEX().
getbit: bit_array, bit_index_in_array --> bit_value
putbit: bit_array, bit_index_in_array, bit_value --> 0
getbit(array,i) fetches the dword containing the bit i and shifts the dword right, so that the bit i becomes the least significant bit. Then, a bitwise and with 1 clears all other bits.
putbit(array, i, v) first of all checks the least significant bit of v; if it is 0, we have to clear the bit, and if it is 1, we have to set it.
To set the bit, we do a bitwise or of the dword that contains the bit and the value of 1 shifted left by bit_index_in_dword: that bit is set, and other bits do not change.
To clear the bit, we do a bitwise and of the dword that contains the bit and the bitwise complement of 1 shifted left by bit_index_in_dword: that value has all bits set to one except the only zero bit in the position that we want to clear.
The macro ends with , 0 because otherwise it would return the value of dword where the bit i is stored, and that value is not meaningful. One could also use ((void)0).

It's a trade-off:
(1) use 1 byte for each 2 bit value - simple, fast, but uses 4x memory
(2) pack bits into bytes - more complex, some performance overhead, uses minimum memory
If you have enough memory available then go for (1), otherwise consider (2).

How to shift an array of bytes by 12-bits

I want to shift the contents of an array of bytes by 12-bit to the left.
For example, starting with this array of type uint8_t shift[10]:
{0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x0A, 0xBC}
I'd like to shift it to the left by 12-bits resulting in:
{0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xAB, 0xC0, 0x00}

Hurray for pointers!
This code works by looking ahead 12 bits for each byte and copying the proper bits forward. 12 bits is the bottom half (nybble) of the next byte and the top half of 2 bytes away.
unsigned char length = 10;
unsigned char data[10] = {0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0A,0xBC};
unsigned char *shift = data;
while (shift < data+(length-2)) {
*shift = (*(shift+1)&0x0F)<<4 | (*(shift+2)&0xF0)>>4;
shift++;
}
*(data+length-2) = (*(data+length-1)&0x0F)<<4;
*(data+length-1) = 0x00;
Justin wrote:
#Mike, your solution works, but does not carry.
Well, I'd say a normal shift operation does just that (called overflow), and just lets the extra bits fall off the right or left. It's simple enough to carry if you wanted to - just save the 12 bits before you start to shift. Maybe you want a circular shift, to put the overflowed bits back at the bottom? Maybe you want to realloc the array and make it larger? Return the overflow to the caller? Return a boolean if non-zero data was overflowed? You'd have to define what carry means to you.
unsigned char overflow[2];
*overflow = (*data&0xF0)>>4;
*(overflow+1) = (*data&0x0F)<<4 | (*(data+1)&0xF0)>>4;
while (shift < data+(length-2)) {
/* normal shifting */
}
/* now would be the time to copy it back if you want to carry it somewhere */
*(data+length-2) = (*(data+length-1)&0x0F)<<4 | (*(overflow)&0x0F);
*(data+length-1) = *(overflow+1);
/* You could return a 16-bit carry int,
* but endian-ness makes that look weird
* if you care about the physical layout */
unsigned short carry = *(overflow+1)<<8 | *overflow;

Here's my solution, but even more importantly my approach to solving the problem.
I approached the problem by
drawing the memory cells and drawing arrows from the destination to the source.
made a table showing the above drawing.
labeling each row in the table with the relative byte address.
This showed me the pattern:
let iL be the low nybble (half byte) of a[i]
let iH be the high nybble of a[i]
iH = (i+1)L
iL = (i+2)H
This pattern holds for all bytes.
Translating into C, this means:
a[i] = (iH << 4) OR iL
a[i] = ((a[i+1] & 0x0f) << 4) | ((a[i+2] & 0xf0) >> 4)
We now make three more observations:
since we carry out the assignments left to right, we don't need to store any values in temporary variables.
we will have a special case for the tail: all 12 bits at the end will be zero.
we must avoid reading undefined memory past the array. since we never read more than a[i+2], this only affects the last two bytes
So, we
handle the general case by looping for N-2 bytes and performing the general calculation above
handle the next to last byte by it by setting iH = (i+1)L
handle the last byte by setting it to 0
given a with length N, we get:
for (i = 0; i < N - 2; ++i) {
a[i] = ((a[i+1] & 0x0f) << 4) | ((a[i+2] & 0xf0) >> 4);
}
a[N-2] = (a[N-1) & 0x0f) << 4;
a[N-1] = 0;
And there you have it... the array is shifted left by 12 bits. It could easily be generalized to shifting N bits, noting that there will be M assignment statements where M = number of bits modulo 8, I believe.
The loop could be made more efficient on some machines by translating to pointers
for (p = a, p2=a+N-2; p != p2; ++p) {
*p = ((*(p+1) & 0x0f) << 4) | (((*(p+2) & 0xf0) >> 4);
}
and by using the largest integer data type supported by the CPU.
(I've just typed this in, so now would be a good time for somebody to review the code, especially since bit twiddling is notoriously easy to get wrong.)

Lets make it the best way to shift N bits in the array of 8 bit integers.
N - Total number of bits to shift
F = (N / 8) - Full 8 bit integers shifted
R = (N % 8) - Remaining bits that need to be shifted
I guess from here you would have to find the most optimal way to make use of this data to move around ints in an array. Generic algorithms would be to apply the full integer shifts by starting from the right of the array and moving each integer F indexes. Zero fill the newly empty spaces. Then finally perform an R bit shift on all of the indexes, again starting from the right.
In the case of shifting 0xBC by R bits you can calculate the overflow by doing a bitwise AND, and the shift using the bitshift operator:
// 0xAB shifted 4 bits is:
(0xAB & 0x0F) >> 4 // is the overflow (0x0A)
0xAB << 4 // is the shifted value (0xB0)
Keep in mind that the 4 bits is just a simple mask: 0x0F or just 0b00001111. This is easy to calculate, dynamically build, or you can even use a simple static lookup table.
I hope that is generic enough. I'm not good with C/C++ at all so maybe someone can clean up my syntax or be more specific.
Bonus: If you're crafty with your C you might be able to fudge multiple array indexes into a single 16, 32, or even 64 bit integer and perform the shifts. But that is prabably not very portable and I would recommend against this. Just a possible optimization.

Here a working solution, using temporary variables:
void shift_4bits_left(uint8_t* array, uint16_t size)
{
int i;
uint8_t shifted = 0x00;
uint8_t overflow = (0xF0 & array[0]) >> 4;
for (i = (size - 1); i >= 0; i--)
{
shifted = (array[i] << 4) | overflow;
overflow = (0xF0 & array[i]) >> 4;
array[i] = shifted;
}
}
Call this function 3 times for a 12-bit shift.
Mike's solution maybe faster, due to the use of temporary variables.

The 32 bit version... :-) Handles 1 <= count <= num_words
#include <stdio.h>
unsigned int array[] = {0x12345678,0x9abcdef0,0x12345678,0x9abcdef0,0x66666666};
int main(void) {
int count;
unsigned int *from, *to;
from = &array[0];
to = &array[0];
count = 5;
while (count-- > 1) {
*to++ = (*from<<12) | ((*++from>>20)&0xfff);
};
*to = (*from<<12);
printf("%x\n", array[0]);
printf("%x\n", array[1]);
printf("%x\n", array[2]);
printf("%x\n", array[3]);
printf("%x\n", array[4]);
return 0;
}

#Joseph, notice that the variables are 8 bits wide, while the shift is 12 bits wide. Your solution works only for N <= variable size.
If you can assume your array is a multiple of 4 you can cast the array into an array of uint64_t and then work on that. If it isn't a multiple of 4, you can work in 64-bit chunks on as much as you can and work on the remainder one by one.
This may be a bit more coding, but I think it's more elegant in the end.

There are a couple of edge-cases which make this a neat problem:
the input array might be empty
the last and next-to-last bits need to be treated specially, because they have zero bits shifted into them
Here's a simple solution which loops over the array copying the low-order nibble of the next byte into its high-order nibble, and the high-order nibble of the next-next (+2) byte into its low-order nibble. To save dereferencing the look-ahead pointer twice, it maintains a two-element buffer with the "last" and "next" bytes:
void shl12(uint8_t *v, size_t length) {
if (length == 0) {
return; // nothing to do
}
if (length > 1) {
uint8_t last_byte, next_byte;
next_byte = *(v + 1);
for (size_t i = 0; i + 2 < length; i++, v++) {
last_byte = next_byte;
next_byte = *(v + 2);
*v = ((last_byte & 0x0f) << 4) | (((next_byte) & 0xf0) >> 4);
}
// the next-to-last byte is half-empty
*(v++) = (next_byte & 0x0f) << 4;
}
// the last byte is always empty
*v = 0;
}
Consider the boundary cases, which activate successively more parts of the function:
When length is zero, we bail out without touching memory.
When length is one, we set the one and only element to zero.
When length is two, we set the high-order nibble of the first byte to low-order nibble of the second byte (that is, bits 12-16), and the second byte to zero. We don't activate the loop.
When length is greater than two we hit the loop, shuffling the bytes across the two-element buffer.
If efficiency is your goal, the answer probably depends largely on your machine's architecture. Typically you should maintain the two-element buffer, but handle a machine word (32/64 bit unsigned integer) at a time. If you're shifting a lot of data it will be worthwhile treating the first few bytes as a special case so that you can get your machine word pointers word-aligned. Most CPUs access memory more efficiently if the accesses fall on machine word boundaries. Of course, the trailing bytes have to be handled specially too so you don't touch memory past the end of the array.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Turn a large chunk of memory backwards, fast - c

Fastest way would probably to store the reverse of all possible byte values in a look-up table. The table would take only 256 bytes.

Single Core? How much memory? Is the display buffered in memory and pushed to the device, or is the only copy of the pixels in the screens memory?

Related

Which is faster: logarithm or lookup table? [duplicate]

make the moving bits more efficient

Bitwise memmove

How to define and work with an array of bits in C?

How to shift an array of bytes by 12-bits

Categories

Resources