How to access multiple elements of array in one go? - c

I have an array uint8_t data[256]. But each element is single byte.
My data bus is 32 bit long. So, If I want to access 32 bits, I do:
DATA = data[i] + (data[i + 1] << 8) + (data[i + 2] << 16) + (data[i + 3] << 24);
But this translates into 4 separate read requests in the memory of one byte each.
How can I access all the 4 bytes in the form of single transaction?

If you know the endian-ness of your data (or if you don't care), and your data is aligned (or have a byte-addressing process and you don't care about efficiency) you can cast data to a uint32_t * and access it in 4-byte chucks, like so:
DATA = ((uint32_t *)data)[i/4];
This of course assumes i is a multiple of 4.

Just cast data to uint32_t.
uint8_t data[256] = {1,2,3,4,5,6,7,8} ;
int main(int argc, char **argv)
{
int index = 1 ;
uint32_t d = *(uint32_t *)(data + index) ;
printf ("%08x\n", d) ;
}
Output on a little endian architecture will be
05040302
Output on a big endian architecture will be
02030405
However depending on the achitecture of the processor your progam is running on, you might run into memory alignment problems (performance hit if you address an unaligned memory, or even a crash if your processor doesn't support unaligned memory addressing).

Maybe you should store data as an array of 32-bit:
uint32_t data[64];
DATA = data[i];
DATA = data[i+1];
...

As #dwayne-towell mentioned - you need to care of endianness of you data. In one transaction it might be realized like an example below:
#include <stdio.h>
#include <stdint.h>
int
main()
{
uint8_t data[256];
uint32_t i, *p;
// Add some 32bit numbers
p = (uint32_t *)data;
for (i = 0; i < sizeof(data)/sizeof(uint32_t); ++i) {
*(p++) = i;
}
// Print some 32bit numbers
p = (uint32_t *)data;
for (i = 0; i < sizeof(data)/sizeof(uint32_t); ++i) {
printf("value=%u\n", *(p++));
}
return (0);
}

Related

How to get the leftmost bits from Core Foundation CFDataRef?

I'm trying to get the N leftmost bits from a Core Foundation CFDataRef. New to CF and C, so here's what I've got so far. It's printing out 0, so something's off.
int get_N_leftmost_bits_from_data(int N)
{
// Create a CFDataRef from a 12-byte buffer
const unsigned char * _Nullable buffer = malloc(12);
const CFDataRef data = CFDataCreate(buffer, 12);
const unsigned char *_Nullable bytesPtr = CFDataGetBytePtr(data);
// Create a buffer for the underlying bytes
void *underlyingBytes = malloc(CFDataGetLength(data));
// Copy the first 4 bytes to underlyingBytes
memcpy(underlyingBytes, bytesPtr, 4);
// Convert underlying bytes to an int
int num = atoi(underlyingBytes);
// Shift all the needed bits right
const int numBitsInByte = 8;
const int shiftedNum = num >> ((4 * numBitsInByte) - N);
return shiftedNum;
}
Thank you for the help!
Since you're only concerned about the bit in the first four bytes, then you could just copy the bytes over to an integer then perform a bit-shift on that integer.
#include <string.h>
#include <stdint.h>
int main(){
uint8_t bytes[12] = {0};
for(int n = 0; n < sizeof(bytes) ; ++n){
bytes[n] = n+0xF;
//printf("%02x\n",bytes[n]);
}
uint32_t firstfour = 0;
//copy the first four bytes
memcpy(&firstfour,bytes,sizeof(uint32_t));
//get the left most bit
uint32_t bit = firstfour>>31&1;
return 0;
}
You can still perform memcpy in CF
In general, to get n leftmost bits from x, you have to do something like
x >> ((sizeof(x) * 8) - n) (I assume that byte consists of 8 bits)
I don't have access to apple's device and don't know its API but few remarks:
You don't define ret in your's supplied code. (and I don't think it is defined by apple's library)
I was not able to find CFNumberGetInt32() with my search engine and have no clue what it could do. Please post buildable code with enough context to understand it

sending integer greater than 255 with uint8 array

I want to send integer greater than 255 using uint8 array from mobile to Arduino over bluetooth.
Since BLE module that I'm using does not accept Uint16Array, I'm restricted to use Uint8 array only.
My App code :
var data = new Uint8Array(3);
data[0]= 1;
data[1]= 123;
data[2]= 555;
ble.writeWithoutResponse(app.connectedPeripheral.id, SERVICE_UUID, WRITE_UUID, data.buffer, success, failure);
My Device Specific Code :
void SimbleeBLE_onReceive(char *data, int len) {
Serial.print(data[0]); // prints 1
Serial.print(data[1]); // prints 123
Serial.print(data[2]); // failed to print 555
}
Since uint8 only allows integer upto 255, How do I send greater values than that ?
You have to split it. You already know (or you should) that an int16 has, well, 16 bits (so it takes two bytes to store it).
Now very small digression about endianness. With endianness you mean the order of the bytes when stored. For instance, if you have the value 0x1234, you can either store it as 0x12 0x34 (big endian) or as 0x34 0x12 (little endian).
I don't know what language you use, so... Normally in C++ you do something like this:
const int datalen = 3;
uint16_t data[datalen];
data[0]= 1;
data[1]= 123;
data[2]= 555;
uint8_t sendingData[] = new uint8_t[datalen * sizeof(uint16_t)];
for (int i = 0; i < datalen; i++)
{
sendingData[i * 2] = (data[i] >> 8) & 0xFF;
sendingData[i * 2 + 1] = data[i] & 0xFF;
}
functionToSendData(sendingData, datalen * sizeof(uint16_t));
This sends in big endian format. If you prefer the little endian one, write
sendingData[i * 2] = data[i] & 0xFF;
sendingData[i * 2 + 1] = (data[i] >> 8) & 0xFF;
A simpler version can be
const int datalen = 3;
uint16_t data[datalen];
data[0]= 1;
data[1]= 123;
data[2]= 555;
functionToSendData((uint8_t*)data, datalen * sizeof(uint16_t));
In the first case you know the endianness of the transmission (it is little or big according to how you code), in the second it depends on the architecture and/or the compiler.
In JavaScript you can use this:
var sendingData = new Uint8Array(data.buffer)
and then send this new array. Credits go to this answer
When you receive it, you will have to do one of these three things to convert it
// Data is big endian
void SimbleeBLE_onReceive(char *receivedData, int len) {
uint16_t data[] = new uint16_t[len/2];
for (int i = 0; i < len/2; i++)
data = (((uint16_t)receivedData[i * 2]) << 8) + receivedData[i * 2 + 1];
Serial.print(data[0]);
Serial.print(data[1]);
Serial.print(data[2]);
}
// Data is little endian
void SimbleeBLE_onReceive(char *receivedData, int len) {
uint16_t data[] = new uint16_t[len/2];
for (int i = 0; i < len/2; i++)
data = receivedData[i * 2] + (((uint16_t)receivedData[i * 2 + 1]) << 8);
Serial.print(data[0]);
Serial.print(data[1]);
Serial.print(data[2]);
}
// Trust the compiler
void SimbleeBLE_onReceive(char *receivedData, int len) {
uint16_t *data = receivedData;
Serial.print(data[0]);
Serial.print(data[1]);
Serial.print(data[2]);
}
The last method is the most error-prone, since you have to know what endianness uses the compiler and it has to match the sendong one.
If the endianness mismatches you will receive what you think are "random" numbers. It is really easily debugged, though. For instance, you send the value 156 (hexadecimal 0x9C), and receive the 39936 (hexadecimal 0x9C00). See? The bytes are inverted. Another example: sending 8942 (hex 0x22EE) and receiving 60962 (hex 0xEE22).
Just to finish, I think you are going to have problems with this, because sometimes you will not receive the bytes "in one block", but separated. For instance, when you send 1 123 555 (in hex and, for instance, big endian this will be six bytes, particularly 00 01 00 7B 02 2B) you may get a call to SimbleeBLE_onReceive with just 3 or 4 bytes, then receive the others. So you will have to define a sort of protocol to mark the start and/or end of the packet, and accumulate the bytes in a buffer until ready to process them all.
Use some form of scaling on the data at sending and receiving ends to keep it within the 0 to 255 range. for example dividing and multiplying so the data is sent below 255 and then scale it up at the receiving end.
Another way, if you know the hi and low range of the data would be to use the Arduino mapping function.
y = map(mapped value, actual lo, actual hi, mapped lo, mapped hi)
If you don't know the full range, you could use the constrain() function.

Elegant way of getting a uint32_t from four elements of a uint8_t array?

I have a uint8_t array from which I need to take the first 4 elements, 32 bits, to create a size_t (uint32_t on my machine).
Example non-working code:
uint8_t array[8];
array[0] = 128;
array[1] = 128;
array[2] = 0;
array[3] = 0;
size_t size = array[0]; //results in 128
size = *array; //also results in 128
The bytes of the first four indices of that array are 80 80 00 00.
I want that size_t to result in 128 + 256*128 by reading the first 4 bytes of data from that array, little endian. Is there a way to make that size_t initialization read those 4 bytes directly as if the array were any old chunk of memory rather than having to manually add and multiply to find the value I want?
To do it in a portable way (i.e. endian independent) use the shift operators. Something like
uint32_t result = 0;
for(int i=3; i>=0; --i) {
result <<= 8;
result += array[i];
}
// result now holds the value you desire
This is the normal way:
inline uint32 little_endian_to_uint32(uint8 *p)
{
return p[0] + p[1] * 0x100ul + p[2] * 0x10000ul + p[3] * 0x1000000ul;
}
Your compiler ought to generate the optimum assmebly code.
Got it. Right after I asked, of course!
uint32_t size = *((uint32_t*) array);

Bitwise memmove

What is the best way to implement a bitwise memmove? The method should take an additional destination and source bit-offset and the count should be in bits too.
I saw that ARM provides a non-standard _membitmove, which does exactly what I need, but I couldn't find its source.
Bind's bitset includes isc_bitstring_copy, but it's not efficient
I'm aware that the C standard library doesn't provide such a method, but I also couldn't find any third-party code providing a similar method.
Assuming "best" means "easiest", you can copy bits one by one. Conceptually, an address of a bit is an object (struct) that has a pointer to a byte in memory and an index of a bit in the byte.
struct pointer_to_bit
{
uint8_t* p;
int b;
};
void membitmovebl(
void *dest,
const void *src,
int dest_offset,
int src_offset,
size_t nbits)
{
// Create pointers to bits
struct pointer_to_bit d = {dest, dest_offset};
struct pointer_to_bit s = {src, src_offset};
// Bring the bit offsets to range (0...7)
d.p += d.b / 8; // replace division by right-shift if bit offset can be negative
d.b %= 8; // replace "%=8" by "&=7" if bit offset can be negative
s.p += s.b / 8;
s.b %= 8;
// Determine whether it's OK to loop forward
if (d.p < s.p || d.p == s.p && d.b <= s.b)
{
// Copy bits one by one
for (size_t i = 0; i < nbits; i++)
{
// Read 1 bit
int bit = (*s.p >> s.b) & 1;
// Write 1 bit
*d.p &= ~(1 << d.b);
*d.p |= bit << d.b;
// Advance pointers
if (++s.b == 8)
{
s.b = 0;
++s.p;
}
if (++d.b == 8)
{
d.b = 0;
++d.p;
}
}
}
else
{
// Copy stuff backwards - essentially the same code but ++ replaced by --
}
}
If you want to write a version optimized for speed, you will have to do copying by bytes (or, better, words), unroll loops, and handle a number of special cases (memmove does that; you will have to do more because your function is more complicated).
P.S. Oh, seeing that you call isc_bitstring_copy inefficient, you probably want the speed optimization. You can use the following idea:
Start copying bits individually until the destination is byte-aligned (d.b == 0). Then, it is easy to copy 8 bits at once, doing some bit twiddling. Do this until there are less than 8 bits left to copy; then continue copying bits one by one.
// Copy 8 bits from s to d and advance pointers
*d.p = *s.p++ >> s.b;
*d.p++ |= *s.p << (8 - s.b);
P.P.S Oh, and seeing your comment on what you are going to use the code for, you don't really need to implement all the versions (byte/halfword/word, big/little-endian); you only want the easiest one - the one working with words (uint32_t).
Here is a partial implementation (not tested). There are obvious efficiency and usability improvements.
Copy n bytes from src to dest (not overlapping src), and shift bits at dest rightwards by bit bits, 0 <= bit <= 7. This assumes that the least significant bits are at the right of the bytes
void memcpy_with_bitshift(unsigned char *dest, unsigned char *src, size_t n, int bit)
{
int i;
memcpy(dest, src, n);
for (i = 0; i < n; i++) {
dest[i] >> bit;
}
for (i = 0; i < n; i++) {
dest[i+1] |= (src[i] << (8 - bit));
}
}
Some improvements to be made:
Don't overwrite first bit bits at beginning of dest.
Merge loops
Have a way to copy a number of bits not divisible by 8
Fix for >8 bits in a char

Turn a large chunk of memory backwards, fast

I need to rewrite about 4KB of data in reverse order, at bit level (last bit of last byte becoming first bit of first byte), as fast as possible. Are there any clever sniplets to do it?
Rationale: The data is display contents of LCD screen in an embedded device that is usually positioned in a way that the screen is on your shoulders level. The screen has "6 o'clock" orientation, that is to be viewed from below - like lying flat or hanging above your eyes level. This is fixable by rotating the screen 180 degrees, but then I need to reverse the screen data (generated by library), which is 1 bit = 1 pixel, starting with upper left of the screen. The CPU isn't very powerful, and the device has enough work already, plus several frames a second would be desirable so performance is an issue; RAM not so much.
edit:
Single core, ARM 9 series. 64MB, (to be scaled down to 32MB later), Linux. The data is pushed from system memory to the LCD driver over 8-bit IO port.
The CPU is 32bit and performs much better at this word size than at byte level.
There's a classic way to do this. Let's say unsigned int is your 32-bit word. I'm using C99 because the restrict keyword lets the compiler perform extra optimizations in this speed-critical code that would otherwise be unavailable. These keywords inform the compiler that "src" and "dest" do not overlap. This also assumes you are copying an integral number of words, if you're not, then this is just a start.
I also don't know which bit shifting / rotation primitives are fast on the ARM and which are slow. This is something to consider. If you need more speed, consider disassembling the output from the C compiler and going from there. If using GCC, try O2, O3, and Os to see which one is fastest. You might reduce stalls in the pipeline by doing two words at the same time.
This uses 23 operations per word, not counting load and store. However, these 23 operations are all very fast and none of them access memory. I don't know if a lookup table would be faster or not.
void
copy_rev(unsigned int *restrict dest,
unsigned int const *restrict src,
unsigned int n)
{
unsigned int i, x;
for (i = 0; i < n; ++i) {
x = src[i];
x = (x >> 16) | (x << 16);
x = ((x >> 8) & 0x00ff00ffU) | ((x & 0x00ff00ffU) << 8);
x = ((x >> 4) & 0x0f0f0f0fU) | ((x & 0x0f0f0f0fU) << 4);
x = ((x >> 2) & 0x33333333U) | ((x & 0x33333333U) << 2);
x = ((x >> 1) & 0x55555555U) | ((x & 0x555555555) << 1);
dest[n-1-i] = x;
}
}
This page is a great reference: http://graphics.stanford.edu/~seander/bithacks.html#BitReverseObvious
Final note: Looking at the ARM assembly reference, there is a "REV" opcode which reverses the byte order in a word. This would shave 7 operations per loop off the above code.
Fastest way would probably to store the reverse of all possible byte values in a look-up table. The table would take only 256 bytes.
Build a 256 element lookup table of byte values that are bit-reversed from their index.
{0x00, 0x80, 0x40, 0xc0, etc}
Then iterate through your array copying using each byte as an index into your lookup table.
If you are writing assembly language, the x86 instruction set has an XLAT instruction that does just this sort of lookup. Although it may not actually be faster than C code on modern processors.
You can do this in place if you iterate from both ends towards the middle. Because of cache effects, you may find it's faster to swap in 16 byte chunks (assuming a 16 byte cache line).
Here's the basic code (not including the cache line optimization)
// bit reversing lookup table
typedef unsigned char BYTE;
extern const BYTE g_RevBits[256];
void ReverseBitsInPlace(BYTE * pb, int cb)
{
int iter = cb/2;
for (int ii = 0, jj = cb-1; ii < iter; ++ii, --jj)
{
BYTE b1 = g_RevBits[pb[ii]];
pb[ii] = g_RevBits[pb[jj]];
pb[jj] = b1;
}
if (cb & 1) // if the number of bytes was odd, swap the middle one in place
{
pb[cb/2] = g_RevBits[pb[cb/2]];
}
}
// initialize the bit reversing lookup table using macros to make it less typing.
#define BITLINE(n) \
0x0##n, 0x8##n, 0x4##n, 0xC##n, 0x2##n, 0xA##n, 0x6##n, 0xE##n,\
0x1##n, 0x9##n, 0x5##n, 0xD##n, 0x3##n, 0xB##n, 0x7##n, 0xF##n,
const BYTE g_RevBits[256] = {
BITLINE(0), BITLINE(8), BITLINE(4), BITLINE(C),
BITLINE(2), BITLINE(A), BITLINE(6), BITLINE(E),
BITLINE(1), BITLINE(9), BITLINE(5), BITLINE(D),
BITLINE(3), BITLINE(B), BITLINE(7), BITLINE(F),
};
The Bit Twiddling Hacks site is alwas a good starting point for these kind of problems. Take a look here for fast bit reversal. Then its up to you to apply it to each byte/word of your memory block.
EDIT:
Inspired by Dietrich Epps answer and looking at the ARM instruction set, there is a RBIT opcode that reverses the bits contained in a register. So if performance is critical, you might consider using some assembly code.
Loop through the half of the array, convert and exchange bytes.
for( int i = 0; i < arraySize / 2; i++ ) {
char inverted1 = invert( array[i] );
char inverted2 = invert( array[arraySize - i - 1] );
array[i] = inverted2;
array[arraySize - i - 1] = inverted1;
}
For conversion use a precomputed table - an array of 2CHAR_BIT (CHAR_BIT will most likely be 8) elements where at position "I" the result of byte with value "I" inversion is stored. This will be very fast - one pass - and consume only 2CHAR_BIT for the table.
It looks like this code takes about 50 clocks per bit swap on my i7 XPS 8500 machine. 7.6 seconds for a million array flips. Single threaded. It prints some ASCI art based on patterns of 1s and 0s. I rotated the pic left 180 degrees after reversing the bit array, using a graphic editor, and they look identical to me. A double-reversed image comes out the same as the original.
As for pluses, it's a complete solution. It swaps bits from the back of a bit array to the front, vs operating on ints/bytes and then needing to swap ints/bytes in an array.
Also, this is a general purpose bit library, so you might find it handy in the future for solving other, more mundane problems.
Is it as fast as the accepted answer? I think it's close, but without working code to benchmark it's impossible to say. Feel free to cut and paste this working program.
// Reverse BitsInBuff.cpp : Defines the entry point for the console application.
#include "stdafx.h"
#include "time.h"
#include "memory.h"
//
// Manifest constants
#define uchar unsigned char
#define BUFF_BYTES 510 //400 supports a display of 80x40 bits
#define DW 80 // Display Width
// ----------------------------------------------------------------------------
uchar mask_set[] = { 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80 };
uchar mask_clr[] = { 0xfe, 0xfd, 0xfb, 0xf7, 0xef, 0xdf, 0xbf, 0x7f };
//
// Function Prototypes
static void PrintIntBits(long x, int bits);
void BitSet(uchar * BitArray, unsigned long BitNumber);
void BitClr(uchar * BitArray, unsigned long BitNumber);
void BitTog(uchar * BitArray, unsigned long BitNumber);
uchar BitGet(uchar * BitArray, unsigned long BitNumber);
void BitPut(uchar * BitArray, unsigned long BitNumber, uchar value);
//
uchar *ReverseBitsInArray(uchar *Buff, int BitKnt);
static void PrintIntBits(long x, int bits);
// -----------------------------------------------------------------------------
// Reverse the bit ordering in an array
uchar *ReverseBitsInArray(uchar *Buff, int BitKnt) {
unsigned long front=0, back = BitKnt-1;
uchar temp;
while( front<back ) {
temp = BitGet(Buff, front); // copy front bit to temp before overwriting
BitPut(Buff, front, BitGet(Buff, back)); // copy back bit to front bit
BitPut(Buff, back, temp); // copy saved value of front in temp to back of bit arra)
front++;
back--;
}
return Buff;
}
// ---------------------------------------------------------------------------
// ---------------------------------------------------------------------------
int _tmain(int argc, _TCHAR* argv[]) {
int i, j, k, LoopKnt = 1000001;
time_t start;
uchar Buff[BUFF_BYTES];
memset(Buff, 0, sizeof(Buff));
// make an ASCII art picture
for(i=0, k=0; i<(sizeof(Buff)*8)/DW; i++) {
for(j=0; j<DW/2; j++) {
BitSet(Buff, (i*DW)+j+k);
}
k++;
}
// print ASCII art picture
for(i=0; i<sizeof(Buff); i++) {
if(!(i % 10)) printf("\n"); // print bits in blocks of 80
PrintIntBits(Buff[i], 8);
}
i=LoopKnt;
start = clock();
while( i-- ) {
ReverseBitsInArray((uchar *)Buff, BUFF_BYTES * 8);
}
// print ASCII art pic flipped upside-down and rotated left
printf("\nMilliseconds elapsed = %d", clock() - start);
for(i=0; i<sizeof(Buff); i++) {
if(!(i % 10)) printf("\n"); // print bits in blocks of 80
PrintIntBits(Buff[i], 8);
}
printf("\n\nBenchmark time for %d loops\n", LoopKnt);
getchar();
return 0;
}
// -----------------------------------------------------------------------------
// Scaffolding...
static void PrintIntBits(long x, int bits) {
unsigned long long z=1;
int i=0;
z = z << (bits-1);
for (; z > 0; z >>= 1) {
printf("%s", ((x & z) == z) ? "#" : ".");
}
}
// These routines do bit manipulations on a bit array of unsigned chars
// ---------------------------------------------------------------------------
void BitSet(uchar *buff, unsigned long BitNumber) {
buff[BitNumber >> 3] |= mask_set[BitNumber & 7];
}
// ----------------------------------------------------------------------------
void BitClr(uchar *buff, unsigned long BitNumber) {
buff[BitNumber >> 3] &= mask_clr[BitNumber & 7];
}
// ----------------------------------------------------------------------------
void BitTog(uchar *buff, unsigned long BitNumber) {
buff[BitNumber >> 3] ^= mask_set[BitNumber & 7];
}
// ----------------------------------------------------------------------------
uchar BitGet(uchar *buff, unsigned long BitNumber) {
return (uchar) ((buff[BitNumber >> 3] >> (BitNumber & 7)) & 1);
}
// ----------------------------------------------------------------------------
void BitPut(uchar *buff, unsigned long BitNumber, uchar value) {
if(value) { // if the bit at buff[BitNumber] is true.
BitSet(buff, BitNumber);
} else {
BitClr(buff, BitNumber);
}
}
Below is the code listing for an optimization using a new buffer, instead of swapping bytes in place. Given that only 2030:4080 BitSet()s are needed because of the if() test, and about half the GetBit()s and PutBits() are eliminated by eliminating TEMP, I suspect memory access time is a large, fixed cost to these kinds of operations, providing a hard limit to optimization.
Using a look-up approach, and CONDITIONALLY swapping bytes, rather than bits, reduces by a factor of 8 the number of memory accesses, and testing for a 0 byte gets amortized across 8 bits, rather than 1.
Using these two approaches together, testing to see if the entire 8-bit char is 0 before doing ANYTHING, including the table lookup, and the write, is likely going to be the fastest possible approach, but would require an extra 512 bytes for the new, destination bit array, and 256 bytes for the lookup table. The performance payoff might be quite dramatic though.
// -----------------------------------------------------------------------------
// Reverse the bit ordering in new array
uchar *ReverseBitsInNewArray(uchar *Dst, const uchar *Src, const int BitKnt) {
int front=0, back = BitKnt-1;
memset(Dst, 0, BitKnt/BitsInByte);
while( front < back ) {
if(BitGet(Src, back--)) { // memset() has already set all bits in Dst to 0,
BitSet(Dst, front); // so only reset if Src bit is 1
}
front++;
}
return Dst;
To reverse a single byte x you can handle the bits one at a time:
unsigned char a = 0;
for (i = 0; i < 8; ++i) {
a += (unsigned char)(((x >> i) & 1) << (7 - i));
}
You can create a cache of these results in an array so that you can quickly reverse a byte just by making a single lookup instead of looping.
Then you just have to reverse the byte array, and when you write the data apply the above mapping. Reversing a byte array is a well documented problem, e.g. here.
Single Core?
How much memory?
Is the display buffered in memory and pushed to the device, or is the only copy of the pixels in the screens memory?
The data is pushed from system memory to the LCD driver over 8-bit IO
port.
Since you'll be writing to the LCD one byte at a time, I think the best idea is to perform the bit reversal right when sending the data to the LCD driver rather than as a separate pre-pass. Something along those lines should be faster than any of the other answers:
void send_to_LCD(uint8_t* data, int len, bool rotate) {
if (rotate)
for (int i=len-1; i>=0; i--)
write(reverse(data[i]));
else
for (int i=0; i<len; i++)
write(data[i]);
}
Where write() is the function that sends a byte to the LCD driver and reverse() one of the single-byte bit reversal methods described in the other answers.
This approach avoids the need to store two copies of the video data in ram and also avoids the read-invert-write roundtrip. Also note that this is the simplest implementation: it could be trivially adapted to load, say, 4 bytes at a time from memory if this were to yield better performance. A smart vectorizing compiler may be even able to do it for you.

Resources