I have an array defined as
int data[k];
where k is the size of the array. Each element of the array is either 0 or 1. I want to save the binary data in another array defined as
uint8_t new_data[k/8];
(k is usually a multiple of 8).
How can I do this in C?
Thanks in advance
Assuming k is a multiple of 8, assuming that by "each element is binary" you mean "each int is either 0 or 1", also assuming the bits in data are packed from most significant to least significant and the bytes of new_data are packed as big-endian (all reasonable assumptions), then this is how you do it:
for (int i = 0; i < k/8; ++i)
{
new_data[i] = (data[8*i ] << 7) | (data[8*i+1] << 6)
| (data[8*i+2] << 5) | (data[8*i+3] << 4)
| (data[8*i+4] << 3) | (data[8*i+5] << 2)
| (data[8*i+6] << 1) | data[8*i+7];
}
Assuming new_data starts initialized at 0, data[i] contains only zeroes and ones and that you want to fill lowest bits first:
for(unsigned i = 0; i < k; ++i) {
new_data[i/8] |= data[i]<<(i%8);
}
A possibly faster implementation1 may be:
for(int i = 0; i < k/8; ++i) {
uint8_t o = 0;
for(int j = 0; j < 8; ++j) {
o |= data[i*8]<<j;
}
new_data[i] = o;
}
(notice that this essentially assumes that k is multiple of 8)
It's generally easier to optimize, as the inner loop has small, known boundaries and it writes on a variable with just that small scope; this is easier for optimizers to handle, and you can see for example that with gcc the inner loop gets completely unrolled.
Related
This is my first question, so I hope to do this right.
I have a problem where I have to map a key which can be in the range (0, 1, 2) to select a value from the same range (0, 1, 2). I have to repeat this millions of times and I was trying to implement this by using bitwise operations in C, without success.
So let's say I have 16 keys in the range (0, 1, 2) which I want to map to 16 values in the same range by using the following rules:
0 -> 2
1 -> 1
2 -> 1
I can represent the array of 16 keys as 16 2-bit pairs in a 32bit unsigned int. For instance:
0, 1, 2, 1, 2, 0, ... //Original array of keys
00 01 10 01 10 00 ... //2-bit pairs representation of keys in a 32bit int
and I am interested in transforming the unsigned int, following the rules above (i.e. the 2-bit pairs have to be transformed following the rules: 00->10, 01->01, and 10->01), so that I end up with a 32bit unsigned int like:
10 01 01 01 01 10 ... //2-bit pairs transformed using the given rule.
Would it be a relatively fast bitwise procedure which will allow me to apply efficiently this transformation (given that the transformation rules can change)?
I hope I formulated my question clearly. Thanks for any help.
EDIT: I corrected some mistakes, and clarified some points following comments.
EDIT2: Following some suggestions, I add what I hope is a code example:
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
int i;
unsigned int keys[16];
unsigned int bitKeys = 0;
unsigned int mapping[3];
unsigned int result[16];
unsigned int bitResults = 0;
//Initialize random keys and mapping dict
for(i = 0; i<16; i++)
keys[i] = rand() % 3;
bitKeys |= keys[i] << (2*i);
for(i = 0; i<3; i++)
mapping[i] = rand() % 3;
//Get results without using bitwise opperations.
for(i = 0; i<16; i++)
result[i] = mapping[ keys[i] ];
bitResults |= result[i] << (2*i);
//Would it be possible to get bitResults directly from bitKeys efficiently by using bitwise operations?
return 0;
}
This is essentially a problem of simplifying truth tables to minimal Boolean expressions; here we need two expressions, one for each output value bit.
BA QP
00 10
01 01
10 01
11 XX
B: high key bit, A: low key bit, Q: high value bit, P: low value bit
By using any of the many tools available (including our brain) for minimizing combinational logic circuits, we get the expressions
Q = ¬A·¬B
P = A + B
Now that we have the expressions, we can apply them to all keys in a 32-bit variable:
uint32_t keys = 2<<30|0<<10|1<<8|2<<6|1<<4|2<<2|0; // for example
uint32_t vals = ~keys & ~keys<<1 & 0xAAAAAAAA // value_high is !key_high & !key_low
| (keys>>1 | keys) & 0x55555555; // value_low is key_high | key_low
I would need a solution for any arbitrary mapping.
Here's an example program for arbitrary mappings. For each of the two value bits, there are 23 possible expressions (the same set for both bits); these expressions are:
0 ¬A·¬B A ¬B B ¬A A+B 1
By concatenating the high and low mapping bits, respectively, for keys 0, 1 and 2, we get the index of the expression corresponding to the mapping function. In the following program, the values of all the expressions, even the ones unused by the mapping, are stored in the term array. While this may seem wasteful, it allows computation without branches, which may be a win in the end.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
int main()
{
int i;
unsigned mapping[3];
// generate example mapping
for (i = 0; i < 3; ++i) mapping[i] = rand() % 3, printf(" %d->%d", i, mapping[i]);
puts("");
// determine the mapping expression index 0..7 for high and low value bit
short h = mapping[0]/2 | mapping[1]/2<<1 | mapping[2]/2<<2;
short l = mapping[0]%2 | mapping[1]%2<<1 | mapping[2]%2<<2;
uint32_t keys = 0x1245689A; // for example
uint32_t b = keys, a = keys<<1;
uint32_t term[8] = { 0, ~a&~b, a, ~b, b, ~a, a|b, -1 }; // all possible terms
uint32_t vals = term[h] & 0xAAAAAAAA // value_high
| term[l]>>1 & 0x55555555; // value_low
printf("%8x\n%8x\n", keys, vals);
}
After thinking about it, and using some of the ideas from other answers, I think I found a general solution. It is based in first estimating the value assuming there are only the keys 10, and 01 (i.e. one bit of the pair determines the other) and then correct by the key 00. An example code of the solution:
#include <stdio.h>
#include <stdlib.h>
void printBits(size_t const size, void const * const ptr)
{
unsigned char *b = (unsigned char*) ptr;
unsigned char byte;
int i, j;
for (i=size-1;i>=0;i--)
{
for (j=7;j>=0;j--)
{
byte = (b[i] >> j) & 1;
printf("%u", byte);
if(j%2 == 0) printf("|");
}
}
puts("");
}
int test2BitMapping(unsigned int * mapping)
{
int i;
unsigned int keys[16];
unsigned int bitKeys = 0;
unsigned int b = 0;
unsigned int c = 0;
unsigned int d = 0;
unsigned int expand[4] = {0x00000000u, 0x55555555u, 0xAAAAAAAAu, 0xFFFFFFFFu};
unsigned int v12 = 0;
unsigned int v0mask = 0;
unsigned int result[16];
unsigned int bitResults = 0;
unsigned int bitResultsTest = 0;
//Create mapping masks
b = ((1 & mapping[1]) | (2 & mapping[2]));
c = (2 & mapping[1]) | (1 & mapping[2]);
d = mapping[0];
b = expand[b];
c = expand[c];
d = expand[d];
//Initialize random keys
for(i = 0; i<16; i++) {
if(0) { //Test random keys
keys[i] = rand() % 3;
}
else { //Check all keys are generated
keys[i] = i % 3;
}
bitKeys |= keys[i] << (2*i);
}
//Get results without using bitwise opperations.
for(i = 0; i<16; i++) {
result[i] = mapping[ keys[i] ];
bitResultsTest |= result[i] << (2*i);
}
//Get results by using bitwise opperations.
v12 = ( bitKeys & b ) | ( (~bitKeys) & c );
v0mask = bitKeys | (((bitKeys & 0xAAAAAAAAu) >> 1) | ((bitKeys & 0x55555555u) << 1));
bitResults = ( d & (~v0mask) ) | ( v12 & v0mask );
//Check results
if(0) {
for(i = 0; i<3; i++) {
printf("%d -> %d, ", i, mapping[i]);
}
printf("\n");
printBits(sizeof(unsigned int), &bitKeys);
printBits(sizeof(unsigned int), &bitResults);
printBits(sizeof(unsigned int), &bitResultsTest);
printf("-------\n");
}
if(bitResults != bitResultsTest) {
printf("*********\nDifferent\n*********\n");
}
else {
printf("OK\n");
}
}
int main(void)
{
int i, j, k;
unsigned int mapping[3];
//Test using random mapping
for(k = 0; k < 1000; k++) {
for(i = 0; i<3; i++) {
mapping[i] = rand() % 3;
}
test2BitMapping(mapping);
}
//Test all possible mappings
for(i = 0; i<3; i++) {
for(j = 0; j<3; j++) {
for(k = 0; k<3; k++) {
mapping[0] = i;
mapping[1] = j;
mapping[2] = k;
test2BitMapping(mapping);
}
}
}
return 0;
}
and I am interested in transforming the unsigned int, following the rules above (i.e. the 2-bit pairs have to be transformed following the rules: 00->10, 01->01, and 10->01), so that I end up with a 32bit unsigned int
Certainly this can be done, but the required sequence of operations will be different for each of the 27 distinct mappings from { 0, 1, 2 } to { 0, 1, 2 }. Some can be very simple, such as for the three constant mappings, but others require more complex expressions.
Without having performed a thorough analysis, I'm inclined to guess that the mappings that are neither constant nor permutations, such as the one presented in the example, probably have the greatest minimum complexity. These all share the characteristic that two keys map to the same value, whereas the other key maps to a different one. One way -- not necessarily the best -- to approach finding an expression for such a mapping is to focus first on achieving the general result that the two keys map to one value and the other to a different one, and then move on to transforming the resulting values to the desired ones, if necessary.
For the example presented, for instance,
0 -> 2
1 -> 1
2 -> 1
, one could (on a per-key basis) use ((key & 2) >> 1) | ((key & 1) << 1) to achieve these preliminary results:
0 -> 0
1 -> 3
2 -> 3
, which can be converted to the desired final result by flipping the higher-order bit via an exclusive-or operation.
Note well the bit masking. There are other ways that could be approached for mapping a single key, but for the case of multiple keys stored in contiguous bits of the same integer, you need to be careful to avoid contaminating the computed mappings with data from different keys.
In 16-entry bit-vector form, that would be
uint32_t keys = /*...*/;
uint32_t values = (((keys & 0xAAAAAAAAu) >> 1) | ((keys & 0x55555555u) << 1))
^ 0xAAAAAAAAu;
. That happens to have a couple fewer operations than the expression in your other answer so far, but I am not certain that it is the smallest possible number of operations. In fact, if you are prepared to accept arithmetic operations in addition to bitwise ones, then you can definitely do it with fewer operations:
uint32_t keys = /*...*/;
uint32_t values = 0xAAAAAAAAu
- (((keys & 0xAAAAAAAAu) >> 1) | (keys & 0x55555555u));
Of course, in general, various operations do not all have the same cost as each other, but integer addition and subtraction and bitwise AND, OR, and XOR all have the the same cost as each other on most architectures (see, for example, https://www.agner.org/optimize/instruction_tables.pdf).
I am working on a project in which i want to convert a given video input stream into block sections (so it can be used by a hardware codec). This is project is run on an STM32 microcontroller running a 200Mhz clock.
The received input is a YCbCr 4:2:2 progressive stream, which basically means the input stream looks like this for every row:
Size: 32 bit word 32 bit word 32 bit word ...
Component: Cr Y1 Cb Y0 Cr Y1 Cb Y0 Cr Y1 Cb Y0 ...
Bits: 8 8 8 8 8 8 8 8 8 8 8 8 ...
This stream needs to be converted into a block format used by a hardware codec. The codec accepts a byte array in a specific order. Currently i am doing this using a nested loop for every 1/8 of an image frame using lookup tables and writing into an empty array:
Defines:
#define ROWS_PER_MCU 8
#define WORDS_PER_MCU 8
#define HORIZONTAL_MCU_PER_INPUTBUFFER 40
#define VERTICAL_MCU_PER_INPUTBUFFER 8
Global variables are declared like this:
typedef struct jpegInputbufferLUT
{
uint8_t JPEG_Y_MCU_LUT[256];
uint8_t JPEG_Cb_MCU_422_LUT[256];
uint8_t JPEG_Cr_MCU_422_LUT[256];
}jpegIndexLUT;
jpegIndexLUT jpegInputLUT;
uint8_t jpegInBuffer[81920];
uint32_t rawBuffer[20480];
Look up tables are created like this:
void JPEG_Init_MCU_LUT(void)
{
uint32_t offset;
/*Y LUT */
for(uint32_t i = 0; i < 16; i++)
{
for(j = 0; j < 16; j++)
{
offset = j + (i*8);
if((j>=8) && (i>=8)) offset+= 120;
else if((j>=8) && (i<8)) offset+= 56;
else if((j<8) && (i>=8)) offset+= 64;
jpegInputLUT.JPEG_Y_MCU_LUT[i*16 + j] = offset;
}
}
/*Cb Cr LUT*/
for(uint32_t i = 0; i < 16; i++)
{
for(j = 0; j < 16; j++)
{
offset = i*16 + j;
jpegInputLUT.JPEG_Cb_MCU_422_LUT[offset] = (j/2) + (i*8) + 128;
jpegInputLUT.JPEG_Cr_MCU_422_LUT[offset] = (j/2) + (i*8) + 192;
}
}
}
Conversion code:
/* Initialize variables for array conversion */
uint32_t currentMCU = 0;
uint32_t lutOffset = 0;
uint32_t inputOffset = 0;
uint32_t verticalOffset = 0;
/* Convert X rows into MCU blocks for JPEG encoding */
for(uint8_t k = 0; k < VERTICAL_MCU_PER_INPUTBUFFER; k++)
{
for(uint8_t n = 0; n < HORIZONTAL_MCU_PER_INPUTBUFFER; n++)
{
inputOffset = verticalOffset + (n * 8);
lutOffset = 0;
for(uint8_t i = 0; i < ROWS_PER_MCU; i++)
{
for(uint8_t j = 0; j < WORDS_PER_MCU; j++)
{
/* Mask 32 bit according to DCMI input format */
uint32_t rawBufferAddress = inputOffset+j; // Calculate rawBuffer address here so it only has to be calculated once
jpegInBuffer[jpegInputLUT.JPEG_Y_MCU_LUT[lutOffset] + currentMCU] = (rawBuffer[rawBufferAddress] & 0x7F);
jpegInBuffer[jpegInputLUT.JPEG_Cb_MCU_422_LUT[lutOffset] + currentMCU] = ((rawBuffer[rawBufferAddress] >> 7) & 0x7F);
jpegInBuffer[jpegInputLUT.JPEG_Cr_MCU_422_LUT[lutOffset] + currentMCU] = ((rawBuffer[rawBufferAddress] >> 23) & 0x7F);
jpegInBuffer[jpegInputLUT.JPEG_Y_MCU_LUT[lutOffset+1] + currentMCU] = ((rawBuffer[rawBufferAddress] >> 16) & 0x7F);
lutOffset+=2;
}
inputOffset += 320;
}
currentMCU += 256;
}
verticalOffset += 2240;
}
This conversion is currently taking me about 8 ms, and this needs to be done 8 times. This is currently taking up almost all of my available execution time, since i am trying to get 15 fps out of my system.
Is it in any way possible to speed this up? I was thinking maybe sorting the input array instead of just writing into a new buffer, but would swapping 2 elements in an array have a faster execution time than copying values into another array?
Would love to hear your ideas/thoughts on this,
Thanks in advance!
Your program seems to run slower than expected from an STM32. You may need to look into what assembly is produced, compiler optimization settings, if MCU frequency is correct, if memory is too slow, etc. We don't have enough information to give a definite answer why. Your code seems to spend 8 ms * 200M / (8*8*8*40) = 78 cycles for each inner loop iteration. For reference, an stm32f723 only needs about 15 cycles, and an stm32f103 about 28 cycles (the code was adjusted to access smaller arrays in the latter case).
The LUT table is not needed as its content is very regular. Reading LUT values adds more memory reads, which may be a significant contribution. If I got your LUT generation code correctly, it produces the following numbers in the inner loop:
Y1 Cb Cr Y2
0 128 192 1
2 129 193 3
4 130 194 5
6 131 195 7
64 132 196 65
66 133 197 67
68 134 198 69
70 135 199 71
8 136 200 9
etc
The second and third columns are just consecutive numbers. The fourth column equals the first one plus one. And the first number needs a bit flip. You can try the following code (please check that it is correct):
uint32_t lutOffset = 0;
for(uint8_t i = 0; i < ROWS_PER_MCU; i++)
{
for(uint8_t j = 0; j < WORDS_PER_MCU; j++)
{
uint32_t rawBufferAddress = (inputOffset+j) /* % 2048 */;
#if 0
unsigned y_lut1 = jpegInputLUT.JPEG_Y_MCU_LUT[lutOffset];
unsigned Cb_lut = jpegInputLUT.JPEG_Cb_MCU_422_LUT[lutOffset];
unsigned Cr_lut = jpegInputLUT.JPEG_Cr_MCU_422_LUT[lutOffset];
unsigned y_lut2 = jpegInputLUT.JPEG_Y_MCU_LUT[lutOffset+1];
#else
unsigned y_lut1 = lutOffset | (j / 4) << 6 | (j % 4) << 1;
unsigned Cb_lut = 128 + lutOffset + j;
unsigned Cr_lut = 192 + lutOffset + j;
unsigned y_lut2 = y_lut1 + 1;
#endif
jpegInBuffer[y_lut1 + currentMCU] = (rawBuffer[rawBufferAddress] & 0x7F);
jpegInBuffer[Cb_lut + currentMCU] = ((rawBuffer[rawBufferAddress] >> 7) & 0x7F);
jpegInBuffer[Cr_lut + currentMCU] = ((rawBuffer[rawBufferAddress] >> 23) & 0x7F);
jpegInBuffer[y_lut2 + currentMCU] = ((rawBuffer[rawBufferAddress] >> 16) & 0x7F);
}
lutOffset += 8;
inputOffset += 320;
}
This version takes about 20 cycles per iteration on my stm32f103, which is less than 6 ms even at its 72 MHz.
UPD. Another option is using one small lookup table instead of bit computations:
static const unsigned x[8] = { 0, 2, 4, 6, 64, 66, 68, 70 };
// unsigned y_lut1 = lutOffset | (j / 4) << 6 | (j % 4) << 1;
unsigned y_lut1 = lutOffset + x[j];
This improves the inner loop timing to 18 (f103) / 7.5 (f723) cycles. For some reason, optimizing this expression for F723 does not work well. I would expect these options to give identical result since the inner loop is unrolled, but who knows.
As an additional optimization, probably not necessary, the output values can be combined into 32-bit words and written one word a time. This seems possible because LUT values come in blocks of four consecutive ones. For this, the inner loop can be converted to a nested loop of 2 by 4 iterations. Each 4 iterations of the innermost loop will produce one uint32_t for Cb, one uint32_t for Cr and two uint32_t for Y. But is not worth doing.
I measure run time with SysTick:
SysTick->LOAD = SysTick_LOAD_RELOAD_Msk;
SysTick->VAL = 0;
SysTick->CTRL = SysTick_CTRL_CLKSOURCE_Msk | SysTick_CTRL_ENABLE_Msk;
volatile unsigned t0 = SysTick->VAL;
f();
volatile unsigned t1 = t0 - SysTick->VAL;
I used output pins sometimes too, when connecting a debugger is not practical. Strictly speaking, both methods are not guaranteed to work because the compiler may move code across measurement points, but it has worked as intended for me (with gcc). Assembly inspection is needed to make sure that nothing fishy is going on.
There are any number of micro optimisations that could be performed here that could provide an improvement. Some may exhibit an improvement in debug build without compiler optimisation, only to have no advantage with optimisation. It is possible even that some "clever" trick that is faster in debug if non-idiomatic could cause the optimiser to generate worse code that it might had you favoured clarity over performance.
All the obvious micro-optimisations such as loop unrolling the compiler optimiser will likely be able to perform for you without complicating the code or risking introducing errors.
One rather obvious improvement (regardless of whether or not it is faster), would be to change:
for( uint8_t j = 0; j < WORDS_PER_MCU; j++ )
{
/* Mask 32 bit according to DCMI input format */
uint32_t rawBufferAddress = inputOffset+j; // Calculate rawBuffer address here so it only has to be calculated once
...
to:
uint32_t rawBufferAddress = inputOffset ;
for( uint8_t j = 0; j < WORDS_PER_MCU; rawBufferAddress++, j++)
{
/* Mask 32 bit according to DCMI input format */
...
Your "only has to be calculated once" is actually WORDS_PER_MCU calculations, and an increment is likely to be faster than and addition and assignment. At worst it will be no different.
I would similarly suggest moving all the other "end of loop increments such as lutOffset+=2 into the respective for third expression also. Not for performance, but for clarity.
I'm wondering if someone know effective approach to calculate bits in specified position along array?
Assuming that OP wants to count active bits
size_t countbits(uint8_t *array, int pos, size_t size)
{
uint8_t mask = 1 << pos;
uint32_t result = 0;
while(size--)
{
result += *array++ & mask;
}
return result >> pos;
}
You can just loop the array values and test for the bits with a bitwise and operator, like so:
int arr[] = {1,2,3,4,5};
// 1 - 001
// 2 - 010
// 3 - 011
// 4 - 100
// 5 - 101
int i, bitcount = 0;
for (i = 0; i < 5; ++i){
if (arr[i] & (1 << 2)){ //testing and counting the 3rd bit
bitcount++;
}
}
printf("%d", bitcount); //2
Note that i opted for 1 << 2 which tests for the 3rd bit from the right or the third least significant bit just to be easier to show. Now bitCount would now hold 2 which are the number of 3rd bits set to 1.
Take a look at the result in Ideone
In your case you would need to check for the 5th bit which can be represented as:
1 << 4
0x10000
16
And the 8th bit:
1 << 7
0x10000000
256
So adjusting this to your bits would give you:
int i, bitcount8 = 0, bitcount5 = 0;
for (i = 0; i < your_array_size_here; ++i){
if (arr[i] & 0x10000000){
bitcount8++;
}
if (arr[i] & 0x10000){
bitcount5++;
}
}
If you need to count many of them, then this solution isn't great and you'd be better off creating an array of bit counts, and calculating them with another for loop:
int i, j, bitcounts[8] = {0};
for (i = 0; i < your_array_size_here; ++i){
for (j = 0; j < 8; ++j){
//j will be catching each bit with the increasing shift lefts
if (arr[i] & (1 << j)){
bitcounts[j]++;
}
}
}
And in this case you would access the bit counts by their index:
printf("%d", bitcounts[2]); //2
Check this solution in Ideone as well
Let the bit position difference (e.g. 7 - 4 in this case) be diff.
If 2diff > n, then code can add both bits at the same time.
void count(const uint8_t *Array, size_t n, int *bit7sum, int *bit4sum) {
unsigned sum = 0;
unsigned mask = 0x90;
while (n > 0) {
n--;
sum += Array[n] & mask;
}
*bit7sum = sum >> 7;
*bit4sum = (sum >> 4) & 0x07;
}
If the processor has a fast multiply and n is still not too large, like n < pow(2,14) in this case. (Or n < pow(2,8) in the general case)
void count2(const uint8_t *Array, size_t n, int *bit7sum, int *bit4sum) {
// assume 32 bit or wider unsigned
unsigned sum = 0;
unsigned mask1 = 0x90;
unsigned m = 1 + (1u << 11); // to move bit 7 to the bit 18 place
unsigned mask2 = (1u << 18) | (1u << 4);
while (n > 0) {
n--;
sum += ((Array[n] & mask1)*m) & mask2;
}
*bit7sum = sum >> 18;
*bit4sum = ((1u << 18) - 1) & sum) >> 4);
}
Algorithm: code is using a mask, multiply, mask to separate the 2 bits. The lower bit remains in it low position while the upper bit is shifted to the upper bits. Then a parallel add occurs.
The loop avoids any branching aside from the loop itself. This can make for fast code. YMMV.
With even larger n, break it down into multiple calls to count2()
Lets say I have this byte
uint8_t k[8]= {0,0,0,1,1,1,0,0};
Is there a way to get this to become a single integer or hex?
If k represents 8 bytes of the 64-bit integer, go through the array of 8-bit integers, and shift them into the result left-to-right:
uint64_t res = 0;
for (int i = 0 ; i != 8 ; i++) {
res <<= 8;
res |= k[i];
}
The direction of the loop depends on the order in which the bytes of the original int are stored in the k array. The above snippet shows the MSB-to-LSB order; if the array is LSB-to-MSB, start the loop at 7, and go down to zero.
If the bytes represent individual bits, shift by one rather than eight.
This should do the trick:
int convertToInt(uint8_t k[8], bool leastSignificantFirst) {
int res = 0;
for (int i = 0; i < 8; ++i) {
if (leastSignificantFirst) {
res |= (k[i] & 1) << (7 - i);
} else {
res |= (k[i] & 1) << i;
}
}
return res;
}
To be on the same page, let's assume sizeof(int)=4 and sizeof(long)=8.
Given an array of integers, what would be an efficient method to logically bitshift the array to either the left or right?
I am contemplating an auxiliary variable such as a long, that will compute the bitshift for the first pair of elements (index 0 and 1) and set the first element (0). Continuing in this fashion the bitshift for elements (index 1 and 2) will be computer, and then index 1 will be set.
I think this is actually a fairly efficient method, but there are drawbacks. I cannot bitshift greater than 32 bits. I think using multiple auxiliary variables would work, but I'm envisioning recursion somewhere along the line.
There's no need to use a long as an intermediary. If you're shifting left, start with the highest order int, shifting right start at the lowest. Add in the carry from the adjacent element before you modify it.
void ShiftLeftByOne(int * arr, int len)
{
int i;
for (i = 0; i < len - 1; ++i)
{
arr[i] = (arr[i] << 1) | ((arr[i+1] >> 31) & 1);
}
arr[len-1] = arr[len-1] << 1;
}
This technique can be extended to do a shift of more than 1 bit. If you're doing more than 32 bits, you take the bit count mod 32 and shift by that, while moving the result further along in the array. For example, to shift left by 33 bits, the code will look nearly the same:
void ShiftLeftBy33(int * arr, int len)
{
int i;
for (i = 0; i < len - 2; ++i)
{
arr[i] = (arr[i+1] << 1) | ((arr[i+2] >> 31) & 1);
}
arr[len-2] = arr[len-1] << 1;
arr[len-1] = 0;
}
For anyone else, this is a more generic version of Mark Ransom's answer above for any number of bits and any type of array:
/* This function shifts an array of byte of size len by shft number of
bits to the left. Assumes array is big endian. */
#define ARR_TYPE uint8_t
void ShiftLeft(ARR_TYPE * arr_out, ARR_TYPE * arr_in, int arr_len, int shft)
{
const int int_n_bits = sizeof(ARR_TYPE) * 8;
int msb_shifts = shft % int_n_bits;
int lsb_shifts = int_n_bits - msb_shifts;
int byte_shft = shft / int_n_bits;
int last_byt = arr_len - byte_shft - 1;
for (int i = 0; i < arr_len; i++){
if (i <= last_byt){
int msb_idx = i + byte_shft;
arr_out[i] = arr_in[msb_idx] << msb_shifts;
if (i != last_byt)
arr_out[i] |= arr_in[msb_idx + 1] >> lsb_shifts;
}
else arr_out[i] = 0;
}
}
Take a look at BigInteger implementation in Java, which internally stores data as an array of bytes. Specifically you can check out the funcion leftShift(). Syntax is the same as in C, so it wouldn't be too difficult to write a pair of funciontions like those. Take into account too, that when it comes to bit shifting you can take advange of unsinged types in C. This means that in Java to safely shift data without messing around with sign you usually need bigger types to hold data (i.e. an int to shift a short, a long to shift an int, ...)