byte swapping/reversing a character array

byte swapping/reversing a character array - c

I am looking for a method to reverse the bytes in a character array.I also need to reverse the individual bits of the bytes being swapped before positioning them in the right place.
for example say I have a char arr[1000] whose arr[0] = 00100011 and arr[999] = 11010110, I want to swap arr[0] and arr[999] and in addition reverse the bits in each of them. so the output would be arr[999]= 11000100 ( reversed bits of arr[0]) and arr[0] = 01101011 (reversed bits of arr[999]).
I have some code to do the bit reversal inside a byte :
static char reverseByte(char val)
{
char result = 0;
int counter = 8;
while (counter-- < 0)
{
result <<= 1;
result |= (char)(val & 1);
val = (char)(val >> 1);
}
return result;
}
But this would mean running an outer loop to do the byte swap and then running the above small loop for each byte inside i.e 1000 in the above case. Is this the right approach ? Is there a better way to achieve this ?
Any help would be greatly appreciated.

How about this?:
#include <stdio.h>
#include <limits.h>
#if CHAR_BIT != 8
#error char is expected to be 8 bits
#endif
unsigned char RevByte(unsigned char b)
{
static const unsigned char t[16] =
{
0x0, 0x8, 0x4, 0xC, 0x2, 0xA, 0x6, 0xE,
0x1, 0x9, 0x5, 0xD, 0x3, 0xB, 0x7, 0xF
};
return t[b >> 4] | (t[b & 0xF] << 4);
}
void RevBytes(unsigned char* b, size_t c)
{
size_t i;
for (i = 0; i < c / 2; i++)
{
unsigned char t = b[i];
b[i] = RevByte(b[c - 1 - i]);
b[c - 1 - i] = RevByte(t);
}
if (c & 1)
b[c / 2] = RevByte(b[c / 2]);
}
int main(void)
{
int i;
unsigned char buf[16] =
{
0x0, 0x8, 0x4, 0xC, 0x2, 0xA, 0x6, 0xE,
0x1, 0x9, 0x5, 0xD, 0x3, 0xB, 0x7, 0xF
};
RevBytes(buf, 16);
for (i = 0; i < 16; i++)
printf("0x%02X ", buf[i]);
puts("");
return 0;
}
Output (ideone):
0xF0 0xE0 0xD0 0xC0 0xB0 0xA0 0x90 0x80 0x70 0x60 0x50 0x40 0x30 0x20 0x10 0x00

Reversing the elements of a collection is an old trick - you swap first and last, then first+1 and last-1, then... until first+i is equal to or further along the collection than last-j.
All you have to add onto that is
-flipping the bits of the two entries before swapping them
-if you have one element left in the middle, flip its bits on the spot

To do this - assuming your bit reversal method is correct - if you can spare the extra char in storage, you can simply traverse the bytes. Consider the following method:
static void swapReverseBytes(char* arr, size_t len)
{
int i = 0;
char tmp = 0;
if(arr == NULL || len < 1)
return;
if(len == 1) {
arr[0] = reverseByte(arr[0]);
return;
}
for(i = 0 ; i < (len / 2) ; ++i) {
tmp = arr[len - i - 1];
arr[len - i - 1] = reverseByte(arr[i]);
arr[i] = reverseByte(tmp);
}
}
This is a rough sketch (should compile, I think) and, assuming I have no off-by-one errors, this should work. As mentioned earlier, it is probably fastest to reverse the bits using a LUT, but byte swapping will be relatively similar since you need to actually move each byte unless you are keeping some state about the traversal order of the array. If that is the case, it is quite possible to simply use some state (i.e. a flag) to determine whether the array should be traversed normally (1...n) or in reverse order (n...1). In any event, the swapping is "free" in the Big-O, but in practice may (not necessarily) show a performance impact (since there will be no "actual" swapping of the byte order). Note, this would not work if the user expects an array which is sorted - this trick only works in the event that this is internal (or you have some sort of encapsulation, say, as in a C++ class).

To do this - assuming your bit reversal method is correct - if you can spare the extra char in storage, you can simply traverse the bytes. Consider the following method:
static void swapReverseBytes(char* arr, size_t len)
{
int i = 0;
char tmp = 0;
if(arr == NULL || len < 1)
return;
if(len == 1) {
arr[0] = reverseByte(arr[0]);
return;
}
for(i = 0 ; i < (len / 2) ; ++i) {
tmp = arr[len - i - 1];
arr[len - i - 1] = reverseByte(arr[i]);
arr[i] = reverseByte(tmp);
}
}
This is a rough sketch (should compile, I think) and, assuming I have no off-by-one errors, this should work. As mentioned earlier, it is probably fastest to reverse the bits using a LUT, but the byte swapping method will be relatively similar since you need to actually move each byte unless you are keeping some state about the traversal order of the array. If that is the case, it is quite possible to simply use some state (i.e. a flag) to determine whether the array should be traversed normally (1...n) or in reverse order (n...1). In any event, the swapping is "free" in the Big-O, but in practice may (not necessarily) show a performance impact. So if your optimization is really on speed and an extra int in space isn't too much, this state may be worthwhile for you. Note that this trick will only work if this is internal. If this method is public to the user and he or she does not know about the flag, then this will not actually flip anything. Another option to use this is if you can use C++, you can create a class which encapsulates this functionality for the user.

Related

Get part of specific length of allocated memory space

I have some billions of bits loaded into RAM by the use of malloc() - will call it big_set. I also have another amount of bits (will call it small_set) in RAM which are all set to 1 and I know its size (how many bits - I will call it ss_size), but can't predict it, as varies on each execution. ss_size can be sometimes as small as 100 or large as hundreds of millions.
I need to do some bitwise operations between small_set and some unpredictable parts of big_set of ss_size bits length. I can't just extend small_set with zeros on both most-significant and least-significant sides to make its size equal big_set's size, as that would be very RAM and CPU expensive (same operations will be done at same time with a lot of differently sized small_sets and also will do shift operations over small_set, expanding it would lead in much more bits to CPU work on).
Example:
big_set: 100111001111100011000111110001100 (would be billions of bits in reality)
small_set: 111111, so ss_size is 6. (may be an unpredictable number of bits).
I need to take 6 bits length parts of big_set, e.g.: 001100, 000111, etc. Obs.: not necessarily Nth 6 bits, it could be from 3rd to 9th bits, for instance. I don't know how can I get it.
I don't want to get a big_set copy with everything zeroed except the 6 bits I would be taking, like on 000000001111100000000000000000000, as that would be also very RAM expensive.
The question is: how can I get N bits from anywhere inside big_set, so I can do bitwise operations between they and small_set? Being N = ss_size.

I'm not sure that the example given below will give an answer to your question, also I am not sure that the realized XOR will work correctly.
But I have tried to show how confusing can be the implementation of the algorithm, if the task is to save memory.
This is my example for case of 40 bit in big_set and 6 bit in small_set:
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
void setBitsInMemory(uint8_t * memPtr, size_t from, size_t to)
// sets bits in the memory allocated from memPtr (pointer to the first byte)
// where from and to are numbers of bits to be set
{
for (size_t i = from; i <= to; i++)
{
size_t block = i / 8;
size_t offset = i % 8;
*(memPtr + block) |= 0x1 << offset;
}
}
uint8_t * allocAndBuildSmallSet(size_t bitNum)
// Allocate memory to store bitNum bits and set them to 1
{
uint8_t * ptr = NULL;
size_t byteNum = 1 + bitNum / 8; // determine number of bytes for
ptr = (uint8_t*) malloc(byteNum);
if (ptr != NULL)
{
for (size_t i = 0; i < byteNum; i++) ptr[i] = 0;
setBitsInMemory(ptr, 0, bitNum - 1);
}
return ptr;
}
void printBits(uint8_t * memPtr, size_t from, size_t to)
{
for (size_t i = from; i <= to; i++)
{
size_t block = i / 8;
size_t offset = i % 8;
if (*(memPtr + block) & (0x1 << offset) )
printf("1");
else
printf("0");
}
}
void applyXOR(uint8_t * mainMem, size_t start, size_t cnt, uint8_t * pattern, size_t ptrnSize)
// Applys bitwise XOR between cnt bits of mainMem and pattern
// starting from start bit in mainMem and 0 bit in pattern
// if pattern is smaller than cnt, it will be applyed cyclically
{
size_t ptrnBlk = 0;
size_t ptrnOff = 0;
for (size_t i = start; i < start + cnt; i++)
{
size_t block = i / 8;
size_t offset = i % 8;
*(mainMem + block) ^= ((*(pattern + ptrnBlk) & (0x1 << ptrnOff)) ? 1 : 0) << offset;
ptrnOff++;
if ((ptrnBlk * 8 + ptrnOff) >= ptrnSize)
{
ptrnBlk = 0;
ptrnOff = 0;
}
if (ptrnOff % 8 == 0)
{
ptrnBlk++;
ptrnOff = 0;
}
}
}
int main(void)
{
uint8_t * big_set;
size_t ss_size;
uint8_t * small_set;
big_set = (uint8_t*)malloc(5); // 5 bytes (40 bit) without initialization
ss_size = 6;
small_set = allocAndBuildSmallSet(ss_size);
printf("Initial big_set:\n");
printBits(big_set, 0, 39);
// some operation for ss_size bits starting from 12th
applyXOR(big_set, 12, ss_size, small_set, ss_size);
// output for visual analysis
printf("\nbig_set after XOR with small_set:\n");
printBits(big_set, 0, 39);
printf("\n");
// free memory
free(big_set);
free(small_set);
}
At my PC I can see the following:

Slice up an uint8_t array

Let's say that I have an array of 16 uint8_t as follows:
uint8_t array[] = {0x13, 0x01, 0x4E, 0x52, 0x31, 0x4A, 0x35, 0x36, 0x4C, 0x11, 0x21, 0xC6, 0x3C, 0x73, 0xC2, 0x41};
This array stores the data contained in a 128 bits register of an external peripheral. Some of the information it represents are stored on 2, 3, 8, 12 bits ... and so on.
What is the best and elegant way to slice it up and bit mask the information I need? (The problem is that some things that I need overlaps the length of one cell of the array)
If that can help, this snippet I wrote converts the whole array into a char* string. But casting this into an int is not option because.. well 16 bytes.
int i;
char str[33];
for(i = 0; i < sizeof(array) / sizeof(*array) ; i++) {
sprintf(str+2*i,"%02hX",array[i]);
}
puts(str);
13014E52314A35364C1121C63C73C241

Actually such problem also occures when trying to parse all kind of bitstreams, like video or image files or compressed data by algorithms like LZ*. So the approach used there is to implement a bitstream reader.
But in your case the bit sequence is fixed length and quite short, so one way is to manually check the field values using bitwise operations.
Or you can use this function that I just wrote, which can extract arbitrary number of bits from a uint8 array, starting from desired bit position:
uint32_t extract_bits(uint8_t *arr, unsigned int bit_index, unsigned int bit_count)
{
/* Assert that we are not requested to extract more than 32 bits */
uint32_t result = 0;
assert(bit_count <= sizeof(result)*8 && arr != NULL);
/* You can additionally check if you are trying to extract bits exceeding the 16 byte range */
assert(bit_index + bit_count <= 16 * 8);
unsigned int arr_id = bit_index / 8;
unsigned int bit_offset = bit_index % 8;
if (bit_offset > 0) {
/* Extract first 'unaligned_bit_count' bits, which happen to be non-byte-aligned.
* When we do extract those bits, the remaining will be byte-aligned so
* we will thread them in different manner.
*/
unsigned int unaligned_bit_count = 8 - bit_offset;
/* Check if we need less than the remaining unaligned bits */
if (bit_count < unaligned_bit_count) {
result = (arr[arr_id] >> bit_offset) & ((1 << bit_count) - 1);
return result;
}
/* We need them all */
result = arr[arr_id] >> bit_offset;
bit_count -= unaligned_bit_count;
/* Move to next byte element */
arr_id++;
}
while (bit_count > 0) {
/* Try to extract up to 8 bits per iteration */
int bits_to_extract = bit_count > 8 ? 8 : bit_count;
if (bits_to_extract < 8) {
result = (result << bits_to_extract) | (arr[arr_id] & ((1 << bits_to_extract)-1));
}else {
result = (result << bits_to_extract) | arr[arr_id];
}
bit_count -= bits_to_extract;
arr_id++;
}
return result;
}
Here is example of how it is used.
uint32_t r;
/* Extracts bits [7..8] and places them as most significant bits of 'r' */
r = extract_bits(arr, 7, 2)
/* Extracts bits [4..35] and places them as most significant bits of 'r' */
r = extract_bits(arr, 4, 32);
/* Visualize */
printf("slice=%x\n", r);
And then the visualisation of r is up to you. They can either be represented as hex dwords, characters, or however you decide.

Efficient algorithm for finding a byte in a bit array

Given a bytearray uint8_t data[N] what is an efficient method to find a byte uint8_t search within it even if search is not octet aligned? i.e. the first three bits of search could be in data[i] and the next 5 bits in data[i+1].
My current method involves creating a bool get_bit(const uint8_t* src, struct internal_state* state) function (struct internal_state contains a mask that is bitshifted right, &ed with src and returned, maintaining size_t src_index < size_t src_len) , leftshifting the returned bits into a uint8_t my_register and comparing it with search every time, and using state->src_index and state->src_mask to get the position of the matched byte.
Is there a better method for this?

If you're searching an eight bit pattern within a large array you can implement a sliding window over 16 bit values to check if the searched pattern is part of the two bytes forming that 16 bit value.
To be portable you have to take care of endianness issues which is done by my implementation by building the 16 bit value to search for the pattern manually. The high byte is always the currently iterated byte and the low byte is the following byte. If you do a simple conversion like value = *((unsigned short *)pData) you will run into trouble on x86 processors...
Once value, cmp and mask are setup cmp and mask are shifted. If the pattern was not found within hi high byte the loop continues by checking the next byte as start byte.
Here is my implementation including some debug printouts (the function returns the bit position or -1 if pattern was not found):
int findPattern(unsigned char *data, int size, unsigned char pattern)
{
int result = -1;
unsigned char *pData;
unsigned char *pEnd;
unsigned short value;
unsigned short mask;
unsigned short cmp;
int tmpResult;
if ((data != NULL) && (size > 0))
{
pData = data;
pEnd = data + size;
while ((pData < pEnd) && (result == -1))
{
printf("\n\npData = {%02x, %02x, ...};\n", pData[0], pData[1]);
if ((pData + 1) < pEnd) /* still at least two bytes to check? */
{
tmpResult = (int)(pData - data) * 8; /* calculate bit offset according to current byte */
/* avoid endianness troubles by "manually" building value! */
value = *pData << 8;
pData++;
value += *pData;
/* create a sliding window to check if search patter is within value */
cmp = pattern << 8;
mask = 0xFF00;
while (mask > 0x00FF) /* the low byte is checked within next iteration! */
{
printf("cmp = %04x, mask = %04x, tmpResult = %d\n", cmp, mask, tmpResult);
if ((value & mask) == cmp)
{
result = tmpResult;
break;
}
tmpResult++; /* count bits! */
mask >>= 1;
cmp >>= 1;
}
}
else
{
/* only one chance left if there is only one byte left to check! */
if (*pData == pattern)
{
result = (int)(pData - data) * 8;
}
pData++;
}
}
}
return (result);
}

I don't think you can do much better than this in C:
/*
* Searches for the 8-bit pattern represented by 'needle' in the bit array
* represented by 'haystack'.
*
* Returns the index *in bits* of the first appearance of 'needle', or
* -1 if 'needle' is not found.
*/
int search(uint8_t needle, int num_bytes, uint8_t haystack[num_bytes]) {
if (num_bytes > 0) {
uint16_t window = haystack[0];
if (window == needle) return 0;
for (int i = 1; i < num_bytes; i += 1) {
window = window << 8 + haystack[i];
/* Candidate for unrolling: */
for (int j = 7; j >= 0; j -= 1) {
if ((window >> j) & 0xff == needle) {
return 8 * i - j;
}
}
}
}
return -1;
}
The main idea is to handle the 87.5% of cases that cross the boundary between consecutive bytes by pairing bytes in a wider data type (uint16_t in this case). You could adjust it to use an even wider data type, but I'm not sure that would gain anything.
What you cannot safely or easily do is anything involving casting part or all of your array to a wider integer type via a pointer (i.e. (uint16_t *)&haystack[i]). You cannot be ensured of proper alignment for such a cast, nor of the byte order with which the result might be interpreted.

I don't know if it would be better, but i would use sliding window.
uint counter = 0, feeder = 8;
uint window = data[0];
while (search ^ (window & 0xff)){
window >>= 1;
feeder--;
if (feeder < 8){
counter++;
if (counter >= data.length) {
feeder = 0;
break;
}
window |= data[counter] << feeder;
feeder += 8;
}
}
//Returns index of first bit of first sequence occurrence or -1 if sequence is not found
return (feeder > 0) ? (counter+1)*8-feeder : -1;
Also with some alterations you can use this method to search for arbitrary length (1 to 64-array_element_size_in_bits) bits sequence.

If AVX2 is acceptable (with earlier versions it didn't work out so well, but you can still do something there), you can search in a lot of places at the same time. I couldn't test this on my machine (only compile) so the following is more to give to you an idea of how it could be approached than copy&paste code, so I'll try to explain it rather than just code-dump.
The main idea is to read an uint64_t, shift it right by all values that make sense (0 through 7), then for each of those 8 new uint64_t's, test whether the byte is in there. Small complication: for the uint64_t's shifted by more than 0, the highest position should not be counted since it has zeroes shifted into it that might not be in the actual data. Once this is done, the next uint64_t should be read at an offset of 7 from the current one, otherwise there is a boundary that is not checked across. That's fine though, unaligned loads aren't so bad anymore, especially if they're not wide.
So now for some (untested, and incomplete, see below) code,
__m256i needle = _mm256_set1_epi8(find);
size_t i;
for (i = 0; i < n - 6; i += 7) {
// unaligned load here, but that's OK
uint64_t d = *(uint64_t*)(data + i);
__m256i x = _mm256_set1_epi64x(d);
__m256i low = _mm256_srlv_epi64(x, _mm256_set_epi64x(3, 2, 1, 0));
__m256i high = _mm256_srlv_epi64(x, _mm256_set_epi64x(7, 6, 5, 4));
low = _mm256_cmpeq_epi8(low, needle);
high = _mm256_cmpeq_epi8(high, needle);
// in the qword right-shifted by 0, all positions are valid
// otherwise, the top position corresponds to an incomplete byte
uint32_t lowmask = 0x7f7f7fffu & _mm256_movemask_epi8(low);
uint32_t highmask = 0x7f7f7f7fu & _mm256_movemask_epi8(high);
uint64_t mask = lowmask | ((uint64_t)highmask << 32);
if (mask) {
int bitindex = __builtin_ffsl(mask);
// the bit-index and byte-index are swapped
return 8 * (i + (bitindex & 7)) + (bitindex >> 3);
}
}
The funny "bit-index and byte-index are swapped" thing is because searching within a qword is done byte by byte and the results of those comparisons end up in 8 adjacent bits, while the search for "shifted by 1" ends up in the next 8 bits and so on. So in the resulting masks, the index of the byte that contains the 1 is a bit-offset, but the bit-index within that byte is actually the byte-offset, for example 0x8000 would correspond to finding the byte at the 7th byte of the qword that was right-shifted by 1, so the actual index is 8*7+1.
There is also the issue of the "tail", the part of the data left over when all blocks of 7 bytes have been processed. It can be done much the same way, but now more positions contain bogus bytes. Now n - i bytes are left over, so the mask has to have n - i bits set in the lowest byte, and one fewer for all other bytes (for the same reason as earlier, the other positions have zeroes shifted in). Also, if there is exactly 1 byte "left", it isn't really left because it would have been tested already, but that doesn't really matter. I'll assume the data is sufficiently padded that accessing out of bounds doesn't matter. Here it is, untested:
if (i < n - 1) {
// make n-i-1 bits, then copy them to every byte
uint32_t validh = ((1u << (n - i - 1)) - 1) * 0x01010101;
// the lowest position has an extra valid bit, set lowest zero
uint32_t validl = (validh + 1) | validh;
uint64_t d = *(uint64_t*)(data + i);
__m256i x = _mm256_set1_epi64x(d);
__m256i low = _mm256_srlv_epi64(x, _mm256_set_epi64x(3, 2, 1, 0));
__m256i high = _mm256_srlv_epi64(x, _mm256_set_epi64x(7, 6, 5, 4));
low = _mm256_cmpeq_epi8(low, needle);
high = _mm256_cmpeq_epi8(high, needle);
uint32_t lowmask = validl & _mm256_movemask_epi8(low);
uint32_t highmask = validh & _mm256_movemask_epi8(high);
uint64_t mask = lowmask | ((uint64_t)highmask << 32);
if (mask) {
int bitindex = __builtin_ffsl(mask);
return 8 * (i + (bitindex & 7)) + (bitindex >> 3);
}
}

If you are searching a large amount of memory and can afford an expensive setup, another approach is to use a 64K lookup table. For each possible 16-bit value, the table stores a byte containing the bit shift offset at which the matching octet occurs (+1, so 0 can indicate no match). You can initialize it like this:
uint8_t* g_pLookupTable = malloc(65536);
void initLUT(uint8_t octet)
{
memset(g_pLookupTable, 0, 65536); // zero out
for(int i = 0; i < 65536; i++)
{
for(int j = 7; j >= 0; j--)
{
if(((i >> j) & 255) == octet)
{
g_pLookupTable[i] = j + 1;
break;
}
}
}
}
Note that the case where the value is shifted 8 bits is not included (the reason will be obvious in a minute).
Then you can scan through your array of bytes like this:
int findByteMatch(uint8_t* pArray, uint8_t octet, int length)
{
if(length >= 0)
{
uint16_t index = (uint16_t)pArray[0];
if(index == octet)
return 0;
for(int bit, i = 1; i < length; i++)
{
index = (index << 8) | pArray[i];
if(bit = g_pLookupTable[index])
return (i * 8) - (bit - 1);
}
}
return -1;
}
Further optimization:
Read 32 or however many bits at a time from pArray into a uint32_t and then shift and AND each to get byte one at a time, OR with index and test, before reading another 4.
Pack the LUT into 32K by storing a nybble for each index. This might help it squeeze into the cache on some systems.
It will depend on your memory architecture whether this is faster than an unrolled loop that doesn't use a lookup table.

Comparing arbitrary bit sequences in a byte array in c

I have a couple uint8_t arrays in my c code, and I'd like to compare an arbitrary sequence bits from one with another. So for example, I have bitarray_1 and bitarray_2, and I'd like to compare bits 13 - 47 from bitarray_1 with bits 5-39 of bitarray_2. What is the most efficient way to do this?
Currently it's a huge bottleneck in my program, since I just have a naive implementation that copies the bits into the beginning of a new temporary array, and then uses memcmp on them.

three words: shift, mask and xor.
shift to get the same memory alignment for both bitarray. If not you will have to shift one of the arrays before comparing them. Your exemple is probably misleading because bits 13-47 and 5-39 have the same memory alignment on 8 bits addresses. This wouldn't be true if you were comparing say bits 14-48 with bits 5-39.
Once everything is aligned and exceeding bits cleared for table boundaries a xor is enough to perform the comparison of all the bits at once. Basically you can manage to do it with just one memory read for each array, which should be pretty efficient.
If memory alignment is the same for both arrays as in your example memcmp and special case for upper and lower bound is probably yet faster.
Also accessing array by uint32_t (or uint64_t on 64 bits architectures) should also be more efficient than accessing by uint8_t.
The principle is simple but as Andrejs said the implementation is not painless...
Here is how it goes (similarities with #caf proposal is no coincidence):
/* compare_bit_sequence() */
int compare_bit_sequence(uint8_t s1[], unsigned s1_off, uint8_t s2[], unsigned s2_off,
unsigned length)
{
const uint8_t mask_lo_bits[] =
{ 0x00, 0x01, 0x03, 0x07, 0x0f, 0x1f, 0x3f, 0x7f, 0xff };
const uint8_t clear_lo_bits[] =
{ 0xff, 0xfe, 0xfc, 0xf8, 0xf0, 0xe0, 0xc0, 0x80, 0x00 };
uint8_t v1;
uint8_t * max_s1;
unsigned end;
uint8_t lsl;
uint8_t v1_mask;
int delta;
/* Makes sure the offsets are less than 8 bits */
s1 += s1_off >> 3;
s1_off &= 7;
s2 += s2_off >> 3;
s2_off &= 7;
/* Make sure s2 is the sequence with the shorter offset */
if (s2_off > s1_off){
uint8_t * tmp_s;
unsigned tmp_off;
tmp_s = s2; s2 = s1; s1 = tmp_s;
tmp_off = s2_off; s2_off = s1_off; s1_off = tmp_off;
}
delta = s1_off;
/* handle the beginning, s2 incomplete */
if (s2_off > 0){
delta = s1_off - s2_off;
v1 = delta
? (s1[0] >> delta | s1[1] << (8 - delta)) & clear_lo_bits[delta]
: s1[0];
if (length <= 8 - s2_off){
if ((v1 ^ *s2)
& clear_lo_bits[s2_off]
& mask_lo_bits[s2_off + length]){
return NOT_EQUAL;
}
else {
return EQUAL;
}
}
else{
if ((v1 ^ *s2) & clear_lo_bits[s2_off]){
return NOT_EQUAL;
}
length -= 8 - s2_off;
}
s1++;
s2++;
}
/* main loop, we test one group of 8 bits of v2 at each loop */
max_s1 = s1 + (length >> 3);
lsl = 8 - delta;
v1_mask = clear_lo_bits[delta];
while (s1 < max_s1)
{
if ((*s1 >> delta | (*++s1 << lsl & v1_mask)) ^ *s2++)
{
return NOT_EQUAL;
}
}
/* last group of bits v2 incomplete */
end = length & 7;
if (end && ((*s2 ^ *s1 >> delta) & mask_lo_bits[end]))
{
return NOT_EQUAL;
}
return EQUAL;
}
All possible optimisations are not yet used. One promising one would be to use larger chunks of data (64 bits or 32 bits at once instead of 8), you could also detect cases where offset are synchronised for both arrays and in such cases use a memcmp instead of the main loop, replace modulos % 8 by logical operators & 7, replace '/ 8' by '>> 3', etc., have to branches of code instead of swapping s1 and s2, etc, but the main purpose is achieved : only one memory read and not memory write for each array item hence most of the work can take place inside processor registers.

bits 13 - 47 of bitarray_1 are the same as bits 5 - 39 of bitarray_1 + 1.
Compare the first 3 bits (5 - 7) with a mask and the other bits (8 - 39) with memcmp().
Rather than shift and copy the bits, maybe representing them differently is faster. You have to measure.
/* code skeleton */
static char bitarray_1_bis[BIT_ARRAY_SIZE*8+1];
static char bitarray_2_bis[BIT_ARRAY_SIZE*8+1];
static const char *lookup_table[] = {
"00000000", "00000001", "00000010" /* ... */
/* 256 strings */
/* ... */ "11111111"
};
/* copy every bit of bitarray_1 to an element of bitarray_1_bis */
for (k = 0; k < BIT_ARRAY_SIZE; k++) {
strcpy(bitarray_1_bis + 8*k, lookup_table[bitarray_1[k]]);
strcpy(bitarray_2_bis + 8*k, lookup_table[bitarray_2[k]]);
}
memcmp(bitarray_1_bis + 13, bitarray_2_bis + 5, 47 - 13 + 1);
You can (and should) limit the copy to the minimum possible.
I have no idea if it's faster, but it wouldn't surprise me if it was. Again, you have to measure.

The easiest way to do this is to convert the more complex case into a simpler case, then solve the simpler case.
In the following code, do_compare() solves the simpler case (where the sequences are never offset by more than 7 bits, s1 is always offset as much or more than s2, and the length of the sequence is non-zero). The compare_bit_sequence() function then takes care of converting the harder case to the easier case, and calls do_compare() to do the work.
This just does a single-pass through the bit sequences, so hopefully that's an improvement on your copy-and-memcmp implementation.
#define NOT_EQUAL 0
#define EQUAL 1
/* do_compare()
*
* Does the actual comparison, but has some preconditions on parameters to
* simplify things:
*
* length > 0
* 8 > s1_off >= s2_off
*/
int do_compare(const uint8_t s1[], const unsigned s1_off, const uint8_t s2[],
const unsigned s2_off, const unsigned length)
{
const uint8_t mask_lo_bits[] =
{ 0xff, 0x01, 0x03, 0x07, 0x0f, 0x1f, 0x3f, 0x7f, 0xff };
const uint8_t mask_hi_bits[] =
{ 0x00, 0x80, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc, 0xfe, 0xff };
const unsigned msb = (length + s1_off - 1) / 8;
const unsigned s2_shl = s1_off - s2_off;
const unsigned s2_shr = 8 - s2_shl;
unsigned n;
uint8_t s1s2_diff, lo_bits = 0;
for (n = 0; n <= msb; n++)
{
/* Shift s2 so it is aligned with s1, pulling in low bits from
* the high bits of the previous byte, and store in s1s2_diff */
s1s2_diff = lo_bits | (s2[n] << s2_shl);
/* Save the bits needed to fill in the low-order bits of the next
* byte. HERE BE DRAGONS - since s2_shr can be 8, this below line
* only works because uint8_t is promoted to int, and we know that
* the width of int is guaranteed to be >= 16. If you change this
* routine to work with a wider type than uint8_t, you will need
* to special-case this line so that if s2_shr is the width of the
* type, you get lo_bits = 0. Don't say you weren't warned. */
lo_bits = s2[n] >> s2_shr;
/* XOR with s1[n] to determine bits that differ between s1 and s2 */
s1s2_diff ^= s1[n];
/* Look only at differences in the high bits in the first byte */
if (n == 0)
s1s2_diff &= mask_hi_bits[8 - s1_off];
/* Look only at differences in the low bits of the last byte */
if (n == msb)
s1s2_diff &= mask_lo_bits[(length + s1_off) % 8];
if (s1s2_diff)
return NOT_EQUAL;
}
return EQUAL;
}
/* compare_bit_sequence()
*
* Adjusts the parameters to match the preconditions for do_compare(), then
* calls it to do the work.
*/
int compare_bit_sequence(const uint8_t s1[], unsigned s1_off,
const uint8_t s2[], unsigned s2_off, unsigned length)
{
/* Handle length zero */
if (length == 0)
return EQUAL;
/* Makes sure the offsets are less than 8 bits */
s1 += s1_off / 8;
s1_off %= 8;
s2 += s2_off / 8;
s2_off %= 8;
/* Make sure s2 is the sequence with the shorter offset */
if (s1_off >= s2_off)
return do_compare(s1, s1_off, s2, s2_off, length);
else
return do_compare(s2, s2_off, s1, s1_off, length);
}
To do the comparison in your example, you'd call:
compare_bit_sequence(bitarray_1, 13, bitarray_2, 5, 35)
(Note that I am numbering the bits from zero, and assuming that the bitarrays are laid out little-endian, so this will start the comparison from the sixth-least-significant bit in bitarray2[0], and the sixth-least-signifcant bit in bitarray1[1]).

What about writing the function that will calculate the offsets from both arrays, apply the mask, shift the bits and store the result to the int so you may compare them. If the bits count (34 in your example) exceeds the length of the int - recurse or loop.
Sorry, the example will be pain in the ass.

Here is my unoptimized bit sequence comparison function:
#include <stdio.h>
#include <stdint.h>
// 01234567 01234567
uint8_t bitsA[] = { 0b01000000, 0b00010000 };
uint8_t bitsB[] = { 0b10000000, 0b00100000 };
int bit( uint8_t *bits, size_t bitpoz, size_t len ){
return (bitpoz<len)? !!(bits[bitpoz/8]&(1<<(7-bitpoz%8))): 0;
}
int bitcmp( uint8_t *bitsA, size_t firstA, size_t lenA,
uint8_t *bitsB, size_t firstB, size_t lenB ){
int cmp;
for( size_t i=0; i<lenA || i<lenB; i++ ){
if( (cmp = bit(bitsA,firstA+i,firstA+lenA) -
bit(bitsB,firstB+i,firstB+lenB)) ) return cmp;
}
return 0;
}
int main(){
printf( "cmp: %i\n", bitcmp( bitsA,1,11, bitsB,0,11 ) );
}
EDIT: Here is my (untested) bitstring equality test function:
#include <stdlib.h>
#include <stdint.h>
#define load_64bit(bits,first) (*(uint64_t*)bits<<first | *(bits+8)>>(8-first))
#define load_32bit(bits,first) (*(uint32_t*)bits<<first | *(bits+4)>>(8-first))
#define load_16bit(bits,first) (*(uint16_t*)bits<<first | *(bits+2)>>(8-first))
#define load_8bit( bits,first) ( *bits<<first | *(bits+1)>>(8-first))
static inline uint8_t last_bits( uint8_t *bits, size_t first, size_t size ){
return (first+size>8?load_8bit(bits,first):*bits<<first)>>(8-size);
}
int biteq( uint8_t *bitsA, size_t firstA,
uint8_t *bitsB, size_t firstB, size_t size ){
if( !size ) return 1;
bitsA+=firstA/8; firstA%=8;
bitsB+=firstB/8; firstB%=8;
for(; size>64;size-=64,bitsA+=8,bitsB+=8)
if(load_64bit(bitsA,firstA)!=load_64bit(bitsB,firstB)) return 0;
for(; size>32;size-=32,bitsA+=4,bitsB+=4)
if(load_32bit(bitsA,firstA)!=load_32bit(bitsB,firstB)) return 0;
for(; size>16;size-=16,bitsA+=2,bitsB+=2)
if(load_16bit(bitsA,firstA)!=load_16bit(bitsB,firstB)) return 0;
for(; size> 8;size-= 8,bitsA++, bitsB++ )
if(load_8bit( bitsA,firstA)!=load_8bit( bitsB,firstB)) return 0;
return !size ||
last_bits(bitsA,firstA,size)==last_bits(bitsB,firstB,size);
}
I made a simple measurement tool to see how fast is it:
#include <unistd.h>
#include <stdio.h>
#include <signal.h>
#define SIZE 1000000
uint8_t bitsC[SIZE];
volatile int end_loop;
void sigalrm_hnd( int sig ){ (void)sig; end_loop=1; }
int main(){
uint64_t loop_count; int cmp;
signal(SIGALRM,sigalrm_hnd);
loop_count=0; end_loop=0; alarm(10);
while( !end_loop ){
for( int i=1; i<7; i++ ){
loop_count++;
cmp = biteq( bitsC,i, bitsC,7-i,(SIZE-1)*8 );
if( !cmp ){ printf( "cmp: %i (==0)\n", cmp ); return -1; }
}
}
printf( "biteq: %.2f round/sec\n", loop_count/10.0 );
}
Result:
bitcmp: 8.40 round/sec
biteq: 363.60 round/sec
EDIT2: last_bits() changed.

Byte level length description

I have a protocol that requires a length field up to 32-bits, and it must be
generated at runtime to describe how many bytes are in a given packet.
The code below is kind of ugly but I am wondering if this can be refactored to
be slightly more efficient or easily understandable. The problem is that the
code will only generate enough bytes to describe the length of the packet, so
less than 255 bytes = 1 byte of length, less than 65535 = 2 bytes of length
etc...
{
extern char byte_stream[];
int bytes = offset_in_packet;
int n = length_of_packet;
/* Under 4 billion, so this can be represented in 32 bits. */
int t;
/* 32-bit number used for temporary storage. */
/* These are the bytes we will break up n into. */
unsigned char first, second, third, fourth;
t = n & 0xFF000000;
/* We have used AND to "mask out" the first byte of the number. */
/* The only bits which can be on in t are the first 8 bits. */
first = t >> 24;
if (t) {
printf("byte 1: 0x%02x\n",first );
byte_stream[bytes] = first; bytes++;
write_zeros = 1;
}
/* Now we shift t so that it is between 0 and 255. This is the first, highest byte of n. */
t = n & 0x00FF0000;
second = t >> 16;
if (t || write_zeros) {
printf("byte 2: 0x%02x\n", second );
byte_stream[bytes] = second; bytes++;
write_zeros = 1;
}
t = n & 0x0000FF00;
third = t >> 8;
if ( t || write_zeros) {
printf("byte 3: 0x%02x\n", third );
byte_stream[bytes] = third; bytes++;
write_zeros = 1;
}
t = n & 0x000000FF;
fourth = t;
if (t || write_zeros) {
printf("byte 4: 0x%02x\n", fourth);
byte_stream[bytes] = fourth; bytes++;
}
}

You should really use a fixed-width field for your length.
When the program on the receiving end has to read the length field of your packet, how does it know where the length stops?
If the length of a packet can potentially reach 4 GB, does a 1-3 byte overhead really matter?
Do you see how complex your code has already become?

Really you're only doing four calculations, so readability seems way more important here than efficiency. My approach to make something like this more readable is to
Extract common code to a function
Put similar calculations together to make the patterns more obvious
Get rid of the intermediate variable print_zeroes and be explicit about the cases in which you output bytes even if they're zero (i.e. the preceding byte was non-zero)
I've changed the random code block into a function and changed a few variables (underscores are giving me trouble in the markdown preview screen). I've also assumed that bytes is being passed in, and that whoever is passing it in will pass us a pointer so we can modify it.
Here's the code:
/* append byte b to stream, increment index */
/* really needs to check length of stream before appending */
void output( int i, unsigned char b, char stream[], int *index )
{
printf("byte %d: 0x%02x\n", i, b);
stream[(*index)++] = b;
}
void answer( char bytestream[], unsigned int *bytes, unsigned int n)
{
/* mask out four bytes from word n */
first = (n & 0xFF000000) >> 24;
second = (n & 0x00FF0000) >> 16;
third = (n & 0x0000FF00) >> 8;
fourth = (n & 0x000000FF) >> 0;
/* conditionally output each byte starting with the */
/* first non-zero byte */
if (first)
output( 1, first, bytestream, bytes);
if (first || second)
output( 2, second, bytestream, bytes);
if (first || second || third)
output( 3, third, bytestream, bytes);
if (first || second || third || fourth)
output( 4, fourth, bytestream, bytes);
}
Ever so slightly more efficient, and maybe easier to understand would be this modification to the last four if statements:
if (n>0x00FFFFFF)
output( 1, first, bytestream, bytes);
if (n>0x0000FFFF)
output( 2, second, bytestream, bytes);
if (n>0x000000FF)
output( 3, third, bytestream, bytes);
if (1)
output( 4, fourth, bytestream, bytes);
I agree, however, that compressing this field makes the receiving state machine overly complicated. But if you can't change the protocol, this code is much easier to read.

Try this loop:
{
extern char byte_stream[];
int bytes = offset_in_packet;
int n = length_of_packet; /* Under 4 billion, so this can be represented in 32 bits. */
int t; /* 32-bit number used for temporary storage. */
int i;
unsigned char curByte;
for (i = 0; i < 4; i++) {
t = n & (0xFF000000 >> (i * 16));
curByte = t >> (24 - (i * 8));
if (t || write_zeros) {
printf("byte %d: 0x%02x\n", i, curByte );
byte_stream[bytes] = curByte;
bytes++;
write_zeros = 1;
}
}
}

I'm not sure I understand your question. What exactly are you trying to count? If I understand correctly you're trying to find the Most Significant non-zero byte.
You're probably better off using a loop like this:
int i;
int write_zeros = 0;
for (i = 3; i >=0 ; --i) {
t = (n >> (8 * i)) & 0xff;
if (t || write_zeros) {
write_zeros = 1;
printf ("byte %d : 0x%02x\n", 4-i, t);
byte_stream[bytes++] = t;
}
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight