Is it possible to optimize this nested loop?

Is it possible to optimize this nested loop? - c

I am working on a project in which i want to convert a given video input stream into block sections (so it can be used by a hardware codec). This is project is run on an STM32 microcontroller running a 200Mhz clock.
The received input is a YCbCr 4:2:2 progressive stream, which basically means the input stream looks like this for every row:
Size: 32 bit word 32 bit word 32 bit word ...
Component: Cr Y1 Cb Y0 Cr Y1 Cb Y0 Cr Y1 Cb Y0 ...
Bits: 8 8 8 8 8 8 8 8 8 8 8 8 ...
This stream needs to be converted into a block format used by a hardware codec. The codec accepts a byte array in a specific order. Currently i am doing this using a nested loop for every 1/8 of an image frame using lookup tables and writing into an empty array:
Defines:
#define ROWS_PER_MCU 8
#define WORDS_PER_MCU 8
#define HORIZONTAL_MCU_PER_INPUTBUFFER 40
#define VERTICAL_MCU_PER_INPUTBUFFER 8
Global variables are declared like this:
typedef struct jpegInputbufferLUT
{
uint8_t JPEG_Y_MCU_LUT[256];
uint8_t JPEG_Cb_MCU_422_LUT[256];
uint8_t JPEG_Cr_MCU_422_LUT[256];
}jpegIndexLUT;
jpegIndexLUT jpegInputLUT;
uint8_t jpegInBuffer[81920];
uint32_t rawBuffer[20480];
Look up tables are created like this:
void JPEG_Init_MCU_LUT(void)
{
uint32_t offset;
/*Y LUT */
for(uint32_t i = 0; i < 16; i++)
{
for(j = 0; j < 16; j++)
{
offset = j + (i*8);
if((j>=8) && (i>=8)) offset+= 120;
else if((j>=8) && (i<8)) offset+= 56;
else if((j<8) && (i>=8)) offset+= 64;
jpegInputLUT.JPEG_Y_MCU_LUT[i*16 + j] = offset;
}
}
/*Cb Cr LUT*/
for(uint32_t i = 0; i < 16; i++)
{
for(j = 0; j < 16; j++)
{
offset = i*16 + j;
jpegInputLUT.JPEG_Cb_MCU_422_LUT[offset] = (j/2) + (i*8) + 128;
jpegInputLUT.JPEG_Cr_MCU_422_LUT[offset] = (j/2) + (i*8) + 192;
}
}
}
Conversion code:
/* Initialize variables for array conversion */
uint32_t currentMCU = 0;
uint32_t lutOffset = 0;
uint32_t inputOffset = 0;
uint32_t verticalOffset = 0;
/* Convert X rows into MCU blocks for JPEG encoding */
for(uint8_t k = 0; k < VERTICAL_MCU_PER_INPUTBUFFER; k++)
{
for(uint8_t n = 0; n < HORIZONTAL_MCU_PER_INPUTBUFFER; n++)
{
inputOffset = verticalOffset + (n * 8);
lutOffset = 0;
for(uint8_t i = 0; i < ROWS_PER_MCU; i++)
{
for(uint8_t j = 0; j < WORDS_PER_MCU; j++)
{
/* Mask 32 bit according to DCMI input format */
uint32_t rawBufferAddress = inputOffset+j; // Calculate rawBuffer address here so it only has to be calculated once
jpegInBuffer[jpegInputLUT.JPEG_Y_MCU_LUT[lutOffset] + currentMCU] = (rawBuffer[rawBufferAddress] & 0x7F);
jpegInBuffer[jpegInputLUT.JPEG_Cb_MCU_422_LUT[lutOffset] + currentMCU] = ((rawBuffer[rawBufferAddress] >> 7) & 0x7F);
jpegInBuffer[jpegInputLUT.JPEG_Cr_MCU_422_LUT[lutOffset] + currentMCU] = ((rawBuffer[rawBufferAddress] >> 23) & 0x7F);
jpegInBuffer[jpegInputLUT.JPEG_Y_MCU_LUT[lutOffset+1] + currentMCU] = ((rawBuffer[rawBufferAddress] >> 16) & 0x7F);
lutOffset+=2;
}
inputOffset += 320;
}
currentMCU += 256;
}
verticalOffset += 2240;
}
This conversion is currently taking me about 8 ms, and this needs to be done 8 times. This is currently taking up almost all of my available execution time, since i am trying to get 15 fps out of my system.
Is it in any way possible to speed this up? I was thinking maybe sorting the input array instead of just writing into a new buffer, but would swapping 2 elements in an array have a faster execution time than copying values into another array?
Would love to hear your ideas/thoughts on this,
Thanks in advance!

Your program seems to run slower than expected from an STM32. You may need to look into what assembly is produced, compiler optimization settings, if MCU frequency is correct, if memory is too slow, etc. We don't have enough information to give a definite answer why. Your code seems to spend 8 ms * 200M / (8*8*8*40) = 78 cycles for each inner loop iteration. For reference, an stm32f723 only needs about 15 cycles, and an stm32f103 about 28 cycles (the code was adjusted to access smaller arrays in the latter case).
The LUT table is not needed as its content is very regular. Reading LUT values adds more memory reads, which may be a significant contribution. If I got your LUT generation code correctly, it produces the following numbers in the inner loop:
Y1 Cb Cr Y2
0 128 192 1
2 129 193 3
4 130 194 5
6 131 195 7
64 132 196 65
66 133 197 67
68 134 198 69
70 135 199 71
8 136 200 9
etc
The second and third columns are just consecutive numbers. The fourth column equals the first one plus one. And the first number needs a bit flip. You can try the following code (please check that it is correct):
uint32_t lutOffset = 0;
for(uint8_t i = 0; i < ROWS_PER_MCU; i++)
{
for(uint8_t j = 0; j < WORDS_PER_MCU; j++)
{
uint32_t rawBufferAddress = (inputOffset+j) /* % 2048 */;
#if 0
unsigned y_lut1 = jpegInputLUT.JPEG_Y_MCU_LUT[lutOffset];
unsigned Cb_lut = jpegInputLUT.JPEG_Cb_MCU_422_LUT[lutOffset];
unsigned Cr_lut = jpegInputLUT.JPEG_Cr_MCU_422_LUT[lutOffset];
unsigned y_lut2 = jpegInputLUT.JPEG_Y_MCU_LUT[lutOffset+1];
#else
unsigned y_lut1 = lutOffset | (j / 4) << 6 | (j % 4) << 1;
unsigned Cb_lut = 128 + lutOffset + j;
unsigned Cr_lut = 192 + lutOffset + j;
unsigned y_lut2 = y_lut1 + 1;
#endif
jpegInBuffer[y_lut1 + currentMCU] = (rawBuffer[rawBufferAddress] & 0x7F);
jpegInBuffer[Cb_lut + currentMCU] = ((rawBuffer[rawBufferAddress] >> 7) & 0x7F);
jpegInBuffer[Cr_lut + currentMCU] = ((rawBuffer[rawBufferAddress] >> 23) & 0x7F);
jpegInBuffer[y_lut2 + currentMCU] = ((rawBuffer[rawBufferAddress] >> 16) & 0x7F);
}
lutOffset += 8;
inputOffset += 320;
}
This version takes about 20 cycles per iteration on my stm32f103, which is less than 6 ms even at its 72 MHz.
UPD. Another option is using one small lookup table instead of bit computations:
static const unsigned x[8] = { 0, 2, 4, 6, 64, 66, 68, 70 };
// unsigned y_lut1 = lutOffset | (j / 4) << 6 | (j % 4) << 1;
unsigned y_lut1 = lutOffset + x[j];
This improves the inner loop timing to 18 (f103) / 7.5 (f723) cycles. For some reason, optimizing this expression for F723 does not work well. I would expect these options to give identical result since the inner loop is unrolled, but who knows.
As an additional optimization, probably not necessary, the output values can be combined into 32-bit words and written one word a time. This seems possible because LUT values come in blocks of four consecutive ones. For this, the inner loop can be converted to a nested loop of 2 by 4 iterations. Each 4 iterations of the innermost loop will produce one uint32_t for Cb, one uint32_t for Cr and two uint32_t for Y. But is not worth doing.
I measure run time with SysTick:
SysTick->LOAD = SysTick_LOAD_RELOAD_Msk;
SysTick->VAL = 0;
SysTick->CTRL = SysTick_CTRL_CLKSOURCE_Msk | SysTick_CTRL_ENABLE_Msk;
volatile unsigned t0 = SysTick->VAL;
f();
volatile unsigned t1 = t0 - SysTick->VAL;
I used output pins sometimes too, when connecting a debugger is not practical. Strictly speaking, both methods are not guaranteed to work because the compiler may move code across measurement points, but it has worked as intended for me (with gcc). Assembly inspection is needed to make sure that nothing fishy is going on.

There are any number of micro optimisations that could be performed here that could provide an improvement. Some may exhibit an improvement in debug build without compiler optimisation, only to have no advantage with optimisation. It is possible even that some "clever" trick that is faster in debug if non-idiomatic could cause the optimiser to generate worse code that it might had you favoured clarity over performance.
All the obvious micro-optimisations such as loop unrolling the compiler optimiser will likely be able to perform for you without complicating the code or risking introducing errors.
One rather obvious improvement (regardless of whether or not it is faster), would be to change:
for( uint8_t j = 0; j < WORDS_PER_MCU; j++ )
{
/* Mask 32 bit according to DCMI input format */
uint32_t rawBufferAddress = inputOffset+j; // Calculate rawBuffer address here so it only has to be calculated once
...
to:
uint32_t rawBufferAddress = inputOffset ;
for( uint8_t j = 0; j < WORDS_PER_MCU; rawBufferAddress++, j++)
{
/* Mask 32 bit according to DCMI input format */
...
Your "only has to be calculated once" is actually WORDS_PER_MCU calculations, and an increment is likely to be faster than and addition and assignment. At worst it will be no different.
I would similarly suggest moving all the other "end of loop increments such as lutOffset+=2 into the respective for third expression also. Not for performance, but for clarity.

Related

How to sum values in a sequence of bytes in C

I am trying to figure out how to add sequential bytes in a data block starting at a given offset(sequenceOffset) to sequenceLength, by typcasting them to signed 16 bit integers(int16_t). The numbers can be negative and positive. I feel like i am not incrementing the offset properly but cannot figure out how it is meant to be done.
For example:
Summing sequence of 8 bytes at offset 53:
57 AB 2A 2C 4E A4 7A 64
-21673 11306 -23474 25722
You said the sum is: 22848
Should be: -8119
int16_t sumSequence16(const uint8_t* const blockAddress, uint32_t blockLength, uint32_t sequenceOffset,
uint8_t sequenceLength) {
int count = 0;
for (int i = 0; i < blockLength; i++) {
if (*(blockAddress + i) == sequenceOffset) {
count += (int16_t*)(&sequenceOffset);
sequenceOffset++;
}
}
return count;
}

There are some serious problems with your code.
Start by noticing that your code doesn't use sequenceLength at all - that's strange.
Then there is no need to loop over the whole block - you only need to look at the bytes inside the relevant sequence.
This line is very strange:
if (*(blockAddress + i) == sequenceOffset)
^^^^^^^^^^^^^^^^^^^
Reads the data at index i
It compare a data value inside the data block with the sequenceOffset - that doesn't seem correct.
And this part:
(int16_t*)(&sequenceOffset);
is actually a violation of the strict aliasing rule.
Finally, you never mention which endianess the data is stored with. From your example it seems to be little endian so I'll use little endian in the code below:
int16_t sumSequence16(const uint8_t* const blockAddress,
const uint32_t sequenceOffset,
const uint8_t sequenceLength)
{
uint8_t* p = blockAddress + sequenceOffset; // Point to first byte in sequence
int sum = 0;
for (uint8_t i = 0; i < sequenceLength; i += 2)
{
int16_t t = 0;
t = p[i+1]; // Read MSB
t = t << 8; // Shift MSB 8 bits to the left
t = t | p[i]; // Add LSB
sum = sum + t; // Update the running sum
}
return sum;
}

2-bit mapping using bitwise operations in C

This is my first question, so I hope to do this right.
I have a problem where I have to map a key which can be in the range (0, 1, 2) to select a value from the same range (0, 1, 2). I have to repeat this millions of times and I was trying to implement this by using bitwise operations in C, without success.
So let's say I have 16 keys in the range (0, 1, 2) which I want to map to 16 values in the same range by using the following rules:
0 -> 2
1 -> 1
2 -> 1
I can represent the array of 16 keys as 16 2-bit pairs in a 32bit unsigned int. For instance:
0, 1, 2, 1, 2, 0, ... //Original array of keys
00 01 10 01 10 00 ... //2-bit pairs representation of keys in a 32bit int
and I am interested in transforming the unsigned int, following the rules above (i.e. the 2-bit pairs have to be transformed following the rules: 00->10, 01->01, and 10->01), so that I end up with a 32bit unsigned int like:
10 01 01 01 01 10 ... //2-bit pairs transformed using the given rule.
Would it be a relatively fast bitwise procedure which will allow me to apply efficiently this transformation (given that the transformation rules can change)?
I hope I formulated my question clearly. Thanks for any help.
EDIT: I corrected some mistakes, and clarified some points following comments.
EDIT2: Following some suggestions, I add what I hope is a code example:
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
int i;
unsigned int keys[16];
unsigned int bitKeys = 0;
unsigned int mapping[3];
unsigned int result[16];
unsigned int bitResults = 0;
//Initialize random keys and mapping dict
for(i = 0; i<16; i++)
keys[i] = rand() % 3;
bitKeys |= keys[i] << (2*i);
for(i = 0; i<3; i++)
mapping[i] = rand() % 3;
//Get results without using bitwise opperations.
for(i = 0; i<16; i++)
result[i] = mapping[ keys[i] ];
bitResults |= result[i] << (2*i);
//Would it be possible to get bitResults directly from bitKeys efficiently by using bitwise operations?
return 0;
}

This is essentially a problem of simplifying truth tables to minimal Boolean expressions; here we need two expressions, one for each output value bit.
BA QP
00 10
01 01
10 01
11 XX
B: high key bit, A: low key bit, Q: high value bit, P: low value bit
By using any of the many tools available (including our brain) for minimizing combinational logic circuits, we get the expressions
Q = ¬A·¬B
P = A + B
Now that we have the expressions, we can apply them to all keys in a 32-bit variable:
uint32_t keys = 2<<30|0<<10|1<<8|2<<6|1<<4|2<<2|0; // for example
uint32_t vals = ~keys & ~keys<<1 & 0xAAAAAAAA // value_high is !key_high & !key_low
| (keys>>1 | keys) & 0x55555555; // value_low is key_high | key_low
I would need a solution for any arbitrary mapping.
Here's an example program for arbitrary mappings. For each of the two value bits, there are 23 possible expressions (the same set for both bits); these expressions are:
0 ¬A·¬B A ¬B B ¬A A+B 1
By concatenating the high and low mapping bits, respectively, for keys 0, 1 and 2, we get the index of the expression corresponding to the mapping function. In the following program, the values of all the expressions, even the ones unused by the mapping, are stored in the term array. While this may seem wasteful, it allows computation without branches, which may be a win in the end.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
int main()
{
int i;
unsigned mapping[3];
// generate example mapping
for (i = 0; i < 3; ++i) mapping[i] = rand() % 3, printf(" %d->%d", i, mapping[i]);
puts("");
// determine the mapping expression index 0..7 for high and low value bit
short h = mapping[0]/2 | mapping[1]/2<<1 | mapping[2]/2<<2;
short l = mapping[0]%2 | mapping[1]%2<<1 | mapping[2]%2<<2;
uint32_t keys = 0x1245689A; // for example
uint32_t b = keys, a = keys<<1;
uint32_t term[8] = { 0, ~a&~b, a, ~b, b, ~a, a|b, -1 }; // all possible terms
uint32_t vals = term[h] & 0xAAAAAAAA // value_high
| term[l]>>1 & 0x55555555; // value_low
printf("%8x\n%8x\n", keys, vals);
}

After thinking about it, and using some of the ideas from other answers, I think I found a general solution. It is based in first estimating the value assuming there are only the keys 10, and 01 (i.e. one bit of the pair determines the other) and then correct by the key 00. An example code of the solution:
#include <stdio.h>
#include <stdlib.h>
void printBits(size_t const size, void const * const ptr)
{
unsigned char *b = (unsigned char*) ptr;
unsigned char byte;
int i, j;
for (i=size-1;i>=0;i--)
{
for (j=7;j>=0;j--)
{
byte = (b[i] >> j) & 1;
printf("%u", byte);
if(j%2 == 0) printf("|");
}
}
puts("");
}
int test2BitMapping(unsigned int * mapping)
{
int i;
unsigned int keys[16];
unsigned int bitKeys = 0;
unsigned int b = 0;
unsigned int c = 0;
unsigned int d = 0;
unsigned int expand[4] = {0x00000000u, 0x55555555u, 0xAAAAAAAAu, 0xFFFFFFFFu};
unsigned int v12 = 0;
unsigned int v0mask = 0;
unsigned int result[16];
unsigned int bitResults = 0;
unsigned int bitResultsTest = 0;
//Create mapping masks
b = ((1 & mapping[1]) | (2 & mapping[2]));
c = (2 & mapping[1]) | (1 & mapping[2]);
d = mapping[0];
b = expand[b];
c = expand[c];
d = expand[d];
//Initialize random keys
for(i = 0; i<16; i++) {
if(0) { //Test random keys
keys[i] = rand() % 3;
}
else { //Check all keys are generated
keys[i] = i % 3;
}
bitKeys |= keys[i] << (2*i);
}
//Get results without using bitwise opperations.
for(i = 0; i<16; i++) {
result[i] = mapping[ keys[i] ];
bitResultsTest |= result[i] << (2*i);
}
//Get results by using bitwise opperations.
v12 = ( bitKeys & b ) | ( (~bitKeys) & c );
v0mask = bitKeys | (((bitKeys & 0xAAAAAAAAu) >> 1) | ((bitKeys & 0x55555555u) << 1));
bitResults = ( d & (~v0mask) ) | ( v12 & v0mask );
//Check results
if(0) {
for(i = 0; i<3; i++) {
printf("%d -> %d, ", i, mapping[i]);
}
printf("\n");
printBits(sizeof(unsigned int), &bitKeys);
printBits(sizeof(unsigned int), &bitResults);
printBits(sizeof(unsigned int), &bitResultsTest);
printf("-------\n");
}
if(bitResults != bitResultsTest) {
printf("*********\nDifferent\n*********\n");
}
else {
printf("OK\n");
}
}
int main(void)
{
int i, j, k;
unsigned int mapping[3];
//Test using random mapping
for(k = 0; k < 1000; k++) {
for(i = 0; i<3; i++) {
mapping[i] = rand() % 3;
}
test2BitMapping(mapping);
}
//Test all possible mappings
for(i = 0; i<3; i++) {
for(j = 0; j<3; j++) {
for(k = 0; k<3; k++) {
mapping[0] = i;
mapping[1] = j;
mapping[2] = k;
test2BitMapping(mapping);
}
}
}
return 0;
}

and I am interested in transforming the unsigned int, following the rules above (i.e. the 2-bit pairs have to be transformed following the rules: 00->10, 01->01, and 10->01), so that I end up with a 32bit unsigned int
Certainly this can be done, but the required sequence of operations will be different for each of the 27 distinct mappings from { 0, 1, 2 } to { 0, 1, 2 }. Some can be very simple, such as for the three constant mappings, but others require more complex expressions.
Without having performed a thorough analysis, I'm inclined to guess that the mappings that are neither constant nor permutations, such as the one presented in the example, probably have the greatest minimum complexity. These all share the characteristic that two keys map to the same value, whereas the other key maps to a different one. One way -- not necessarily the best -- to approach finding an expression for such a mapping is to focus first on achieving the general result that the two keys map to one value and the other to a different one, and then move on to transforming the resulting values to the desired ones, if necessary.
For the example presented, for instance,
0 -> 2
1 -> 1
2 -> 1
, one could (on a per-key basis) use ((key & 2) >> 1) | ((key & 1) << 1) to achieve these preliminary results:
0 -> 0
1 -> 3
2 -> 3
, which can be converted to the desired final result by flipping the higher-order bit via an exclusive-or operation.
Note well the bit masking. There are other ways that could be approached for mapping a single key, but for the case of multiple keys stored in contiguous bits of the same integer, you need to be careful to avoid contaminating the computed mappings with data from different keys.
In 16-entry bit-vector form, that would be
uint32_t keys = /*...*/;
uint32_t values = (((keys & 0xAAAAAAAAu) >> 1) | ((keys & 0x55555555u) << 1))
^ 0xAAAAAAAAu;
. That happens to have a couple fewer operations than the expression in your other answer so far, but I am not certain that it is the smallest possible number of operations. In fact, if you are prepared to accept arithmetic operations in addition to bitwise ones, then you can definitely do it with fewer operations:
uint32_t keys = /*...*/;
uint32_t values = 0xAAAAAAAAu
- (((keys & 0xAAAAAAAAu) >> 1) | (keys & 0x55555555u));
Of course, in general, various operations do not all have the same cost as each other, but integer addition and subtraction and bitwise AND, OR, and XOR all have the the same cost as each other on most architectures (see, for example, https://www.agner.org/optimize/instruction_tables.pdf).

Converting binary int to binary uint8_t in c

I have an array defined as
int data[k];
where k is the size of the array. Each element of the array is either 0 or 1. I want to save the binary data in another array defined as
uint8_t new_data[k/8];
(k is usually a multiple of 8).
How can I do this in C?
Thanks in advance

Assuming k is a multiple of 8, assuming that by "each element is binary" you mean "each int is either 0 or 1", also assuming the bits in data are packed from most significant to least significant and the bytes of new_data are packed as big-endian (all reasonable assumptions), then this is how you do it:
for (int i = 0; i < k/8; ++i)
{
new_data[i] = (data[8*i ] << 7) | (data[8*i+1] << 6)
| (data[8*i+2] << 5) | (data[8*i+3] << 4)
| (data[8*i+4] << 3) | (data[8*i+5] << 2)
| (data[8*i+6] << 1) | data[8*i+7];
}

Assuming new_data starts initialized at 0, data[i] contains only zeroes and ones and that you want to fill lowest bits first:
for(unsigned i = 0; i < k; ++i) {
new_data[i/8] |= data[i]<<(i%8);
}
A possibly faster implementation1 may be:
for(int i = 0; i < k/8; ++i) {
uint8_t o = 0;
for(int j = 0; j < 8; ++j) {
o |= data[i*8]<<j;
}
new_data[i] = o;
}
(notice that this essentially assumes that k is multiple of 8)
It's generally easier to optimize, as the inner loop has small, known boundaries and it writes on a variable with just that small scope; this is easier for optimizers to handle, and you can see for example that with gcc the inner loop gets completely unrolled.

Efficient algorithm for finding a byte in a bit array

Given a bytearray uint8_t data[N] what is an efficient method to find a byte uint8_t search within it even if search is not octet aligned? i.e. the first three bits of search could be in data[i] and the next 5 bits in data[i+1].
My current method involves creating a bool get_bit(const uint8_t* src, struct internal_state* state) function (struct internal_state contains a mask that is bitshifted right, &ed with src and returned, maintaining size_t src_index < size_t src_len) , leftshifting the returned bits into a uint8_t my_register and comparing it with search every time, and using state->src_index and state->src_mask to get the position of the matched byte.
Is there a better method for this?

If you're searching an eight bit pattern within a large array you can implement a sliding window over 16 bit values to check if the searched pattern is part of the two bytes forming that 16 bit value.
To be portable you have to take care of endianness issues which is done by my implementation by building the 16 bit value to search for the pattern manually. The high byte is always the currently iterated byte and the low byte is the following byte. If you do a simple conversion like value = *((unsigned short *)pData) you will run into trouble on x86 processors...
Once value, cmp and mask are setup cmp and mask are shifted. If the pattern was not found within hi high byte the loop continues by checking the next byte as start byte.
Here is my implementation including some debug printouts (the function returns the bit position or -1 if pattern was not found):
int findPattern(unsigned char *data, int size, unsigned char pattern)
{
int result = -1;
unsigned char *pData;
unsigned char *pEnd;
unsigned short value;
unsigned short mask;
unsigned short cmp;
int tmpResult;
if ((data != NULL) && (size > 0))
{
pData = data;
pEnd = data + size;
while ((pData < pEnd) && (result == -1))
{
printf("\n\npData = {%02x, %02x, ...};\n", pData[0], pData[1]);
if ((pData + 1) < pEnd) /* still at least two bytes to check? */
{
tmpResult = (int)(pData - data) * 8; /* calculate bit offset according to current byte */
/* avoid endianness troubles by "manually" building value! */
value = *pData << 8;
pData++;
value += *pData;
/* create a sliding window to check if search patter is within value */
cmp = pattern << 8;
mask = 0xFF00;
while (mask > 0x00FF) /* the low byte is checked within next iteration! */
{
printf("cmp = %04x, mask = %04x, tmpResult = %d\n", cmp, mask, tmpResult);
if ((value & mask) == cmp)
{
result = tmpResult;
break;
}
tmpResult++; /* count bits! */
mask >>= 1;
cmp >>= 1;
}
}
else
{
/* only one chance left if there is only one byte left to check! */
if (*pData == pattern)
{
result = (int)(pData - data) * 8;
}
pData++;
}
}
}
return (result);
}

I don't think you can do much better than this in C:
/*
* Searches for the 8-bit pattern represented by 'needle' in the bit array
* represented by 'haystack'.
*
* Returns the index *in bits* of the first appearance of 'needle', or
* -1 if 'needle' is not found.
*/
int search(uint8_t needle, int num_bytes, uint8_t haystack[num_bytes]) {
if (num_bytes > 0) {
uint16_t window = haystack[0];
if (window == needle) return 0;
for (int i = 1; i < num_bytes; i += 1) {
window = window << 8 + haystack[i];
/* Candidate for unrolling: */
for (int j = 7; j >= 0; j -= 1) {
if ((window >> j) & 0xff == needle) {
return 8 * i - j;
}
}
}
}
return -1;
}
The main idea is to handle the 87.5% of cases that cross the boundary between consecutive bytes by pairing bytes in a wider data type (uint16_t in this case). You could adjust it to use an even wider data type, but I'm not sure that would gain anything.
What you cannot safely or easily do is anything involving casting part or all of your array to a wider integer type via a pointer (i.e. (uint16_t *)&haystack[i]). You cannot be ensured of proper alignment for such a cast, nor of the byte order with which the result might be interpreted.

I don't know if it would be better, but i would use sliding window.
uint counter = 0, feeder = 8;
uint window = data[0];
while (search ^ (window & 0xff)){
window >>= 1;
feeder--;
if (feeder < 8){
counter++;
if (counter >= data.length) {
feeder = 0;
break;
}
window |= data[counter] << feeder;
feeder += 8;
}
}
//Returns index of first bit of first sequence occurrence or -1 if sequence is not found
return (feeder > 0) ? (counter+1)*8-feeder : -1;
Also with some alterations you can use this method to search for arbitrary length (1 to 64-array_element_size_in_bits) bits sequence.

If AVX2 is acceptable (with earlier versions it didn't work out so well, but you can still do something there), you can search in a lot of places at the same time. I couldn't test this on my machine (only compile) so the following is more to give to you an idea of how it could be approached than copy&paste code, so I'll try to explain it rather than just code-dump.
The main idea is to read an uint64_t, shift it right by all values that make sense (0 through 7), then for each of those 8 new uint64_t's, test whether the byte is in there. Small complication: for the uint64_t's shifted by more than 0, the highest position should not be counted since it has zeroes shifted into it that might not be in the actual data. Once this is done, the next uint64_t should be read at an offset of 7 from the current one, otherwise there is a boundary that is not checked across. That's fine though, unaligned loads aren't so bad anymore, especially if they're not wide.
So now for some (untested, and incomplete, see below) code,
__m256i needle = _mm256_set1_epi8(find);
size_t i;
for (i = 0; i < n - 6; i += 7) {
// unaligned load here, but that's OK
uint64_t d = *(uint64_t*)(data + i);
__m256i x = _mm256_set1_epi64x(d);
__m256i low = _mm256_srlv_epi64(x, _mm256_set_epi64x(3, 2, 1, 0));
__m256i high = _mm256_srlv_epi64(x, _mm256_set_epi64x(7, 6, 5, 4));
low = _mm256_cmpeq_epi8(low, needle);
high = _mm256_cmpeq_epi8(high, needle);
// in the qword right-shifted by 0, all positions are valid
// otherwise, the top position corresponds to an incomplete byte
uint32_t lowmask = 0x7f7f7fffu & _mm256_movemask_epi8(low);
uint32_t highmask = 0x7f7f7f7fu & _mm256_movemask_epi8(high);
uint64_t mask = lowmask | ((uint64_t)highmask << 32);
if (mask) {
int bitindex = __builtin_ffsl(mask);
// the bit-index and byte-index are swapped
return 8 * (i + (bitindex & 7)) + (bitindex >> 3);
}
}
The funny "bit-index and byte-index are swapped" thing is because searching within a qword is done byte by byte and the results of those comparisons end up in 8 adjacent bits, while the search for "shifted by 1" ends up in the next 8 bits and so on. So in the resulting masks, the index of the byte that contains the 1 is a bit-offset, but the bit-index within that byte is actually the byte-offset, for example 0x8000 would correspond to finding the byte at the 7th byte of the qword that was right-shifted by 1, so the actual index is 8*7+1.
There is also the issue of the "tail", the part of the data left over when all blocks of 7 bytes have been processed. It can be done much the same way, but now more positions contain bogus bytes. Now n - i bytes are left over, so the mask has to have n - i bits set in the lowest byte, and one fewer for all other bytes (for the same reason as earlier, the other positions have zeroes shifted in). Also, if there is exactly 1 byte "left", it isn't really left because it would have been tested already, but that doesn't really matter. I'll assume the data is sufficiently padded that accessing out of bounds doesn't matter. Here it is, untested:
if (i < n - 1) {
// make n-i-1 bits, then copy them to every byte
uint32_t validh = ((1u << (n - i - 1)) - 1) * 0x01010101;
// the lowest position has an extra valid bit, set lowest zero
uint32_t validl = (validh + 1) | validh;
uint64_t d = *(uint64_t*)(data + i);
__m256i x = _mm256_set1_epi64x(d);
__m256i low = _mm256_srlv_epi64(x, _mm256_set_epi64x(3, 2, 1, 0));
__m256i high = _mm256_srlv_epi64(x, _mm256_set_epi64x(7, 6, 5, 4));
low = _mm256_cmpeq_epi8(low, needle);
high = _mm256_cmpeq_epi8(high, needle);
uint32_t lowmask = validl & _mm256_movemask_epi8(low);
uint32_t highmask = validh & _mm256_movemask_epi8(high);
uint64_t mask = lowmask | ((uint64_t)highmask << 32);
if (mask) {
int bitindex = __builtin_ffsl(mask);
return 8 * (i + (bitindex & 7)) + (bitindex >> 3);
}
}

If you are searching a large amount of memory and can afford an expensive setup, another approach is to use a 64K lookup table. For each possible 16-bit value, the table stores a byte containing the bit shift offset at which the matching octet occurs (+1, so 0 can indicate no match). You can initialize it like this:
uint8_t* g_pLookupTable = malloc(65536);
void initLUT(uint8_t octet)
{
memset(g_pLookupTable, 0, 65536); // zero out
for(int i = 0; i < 65536; i++)
{
for(int j = 7; j >= 0; j--)
{
if(((i >> j) & 255) == octet)
{
g_pLookupTable[i] = j + 1;
break;
}
}
}
}
Note that the case where the value is shifted 8 bits is not included (the reason will be obvious in a minute).
Then you can scan through your array of bytes like this:
int findByteMatch(uint8_t* pArray, uint8_t octet, int length)
{
if(length >= 0)
{
uint16_t index = (uint16_t)pArray[0];
if(index == octet)
return 0;
for(int bit, i = 1; i < length; i++)
{
index = (index << 8) | pArray[i];
if(bit = g_pLookupTable[index])
return (i * 8) - (bit - 1);
}
}
return -1;
}
Further optimization:
Read 32 or however many bits at a time from pArray into a uint32_t and then shift and AND each to get byte one at a time, OR with index and test, before reading another 4.
Pack the LUT into 32K by storing a nybble for each index. This might help it squeeze into the cache on some systems.
It will depend on your memory architecture whether this is faster than an unrolled loop that doesn't use a lookup table.

C Read char as binary

This is actually part of a project I'm working on using an avr. I'm interfacing via twi with a DS1307 real-time clock IC. It reports information back as a series of 8 chars. It returns in the format:
// Second : ds1307[0]
// Minute : ds1307[1]
// Hour : ds1307[2]
// Day : ds1307[3]
// Date : ds1307[4]
// Month : ds1307[5]
// Year : ds1307[6]
What I would like to do is take each part of the time and read it bit by bit. I can't think of a way to do this. Basically lighting up an led if the bit is a 1, but not if it's a 0.
I'd imagine that there is a rather simple way to do it by bitshifting, but I can't put my finger on the logic to do it.

Checking whether the bit N is set can be done with a simple expression like:
(bitmap & (0x1 << N)) != 0
where bitmap is the integer value (e.g. 64 bit in your case) containing the bits.
Finding the seconds:
(bitmap & 0xFF)
Finding the minute:
(bitmap & 0xFF00) >> 8
Finding the hour:
(bitmap & 0xFF0000) >> 16

If I'm interpreting you correctly, the following iterates over all the bits from lowest to highest. That is, the 8 bits of Seconds, followed by the 8 bits of Minutes, etc.
unsigned char i, j;
for (i = 0; i < sizeof(ds1307); i++)
{
unsigned char value = ds1307[i]; // seconds, minutes, hours etc
for (j = 0; j < 8; j++)
{
if (value & 0x01)
{
// bit is 1
}
else
{
// bit is 0
}
value >>= 1;
}
}

Yes - you can use >> to shift the bits right by one, and & 1 to obtain the value of the least significant bit:
unsigned char ds1307[7];
int i, j;
for (i = 0; i < 7; i++)
for (j = 0; j < 8; j++)
printf("byte %d, bit %d = %u\n", i, j, (ds1307[i] >> j) & 1U);
(This will examine the bits from least to most significant. By the way, your example array only has 7 bytes, not 8...)

essentially, if the 6 LEDs to show the seconds in binary format are connected to PORTA2-PORTA7, you can PORTA = ds1307[0] to have the seconds automatically lit up correctly.