Packet framing for very short serial packets - c

We designed simple fixed-length protocol for embedded device. Every packet is just two bytes:
bits | 15..12 | 11..4 | 3..0 |
| OpCode | DATA | CRC4 |
We use "crc-based framing", i.e. receiver collects two bytes, compute CRC4 and if it matches frame is considered valid. As you can see, there is no start-of-frame or end-of-frame.
There is a catch: recommended message length for CRC4 is 11 bits and here it is computed for 12 bits. As far as I understand that means that CRC error-detection properties degrade (but I'm not sure how much).
(By the way, if anybody needs code for CRC4 (or any other) and does not feel skilled enough to write it himself, boost has very nice boost::crc function that can compute any crc )
The problem is: this crc-based framing doesn't work and we get framing errors, i.e. second byte from one message and first byte from the following message sometimes form correct message.
My question is - is there any way to correct framing without adding any more bytes? We spend quite some time squeezing everything in that two bytes and it would be kinda sad to just throw it away like that.
We do have a spare bit in opcode field though.
Time-based framing will not be very reliable because our radio-channel likes to "spit" several packets at once
Maybe there is some other error-detection method that will work better than CRC4?
It we have to append more bytes, what would be the best way to do it?
We can use start-of-frame byte and byte-stuffing (such as COBS) ( +2 bytes but I'm not sure what to do with corrupted messages )
We can use start-of-frame nibble and widen CRC to CRC8 ( +1 byte )
Something else?

A common way to do what you are asking is to "hunt for framing" at start up and require N consecutive good packets before accepting any packets. This can be implemented using a state machine with 3 states: HUNT, LOF (loss of frame), SYNC
It could be something like:
#define GOOD_PACKETS_REQUIRED_BEFORE_SYNC 8
int state = HUNT;
int good_count = 0;
Packet GetPacket(void)
{
unsigned char fb = 0;
unsigned char sb = 0;
while (1)
{
if (state == HUNT)
{
fb = sb;
sb = GetNextByteFromUART();
if (IsValidCRC(fb, sb))
{
state = LOF;
good_count = 1;
}
}
else if (state == LOF)
{
fb = GetNextByteFromUART();
sb = GetNextByteFromUART();
if (IsValidCRC(fb, sb))
{
good_count++;
if (good_count >= GOOD_PACKETS_REQUIRED_BEFORE_SYNC)
{
state = SYNC;
}
}
else
{
state = HUNT;
good_count = 0;
}
}
else if (state == SYNC)
{
fb = GetNextByteFromUART();
sb = GetNextByteFromUART();
if (IsValidCRC(fb, sb))
{
return packet(fb, sb);;
}
// SYNC lost! Start a new hunt for correct framing
state = HUNT;
good_count = 0;
}
}
}
You can find several standard communication protocols which use this (or similar) technique, e.g. ATM and E1 (https://en.wikipedia.org/wiki/E-carrier). There are different variants of the principle. For instance you may want to go from SYNC to LOF when receiving the first bad packet (decrementing good_count) and then go from LOF to HUNT on the second consecutive bad packet. That would cut down the time it takes to re-frame. The above just shows a very simple variant.
Notice: In real world code you probably can't accept a blocking function like the one above. The above code is only provided to describe the principle.
Whether you need a CRC or can do with a fixed frame-word (e.g. 0xB) depends on your media.

There is a catch: recommended message length for CRC4 is 11 bits and here it is computed for 12 bits.
No, here it is computed for 16 bits.
As far as I understand that means that CRC error-detection properties degrade (but I'm not sure how much).
Recommendations about CRC likely refer to whether you have a 100% chance of finding a single-bit error or not. All CRCs struggle with multi-bit errors and will not necessarily find them.
When dealing with calculations about CRC reliability of UART, you also have to take the start and stop bits in account. Bit errors may as well strike there, in which case the hardware may or may not assist in finding the error.
second byte from one message and first byte from the following message sometimes form correct message
Of course. You have no synch mechanism, what do you expect? This has nothing to do with CRC.
My question is - is there any way to correct framing without adding any more bytes?
Either you have to sacrifice one bit per byte as a synch flag or increase the packet length. Alternatively you could use different delays between data bits. Maybe send the two bytes directly after each other, then use a delay.
Which method to pick depends on the nature of the data and your specification. Nobody on SO can tell you what your spec looks like.
Maybe there is some other error-detection method that will work better than CRC4?
Not likely. CRC is pretty much the only professional checksum algorithm. The polynomials are picked based on the excepted nature of the noise - they pick a polynomial which reminds as little of the noise as possible. However, this is mainly of academic interest, as no CRC guru can know how the noise looks like in your specific application.
Alternatives are sums, xor, parity, count number of 1s etc... all of them are quite bad, probability-wise.
It we have to append more bytes, what would be the best way to do it?
Nobody can answer that question without knowing the nature of the data.

If the CRC is mainly for paranoia (from the comments), you can give up some error-checking robustness and processor time for framing.
Since there is a free bit in the opcode, always set the most-significant bit of the first byte to zero. Then before transmission, but after calculating the CRC, set the most-significant bit of the second byte to one.
A frame is then two consecutive bytes where the first most significant bit is zero and the second is one. If the two bytes fail the CRC check, set the most significant bit of the second byte to zero and recalculate to see if the packet had the bit flipped before transmission.
The downside is that the CRC will be calculated twice about half of the time. Also, setting the bit for framing may cause invalid data to match the CRC.

Related

elegant (and fast!) way to rearrange columns and rows in an ADC buffer

Abstract:
I am looking for an elegant and fast way to "rearrange" the values in my ADC Buffer for further processing.
Introduction:
on an ARM Cortex M4 Processor I am using 3 ADCs to sample analog values, with DMA and "Double Buffer Technique". When I get a "half buffer complete Interrupt" the data in the 1D array are arranged like this:
Ch1S1, Ch2S1, Ch3S1, Ch1S2, Ch2S2, Ch3S2, Ch1S3 ..... Ch1Sn-1, Ch2Sn-1, Ch3Sn-1, Ch1Sn, Ch2Sn, Ch3Sn
Where Sn stands for Sample# and CHn for Channel Number.
As I do 2x Oversampling n equals 16, the channel count is 9 in reality, in the example above it is 3
Or written in an 2D-form
Ch1S1, Ch2S1, Ch3S1,
Ch1S2, Ch2S2, Ch3S2,
Ch1S3 ...
Ch1Sn-1, Ch2Sn-1, Ch3Sn-1,
Ch1Sn, Ch2Sn, Ch3Sn
Where the rows represent the n samples and the colums represent the channels ...
I am using CMSIS-DSP to calculate all the vector stuff, like shifting, scaling, multiplication, once I have "sorted out" the channels. This part is pretty fast.
Issue:
But the code I am using for "reshaping" the 1-D Buffer array to an accumulated value for each channel is pretty poor and slow:
for(i = 0; i < ADC_BUFFER_SZ; i++) {
for(j = 0; j < MEAS_ADC_CHANNELS; j++) {
if(i) *(ADC_acc + j) += *(ADC_DMABuffer + bP); // sum up all elements
else *(ADC_acc + j) = *(ADC_DMABuffer + bP); // initialize new on first run
bP++;
}
}
After this procedure I get a 1D array with one (accumulated) U32 value per Channel, but this code is pretty slow: ~4000 Clock cycles for 16 Samples per channel / 9 Channels or ~27 Clock cycles per sample. In order to archive higher Sample rates, this needs to be many times faster, than it is right now.
Question(s):
What I am looking for is: some elegant way, using the CMSIS-DPS functions to archive the same result as above, but much faster. My gut says that I am thinking in the wrong direction, that there must be a solution within the CMSIS-DSP lib, as I am most probably not the first guy who stumbles upon this topic and I most probably won't be the last. So I'm asking for a little push in the right direction, I as guess this could be a severe case of "work-blindness" ...
I was thinking about using the dot-product function "arm_dot_prod_q31" together with an array filled with ones for the accumulation task, because I could not find the CMSIS function which would simply sum up an 1D array? But this would not solve the "reshaping" issue, I still had to copy data around and create new buffers to prepare the vectors for the "arm_dot_prod_q31" call ...
Besides that it feels somehow awkward using a dot-product, where I just want to sum up array elements …
I also thought about transforming the ADC Buffer into a 16 x 9 or 9 x 16 Matrix, but then I could not find anything where I could easily (=fast & elegant) access rows or columns, which would leave me with another issue to solve, which would eventually require to create new buffers and copying data around, as I am missing a function where I could multiply a matrix with a vector ...
Maybe someone has a hint for me, that points me in the right direction?
Thanks a lot and cheers!
ARM is a risk device, so 27 cycles is roughly equal to 27 instructions, IIRC. You may find that you're going to need a higher clock rate to meet your timing requirements. What OS are you running? Do you have access to the cache controller? You may need to lock data buffers into the cache to get high enough performance. Also, keep your sums and raw data physically close in memory as your system will allow.
I am not convinced your perf issue is entirely the consequence of how you are stepping through your data array, but here's a more streamlined approach than what you are using:
int raw[ADC_BUFFER_SZ];
int sums[MEAS_ADC_CHANNELS];
for (int idxRaw = 0, int idxSum = 0; idxRaw < ADC_BUFFER_SZ; idxRaw++)
{
sums[idxSum++] += raw[idxRaw];
if (idxSum == MEAS_ADC_CHANNELS) idxSum = 0;
}
Note that I have not tested the above code, nor even tried to compile it. The algorithm is simple enough, you should be able to get working quickly.
Writing pointer math in your code, will not make it any faster. The compiler will convert array notation to efficient pointer math for you. You definitely don't need two loops.
That said, I often use a pointer for iteration:
int raw[ADC_BUFFER_SZ];
int sums[MEAS_ADC_CHANNELS];
int *itRaw = raw;
int *itRawEnd = raw + ADC_BUFFER_SZ;
int *itSums = sums;
int *itSumsEnd = itSums + MEAS_ADC_CHANNELS;
while(itRaw != itEnd)
{
*itSums += *itRaw;
itRaw++;
itSums++;
if (itSums == itSumsEnd) itSums = sums;
}
But almost never, when I am working with a mathematician or scientist, which is often the case with measurement/metrological device development. It's easier to explain the array notation to non-C reviewers, than the iterator form.
Also, if I have an algorithm description that uses the phrase "for each...", I tend to prefer the for loop form, but when the description uses "while ...", then of course I will probably use the while... form, unless I can skip one or more variable assignment statements by rearranging it to a do..while. But I often stick as close as possible to the original description until after I've passed all the testing criteria, then do rearrangement of loops for code hygiene purposes. It's easier to argue with a domain expert that their math is wrong, when you can easily convince them that you implemented what they described.
Always get it right first, then measure and make the determination whether to further hone the code. Decades ago, some C compilers for embedded systems could do a better job of optimizing one kind of loop than another. We used to have to keep a warry eye on the machine code they generated, and often developed habits that avoided those worst case scenarios. That is uncommon today, and almost certainly not the case for you ARM tool chain. But you may have to look into how your compilers optimization features work and try something different.
Do try to avoid doing value math on the same line as your pointer math. It's just confusing:
*(p1 + offset1) += *(p2 + offset2); // Can and should be avoided.
*(p1++) = *(p2++); // reasonable, especially for experienced coders/reviewers.
p1[offset1] += p2[offset2]; // Okay. Doesn't mix math notation with pointer notation.
p1[offset1 + A*B/C] += p2...; // Very bad.
// But...
int offset1 += A*B/C; // Especially helpful when stepping in the debugger.
p1[offset1]... ; // Much better.
Hence the iterator form mentioned earlier. It may reduce the lines of code, but does not reduce the complexity and definitely does increase the odds of introducing a bug at some point.
A purist could argue that p1[x] is in fact pointer notation in C, but array notation has almost, if not completely universal binding rules across languages. Intentions are obvious, even to non programmers. While the examples above are pretty trivial and most C programmers would have no problems reading any of them, it's when the number of variables involved and the complexity of the math increases, that mixing your value math with pointer math quickly becomes problematic. You'll almost never do it for anything non-trivial, so for consistency's sake, just get in the habit of avoiding it all-together.

What is the most efficient way to represent small values in a struct?

Often I find myself having to represent a structure that consists of very small values. For example, Foo has 4 values, a, b, c, d that, range from 0 to 3. Usually I don't care, but sometimes, those structures are
used in a tight loop;
their values are read a billion times/s, and that is the bottleneck of the program;
the whole program consists of a big array of billions of Foos;
In that case, I find myself having trouble deciding how to represent Foo efficiently. I have basically 4 options:
struct Foo {
int a;
int b;
int c;
int d;
};
struct Foo {
char a;
char b;
char c;
char d;
};
struct Foo {
char abcd;
};
struct FourFoos {
int abcd_abcd_abcd_abcd;
};
They use 128, 32, 8, 8 bits respectively per Foo, ranging from sparse to densely packed. The first example is probably the most linguistic one, but using it would essentially increase by 16 times the size of the program, which doesn't sound quite right. Moreover, most of the memory will be filled with zeroes and not be used at all, which makes me wonder if this isn't a waste. On the other hands, packing them densely brings an additional overhead for of reading them.
What is the computationally 'fastest' method for representing small values in a struct?
For dense packing that doesn't incur a large overhead of reading, I'd recommend a struct with bitfields. In your example where you have four values ranging from 0 to 3, you'd define the struct as follows:
struct Foo {
unsigned char a:2;
unsigned char b:2;
unsigned char c:2;
unsigned char d:2;
}
This has a size of 1 byte, and the fields can be accessed simply, i.e. foo.a, foo.b, etc.
By making your struct more densely packed, that should help with cache efficiency.
Edit:
To summarize the comments:
There's still bit fiddling happening with a bitfield, however it's done by the compiler and will most likely be more efficient than what you would write by hand (not to mention it makes your source code more concise and less prone to introducing bugs). And given the large amount of structs you'll be dealing with, the reduction of cache misses gained by using a packed struct such as this will likely make up for the overhead of bit manipulation the struct imposes.
Pack them only if space is a consideration - for example, an array of 1,000,000 structs. Otherwise, the code needed to do shifting and masking is greater than the savings in space for the data. Hence you are more likely to have a cache miss on the I-cache than the D-cache.
There is no definitive answer, and you haven't given enough information to allow a "right" choice to be made. There are trade-offs.
Your statement that your "primary goal is time efficiency" is insufficient, since you haven't specified whether I/O time (e.g. to read data from file) is more of a concern than computational efficiency (e.g. how long some set of computations take after a user hits a "Go" button).
So it might be appropriate to write the data as a single char (to reduce time to read or write) but unpack it into an array of four int (so subsequent calculations go faster).
Also, there is no guarantee that an int is 32 bits (which you have assumed in your statement that the first packing uses 128 bits). An int can be 16 bits.
Foo has 4 values, a, b, c, d that, range from 0 to 3. Usually I don't
care, but sometimes, those structures are ...
There is another option: since the values 0 ... 3 likely indicate some sort of state, you could consider using "flags"
enum{
A_1 = 1<<0,
A_2 = 1<<1,
A_3 = A_1|A_2,
B_1 = 1<<2,
B_2 = 1<<3,
B_3 = B_1|B_2,
C_1 = 1<<4,
C_2 = 1<<5,
C_3 = C_1|C_2,
D_1 = 1<<6,
D_2 = 1<<7,
D_3 = D_1|D_2,
//you could continue to ... D7_3 for 32/64 bits if it makes sense
}
This isn't much different than using bitfields for most situations, but can drastically reduce your conditional logic.
if ( a < 2 && b < 2 && c < 2 && d < 2) // .... (4 comparisons)
//vs.
if ( abcd & (A_2|B_2|C_2|D_2) !=0 ) //(bitop with constant and a 0-compare)
Depending what kinds of operations you will be doing on the data, it may make sense to use either 4 or 8 sets of abcd and pad out the end with 0s as needed. That could allow up to 32 comparisons to be replaced with a bitop and 0-compare.
For instance, if you wanted to set the "1 bit" on all 8 sets of 4 in a 64 bit variable you can do uint64_t abcd8 = 0x5555555555555555ULL; then to set all the 2 bits you could do abcd8 |= 0xAAAAAAAAAAAAAAAAULL; making all values now 3
Addendum:
On further consideration, you could use a union as your type and either do a union with char and #dbush's bitfields (these flag operations would still work on the unsigned char) or use char types for each a,b,c,d and union them with unsigned int. This would allow both a compact representation and efficient operations depending on what union member you use.
union Foo {
char abcd; //Note: you can use flags and bitops on this too
struct {
unsigned char a:2;
unsigned char b:2;
unsigned char c:2;
unsigned char d:2;
};
};
Or even extended further
union Foo {
uint64_t abcd8; //Note: you can use flags and bitops on these too
uint32_t abcd4[2];
uint16_t abcd2[4];
uint8_t abcd[8];
struct {
unsigned char a:2;
unsigned char b:2;
unsigned char c:2;
unsigned char d:2;
} _[8];
};
union Foo myfoo = {0xFFFFFFFFFFFFFFFFULL};
//assert(myfoo._[0].a == 3 && myfoo.abcd[0] == 0xFF);
This method does introduce some endianness differences, which would also be a problem if you use a union to cover any other combination of your other methods.
union Foo {
uint32_t abcd;
uint32_t dcba; //only here for endian purposes
struct { //anonymous struct
char a;
char b;
char c;
char d;
};
};
You could experiment and measure with different union types and algorithms to see which parts of the unions are worth keeping, then discard the ones that are not useful. You may find that operating on several char/short/int types simultaneously gets automatically optimized to some combination of AVX/simd instructions whereas using bitfields does not unless you manually unroll them... there is no way to know until you test and measure them.
Fitting your data set in cache is critical. Smaller is always better, because hyperthreading competitively shares the per-core caches between the hardware threads (on Intel CPUs). Comments on this answer include some numbers for costs of cache misses.
On x86, loading 8bit values with sign or zero-extension into 32 or 64bit registers (movzx or movsx) is literally just as fast as plain mov of a byte or 32bit dword. Storing the low byte of a 32bit register also has no overhead. (See Agner Fog's instruction tables and C / asm optimization guides here).
Still x86-specific: [u]int8_t temporaries are ok, too, but avoid [u]int16_t temporaries. (load/store from/to [u]int16_t in memory is fine, but working with 16bit values in registers has big penalties from the operand-size prefix decoding slowly on Intel CPUs.) 32bit temporaries will be faster if you want to use them as an array index. (Using 8bit registers doesn't zero the high 24/56bits, so it takes an extra instruction to zero or sign-extend, to use an 8bit register as an array index, or in an expression with a wider type (like adding it to an int.)
I'm unsure what ARM or other architectures can do as far as efficient zero/sign extension from single-byte loads, or for single-byte stores.
Given this, my recommendation is pack for storage, use int for temporaries. (Or long, but that will increase code size slightly on x86-64, because a REX prefix is needed to specify a 64bit operand size.) e.g.
int a_i = foo[i].a;
int b_i = foo[i].b;
...;
foo[i].a = a_i + b_i;
bitfields
Packing into bitfields will have more overhead, but can still be worth it. Testing a compile-time-constant-bit-position (or multiple bits) in a byte or 32/64bit chunk of memory is fast. If you actually need to unpack some bitfields into ints and pass them to a non-inline function call or something, that will take a couple extra instructions to shift and mask. If this gives even a small reduction in cache misses, this can be worth it.
Testing, setting (to 1) or clearing (to 0) a bit or group of bits can be done efficiently with OR or AND, but assigning an unknown boolean value to a bitfield takes more instructions to merge the new bits with the bits for other fields. This can significantly bloat code if you assign a variable to a bitfield very often. So using int foo:6 and things like that in your structs, because you know foo doesn't need the top two bits, is not likely to be helpful. If you're not saving many bits compared to putting each thing in it's own byte/short/int, then the reduction in cache misses won't outweigh the extra instructions (which can add up into I-cache / uop-cache misses, as well as the direct extra latency and work of the instructions.)
The x86 BMI1 / BMI2 (Bit-Manipulation) instruction-set extensions will make copying data from a register into some destination bits (without clobbering the surrounding bits) more efficient. BMI1: Haswell, Piledriver. BMI2: Haswell, Excavator(unreleased). Note that like SSE/AVX, this will mean you'd need BMI versions of your functions, and fallback non-BMI versions for CPUs that don't support those instructions. AFAIK, compilers don't have options to see patterns for these instructions and use them automatically. They're only usable via intrinsics (or asm).
Dbush's answer, packing into bitfields is probably a good choice, depending on how you use your fields. Your fourth option (of packing four separate abcd values into one struct) is probably a mistake, unless you can do something useful with four sequential abcd values (vector-style).
code generically, try both ways
For a data structure your code uses extensively, it makes sense to set things up so you can flip from one implementation to another, and benchmark. Nir Friedman's answer, with getters/setters is a good way to go. However, just using int temporaries and working with the fields as separate members of the struct should work fine. It's up to the compiler to generate code to test the right bits of a byte, for packed bitfields.
prepare for SIMD, if warranted
If you have any code that checks just one or a couple fields of each struct, esp. looping over sequential struct values, then the struct-of-arrays answer given by cmaster will be useful. x86 vector instructions have a single byte as the smallest granularity, so a struct-of-arrays with each value in a separate byte would let you quickly scan for the first element where a == something, using PCMPEQB / PTEST.
First, precisely define what you mean by "most efficient". Best memory utilization? Best performance?
Then implement your algorithm both ways and actually profile it on the actual hardware you intend to run it on under the actual conditions you intend to run it under once it's delivered.
Pick the one that better meets your original definition of "most efficient".
Anything else is just a guess. Whatever you choose will probably work fine, but without actually measuring the difference under the exact conditions you'd use the software, you'll never know which implementation would be "more efficient".
I think the only real answer can be to write your code generically, and then profile the full program with all of them. I don't think this will take that much time, though it may look a little more awkward. Basically, I'd do something like this:
template <bool is_packed> class Foo;
using interface_int = char;
template <>
class Foo<true> {
char m_a, m_b, m_c, m_d;
public:
void setA(interface_int a) { m_a = a; }
interface_int getA() { return m_a; }
...
}
template <>
class Foo<false> {
char m_data;
public:
void setA(interface_int a) { // bit magic changes m_data; }
interface_int getA() { // bit magic gets a from m_data; }
}
If you just write your code like this instead of exposing the raw data, it will be easy to switch implementations and profile. The function calls will get inlined and will not impact performance. Note that I just wrote setA and getA instead of a function that returns a reference, this is more complicated to implement.
Code it with ints
treat the fields as ints.
blah.x in all your code, except the declarion will be all you will be doing. Integral promotion will take care of most cases.
When you are all done, have 3 equivalant include files: an include file using ints, one using char and one using bitfields.
And then profile. Don't worry about it at this stage, because its premature optimization, and nothing but your chosen include file will change.
Massive Arrays and Out of Memory Errors
the whole program consists of a big array of billions of Foos;
First things first, for #2, you might find yourself or your users (if others run the software) often being unable to allocate this array successfully if it spans gigabytes. A common mistake here is to think that out of memory errors mean "no more memory available", when they instead often mean that the OS could not find a contiguous set of unused pages matching the requested memory size. It's for this reason that people often get confused when they request to allocate a one gigabyte block only to have it fail even though they have 30 gigabytes of physical memory free, e.g. Once you start allocating memory in sizes that span more than, say, 1% of the typical amount of memory available, it's often time to consider avoiding one giant array to represent the whole thing.
So perhaps the first thing you need to do is rethink the data structure. Instead of allocating a single array of billions of elements, often you'll significantly reduce the odds of running into problems by allocating in smaller chunks (smaller arrays aggregated together). For example, if your access pattern is solely sequential in nature, you can use an unrolled list (arrays linked together). If random access is needed, you might use something like an array of pointers to arrays which each span 4 kilobytes. This requires a bit more work to index an element, but with this kind of scale of billions of elements, it's often a necessity.
Access Patterns
One of the things unspecified in the question are the memory access patterns. This part is critical for guiding your decisions.
For example, is the data structure solely traversed sequentially, or is random access needed? Are all of these fields: a, b, c, d, needed together all the time, or can they be accessed one or two or three at a time?
Let's try to cover all the possibilities. At the scale we're talking about, this:
struct Foo {
int a1;
int b1;
int c1;
int d1
};
... is unlikely to be helpful. At this kind of input scale, and accessed in tight loops, your times are generally going to be dominated by the upper levels of memory hierarchy (paging and CPU cache). It no longer becomes quite as critical to focus on the lowest level of the hierarchy (registers and associated instructions). To put it another way, at billions of elements to process, the last thing you should be worrying about is the cost of moving this memory from L1 cache lines to registers and the cost of bitwise instructions, e.g. (not saying it's not a concern at all, just saying it's a much lower priority).
At a small enough scale where the entirety of the hot data fits into the CPU cache and a need for random access, this kind of straightforward representation can show a performance improvement due to the improvements at the lowest level of the hierarchy (registers and instructions), yet it would require a drastically smaller-scale input than what we're talking about.
So even this is likely to be a considerable improvement:
struct Foo {
char a1;
char b1;
char c1;
char d1;
};
... and this even more:
// Each field packs 4 values with 2-bits each.
struct Foo {
char a4;
char b4;
char c4;
char d4;
};
* Note that you could use bitfields for the above, but bitfields tend to have caveats associated with them depending on the compiler being used. I've often been careful to avoid them due to the portability issues commonly described, though this may be unnecessary in your case. However, as we adventure into SoA and hot/cold field-splitting territories below, we'll reach a point where bitfields can't be used anyway.
This code also places a focus on horizontal logic which can start to make it easier to explore some further optimization paths (ex: transforming the code to use SIMD), as it's already in a miniature SoA form.
Data "Consumption"
Especially at this kind of scale, and even more so when your memory access is sequential in nature, it helps to think in terms of data "consumption" (how quickly the machine can load data, do the necessary arithmetic, and store the results). A simple mental image I find useful is to imagine the computer as having a "big mouth". It goes faster if we feed it large enough spoonfuls of data at once, not little teeny teaspoons, and with more relevant data packed tightly into a contiguous spoonful.
Hot/Cold Field Splitting
The above code so far is making the assumption that all of these fields are equally hot (accessed frequently), and accessed together. You may have some cold fields or fields that are only accessed in critical code paths in pairs. Let's say that you rarely access c and d, or that your code has one critical loop that accesses a and b, and another that accesses c and d. In that case, it can be helpful to split it off into two structures:
struct Foo1 {
char a4;
char b4;
};
struct Foo2 {
char c4;
char d4;
};
Again if we're "feeding" the computer data, and our code is only interested in a and b fields at the moment, we can pack more into spoonfuls of a and b fields if we have contiguous blocks that only contain a and b fields, and not c and d fields. In such a case, c and d fields would be data the computer can't digest at the moment, yet it would be mixed into the memory regions in between a and b fields. If we want the computer to consume data as quickly as possible, we should only be feeding it the relevant data of interest at the moment, so it's worth splitting the structures in these scenarios.
SIMD SoA for Sequential Access
Moving towards vectorization, and assuming sequential access, the fastest rate at which the computer can consume data will often be in parallel using SIMD. In such a case, we might end up with a representation like this:
struct Foo1 {
char* a4n;
char* b4n;
};
... with careful attention to alignment and padding (the size/alignment should be a multiple of 16 or 32 bytes for AVX or even 64 for futuristic AVX-512) necessary to use faster aligned moves into XMM/YMM registers (and possibly with AVX instructions in the future).
AoSoA for Random/Multi-Field Access
Unfortunately the above representation can start to lose a lot of the potential benefits if a and b are accessed frequently together, especially with a random access pattern. In such a case, a more optimal representation can start looking like this:
struct Foo1 {
char a4x32[32];
char b4x32[32];
};
... where we're now aggregating this structure. This makes it so the a and b fields are no longer so spread apart, allowing groups of 32 a and b fields to fit into a single 64-byte cache line and accessed together quickly. We can also fit 128 or 256 a or b elements now into an XMM/YMM register.
Profiling
Normally I try to avoid general wisdom advice in performance questions, but I noticed this one seems to avoid the details that someone who has profiler in hand would typically mention. So I apologize if this comes off a bit as patronizing or if a profiler is already being actively used, but I think the question warrants this section.
As an anecdote, I've often done a better job (I shouldn't!) at optimizing production code written by people who have far superior knowledge than me about computer architecture (I worked with a lot of people who came from the punch card era and can understand assembly code at a glance), and would often get called in to optimize their code (which felt really odd). It's for one simple reason: I "cheated" and used a profiler (VTune). My peers often didn't (they had an allergy to it and thought they understood hotspots just as well as a profiler and saw profiling as a waste of time).
Of course the ideal is to find someone who has both the computer architecture expertise and a profiler in hand, but lacking one or the other, the profiler can give the bigger edge. Optimization still rewards a productivity mindset which hinges on the most effective prioritization, and the most effective prioritization is to optimize the parts that truly matter the most. The profiler gives us detailed breakdowns of exactly how much time is spent and where, along with useful metrics like cache misses and branch mispredictions which even the most advanced humans typically can't predict anywhere close to as accurate as a profiler can reveal. Furthermore, profiling is often the key to discovering how the computer architecture works at a more rapid pace by chasing down hotspots and researching why they exist. For me, profiling was the ultimate entry point into better understanding how the computer architecture actually works and not how I imagined it to work. It was only then that the writings of someone as experienced in this regard as Mysticial started to make more and more sense.
Interface Design
One of the things that might start to become apparent here is that there are many optimization possibilities. The answers to this kind of question are going to be about strategies rather than absolute approaches. A lot still has to be discovered in hindsight after you try something, and still iterating towards more and more optimal solutions as you need them.
One of the difficulties here in a complex codebase is leaving enough breathing room in the interfaces to experiment and try different optimization techniques, to iterate and iterate towards faster solutions. If the interface leaves room to seek these kinds of optimizations, then we can optimize all day long and often get some marvelous results if we're measuring things properly even with a trial and error mindset.
To often leave enough breathing room in an implementation to even experiment and explore faster techniques often requires the interface designs to accept data in bulk. This is especially true if the interfaces involve indirect function calls (ex: through a dylib or a function pointer) where inlining is no longer an effective possibility. In such scenarios, leaving room to optimize without cascading interface breakages often means designing away from the mindset of receiving simple scalar parameters in favor of passing pointers to whole chunks of data (possibly with a stride if there are various interleaving possibilities). So while this is straying into a pretty broad territory, a lot of the top priorities in optimizing here are going to boil down to leaving enough breathing room to optimize implementations without cascading changes throughout your codebase, and having a profiler in hand to guide you the right way.
TL;DR
Anyway, some of these strategies should help guide you the right way. There are no absolutes here, only guides and things to try out, and always best done with a profiler in hand. Yet when processing data of this enormous scale, it's always worth remembering the image of the hungry monster, and how to most effectively feed it these appropriately-sized and packed spoonfuls of relevant data.
Let's say, you have a memory bus that's a little bit older and can deliver 10 GB/s. Now take a CPU at 2.5 GHz, and you see that you would need to handle at least four bytes per cycle to saturate the memory bus. As such, when you use the definition of
struct Foo {
char a;
char b;
char c;
char d;
}
and use all four variables in each pass through the data, your code will be CPU bound. You can't gain any speed by a denser packing.
Now, this is different when each pass only performs a trivial operation on one of the four values. In that case, you are better off with a struct of arrays:
struct Foo {
size_t count;
char* a; //a[count]
char* b; //b[count]
char* c; //c[count]
char* d; //d[count]
}
You've stated the common and ambiguous C/C++ tag.
Assuming C++, make the data private and add getters/ setters.
No, that will not cause a performance hit - providing the optimizer is turned on.
You can then change the implementation to use the alternatives without any change to your calling code - and therefore more easily finesse the implementation based on the results of the bench tests.
For the record, I'd expect the struct with bit fields as per #dbush to be most likely the fastest given your description.
Note all this is around keeping the data in cache - you may also want to see if the design of the calling algorithm can help with that.
Getting back to the question asked :
used in a tight loop;
their values are read a billion times/s, and that is the bottleneck of the program;
the whole program consists of a big array of billions of Foos;
This is a classic example of when you should write platform specific high performance code that takes time to design for each implementation platform, but the benefits outweigh that cost.
As it's the bottleneck of the entire program you don't look for a general solution, but recognize that this needs to have multiple approaches tested and timed against real data, as the best solution will be platform specific.
It is also possible, as it is a large array of billion of foos, that the OP should consider using OpenCL or OpenMP as potential solutions so as to maximize the exploitation of available resources on the runtime hardware. This is a little dependent on what you need from the data, but it's probably the most important aspect of this type of problem - how to exploit available parallelism.
But there is no single right answer to this question, IMO.
The most efficient, performance / execution, is to use the processor's word size. Don't make the processor perform extra work of packing or unpacking.
Some processors have more than one efficient size. Many ARM processors can operate in 8/32 bit mode. This means that the processor is optimized for handling 8 bit quantities or 32-bit quantities. For a processor like this, I recommend using 8-bit data types.
Your algorithm has a lot to do with the efficiency. If you are moving data or copying data you may want to consider moving data 32-bits at a time (4 8-bit quantities). The idea here is to reduce the number of fetches by the processor.
For performance, write your code to make use of registers, such as using more local variables. Fetching from memory into registers is more costly than using registers directly.
Best of all, check out your compiler optimization settings. Set your compile for the highest performance (speed) settings. Next, generate assembly language listings of your functions. Review the listing to see how the compiler generated code. Adjust your code to improve the compiler's optimization capabilities.
If what you're after is efficiency of space, then you should consider avoiding structs altogether. The compiler will insert padding into your struct representation as necessary to make its size a multiple of its alignment requirement, which might be as much as 16 bytes (but is more likely to be 4 or 8 bytes, and could after all be as little as 1 byte).
If you use a struct anyway, then which to use depends on your implementation. If #dbush's bitfield approach yields one-byte structures then it's hard to beat that. If your implementation is going to pad the representation to at least four bytes no matter what, however, then this is probably the one to use:
struct Foo {
char a;
char b;
char c;
char d;
};
Or I guess I would probably use this variant:
struct Foo {
uint8_t a;
uint8_t b;
uint8_t c;
uint8_t d;
};
Since we're supposing that your struct is taking up a minimum of four bytes, there is no point in packing the data into smaller space. That would be counter-productive, in fact, because it would also make the processor do the extra work packing and unpacking the values within.
For handling large amounts of data, making efficient use of the CPU cache provides a far greater win than avoiding a few integer operations. If your data usage pattern is at least somewhat systematic (e.g. if after accessing one element of your erstwhile struct array, you are likely to access a nearby one next) then you are likely to get a boost in both space efficiency and speed by packing the data as tightly as you can. Depending on your C implementation (or if you want to avoid implementation dependency), you might need to achieve that differently -- for instance, via an array of integers. For your particular example of four fields, each requiring two bits, I would consider representing each "struct" as a uint8_t instead, for a total of 1 byte each.
Maybe something like this:
#include <stdint.h>
#define NUMBER_OF_FOOS 1000000000
#define A 0
#define B 2
#define C 4
#define D 6
#define SET_FOO_FIELD(foos, index, field, value) \
((foos)[index] = (((foos)[index] & ~(3 << (field))) | (((value) & 3) << (field))))
#define GET_FOO_FIELD(foos, index, field) (((foos)[index] >> (field)) & 3)
typedef uint8_t foo;
foo all_the_foos[NUMBER_OF_FOOS];
The field name macros and access macros provide a more legible -- and adjustable -- way to access the individual fields than would direct manipulation of the array (but be aware that these particular macros evaluate some of their arguments more than once). Every bit is used, giving you about as good cache usage as it is possible to achieve through choice of data structure alone.
I did video decompression for a while. The fastest thing to do is something like this:
short ABCD; //use a 16 bit data type for your example
and set up some macros. Maybe:
#define GETA ((ABCD >> 12) & 0x000F)
#define GETB ((ABCD >> 8) & 0x000F)
#define GETC ((ABCD >> 4) & 0x000F)
#define GETD (ABCD & 0x000F) // no need to shift D
In practice you should try to be moving 32 bit longs or 64 bit long long because thats the native MOVE size on most modern processors.
Using a struct will always create the overhead in your compiled code of extra instructions from the base address of you struct to the field. So get away from that if you really want to tighten your loop.
Edit:
Above example gives you 4 bit values. If you really just need values of 0..3 then you can do the same things to pull out your 2 bit numbers so,,,GETA might look like this:
GETA ((ABCD >> 14) & 0x0003)
And if you are really moving billions of things things, and I don't doubt it, just fill up a 32bit variable and shift and mask your way through it.
Hope this helps.

Combine two bytes from gyroscope into signed angular rate

I've got two 8-bit chars. They're the product of some 16-bit signed float being broken up into MSB and LSB inside a gyroscope.
The standard method I know of combining two bytes is this:
(signed float) = (((MSB value) << 8) | (LSB value));
Just returns garbage.
How can I do this?
Okay, so, dear me from ~4 years ago:
First of all, the gyroscope you're working with is a MAX21000. The datasheet, as far as future you can see, doesn't actually describe the endianness of the I2C connection, which probably also tripped you up. However, the SPI connection does state that the data is transmitted MSB-first, with the top 8-bits of the axis data in the first byte, and the additional 8 in the next.
To your credit, the datasheet doesn't really go into what type those 16 bits represent - however, that's because it's standardized across manufacturers.
The real reason why you got such meaningless values when converting to float is that the gyro isn't sending a float. Why'd you even think it would?
The gyro sends a plain 'ol int16 (short). A simple search for "i2c gyro interface" would have made that clear. How do you get that into a decimal angular rate? You divide by 32,768 (int16's max positive value), then multiply by the full-scale range set on the gyro.
Simple! Here, want a code example?
float X_angular_rate = ((((int16_t)((byte_1 << 8) | byte_2))/SHRT_MAX)*GYRO_SCALE
However, I think that it's important to note that the data from these gyroscopes alone is not, in itself, as useful as you thought; to my current knowledge, due to their poor zero-rate drift characteristics, MEMS gyros are almost always used in a sensor fusion setup with an accelerometer and a Kalman filter to make a proper IMU.
Any position and attitude derived from dead-reckoning without this added complexity is going to be hopelessly inaccurate after mere minutes, which is why you added an accelerometer to the next revision of the board.
You have shown two bytes, and float is 4 bytes on most systems. What did you do with the other two bytes of the original float you deconstructed? You should preserve and re-construct all four original bytes if possible. If you can't, and you have to omit any bytes, set them to zero, and make them the least significant bits in the fractional part of the float and hopefully you'll get an answer with satisfactory precision.
The diagram below shows the bit positions, so acting in accordance with the endianness of your system, you should be able to construct a valid float based on how you deconstructed the original. It can really help to write a function to display values as binary numbers and line them up and display initial, intermediate and end results to ensure that you're really accomplishing what you think (hope) you are.
To get a valid result you have to put something sensible into those bits.

Calculating an 8-bit CRC with the C preprocessor?

I'm writing code for a tiny 8-bit microcontroller with only a few bytes of RAM. It has a simple job which is to transmit 7 16-bit words, then the CRC of those words. The values of the words are chosen at compile time. The CRC specifically is "remainder of division of
word 0 to word 6 as unsigned number divided by the polynomial x^8+x²+x+1 (initial value 0xFF)."
Is it possible to calculate the CRC of those bytes at compile time using the C preprocessor?
#define CALC_CRC(a,b,c,d,e,f,g) /* what goes here? */
#define W0 0x6301
#define W1 0x12AF
#define W2 0x7753
#define W3 0x0007
#define W4 0x0007
#define W5 0x5621
#define W6 0x5422
#define CRC CALC_CRC(W0, W1, W2, W3, W4, W5, W6)
It is possible to design a macro which will perform a CRC calculation at compile time. Something like
// Choosing names to be short and hopefully unique.
#define cZX((n),b,v) (((n) & (1 << b)) ? v : 0)
#define cZY((n),b, w,x,y,z) (cZX((n),b,w)^CzX((n),b+1,x)^CzX((n),b+2,y)^cZX((n),b+3,z))
#define CRC(n) (cZY((n),0,cZ0,cZ1,cZ2,cZ3)^cZY((n),4,cZ4,cZ5,cZ6,cZ7))
should probably work, and will be very efficient if (n) can be evaluated as a compile-time constant; it will simply evaluate to a constant itself. On the other hand, if n is an expression, that expression will end up getting recomputed eight times. Even if n is a simple variable, the resulting code will likely be significantly larger than the fastest non-table-based way of writing it, and may be slower than the most compact way of writing it.
BTW, one thing I'd really like to see in the C and C++ standard would be a means of specifying overloads which would be used for functions declared inline only if particular parameters could be evaluated as compile-time constants. The semantics would be such that there would be no 'guarantee' that any such overload would be used in every case where a compiler might be able to determine a value, but there would be a guarantee that (1) no such overload would be used in any case where a "compile-time-const" parameter would have to be evaluated at runtime, and (2) any parameter which is considered a constant in one such overload will be considered a constant in any functions invoked from it. There are a lot of cases where a function could written to evaluate to a compile-time constant if its parameter is constant, but where run-time evaluation would be absolutely horrible. For example:
#define bit_reverse_byte(n) ( (((n) & 128)>>7)|(((n) & 64)>>5)|(((n) & 32)>>3)|(((n) & 16)>>1)|
(((n) & 8)<<1)|(((n) & 4)<<3)|(((n) & 2)<<5)|(((n) & 1)<<7) )
#define bit_reverse_word(n) (bit_reverse_byte((n) >> 8) | (bit_reverse_byte(n) << 8))
A simple rendering of a non-looped single-byte bit-reverse function in C on the PIC would be about 17-19 single-cycle instructions; a word bit-reverse would be 34, or about 10 plus a byte-reverse function (which would execute twice). Optimal assembly code would be about 15 single-cycle instructions for byte reverse or 17 for word-reverse. Computing bit_reverse_byte(b) for some byte variable b would take many dozens of instructions totalling many dozens of cycles. Computing bit_reverse_word(w) for some 16-bit wordw` would probably take hundreds of instructions taking hundreds or thousands of cycles to execute. It would be really nice if one could mark a function to be expanded inline using something like the above formulation in the scenario where it would expand to a total of four instructions (basically just loading the result) but use a function call in scenarios where inline expansion would be heinous.
The simplest checksum algorithm is the so-called longitudinal parity check, which breaks the data into "words" with a fixed number n of bits, and then computes the exclusive or of all those words. The result is appended to the message as an extra word.
To check the integrity of a message, the receiver computes the exclusive or of all its words, including the checksum; if the result is not a word with n zeros, the receiver knows that a transmission error occurred.
(souce: wiki)
In your example:
#define CALC_LRC(a,b,c,d,e,f) ((a)^(b)^(c)^(d)^(e)^(f))
Disclaimer: this is not really a direct answer, but rather a series of questions and suggestions that are too long for a comment.
First Question: Do you have control over both ends of the protocol, e.g. can you choose the checksum algorithm by means of either yourself or a coworker controlling the code on the other end?
If YES to question #1:
You need to evaluate why you need the checksum, what checksum is appropriate, and the consequences of receiving a corrupt message with a valid checksum (which factors into both the what & why).
What is your transmission medium, protocol, bitrate, etc? Are you expecting/observing bit errors? So for example, with SPI or I2C from one chip to another on the same board, if you have bit errors, it's probably the HW engineers fault or you need to slow the clock rate, or both. A checksum can't hurt, but shouldn't really be necessary. On the other hand, with an infrared signal in a noisy environment, and you'll have a much higher probability of error.
Consequences of a bad message is always the most important question here. So if you're writing the controller for digital room thermometer and sending a message to update the display 10x a second, one bad value ever 1000 messages has very little if any real harm. No checksum or a weak checksum should be fine.
If these 6 bytes fire a missile, set the position of a robotic scalpel, or cause the transfer of money, you better be damn sure you have the right checksum, and may even want to look into a cryptographic hash (which may require more RAM than you have).
For in-between stuff, with noticeable detriment to performance/satisfaction with the product, but no real harm, its your call. For example, a TV that occasionally changes the volume instead of the channel could annoy the hell out of customers--more so than simply dropping the command if a good CRC detects an error, but if you're in the business of making cheap/knock-off TVs that might be OK if it gets product to market faster.
So what checksum do you need?
If either or both ends have HW support for a checksum built into the peripheral (fairly common in SPI for example), that might be a wise choice. Then it becomes more or less free to calculate.
An LRC, as suggested by vulkanino's answer, is the simplest algorithm.
Wikipedia has some decent info on how/why to choose a polynomial if you really need a CRC:
http://en.wikipedia.org/wiki/Cyclic_redundancy_check
If NO to question #1:
What CRC algorithm/polynomial does the other end require? That's what you're stuck with, but telling us might get you a better/more complete answer.
Thoughts on implementation:
Most of the algorithms are pretty light-weight in terms of RAM/registers, requiring only a couple extra bytes. In general, a function will result in better, cleaner, more readable, debugger-friendly code.
You should think of the macro solution as an optimization trick, and like all optimization tricks, jumping to them to early can be a waste of development time and a cause of more problems than it's worth.
Using a macro also has some strange implications you may not have considered yet:
You are aware that the preprocessor can only do the calculation if all the bytes in a message are fixed at compile time, right? If you have a variable in there, the compiler has to generate code. Without a function, that code will be inlined every time it's used (yes, that could mean lots of ROM usage). If all the bytes are variable, that code might be worse than just writing the function in C. Or with a good compiler, it might be better. Tough to say for certain. On the other hand, if a different number of bytes are variable depending on the message being sent, you might end up with several versions of the code, each optimized for that particular usage.

Efficient container for bits

I have a bit array that can be very dense in some parts and very sparse in others. The array can get as large as 2**32 bits. I am turning it into a bunch of tuples containing offset and length to make it more efficient to deal with in memory. However, this sometimes is less efficient with things like 10101010100011. Any ideas on a good way of storing this in memory?
If I understand correctly, you're using tuples of (offset, length) to represent runs of 1 bits? If so, a better approach would be to use runs of packed bitfields. For dense areas, you get a nice efficient array, and in non-dense areas you get implied zeros. For example, in C++, the representation might look like:
// The map key is the offset; the vector's length gives you the length
std::map<unsigned int, std::vector<uint32_t> >
A lookup would consist of finding the key before the bit position in question, and seeing if the bit falls in its vector. If it does, use the value from the vector. Otherwise, return 0. For example:
typedef std::map<unsigned int, std::vector<uint32_t> > bitmap; // for convenience
typedef std::vector<uint32_t> bitfield; // also convenience
bool get_bit(const bitmap &bm, unsigned int idx) {
unsigned int offset = idx / 32;
bitmap::const_iterator it = bm.upper_bound(offset);
// bm is the element /after/ the one we want
if (it == bm.begin()) {
// but it's the first, so we don't have the target element
return false;
}
it--;
// make offset be relative to this element start
offset -= it.first;
// does our bit fall within this element?
if (offset >= it.second.size())
return false; // nope
unsigned long bf = it.second[offset];
// extract the bit of interest
return (bf & (1 << (offset % 32))) != 0;
}
It would help to know more. By "very sparse/dense," do you mean millions of consecutive zeroes/ones, or do you mean local (how local?) proportions of 0's very close to 0 or 1? Does one or the other value predominate? Are there any patterns that might make run-length encoding effective? How will you use this data structure? (Random access? What kind of distribution of accessed indexes? Are huge chunks never or very rarely accessed?)
I can only guess you aren't going to be randomly accessing and modifying all 4 billion bits at rates of billions of bits/second. Unless it is phenomenally sparse/dense on a local level (such as any million consecutive bits are likely to be the same except for 5 or 10 bits) or full of large scale repetition or patterns, my hunch is that the choice of data structure depends more on how the array is used than on the nature of the data.
How to structure things will be dependent on what is your data. For trying to represent large amounts of data, you will need to have long runs of zeros or ones. This would eliminate the need to have it respresented. If this is not the case and you have approxiately the same amount of one's and zeros, you would be better off with all of the memory.
It might help to think of this as a compression problem. For compression to be effective there has to be a pattern (or a limit set of items used out of an entire space) and an uneven distribution in order for compression to work. If all the elements are used and evenly distributed, compression is hard to do, or could take more space then the actual data.
If there are only runs of zero and ones, (more then just one), using offset and length might make some sense. If there is inconsistent runs, you could just copy the bits as a bit array where you have offset, length, and values.
How efficient the above is will depend upon if you have a large runs of ones or zeros. You will want to be careful to make sure you are not using more memory to reperesent your memory, then just using memory itself, (i.e. your are using more memory to represent the memory then just placing it into memory).
Check out bison source code. Look at biset implementation. It provides several flavors of implementations to deal with bit arrays with different densities.
How many of these do you intend to keep in memory at once?
As far as I can see, 2**32 bits = 512M, only half a gig, which isn't very much memory nowadays. Do you have anything better to do with it?
Assuming your server has enough ram, allocate it all at startup, then keep it in memory, the network handling thread can execute in just a few instructions in constant time - it should be able to keep up with any workload.

Resources