Efficiency of data structures in C99 (possibly affected by endianness)

Efficiency of data structures in C99 (possibly affected by endianness) - c

I have a couple of questions that are all inter-related. Basically, in the algorithm I am implementing a word w is defined as four bytes, so it can be contained whole in a uint32_t.
However, during the operation of the algorithm I often need to access the various parts of the word. Now, I can do this in two ways:
uint32_t w = 0x11223344;
uint8_t a = (w & 0xff000000) >> 24;
uint8_t b = (w & 0x00ff0000) >> 16;
uint8_t b = (w & 0x0000ff00) >> 8;
uint8_t d = (w & 0x000000ff);
However, part of me thinks that isn't particularly efficient. I thought a better way would be to use union representation like so:
typedef union
{
struct
{
uint8_t d;
uint8_t c;
uint8_t b;
uint8_t a;
};
uint32_t n;
} word32;
Using this method I can assign word32 w = 0x11223344; then I can access the various
parts as I require (w.a=11 in little endian).
However, at this stage I come up against endianness issues, namely, in big endian systems my struct is defined incorrectly so I need to re-order the word prior to it being passed in.
This I can do without too much difficulty. My question is, then, is the first part (various bitwise ands and shifts) efficient compared to the implementation using a union? Is there any difference between the two generally? Which way should I go on a modern, x86_64 processor? Is endianness just a red herring here?
I could inspect the assembly output of course, but my knowledge of compilers is not brilliant. I would have thought a union would be more efficient as it would essentially convert to memory offsets, like so:
mov eax, [r9+8]
Would a compiler realise that is what happening in the bit-shift case above?
If it matters, I'm using C99, specifically my compiler is clang (llvm).
Thanks in advance.

If you need AES, why not use an existing implementation? This can be particularly beneficial on modern Intel processors with hardware support for AES.
The union trick can slow down things due to store-to-load-forwarding (STLF) failures. This may happen, depending on the processor model, if you write data to memory and read it back soon as a different data type (e.g. 32bit vs 8bit).

Such a thing is hard to tell without being able to inspect the real use of these operations in your code:
the shift version will probably do
better if you happen to have all your
variables in registers, anyhow, and
then you do intensive computations on
them. Usually compilers (clang including) are relatively clever in issuing instructions for partial words and stuff like that.
the union version would perhaps be
more efficient if you'd have to load
your bytes from memory most of the
time
In any case I would abstract the access operation into a macro, such that you can modify it easily whence you have a working code.
For my personal taste I would go for the shift version, since it is conceptually simpler, and only go for the union when I'd see that at the end the produced assembler doesn't look satisfactory.

I would guess using a union may be more efficient. Of course, the compiler may be able to optimize the shifts into byte loads since they are known during compilation -- in which case both schemes will yield identical code.
Another option (also byte order dependent) is to cast the word to a byte array and access the bytes directly. I.e., something like the following
uint8_t b = ((uint8_t*)w)[n]
I'm not sure you will see any difference on a real modern 32/64 bit processor, though.
EDIT: It seems like clang produces identical code in both cases.

Given that accessing bits using shift and masking is a common operation I'd expect compilers to be quite smart about it especially if you're using constant shift count and mask.
An option would be to use macros for bit set/get so that you can pick the best strategy at configure time if on a specific platform a compiler happens to be on the dumb side (and wisely chosen names for the macros can also make the code more clear and self explaining).

Related

Comparison uint8_t vs uint16_t while declaring a counter

Assuming to have a counter which counts from 0 to 100, is there an advantage of declaring the counter variable as uint16_t instead of uint8_t.
Obviously if I use uint8_t I could save some space. On a processor with natural wordsize of 16 bits access times would be the same for both I guess. I couldn't think why I would use a uint16_t if uint8_t can cover the range.

Using a wider type than necessary can allow the compiler to avoid having to mask the higher bits.
Suppose you were working on a 16 bit architecture, then using uint16_t could be more efficient, however if you used uint16_t instead of uint8_t on a 32 bit architecture then you would still have the mask instructions but just masking a different number of bits.
The most efficient type to use in a cross-platform portable way is just plain int or unsigned int, which will always be the correct type to avoid the need for masking instructions, and will always be able to hold numbers up to 100.
If you are in a MISRA or similar regulated environment that forbids the use of native types, then the correct standard-compliant type to use is uint_fast8_t. This guarantees to be the fastest unsigned integer type that has at least 8 bits.
However, all of this is nonsense really. Your primary goal in writing code should be to make it readable, not to make it as fast as possible. Penny-pinching instructions like this makes code convoluted and more likely to have bugs. Also because it is harder to read, the bugs are less likely to be found during code review.
You should only try to optimize like this once the code is finished and you have tested it and found the particular part which is the bottleneck. Masking a loop counter is very unlikely to be the bottleneck in any real code.

Obviously if I use uint8_t I could save some space.
Actually, that's not necessarily obvious! A loop index variable is likely to end up in a register, and if it does there's no memory to be saved. Also, since the definition of the C language says that much arithmetic takes place using type int, it's possible that using a variable smaller than int might actually end up costing you space in terms of extra code emitted by the compiler to convert back and forth between int and your smaller variable. So while it could save you some space, it's not at all guaranteed that it will — and, in any case, the actual savings are going to be almost imperceptibly small in the grand scheme of things.
If you have an array of some number of integers in the range 0-100, using uint8_t is a fine idea if you want to save space. For an individual variable, on the other hand, the arguments are pretty different.
In general, I'd say that there are two reasons not to use type uint8_t (or, equivalently, char or unsigned char) as a loop index:
It's not going to save much data space (if at all), and it might cost code size and/or speed.
If the loop runs over exactly 256 elements (yours didn't, but I'm speaking more generally here), you may have introduced a bug (which you'll discover soon enough): your loop may run forever.
The interviewer was probably expecting #1 as an answer. It's not a guaranteed answer — under plenty of circumstances, using the smaller type won't cost you anything, and evidently there are microprocessors where it can actually save something — but as a general rule, I agree that using an 8-bit type as a loop index is, well, silly. And whether or not you agree, it's certainly an issue to be aware of, so I think it's a fair interview question.
See also this question, which discusses the same sorts of issues.

The interview question doesn't make much sense from a platform-generic point of view. If we look at code such as this:
for(uint8_t i=0; i<n; i++)
array[i] = x;
Then the expression i<n will get carried out on type int or larger because of implicit promotion. Though the compiler may optimize it to use a smaller type if it doesn't affect the result.
As for array[i], the compiler is likely to use a type corresponding to whatever address size the system is using.
What the interviewer was fishing for is likely that uint32_t on a 32 bitter tend to generate faster code in some situations. For those cases you can use uint_fast8_t, but more likely the compiler will perform optimizations no matter.
The only optimization uint8_t blocks the compiler from doing, is to allocate a larger variable than 8 bits on the stack. It doesn't however block the compiler from optimizing out the variable entirely and using a register instead. Such as for example storing it in an index register with the same width as the address bus.
Example with gcc x86_64: https://godbolt.org/z/vYscf3KW9. The disassembly is pretty painful to read, but the compiler just picked CPU registers to store anything regardless of the type of i, giving identical machine code between uint8_t anduint16_t. I would have been surprised if it didn't.
On a processor with natural wordsize of 16 bits access times would be the same for both I guess.
Yes this is true for all mainstream 16 bitters. Some might even manage faster code if given 8 bits instead of 16. Some exotic systems like DSP exist, but in case of lets say a 1 byte=16 bits DSP, then the compiler doesn't even provide you with uint8_t to begin with - it is an optional type. One generally doesn't bother with portability to wildly exotic systems, since doing so is a waste of everyone's time and money.
The correct answer: it is senseless to do manual optimization without a specific system in mind. uint8_t is perfectly fine to use for generic, portable code.

What is the most efficient way to represent small values in a struct?

Often I find myself having to represent a structure that consists of very small values. For example, Foo has 4 values, a, b, c, d that, range from 0 to 3. Usually I don't care, but sometimes, those structures are
used in a tight loop;
their values are read a billion times/s, and that is the bottleneck of the program;
the whole program consists of a big array of billions of Foos;
In that case, I find myself having trouble deciding how to represent Foo efficiently. I have basically 4 options:
struct Foo {
int a;
int b;
int c;
int d;
};
struct Foo {
char a;
char b;
char c;
char d;
};
struct Foo {
char abcd;
};
struct FourFoos {
int abcd_abcd_abcd_abcd;
};
They use 128, 32, 8, 8 bits respectively per Foo, ranging from sparse to densely packed. The first example is probably the most linguistic one, but using it would essentially increase by 16 times the size of the program, which doesn't sound quite right. Moreover, most of the memory will be filled with zeroes and not be used at all, which makes me wonder if this isn't a waste. On the other hands, packing them densely brings an additional overhead for of reading them.
What is the computationally 'fastest' method for representing small values in a struct?

For dense packing that doesn't incur a large overhead of reading, I'd recommend a struct with bitfields. In your example where you have four values ranging from 0 to 3, you'd define the struct as follows:
struct Foo {
unsigned char a:2;
unsigned char b:2;
unsigned char c:2;
unsigned char d:2;
}
This has a size of 1 byte, and the fields can be accessed simply, i.e. foo.a, foo.b, etc.
By making your struct more densely packed, that should help with cache efficiency.
Edit:
To summarize the comments:
There's still bit fiddling happening with a bitfield, however it's done by the compiler and will most likely be more efficient than what you would write by hand (not to mention it makes your source code more concise and less prone to introducing bugs). And given the large amount of structs you'll be dealing with, the reduction of cache misses gained by using a packed struct such as this will likely make up for the overhead of bit manipulation the struct imposes.

Pack them only if space is a consideration - for example, an array of 1,000,000 structs. Otherwise, the code needed to do shifting and masking is greater than the savings in space for the data. Hence you are more likely to have a cache miss on the I-cache than the D-cache.

There is no definitive answer, and you haven't given enough information to allow a "right" choice to be made. There are trade-offs.
Your statement that your "primary goal is time efficiency" is insufficient, since you haven't specified whether I/O time (e.g. to read data from file) is more of a concern than computational efficiency (e.g. how long some set of computations take after a user hits a "Go" button).
So it might be appropriate to write the data as a single char (to reduce time to read or write) but unpack it into an array of four int (so subsequent calculations go faster).
Also, there is no guarantee that an int is 32 bits (which you have assumed in your statement that the first packing uses 128 bits). An int can be 16 bits.

Foo has 4 values, a, b, c, d that, range from 0 to 3. Usually I don't
care, but sometimes, those structures are ...
There is another option: since the values 0 ... 3 likely indicate some sort of state, you could consider using "flags"
enum{
A_1 = 1<<0,
A_2 = 1<<1,
A_3 = A_1|A_2,
B_1 = 1<<2,
B_2 = 1<<3,
B_3 = B_1|B_2,
C_1 = 1<<4,
C_2 = 1<<5,
C_3 = C_1|C_2,
D_1 = 1<<6,
D_2 = 1<<7,
D_3 = D_1|D_2,
//you could continue to ... D7_3 for 32/64 bits if it makes sense
}
This isn't much different than using bitfields for most situations, but can drastically reduce your conditional logic.
if ( a < 2 && b < 2 && c < 2 && d < 2) // .... (4 comparisons)
//vs.
if ( abcd & (A_2|B_2|C_2|D_2) !=0 ) //(bitop with constant and a 0-compare)
Depending what kinds of operations you will be doing on the data, it may make sense to use either 4 or 8 sets of abcd and pad out the end with 0s as needed. That could allow up to 32 comparisons to be replaced with a bitop and 0-compare.
For instance, if you wanted to set the "1 bit" on all 8 sets of 4 in a 64 bit variable you can do uint64_t abcd8 = 0x5555555555555555ULL; then to set all the 2 bits you could do abcd8 |= 0xAAAAAAAAAAAAAAAAULL; making all values now 3
Addendum:
On further consideration, you could use a union as your type and either do a union with char and #dbush's bitfields (these flag operations would still work on the unsigned char) or use char types for each a,b,c,d and union them with unsigned int. This would allow both a compact representation and efficient operations depending on what union member you use.
union Foo {
char abcd; //Note: you can use flags and bitops on this too
struct {
unsigned char a:2;
unsigned char b:2;
unsigned char c:2;
unsigned char d:2;
};
};
Or even extended further
union Foo {
uint64_t abcd8; //Note: you can use flags and bitops on these too
uint32_t abcd4[2];
uint16_t abcd2[4];
uint8_t abcd[8];
struct {
unsigned char a:2;
unsigned char b:2;
unsigned char c:2;
unsigned char d:2;
} _[8];
};
union Foo myfoo = {0xFFFFFFFFFFFFFFFFULL};
//assert(myfoo._[0].a == 3 && myfoo.abcd[0] == 0xFF);
This method does introduce some endianness differences, which would also be a problem if you use a union to cover any other combination of your other methods.
union Foo {
uint32_t abcd;
uint32_t dcba; //only here for endian purposes
struct { //anonymous struct
char a;
char b;
char c;
char d;
};
};
You could experiment and measure with different union types and algorithms to see which parts of the unions are worth keeping, then discard the ones that are not useful. You may find that operating on several char/short/int types simultaneously gets automatically optimized to some combination of AVX/simd instructions whereas using bitfields does not unless you manually unroll them... there is no way to know until you test and measure them.

Fitting your data set in cache is critical. Smaller is always better, because hyperthreading competitively shares the per-core caches between the hardware threads (on Intel CPUs). Comments on this answer include some numbers for costs of cache misses.
On x86, loading 8bit values with sign or zero-extension into 32 or 64bit registers (movzx or movsx) is literally just as fast as plain mov of a byte or 32bit dword. Storing the low byte of a 32bit register also has no overhead. (See Agner Fog's instruction tables and C / asm optimization guides here).
Still x86-specific: [u]int8_t temporaries are ok, too, but avoid [u]int16_t temporaries. (load/store from/to [u]int16_t in memory is fine, but working with 16bit values in registers has big penalties from the operand-size prefix decoding slowly on Intel CPUs.) 32bit temporaries will be faster if you want to use them as an array index. (Using 8bit registers doesn't zero the high 24/56bits, so it takes an extra instruction to zero or sign-extend, to use an 8bit register as an array index, or in an expression with a wider type (like adding it to an int.)
I'm unsure what ARM or other architectures can do as far as efficient zero/sign extension from single-byte loads, or for single-byte stores.
Given this, my recommendation is pack for storage, use int for temporaries. (Or long, but that will increase code size slightly on x86-64, because a REX prefix is needed to specify a 64bit operand size.) e.g.
int a_i = foo[i].a;
int b_i = foo[i].b;
...;
foo[i].a = a_i + b_i;
bitfields
Packing into bitfields will have more overhead, but can still be worth it. Testing a compile-time-constant-bit-position (or multiple bits) in a byte or 32/64bit chunk of memory is fast. If you actually need to unpack some bitfields into ints and pass them to a non-inline function call or something, that will take a couple extra instructions to shift and mask. If this gives even a small reduction in cache misses, this can be worth it.
Testing, setting (to 1) or clearing (to 0) a bit or group of bits can be done efficiently with OR or AND, but assigning an unknown boolean value to a bitfield takes more instructions to merge the new bits with the bits for other fields. This can significantly bloat code if you assign a variable to a bitfield very often. So using int foo:6 and things like that in your structs, because you know foo doesn't need the top two bits, is not likely to be helpful. If you're not saving many bits compared to putting each thing in it's own byte/short/int, then the reduction in cache misses won't outweigh the extra instructions (which can add up into I-cache / uop-cache misses, as well as the direct extra latency and work of the instructions.)
The x86 BMI1 / BMI2 (Bit-Manipulation) instruction-set extensions will make copying data from a register into some destination bits (without clobbering the surrounding bits) more efficient. BMI1: Haswell, Piledriver. BMI2: Haswell, Excavator(unreleased). Note that like SSE/AVX, this will mean you'd need BMI versions of your functions, and fallback non-BMI versions for CPUs that don't support those instructions. AFAIK, compilers don't have options to see patterns for these instructions and use them automatically. They're only usable via intrinsics (or asm).
Dbush's answer, packing into bitfields is probably a good choice, depending on how you use your fields. Your fourth option (of packing four separate abcd values into one struct) is probably a mistake, unless you can do something useful with four sequential abcd values (vector-style).
code generically, try both ways
For a data structure your code uses extensively, it makes sense to set things up so you can flip from one implementation to another, and benchmark. Nir Friedman's answer, with getters/setters is a good way to go. However, just using int temporaries and working with the fields as separate members of the struct should work fine. It's up to the compiler to generate code to test the right bits of a byte, for packed bitfields.
prepare for SIMD, if warranted
If you have any code that checks just one or a couple fields of each struct, esp. looping over sequential struct values, then the struct-of-arrays answer given by cmaster will be useful. x86 vector instructions have a single byte as the smallest granularity, so a struct-of-arrays with each value in a separate byte would let you quickly scan for the first element where a == something, using PCMPEQB / PTEST.

First, precisely define what you mean by "most efficient". Best memory utilization? Best performance?
Then implement your algorithm both ways and actually profile it on the actual hardware you intend to run it on under the actual conditions you intend to run it under once it's delivered.
Pick the one that better meets your original definition of "most efficient".
Anything else is just a guess. Whatever you choose will probably work fine, but without actually measuring the difference under the exact conditions you'd use the software, you'll never know which implementation would be "more efficient".

I think the only real answer can be to write your code generically, and then profile the full program with all of them. I don't think this will take that much time, though it may look a little more awkward. Basically, I'd do something like this:
template <bool is_packed> class Foo;
using interface_int = char;
template <>
class Foo<true> {
char m_a, m_b, m_c, m_d;
public:
void setA(interface_int a) { m_a = a; }
interface_int getA() { return m_a; }
...
}
template <>
class Foo<false> {
char m_data;
public:
void setA(interface_int a) { // bit magic changes m_data; }
interface_int getA() { // bit magic gets a from m_data; }
}
If you just write your code like this instead of exposing the raw data, it will be easy to switch implementations and profile. The function calls will get inlined and will not impact performance. Note that I just wrote setA and getA instead of a function that returns a reference, this is more complicated to implement.

Code it with ints
treat the fields as ints.
blah.x in all your code, except the declarion will be all you will be doing. Integral promotion will take care of most cases.
When you are all done, have 3 equivalant include files: an include file using ints, one using char and one using bitfields.
And then profile. Don't worry about it at this stage, because its premature optimization, and nothing but your chosen include file will change.

Massive Arrays and Out of Memory Errors
the whole program consists of a big array of billions of Foos;
First things first, for #2, you might find yourself or your users (if others run the software) often being unable to allocate this array successfully if it spans gigabytes. A common mistake here is to think that out of memory errors mean "no more memory available", when they instead often mean that the OS could not find a contiguous set of unused pages matching the requested memory size. It's for this reason that people often get confused when they request to allocate a one gigabyte block only to have it fail even though they have 30 gigabytes of physical memory free, e.g. Once you start allocating memory in sizes that span more than, say, 1% of the typical amount of memory available, it's often time to consider avoiding one giant array to represent the whole thing.
So perhaps the first thing you need to do is rethink the data structure. Instead of allocating a single array of billions of elements, often you'll significantly reduce the odds of running into problems by allocating in smaller chunks (smaller arrays aggregated together). For example, if your access pattern is solely sequential in nature, you can use an unrolled list (arrays linked together). If random access is needed, you might use something like an array of pointers to arrays which each span 4 kilobytes. This requires a bit more work to index an element, but with this kind of scale of billions of elements, it's often a necessity.
Access Patterns
One of the things unspecified in the question are the memory access patterns. This part is critical for guiding your decisions.
For example, is the data structure solely traversed sequentially, or is random access needed? Are all of these fields: a, b, c, d, needed together all the time, or can they be accessed one or two or three at a time?
Let's try to cover all the possibilities. At the scale we're talking about, this:
struct Foo {
int a1;
int b1;
int c1;
int d1
};
... is unlikely to be helpful. At this kind of input scale, and accessed in tight loops, your times are generally going to be dominated by the upper levels of memory hierarchy (paging and CPU cache). It no longer becomes quite as critical to focus on the lowest level of the hierarchy (registers and associated instructions). To put it another way, at billions of elements to process, the last thing you should be worrying about is the cost of moving this memory from L1 cache lines to registers and the cost of bitwise instructions, e.g. (not saying it's not a concern at all, just saying it's a much lower priority).
At a small enough scale where the entirety of the hot data fits into the CPU cache and a need for random access, this kind of straightforward representation can show a performance improvement due to the improvements at the lowest level of the hierarchy (registers and instructions), yet it would require a drastically smaller-scale input than what we're talking about.
So even this is likely to be a considerable improvement:
struct Foo {
char a1;
char b1;
char c1;
char d1;
};
... and this even more:
// Each field packs 4 values with 2-bits each.
struct Foo {
char a4;
char b4;
char c4;
char d4;
};
* Note that you could use bitfields for the above, but bitfields tend to have caveats associated with them depending on the compiler being used. I've often been careful to avoid them due to the portability issues commonly described, though this may be unnecessary in your case. However, as we adventure into SoA and hot/cold field-splitting territories below, we'll reach a point where bitfields can't be used anyway.
This code also places a focus on horizontal logic which can start to make it easier to explore some further optimization paths (ex: transforming the code to use SIMD), as it's already in a miniature SoA form.
Data "Consumption"
Especially at this kind of scale, and even more so when your memory access is sequential in nature, it helps to think in terms of data "consumption" (how quickly the machine can load data, do the necessary arithmetic, and store the results). A simple mental image I find useful is to imagine the computer as having a "big mouth". It goes faster if we feed it large enough spoonfuls of data at once, not little teeny teaspoons, and with more relevant data packed tightly into a contiguous spoonful.
Hot/Cold Field Splitting
The above code so far is making the assumption that all of these fields are equally hot (accessed frequently), and accessed together. You may have some cold fields or fields that are only accessed in critical code paths in pairs. Let's say that you rarely access c and d, or that your code has one critical loop that accesses a and b, and another that accesses c and d. In that case, it can be helpful to split it off into two structures:
struct Foo1 {
char a4;
char b4;
};
struct Foo2 {
char c4;
char d4;
};
Again if we're "feeding" the computer data, and our code is only interested in a and b fields at the moment, we can pack more into spoonfuls of a and b fields if we have contiguous blocks that only contain a and b fields, and not c and d fields. In such a case, c and d fields would be data the computer can't digest at the moment, yet it would be mixed into the memory regions in between a and b fields. If we want the computer to consume data as quickly as possible, we should only be feeding it the relevant data of interest at the moment, so it's worth splitting the structures in these scenarios.
SIMD SoA for Sequential Access
Moving towards vectorization, and assuming sequential access, the fastest rate at which the computer can consume data will often be in parallel using SIMD. In such a case, we might end up with a representation like this:
struct Foo1 {
char* a4n;
char* b4n;
};
... with careful attention to alignment and padding (the size/alignment should be a multiple of 16 or 32 bytes for AVX or even 64 for futuristic AVX-512) necessary to use faster aligned moves into XMM/YMM registers (and possibly with AVX instructions in the future).
AoSoA for Random/Multi-Field Access
Unfortunately the above representation can start to lose a lot of the potential benefits if a and b are accessed frequently together, especially with a random access pattern. In such a case, a more optimal representation can start looking like this:
struct Foo1 {
char a4x32[32];
char b4x32[32];
};
... where we're now aggregating this structure. This makes it so the a and b fields are no longer so spread apart, allowing groups of 32 a and b fields to fit into a single 64-byte cache line and accessed together quickly. We can also fit 128 or 256 a or b elements now into an XMM/YMM register.
Profiling
Normally I try to avoid general wisdom advice in performance questions, but I noticed this one seems to avoid the details that someone who has profiler in hand would typically mention. So I apologize if this comes off a bit as patronizing or if a profiler is already being actively used, but I think the question warrants this section.
As an anecdote, I've often done a better job (I shouldn't!) at optimizing production code written by people who have far superior knowledge than me about computer architecture (I worked with a lot of people who came from the punch card era and can understand assembly code at a glance), and would often get called in to optimize their code (which felt really odd). It's for one simple reason: I "cheated" and used a profiler (VTune). My peers often didn't (they had an allergy to it and thought they understood hotspots just as well as a profiler and saw profiling as a waste of time).
Of course the ideal is to find someone who has both the computer architecture expertise and a profiler in hand, but lacking one or the other, the profiler can give the bigger edge. Optimization still rewards a productivity mindset which hinges on the most effective prioritization, and the most effective prioritization is to optimize the parts that truly matter the most. The profiler gives us detailed breakdowns of exactly how much time is spent and where, along with useful metrics like cache misses and branch mispredictions which even the most advanced humans typically can't predict anywhere close to as accurate as a profiler can reveal. Furthermore, profiling is often the key to discovering how the computer architecture works at a more rapid pace by chasing down hotspots and researching why they exist. For me, profiling was the ultimate entry point into better understanding how the computer architecture actually works and not how I imagined it to work. It was only then that the writings of someone as experienced in this regard as Mysticial started to make more and more sense.
Interface Design
One of the things that might start to become apparent here is that there are many optimization possibilities. The answers to this kind of question are going to be about strategies rather than absolute approaches. A lot still has to be discovered in hindsight after you try something, and still iterating towards more and more optimal solutions as you need them.
One of the difficulties here in a complex codebase is leaving enough breathing room in the interfaces to experiment and try different optimization techniques, to iterate and iterate towards faster solutions. If the interface leaves room to seek these kinds of optimizations, then we can optimize all day long and often get some marvelous results if we're measuring things properly even with a trial and error mindset.
To often leave enough breathing room in an implementation to even experiment and explore faster techniques often requires the interface designs to accept data in bulk. This is especially true if the interfaces involve indirect function calls (ex: through a dylib or a function pointer) where inlining is no longer an effective possibility. In such scenarios, leaving room to optimize without cascading interface breakages often means designing away from the mindset of receiving simple scalar parameters in favor of passing pointers to whole chunks of data (possibly with a stride if there are various interleaving possibilities). So while this is straying into a pretty broad territory, a lot of the top priorities in optimizing here are going to boil down to leaving enough breathing room to optimize implementations without cascading changes throughout your codebase, and having a profiler in hand to guide you the right way.
TL;DR
Anyway, some of these strategies should help guide you the right way. There are no absolutes here, only guides and things to try out, and always best done with a profiler in hand. Yet when processing data of this enormous scale, it's always worth remembering the image of the hungry monster, and how to most effectively feed it these appropriately-sized and packed spoonfuls of relevant data.

Let's say, you have a memory bus that's a little bit older and can deliver 10 GB/s. Now take a CPU at 2.5 GHz, and you see that you would need to handle at least four bytes per cycle to saturate the memory bus. As such, when you use the definition of
struct Foo {
char a;
char b;
char c;
char d;
}
and use all four variables in each pass through the data, your code will be CPU bound. You can't gain any speed by a denser packing.
Now, this is different when each pass only performs a trivial operation on one of the four values. In that case, you are better off with a struct of arrays:
struct Foo {
size_t count;
char* a; //a[count]
char* b; //b[count]
char* c; //c[count]
char* d; //d[count]
}

You've stated the common and ambiguous C/C++ tag.
Assuming C++, make the data private and add getters/ setters.
No, that will not cause a performance hit - providing the optimizer is turned on.
You can then change the implementation to use the alternatives without any change to your calling code - and therefore more easily finesse the implementation based on the results of the bench tests.
For the record, I'd expect the struct with bit fields as per #dbush to be most likely the fastest given your description.
Note all this is around keeping the data in cache - you may also want to see if the design of the calling algorithm can help with that.

Getting back to the question asked :
used in a tight loop;
their values are read a billion times/s, and that is the bottleneck of the program;
the whole program consists of a big array of billions of Foos;
This is a classic example of when you should write platform specific high performance code that takes time to design for each implementation platform, but the benefits outweigh that cost.
As it's the bottleneck of the entire program you don't look for a general solution, but recognize that this needs to have multiple approaches tested and timed against real data, as the best solution will be platform specific.
It is also possible, as it is a large array of billion of foos, that the OP should consider using OpenCL or OpenMP as potential solutions so as to maximize the exploitation of available resources on the runtime hardware. This is a little dependent on what you need from the data, but it's probably the most important aspect of this type of problem - how to exploit available parallelism.
But there is no single right answer to this question, IMO.

The most efficient, performance / execution, is to use the processor's word size. Don't make the processor perform extra work of packing or unpacking.
Some processors have more than one efficient size. Many ARM processors can operate in 8/32 bit mode. This means that the processor is optimized for handling 8 bit quantities or 32-bit quantities. For a processor like this, I recommend using 8-bit data types.
Your algorithm has a lot to do with the efficiency. If you are moving data or copying data you may want to consider moving data 32-bits at a time (4 8-bit quantities). The idea here is to reduce the number of fetches by the processor.
For performance, write your code to make use of registers, such as using more local variables. Fetching from memory into registers is more costly than using registers directly.
Best of all, check out your compiler optimization settings. Set your compile for the highest performance (speed) settings. Next, generate assembly language listings of your functions. Review the listing to see how the compiler generated code. Adjust your code to improve the compiler's optimization capabilities.

If what you're after is efficiency of space, then you should consider avoiding structs altogether. The compiler will insert padding into your struct representation as necessary to make its size a multiple of its alignment requirement, which might be as much as 16 bytes (but is more likely to be 4 or 8 bytes, and could after all be as little as 1 byte).
If you use a struct anyway, then which to use depends on your implementation. If #dbush's bitfield approach yields one-byte structures then it's hard to beat that. If your implementation is going to pad the representation to at least four bytes no matter what, however, then this is probably the one to use:
struct Foo {
char a;
char b;
char c;
char d;
};
Or I guess I would probably use this variant:
struct Foo {
uint8_t a;
uint8_t b;
uint8_t c;
uint8_t d;
};
Since we're supposing that your struct is taking up a minimum of four bytes, there is no point in packing the data into smaller space. That would be counter-productive, in fact, because it would also make the processor do the extra work packing and unpacking the values within.
For handling large amounts of data, making efficient use of the CPU cache provides a far greater win than avoiding a few integer operations. If your data usage pattern is at least somewhat systematic (e.g. if after accessing one element of your erstwhile struct array, you are likely to access a nearby one next) then you are likely to get a boost in both space efficiency and speed by packing the data as tightly as you can. Depending on your C implementation (or if you want to avoid implementation dependency), you might need to achieve that differently -- for instance, via an array of integers. For your particular example of four fields, each requiring two bits, I would consider representing each "struct" as a uint8_t instead, for a total of 1 byte each.
Maybe something like this:
#include <stdint.h>
#define NUMBER_OF_FOOS 1000000000
#define A 0
#define B 2
#define C 4
#define D 6
#define SET_FOO_FIELD(foos, index, field, value) \
((foos)[index] = (((foos)[index] & ~(3 << (field))) | (((value) & 3) << (field))))
#define GET_FOO_FIELD(foos, index, field) (((foos)[index] >> (field)) & 3)
typedef uint8_t foo;
foo all_the_foos[NUMBER_OF_FOOS];
The field name macros and access macros provide a more legible -- and adjustable -- way to access the individual fields than would direct manipulation of the array (but be aware that these particular macros evaluate some of their arguments more than once). Every bit is used, giving you about as good cache usage as it is possible to achieve through choice of data structure alone.

I did video decompression for a while. The fastest thing to do is something like this:
short ABCD; //use a 16 bit data type for your example
and set up some macros. Maybe:
#define GETA ((ABCD >> 12) & 0x000F)
#define GETB ((ABCD >> 8) & 0x000F)
#define GETC ((ABCD >> 4) & 0x000F)
#define GETD (ABCD & 0x000F) // no need to shift D
In practice you should try to be moving 32 bit longs or 64 bit long long because thats the native MOVE size on most modern processors.
Using a struct will always create the overhead in your compiled code of extra instructions from the base address of you struct to the field. So get away from that if you really want to tighten your loop.
Edit:
Above example gives you 4 bit values. If you really just need values of 0..3 then you can do the same things to pull out your 2 bit numbers so,,,GETA might look like this:
GETA ((ABCD >> 14) & 0x0003)
And if you are really moving billions of things things, and I don't doubt it, just fill up a 32bit variable and shift and mask your way through it.
Hope this helps.

What is the best approach when working with on-disk data structures

I would like to know how best to work with on-disk data structures given that the storage layout needs to exactly match the logical design. I find that structure alignment & packing do not really help much when you need to have a certain layout for your storage.
My approach to this problem is defining the (width) of the structure using a processor directive and using the width when allocation character (byte) arrays that I will write to disk after appending data that follows the logical structure model.
eg:
typedef struct __attribute__((packed, aligned(1))) foo {
uint64_t some_stuff;
uint8_t flag;
} foo;
if I persist foo on-disk the "flag" value will come at the very end of the data. Given that I can easily use foo when reading the data using fread on a &foo type then using the struct normally without any further byte fiddling.
Instead I prefer to do this
#define foo_width sizeof(uint64_t)+sizeof(uint8_t)
uint8_t *foo = calloc(1, foo_width);
foo[0] = flag_value;
memcpy(foo+1, encode_int64(some_value), sizeof(uint64_t));
Then I just use fwrite and fread to commit and read the bytes but later unpack them in order to use the data stored in various logical fields.
I wonder which approach is best to use given I desire the layout of the on-disk storage to match the logical layout ... this was just an example ...
If anyone knows how efficient each method is with respect to decoding/unpacking bytes vs copying structure directly from it's on-disk representation please share , I personally prefer using the second approach since it gives me full control on the storage layout but I am not ready to sacrifice looks for performance since this approach requires a lot of loop logic to unpack / traverse through the bytes to various boundaries in the data.
Thanks.

Based on your requirements (considering looks and performance), the first approach is better because, the compiler will do the hard work for you. In other words, if a tool (compiler in this case) provides you certain feature then you do not want to implement it on your own because, in most cases, tool's implementation would be more efficient than yours.

I prefer something close to your second approach, but without memcpy:
void store_i64le(void *dest, uint64_t value)
{ // Generic version which will work with any platform
uint8_t *d = dest;
d[0] = (uint8_t)(value);
d[1] = (uint8_t)(value >> 8);
d[2] = (uint8_t)(value >> 16);
d[3] = (uint8_t)(value >> 24);
d[4] = (uint8_t)(value >> 32);
d[5] = (uint8_t)(value >> 40);
d[6] = (uint8_t)(value >> 48);
d[7] = (uint8_t)(value >> 56);
}
store_i64le(foo+1, some_value);
On a typical ARM, the above store_i64le method would translate into about 30 bytes--a reasonable tradeoff of time, space, and complexity. Not quite optimal from a speed perspective, but not much worse than optimal from a space perspective on something like the Cortex-M0 which doesn't support unaligned writes. Note that the code as written has zero dependence upon machine byte order. If one knew that one was using a little-endian platform whose hardware would convert unaligned 32-bit accesses to a sequence of 8- and 16-bit accesses, one could rewrite the method as
void store_i64le(void *dest, uint64_t value)
{ // For an x86 or little-endian ARM which can handle unaligned 32-bit loads and stores
uint32_t *d = dest;
d[0] = (uint32_t)(value);
d[1] = (uint32_t)(value >> 32);
}
which would be faster on the platforms where it would work. Note that the method would be invoked the same was as the byte-at-a-time version; the caller wouldn't have to worry about which approach to use.

If you are on Linux or Windows, then just memory-map the file and cast the pointer to the type of the C struct. Whatever you write in this mapped area will be automatically flushed to disk in the most efficient way the OS has available. It will be a lot more efficient than calling "write", and minimal hassle for you.
As others have mentioned, it isn't very portable. To be portable between little-endian and big-endian the common strategy is to write the whole file in big-endian or little-endian and convert as you access it. However, this throws away your speed. A way to preserve your speed is to write an external utility which converts the whole file once, and then run that utility any time you move the structure from one platform to another.
In the case that you have two different platforms accessing a single file over a shared network path, you are in for a lot of pain if you try writing it yourself just because of synchronization issues, so I would suggest an entirely different approach like using sqlite.

Alternative to writing masks for 32 bit microcontrollers

I am working on a project that involves programming 32 bit ARM micro-controllers. As in many embedded software coding work, setting and clearing bits are essential and quite repetitive task. Masking strategy is useful when working with micros rather than 32 bits to set and clear bits. But when working with 32 bit micro-contollers, it is not really practical to write masks each time we need to set/clear a single bit.
Writing functions to handle this could be a solution; however having a function occupies memory which is not ideal in my case.
Is there any better alternative to handle bit setting/clearing when working with 32 bit micros?

In C or C++, you would typically define macros for bit masks and combine them as desired.
/* widget.h */
#define WIDGET_FOO 0x00000001u
#define WIDGET_BAR 0x00000002u
/* widget_driver.c */
static uint32_t *widget_control_register = (uint32_t*)0x12346578;
int widget_init (void) {
*widget_control_register |= WIDGET_FOO;
if (*widget_control_register & WIDGET_BAR) log(LOG_DEBUG, "widget: bar is set");
}
If you want to define the bit masks from the bit positions rather than as absolute values, define constants based on a shift operation (if your compiler doesn't optimize these constants, it's hopeless).
#define WIDGET_FOO (1u << 0)
#define WIDGET_BAR (1u << 1)
You can define macros to set bits:
/* widget.h */
#define WIDGET_CONTROL_REGISTER_ADDRESS ((uint32_t*)0x12346578)
#define SET_WIDGET_BITS(m) (*WIDGET_CONTROL_REGISTER_ADDRESS |= (m))
#define CLEAR_WIDGET_BITS(m) (*WIDGET_CONTROL_REGISTER_ADDRESS &= ~(uint32_t)(m))
You can define functions rather than macros. This has the advantage of added type verifications during compilations. If you declare the function as static inline (or even just static) in a header, a good compiler will inline the function everywhere, so using a function in your source code won't cost any code memory (assuming that the generated code for the function body is smaller than a function call, which should be the case for a function that merely sets some bits in a register).
/* widget.h */
#define WIDGET_CONTROL_REGISTER_ADDRESS ((uint32_t*)0x12346578)
static inline void set_widget_bits(uint32_t m) {
*WIDGET_CONTROL_REGISTER_ADDRESS |= m;
}
static inline void set_widget_bits(uint32_t m) {
*WIDGET_CONTROL_REGISTER_ADDRESS &= ~m;
}

The other common idiom for registers providing access to individual bits or groups of bits is to define a struct containing bitfields for each register of your device. This can get tricky, and it is dependent on the C compiler implementation. But it can also be clearer than macros.
A simple device with a one-byte data register, a control register, and a status register could look like this:
typedef struct {
unsigned char data;
unsigned char txrdy:1;
unsigned char rxrdy:1;
unsigned char reserved:2;
unsigned char mode:4;
} COMCHANNEL;
#define CHANNEL_A (*(COMCHANNEL *)0x10000100)
// ...
void sendbyte(unsigned char b) {
while (!CHANNEL_A.txrdy) /*spin*/;
CHANNEL_A.data = b;
}
unsigned char readbyte(void) {
while (!CHANNEL_A.rxrdy) /*spin*/;
return CHANNEL_A.data;
}
Access to the mode field is just CHANNEL_A.mode = 3;, which is a lot clearer than writing something like *CHANNEL_A_MODE = (*CHANNEL_A_MODE &~ CHANNEL_A_MODE_MASK) | (3 << CHANNEL_A_MODE_SHIFT);. Of course, the latter ugly expression would usually be (mostly) covered over by macros.
In my experience, once you established a style for describing your peripheral registers you are best served by following that style over the whole project. The consistency will have huge benefits for future code maintenance, and over the lifetime of a project that factor likely is more important that the relatively small detail of whether you adopted the struct and bitfields or macro style.
If you are coding for a target which has already set a style in its manufacturer provided header files and customary compiler toolchain, adopting that style for your own custom hardware and low level code may be best as it will provide the best match between manufacturer documentation and your coding style.
But if you have the luxury of establishing the style for your development at the outset, your compiler platform is well enough documented to permit you to reliably describe device registers with bitfields, and you expect to use the same compiler for the lifetime of the product, then that is often a good way to go.
You can actually have it both ways too. It isn't that unusual to wrap the bitfield declarations inside a union that describes the physical registers, allowing their values to be easily manipulated all bits at once. (I know I've seen a variation of this where conditional compilation was used to provide two versions of the bitfields, one for each bit order, and a common header file used toolchain-specific definitions to decide which to select.)
typedef struct {
unsigned char data;
union {
struct {
unsigned char txrdy:1;
unsigned char rxrdy:1;
unsigned char reserved:2;
unsigned char mode:4;
} bits;
unsigned char status;
};
} COMCHANNEL;
// ...
#define CHANNEL_A_MODE_TXRDY 0x01
#define CHANNEL_A_MODE_TXRDY 0x02
#define CHANNEL_A_MODE_MASK 0xf0
#define CHANNEL_A_MODE_SHIFT 4
// ...
#define CHANNEL_A (*(COMCHANNEL *)0x10000100)
// ...
void sendbyte(unsigned char b) {
while (!CHANNEL_A.bits.txrdy) /*spin*/;
CHANNEL_A.data = b;
}
unsigned char readbyte(void) {
while (!CHANNEL_A.bits.rxrdy) /*spin*/;
return CHANNEL_A.data;
}
Assuming your compiler understands the anonymous union then you can simply refer to CHANNEL_A.status to get the whole byte, or CHANNEL_A.mode to refer to just the mode field.
There are some things to watch for if you go this route. First, you have to have a good understanding of structure packing as defined in your platform. The related issue is the order in which bit fields are allocated across their storage, which can vary. I've assumed that the low order bit is assigned first in my examples here.
There may also be hardware implementation issues to worry about. If a particular register must always be read and written 32 bits at a time, but you have it described as a bunch of small bit fields, the compiler might generate code that violates that rule and accesses only a single byte of the register. There is usually a trick available to prevent this, but it will be highly platform dependent. In this case, using macros with a fixed sized register will be less likely to cause a strange interaction with your hardware device.
These issues are very compiler vendor dependent. Even without changing compiler vendors, #pragma settings, command line options, or more likely optimization level choices can all affect memory layout, padding, and memory access patterns. As a side effect, they will likely lock your project to a single specific compiler toolchain, unless heroic efforts are used to create register definition header files that use conditional compilation to describe the registers differently for different compilers. And even then, you are probably well served to include at least one regression test that verifies your assumptions so that any upgrades to the toolchain (or well-intentioned tweaks to the optimization level) will cause any issues to get caught before they are mysterious bugs in code that "has worked for years".
The good news is that the sorts of deep embedded projects where this technique makes sense are already subject to a number of toolchain lock in forces, and this burden may not be a burden at all. Even if your product development team moves to a new compiler for the next product, it is often critical that firmware for a particular product be maintained with the very same toolchain over its lifetime.

If you use the Cortex M3 you can use bit-banding
Bit-banding maps a complete word of memory onto a single bit in the bit-band region. For example, writing to one of the alias words will set or clear the corresponding bit in the bitband region.
This allows every individual bit in the bit-banding region to be directly accessible from a word-aligned address using a single LDR instruction. It also allows individual bits to be toggled from C without performing a read-modify-write sequence of instructions.

If you have C++ available, and there's a decent compiler available, then something like QFlags is a good idea. It gives you a type-safe interface to bit flags.
It is likely to produce better code than using bitfields in structures, since the bitfields can only be changed one at a time and will likely translate to at least one load/modify/store per each changed bitfield. With a QFlags-like approach, you can get one load/modify/store per each or-assign or and-assign statement. Note that the use of QFlags doesn't require the inclusion of the entire Qt framework. It's a stand-alone header file (after minor tweaks).

At the driver level setting and clearing bits with masks is very common and sometimes the only way. Besides, it's an extremely quick operation; only a few instructions. It may be worthwhile to set up a function that can clear or set certain bits for readability and also reusability.
It's not clear what type of registers you are setting and clearing bits in but in general there are two cases you have to worry about in embedded systems:
Setting and clearing bits in a read/write register
If you want to change a single bit (or a handful of bits) in a read and write register you will first have to read the register, set or clear the appropriate bit using masks and whatever else to get the correct behavior, and then write back to the same register. That way you don't change the other bits.
Writing to separate Set and Clear registers (common in ARM micros) Sometimes there are separate Set and Clear registers. You can write just a single bit to a clear register and it will clear that bit. For instance, if there is a register you want to clear bit 9, just write (1<<9) to the clear register. You don't have to worry about modifying the other bits. Similar for the set register.

You can set and clear bits with a function that takes up as much memory as doing it with a mask:
#define SET_BIT(variableName, bitNumber) variableName |= (0x00000001<<(bitNumber));
#define CLR_BIT(variableName, bitNumber) variableName &= ~(0x00000001<<(bitNumber));
int myVariable = 12;
SET_BIT(myVariable, 0); // myVariable now equals 13
CLR_BIT(myVariable, 1); // myVariable now equals 11
These macros will produce exactly the same assembler instructions as a mask.
Alternatively, you could do this:
#define BIT(n) (0x00000001<<n)
#define NOT_BIT(n) ~(0x00000001<<n)
int myVariable = 12;
myVariable |= BIT(4); //myVariable now equals 28
myVariable &= NOT_BIT(3); //myVariable now equals 20
myVariable |= BIT(5) |
BIT(6) |
BIT(7) |
BIT(8); //myVariable now equals 500

Bitfield manipulation in C

The classic problem of testing and setting individual bits in an integer in C is perhaps one the most common intermediate-level programming skills. You set and test with simple bitmasks such as
unsigned int mask = 1<<11;
if (value & mask) {....} // Test for the bit
value |= mask; // set the bit
value &= ~mask; // clear the bit
An interesting blog post argues that this is error prone, difficult to maintain, and poor practice. The C language itself provides bit level access which is typesafe and portable:
typedef unsigned int boolean_t;
#define FALSE 0
#define TRUE !FALSE
typedef union {
struct {
boolean_t user:1;
boolean_t zero:1;
boolean_t force:1;
int :28; /* unused */
boolean_t compat:1; /* bit 31 */
};
int raw;
} flags_t;
int
create_object(flags_t flags)
{
boolean_t is_compat = flags.compat;
if (is_compat)
flags.force = FALSE;
if (flags.force) {
[...]
}
[...]
}
But this makes me cringe.
The interesting argument my coworker and I had about this is still unresolved. Both styles work, and I maintain the classic bitmask method is easy, safe, and clear. My coworker agrees it's common and easy, but the bitfield union method is worth the extra few lines to make it portable and safer.
Is there any more arguments for either side? In particular is there some possible failure, perhaps with endianness, that the bitmask method may miss but where the structure method is safe?

Bitfields are not quite as portable as you think, as "C gives no guarantee of the ordering of fields within machine words" (The C book)
Ignoring that, used correctly, either method is safe. Both methods also allow symbolic access to integral variables. You can argue that the bitfield method is easier to write, but it also means more code to review.

If the issue is that setting and clearing bits is error prone, then the right thing to do is to write functions or macros to make sure you do it right.
// off the top of my head
#define SET_BIT(val, bitIndex) val |= (1 << bitIndex)
#define CLEAR_BIT(val, bitIndex) val &= ~(1 << bitIndex)
#define TOGGLE_BIT(val, bitIndex) val ^= (1 << bitIndex)
#define BIT_IS_SET(val, bitIndex) (val & (1 << bitIndex))
Which makes your code readable if you don't mind that val has to be an lvalue except for BIT_IS_SET. If that doesn't make you happy, then you take out assignment, parenthesize it and use it as val = SET_BIT(val, someIndex); which will be equivalent.
Really, the answer is to consider decoupling the what you want from how you want to do it.

Bitfields are great and easy to read, but unfortunately the C language does not specify the layout of bitfields in memory, which means they are essentially useless for dealing with packed data in on-disk formats or binary wire protocols. If you ask me, this decision was a design error in C—Ritchie could have picked an order and stuck with it.

You have to think about this from the perspective of a writer -- know your audience. So there are a couple of "audiences" to consider.
First there's the classic C programmer, who have bitmasked their whole lives and could do it in their sleep.
Second there's the newb, who has no idea what all this |, & stuff is. They were programming php at their last job and now they work for you. (I say this as a newb who does php)
If you write to satisfy the first audience (that is bitmask-all-day-long), you'll make them very happy, and they'll be able to maintain the code blindfolded. However, the newb will likely need to overcome a large learning curve before they are able to maintain your code. They will need to learn about binary operators, how you use these operations to set/clear bits, etc. You're almost certainly going to have bugs introduced by the newb as he/she all the tricks required to get this to work.
On the other hand, if you write to satisfy the second audience, the newbs will have an easier time maintaining the code. They'll have an easier time groking
flags.force = 0;
than
flags &= 0xFFFFFFFE;
and the first audience will just get grumpy, but its hard to imagine they wouldn't be able to grok and maintain the new syntax. It's just much harder to screw up. There won't be new bugs, because the newb will more easily maintain the code. You'll just get lectures about how "back in my day you needed a steady hand and a magnetized needle to set bits... we didn't even HAVE bitmasks!" (thanks XKCD).
So I would strongly recommend using the fields over the bitmasks to newb-safe your code.

The union usage has undefined behavior according to the ANSI C standard, and thus, should not be used (or at least not be considered portable).
From the ISO/IEC 9899:1999 (C99) standard:
Annex J - Portability Issues:
1 The following are unspecified:
— The value of padding bytes when storing values in structures or unions (6.2.6.1).
— The value of a union member other than the last one stored into (6.2.6.1).
6.2.6.1 - Language Concepts - Representation of Types - General:
6 When a value is stored in an object of structure or union type, including in a member
object, the bytes of the object representation that correspond to any padding bytes take
unspecified values.[42]) The value of a structure or union object is never a trap
representation, even though the value of a member of the structure or union object may be
a trap representation.
7 When a value is stored in a member of an object of union type, the bytes of the object
representation that do not correspond to that member but do correspond to other members
take unspecified values.
So, if you want to keep the bitfield ↔ integer correspondence, and to keep portability, I strongly suggest you to use the bitmasking method, that contrary to the linked blog post, it is not poor practice.

What it is about the bitfield approach that makes you cringe?
Both techniques have their place, and the only decision I have is which one to use:
For simple "one-off" bit fiddling, I use the bitwise operators directly.
For anything more complex - eg hardware register maps, the bitfield approach wins hands down.
Bitfields are more succinct to use
(at the expense of /slightly/ more
verbosity to write.
Bitfields are
more robust (what size is "int",
anyway)
Bitfields are usually just
as fast as bitwise operators.
Bitfields are very powerful when you
have a mix of single and multiple bit
fields, and extracting the
multiple-bit field involves loads of
manual shifts.
Bitfields are
effectively self-documenting. By
defining the structure and therefore
naming the elements, I know what it's
meant to do.
Bitfields also seamlessly handle structures bigger than a single int.
With bitwise operators, typical (bad) practice is a slew of #defines for the bit masks.
The only caveat with bitfields is to make sure the compiler has really packed the object into the size you wanted. I can't remember if this is define by the standard, so a assert(sizeof(myStruct) == N) is a useful check.

The blog post you are referring to mentions raw union field as alternative access method for bitfields.
The purposes blog post author used raw for are ok, however if you plan to use it for anything else (e.g. serialisation of bit fields, setting/checking individual bits), disaster is just waiting for you around the corner. The ordering of bits in memory is architecture dependent and memory padding rules vary from compiler to compiler (see wikipedia), so exact position of each bitfield may differs, in other words you never can be sure which bit of raw each bitfield corresponds to.
However if you don't plan to mix it you better take raw out and you will be safe.

Well you can't go wrong with structure mapping since both fields are accessable they can be used interchangably.
One benefit for bit fields is that you can easily aggregate options:
mask = USER|FORCE|ZERO|COMPAT;
vs
flags.user = true;
flags.force = true;
flags.zero = true;
flags.compat = true;
In some environments such as dealing with protocol options it can get quite old having to individually set options or use multiple parameters to ferry intermediate states to effect a final outcome.
But sometimes setting flag.blah and having the list popup in your IDE is great especially if your like me and can't remember the name of the flag you want to set without constantly referencing the list.
I personally will sometimes shy away from declaring boolean types because at some point I'll end up with the mistaken impression that the field I just toggled was not dependent (Think multi-thread concurrency) on the r/w status of other "seemingly" unrelated fields which happen to share the same 32-bit word.
My vote is that it depends on the context of the situation and in some cases both approaches may work out great.

Either way, bitfields have been used in GNU software for decades and it hasn't done them any harm. I like them as parameters to functions.
I would argue that bitfields are conventional as opposed to structs. Everyone knows how to AND the values to set various options off and the compiler boils this down to very efficient bitwise operations on the CPU.
Providing you use the masks and tests in the correct way, the abstractions the compiler provide should make it robust, simple, readable and clean.
When I need a set of on/off switches, Im going to continue using them in C.

In C++, just use std::bitset<N>.

It is error-prone, yes. I've seen lots of errors in this kind of code, mainly because some people feel that they should mess with it and the business logic in a totally disorganized way, creating maintenance nightmares. They think "real" programmers can write value |= mask; , value &= ~mask; or even worse things at any place, and that's just ok. Even better if there's some increment operator around, a couple of memcpy's, pointer casts and whatever obscure and error-prone syntax happens to come to their mind at that time. Of course there's no need to be consistent and you can flip bits in two or three different ways, distributed randomly.
My advice would be:
Encapsulate this ---- in a class, with methods such as SetBit(...) and ClearBit(...). (If you don't have classes in C, in a module.) While you're at it, you can document all their behaviour.
Unit test that class or module.

Your first method is preferable, IMHO. Why obfuscate the issue? Bit fiddling is a really basic thing. C did it right. Endianess doesn't matter. The only thing the union solution does is name things. 11 might be mysterious, but #defined to a meaningful name or enum'ed should suffice.
Programmers who can't handle fundamentals like "|&^~" are probably in the wrong line of work.

I nearly always use the logical operations with a bit mask, either directly or as a macro. e.g.
#define ASSERT_GPS_RESET() { P1OUT &= ~GPS_RESET ; }
incidentally your union definition in the original question would not work on my processor/compiler combination. The int type is only 16 bits wide and the bitfield definitions are 32. To make it slightly more portable then you would have to define a new 32 bit type that you could then map to the required base type on each target architecture as part of the porting exercise. In my case
typedef unsigned long int uint32_t
and in the original example
typedef unsigned int uint32_t
typedef union {
struct {
boolean_t user:1;
boolean_t zero:1;
boolean_t force:1;
int :28; /* unused */
boolean_t compat:1; /* bit 31 */
};
uint32_t raw;
} flags_t;
The overlaid int should also be made unsigned.

Well, I suppose that's one way of doing it, but I would always prefer to keep it simple.
Once you're used to it, using masks is straightforward, unambiguous and portable.
Bitfields are straightforward, but they are not portable without having to do additional work.
If you ever have to write MISRA-compliant code, the MISRA guidelines frown on bitfields, unions, and many, many other aspects of C, in order to avoid undefined or implementation-dependent behaviour.

When I google for "C operators", the first three pages are:
Operators in C and C++
http://h30097.www3.hp.com/docs/base_doc/DOCUMENTATION/V40F_HTML/AQTLTBTE/DOCU_059.HTM
http://www.cs.mun.ca/~michael/c/op.html
..so I think that argument about people new to the language is a little silly.

Generally, the one that is easier to read and understand is the one that is also easier to maintain. If you have co-workers that are new to C, the "safer" approach will probably be the easier one for them to understand.

Bitfields are great, except that the bit manipulation operations are not atomic, and can thus lead to problems in multi-threaded application.
For example one could assume that a macro:
#define SET_BIT(val, bitIndex) val |= (1 << bitIndex)
Defines an atomic operation, since |= is one statement. But the ordinary code generated by a compiler will not try to make |= atomic.
So if multiple threads execute different set bit operations one of the set bit operation could be spurious. Since both threads will execute:
thread 1 thread 2
LOAD field LOAD field
OR mask1 OR mask2
STORE field STORE field
The result can be field' = field OR mask1 OR mask2 (intented), or the result can be field' = field OR mask1 (not intented) or the result can be field' = field OR mask2 (not intended).

I'm not adding much to what's already been said, except to emphasize two points:
The compiler is free to arrange bits within a bitfield any way it wants. This mean if you're trying to manipulate bits in a microcontroller register, or if you want to send the bits to another processor (or even the same processor with a different compiler), you MUST use bitmasks.
On the other hand, if you're trying to create a compact representation of bits and small integers for use within a single processor, bitfields are easier to maintain and thus less error prone, and -- with most compilers -- are at least as efficient as manually masking and shifting.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight