What is the point behind unions in C?

What is the point behind unions in C? - c

I'm going through O'Reilly's Practical C Programming book, and having read the K&R book on the C programming language, and I am really having trouble grasping the concept behind unions.
They take the size of the largest data type that makes them up...and the most recently assigned one overwrites the rest...but why not just use / free memory as needed?
The book mentions that it's used in communication, where you need to set flags of the same size; and on a googled website, that it can eliminate odd-sized memory chunks...but is it of any use in a modern, non-embedded memory space?
Is there something crafty you can do with it and CPU registers? Is it simply a hold over from an earlier era of programming? Or does it, like the infamous goto, still have some powerful use (possibly in tight memory spaces) that makes it worth keeping around?

Well, you almost answered your question: Memory.
Back in the days memory was rather low, and even saving a few kbytes has been useful.
But even today there are scenarios where unions would be useful. For example, if you'd like to implement some kind of variant datatype. The best way to do this is using a union.
This doesn't sound like much, but let's just assume you want to use a variable either storing a 4 character string (like an ID) or a 4 byte number (which could be some hash or indeed just a number).
If you use a classic struct, this would be 8 bytes long (at least, if you're unlucky there are filling bytes as well). Using an union it's only 4 bytes. So you're saving 50% memory, which isn't a lot for one instance, but imagine having a million of these.
While you can achieve similar things by casting or subclassing a union is still the easiest way to do this.

One use of unions is having two variables occupy the same space, and a second variable in the struct decide what data type you want to read it as.
e.g. you could have a boolean 'isDouble', and a union 'doubleOrLong' which has both a double and a long. If isDouble == true interpret the union as a double else interpret it as a long.
Another use of unions is accessing data types in different representations. For instance, if you know how a double is laid out in memory, you could put a double in a union, access it as a different data type like a long, directly access its bits, its mantissa, its sign, its exponent, whatever, and do some direct manipulation with it.
You don't really need this nowadays since memory is so cheap, but in embedded systems it has its uses.

The Windows API makes use of unions quite a lot. LARGE_INTEGER is an example of such a usage. Basically, if the compiler supports 64-bit integers, use the QuadPart member; otherwise, set the low DWORD and the high DWORD manually.

It's not really a hold over, as the C language was created in 1972, when memory was a real concern.
You could make the argument that in modern, non-embedded space, you might not want to use C as a programming language to begin with. If you've chosen C as your language choice for implementation, you're looking to harness the benefits of C: it's efficient, close-to-metal, which results in tight, fast binaries.
As such, when choosing to use C, you'd still want to take advantage of it's benefits, which includes memory-space efficiency. To which, the Union works very well; allowing you to have some degree of type safety, while enforcing the smallest memory foot print available.

One place where I have seen it used is in the Doom 3/idTech 4 Fast Inverse Square Root implementation.
For those unfamiliar with this algorithm, it essentially requires treating a floating point number as an integer. The old Quake (and earlier) version of the code does this by the following:
float y = 2.0f;
// treat the bits of y as an integer
long i = * ( long * ) &y;
// do some stuff with i
// treat the bits of i as a float
y = * ( float * ) &i;
original source on GitHub
This code takes the address of a floating point number y, casts it to a pointer to a long (ie, a 32 bit integer in Quake days), and derefences it into i. Then it does some incredibly bizarre bit-twiddling stuff, and the reverse.
There are two disadvantages of doing it this way. One is that the convoluted address-of, cast, dereference process forces the value of y to be read from memory, rather than from a register1, and ditto on the way back. On Quake-era computers, however, floating point and integer registers were completely separate so you pretty much had to push to memory and back to deal with this restriction.
The second is that, at least in C++, doing such casting is deeply frowned upon, even when doing what amounts to voodoo such as this function does. I'm sure there are more compelling arguments, however I'm not sure what they are :)
So, in Doom 3, id included the following bit in their new implementation (which uses a different set of bit twiddling, but a similar idea):
union _flint {
dword i;
float f;
};
...
union _flint seed;
seed.i = /* look up some tables to get this */;
double r = seed.f; // <- access the bits of seed.i as a floating point number
original source on GitHub
Theoretically, on an SSE2 machine, this can be accessed through a single register; I'm not sure in practice whether any compiler would do this. It's still somewhat cleaner code in my opinion than the casting games in the earlier Quake version.
1 - ignoring "sufficiently advanced compiler" arguments

Related

Comparison uint8_t vs uint16_t while declaring a counter

Assuming to have a counter which counts from 0 to 100, is there an advantage of declaring the counter variable as uint16_t instead of uint8_t.
Obviously if I use uint8_t I could save some space. On a processor with natural wordsize of 16 bits access times would be the same for both I guess. I couldn't think why I would use a uint16_t if uint8_t can cover the range.

Using a wider type than necessary can allow the compiler to avoid having to mask the higher bits.
Suppose you were working on a 16 bit architecture, then using uint16_t could be more efficient, however if you used uint16_t instead of uint8_t on a 32 bit architecture then you would still have the mask instructions but just masking a different number of bits.
The most efficient type to use in a cross-platform portable way is just plain int or unsigned int, which will always be the correct type to avoid the need for masking instructions, and will always be able to hold numbers up to 100.
If you are in a MISRA or similar regulated environment that forbids the use of native types, then the correct standard-compliant type to use is uint_fast8_t. This guarantees to be the fastest unsigned integer type that has at least 8 bits.
However, all of this is nonsense really. Your primary goal in writing code should be to make it readable, not to make it as fast as possible. Penny-pinching instructions like this makes code convoluted and more likely to have bugs. Also because it is harder to read, the bugs are less likely to be found during code review.
You should only try to optimize like this once the code is finished and you have tested it and found the particular part which is the bottleneck. Masking a loop counter is very unlikely to be the bottleneck in any real code.

Obviously if I use uint8_t I could save some space.
Actually, that's not necessarily obvious! A loop index variable is likely to end up in a register, and if it does there's no memory to be saved. Also, since the definition of the C language says that much arithmetic takes place using type int, it's possible that using a variable smaller than int might actually end up costing you space in terms of extra code emitted by the compiler to convert back and forth between int and your smaller variable. So while it could save you some space, it's not at all guaranteed that it will — and, in any case, the actual savings are going to be almost imperceptibly small in the grand scheme of things.
If you have an array of some number of integers in the range 0-100, using uint8_t is a fine idea if you want to save space. For an individual variable, on the other hand, the arguments are pretty different.
In general, I'd say that there are two reasons not to use type uint8_t (or, equivalently, char or unsigned char) as a loop index:
It's not going to save much data space (if at all), and it might cost code size and/or speed.
If the loop runs over exactly 256 elements (yours didn't, but I'm speaking more generally here), you may have introduced a bug (which you'll discover soon enough): your loop may run forever.
The interviewer was probably expecting #1 as an answer. It's not a guaranteed answer — under plenty of circumstances, using the smaller type won't cost you anything, and evidently there are microprocessors where it can actually save something — but as a general rule, I agree that using an 8-bit type as a loop index is, well, silly. And whether or not you agree, it's certainly an issue to be aware of, so I think it's a fair interview question.
See also this question, which discusses the same sorts of issues.

The interview question doesn't make much sense from a platform-generic point of view. If we look at code such as this:
for(uint8_t i=0; i<n; i++)
array[i] = x;
Then the expression i<n will get carried out on type int or larger because of implicit promotion. Though the compiler may optimize it to use a smaller type if it doesn't affect the result.
As for array[i], the compiler is likely to use a type corresponding to whatever address size the system is using.
What the interviewer was fishing for is likely that uint32_t on a 32 bitter tend to generate faster code in some situations. For those cases you can use uint_fast8_t, but more likely the compiler will perform optimizations no matter.
The only optimization uint8_t blocks the compiler from doing, is to allocate a larger variable than 8 bits on the stack. It doesn't however block the compiler from optimizing out the variable entirely and using a register instead. Such as for example storing it in an index register with the same width as the address bus.
Example with gcc x86_64: https://godbolt.org/z/vYscf3KW9. The disassembly is pretty painful to read, but the compiler just picked CPU registers to store anything regardless of the type of i, giving identical machine code between uint8_t anduint16_t. I would have been surprised if it didn't.
On a processor with natural wordsize of 16 bits access times would be the same for both I guess.
Yes this is true for all mainstream 16 bitters. Some might even manage faster code if given 8 bits instead of 16. Some exotic systems like DSP exist, but in case of lets say a 1 byte=16 bits DSP, then the compiler doesn't even provide you with uint8_t to begin with - it is an optional type. One generally doesn't bother with portability to wildly exotic systems, since doing so is a waste of everyone's time and money.
The correct answer: it is senseless to do manual optimization without a specific system in mind. uint8_t is perfectly fine to use for generic, portable code.

What is the most efficient way to represent small values in a struct?

Often I find myself having to represent a structure that consists of very small values. For example, Foo has 4 values, a, b, c, d that, range from 0 to 3. Usually I don't care, but sometimes, those structures are
used in a tight loop;
their values are read a billion times/s, and that is the bottleneck of the program;
the whole program consists of a big array of billions of Foos;
In that case, I find myself having trouble deciding how to represent Foo efficiently. I have basically 4 options:
struct Foo {
int a;
int b;
int c;
int d;
};
struct Foo {
char a;
char b;
char c;
char d;
};
struct Foo {
char abcd;
};
struct FourFoos {
int abcd_abcd_abcd_abcd;
};
They use 128, 32, 8, 8 bits respectively per Foo, ranging from sparse to densely packed. The first example is probably the most linguistic one, but using it would essentially increase by 16 times the size of the program, which doesn't sound quite right. Moreover, most of the memory will be filled with zeroes and not be used at all, which makes me wonder if this isn't a waste. On the other hands, packing them densely brings an additional overhead for of reading them.
What is the computationally 'fastest' method for representing small values in a struct?

For dense packing that doesn't incur a large overhead of reading, I'd recommend a struct with bitfields. In your example where you have four values ranging from 0 to 3, you'd define the struct as follows:
struct Foo {
unsigned char a:2;
unsigned char b:2;
unsigned char c:2;
unsigned char d:2;
}
This has a size of 1 byte, and the fields can be accessed simply, i.e. foo.a, foo.b, etc.
By making your struct more densely packed, that should help with cache efficiency.
Edit:
To summarize the comments:
There's still bit fiddling happening with a bitfield, however it's done by the compiler and will most likely be more efficient than what you would write by hand (not to mention it makes your source code more concise and less prone to introducing bugs). And given the large amount of structs you'll be dealing with, the reduction of cache misses gained by using a packed struct such as this will likely make up for the overhead of bit manipulation the struct imposes.

Pack them only if space is a consideration - for example, an array of 1,000,000 structs. Otherwise, the code needed to do shifting and masking is greater than the savings in space for the data. Hence you are more likely to have a cache miss on the I-cache than the D-cache.

There is no definitive answer, and you haven't given enough information to allow a "right" choice to be made. There are trade-offs.
Your statement that your "primary goal is time efficiency" is insufficient, since you haven't specified whether I/O time (e.g. to read data from file) is more of a concern than computational efficiency (e.g. how long some set of computations take after a user hits a "Go" button).
So it might be appropriate to write the data as a single char (to reduce time to read or write) but unpack it into an array of four int (so subsequent calculations go faster).
Also, there is no guarantee that an int is 32 bits (which you have assumed in your statement that the first packing uses 128 bits). An int can be 16 bits.

Foo has 4 values, a, b, c, d that, range from 0 to 3. Usually I don't
care, but sometimes, those structures are ...
There is another option: since the values 0 ... 3 likely indicate some sort of state, you could consider using "flags"
enum{
A_1 = 1<<0,
A_2 = 1<<1,
A_3 = A_1|A_2,
B_1 = 1<<2,
B_2 = 1<<3,
B_3 = B_1|B_2,
C_1 = 1<<4,
C_2 = 1<<5,
C_3 = C_1|C_2,
D_1 = 1<<6,
D_2 = 1<<7,
D_3 = D_1|D_2,
//you could continue to ... D7_3 for 32/64 bits if it makes sense
}
This isn't much different than using bitfields for most situations, but can drastically reduce your conditional logic.
if ( a < 2 && b < 2 && c < 2 && d < 2) // .... (4 comparisons)
//vs.
if ( abcd & (A_2|B_2|C_2|D_2) !=0 ) //(bitop with constant and a 0-compare)
Depending what kinds of operations you will be doing on the data, it may make sense to use either 4 or 8 sets of abcd and pad out the end with 0s as needed. That could allow up to 32 comparisons to be replaced with a bitop and 0-compare.
For instance, if you wanted to set the "1 bit" on all 8 sets of 4 in a 64 bit variable you can do uint64_t abcd8 = 0x5555555555555555ULL; then to set all the 2 bits you could do abcd8 |= 0xAAAAAAAAAAAAAAAAULL; making all values now 3
Addendum:
On further consideration, you could use a union as your type and either do a union with char and #dbush's bitfields (these flag operations would still work on the unsigned char) or use char types for each a,b,c,d and union them with unsigned int. This would allow both a compact representation and efficient operations depending on what union member you use.
union Foo {
char abcd; //Note: you can use flags and bitops on this too
struct {
unsigned char a:2;
unsigned char b:2;
unsigned char c:2;
unsigned char d:2;
};
};
Or even extended further
union Foo {
uint64_t abcd8; //Note: you can use flags and bitops on these too
uint32_t abcd4[2];
uint16_t abcd2[4];
uint8_t abcd[8];
struct {
unsigned char a:2;
unsigned char b:2;
unsigned char c:2;
unsigned char d:2;
} _[8];
};
union Foo myfoo = {0xFFFFFFFFFFFFFFFFULL};
//assert(myfoo._[0].a == 3 && myfoo.abcd[0] == 0xFF);
This method does introduce some endianness differences, which would also be a problem if you use a union to cover any other combination of your other methods.
union Foo {
uint32_t abcd;
uint32_t dcba; //only here for endian purposes
struct { //anonymous struct
char a;
char b;
char c;
char d;
};
};
You could experiment and measure with different union types and algorithms to see which parts of the unions are worth keeping, then discard the ones that are not useful. You may find that operating on several char/short/int types simultaneously gets automatically optimized to some combination of AVX/simd instructions whereas using bitfields does not unless you manually unroll them... there is no way to know until you test and measure them.

Fitting your data set in cache is critical. Smaller is always better, because hyperthreading competitively shares the per-core caches between the hardware threads (on Intel CPUs). Comments on this answer include some numbers for costs of cache misses.
On x86, loading 8bit values with sign or zero-extension into 32 or 64bit registers (movzx or movsx) is literally just as fast as plain mov of a byte or 32bit dword. Storing the low byte of a 32bit register also has no overhead. (See Agner Fog's instruction tables and C / asm optimization guides here).
Still x86-specific: [u]int8_t temporaries are ok, too, but avoid [u]int16_t temporaries. (load/store from/to [u]int16_t in memory is fine, but working with 16bit values in registers has big penalties from the operand-size prefix decoding slowly on Intel CPUs.) 32bit temporaries will be faster if you want to use them as an array index. (Using 8bit registers doesn't zero the high 24/56bits, so it takes an extra instruction to zero or sign-extend, to use an 8bit register as an array index, or in an expression with a wider type (like adding it to an int.)
I'm unsure what ARM or other architectures can do as far as efficient zero/sign extension from single-byte loads, or for single-byte stores.
Given this, my recommendation is pack for storage, use int for temporaries. (Or long, but that will increase code size slightly on x86-64, because a REX prefix is needed to specify a 64bit operand size.) e.g.
int a_i = foo[i].a;
int b_i = foo[i].b;
...;
foo[i].a = a_i + b_i;
bitfields
Packing into bitfields will have more overhead, but can still be worth it. Testing a compile-time-constant-bit-position (or multiple bits) in a byte or 32/64bit chunk of memory is fast. If you actually need to unpack some bitfields into ints and pass them to a non-inline function call or something, that will take a couple extra instructions to shift and mask. If this gives even a small reduction in cache misses, this can be worth it.
Testing, setting (to 1) or clearing (to 0) a bit or group of bits can be done efficiently with OR or AND, but assigning an unknown boolean value to a bitfield takes more instructions to merge the new bits with the bits for other fields. This can significantly bloat code if you assign a variable to a bitfield very often. So using int foo:6 and things like that in your structs, because you know foo doesn't need the top two bits, is not likely to be helpful. If you're not saving many bits compared to putting each thing in it's own byte/short/int, then the reduction in cache misses won't outweigh the extra instructions (which can add up into I-cache / uop-cache misses, as well as the direct extra latency and work of the instructions.)
The x86 BMI1 / BMI2 (Bit-Manipulation) instruction-set extensions will make copying data from a register into some destination bits (without clobbering the surrounding bits) more efficient. BMI1: Haswell, Piledriver. BMI2: Haswell, Excavator(unreleased). Note that like SSE/AVX, this will mean you'd need BMI versions of your functions, and fallback non-BMI versions for CPUs that don't support those instructions. AFAIK, compilers don't have options to see patterns for these instructions and use them automatically. They're only usable via intrinsics (or asm).
Dbush's answer, packing into bitfields is probably a good choice, depending on how you use your fields. Your fourth option (of packing four separate abcd values into one struct) is probably a mistake, unless you can do something useful with four sequential abcd values (vector-style).
code generically, try both ways
For a data structure your code uses extensively, it makes sense to set things up so you can flip from one implementation to another, and benchmark. Nir Friedman's answer, with getters/setters is a good way to go. However, just using int temporaries and working with the fields as separate members of the struct should work fine. It's up to the compiler to generate code to test the right bits of a byte, for packed bitfields.
prepare for SIMD, if warranted
If you have any code that checks just one or a couple fields of each struct, esp. looping over sequential struct values, then the struct-of-arrays answer given by cmaster will be useful. x86 vector instructions have a single byte as the smallest granularity, so a struct-of-arrays with each value in a separate byte would let you quickly scan for the first element where a == something, using PCMPEQB / PTEST.

First, precisely define what you mean by "most efficient". Best memory utilization? Best performance?
Then implement your algorithm both ways and actually profile it on the actual hardware you intend to run it on under the actual conditions you intend to run it under once it's delivered.
Pick the one that better meets your original definition of "most efficient".
Anything else is just a guess. Whatever you choose will probably work fine, but without actually measuring the difference under the exact conditions you'd use the software, you'll never know which implementation would be "more efficient".

I think the only real answer can be to write your code generically, and then profile the full program with all of them. I don't think this will take that much time, though it may look a little more awkward. Basically, I'd do something like this:
template <bool is_packed> class Foo;
using interface_int = char;
template <>
class Foo<true> {
char m_a, m_b, m_c, m_d;
public:
void setA(interface_int a) { m_a = a; }
interface_int getA() { return m_a; }
...
}
template <>
class Foo<false> {
char m_data;
public:
void setA(interface_int a) { // bit magic changes m_data; }
interface_int getA() { // bit magic gets a from m_data; }
}
If you just write your code like this instead of exposing the raw data, it will be easy to switch implementations and profile. The function calls will get inlined and will not impact performance. Note that I just wrote setA and getA instead of a function that returns a reference, this is more complicated to implement.

Code it with ints
treat the fields as ints.
blah.x in all your code, except the declarion will be all you will be doing. Integral promotion will take care of most cases.
When you are all done, have 3 equivalant include files: an include file using ints, one using char and one using bitfields.
And then profile. Don't worry about it at this stage, because its premature optimization, and nothing but your chosen include file will change.

Massive Arrays and Out of Memory Errors
the whole program consists of a big array of billions of Foos;
First things first, for #2, you might find yourself or your users (if others run the software) often being unable to allocate this array successfully if it spans gigabytes. A common mistake here is to think that out of memory errors mean "no more memory available", when they instead often mean that the OS could not find a contiguous set of unused pages matching the requested memory size. It's for this reason that people often get confused when they request to allocate a one gigabyte block only to have it fail even though they have 30 gigabytes of physical memory free, e.g. Once you start allocating memory in sizes that span more than, say, 1% of the typical amount of memory available, it's often time to consider avoiding one giant array to represent the whole thing.
So perhaps the first thing you need to do is rethink the data structure. Instead of allocating a single array of billions of elements, often you'll significantly reduce the odds of running into problems by allocating in smaller chunks (smaller arrays aggregated together). For example, if your access pattern is solely sequential in nature, you can use an unrolled list (arrays linked together). If random access is needed, you might use something like an array of pointers to arrays which each span 4 kilobytes. This requires a bit more work to index an element, but with this kind of scale of billions of elements, it's often a necessity.
Access Patterns
One of the things unspecified in the question are the memory access patterns. This part is critical for guiding your decisions.
For example, is the data structure solely traversed sequentially, or is random access needed? Are all of these fields: a, b, c, d, needed together all the time, or can they be accessed one or two or three at a time?
Let's try to cover all the possibilities. At the scale we're talking about, this:
struct Foo {
int a1;
int b1;
int c1;
int d1
};
... is unlikely to be helpful. At this kind of input scale, and accessed in tight loops, your times are generally going to be dominated by the upper levels of memory hierarchy (paging and CPU cache). It no longer becomes quite as critical to focus on the lowest level of the hierarchy (registers and associated instructions). To put it another way, at billions of elements to process, the last thing you should be worrying about is the cost of moving this memory from L1 cache lines to registers and the cost of bitwise instructions, e.g. (not saying it's not a concern at all, just saying it's a much lower priority).
At a small enough scale where the entirety of the hot data fits into the CPU cache and a need for random access, this kind of straightforward representation can show a performance improvement due to the improvements at the lowest level of the hierarchy (registers and instructions), yet it would require a drastically smaller-scale input than what we're talking about.
So even this is likely to be a considerable improvement:
struct Foo {
char a1;
char b1;
char c1;
char d1;
};
... and this even more:
// Each field packs 4 values with 2-bits each.
struct Foo {
char a4;
char b4;
char c4;
char d4;
};
* Note that you could use bitfields for the above, but bitfields tend to have caveats associated with them depending on the compiler being used. I've often been careful to avoid them due to the portability issues commonly described, though this may be unnecessary in your case. However, as we adventure into SoA and hot/cold field-splitting territories below, we'll reach a point where bitfields can't be used anyway.
This code also places a focus on horizontal logic which can start to make it easier to explore some further optimization paths (ex: transforming the code to use SIMD), as it's already in a miniature SoA form.
Data "Consumption"
Especially at this kind of scale, and even more so when your memory access is sequential in nature, it helps to think in terms of data "consumption" (how quickly the machine can load data, do the necessary arithmetic, and store the results). A simple mental image I find useful is to imagine the computer as having a "big mouth". It goes faster if we feed it large enough spoonfuls of data at once, not little teeny teaspoons, and with more relevant data packed tightly into a contiguous spoonful.
Hot/Cold Field Splitting
The above code so far is making the assumption that all of these fields are equally hot (accessed frequently), and accessed together. You may have some cold fields or fields that are only accessed in critical code paths in pairs. Let's say that you rarely access c and d, or that your code has one critical loop that accesses a and b, and another that accesses c and d. In that case, it can be helpful to split it off into two structures:
struct Foo1 {
char a4;
char b4;
};
struct Foo2 {
char c4;
char d4;
};
Again if we're "feeding" the computer data, and our code is only interested in a and b fields at the moment, we can pack more into spoonfuls of a and b fields if we have contiguous blocks that only contain a and b fields, and not c and d fields. In such a case, c and d fields would be data the computer can't digest at the moment, yet it would be mixed into the memory regions in between a and b fields. If we want the computer to consume data as quickly as possible, we should only be feeding it the relevant data of interest at the moment, so it's worth splitting the structures in these scenarios.
SIMD SoA for Sequential Access
Moving towards vectorization, and assuming sequential access, the fastest rate at which the computer can consume data will often be in parallel using SIMD. In such a case, we might end up with a representation like this:
struct Foo1 {
char* a4n;
char* b4n;
};
... with careful attention to alignment and padding (the size/alignment should be a multiple of 16 or 32 bytes for AVX or even 64 for futuristic AVX-512) necessary to use faster aligned moves into XMM/YMM registers (and possibly with AVX instructions in the future).
AoSoA for Random/Multi-Field Access
Unfortunately the above representation can start to lose a lot of the potential benefits if a and b are accessed frequently together, especially with a random access pattern. In such a case, a more optimal representation can start looking like this:
struct Foo1 {
char a4x32[32];
char b4x32[32];
};
... where we're now aggregating this structure. This makes it so the a and b fields are no longer so spread apart, allowing groups of 32 a and b fields to fit into a single 64-byte cache line and accessed together quickly. We can also fit 128 or 256 a or b elements now into an XMM/YMM register.
Profiling
Normally I try to avoid general wisdom advice in performance questions, but I noticed this one seems to avoid the details that someone who has profiler in hand would typically mention. So I apologize if this comes off a bit as patronizing or if a profiler is already being actively used, but I think the question warrants this section.
As an anecdote, I've often done a better job (I shouldn't!) at optimizing production code written by people who have far superior knowledge than me about computer architecture (I worked with a lot of people who came from the punch card era and can understand assembly code at a glance), and would often get called in to optimize their code (which felt really odd). It's for one simple reason: I "cheated" and used a profiler (VTune). My peers often didn't (they had an allergy to it and thought they understood hotspots just as well as a profiler and saw profiling as a waste of time).
Of course the ideal is to find someone who has both the computer architecture expertise and a profiler in hand, but lacking one or the other, the profiler can give the bigger edge. Optimization still rewards a productivity mindset which hinges on the most effective prioritization, and the most effective prioritization is to optimize the parts that truly matter the most. The profiler gives us detailed breakdowns of exactly how much time is spent and where, along with useful metrics like cache misses and branch mispredictions which even the most advanced humans typically can't predict anywhere close to as accurate as a profiler can reveal. Furthermore, profiling is often the key to discovering how the computer architecture works at a more rapid pace by chasing down hotspots and researching why they exist. For me, profiling was the ultimate entry point into better understanding how the computer architecture actually works and not how I imagined it to work. It was only then that the writings of someone as experienced in this regard as Mysticial started to make more and more sense.
Interface Design
One of the things that might start to become apparent here is that there are many optimization possibilities. The answers to this kind of question are going to be about strategies rather than absolute approaches. A lot still has to be discovered in hindsight after you try something, and still iterating towards more and more optimal solutions as you need them.
One of the difficulties here in a complex codebase is leaving enough breathing room in the interfaces to experiment and try different optimization techniques, to iterate and iterate towards faster solutions. If the interface leaves room to seek these kinds of optimizations, then we can optimize all day long and often get some marvelous results if we're measuring things properly even with a trial and error mindset.
To often leave enough breathing room in an implementation to even experiment and explore faster techniques often requires the interface designs to accept data in bulk. This is especially true if the interfaces involve indirect function calls (ex: through a dylib or a function pointer) where inlining is no longer an effective possibility. In such scenarios, leaving room to optimize without cascading interface breakages often means designing away from the mindset of receiving simple scalar parameters in favor of passing pointers to whole chunks of data (possibly with a stride if there are various interleaving possibilities). So while this is straying into a pretty broad territory, a lot of the top priorities in optimizing here are going to boil down to leaving enough breathing room to optimize implementations without cascading changes throughout your codebase, and having a profiler in hand to guide you the right way.
TL;DR
Anyway, some of these strategies should help guide you the right way. There are no absolutes here, only guides and things to try out, and always best done with a profiler in hand. Yet when processing data of this enormous scale, it's always worth remembering the image of the hungry monster, and how to most effectively feed it these appropriately-sized and packed spoonfuls of relevant data.

Let's say, you have a memory bus that's a little bit older and can deliver 10 GB/s. Now take a CPU at 2.5 GHz, and you see that you would need to handle at least four bytes per cycle to saturate the memory bus. As such, when you use the definition of
struct Foo {
char a;
char b;
char c;
char d;
}
and use all four variables in each pass through the data, your code will be CPU bound. You can't gain any speed by a denser packing.
Now, this is different when each pass only performs a trivial operation on one of the four values. In that case, you are better off with a struct of arrays:
struct Foo {
size_t count;
char* a; //a[count]
char* b; //b[count]
char* c; //c[count]
char* d; //d[count]
}

You've stated the common and ambiguous C/C++ tag.
Assuming C++, make the data private and add getters/ setters.
No, that will not cause a performance hit - providing the optimizer is turned on.
You can then change the implementation to use the alternatives without any change to your calling code - and therefore more easily finesse the implementation based on the results of the bench tests.
For the record, I'd expect the struct with bit fields as per #dbush to be most likely the fastest given your description.
Note all this is around keeping the data in cache - you may also want to see if the design of the calling algorithm can help with that.

Getting back to the question asked :
used in a tight loop;
their values are read a billion times/s, and that is the bottleneck of the program;
the whole program consists of a big array of billions of Foos;
This is a classic example of when you should write platform specific high performance code that takes time to design for each implementation platform, but the benefits outweigh that cost.
As it's the bottleneck of the entire program you don't look for a general solution, but recognize that this needs to have multiple approaches tested and timed against real data, as the best solution will be platform specific.
It is also possible, as it is a large array of billion of foos, that the OP should consider using OpenCL or OpenMP as potential solutions so as to maximize the exploitation of available resources on the runtime hardware. This is a little dependent on what you need from the data, but it's probably the most important aspect of this type of problem - how to exploit available parallelism.
But there is no single right answer to this question, IMO.

The most efficient, performance / execution, is to use the processor's word size. Don't make the processor perform extra work of packing or unpacking.
Some processors have more than one efficient size. Many ARM processors can operate in 8/32 bit mode. This means that the processor is optimized for handling 8 bit quantities or 32-bit quantities. For a processor like this, I recommend using 8-bit data types.
Your algorithm has a lot to do with the efficiency. If you are moving data or copying data you may want to consider moving data 32-bits at a time (4 8-bit quantities). The idea here is to reduce the number of fetches by the processor.
For performance, write your code to make use of registers, such as using more local variables. Fetching from memory into registers is more costly than using registers directly.
Best of all, check out your compiler optimization settings. Set your compile for the highest performance (speed) settings. Next, generate assembly language listings of your functions. Review the listing to see how the compiler generated code. Adjust your code to improve the compiler's optimization capabilities.

If what you're after is efficiency of space, then you should consider avoiding structs altogether. The compiler will insert padding into your struct representation as necessary to make its size a multiple of its alignment requirement, which might be as much as 16 bytes (but is more likely to be 4 or 8 bytes, and could after all be as little as 1 byte).
If you use a struct anyway, then which to use depends on your implementation. If #dbush's bitfield approach yields one-byte structures then it's hard to beat that. If your implementation is going to pad the representation to at least four bytes no matter what, however, then this is probably the one to use:
struct Foo {
char a;
char b;
char c;
char d;
};
Or I guess I would probably use this variant:
struct Foo {
uint8_t a;
uint8_t b;
uint8_t c;
uint8_t d;
};
Since we're supposing that your struct is taking up a minimum of four bytes, there is no point in packing the data into smaller space. That would be counter-productive, in fact, because it would also make the processor do the extra work packing and unpacking the values within.
For handling large amounts of data, making efficient use of the CPU cache provides a far greater win than avoiding a few integer operations. If your data usage pattern is at least somewhat systematic (e.g. if after accessing one element of your erstwhile struct array, you are likely to access a nearby one next) then you are likely to get a boost in both space efficiency and speed by packing the data as tightly as you can. Depending on your C implementation (or if you want to avoid implementation dependency), you might need to achieve that differently -- for instance, via an array of integers. For your particular example of four fields, each requiring two bits, I would consider representing each "struct" as a uint8_t instead, for a total of 1 byte each.
Maybe something like this:
#include <stdint.h>
#define NUMBER_OF_FOOS 1000000000
#define A 0
#define B 2
#define C 4
#define D 6
#define SET_FOO_FIELD(foos, index, field, value) \
((foos)[index] = (((foos)[index] & ~(3 << (field))) | (((value) & 3) << (field))))
#define GET_FOO_FIELD(foos, index, field) (((foos)[index] >> (field)) & 3)
typedef uint8_t foo;
foo all_the_foos[NUMBER_OF_FOOS];
The field name macros and access macros provide a more legible -- and adjustable -- way to access the individual fields than would direct manipulation of the array (but be aware that these particular macros evaluate some of their arguments more than once). Every bit is used, giving you about as good cache usage as it is possible to achieve through choice of data structure alone.

I did video decompression for a while. The fastest thing to do is something like this:
short ABCD; //use a 16 bit data type for your example
and set up some macros. Maybe:
#define GETA ((ABCD >> 12) & 0x000F)
#define GETB ((ABCD >> 8) & 0x000F)
#define GETC ((ABCD >> 4) & 0x000F)
#define GETD (ABCD & 0x000F) // no need to shift D
In practice you should try to be moving 32 bit longs or 64 bit long long because thats the native MOVE size on most modern processors.
Using a struct will always create the overhead in your compiled code of extra instructions from the base address of you struct to the field. So get away from that if you really want to tighten your loop.
Edit:
Above example gives you 4 bit values. If you really just need values of 0..3 then you can do the same things to pull out your 2 bit numbers so,,,GETA might look like this:
GETA ((ABCD >> 14) & 0x0003)
And if you are really moving billions of things things, and I don't doubt it, just fill up a 32bit variable and shift and mask your way through it.
Hope this helps.

Why aren't the C-supplied integer types good enough for basically any project?

I'm much more of a sysadmin than a programmer. But I do spend an inordinate amount of time grovelling through programmers' code trying to figure out what went wrong. And a disturbing amount of that time is spent dealing with problems when the programmer expected one definition of __u_ll_int32_t or whatever (yes, I know that's not real), but either expected the file defining that type to be somewhere other than it is, or (and this is far worse but thankfully rare) expected the semantics of that definition to be something other than it is.
As I understand C, it deliberately doesn't make width definitions for integer types (and that this is a Good Thing), but instead gives the programmer char, short, int, long, and long long, in all their signed and unsigned glory, with defined minima which the implementation (hopefully) meets. Furthermore, it gives the programmer various macros that the implementation must provide to tell you things like the width of a char, the largest unsigned long, etc. And yet the first thing any non-trivial C project seems to do is either import or invent another set of types that give them explicitly 8, 16, 32, and 64 bit integers. This means that as the sysadmin, I have to have those definition files in a place the programmer expects (that is, after all, my job), but then not all of the semantics of all those definitions are the same (this wheel has been re-invented many times) and there's no non-ad-hoc way that I know of to satisfy all of my users' needs here. (I've resorted at times to making a <bits/types_for_ralph.h>, which I know makes puppies cry every time I do it.)
What does trying to define the bit-width of numbers explicitly (in a language that specifically doesn't want to do that) gain the programmer that makes it worth all this configuration management headache? Why isn't knowing the defined minima and the platform-provided MAX/MIN macros enough to do what C programmers want to do? Why would you want to take a language whose main virtue is that it's portable across arbitrarily-bitted platforms and then typedef yourself into specific bit widths?

When a C or C++ programmer (hereinafter addressed in second-person) is choosing the size of an integer variable, it's usually in one of the following circumstances:
You know (at least roughly) the valid range for the variable, based on the real-world value it represents. For example,
numPassengersOnPlane in an airline reservation system should accommodate the largest supported airplane, so needs at least 10 bits. (Round up to 16.)
numPeopleInState in a US Census tabulating program needs to accommodate the most populous state (currently about 38 million), so needs at least 26 bits. (Round up to 32.)
In this case, you want the semantics of int_leastN_t from <stdint.h>. It's common for programmers to use the exact-width intN_t here, when technically they shouldn't; however, 8/16/32/64-bit machines are so overwhelmingly dominant today that the distinction is merely academic.
You could use the standard types and rely on constraints like “int must be at least 16 bits”, but a drawback of this is that there's no standard maximum size for the integer types. If int happens to be 32 bits when you only really needed 16, then you've unnecessarily doubled the size of your data. In many cases (see below), this isn't a problem, but if you have an array of millions of numbers, then you'll get lots of page faults.
Your numbers don't need to be that big, but for efficiency reasons, you want a fast, “native” data type instead of a small one that may require time wasted on bitmasking or zero/sign-extension.
This is the int_fastN_t types in <stdint.h>. However, it's common to just use the built-in int here, which in the 16/32-bit days had the semantics of int_fast16_t. It's not the native type on 64-bit systems, but it's usually good enough.
The variable is an amount of memory, array index, or casted pointer, and thus needs a size that depends on the amount of addressable memory.
This corresponds to the typedefs size_t, ptrdiff_t, intptr_t, etc. You have to use typedefs here because there is no built-in type that's guaranteed to be memory-sized.
The variable is part of a structure that's serialized to a file using fread/fwrite, or called from a non-C language (Java, COBOL, etc.) that has its own fixed-width data types.
In these cases, you truly do need an exact-width type.
You just haven't thought about the appropriate type, and use int out of habit.
Often, this works well enough.
So, in summary, all of the typedefs from <stdint.h> have their use cases. However, the usefulness of the built-in types is limited due to:
Lack of maximum sizes for these types.
Lack of a native memsize type.
The arbitrary choice between LP64 (on Unix-like systems) and LLP64 (on Windows) data models on 64-bit systems.
As for why there are so many redundant typedefs of fixed-width (WORD, DWORD, __int64, gint64, FINT64, etc.) and memsize (INT_PTR, LPARAM, VPTRDIFF, etc.) integer types, it's mainly because <stdint.h> came late in C's development, and people are still using older compilers that don't support it, so libraries need to define their own. Same reason why C++ has so many string classes.

Sometimes it is important. For example, most image file formats require an exact number of bits/bytes be used (or at least specified).
If you only wanted to share a file created by the same compiler on the same computer architecture, you would be correct (or at least things would work). But, in real life things like file specifications and network packets are created by a variety of computer architectures and compilers, so we have to care about the details in these case (at least).

The main reason the fundamental types can't be fixed is that a few machines don't use 8-bit bytes. Enough programmers don't care, or actively want not to be bothered with support for such beasts, that the majority of well-written code demands a specific number of bits wherever overflow would be a concern.
It's better to specify a required range than to use int or long directly, because asking for "relatively big" or "relatively small" is fairly meaningless. The point is to know what inputs the program can work with.
By the way, usually there's a compiler flag that will adjust the built-in types. See INT_TYPE_SIZE for GCC. It might be cleaner to stick that into the makefile, than to specialize the whole system environment with new headers.

If you want portable code, you want the code your write to function identically on all platforms. If you have
int i = 32767;
you can't say for certain what i+1 will give you on all platforms.
This is not portable. Some compilers (on the same CPU architecture!) will give you -32768 and some will give you 32768. Some perverted ones will give you 0. That's a pretty big difference. Granted if it overflows, this is Undefined Behavior, but you don't know it is UB unless you know exactly what the size of int is.
If you use the standard integer definitions (which is <stdint.h>, the ISO/IEC 9899:1999 standard), then you know the answer of +1 will give exact answer.
int16_t i = 32767;
i+1 will overflow (and on most compilers, i will appear to be -32768)
uint16_t j = 32767;
j+1 gives 32768;
int8_t i = 32767; // should be a warning but maybe not. most compilers will set i to -1
i+1 gives 0; (//in this case, the addition didn't overflow
uint8_t j = 32767; // should be a warning but maybe not. most compilers will set i to 255
i+1 gives 0;
int32_t i = 32767;
i+1 gives 32768;
uint32_t j = 32767;
i+1 gives 32768;

There are two opposing forces at play here:
The need for C to adapt to any CPU architecture in a natural way.
The need for data transferred to/from a program (network, disk, file, etc.) so that a program running on any architecture can correctly interpret it.
The "CPU matching" need has to do with inherent efficiency. There is CPU quantity which is most easily handled as a single unit which all arithmetic operations easily and efficiently are performed on, and which results in the need for the fewest bits of instruction encoding. That type is int. It could be 16 bits, 18 bits*, 32 bits, 36 bits*, 64 bits, or even 128 bits on some machines. (* These were some not-well-known machines from the 1960s and 1970s which may have never had a C compiler.)
Data transfer needs when transferring binary data require that record fields are the same size and alignment. For this it is quite important to have control of data sizes. There is also endianness and maybe binary data representations, like floating point representations.
A program which forces all integer operations to be 32 bit in the interests of size compatibility will work well on some CPU architectures, but not others (especially 16 bit, but also perhaps some 64-bit).
Using the CPU's native register size is preferable if all data interchange is done in a non-binary format, like XML or SQL (or any other ASCII encoding).

What type I should use for fastest calculation speed?

I am making a 2D shooter game, and thus I have to stuff in a array lots of bullets, including their position, and where they are going.
So I have two issues, one is memory use, specially writing arrays that don't place things out of aligned and results in lots of padding or alignment that makes the speed of calculations suck.
The second is speed of calculation.
First this mean between choosing integers or floats... For now I am going with integers (if someone think floating point is better, please say so).
Then, this also mean choosing a variant of that type (8 bits? 16 bits? C confusing default? The CPU word size? Single precision? Double precision?)
Thus the question is: What type in C is fastest in modern processors (ie: common x86, ARM and other popular processors, don't worry about Z80 or 36bit processors), and what type is more reasonable when taking speed AND memory use in account?
Also, signed and unsigned has differences in speed?
EDIT because of close votes: Yes, it might be premature optimization, but I am asking not only about CPU use, but memory use (that might vary significantly), also I am doing the project to exercise my C skills, it is some years I don't code in C, and I thought to have some fun and find limits and stretch them, and also learn new standards (last time I used C it was still C89).
Finally, the major motivation of asking this question was just hacker curiosity when I found out some new interesting types (like int_fast*_t) existed in newer standards.
But if you still think this is not worth asking, then I can delete the question and go peruse the standards and some books, learn by myself. Then if others one day have the same curiosity, it is not my problem.

I would say an int should be the most comfortable for your CPU. But the C standard does have:
The typedef name int_fastN_t designates the fastest signed integer
type with a width of at least N . The typedef name uint_fastN_t
designates the fastest unsigned integer type with a width of at least
N
So in theory you could say things like: "I need it to be at least 16 bits so I shall use int_fast16_t". In practice that might translate to a plain int.
I suspect it is premature to think about these before you actually hit a performance issue that you can try to work around. I think it is better to solve problems when they occur than to try to think of an elusive super-solution that could solve all future possible issues.

Single precision floating point add and multiply is as fast as as 32 bit integer arithmetic in all modern processors (x86,ARM,MIPS), i.e. one result per clock cycle. Calculating positions and velocity in space is a lot easier with floating point arithmetic, so use floats. Single precision floats are 32 bits, and are the same size as the most efficient integer type on 32 bit CPUs.

Reasons to use (or not) stdint

I already know that stdint is used to when you need specific variable sizes for portability between platforms. I don't really have such an issue for now, but what are the cons and pros of using it besides the already shown fact above?
Looking for this on stackoverflow and others sites, I found 2 links that treats about the theme:
codealias.info - this one talks about the portability of the stdint.
stackoverflow - this one is more specific about uint8_t.
These two links are great specially if one is looking to know more about the main reason of this header - portability. But for me, what I like most about it is that I think uint8_t is cleaner than unsigned char (for storing an RBG channel value for example), int32_t looks more meaningful than simply int, etc.
So, my question is, exactly what are the cons and pros of using stdint besides the portability? Should I use it just in some specifics parts of my code, or everywhere? if everywhere, how can I use functions like atoi(), strtok(), etc. with it?
Thanks!

Pros
Using well-defined types makes the code far easier and safer to port, as you won't get any surprises when for example one machine interprets int as 16-bit and another as 32-bit. With stdint.h, what you type is what you get.
Using int etc also makes it hard to detect dangerous type promotions.
Another advantage is that by using int8_t instead of char, you know that you always get a signed 8 bit variable. char can be signed or unsigned, it is implementation-defined behavior and varies between compilers. Therefore, the default char is plain dangerous to use in code that should be portable.
If you want to give the compiler hints of that a variable should be optimized, you can use the uint_fastx_t which tells the compiler to use the fastest possible integer type, at least as large as 'x'. Most of the time this doesn't matter, the compiler is smart enough to make optimizations on type sizes no matter what you have typed in. Between sequence points, the compiler can implicitly change the type to another one than specified, as long as it doesn't affect the result.
Cons
None.
Reference: MISRA-C:2004 rule 6.3."typedefs that indicate size and signedness shall be used in place of the basic types".
EDIT : Removed incorrect example.

The only reason to use uint8_t rather than unsigned char (aside from aesthetic preference) is if you want to document that your program requires char to be exactly 8 bits. uint8_t exists if and only if CHAR_BIT==8, per the requirements of the C standard.
The rest of the intX_t and uintX_t types are useful in the following situations:
reading/writing disk/network (but then you also have to use endian conversion functions)
when you want unsigned wraparound behavior at an exact cutoff (but this can be done more portably with the & operator).
when you're controlling the exact layout of a struct because you need to ensure no padding exists (e.g. for memcmp or hashing purposes).
On the other hand, the uint_least8_t, etc. types are useful anywhere that you want to avoid using wastefully large or slow types but need to ensure that you can store values of a certain magnitude. For example, while long long is at least 64 bits, it might be 128-bit on some machines, and using it when what you need is just a type that can store 64 bit numbers would be very wasteful on such machines. int_least64_t solves the problem.
I would avoid using the [u]int_fastX_t types entirely since they've sometimes changed on a given machine (breaking the ABI) and since the definitions are usually wrong. For instance, on x86_64, the 64-bit integer type is considered the "fast" one for 16-, 32-, and 64-bit values, but while addition, subtraction, and multiplication are exactly the same speed whether you use 32-bit or 64-bit values, division is almost surely slower with larger-than-necessary types, and even if they were the same speed, you're using twice the memory for no benefit.
Finally, note that the arguments some answers have made about the inefficiency of using int32_t for a counter when it's not the native integer size are technically mostly correct, but it's irrelevant to correct code. Unless you're counting some small number of things where the maximum count is under your control, or some external (not in your program's memory) thing where the count might be astronomical, the correct type for a count is almost always size_t. This is why all the standard C functions use size_t for counts. Don't consider using anything else unless you have a very good reason.

cons
The primary reason the C language does not specify the size of int or long, etc. is for computational efficiency. Each architecture has a natural, most-efficient size, and the designers specifically empowered and intended the compiler implementor to use the natural native data size data for speed and code size efficiency.
In years past, communication with other machines was not a primary concern—most programs were local to the machine—so the predictability of each data type's size was of little concern.
Insisting that a particular architecture use a particular size int to count with is a really bad idea, even though it would seem to make other things easier.
In a way, thanks to XML and its brethren, data type size again is no longer much of a concern. Shipping machine-specific binary structures from machine to machine is again the exception rather than the rule.

I use stdint types for one reason only, when the data I hold in memory shall go on disk/network/descriptor in binary form. You only have to fight the little-endian/big-endian issue but that's relatively easy to overcome.
The obvious reason not to use stdint is when the code is size-independent, in maths terms everything that works over the rational integers. It would produce ugly code duplicates if you provided a uint*_t version of, say, qsort() for every expansion of *.
I use my own types in that case, derived from size_t when I'm lazy or the largest supported unsigned integer on the platform when I'm not.
Edit, because I ran into this issue earlier:
I think it's noteworthy that at least uint8_t, uint32_t and uint64_t are broken in Solaris 2.5.1.
So for maximum portability I still suggest avoiding stdint.h (at least for the next few years).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight