Faster to use Integers as Booleans? - c

From a memory access standpoint... is it worth attempting an optimization like this?
int boolean_value = 0;
//magical code happens and boolean_value could be 0 or 1
if(boolean_value)
{
//do something
}
Instead of
unsigned char boolean_value = 0;
//magical code happens and boolean_value could be 0 or 1
if(boolean_value)
{
//do something
}
The unsigned char of course takes up only 1 byte as apposed to the integers 4 (assuming 32 bit platform here), but my understanding is that it would be faster for a processor to read the integer value from memory.

It may or may not be faster, and the speed depends on so many things that a generic answer is impossible. For example: hardware architecture, compiler, compiler options, amount of data (does it fit into L1 cache?), other things competing for the CPU, etc.
The correct answer, therefore, is: try both ways and measure for your particular case.
If measurement does not indicate that one method is significantly faster than the other, opt for the one that is clearer.

From a memory access standpoint... is
it worth attempting an optimization
like this?
Probably not. In almost all modern processors, memory will get fetched based on the word size of the processor. In your case, even to get one byte of memory out, your processor probably fetches the entire 32-bit word or more based on the caching of that processor. Your architecture may vary, so you will want to understand how your CPU works to gauge.
But as others have said, it doesn't hurt to try it and measure it.

This is almost never a good idea. Many systems can only read word-sized chunks from memory at once, so reading a byte then masking or shifting will actually take more code space and just as much (data) memory. If you're using an obscure tiny system, measure, but in general this will actually slow down and bloat your code.

Asking how much memory unsigned char takes versus int is only meaningful when it's in an array (or possibly a structure, if you're careful to order the elements to take care of alignment). As a lone variable, it's very unlikely that you save any memory at all, and the compiler is likely to generate larger code to truncate the upper bits of registers.
As a general policy, never use smaller-than-int types except in arrays unless you have a really good reason other than trying to save space.

Follow the standard rules of optimization. First, don't optimize. Then test if your code needs it at some point. Then optimize that point. This link provides an excellent intro to the topic of optimization.
http://www.catb.org/~esr/writings/taoup/html/optimizationchapter.html

Related

Comparison uint8_t vs uint16_t while declaring a counter

Assuming to have a counter which counts from 0 to 100, is there an advantage of declaring the counter variable as uint16_t instead of uint8_t.
Obviously if I use uint8_t I could save some space. On a processor with natural wordsize of 16 bits access times would be the same for both I guess. I couldn't think why I would use a uint16_t if uint8_t can cover the range.
Using a wider type than necessary can allow the compiler to avoid having to mask the higher bits.
Suppose you were working on a 16 bit architecture, then using uint16_t could be more efficient, however if you used uint16_t instead of uint8_t on a 32 bit architecture then you would still have the mask instructions but just masking a different number of bits.
The most efficient type to use in a cross-platform portable way is just plain int or unsigned int, which will always be the correct type to avoid the need for masking instructions, and will always be able to hold numbers up to 100.
If you are in a MISRA or similar regulated environment that forbids the use of native types, then the correct standard-compliant type to use is uint_fast8_t. This guarantees to be the fastest unsigned integer type that has at least 8 bits.
However, all of this is nonsense really. Your primary goal in writing code should be to make it readable, not to make it as fast as possible. Penny-pinching instructions like this makes code convoluted and more likely to have bugs. Also because it is harder to read, the bugs are less likely to be found during code review.
You should only try to optimize like this once the code is finished and you have tested it and found the particular part which is the bottleneck. Masking a loop counter is very unlikely to be the bottleneck in any real code.
Obviously if I use uint8_t I could save some space.
Actually, that's not necessarily obvious! A loop index variable is likely to end up in a register, and if it does there's no memory to be saved. Also, since the definition of the C language says that much arithmetic takes place using type int, it's possible that using a variable smaller than int might actually end up costing you space in terms of extra code emitted by the compiler to convert back and forth between int and your smaller variable. So while it could save you some space, it's not at all guaranteed that it will — and, in any case, the actual savings are going to be almost imperceptibly small in the grand scheme of things.
If you have an array of some number of integers in the range 0-100, using uint8_t is a fine idea if you want to save space. For an individual variable, on the other hand, the arguments are pretty different.
In general, I'd say that there are two reasons not to use type uint8_t (or, equivalently, char or unsigned char) as a loop index:
It's not going to save much data space (if at all), and it might cost code size and/or speed.
If the loop runs over exactly 256 elements (yours didn't, but I'm speaking more generally here), you may have introduced a bug (which you'll discover soon enough): your loop may run forever.
The interviewer was probably expecting #1 as an answer. It's not a guaranteed answer — under plenty of circumstances, using the smaller type won't cost you anything, and evidently there are microprocessors where it can actually save something — but as a general rule, I agree that using an 8-bit type as a loop index is, well, silly. And whether or not you agree, it's certainly an issue to be aware of, so I think it's a fair interview question.
See also this question, which discusses the same sorts of issues.
The interview question doesn't make much sense from a platform-generic point of view. If we look at code such as this:
for(uint8_t i=0; i<n; i++)
array[i] = x;
Then the expression i<n will get carried out on type int or larger because of implicit promotion. Though the compiler may optimize it to use a smaller type if it doesn't affect the result.
As for array[i], the compiler is likely to use a type corresponding to whatever address size the system is using.
What the interviewer was fishing for is likely that uint32_t on a 32 bitter tend to generate faster code in some situations. For those cases you can use uint_fast8_t, but more likely the compiler will perform optimizations no matter.
The only optimization uint8_t blocks the compiler from doing, is to allocate a larger variable than 8 bits on the stack. It doesn't however block the compiler from optimizing out the variable entirely and using a register instead. Such as for example storing it in an index register with the same width as the address bus.
Example with gcc x86_64: https://godbolt.org/z/vYscf3KW9. The disassembly is pretty painful to read, but the compiler just picked CPU registers to store anything regardless of the type of i, giving identical machine code between uint8_t anduint16_t. I would have been surprised if it didn't.
On a processor with natural wordsize of 16 bits access times would be the same for both I guess.
Yes this is true for all mainstream 16 bitters. Some might even manage faster code if given 8 bits instead of 16. Some exotic systems like DSP exist, but in case of lets say a 1 byte=16 bits DSP, then the compiler doesn't even provide you with uint8_t to begin with - it is an optional type. One generally doesn't bother with portability to wildly exotic systems, since doing so is a waste of everyone's time and money.
The correct answer: it is senseless to do manual optimization without a specific system in mind. uint8_t is perfectly fine to use for generic, portable code.

Can bit-level operations ever be "fast" in software?

Let me clarify the soft-sounding title straight away. This is actually something that has been nagging me for quite a while now, despite feeling like a pretty basic question.
Many languages give a faulty impression of efficiency by letting the developer play with bits, such as thebool.h C header which, as I understand it, is essentially just an int with a wrapper around it. Essentially, the byte seems to be the absolute lowest atomic unit of computation in C - bool x = 0 is not faster/more memory efficient than int x = 0.
What I'm wondering is then, what do we do when we want to implement an algorithm that is inherently tied to loading and manipulating single bits, such as decoding binary codes, unweighted graph connectivity problems and many others? In other words, is the atomicity of the byte an inherent property of modern CPUs or could we theoretically rival the efficiency of an ASIC just by using machine code?
EDIT: Pretty surprised by the downvotes, but I suppose people just didn't understand what I was asking. I think a really good, canonical example is traversing a binary tree (or any other sequential list of yes/no questions really). What I was wondering is if modern cpu architectures are fundamentally poorly equipped to do this (as compared to an ASIC/FPGA, that is), or if this is an artifact of some abstraction layer (language/kernel/etc). Mark's answer was good though (although I'd love a reference to the mentioned architecture extension)
No you can't rival the efficiency of an ASIC. An ASIC means you can replicate parallel bit streams as much as you have budget for on the chip. You just cut and paste your HDL until you fill your die space. A CPU only has a limited number of cores.
I'm guessing that you think that bit operations like z = (x|(1<<y)>>4 are slow and yes, all that bit shifting is extra overhead. But that is just accessing the bits. The bit operations (OR, AND, etc) are all as fast as you can get on modern CPU, i.e. 1 cycle throughput.
The 8051 architecture has a way of accessing individual bits directly, without using byte registers, but if you are worried about speed, you wouldn't consider a 8051.
By convention, a byte is the smallest addressable piece of memory in a computer. The number of bits that a byte has can differ from one system to another.
In the case of x86, there are instructions to move bytes from memory to a register and back, and instructions to manipulate values in registers. I can't speak to other architectures, but they most likely work in a similar way.
So anytime you need to manipulate some number of bits you need to do so a byte (or word, i.e. multiple bytes) at a time.
I also don't know why this question got so many downvotes, the question:
In other words, is the atomicity of the byte an inherent property of modern CPUs or could we theoretically rival the efficiency of an ASIC just by using machine code?
seems reasonable to me. It's certainly not a bad compared to many questions on stackoverflow.
The answer is: no CPUs can't match the efficiency of an ASIC.
However, the reason is not because CPUs are manipulating bytes instead of bits. Instead it's because most of the work that CPUs do to process an instruction is involved with loading it from memory, decoding it, tracking dependencies, etc., rather than performing the actual arithmetic operations on bits or bytes that the instruction directs the CPU to perform.
A good explanation of this is shown in the following presentation from the 2014 LLVM developers meeting. The presentation shows how OpenCL can be used to generate custom FPGA hardware. Slides 12 to 28 show a nice pictorial example of overhead associated with a CPU algorithm and how custom hardware can remove much of this overhead.

Fastest use of a dataset of just over 64 bytes?

Structure: I have 8 64-bit integers (512 bits = 64 bytes, the assumed cache line width) that I would like to compare to another, single 64-bit integer, in turn, without cache misses. The data set is, unfortunately, absolutely inflexible -- it's already as small as possible.
Access pattern: Each uint64_t is in fact an array of 4x4x4 bits, each bit representing the presence or absence of a voxel. This means sometimes I will be using half of one chunk and half of another, or even corners of 8 different 64-bit chunks.... I guess what this means is there is a high likelihood of a lack of alignment.
How can I do this as fast as possible i.e. without thrashing the cache?
P.S. The idea is that this code will ultimately run on a fairly wide range of architectures of at least a 64B cache line width, so I'd prefer this were absolutely as fast as possible. This also means I can't rely on MOVNTDQA, which anyway may incur a performance hit of it's own inspite of loading the 9th element directly to the CPU.
P.P.S. My knowledge of this area is fairly limited so please take it easy on me. But please spare me the premature optimisation comments; be sure that this is the 3% of this application that really counts.
I wouldn't worry about it. If your dataset is really only 9 integers, most of it will likely be stored in registers anyway. Also, there isn't really any way to optimize cache usage without specifying an architecture, since cache structure is architecture dependent. If you can list several target architectures you may be able to find some commonalities that you can optimize toward, but without knowing those architectures, I don't think there's much we can do for you.
Lastly, this seems like a good example of optimizing too early. I would suggest you take the following steps:
Decide what your maximum acceptable run time is
Finish your program in C
Compile for all of your target architectures
For those platforms that don't meet your speed spec, hand-optimize the intermediate assembly files and recompile until you meet your spec.
Are you sure you get cache-misses?
Even if the comparing value is not in an register, i think your first uint64 array should be on one cache stage (or what ever it is called) and your other data in another.
Your cache surely has some n-way associativity, that prevents your data row from being removed from the cache just by accessing your compare value.
Do not lose your time on Micro Optimizations. Improve your algorithms and data structures.

Use int or char in arrays?

Suppose I have an array in C with 5 elements which hold integers in the range [0..255], would it be generally better to useunsigned char, unsigned int or int regarding performance? Because a char would be only one byte, but an int is easier to handle for the processor, as far as I know. Or does it mostly depend on how the elements are accessed?
EDIT: Measuring is quite difficult, because the code belongs to a library, and the array is accessed external.
Also, I encounter this problem not only in this very case, so I'm asking for a more general answer
While the answer really depends on the CPU and how it handles loading storing small integers, you can assume that the byte array will be faster on most modern systems:
A char only takes 1/4 of the space that an int takes (on most systems), which means that working on a char array takes only a quarter of the memory bandwidth. And most codes are memory bound on modern hardware.
This one is quite impossible to answer, because it depends on the code, compiler, and processor.
However, one suggestion is to use uint8_t insted of unsigned char. (The code will be the same, but the explicit version conveys the meaning much better.)
Then there is one more thing to consider. You might be best off by packing four 8-bit integers into one 32-bit integer. Most arithmetic and bitwise logical operations work fine as long as there are no overflows (division being the notable exception).
The golden rule: Measure it. If you can't be bothered measuring it, then it isn't worth optimising. So measure it one way, change it, measure it the other way, take whatever is faster. Be aware that when you switch to a different compiler, or a different processor, like a processor that is introduced in 2015 or 2016, the other code might now be faster.
Alternatively, don't measure it, but write the most readable and maintainable code.
And consider using a single 64 bit integer and shift operations :-)

Is it faster to use an array or bit access for multiple boolean values?

1) On a 32-bit CPU is it faster to acccess an array of 32 boolean values or to access the 32 bits within one word? (Assume we want to check the value of the Nth element and can use either a bit-mask (Nth bit is set) or the integer N as an array index.)
It seems to me that the array would be faster because all common computer architectures natively work at the word level (32 bits, 64 bits, etc., processed in parallel) and accessing the sub-word bits takes extra work.
I know different compilers will represent things differently, but it seems that the underlying hardware architecture would dictate the answer. Or does the answer depend on the language and compiler?
And,
2) Is the speed answer reversed if this array represents a state that I pass between client and server?
This question came to mind when reading question "How use bit/bit-operator to control object state?"
P.S. Yes, I could write code to test this myself, but then the SO community wouldn't get to play along!
Bear in mind that a theoretically faster solution that doesn't fit into a cache line might be slower than a theoretically slower one that does, depending on a whole host of things. If this is actually something that needs to be fast, as determined by profiling, test both ways and see. If it doesn't, do whatever looks like cleaner code, which is probably the array.
It depends on the compiler and the access patterns and the platform. Raymond Chen has an excellent cost-benefit analysis: http://blogs.msdn.com/oldnewthing/archive/2008/11/26/9143050.aspx .
Even on non x86 platforms the use of bits can be prohibitive as at least one PPC platform out there uses microcoded instructions to perform a variable shift which can do nasty things with other hardware threads.
So it can be a win, but you need to understand the context in which it will be good and bad. (Which is a general thing anyway.)
For question #1: Yes, on most 32-bit platforms, an array of boolean values should be faster, because you will just be loading each 32-bit-aligned value in the array and testing it against 0. If you use a single word, you will have all that work plus the overhead of bit-fiddling.
For question #2: Again, yes, since sending data over a network is significantly slower than operating on data in the CPU and main memory, the overhead of sending even one word will strongly outweigh any performance gain or loss you get by aligning words or bit fiddling.
This is the code generated by 0 != (value & (1 << index)) to test a bit:
00401000 mov eax,1
00401005 shl eax,cl
00401007 and eax,1
And this by values[index] to test a bool[]:
00401000 movzx eax,byte ptr [ecx+eax]
Can't figure out how to put a loop around it that doesn't get optimized away, I'll vote bool[].
If you are going to check more than one value at a time, doing it in parallel will obviously be faster. If you're only checking one value, it's probably the same.
If you need a better answer than that, write some tests and get back to us.
I think a byte array is probably better than a full-word array for simple random access.
It will give better cache locality than using the full word size, and I don't think byte access is any slower on most/all common architectures.

Resources