Why int16+int16 is faster than int16+int8? - c

I decided to make a benchmark to test how fast each C99 type is, out of curiosity.
My benchmark creates a array of the following struct:
typedef struct
{
int x;
int y;
short spdx;
short spdy;
unsigned char type;
} defaultTypes;
Then I do this operation on the entire struct, multiple times, to simulate a game update loop:
while(counter < ARRAY_MAX)
{
fastStruct[counter].x+=fastStruct[counter].spdx;
fastStruct[counter].y+=fastStruct[counter].spdy;
++counter;
}
I tried several types, like int_fast8_t and double.
Later I decided to test what if I made the "spdx" variable bigger too? So I made a version where both the position (x, y) and the speed (spdx, spdy) variables are int16_t
To my surprise, it is SLIGHLY, but only SLIGHLY faster than the int16_t + int8_t version, it was 11% faster to be more exact (compared for example to doubles, that run a quarter of the int16_t+int16_t version speed).
For most other speed differences (floats being slower, bigger variables being slower, and so on) I think I know the reasons, but I don't know why a bigger structure (16, 16, 16, 16, 8) is FASTER than a smaller one (even with padding, 16, 16, 8, 8, 8).
Thus, why doing int16_t+=int16_t is 11% faster than int16_t+=int8_t? Someone suggested it had to do with integer promotion, but I am not sure about that.
Important note (seemly this affect the results): I compiled this with MingW, targeting 32-bit, and running on a 64-bit bit x86 (I ran only once a test targeting 64-bit, thus I am not confident of its results, but the performance gap is seemly 2% instead of 11%)

I think the answer has to do with the fact that your variables are signed. For two (signed) variables with the same type, the cpu can add them using a native instruction. However, when the types are different, even just different sizes, it has to convert one to the other. This may or may not be supported in hardware, but it is still a necessary step. Consider if your int16 was 256 and your int8 was -1. You would expect to get back an int16 value of 257. If the CPU just added the bit patterns together, then you'd get something very different, because the 8bit representation of -1 is not the same as the 16bit representation of -1. First it has to convert the 8-bit number to a 16 bit number, then add them together.
This is what is meant by "integer promotion": the compiler recognizes that the vars are of different types, and uses the appropriate assembly code to do the conversion. It "promotes" the int8 to an int16 before doing the add operation.
I'll bet if you changed them to unsigned, the difference in speed would disappear.

Related

Operating Rightmost/Leftmost n-Bits, Not All the Bits of A Integer Type Data Variable

In a programming-task, I have to add a smaller integer in variable B (data type int)
to a larger integer (20 decimal integer) in variable A (data type long long int),
then compare A with variable C which is also as large integer (data type long long int) as A.
What I realized, since I add a smaller B to A,
I don't need to check all the digits of A when I compare that with C, in other words, we don't need to check all the bits of A and C.
Given that I know, how many bits from the right I need to check, say n-bits,
is there a way/technique to check only those specific n-bits from the right (not all the bits of A, C) to make the program faster in c programming language?
Because for comparing all the bits take more time, and since I am working with large number, the program becomes slower.
Every time I search in the google, bit-masking appears which uses all the bits of A, C, that doesn't do what I am asking for, so probably I am not using correct terminology, please help.
Addition:
Initial comments of this post made me think there is no way but i found the following -
Bit Manipulation by University of Colorado Boulder
(#cuboulder, after 7:45)
...the bit band region is accessed via a bit band alías, each bit in a
supported bit band region has its own unique address and we can access
that bit using a pointer to its bit band alias location, the least
significant bit in an alias location can be sent or cleared and that
will be mapped to the bit in the corresponding data or peripheral
memory, unfortunately this will not help you if you need to write to
multiple bit locations in memory dependent operations only allow a
single bit to be cleared or set...
Is above what I a asking for? if yes then
where I can find the detail as beginner?
Updated question:
Is there a way/technique to check only those specific n-bits from the right (not all the bits of A, C) to make the program faster in c programming language (or any other language) that makes the program faster?
Your assumption that comparing fewer bits is faster might be true in some cases but is probably not true in most cases.
I'm only familiar with x86 CPUs. A x86-64 Processor has 64 bit wide registers. These can be accessed as 64 bit registers but the lower bits also as 32, 16 and 8 bit registers. There are processor instructions which work with the 64, 32, 16 or 8 bit part of the registers. Comparing 8 bits is one instruction but so is comparing 64 bits.
If using the 32 bit comparison would be faster than the 64 bit comparison you could gain some speed. But it seems like there is no speed difference for current processor generations. (Check out the "cmp" instruction with the link to uops.info from #harold.)
If your long long data type is actually bigger then the word size of your processor, then it's a different story. E.g. if your long long is 64 bit but your are on a 32 bit processor then these instructions cannot be handled by one register and you would need multiple instructions. So if you know that comparing only the lower 32 bits would be enough this could save some time.
Also note that comparing only e.g. 20 bits would actually take more time then comparing 32 bits. You would have to compare 32 bits and then mask the 12 highest bits. So you would need a comparison and a bitwise and instruction.
As you see this is very processor specific. And you are on the processors opcode level. As #RawkFist wrote in his comment you could try to get the C compiler to create such instructions but that does not automatically mean that this is even faster.
All of this is only relevant if these operations are executed a lot. I'm not sure what you are doing. If e.g. you add many values B to A and compare them to C each time it might be faster to start with C, subtract the B values from it and compare with 0. Because the compare-operation works internally like a subtraction. So instead of an add and a compare instruction a single subtraction would be enough within the loop. But modern CPUs and compilers are very smart and optimize a lot. So maybe the compiler automatically performs such or similar optimizations.
Try this question.
Is there a way/technique to check only those specific n-bits from the right (not all the bits of A, C) to make the program faster in c programming language (or any other language) that makes the program faster?
Yes - when A + B != C. We can short-cut the comparison once a difference is found: from least to most significant.
No - when A + B == C. All bits need comparison.
Now back to OP's original question
Is there a way/technique to check only those specific n-bits from the right (not all the bits of A, C) to make the program faster in c programming language (or any other language) that makes the program faster?
No. In order to do so, we need to out-think the compiler. A well enabled compiler itself will notice any "tricks" available for long long + (signed char)int == long long and emit efficient code.
Yet what about really long compares? How about a custom uint1000000 for A and C?
For long compares of a custom type, a quick compare can be had.
First, select a fast working type. unsigned is a prime candidate.
typedef unsigned ufast;
Now define the wide integer.
#include <limits.h>
#include <stdbool.h>
#define UINT1000000_N (1000000/(sizeof(ufast) * CHAR_BIT))
typedef struct {
// Least significant first
ufast digit[UINT1000000_N];
} uint1000000;
Perform the addition and compare one "digit" at a time.
bool uint1000000_fast_offset_compare(const uint1000000 *A, unsigned B,
const uint1000000 *C) {
ufast carry = B;
for (unsigned i = 0; i < UINT1000000_N; i++) {
ufast sum = A->digit[i] + carry;
if (sum != C->digit[i]) {
return false;
}
carry = sum < A->digit[i];
}
return true;
}

C: Is using char faster than using int?

Since char is only 1 byte long, is it better to to use char while dealing with 8-bit unsigned int?
Example:
I was trying to create a struct for storing rgb values of a color.
struct color
{
unsigned int r: 8;
unsigned int g: 8;
unsigned int b: 8;
};
Now since it is int, it allocates a memory of 4 bytes in my case. But if I replace them with unsigned char, they will be taking 3 bytes of memory as intended (in my platform).
No. Maybe a tiny bit.
First, this is a very platform dependent question.
However the <stdint.h> header was introduced to help with this.
Some hardware platforms are optimised for a particular size of operand and have an overhead to using smaller (in bit-size) operands even though logically the smaller operand requires less computation.
You should find that uint_fast8_t is the fastest unsigned integer type with at least 8 bits (#include <stdint.h> to use it).
That may be the same as unsigned char or unsigned int depending on whether your question is 'yes' or 'no' respectively(*).
So the idea would be that if you're speed focused you'd use uint_fast8_t and the compiler will pick the fastest type fitting your purpose.
There are a couple of downsides to this scheme.
One is that if you create very vast quantities of data performance can be impaired (and limits reached) because you're using an 'oversized' type for the purpose.
On a platform where a 4-byte int is faster than a 1-byte char you're using about 4 times as much memory as you need.
If your platform is small or your scale large that can be a significant overhead.
Also you need to be careful that if the underlying type isn't the minimum size you expect then some calculations may be confounded.
Arithmetic 'clocks' neatly for unsigned operands but obviously at different sizes if uint_fast8_t isn't in fact 8-bits.
It's platform dependent what the following returns:
#include <stdint.h>
//...
int foo() {
uint_fast8_t x=255;
++x;
if(x==0){
return 1;
}
return 0;
}
The overhead of dealing with potentially outsized types can claw back your gains.
I tend to agree with Knuth that "premature optimisation is the root of all evil" and would say you should only get into this sort of cleverness if you need it.
Do a typedef for typedef uint8_t color_comp; for now and get the application working before trying to shave off fractions of a second performance later!
I don't know what your application does but it may be that it's not compute intensive in RGB channels and the bottleneck (if any) is elsewhere. Maybe you find some high load calculation where it's worth dealing with uint_fast8_t conversions and issues.
The wisdom of Knuth is that you probably don't know where that is until later anyway.
(*) It could be unsigned long or indeed some other type. The only constraint is that it is an unsigned type and at least 8 bits.

Should I use the stdint.h integer types on 32/64 bit machines?

One thing that bugs me about the regular c integer declarations is that their names are strange, "long long" being the worst. I am only building for 32 and 64 bit machines so I do not necessarily need the portability that the library offers, however I like that the name for each type is a single word in similar length with no ambiguity in size.
// multiple word types are hard to read
// long integers can be 32 or 64 bits depending on the machine
unsigned long int foo = 64;
long int bar = -64;
// easy to read
// no ambiguity
uint64_t foo = 64;
int64_t bar = -64;
On 32 and 64 bit machines:
1) Can using a smaller integer such as int16_t be slower than something higher such as int32_t?
2) If I needed a for loop to run just 10 times, is it ok to use the smallest integer that can handle it instead of the typical 32 bit integer?
for (int8_t i = 0; i < 10; i++) {
}
3) Whenever I use an integer that I know will never be negative is it ok to prefer using the unsigned version even if I do not need the extra range in provides?
// instead of the one above
for (uint8_t i = 0; i < 10; i++) {
}
4) Is it safe to use a typedef for the types included from stdint.h
typedef int32_t signed_32_int;
typedef uint32_t unsigned_32_int;
edit: both answers were equally good and I couldn't really lean towards one so I just picked the answerer with lower rep
1) Can using a smaller integer such as int16_t be slower than something higher such as int32_t?
Yes it can be slower. Use int_fast16_t instead. Profile the code as needed. Performance is very implementation dependent. A prime benefit of int16_t is its small, well defined size (also it must be 2's complement) as used in structures and arrays, not so much for speed.
The typedef name int_fastN_t designates the fastest signed integer type with a width of at least N. C11 §7.20.1.3 2
2) If I needed a for loop to run just 10 times, is it ok to use the smallest integer that can handle it instead of the typical 32 bit integer?
Yes but that savings in code and speed is questionable. Suggest int instead. Emitted code tends to be optimal in speed/size with the native int size.
3) Whenever I use an integer that I know will never be negative is it OK to prefer using the unsigned version even if I do not need the extra range in provides?
Using some unsigned type is preferred when the math is strictly unsigned (such as array indexing with size_t), yet code needs to watch for careless application like
for (unsigned i = 10 ; i >= 0; i--) // infinite loop
4) Is it safe to use a typedef for the types included from stdint.h
Almost always. Types like int16_t are optional. Maximum portability uses required types uint_least16_t and uint_fast16_t for code to run on rare platforms that use bits widths like 9, 18, etc.
Can using a smaller integer such as int16_t be slower than something higher such as int32_t?
Yes. Some CPUs do not have dedicated 16-bit arithmetic instructions; arithmetic on 16-bit integers must be emulated with an instruction sequence along the lines of:
r1 = r2 + r3
r1 = r1 & 0xffff
The same principle applies to 8-bit types.
Use the "fast" integer types in <stdint.h> to avoid this -- for instance, int_fast16_t will give you an integer that is at least 16 bits wide, but may be wider if 16-bit types are nonoptimal.
If I needed a for loop to run just 10 times, is it ok to use the smallest integer that can handle it instead of the typical 32 bit integer?
Don't bother; just use int. Using a narrower type doesn't actually save any space, and may cause you issues down the line if you decide to increase the number of iterations to over 127 and forget that the loop variable is using a narrow type.
Whenever I use an integer that I know will never be negative is it ok to prefer using the unsigned version even if I do not need the extra range in provides?
Best avoided. Certain C idioms do not work properly on unsigned integers; for instance, you cannot write a loop of the form:
for (i = 100; i >= 0; i--) { … }
if i is an unsigned type, because i >= 0 will always be true!
Is it safe to use a typedef for the types included from stdint.h
Safe from a technical perspective, but it'll annoy other developers who have to work with your code.
Get used to the <stdint.h> names. They're standardized and reasonably easy to type.
Absolutely possible, yes. On my laptop (Intel Haswell), in a microbenchmark that counts up and down between 0 and 65535 on two registers 2 billion times, this takes
1.313660150s - ax dx (16-bit)
1.312484805s - eax edx (32-bit)
1.312270238s - rax rdx (64-bit)
Minuscule but repeatable differences in timing. (I wrote the benchmark in assembly, because C compilers may optimize it to a different register size.)
It will work, but you'll have to keep it up to date if you change the bounds and the C compiler will probably optimize it to the same assembly code anyway.
As long as it's correct C, that's totally fine. Keep in mind that unsigned overflow is defined and signed overflow is undefined, and compilers do take advantage of that for optimization. For example,
void foo(int start, int count) {
for (int i = start; i < start + count; i++) {
// With unsigned arithmetic, this will execute 0 times if
// "start + count" overflows to a number smaller than "start".
// With signed arithmetic, that may happen, or the compiler
// may assume this loop always runs "count" times.
// For defined behavior, avoid signed overflow.
}
Yes. Also, POSIX provides inttypes.h which extends stdint.h with some useful functions and macros.

Precisly convert float 32 to unsigned short or unsigned char

First of all sorry if this is a duplicate, I couldn't find any subject answering my question.
I'm coding a little program that will be used to convert 32-bit floating point values to short int (16 bits) and unsigned char (8 bits) values. This is for HDR images purpose.
From here I could get the following function (without clamping):
static inline uint8_t u8fromfloat(float x)
{
return (int)(x * 255.0f);
}
I suppose that in the same way we could get short int by multiplying by (pow( 2,16 ) -1)
But then I ended up thinking about ordered dithering and especially to Bayer dithering. To convert to uint8_t I suppose I could use a 4x4 matrix and a 8x8 matrix for unsigned short.
I also thought of a Look-up table to speed-up the process, this way:
uint16_t LUT[0x10000] // 2¹⁶ values contained
and store 2^16 unsigned short values corresponding to a float.
This same table could be then used for uint8_t as well because of the implicit cast between unsigned short ↔ unsigned int
But wouldn't a look-up table like this be huge in memory? Also how would one fill a table like this?!
Now I'm confused, what would be best according to you?
EDIT after uwind answer: Let's say now that I also want to do basic color space conversion at the same time, that is before converting to U8/U16 , do a color space conversion (in float), and then shrink it to U8/U16. Wouldn't in that case use a LUT be more efficient? And yeah I would still have the problem to index the LUT.
The way I see it, the look-up table won't help since in order to index into it, you need to convert the float into some integer type. Catch 22.
The table would require 0x10000 * sizeof (uint16_t) bytes, which is 128 KB. Not a lot by modern standards, but on the other hand cache is precious. But, as I said, the table doesn't add much to the solution since you need to convert float to integer in order to index.
You could do a table indexed by the raw bits of the float re-interpreted as integer, but that would have to be 32 bits which becomes very large (8 GB or so).
Go for the straight-forward runtime conversion you outlined.
Just stay with the multiplication - it'll work fine.
Practically all modern CPU have vector instructions (SSE, AVX, ...) adapted to this stuff, so you might look at programming for that. Or use a compiler that automatically vectorizes your code, if possible (Intel C, also GCC). Even in cases where table-lookup is a possible solution, this can often be faster because you don't suffer from memory latency.
First, it should be noted that float has 24 bits of precision, which can no way fit into a 16-bit int or even 8 bits. Second, float have much larger range, which can't be stored in any int or long long int
So your question title is actually incorrect, no way to precisely convert any float to short or char. You want to map a float value between 0 and 1 to an 8-bit or 16-bit int range.
For the code you use above, it'll work fine. However the value 255 is extremely unlikely to be returned because it needs exactly 1.0 as input, otherwise values such as 254.99999 will ended up being truncated as 254. You should round the value instead
return (int)(x * 255.0f + .5f);
or better, use the code provided in your link for more balanced distribution
static inline uint8_t u8fromfloat_trick(float x)
{
union { float f; uint32_t i; } u;
u.f = 32768.0f + x * (255.0f / 256.0f);
return (uint8_t)u.i;
}
Using LUT wouldn't be any faster because a table for 16-bit values is too large for fitting in cache, and in fact may reduce your performance greatly. The snippet above needs only 2 floating-point instructions, or only 1 with FMA. And SIMD will improve performance 4-32x (or more) further, so LUT method would be easily outperformed as it's much harder to parallelize table look ups

On embedded platforms, is it more efficient to use unsigned int instead of (implicity signed) int?

I've got into this habit of always using unsigned integers where possible in my code, because the processor can do divides by powers of two on unsigned types, which it can't with signed types. Speed is critical for this project. The processor operates at up to 40 MIPS.
My processor has an 18 cycle divide, but it takes longer than the single cycle barrel shifter. So is it worth using unsigned integers here to speed things up or do they bring other disadvantages? I'm using a dsPIC33FJ128GP802 - a member of the dsPIC33F series by Microchip. It has single cycle multiply for both signed and unsigned ints. It also has sign and zero extend instructions.
For example, it produces this code when mixing signed and unsigned integers.
026E4 97E80F mov.b [w15-24],w0
026E6 FB0000 se w0,w0
026E8 97E11F mov.b [w15-31],w2
026EA FB8102 ze w2,w2
026EC B98002 mul.ss w0,w2,w0
026EE 400600 add.w w0,w0,w12
026F0 FB8003 ze w3,w0
026F2 100770 subr.w w0,#16,w14
I'm using C (GCC for dsPIC.)
I think we all need to know a lot more about the peculiarities of your processor to answer this question. Why can't it do divides by powers of two on signed integers? As far as I remember the operation is the same for both. I.e.
10/2 = 00001010 goes to 00000101
-10/2 = 11110110 goes to 11111011
Maybe you should write some simple code doing an unsigned divide and a signed divide and compare the compiled output.
Also benchmarking is a good idea. It doesn't need to be precise. Just have a an array of a few thousand numbers, start a timer and start dividing them a few million times and time how long it takes. Maybe do a few billion times if your processor is fast. E.g.
int s_numbers[] = { etc. etc. };
int s_array_size = sizeof(s_numbers);
unsigned int u_numbers[] = { etc. etc.};
unsigned int u_array_size = sizeof(u_numbers);
int i;
int s_result;
unsigned int u_result;
/* Start timer. */
for(i = 0; i < 100000000; i++)
{
i_result = s_numbers[i % s_array_size] / s_numbers[(i + 1) % s_array_size];
}
/* Stop timer and print difference. */
/* Repeat for unsigned integers. */
Written in a hurry to show the principle, please forgive any errors.
It won't give precise benchmarking but should give a general idea of which is faster.
I don't know much about the instruction set available on your processor but a quick look makes me think that it has instructions that may be used for both arithmetic and logical shifts, which should mean that shifting a signed value costs about the same as shifting an unsigned value, and dividing by powers of 2 for each using the shifts should also cost the same. (my knowledge about this is from a quick glance at some intrinsic functions for a C compiler that targets your processor family).
That being said, if you are working with values which are to be interpreted as unsigned then you might as well declare them as unsigned. For the last few years I've been using the types from stdint.h more and more, and usually I end up using the unsigned versions because my values are either inherently unsigned or I'm just using them as bit arrays.
Generate assembly both ways and count cycles.
I'm going to guess the unsigned divide of powers of two are faster because it can simply do a right shift as needed without needing to worry about sign extension.
As for disadvantages: detecting arithmetic overflows, overflowing a signed type because you didn't realize it while using unsigned, etc. Nothing blocking, just different things to watch out for.

Resources