curious gcc compiler code for x |= 128 when x is uint8

curious gcc compiler code for x |= 128 when x is uint8 - c

I've recently stumbled upon an interesting compiler code which I don't understand.
Take the following code:
unsigned char x;
...
x |= 127;
x |= 128;
For the first statement, the compiler generates:
or eax, 0x7f.
However, for the second statement, it becomes:
or eax, 0xffffff80
It seems that for values less than 127, one byte values are used whereas after 128 dword's are preferred.
Does anybody have any idea why this happens?
I reproduced this gcc 6.2 (latest I think).
I tried to post on the gcc mailing lists (gcc-bugs#gcc.gnu.org or gcc-help#gcc.gnu.org ) but I only got delivery failures.

Both instructions are 3 bytes wide as is apparent from the disassembly output:
83 c8 7f or $0x7f,%eax
83 c8 80 or $0xffffff80,%eax
The 83 / 1 is 32-bit register / memory with 8-bit sign-extended immediate value:
83 /1 ib OR r/m32,imm8 r/m32 OR imm8 (sign-extended).
Thus in effect it does change the non-visible part of the 32-bit register, but it doesn't matter. It is not less efficient than any other method. There is also no instruction that would not sign-extend the 8-bit immediate value, except those that operate with 8-bit register halves/quarters. But using this instruction makes it work the same way with other registers that are addressable with r/m32 but which cannot be accessed as individual bytes (edi, esi for example).

Related

Fastest way to find 16bit match in a 4 element short array?

I may confirm by using nanobench. Today I don't feel clever and can't think of an easy way
I have a array, short arr[]={0x1234, 0x5432, 0x9090, 0xFEED};. I know I can use SIMD to compare all elements at once, using movemask+tzcnt to find the index of a match. However since it's only 64 bits I was wondering if there's a faster way?
First I thought maybe I can build a 64-bit int by writing target|(target<<16)|(target<<32)|(target<<48) but then realized both an AND and SUB isn't the same as a compare since the low 16 can affect the higher 16. Then I thought instead of a plain loop I can write index=tzcnt((target==arr[0]?1:0)... | target==arr[3]?8:0
Can anyone think of something more clever? I suspect using the ternary method would give me best results since it's branchless?

For SWAR compare-for-equality, the operation you want is XOR, which like SUB produces all-zero on equal inputs, but unlike SUB doesn't propagate carry sideways.
But then you need to detect a contiguous 16 0 bits. Unlike pcmpeqw, you'll have some zero bits in the other elements.
So it's probably about the same as https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord but with wider mask patterns to operate on 16-bit instead of 8-bit chunks.
There is yet a faster method — use hasless(v, 1), which is defined below; it works in 4 operations and requires no subsquent verification. It simplifies to
#define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL)
The subexpression (v - 0x01010101UL), evaluates to a high bit set in any byte whenever the corresponding byte in v is zero or greater than 0x80. The sub-expression ~v & 0x80808080UL evaluates to high bits set in bytes where the byte of v doesn't have its high bit set (so the byte was less than 0x80). Finally, by ANDing these two sub-expressions the result is the high bits set where the bytes in v were zero, since the high bits set due to a value greater than 0x80 in the first sub-expression are masked off by the second.
This bithack was originally by Alan Mycroft in 1987.
So it could look like this (untested):
#include <stdint.h>
#include <string.h>
// returns 0 / non-zero status.
uint64_t hasmatch_16in64(uint16_t needle, const uint16_t haystack[4])
{
uint64_t vneedle = 0x0001000100010001ULL * needle; // broadcast
uint64_t vbuf;
memcpy(&vbuf, haystack, sizeof(vbuf)); // aliasing-safe unaligned load
//static_assert(sizeof(vbuf) == 4*sizeof(haystack[0]));
uint64_t match = vbuf ^ vneedle;
uint64_t any_zeros = (match - 0x0001000100010001ULL) & ~match & 0x8000800080008000ULL;
return any_zeros;
// unsigned matchpos = _tzcnt_u32(any_zeros) >> 4; // I think.
}
Godbolt with GCC and clang, also including a SIMD intrinsics version.
# gcc12.2 -O3 -march=x86-64-v3 -mtune=znver1
# x86-64-v3 is the Haswell/Zen1 baseline: AVX2+FMA+BMI2, but with tune=generic
# without tune=haswell or whatever, GCC uses shl/add /shl/add instead of imul, despite still needing the same constant
hasmatch_16in64:
movabs rax, 281479271743489 # 0x1000100010001
movzx edi, di # zero-extend to 64-bit
imul rdi, rax # vneedle
xor rdi, QWORD PTR [rsi] # match
# then the bithack
mov rdx, rdi
sub rdx, rax
andn rax, rdi, rdx # BMI1
movabs rdx, -9223231297218904064 # 0x8000800080008000
and rax, rdx
ret
Clang unfortunately adds 0xFFFEFFFEFFFEFFFF instead of reusing the multiplier constant, so it has three 64-bit immediate constants.
AArch64 can do repeating-pattern constants like this as immediates for bitwise ops, and doesn't have as convenient SIMD movemask, so this might be more of a win there, especially if you can guarantee alignment of your array of shorts.
Match position
If you need to know where the match is, I think that bithack has a 1 in the high bit of each zero byte or u16, and nowhere else. (The lowest-precendence / last operations are bitwise AND involving 0x80008000...).
So maybe tzcnt(any_zeros) >> 4 to go from bit-index to u16-index, rounding down. e.g. if the second one is zero, the tzcnt result will be 31. 31 >> 4 = 1.
If that doesn't work, then yeah AVX2 or AVX-512 vpbroadcastw xmm0, edi / vmovq / vpcmeqw / vpmovmskb / tzcnt will work well, too, with smaller code-size and fewer uops, but maybe higher latency. Or maybe less. (To get a byte offset, right shift if you need an index of which short.)
Actually just SSE2 pshuflw can broadcast a word to the low qword of an XMM register. Same for MMX, which would actually allow a memory-source pcmpeqw mm0, [rsi] since it has no alignment requirement and is only 64-bit, not 128.
If you can use SIMD intrinsics, especially if you have efficient word broadcast from AVX2, definitely have a look at it.
#include <immintrin.h>
// note the unsigned function arg, not uint16_t;
// we only use the low 16, but GCC doesn't realize that and wastes an instruction in the non-AVX2 version
int hasmatch_SIMD(unsigned needle, const uint16_t haystack[4])
{
#ifdef __AVX2__ // or higher
__m128i vneedle = _mm_set1_epi16(needle);
#else
__m128i vneedle = _mm_cvtsi32_si128(needle); // movd
vneedle = _mm_shufflelo_epi16(vneedle, 0); // broadcast to low half
#endif
__m128i vbuf = _mm_loadl_epi64((void*)haystack); // alignment and aliasing safe
unsigned mask = _mm_movemask_epi8(_mm_cmpeq_epi16(vneedle, vbuf));
//return _tzcnt_u32(mask) >> 1;
return mask;
}
# clang expects narrow integer args to already be zero- or sign-extended to 32
hasmatch_SIMD:
movd xmm0, edi
pshuflw xmm0, xmm0, 0 # xmm0 = xmm0[0,0,0,0,4,5,6,7]
movq xmm1, qword ptr [rsi] # xmm1 = mem[0],zero
pcmpeqw xmm1, xmm0
pmovmskb eax, xmm1
ret
AXV-512 gives us vpbroadcastw xmm0, edi, replacing vmovd + vpbroadcastw xmm,xmm or movd + pshuflw, saving a shuffle uop.
With AVX2, this is 5 single-uop instructions, vs. 7 (or 9 counting the constants) for the SWAR bithack. Or 6 or 8 not counting the zero-extension of the "needle". So SIMD is better for front-end throughput. (https://agner.org/optimize/ / https://uops.info/)
There are limits to which ports some of these instructions can run on (vs. the bithack instructions mostly being any integer ALU port), but presumably you're not doing this in a loop over many such 4-element arrays. Or else SIMD is an obvious win; checking two 4-element arrays at once in the low and high halves of a __m128i. So probably we do need to consider the front-end costs of setting up those constants.
I didn't add up the latencies; it's probably a bit higher even on Intel CPUs which generally have good latency between integer and SIMD units.
GCC unfortunately fails to optimize away the movzx edi, di from the SIMD version if compiled without AVX2; only clang realizes the upper 16 of _mm_cvtsi32_si128(needle) is discarded by the later shuffle. Maybe better to make the function arg unsigned, not explicitly a narrow 16-bit type.

Clang with -O2 or -O3 and GCC with -O3 compile a simple search loop into branchless instructions:
int indexOf(short target, short* arr) {
int index = -1;
for (int i = 0; i < 4; ++i) {
if (target == arr[i]) {
index = i;
}
}
return index;
}
Demo
I doubt you can get much better without SIMD. In other words, write simple and understandable code to help the compiler produce efficient code.
Side note: for some reason, neither Clang nor GCC use conditional moves on this very similar code:
int indexOf(short target, short* arr) {
for (int i = 0; i < 4; ++i) {
if (target == arr[i]) {
return i;
}
}
return -1;
}

Is subtraction less performant than negation?

I wonder if it's faster for the processor to negate a number or to do a subtraction. For example:
Is
int a = -3;
more efficient than
int a = 0 - 3;
In other words, does a negation is equivalent to subtracting from 0? Or is there a special CPU instruction that negate faster that a subtraction?
I suppose that the compiler does not optimize anything.

From the C language point of view, 0 - 3 is an integer constant expression and those are always calculated at compile-time.
Formal definition from C11 6.6/6:
An integer constant expression shall have integer type and shall
only have operands that are integer constants, enumeration constants,
character constants, sizeof expressions whose results are integer
constants, _Alignof expressions, and floating constants that are the
immediate operands of casts.
Knowing that these are calculated at compile time is important when writing readable code. For example if you want to declare a char array to hold 5 characters and the null terminator, you can write char str[5+1]; rather than 6, to get self-documenting code telling the reader that you have considered null termination.
Similarly when writing macros, you can make use of integer constant expressions to perform parts of the calculation at compile time.

(This answer is about negating a runtime variable, like -x or 0-x where constant-propagation doesn't result in a compile-time constant value for x. A constant like 0-3 has no runtime cost.)
I suppose that the compiler does not optimize anything.
That's not a good assumption if you're writing in C. Both are equivalent for any non-terrible compiler because of how integers work, and it would be a missed-optimization bug if one compiled to more efficient code than the other.
If you actually want to ask about asm, then how to negate efficiently depends on the ISA.
But yes, most ISAs can negate with a single instruction, usually by subtracting from an immediate or implicit zero, or from an architectural zero register.
e.g. 32-bit ARM has an rsb (reverse-subtract) instruction that can take an immediate operand. rsb rdst, rsrc, #123 does dst = 123-src. With an immediate of zero, this is just negation.
x86 has a neg instruction: neg eax is exactly equivalent to eax = 0-eax, setting flags the same way.
3-operand architectures with a zero register (hard-wired to zero) can just do something like MIPS subu $t0, $zero, $t0 to do t0 = 0 - t0. It has no need for a special instruction because the $zero register always reads as zero. Similarly AArch64 removed RSB but has a xzr / wzr 64/32-bit zero register. (Although it also has a pseudo-instruction called neg which subtracts from the zero register).
You could see most of this by using a compiler. https://godbolt.org/z/B7N8SK
But you'd have to actually compile to machine code and disassemble because gcc/clang tend to use the neg pseudo-instruction on AArch64 and RISC-V. Still, you can see ARM32's rsb r0,r0,#0 for int negate(int x){return -x;}

Both are compile time constants, and will generate the same constant initialisation in any reasonable compiler regardless of optimisation.
For example at https://godbolt.org/z/JEMWvS the following code:
void test( void )
{
int a = -3;
}
void test2( void )
{
int a = 0-3;
}
Compiled with gcc 9.2 x86-64 -std=c99 -O0 generates:
test:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], -3
nop
pop rbp
ret
test2:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], -3
nop
pop rbp
ret
Using -Os, the code:
void test( void )
{
volatile int a = -3;
}
void test2( void )
{
volatile int a = 0-3;
}
generates:
test:
mov DWORD PTR [rsp-4], -3
ret
test2:
mov DWORD PTR [rsp-4], -3
ret
The volatile being necessary to prevent the compiler removing the unused variables.
As static data it is even simpler:
int a = -3;
int b = 0-3;
outside of a function generates no executable code, just initialised data objects (initialisation is different from assignment):
a:
.long -3
b:
.long -3
Assignment of the above statics:
a = -4 ;
b = 0-4 ;
is still a compiler evaluated constant:
mov DWORD PTR a[rip], -4
mov DWORD PTR b[rip], -4
The take-home here is:
If you are interested, try it and see (with your own compiler or Godbolt set for your compiler and/or architecture),
don't sweat the small stuff, let the compiler do its job,
constant expressions are evaluated at compile time and have no run-time impact,
writing weird code in the belief you can better the compiler is almost always pointless. Compilers work better with idiomatic code the optimiser can recognise.

It's hard to tell if you ask asking if subtraction is fast then negation in a general sense, or in this specific case of implementing negation via subtraction from zero. I'll try to answer both.
General Case
For the general case, on most modern CPUs these operations are both very fast: usually each only taking a single cycle to execute, and often having a throughput of more than one per cycle (because CPUs are superscalar). On all recent AMD and Intel CPUs that I checked, both sub and neg execute at the same speed for register and immediate arguments.
Implementing -x
As regards to your specific question of implementing the -x operation, it would usually be slightly faster to implement this with a dedicated neg operation than with a sub, because with neg you don't have to prepare the zero registers. For example, a negation function int neg(int x) { return -x; }; would look something like this with the neg instruction:
neg:
mov eax, edi
neg eax
... while implementing it terms of subtraction would look something like:
neg:
xor eax, eax
sub eax, edi
Well ... sub didn't come out looking at worse there, but that's mostly a quirk of the calling convention and the fact that x86 uses a 1 argument destructive neg: the result needs to be in eax, so in the neg case 1 instruction is spent just moving the result to the right register, and one doing the negation. The sub version takes two instructions to perform the negation itself: one to zero a register, and one to do the subtraction. It so happens that this lets you avoid the ABI shuffling because you get to choose the zero register as the result register.
Still, this ABI related inefficiency wouldn't persist after inlining, so we can say in some fundamental sense that neg is slightly more efficient.
Now many ISAs may not have a neg instruction at all, so the question is more or less moot. They may have a hardcoded zero register, so you'd implement negation via subtraction from this register and there is no cost to set up the zero.

Encoding 3 base-6 digits in 8 bits for unpacking performance

I'm looking for an efficient-to-unpack (in terms of small number of basic ALU ops in the generated code) way of encoding 3 base-6 digits (i.e. 3 numbers in the range [0,5]) in 8 bits. Only one is needed at a time, so approaches that need to decode all three in order to access one are probably not good unless the cost of decoding all three is very low.
The obvious method is of course:
x = b%6; // 8 insns
y = b/6%6; // 13 insns
z = b/36; // 5 insns
The instruction counts are measured on x86_64 with gcc>=4.8 which knows how to avoid divs.
Another method (using a different encoding) is:
b *= 6
x = b>>8;
b &= 255;
b *= 6
y = b>>8;
b &= 255;
b *= 6
z = b>>8;
This encoding has more than one representation for many tuples (it uses the whole 8bit range rather than just [0,215]) and appears more efficient if you want all 3 outputs, but wasteful if you only want one.
Are there better approaches?
Target language is C but I've tagged this assembly as well since answering requires some consideration of the instructions that would be generated.

As discussed in comments, a LUT would be excellent if it stays hot in cache. uint8_t LUT[3][256] would need the selector scaled by 256, which takes an extra instruction if it's not a compile-time constant. Scaling by 216 to pack the LUT better is only 1 or 2 instructions more expensive. struct3 LUT[216] is nice, where the struct has a 3-byte array member. On x86, this compiles extremely well in position-dependent code where the LUT base can be a 32-bit absolute as part of the addressing mode (if the table is static):
struct { uint8_t vals[3]; } LUT[216];
unsigned decode_LUT(uint8_t b, unsigned selector) {
return LUT[b].vals[selector];
}
gcc7 -O3 on Godbolt for x86-64 and AArch64
movzx edi, dil
mov esi, esi # zero-extension to 64-bit: goes away when inlining.
lea rax, LUT[rdi+rdi*2] # multiply by 3 and add the base
movzx eax, BYTE PTR [rax+rsi] # then index by selector
ret
Silly gcc used a 3-component LEA (3 cycle latency and runs on fewer ports) instead of using LUT as a disp32 for the actual load (no extra latency for an indexed addressing mode, I think).
This layout has the added advantage of locality if you ever need to decode multiple components of the same byte.
In PIC / PIE code, this costs 2 extra instructions, unfortunately:
movzx edi, dil
lea rax, LUT[rip] # RIP-relative LEA instead of absolute as part of another addressing mode
mov esi, esi
lea rdx, [rdi+rdi*2]
add rax, rdx
movzx eax, BYTE PTR [rax+rsi]
ret
But that's still cheap, and all the ALU instructions are single-cycle latency.
Your 2nd ALU unpacking strategy is promising. I thought at first we could use a single 64-bit multiply to get b*6, b*6*6, and b*6*6*6 in different positions of the same 64-bit integer. (b * ((6ULL*6*6<<32) + (36<<16) + 6)
But the upper byte of each multiply result does depend on masking back to 8-bit after each multiply by 6. (If you can think of a way to not require that, one multiple and shift would be very cheap, especially on 64-bit ISAs where the entire 64-bit multiply result is in one register).
Still, x86 and ARM can multiply by 6 and mask in 3 cycles of latency, the same or better latency than a multiply, or less on Intel CPUs with zero-latency movzx r32, r8, if the compiler avoids using parts of the same register for movzx.
add eax, eax ; *2
lea eax, [rax + rax*2] ; *3
movzx ecx, al ; 0 cycle latency on Intel
.. repeat for next steps
ARM / AArch64 is similarly good, with add r0, r0, r0 lsl #1 for multiply by 3.
As a branchless way to select one of the three, you could consider storing (from ah / ch / ... to get the shift for free) to an array, then loading with the selector as the index. This costs store/reload latency (~5 cycles), but is cheap for throughput and avoids branch misses. (Possibly a 16-bit store and then a byte reload would be good, scaling the selector in the load address and adding 1 to get the high byte, saving an extract instruction before each store on ARM).
This is in fact what gcc emits if you write it this way:
unsigned decode_ALU(uint8_t b, unsigned selector) {
uint8_t decoded[3];
uint32_t tmp = b * 6;
decoded[0] = tmp >> 8;
tmp = 6 * (uint8_t)tmp;
decoded[1] = tmp >> 8;
tmp = 6 * (uint8_t)tmp;
decoded[2] = tmp >> 8;
return decoded[selector];
}
movzx edi, dil
mov esi, esi
lea eax, [rdi+rdi*2]
add eax, eax
mov BYTE PTR -3[rsp], ah # store high half of mul-by-6
movzx eax, al # costs 1 cycle: gcc doesn't know about zero-latency movzx?
lea eax, [rax+rax*2]
add eax, eax
mov BYTE PTR -2[rsp], ah
movzx eax, al
lea eax, [rax+rax*2]
shr eax, 7
mov BYTE PTR -1[rsp], al
movzx eax, BYTE PTR -3[rsp+rsi]
ret
The first store's data is ready 4 cycles after the input to the first movzx, or 5 if you include the extra 1c of latency for reading ah when it's not renamed separately on Intel HSW/SKL. The next 2 stores are 3 cycles apart.
So the total latency is ~10 cycles from b input to result output, if selector=0. Otherwise 13 or 16 cycles.

Measuring a number of different approaches in-place in the function that needs to do this, the practical answer is really boring: it doesn't matter. They're all running at about 50ns per call, and other work is dominating. So for my purposes, the approach that pollutes the cache and branch predictors the least is probably the best. That seems to be:
(b * (int[]){2048,342,57}[i] >> 11) % 6;
where b is the byte containing the packed values and i is the index of the value wanted. The magic constants 342 and 57 are just the multiplicative constants GCC generates for division by 6 and 36, respectively, scaled to a common shift of 11. The final %6 is spurious in the /36 case (i==2) but branching to avoid it does not seem worthwhile.
On the other hand, if doing this same work in a context where there wasn't an interface constraint to have the surrounding function call overhead per lookup, I think an approach like Peter's would be preferable.

Using Int (32 bits) over char (8 bits) to 'help' processor

In C, often we use char for small number representations. However Processor always uses Int( or 32 bit) values for read from(or fetch from) registers. So every time we need to use a char or 8 bits in our program processor need to fetch 32 bits from regsiter and 'parse' 8 bits out of it.
Hence does it sense to use Int more often in place of char if memory is not the limitation?
Will it 'help' processor?

There's the compiler part and the cpu part.
If you tell the compiler you're using a char instead of an int, during static analysis it will know the bounds of the variable is between 0-255 instead of 0-(2^32-1). This will allow it to optimize your program better.
On the cpu side, your assumption isn't always correct. Take x86 as an example, it has registers eax and al for 32 bit and 8 bit register access. If you want to use chars only, using al is sufficient. There is no performance loss.
I did some simple benchmarks in response to below comments:
al:
format PE GUI 4.0
xor ecx, ecx
dec ecx
loop_start:
inc al
add al, al
dec al
dec al
loopd short loop_start
ret
eax:
format PE GUI 4.0
xor ecx, ecx
dec ecx
loop_start:
inc eax
add eax, eax
dec eax
dec eax
loopd short loop_start
ret
times:
$ time ./test_al.exe
./test_al.exe 0.01s user 0.00s system 0% cpu 7.102 total
$ time ./test_eax.exe
./test_eax.exe 0.01s user 0.01s system 0% cpu 7.120 total
So in this case, al is slightly faster, but sometimes eax came out faster. The difference is really negligible. But cpus aren't so simple, there might be code alignment issues, caches, and other things going on, so it's best to benchmark your own code to see if there's any performance improvement. But imo, if your code is not super tight, it's best to trust the compiler to optimize things.

I'd stick to int if I were you as that is probably the most native integral type for your platform. Internally you could expect shorter types to be converted to int so actually degrading performance.
You should never use char and expect it to be consistent across platforms. Although the C standard defines sizeof(char) to be 1, char itself could be signed or unsigned. The choice is down to the compiler.
If you believe that you can squeeze some performance gain in using an 8 bit type then be explicit and use signed char or unsigned char.

From ARM system developers guide
"most ARM data processing operations are 32-bit only. For this reason, you should use
a 32-bit datatype, int or long, for local variables wherever possible. Avoid using char and
short as local variable types, even if you are manipulating an 8- or 16-bit value"
an example code from the book to prove the point. note the wrap around handling for char as opposed to unsigned int.
int checksum_v1(int *data)
{
char i;
int sum = 0;
for (i = 0; i < 64; i++)
{
sum += data[i];
}
return sum;
}
ARM7 assembly when using i as a char
checksum_v1
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ; i = 0
checksum_v1_loop
LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
ADD r1,r1,#1 ; r1 = i+1
AND r1,r1,#0xff ; i = (char)r1
CMP r1,#0x40 ; compare i, 64
ADD r0,r3,r0 ; sum += r3
BCC checksum_v1_loop ; if (i<64) loop
MOV pc,r14 ; return sum
ARM7 assembly when i is an unsigned int.
checksum_v2
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ; i = 0
checksum_v2_loop
LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
ADD r1,r1,#1 ; r1++
CMP r1,#0x40 ; compare i, 64
ADD r0,r3,r0 ; sum += r3
BCC checksum_v2_loop ; if (i<64) goto loop
MOV pc,r14 ; return sum

If your program is simple enough, the optimizer can do the right thing without you having to worry about it. In this case, plain int would be the simplest (and forward-proof) solution.
However, if you want really much to combine specific bit width and speed, you can use 7.18.1.3 Fastest minimum-width integer types from the C99 standard (requires C99-compliant compiler).
For example:
int_fast8_t x;
uint_fast8_t y;
are the signed and unsigned types that are guaranteed to be able to store at least 8 bits of data and use the usually faster underlying type. Of course, it all depends on what you are doing with the data afterwards.
For example, on all systems I have tested (see: standard type sizes in C++) the fast types were 8-bit long.

C programming and error_code variable efficiency

Most code I have ever read uses a int for standard error handling (return values from functions and such). But I am wondering if there is any benefit to be had from using a uint_8 will a compiler -- read: most C compilers on most architectures -- produce instructions using the immediate address mode -- i.e., embed the 1-byte integer into the instruction ? The key instruction I'm thinking about is the compare after a function, using uint_8 as its return type, returns.
I could be thinking about things incorrectly, as introducing a 1 byte type just causes alignment issues -- there is probably a perfectly sane reason why compiles like to pack things in 4-bytes and this is possibly the reason everyone just uses ints -- and since this is stack related issue rather than the heap there is no real overhead.
Doing the right thing is what I'm thinking about. But lets say say for the sake of argument this is a popular cheap microprocessor for a intelligent watch and that it is configured with 1k of memory but does have different addressing modes in its instruction set :D
Another question to slightly specialize the discussion (x86) would be: is the literal in:
uint_32 x=func(); x==1;
and
uint_8 x=func(); x==1;
the same type ? or will the compiler generate a 8-byte literal in the second case. If so it may use it to generate a compare instruction which has the literal as an immediate value and the returned int as a register reference. See CMP instruction types..
Another Refference for the x86 Instruction Set.

Here's what one particular compiler will do for the following code:
extern int foo(void) ;
void bar(void)
{
if(foo() == 31) { //error code 31
do_something();
} else {
do_somehing_else();
}
}
0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 83 ec 08 sub $0x8,%esp
6: e8 fc ff ff ff call 7 <bar+0x7>
b: 83 f8 1f cmp $0x1f,%eax
e: 74 08 je 18 <bar+0x18>
10: c9 leave
11: e9 fc ff ff ff jmp 12 <bar+0x12>
16: 89 f6 mov %esi,%esi
18: c9 leave
19: e9 fc ff ff ff jmp 1a <bar+0x1a>
a 3 byte instruction for the cmp. if foo() returns a char , we get
b: 3c 1f cmp $0x1f,%al
If you're looking for efficiency though. Don't assume comparing stuff in %a1 is faster than comparing with %eax

There may be very small speed differences between the different integral types on a particular architecture. But you can't rely on it, it may change if you move to different hardware, and it may even run slower if you upgrade to newer hardware.
And if you talk about x86 in the example you are giving, you make a false assumption: An immediate needs to be of type uint8_t.
Actually 8-bit immediates embedded into the instruction are of type int8_t and can be used with bytes, words, dwords and qwords, in C notation: char, short, int and long long.
So on this architecture there would be no benefit at all, neither code size nor execution speed.

You should use int or unsigned int types for your calculations. Using smaller types only for compounds (structs/arrays). The reason for that is that int is normally defined to be the "most natural" integral type for the processor, all other derived type may necessitate processing to work correctly. We had in our project compiled with gcc on Solaris for SPARC the case that accesses to 8 and 16 bit variable added an instruction to the code. When loading a smaller type from memory it had to make sure the upper part of the register was properly set (sign extension for signed type or 0 for unsigned). This made the code longer and increased pressure on the registers, which deteriorated the other optimisations.
I've got a concrete example:
I declared two variable of a struct as uint8_t and got that code in Sparc Asm:
if(p->BQ > p->AQ)
was translated in
ldub [%l1+165], %o5 ! <variable>.BQ,
ldub [%l1+166], %g5 ! <variable>.AQ,
and %o5, 0xff, %g4 ! <variable>.BQ, <variable>.BQ
and %g5, 0xff, %l0 ! <variable>.AQ, <variable>.AQ
cmp %g4, %l0 ! <variable>.BQ, <variable>.AQ
bleu,a,pt %icc, .LL586 !
And here what I got when I declared the two variables as uint_t
lduw [%l1+168], %g1 ! <variable>.BQ,
lduw [%l1+172], %g4 ! <variable>.AQ,
cmp %g1, %g4 ! <variable>.BQ, <variable>.AQ
bleu,a,pt %icc, .LL587 !
Two arithmetic operations less and 2 registers more for other stuff

Processors typically likes to work with their natural register sizes, which in C is 'int'.
Although there are exceptions, you're thinking too much on a problem that does not exist.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight