In x86 assembly, the overflow flag is set when an add or sub operation on a signed integer overflows, and the carry flag is set when an operation on an unsigned integer overflows.
However, when it comes to the inc and dec instructions, the situation seems to be somewhat different. According to this website, the inc instruction does not affect the carry flag at all.
But I can't find any information about how inc and dec affect the overflow flag, if at all.
Do inc or dec set the overflow flag when an integer overflow occurs? And is this behavior the same for both signed and unsigned integers?
Okay, so essentially the consensus here is that INC and DEC should behave the same as ADD and SUB, in terms of setting flags, with the exception of the carry flag. This is also what it says in the Intel manual.
The problem is I can't actually reproduce this behavior in practice, when it comes to unsigned integers.
Consider the following assembly code (using GCC inline assembly to make it easier to print out results.)
int8_t ovf = 0;
"movb $-128, %%bh;"
"decb %%bh;"
"seto %b0;"
: "=g"(ovf)
: "%bh"
printf("Overflow flag: %d\n", ovf);
Here we decrement a signed 8-bit value of -128. Since -128 is the smallest possible value, an overflow is inevitable. As expected, this prints out: Overflow flag: 1
But when we do the same with an unsigned value, the behavior isn't as I expect:
int8_t ovf = 0;
"movb $255, %%bh;"
"incb %%bh;"
"seto %b0;"
: "=g"(ovf)
: "%bh"
printf("Overflow flag: %d\n", ovf);
Here I increment an unsigned 8-bit value of 255. Since 255 is the largest possible value, an overflow is inevitable. However, this prints out: Overflow flag: 0.
Huh? Why didn't it set the overflow flag in this case?

The overflow flag is set when an operation would cause a sign change. Your code is very close. I was able to set the OF flag with the following (VC++) code:
char ovf = 0;
_asm {
mov bh, 127
inc bh
seto ovf
cout << "ovf: " << int(ovf) << endl;
When BH is incremented the MSB changes from a 0 to a 1, causing the OF to be set.
This also sets the OF:
char ovf = 0;
_asm {
mov bh, 128
dec bh
seto ovf
cout << "ovf: " << int(ovf) << endl;
Keep in mind that the processor does not distinguish between signed and unsigned numbers. When you use 2's complement arithmetic, you can have one set of instructions that handle both. If you want to test for unsigned overflow, you need to use the carry flag. Since INC/DEC don't affect the carry flag, you need to use ADD/SUB for that case.

Intel® 64 and IA-32 Architectures Software Developer's Manuals
Look at the appropriate manual Instruction Set Reference, A-M. Every instruction is precisely documented.
Here is the INC section on affected flags:
The CF flag is not affected. The OF, SZ, ZF, AZ, and PF flags are set according to the result.

try changing your test to pass in the number rather than hard code it, then have a loop that tries all 256 numbers to find the one if any that affects the flag. Or have the asm perform the loop and exit out when it hits the flag and or when it wraps around to the number it started with (start with something other than 0x00, 0x7f, 0x80, or 0xFF).
.globl inc
mov $33, %eax
inc %al
jo done
jmp top
.globl dec
mov $33, %eax
dec %al
jo donex
jmp topx
Inc overflows when it goes from 0x7F to 0x80. dec overflows when it goes from 0x80 to 0x7F, I suspect the problem is in the way you are using inline assembler.

As many of the other answers have pointed out, INC and DEC do not affect the CF, whereas ADD and SUB do.
What has not been said yet, however, is that this might make a performance difference. Not that you'd usually be bothered by that unless you are trying to optimise the hell out of a routine, but essentially not setting the CF means that INC/DEC only write to part of the flags register, which can cause a partial flag register stall, see Intel 64 and IA-32 Architectures Optimization Reference Manual or Agner Fog's optimisation manuals.

Except for the carry flag inc sets the flags the same way as add operand 1 would.
The fact that inc does not affect the carry flag is very important.

The CPU/ALU is only capable of handling unsigned binary numbers, and then it uses OF, CF, AF, SF, ZF, etc., to allow you to decide whether to use it as a signed number (OF), an unsigned number (CF) or a BCD number (AF).
About your problem, remember to consider the binary numbers themselves, as unsigned.
**Also, the overflow and the OF require 3 numbers: The input number, a second number to use in the arithmetic, and the result number.
Overflow is activated only if the first and second numbers have the same value for the sign bit (the most significant bit) and the result has a different sign. As in, adding 2 negative numbers resulted in a positive number, or adding 2 positive numbers resulted in a negative number:
if( (Sign_Num1==Sign_Num2) && (Sign_Result!=Sign_Num1) ) OF=1;
else OF=0;
For your first problem, you are using -128 as the first number. The second number is implicitly -1, used by the DEC instruction. So we really have the binary numbers 0x80 and 0xFF. Both them have the sign bit set to 1. The result is 0x7F, which is a number with the sign bit set to 0. We got 2 initial numbers with the same sign, and a result with a different sign, so we indicate an overflow. -128-1 resulted in 127, and thus the overflow flag is set to indicate a wrong signed result.
For your second problem, you are using 255 as the first number. The second number is implicitly 1, used by the INC instruction. So we really have the binary numbers 0xFF and 0x01. Both them have a different sign bit, so it is not possible to get an overflow (it is only possible to overflow when basically adding 2 numbers of the same sign, but it is never possible to overflow with 2 numbers of a different sign because they will never lead to go beyond the possible signed value). The result is 0x00, and it doesn't set the overflow flag because 255+1, or more exactly, -1+1 gives 0, which is obviously correct for signed arithmetic.
Remember that for the overflow flag to be set, the 2 numbers being added/subtracted need to have the sign bit with the same value, and then the result must have a sign bit with a value different from them.

What the processor does is set the appropriate flags for the results of these instructions (add, adc, dec, inc, sbb, sub) for both the signed and unsigned cases i e two different flag results for every op. The alternative would be having two sets of instructions where one sets signed-related flags and the other the unsigned-related. If the issuing compiler is using unsigned variables in the operation it will test carry and zero (jc, jnc, jb, jbe etc), if signed it tests overflow, sign and zero (jo, jno, jg, jng, jl, jle etc).


Efficient Assembly multiplication

Started to practice assembly, not too long ago.
I want to implement an efficient multiplying through assembly commands lea and shift.
I want to write a c program that will call an assembly procedure that fits an constant argument recieved by the user and will multiply another argument recieved by the user by that constant.
How can I make this code effective?
What numbers can I group (if any) to fit the same procedure?
for example I think that I can group 2,4,8,... to the same procedure as they are just a left shift by 1,2,3 for example.
But I'm having trouble finding other groups like this one with other numbers and what about negatives...
The interesting part of this exercise is finding ways to use 1 or 2 LEA, SHL, and/or ADD/SUB instructions to implement multiplies by various constants.
Actually dispatching on the fly for a single multiply isn't very interesting, and would mean either actual JIT compiling or that you have every possible sequence already present in a giant table of tiny blocks of code. (Like switch statements.)
Instead I'd suggest writing a C or Python or whatever function that takes 1 integer arg, and as output produces the asm source text that implements x * n where n is the integer arg. i.e. a function like you might find in a compiler that optimizes a multiply-by-constant.
You might want to cook up an automated way to test this, e.g. by comparing against a pure C x * n for a couple different x values.
If you can't get the job done in 2 instructions (or 3 with one of them being mov), it's not worth it. Modern x86 has ridiculously efficient multiply in hardware. imul reg, r/m, imm is 1 uop, 3 cycle latency, fully pipelined. (AMD since Zen, Intel since Core2 or Nehalem or so.) That's your fallback for anything that you can't get done with a critical path length of 1 or 2 cycles (assuming zero-latency mov if you want, like IvyBridge+ and Zen.)
Or you could set a higher threshold before fallback if you want to explore more complicated sequences, e.g. aim for 64-bit multiply on Bulldozer-family (6 cycle latency). https://agner.org/optimize/. Or even P5 Pentium where imul takes 9 cycles (not pairable).
Patterns to look for
Integer multiply boils down to adding up shifted copies of 1 operand where the other operand has 1 bits. (See the algorithm for implementing multiply by runtime-variable values, by shift and add checking each bit one at a time.)
The easiest pattern is of course only a single set bit, i.e. a power of 2; then it's just a left shift. This is easy to check for: n & (n-1) == 0, when n != 0.
Anything with exactly 2 set bits is at most 2 shifts and an add. (GNU C __builtin_popcount(n) counts set bits. In x86 asm, SSE4.2 popcnt).
GNU C __builtin_ctz finds the bit-index of the lowest set bit. Using it on a number you know is non-zero will give you the shift count for the low bit. In x86 asm, bsf / tzcnt.
To clear that lowest set bit and "expose" the next-lowest, you can do n &= n-1;. In x86 asm, BMI1 blsr or LEA / AND.
Another interesting pattern to look for is 2n +- 1. The +1 case is already covered by the 2-set-bits case, but the shift count for the low bit is 0; no shift needed. With shift counts up to 3, you can do it in one LEA.
You can detect 2^n - 1 by checking if n+1 is a power of 2 (has only 1 bit set). Somewhat more complex, (2^n - 1) * 2^m can be done with this trick plus another shift. So you could try right-shifting to bring the lowest set bit to the bottom then looking for tricks.
GCC does this the 2^n - 1 way:
mul15: # gcc -O3 -mtune=bdver2
mov eax, edi
sal eax, 4
sub eax, edi
clang is more efficient (for Intel CPUs where scaled-index is still only 1 cycle latency):
mul15: # clang -O3 -mtune=bdver2
lea eax, [rdi + 4*rdi]
lea eax, [rax + 2*rax]
Combining these patterns
Maybe factorize your number into its prime factors and look for ways to use your building blocks to do combinations of those factors.
But this isn't the only approach. You can do x*11 as x*5*2 + x, like GCC and Clang do this (which is a lot like How to multiply a register by 37 using only 2 consecutive leal instructions in x86?)
lea eax, [rdi + 4*rdi]
lea eax, [rdi + 2*rax]
There are 2 approaches for x*17 as well. GCC and Clang do it this way:
mov eax, edi
sal eax, 4
add eax, edi
But another way which they fail to use even with -march=sandybridge (no mov-elimination, 1-cycle LEA [reg + reg*scale]) is:
lea eax, [rdi + 8*rdi] ; x*9
lea eax, [rax + 8*rdi] ; x*9 + x*8 = x*17
So instead of multiplying factors, we're adding different multipliers to make the total multiplier.
I don't have any great suggestions how to programmatically search for these sequences beyond the simple ones like 2 set bits, or 2^n +- 1. If you're curious, have a look in GCC or LLVM source code for the functions that do these optimizations; the find a lot of tricky ones.
The work might be split between target-neutral optimization passes for powers of 2 vs. x86-specific target code for using LEA, and for deciding on a threshold of how many instructions is worth it before falling back to imul-immediate.
Negative numbers
x * -8 could be done with x - x*9. I think that might be safe even if x*9 overflows but you'd have to double-check on that.
Look at compiler output
#define MULFUN(c) int mul##c(int x) { return x*c; }
I put that on the Godbolt compiler explorer for the x86-64 System V ABI (first arg in RDI, like the above examples). With gcc and clang -O3. I used -mtune=bdver2 (Piledriver) because it has somewhat slower multiply than Intel or Zen. This encourages GCC and Clang to avoid imul slightly more aggressively.
I didn't try if long / uint64_t would change that (6 cycle instead of 4 cycle latency, and half the throughput.) Or if an older uarch like -mtune=nocona (Pentium 4) would make a difference. -mtune=bdver2 did make a difference vs. the default tune=generic for GCC at least.
If you use -m32, you can use even older uarches like -mtune=pentium (in-order P5). I'd recommend -mregparm=3 for that so args are still passed in registers, not the stack.

Creating a mask with N least significant bits set

I would like to create a macro or function1 mask(n) which given a number n returns an unsigned integer with its n least significant bits set. Although this seems like it should be a basic primitive with heavily discussed implementations which compile efficiently - this doesn't seem to be the case.
Of course, various implementations may have different sizes for the primitive integral types like unsigned int, so let's assume for the sake of concreteness that we are talking returning a uint64_t specifically although of course an acceptable solutions would work (with different definitions) for any unsigned integral type. In particular, the solution should be efficient when the type returned is equal to or smaller than the platform's native width.
Critically, this must work for all n in [0, 64]. In particular mask(0) == 0 and mask(64) == (uint64_t)-1. Many "obvious" solutions don't work for one of these two cases.
The most important criteria is correctness: only correct solutions which don't rely on undefined behavior are interesting.
The second most important criteria is performance: the idiom should ideally compile to approximately the most efficient platform-specific way to do this on common platforms.
A solution that sacrifices simplicity in the name of performance, e.g., that uses different implementations on different platforms, is fine.
1 The most general case is a function, but ideally it would also work as a macro, without re-evaluating any of its arguments more than once.
unsigned long long mask(const unsigned n)
assert(n <= 64);
return (n == 64) ? 0xFFFFFFFFFFFFFFFFULL :
(1ULL << n) - 1ULL;
There are several great, clever answers that avoid conditionals, but a modern compiler can generate code for this that doesn’t branch.
Your compiler can probably figure out to inline this, but you might be able to give it a hint with inline or, in C++, constexpr.
The unsigned long long int type is guaranteed to be at least 64 bits wide and present on every implementation, which uint64_t is not.
If you need a macro (because you need something that works as a compile-time constant), that might be:
#define mask(n) ((64U == (n)) ? 0xFFFFFFFFFFFFFFFFULL : (1ULL << (unsigned)(n)) - 1ULL)
As several people correctly reminded me in the comments, 1ULL << 64U is potential undefined behavior! So, insert a check for that special case.
You could replace 64U with CHAR_BITS*sizeof(unsigned long long) if it is important to you to support the full range of that type on an implementation where it is wider than 64 bits.
You could similarly generate this from an unsigned right shift, but you would still need to check n == 64 as a special case, since right-shifting by the width of the type is undefined behavior.
The relevant portion of the (N1570 Draft) standard says, of both left and right bit shifts:
If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.
This tripped me up. Thanks again to everyone in the comments who reviewed my code and pointed the bug out to me.
Another solution without branching
unsigned long long mask(unsigned n)
return ((1ULL << (n & 0x3F)) & -(n != 64)) - 1;
n & 0x3F keeps the shift amount to maximum 63 in order to avoid UB. In fact most modern architectures will just grab the lower bits of the shift amount, so no and instruction is needed for this.
The checking condition for 64 can be changed to -(n < 64) to make it return all ones for n ⩾ 64, which is equivalent to _bzhi_u64(-1ULL, (uint8_t)n) if your CPU supports BMI2.
The output from Clang looks better than gcc. As it happens gcc emits conditional instructions for MIPS64 and ARM64 but not for x86-64, resulting in longer output
The condition can also be simplified to n >> 6, utilizing the fact that it'll be one if n = 64. And we can subtract that from the result instead of creating a mask like above
return (1ULL << (n & 0x3F)) - (n == 64) - 1; // or n >= 64
return (1ULL << (n & 0x3F)) - (n >> 6) - 1;
gcc compiles the latter to
mov eax, 1
shlx rax, rax, rdi
shr edi, 6
dec rax
sub rax, rdi
Some more alternatives
return ~((~0ULL << (n & 0x3F)) << (n == 64));
return ((1ULL << (n & 0x3F)) - 1) | (((uint64_t)n >> 6) << 63);
return (uint64_t)(((__uint128_t)1 << n) - 1); // if a 128-bit type is available
A similar question for 32 bits: Set last `n` bits in unsigned int
Here's one that is portable and conditional-free:
unsigned long long mask(unsigned n)
assert (n <= sizeof(unsigned long long) * CHAR_BIT);
return (1ULL << (n/2) << (n-(n/2))) - 1;
This is not an answer to the exact question. It only works if 0 isn't a required output, but is more efficient.
2n+1 - 1 computed without overflow. i.e. an integer with the low n bits set, for n = 0 .. all_bits
Possibly using this inside a ternary for cmov could be a more efficient solution to the full problem in the question. Perhaps based on a left-rotate of a number with the MSB set, instead of a left-shift of 1, to take care of the difference in counting for this vs. the question for the pow2 calculation.
// defined for n=0 .. sizeof(unsigned long long)*CHAR_BIT
unsigned long long setbits_upto(unsigned n) {
unsigned long long pow2 = 1ULL << n;
return pow2*2 - 1; // one more shift, and subtract 1.
Compiler output suggests an alternate version, good on some ISAs if you're not using gcc/clang (which already do this): bake in an extra shift count so it is possible for the initial shift to shift out all the bits, leaving 0 - 1 = all bits set.
unsigned long long setbits_upto2(unsigned n) {
unsigned long long pow2 = 2ULL << n; // bake in the extra shift count
return pow2 - 1;
The table of inputs / outputs for a 32-bit version of this function is:
n -> 1<<n -> *2 - 1
0 -> 1 -> 1 = 2 - 1
1 -> 2 -> 3 = 4 - 1
2 -> 4 -> 7 = 8 - 1
3 -> 8 -> 15 = 16 - 1
30 -> 0x40000000 -> 0x7FFFFFFF = 0x80000000 - 1
31 -> 0x80000000 -> 0xFFFFFFFF = 0 - 1
You could slap a cmov after it, or other way of handling an input that has to produce zero.
On x86, we can efficiently compute this with 3 single-uop instructions: (Or 2 uops for BTS on Ryzen).
xor eax, eax
bts rax, rdi ; rax = 1<<(n&63)
lea rax, [rax + rax - 1] ; one more left shift, and subtract
(3-component LEA has 3 cycle latency on Intel, but I believe this is optimal for uop count and thus throughput in many cases.)
In C this compiles nicely for all 64-bit ISAs except x86 Intel SnB-family
C compilers unfortunately are dumb and miss using bts even when tuning for Intel CPUs without BMI2 (where shl reg,cl is 3 uops).
e.g. gcc and clang both do this (with dec or add -1), on Godbolt
# gcc9.1 -O3 -mtune=haswell
setbits_upto(unsigned int):
mov ecx, edi
mov eax, 2 ; bake in the extra shift by 1.
sal rax, cl
dec rax
MSVC starts with n in ECX because of the Windows x64 calling convention, but modulo that, it and ICC do the same thing:
# ICC19
setbits_upto(unsigned int):
mov eax, 1 #3.21
mov ecx, edi #2.39
shl rax, cl #2.39
lea rax, QWORD PTR [-1+rax+rax] #3.21
ret #3.21
With BMI2 (-march=haswell), we get optimal-for-AMD code from gcc/clang with -march=haswell
mov eax, 2
shlx rax, rax, rdi
add rax, -1
ICC still uses a 3-component LEA, so if you target MSVC or ICC use the 2ULL << n version in the source whether or not you enable BMI2, because you're not getting BTS either way. And this avoids the worst of both worlds; slow-LEA and a variable-count shift instead of BTS.
On non-x86 ISAs (where presumably variable-count shifts are efficient because they don't have the x86 tax of leaving flags unmodified if the count happens to be zero, and can use any register as the count), this compiles just fine.
e.g. AArch64. And of course this can hoist the constant 2 for reuse with different n, like x86 can with BMI2 shlx.
setbits_upto(unsigned int):
mov x1, 2
lsl x0, x1, x0
sub x0, x0, #1
Basically the same on PowerPC, RISC-V, etc.
#include <stdint.h>
uint64_t mask_n_bits(const unsigned n){
uint64_t ret = n < 64;
ret <<= n&63; //the &63 is typically optimized away
ret -= 1;
return ret;
xor eax, eax
cmp edi, 63
setbe al
shlx rax, rax, rdi
dec rax
Returns expected results and if passed a constant value it will be optimized to a constant mask in clang and gcc as well as icc at -O2 (but not -Os) .
The &63 gets optimized away, but ensures the shift is <=64.
For values less than 64 it just sets the first n bits using (1<<n)-1. 1<<n sets the nth bit (equivalent pow(2,n)) and subtracting 1 from a power of 2 sets all bits less than that.
By using the conditional to set the initial 1 to be shifted, no branch is created, yet it gives you a 0 for all values >=64 because left shifting a 0 will always yield 0. Therefore when we subtract 1, we get all bits set for values of 64 and larger (because of 2s complement representation for -1).
1s complement systems must die - requires special casing if you have one
some compilers may not optimize the &63 away
When the input N is between 1 and 64, we can use -uint64_t(1) >> (64-N & 63).
The constant -1 has 64 set bits and we shift 64-N of them away, so we're left with N set bits.
When N=0, we can make the constant zero before shifting:
uint64_t mask(unsigned N)
return -uint64_t(N != 0) >> (64-N & 63);
This compiles to five instructions in x64 clang:
neg sets the carry flag to N != 0.
sbb turns the carry flag into 0 or -1.
shr rax,N already has an implicit N & 63, so 64-N & 63 was optimized to -N.
mov rcx,rdi
neg rcx
sbb rax,rax
shr rax,cl
With the BMI2 extension, it's only four instructions (the shift length can stay in rdi):
neg edi
sbb rax,rax
shrx rax,rax,rdi

GCC compiles leading zero count poorly unless Haswell specified

GCC supports the __builtin_clz(int x) builtin, which counts the number of number of leading zeros (consecutive most-significant zeros) in the argument.
Among other things0, this is great for efficiently implementing the lg(unsigned int x) function, which takes the base-2 logarithm of x, rounding down1:
/** return the base-2 log of x, where x > 0 */
unsigned lg(unsigned x) {
return 31U - (unsigned)__builtin_clz(x);
This works in the straightforward way - in particular consider the case x == 1 and clz(x) == 31 - then x == 2^0 so lg(x) == 0 and 31 - 31 == 0 and we get the correct result. Higher values of x work similarly.
Assuming the builtin is efficiently implemented, this ends much better than the alternate pure C solutions.
Now as it happens, the count leading zeros operation is essentially the dual of the bsr instruction in x86. That returns the index of the most-significant 1-bit2 in the argument. So if there are 10 leading zeros, the first 1-bit is in bit 21 of the argument. In general we have 31 - clz(x) == bsr(x) and in so bsr in fact directly implements our desired lg() function, without the superfluous 31U - ... part.
In fact, you can read between the line and see that the __builtin_clz function was implemented with bsr in mind: it is defined as undefined behavior if the argument is zero, when of course the "leading zeros" operation is perfectly well-defined as 32 (or whatever the bit-size of int is) with a zero argument. So __builtin_clz was certainly implemented with the idea of being efficiently mapped to a bsr instruction on x86.
However, looking at what GCC actually does, at -O3 with otherwise default options: it adds a ton of extra junk:
lg(unsigned int):
bsr edi, edi
mov eax, 31
xor edi, 31
sub eax, edi
The xor edi,31 line is effectively a not edi for the bottom 4 bits that actually matter, that's off-by-one3 from neg edi which turns the result of bsr into clz. Then the 31 - clz(x) is carried out.
However with -mtune=haswell, the code simplifies into exactly the expected single bsr instruction:
lg(unsigned int):
bsr eax, edi
Why that is the case is very unclear to me. The bsr instruction has been around for a couple decades before Haswell, and the behavior is, AFAIK, unchanged. It's not just an issue of tuning for a particular arch, since bsr + a bunch of extra instructions isn't going to be faster than a plain bsr and furthermore using -mtune=haswell still results in the slower code.
The situation for 64-bit inputs and outputs is even slightly worse: there is an extra movsx in the critical path which seems to do nothing since the result from clz will never be negative. Again, the -march=haswell variant is optimal with a single bsr instruction.
Finally, let's check the big players in the non-Windows compiler space, icc and clang. icc just does a bad job and adds redundant stuff like neg eax; add eax, 31; neg eax; add eax, 31; - wtf? clang does a good job regardless of -march.
0 Such as scanning a bitmap for the first set bit.
1 The logarithm of 0 is undefinited, and so calling our function with a 0 argument is undefined behavior.
2 Here, the LSB is the 0th bit and the MSB is the 31st.
3 Recall that -x == ~x + 1 in twos-complement.
This looks like a known issue with gcc: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168

Replacing arrays access variables with the right integer type

I've had a habit of using int to access arrays (especially in for loops); however I recently discovered that I may have been "doing-it-all-wrong" and my x86 system kept hiding the truth from me. It turns out that int is fine when sizeof(size_t) == sizeof(int) but when used on a system where sizeof(size_t) > sizeof(int), it causes an additional mov instruction. size_t and ptrdiff_t seem to be the optimal way on the systems I've tested, requiring no additional mov.
Here is a shortened example
int vector_get(int *v,int i){ return v[i]; }
> movslq %esi, %rsi
> movl (%rdi,%rsi,4), %eax
> ret
int vector_get(int *v,size_t i){ return v[i]; }
> movl (%rdi,%rsi,4), %eax
> ret
OK, I've fixed myself (using size_t and ptrdiff_t now), now how do I (hopefully not manually) find these instances in my code so I can fix them?
Recently I've noticed several patches including changes from int to size_t coming across the wire mentioning Clang.
I put together a table of the extra instructions that get inserted on each instance to show the results of "doing-it-all-wrong".
unsigned char
unsigned short
unsigned int
movsbq %sil, %rsi
movswq %si, %rsi
movslq %esi, %rsi
movzbl %sil, %esi
movzwl %si, %esi
movl %esi, %esi
Table of unwanted move operations when
accessing vectors with "wrong" type.
Note: long, long long, unsigned long, unsigned long long, size_t and ptrdiff_t require no additional mov* operation (basically anything >= largest object size, or 8 bytes on the 64 bit reference system )
I think I may have a workable stub for patching gcc, but I don't know my way around its source to complete the stub and add proper -Wflag bits, and as usual the hardest part of programming is naming stuff. -Wunalinged-index?
gcc/c/c-typeck.c _______________________________________________
if (!swapped)
warn_array_subscript_with_type_char (index);
> if ( sizeof(index) < sizeof(size_t) )
> warning_at (loc, OPT_Wunaligned_index,
> "array index is smaller than size_t");
/* Apply default promotions *after* noticing character types. */
index = default_conversion (index);
gcc/c-family/c.opt _____________________________________________
C ObjC C++ ObjC++
-trigraphs Support ISO C trigraphs
> Wunaligned-index
> C ObjC C++ ObjC++
> Warn about array indices smaller than size_t
C ObjC C++ ObjC++ Var(flag_undef)
Do not predefine system-specific and GCC-specific macros
gcc/c-family/c-opts.c __________________________________________
case OPT_Wtrigraphs:
cpp_opts->warn_trigraphs = value;
> case OPT_Wunaligned_index:
> cpp_opts->warn_unaligned_index = value;
case OPT_Wundef:
cpp_opts->warn_undef = value;
clang and gcc have -Wchar-subscripts, but that'll only help detect char subscript types.
You might consider modifying clang or gcc (whichever is easier to build on your infrastructure) to broaden the types detected by the -Wchar-subscripts warning. If this is a one-pass fix effort, this might be the most straightforward way to go about it.
Otherwise you'll need to find a linter that complains about non-size_t/ptrdiff_t subscripting; I'm not aware of any that have that option.
The movslq instruction sign-extends a long (aka 4-byte quantity) to a quad (aka 8-byte quantity). This is because int is signed, so an offset of i.e. -1 is 0xffffffff as a long. If you were to just zero-extend that (i.e. not have movslq), this would be 0x00000000ffffffff, aka 4294967295, which is probably not what you want. So, the compiler instead sign-extends the index to yield 0xffff..., aka -1.
The reason the other types don't require the additional operation is because, despite some of them being signed, they're still the same size of 8 bytes. And, thanks to two's complement, 0xffff... can be interpreted as either -1 or 18446744073709551615, and the 64-bit sum will still be the same.
Now, normally, if you were to instead use unsigned int, the compiler would normally have to insert a zero-extend instead, just to make sure the upper-half of the register doesn't contain garbage. However, on the x64 platform, this is done implicitly; an instruction such as mov %eax,%esi will move whatever 4-byte quantity is in eax into the lower 4 bytes of rsi and clear the upper 4, effectively zero-extending the quantity. But, given your postings, the compiler seems to insert mov %esi,%esi instruction anyway, "just to be sure".
Note, however, that this "automatic zero-extending" is not the case for 1- and 2-byte quantities - those must be manually zero-extended.

Is there separate intrinsic for bitset64 in Visual Studio C compiler?

I need to set nth bit of 64 bit integer to 1;
There is an intrinsic (documented here http://msdn.microsoft.com/en-us/library/z56sc6y4(v=vs.90).aspx) :
unsigned char _bittestandset64(
__int64 *a,
__int64 b
which does the job.
My question is if there is a way to just do bit set (without testing) and if there is any performance hit for using bittestandset64 ignoring return value for the purpose.
I am also interested if there is a way to do that in assembly to use in GCC (for Intel Core2 to i7).
The point of intrinsics is to take advantage of specific processor instructions and to do so with the option to still optimize the code. The BTS, Bit Test and Set instruction in this case. There is no dedicated instruction for "bit set". The code generator will pay attention to you using the result value. And if you don't use it then it also won't generate the code to convert the carry flag to a bit value.
So a simple set, like:
_bittestandset64(&bits, 1);
000007F6ED6812CE bts qword ptr [rax],1
While using the result value, like:
unsigned char value = _bittestandset64(&bits, 1);
000007F7394E14A3 bts qword ptr [rax],1
000007F7394E14A8 setb al
000007F7394E14AB mov byte ptr [value],al
You can't do better than a single cpu instruction, this is already as good as it gets.
