Why SIGFPE when INT_MIN is divide by -1 [duplicate]

Why SIGFPE when INT_MIN is divide by -1 [duplicate] - c

I have an assignment of expaining some seemingly strange behaviors of C code (running on x86). I can easily complete everything else but this one has really confused me.
Code snippet 1 outputs -2147483648
int a = 0x80000000;
int b = a / -1;
printf("%d\n", b);
Code snippet 2 outputs nothing, and gives a Floating point exception
int a = 0x80000000;
int b = -1;
int c = a / b;
printf("%d\n", c);
I well know the reason for the result of Code Snippet 1 (1 + ~INT_MIN == INT_MIN), but I can't quite understand how can integer division by -1 generate FPE, nor can I reproduce it on my Android phone (AArch64, GCC 7.2.0). Code 2 just output the same as Code 1 without any exceptions. Is it a hidden bug feature of x86 processor?
The assignment didn't tell anything else (including CPU architecture), but since the whole course is based on a desktop Linux distro, you can safely assume it's a modern x86.
Edit: I contacted my friend and he tested the code on Ubuntu 16.04 (Intel Kaby Lake, GCC 6.3.0). The result was consistent with whatever the assignment stated (Code 1 output the said thing and Code 2 crashed with FPE).

There are four things going on here:
gcc -O0 behaviour explains the difference between your two versions: idiv vs. neg. (While clang -O0 happens to compile them both with idiv). And why you get this even with compile-time-constant operands.
x86 idiv faulting behaviour vs. behaviour of the division instruction on ARM
If integer math results in a signal being delivered, POSIX require it to be SIGFPE: On which platforms does integer divide by zero trigger a floating point exception? But POSIX doesn't require trapping for any particular integer operation. (This is why it's allowed for x86 and ARM to be different).
The Single Unix Specification defines SIGFPE as "Erroneous arithmetic operation". It's confusingly named after floating point, but in a normal system with the FPU in its default state, only integer math will raise it. On x86, only integer division. On MIPS, a compiler could use add instead of addu for signed math, so you could get traps on signed add overflow. (gcc uses addu even for signed, but an undefined-behaviour detector might use add.)
C Undefined Behaviour rules (signed overflow, and division specifically) which let gcc emit code which can trap in that case.
gcc with no options is the same as gcc -O0.
-O0
Reduce compilation time and make debugging produce the expected results. This is the default.
This explains the difference between your two versions:
Not only does gcc -O0 not try to optimize, it actively de-optimizes to make asm that independently implements each C statement within a function. This allows gdb's jump command to work safely, letting you jump to a different line within the function and act like you're really jumping around in the C source. Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? explains more about how and why -O0 compiles the way it does.
It also can't assume anything about variable values between statements, because you can change variables with set b = 4. This is obviously catastrophically bad for performance, which is why -O0 code runs several times slower than normal code, and why optimizing for -O0 specifically is total nonsense. It also makes -O0 asm output really noisy and hard for a human to read, because of all the storing/reloading, and lack of even the most obvious optimizations.
int a = 0x80000000;
int b = -1;
// debugger can stop here on a breakpoint and modify b.
int c = a / b; // a and b have to be treated as runtime variables, not constants.
printf("%d\n", c);
I put your code inside functions on the Godbolt compiler explorer to get the asm for those statements.
To evaluate a/b, gcc -O0 has to emit code to reload a and b from memory, and not make any assumptions about their value.
But with int c = a / -1;, you can't change the -1 with a debugger, so gcc can and does implement that statement the same way it would implement int c = -a;, with an x86 neg eax or AArch64 neg w0, w0 instruction, surrounded by a load(a)/store(c). On ARM32, it's a rsb r3, r3, #0 (reverse-subtract: r3 = 0 - r3).
However, clang5.0 -O0 doesn't do that optimization. It still uses idiv for a / -1, so both versions will fault on x86 with clang. Why does gcc "optimize" at all? See Disable all optimization options in GCC. gcc always transforms through an internal representation, and -O0 is just the minimum amount of work needed to produce a binary. It doesn't have a "dumb and literal" mode that tries to make the asm as much like the source as possible.
x86 idiv vs. AArch64 sdiv:
x86-64:
# int c = a / b from x86_fault()
mov eax, DWORD PTR [rbp-4]
cdq # dividend sign-extended into edx:eax
idiv DWORD PTR [rbp-8] # divisor from memory
mov DWORD PTR [rbp-12], eax # store quotient
Unlike imul r32,r32, there's no 2-operand idiv that doesn't have a dividend upper-half input. Anyway, not that it matters; gcc is only using it with edx = copies of the sign bit in eax, so it's really doing a 32b / 32b => 32b quotient + remainder. As documented in Intel's manual, idiv raises #DE on:
divisor = 0
The signed result (quotient) is too large for the destination.
Overflow can easily happen if you use the full range of divisors, e.g. for int result = long long / int with a single 64b / 32b => 32b division. But gcc can't do that optimization because it's not allowed to make code that would fault instead of following the C integer promotion rules and doing a 64-bit division and then truncating to int. It also doesn't optimize even in cases where the divisor is known to be large enough that it couldn't #DE
When doing 32b / 32b division (with cdq), the only input that can overflow is INT_MIN / -1. The "correct" quotient is a 33-bit signed integer, i.e. positive 0x80000000 with a leading-zero sign bit to make it a positive 2's complement signed integer. Since this doesn't fit in eax, idiv raises a #DE exception. The kernel then delivers SIGFPE.
AArch64:
# int c = a / b from x86_fault() (which doesn't fault on AArch64)
ldr w1, [sp, 12]
ldr w0, [sp, 8] # 32-bit loads into 32-bit registers
sdiv w0, w1, w0 # 32 / 32 => 32 bit signed division
str w0, [sp, 4]
ARM hardware division instructions don't raise exceptions for divide by zero or for INT_MIN/-1 overflow. Nate Eldredge commented:
The full ARM architecture reference manual states that UDIV or SDIV, when dividing by zero, simply return zero as the result, "without any indication that the division by zero occurred" (C3.4.8 in the Armv8-A version). No exceptions and no flags - if you want to catch divide by zero, you have to write an explicit test. Likewise, signed divide of INT_MIN by -1 returns INT_MIN with no indication of the overflow.
AArch64 sdiv documentation doesn't mention any exceptions.
However, software implementations of integer division may raise: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka4061.html. (gcc uses a library call for division on ARM32 by default, unless you set a -mcpu that has HW division.)
C Undefined Behaviour.
As PSkocik explains, INT_MIN / -1 is undefined behaviour in C, like all signed integer overflow. This allows compilers to use hardware division instructions on machines like x86 without checking for that special case. If it had to not fault, unknown inputs would require run-time compare-and branch checks, and nobody wants C to require that.
More about the consequences of UB:
With optimization enabled, the compiler can assume that a and b still have their set values when a/b runs. It can then see the program has undefined behaviour, and thus can do whatever it wants. gcc chooses to produce INT_MIN like it would from -INT_MIN.
On a 2's complement system, the most-negative number is its own negative. This is a nasty corner-case for 2's complement, because it means abs(x) can still be negative.
https://en.wikipedia.org/wiki/Two%27s_complement#Most_negative_number
int x86_fault() {
int a = 0x80000000;
int b = -1;
int c = a / b;
return c;
}
compile to this with gcc6.3 -O3 for x86-64
x86_fault:
mov eax, -2147483648
ret
but clang5.0 -O3 compiles to (with no warning even with -Wall -Wextra`):
x86_fault:
ret
Undefined Behaviour really is totally undefined. Compilers can do whatever they feel like, including returning whatever garbage was in eax on function entry, or loading a NULL pointer and an illegal instruction. e.g. with gcc6.3 -O3 for x86-64:
int *local_address(int a) {
return &a;
}
local_address:
xor eax, eax # return 0
ret
void foo() {
int *p = local_address(4);
*p = 2;
}
foo:
mov DWORD PTR ds:0, 0 # store immediate 0 into absolute address 0
ud2 # illegal instruction
Your case with -O0 didn't let the compilers see the UB at compile time, so you got the "expected" asm output.
See also What Every C Programmer Should Know About Undefined Behavior (the same LLVM blog post that Basile linked).

Signed int division in two's complement is undefined if:
the divisor is zero, OR
the dividend is INT_MIN (==0x80000000 if int is int32_t) and the divisor is -1 (in two's complement,
-INT_MIN > INT_MAX, which causes integer overflow, which is undefined behavior in C)
(https://www.securecoding.cert.org recommends wrapping integer operations in functions that check for such edge cases)
Since you're invoking undefined behavior by breaking rule 2, anything can happen, and as it happens, this particular anything on your platform happens to be an FPE signal being generated by your processor.

With undefined behavior very bad things could happen, and sometimes they do happen.
Your question has no sense in C (read Lattner on UB). But you could get the assembler code (e.g. produced by gcc -O -fverbose-asm -S) and care about machine code behavior.
On x86-64 with Linux integer overflow (and also integer division by zero, IIRC) gives a SIGFPE signal. See signal(7)
BTW, on PowerPC integer division by zero is rumored to give -1 at the machine level (but some C compilers generate extra code to test that case).
The code in your question is undefined behavior in C. The generated assembler code has some defined behavior (depends upon the ISA and processor).
(the assignment is done to make you read more about UB, notably Lattner 's blog, which you should absolutely read)

On x86 if you divide by actually using the idiv operation (which is not really necessary for constant arguments, not even for variables-known-to-be-constant, but it happened anyway), INT_MIN / -1 is one of the cases that results in #DE (divide error). It's really a special case of the quotient being out of range, in general that is possible because idiv divides an extra-wide dividend by the divisor, so many combinations cause overflow - but INT_MIN / -1 is the only case that isn't a div-by-0 that you can normally access from higher level languages since they typically do not expose the extra-wide-dividend capabilities.
Linux annoyingly maps the #DE to SIGFPE, which has probably confused everyone who dealt with it the first time.

Both cases are weird, as the first consists in dividing -2147483648 by -1 and should give 2147483648, and not the result you are receiving. The division by -1 (as the multiplication) should change the sign of the dividend to become a positive number. But there's no such positive number in int (this is what raises U.B.)
0x80000000 is not a valid int number in a 32 bit architecture (as stated in the standard) that represents numbers in two's complement. If you calculate its negative value, you'll get again to it, as it has no opposite number around zero.
When you do arithmetic with signed integers, it works well for integer addition and substraction (always with care, as you are quite easy to overflow, when you add the largest value to some int) but you cannot safely use it to multiply or divide. So in this case, you are invoking Undefined Behaviour. You always invoke undefined behaviour (or implementation defined behaviour, which is similar, but not the same) on overflow with signed integers, as implementations vary widely in implementing that.
I'll try to explain what can be happening (with no trustness), as the compiler is free to do anything, or nothing at all.
Concretely, 0x80000000 as represented in two's complement is
1000_0000_0000_0000_0000_0000_0000
if we complement this number, we get (first complement all bits, then add one)
0111_1111_1111_1111_1111_1111_1111 + 1 =>
1000_0000_0000_0000_0000_0000_0000 !!! the same original number.
suprisingly the same number.... You had an overflow (there's no counterpart positive value to this number, as we overflown when changing sign) then you take out the sign bit, masking with
1000_0000_0000_0000_0000_0000_0000 &
0111_1111_1111_1111_1111_1111_1111 =>
0000_0000_0000_0000_0000_0000_0000
which is the number you use as dividend.
But as I said before, this is what can be happening on your system, but not sure, as the standard says this is Undefined behaviour and, as so, you can get whatever different behaviour from your computer/compiler.
The different results you are obtaining are probably the result of the first operation being done by the compiler, while the second one is done by the program itself. In the first case you are assigning 0x8000_0000 to the variable, while in the second you are calculating the value in the program. Both cases are undefined behaviour and you are seeing it happening in front your eyes.
#NOTE 1
As the compiler is concerned, and the standard doesn't say anything about the valid ranges of int that must be implemented (the standard doesn't include normally 0x8000...000 in two's complement architectures) the correct behaviour of 0x800...000 in two's complement architectures should be, as it has the largest absolute value for an integer of that type, to give a result of 0 when dividing a number by it. But hardware implementations normally don't allow to divide by such a number (as many of them doesn't even implement signed integer division, but simulate it from unsigned division, so many simply extract the signs and do an unsigned division) That requires a check before division, and as the standard says Undefined behaviour, implementations are allowed to freely avoid such a check, and disallow dividing by that number. They simply select the integer range to go from 0x8000...001 to 0xffff...fff, and then from 0x000..0000 to 0x7fff...ffff, disallowing the value 0x8000...0000 as invalid.

Related

-2147483648 % -1 causes floating point exception [duplicate]

I have an assignment of expaining some seemingly strange behaviors of C code (running on x86). I can easily complete everything else but this one has really confused me.
Code snippet 1 outputs -2147483648
int a = 0x80000000;
int b = a / -1;
printf("%d\n", b);
Code snippet 2 outputs nothing, and gives a Floating point exception
int a = 0x80000000;
int b = -1;
int c = a / b;
printf("%d\n", c);
I well know the reason for the result of Code Snippet 1 (1 + ~INT_MIN == INT_MIN), but I can't quite understand how can integer division by -1 generate FPE, nor can I reproduce it on my Android phone (AArch64, GCC 7.2.0). Code 2 just output the same as Code 1 without any exceptions. Is it a hidden bug feature of x86 processor?
The assignment didn't tell anything else (including CPU architecture), but since the whole course is based on a desktop Linux distro, you can safely assume it's a modern x86.
Edit: I contacted my friend and he tested the code on Ubuntu 16.04 (Intel Kaby Lake, GCC 6.3.0). The result was consistent with whatever the assignment stated (Code 1 output the said thing and Code 2 crashed with FPE).

There are four things going on here:
gcc -O0 behaviour explains the difference between your two versions: idiv vs. neg. (While clang -O0 happens to compile them both with idiv). And why you get this even with compile-time-constant operands.
x86 idiv faulting behaviour vs. behaviour of the division instruction on ARM
If integer math results in a signal being delivered, POSIX require it to be SIGFPE: On which platforms does integer divide by zero trigger a floating point exception? But POSIX doesn't require trapping for any particular integer operation. (This is why it's allowed for x86 and ARM to be different).
The Single Unix Specification defines SIGFPE as "Erroneous arithmetic operation". It's confusingly named after floating point, but in a normal system with the FPU in its default state, only integer math will raise it. On x86, only integer division. On MIPS, a compiler could use add instead of addu for signed math, so you could get traps on signed add overflow. (gcc uses addu even for signed, but an undefined-behaviour detector might use add.)
C Undefined Behaviour rules (signed overflow, and division specifically) which let gcc emit code which can trap in that case.
gcc with no options is the same as gcc -O0.
-O0
Reduce compilation time and make debugging produce the expected results. This is the default.
This explains the difference between your two versions:
Not only does gcc -O0 not try to optimize, it actively de-optimizes to make asm that independently implements each C statement within a function. This allows gdb's jump command to work safely, letting you jump to a different line within the function and act like you're really jumping around in the C source. Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? explains more about how and why -O0 compiles the way it does.
It also can't assume anything about variable values between statements, because you can change variables with set b = 4. This is obviously catastrophically bad for performance, which is why -O0 code runs several times slower than normal code, and why optimizing for -O0 specifically is total nonsense. It also makes -O0 asm output really noisy and hard for a human to read, because of all the storing/reloading, and lack of even the most obvious optimizations.
int a = 0x80000000;
int b = -1;
// debugger can stop here on a breakpoint and modify b.
int c = a / b; // a and b have to be treated as runtime variables, not constants.
printf("%d\n", c);
I put your code inside functions on the Godbolt compiler explorer to get the asm for those statements.
To evaluate a/b, gcc -O0 has to emit code to reload a and b from memory, and not make any assumptions about their value.
But with int c = a / -1;, you can't change the -1 with a debugger, so gcc can and does implement that statement the same way it would implement int c = -a;, with an x86 neg eax or AArch64 neg w0, w0 instruction, surrounded by a load(a)/store(c). On ARM32, it's a rsb r3, r3, #0 (reverse-subtract: r3 = 0 - r3).
However, clang5.0 -O0 doesn't do that optimization. It still uses idiv for a / -1, so both versions will fault on x86 with clang. Why does gcc "optimize" at all? See Disable all optimization options in GCC. gcc always transforms through an internal representation, and -O0 is just the minimum amount of work needed to produce a binary. It doesn't have a "dumb and literal" mode that tries to make the asm as much like the source as possible.
x86 idiv vs. AArch64 sdiv:
x86-64:
# int c = a / b from x86_fault()
mov eax, DWORD PTR [rbp-4]
cdq # dividend sign-extended into edx:eax
idiv DWORD PTR [rbp-8] # divisor from memory
mov DWORD PTR [rbp-12], eax # store quotient
Unlike imul r32,r32, there's no 2-operand idiv that doesn't have a dividend upper-half input. Anyway, not that it matters; gcc is only using it with edx = copies of the sign bit in eax, so it's really doing a 32b / 32b => 32b quotient + remainder. As documented in Intel's manual, idiv raises #DE on:
divisor = 0
The signed result (quotient) is too large for the destination.
Overflow can easily happen if you use the full range of divisors, e.g. for int result = long long / int with a single 64b / 32b => 32b division. But gcc can't do that optimization because it's not allowed to make code that would fault instead of following the C integer promotion rules and doing a 64-bit division and then truncating to int. It also doesn't optimize even in cases where the divisor is known to be large enough that it couldn't #DE
When doing 32b / 32b division (with cdq), the only input that can overflow is INT_MIN / -1. The "correct" quotient is a 33-bit signed integer, i.e. positive 0x80000000 with a leading-zero sign bit to make it a positive 2's complement signed integer. Since this doesn't fit in eax, idiv raises a #DE exception. The kernel then delivers SIGFPE.
AArch64:
# int c = a / b from x86_fault() (which doesn't fault on AArch64)
ldr w1, [sp, 12]
ldr w0, [sp, 8] # 32-bit loads into 32-bit registers
sdiv w0, w1, w0 # 32 / 32 => 32 bit signed division
str w0, [sp, 4]
ARM hardware division instructions don't raise exceptions for divide by zero or for INT_MIN/-1 overflow. Nate Eldredge commented:
The full ARM architecture reference manual states that UDIV or SDIV, when dividing by zero, simply return zero as the result, "without any indication that the division by zero occurred" (C3.4.8 in the Armv8-A version). No exceptions and no flags - if you want to catch divide by zero, you have to write an explicit test. Likewise, signed divide of INT_MIN by -1 returns INT_MIN with no indication of the overflow.
AArch64 sdiv documentation doesn't mention any exceptions.
However, software implementations of integer division may raise: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka4061.html. (gcc uses a library call for division on ARM32 by default, unless you set a -mcpu that has HW division.)
C Undefined Behaviour.
As PSkocik explains, INT_MIN / -1 is undefined behaviour in C, like all signed integer overflow. This allows compilers to use hardware division instructions on machines like x86 without checking for that special case. If it had to not fault, unknown inputs would require run-time compare-and branch checks, and nobody wants C to require that.
More about the consequences of UB:
With optimization enabled, the compiler can assume that a and b still have their set values when a/b runs. It can then see the program has undefined behaviour, and thus can do whatever it wants. gcc chooses to produce INT_MIN like it would from -INT_MIN.
On a 2's complement system, the most-negative number is its own negative. This is a nasty corner-case for 2's complement, because it means abs(x) can still be negative.
https://en.wikipedia.org/wiki/Two%27s_complement#Most_negative_number
int x86_fault() {
int a = 0x80000000;
int b = -1;
int c = a / b;
return c;
}
compile to this with gcc6.3 -O3 for x86-64
x86_fault:
mov eax, -2147483648
ret
but clang5.0 -O3 compiles to (with no warning even with -Wall -Wextra`):
x86_fault:
ret
Undefined Behaviour really is totally undefined. Compilers can do whatever they feel like, including returning whatever garbage was in eax on function entry, or loading a NULL pointer and an illegal instruction. e.g. with gcc6.3 -O3 for x86-64:
int *local_address(int a) {
return &a;
}
local_address:
xor eax, eax # return 0
ret
void foo() {
int *p = local_address(4);
*p = 2;
}
foo:
mov DWORD PTR ds:0, 0 # store immediate 0 into absolute address 0
ud2 # illegal instruction
Your case with -O0 didn't let the compilers see the UB at compile time, so you got the "expected" asm output.
See also What Every C Programmer Should Know About Undefined Behavior (the same LLVM blog post that Basile linked).

Signed int division in two's complement is undefined if:
the divisor is zero, OR
the dividend is INT_MIN (==0x80000000 if int is int32_t) and the divisor is -1 (in two's complement,
-INT_MIN > INT_MAX, which causes integer overflow, which is undefined behavior in C)
(https://www.securecoding.cert.org recommends wrapping integer operations in functions that check for such edge cases)
Since you're invoking undefined behavior by breaking rule 2, anything can happen, and as it happens, this particular anything on your platform happens to be an FPE signal being generated by your processor.

With undefined behavior very bad things could happen, and sometimes they do happen.
Your question has no sense in C (read Lattner on UB). But you could get the assembler code (e.g. produced by gcc -O -fverbose-asm -S) and care about machine code behavior.
On x86-64 with Linux integer overflow (and also integer division by zero, IIRC) gives a SIGFPE signal. See signal(7)
BTW, on PowerPC integer division by zero is rumored to give -1 at the machine level (but some C compilers generate extra code to test that case).
The code in your question is undefined behavior in C. The generated assembler code has some defined behavior (depends upon the ISA and processor).
(the assignment is done to make you read more about UB, notably Lattner 's blog, which you should absolutely read)

On x86 if you divide by actually using the idiv operation (which is not really necessary for constant arguments, not even for variables-known-to-be-constant, but it happened anyway), INT_MIN / -1 is one of the cases that results in #DE (divide error). It's really a special case of the quotient being out of range, in general that is possible because idiv divides an extra-wide dividend by the divisor, so many combinations cause overflow - but INT_MIN / -1 is the only case that isn't a div-by-0 that you can normally access from higher level languages since they typically do not expose the extra-wide-dividend capabilities.
Linux annoyingly maps the #DE to SIGFPE, which has probably confused everyone who dealt with it the first time.

How can I instruct the MSVC compiler to use a 64bit/32bit division instead of the slower 128bit/64bit division?

How can I tell the MSVC compiler to use the 64bit/32bit division operation to compute the result of the following function for the x86-64 target:
#include <stdint.h>
uint32_t ScaledDiv(uint32_t a, uint32_t b)
{
if (a > b)
return ((uint64_t)b<<32) / a; //Yes, this must be casted because the result of b<<32 is undefined
else
return uint32_t(-1);
}
I would like the code, when the if statement is true, to compile to use the 64bit/32bit division operation, e.g. something like this:
; Assume arguments on entry are: Dividend in EDX, Divisor in ECX
mov edx, edx ;A dummy instruction to indicate that the dividend is already where it is supposed to be
xor eax,eax
div ecx ; EAX = EDX:EAX / ECX
...however the x64 MSVC compiler insists on using the 128bit/64bit div instruction, such as:
mov eax, edx
xor edx, edx
shl rax, 32 ; Scale up the dividend
mov ecx, ecx
div rcx ;RAX = RDX:RAX / RCX
See: https://www.godbolt.org/z/VBK4R71
According to the answer to this question, the 128bit/64bit div instruction is not faster than the 64bit/32bit div instruction.
This is a problem because it unnecessarily slows down my DSP algorithm which makes millions of these scaled divisions.
I tested this optimization by patching the executable to use the 64bit/32bit div instruction: The performance increased 28% according to the two timestamps yielded by the rdtsc instructions.
(Editor's note: presumably on some recent Intel CPU. AMD CPUs don't need this micro-optimization, as explained in the linked Q&A.)

No current compilers (gcc/clang/ICC/MSVC) will do this optimization from portable ISO C source, even if you let them prove that b < a so the quotient will fit in 32 bits. (For example with GNU C if(b>=a) __builtin_unreachable(); on Godbolt). This is a missed optimization; until that's fixed, you have to work around it with intrinsics or inline asm.
(Or use a GPU or SIMD instead; if you have the same divisor for many elements see https://libdivide.com/ for SIMD to compute a multiplicative inverse once and apply it repeatedly.)
_udiv64 is available starting in Visual Studio 2019 RTM.
In C mode (-TC) it's apparently always defined. In C++ mode, you need to #include <immintrin.h>, as per the Microsoft docs. or intrin.h.
https://godbolt.org/z/vVZ25L (Or on Godbolt.ms because recent MSVC on the main Godbolt site is not working1.)
#include <stdint.h>
#include <immintrin.h> // defines the prototype
// pre-condition: a > b else 64/32-bit division overflows
uint32_t ScaledDiv(uint32_t a, uint32_t b)
{
uint32_t remainder;
uint64_t d = ((uint64_t) b) << 32;
return _udiv64(d, a, &remainder);
}
int main() {
uint32_t c = ScaledDiv(5, 4);
return c;
}
_udiv64 will produce 64/32 div. The two shifts left and right are a missed optimization.
;; MSVC 19.20 -O2 -TC
a$ = 8
b$ = 16
ScaledDiv PROC ; COMDAT
mov edx, edx
shl rdx, 32 ; 00000020H
mov rax, rdx
shr rdx, 32 ; 00000020H
div ecx
ret 0
ScaledDiv ENDP
main PROC ; COMDAT
xor eax, eax
mov edx, 4
mov ecx, 5
div ecx
ret 0
main ENDP
So we can see that MSVC doesn't do constant-propagation through _udiv64, even though in this case it doesn't overflow and it could have compiled main to just mov eax, 0ccccccccH / ret.
UPDATE #2 https://godbolt.org/z/n3Dyp-
Added a solution with Intel C++ Compiler, but this is less efficient and will defeat constant-propagation because it's inline asm.
#include <stdio.h>
#include <stdint.h>
__declspec(regcall, naked) uint32_t ScaledDiv(uint32_t a, uint32_t b)
{
__asm mov edx, eax
__asm xor eax, eax
__asm div ecx
__asm ret
// implicit return of EAX is supported by MSVC, and hopefully ICC
// even when inlining + optimizing
}
int main()
{
uint32_t a = 3 , b = 4, c = ScaledDiv(a, b);
printf( "(%u << 32) / %u = %u\n", a, b, c);
uint32_t d = ((uint64_t)a << 32) / b;
printf( "(%u << 32) / %u = %u\n", a, b, d);
return c != d;
}
Footnote 1: Matt Godbolt's main site's non-WINE MSVC compilers are temporarily(?) gone. Microsoft runs https://www.godbolt.ms/ to host the recent MSVC compilers on real Windows, and normally the main Godbolt.org site relayed to that for MSVC.)
It seems godbolt.ms will generate short links, but not expand them again! Full links are better anyway for their resistance to link-rot.

#Alex Lopatin's answer shows how to use _udiv64 to get non-terrible scalar code (despite MSVC's stupid missed optimization shifting left/right).
For compilers that support GNU C inline asm (including ICC), you can use that instead of the inefficient MSVC inline asm syntax that has a lot of overhead for wrapping a single instruction. See What is the difference between 'asm', '__asm' and '__asm__'? for an example wrapping 64-bit / 32-bit => 32-bit idiv. (Use it for div by just changing the mnemonic and the types to unsigned.) GNU C doesn't have an intrinsic for 64 / 32 or 128 / 64 division; it's supposed to optimize pure C. But unfortunately GCC / Clang / ICC have missed optimizations for this case even using if(a<=b) __builtin_unreachable(); to promise that a>b.
But that's still scalar division, with pretty poor throughput.
Perhaps you can a GPU for your DSP task? If you have a large enough batch of work (and the rest of your algorithm is GPU-friendly) then it's probably worth the overhead of the communication round trip to the GPU.
If you're using the CPU, then anything we can suggest will benefit from parallelizing over multiple cores, so do that for more throughput.
x86 SIMD (SSE4/AVX2/AVX512*) doesn't have SIMD integer division in hardware. The Intel SVML functions _mm_div_epu64 and _mm256_div_epu64 are not intrinsics for a real instruction, they're slow functions that maybe unpack to scalar or compute multiplicative inverses. Or whatever other trick they use; possibly the 32-bit division functions convert to SIMD vectors of double, especially if AVX512 is available. (Intel still calls them "intrinsics" maybe because they're like built-in function that it understands and can do constant-propagation through. They're probably as efficient as they can be, but that's "not very", and they need to handle the general case, not just your special case with the low half of one divisor being all zero and the quotient fitting in 32 bits.)
If you have the same divisor for many elements, see https://libdivide.com/ for SIMD to compute a multiplicative inverse once and apply it repeatedly. (You should adapt that technique to bake in the shifting of the dividend without actually doing it, leaving the all-zero low half implicit.)
If your divisor is always varying, and this isn't a middle step in some larger SIMD-friendly algorithm, scalar division may well be your best bet if you need exact results.
You could get big speedups from using SIMD float if 24-bit mantissa precision is sufficient
uint32_t ScaledDiv(uint32_t a, uint32_t b)
{
return ((1ULL<<32) * (float)b) / a;
}
(float)(1ULL<<32) is a compile-time constant 4294967296.0f.
This does auto-vectorize over an array, with gcc and clang even without -ffast-math (but not MSVC). See it on Godbolt. You could port gcc or clang's asm back to intrinsics for MSVC; they use some FP tricks for packed-conversion of unsigned integers to/from float without AVX512. Non-vectorized scalar FP will probably be slower than plain integer on MSVC, as well as less accurate.
For example, Skylake's div r32 throughput is 1 per 6 cycles. But its AVX vdivps ymm throughput is one instruction (of 8 floats) per 5 cycles. Or for 128-bit SSE2, divps xmm has one per 3 cycle throughput. So you get about 10x the division throughput from AVX on Skylake. (8 * 6/5 = 9.6) Older microarchitectures have much slower SIMD FP division, but also somewhat slower integer division. In general the ratio is smaller because older CPUs don't have as wide SIMD dividers, so 256-bit vdivps has to run the 128-bit halves through separately. But there's still plenty of gain to be had, like better than a factor of 4 on Haswell. And Ryzen has vdivps ymm throughput of 6c, but div 32 throughput of 14-30 cycles. So that's an even bigger speedup than Skylake.
If the rest of your DSP task can benefit from SIMD, the overall speedup should be very good. float operations have higher latency, so out-of-order execution has to work harder to hide that latency and overlap execution of independent loop iterations. So IDK whether it would be better for you to just convert to float and back for this one operation, or to change your algorithm to work with float everywhere. It depends what else you need to do with your numbers.
If your unsigned numbers actually fit into signed 32-bit integers, you can use direct hardware support for packed SIMD int32 -> float conversion. Otherwise you need AVX512F for packed uint32 -> float with a single instruction, but that can be emulated with some loss of efficiency. That's what gcc/clang do when auto-vectorizing with AVX2, and why MSVC doesn't auto-vectorize.
MSVC does auto-vectorize with int32_t instead of uint32_t (and gcc/clang can make more efficient code), so prefer that if the highest bit of your integer inputs and/or outputs can't be set. (i.e. the 2's complement interpretation of their bit-patterns will be non-negative.)
With AVX especially, vdivps is slow enough to mostly hide the throughput costs of converting from integer and back, unless there's other useful work that could have overlapped instead.
Floating point precision:
A float stores numbers as significand * 2^exp where the significand is in the range [1.0, 2.0). (Or [0, 1.0) for subnormals). A single-precision float has 24-bits of significand precision, including the 1 implicit bit.
https://en.wikipedia.org/wiki/Single-precision_floating-point_format
So the 24 most-significant digits of an integer can be represented, the rest lost to rounding error. An integer like (uint64_t)b << 32 is no problem for float; that just means a larger exponent. The low bits are all zero.
For example, b = 123105810 gives us 528735427897589760 for b64 << 32. Converting that to float directly from 64-bit integer gives us 528735419307655168, a rounding error of 0.0000016%, or about 2^-25.8. That's unsurprising: the max rounding error is 0.5ulp (units in the last place), or 2^-25, and this number was even so it had 1 trailing zero anyway. That's the same relative error we'd get from converting 123105810; the resulting float is also the same except for its exponent field (which is higher by 32).
(I used https://www.h-schmidt.net/FloatConverter/IEEE754.html to check this.)
float's max exponent is large enough to hold integers outside the INT64_MIN to INT64_MAX range. The low bits of the large integers that float can represent are all zero, but that's exactly what you have with b<<32. So you're only losing the low 9 bits of b in the worst case where it's full-range and odd.
If the important part of your result is the most-significant bits, and having the low ~9 integer bits = rounding error is ok after converting back to integer, then float is perfect for you.
If float doesn't work, double may be an option.
divpd is about twice as slow as divps on many CPUs, and only does half as much work (2 double elements instead of 4 float). So you lose a factor of 4 throughput this way.
But every 32-bit integer can be represented exactly as a double. And by converting back with truncation towards zero, I think you get exact integer division for all pairs of inputs, unless double-rounding is a problem (first to nearest double, then truncation). You can test it with
// exactly correct for most inputs at least, maybe all.
uint32_t quotient = ((1ULL<<32) * (double)b) / a;
The unsigned long long constant (1ULL<<32) is converted to double, so you have 2x u32 -> double conversions (of a and b), a double multiply, a double divide, and a double -> u32 conversion. x86-64 can do all of these efficiently with scalar conversions (by zero extending uint32_t into int64_t, or ignoring the high bits of a double->int64_t conversion), but it will probably still be slower than div r32.
Converting u32 -> double and back (without AVX512) is maybe even more expensive that converting u32 -> float, but clang does auto-vectorize it.
(Just change float to double in the godbolt link above). Again it would help a lot if your inputs were all <= INT32_MAX so they could be treated as signed integers for FP conversion.
If double-rounding is a problem, you could maybe set the FP rounding mode to truncation instead of the default round-to-nearest, if you don't use FP for anything else in the thread where your DSP code is running.

Why does integer division by -1 (negative one) result in FPE?

I have an assignment of expaining some seemingly strange behaviors of C code (running on x86). I can easily complete everything else but this one has really confused me.
Code snippet 1 outputs -2147483648
int a = 0x80000000;
int b = a / -1;
printf("%d\n", b);
Code snippet 2 outputs nothing, and gives a Floating point exception
int a = 0x80000000;
int b = -1;
int c = a / b;
printf("%d\n", c);
I well know the reason for the result of Code Snippet 1 (1 + ~INT_MIN == INT_MIN), but I can't quite understand how can integer division by -1 generate FPE, nor can I reproduce it on my Android phone (AArch64, GCC 7.2.0). Code 2 just output the same as Code 1 without any exceptions. Is it a hidden bug feature of x86 processor?
The assignment didn't tell anything else (including CPU architecture), but since the whole course is based on a desktop Linux distro, you can safely assume it's a modern x86.
Edit: I contacted my friend and he tested the code on Ubuntu 16.04 (Intel Kaby Lake, GCC 6.3.0). The result was consistent with whatever the assignment stated (Code 1 output the said thing and Code 2 crashed with FPE).

There are four things going on here:
gcc -O0 behaviour explains the difference between your two versions: idiv vs. neg. (While clang -O0 happens to compile them both with idiv). And why you get this even with compile-time-constant operands.
x86 idiv faulting behaviour vs. behaviour of the division instruction on ARM
If integer math results in a signal being delivered, POSIX require it to be SIGFPE: On which platforms does integer divide by zero trigger a floating point exception? But POSIX doesn't require trapping for any particular integer operation. (This is why it's allowed for x86 and ARM to be different).
The Single Unix Specification defines SIGFPE as "Erroneous arithmetic operation". It's confusingly named after floating point, but in a normal system with the FPU in its default state, only integer math will raise it. On x86, only integer division. On MIPS, a compiler could use add instead of addu for signed math, so you could get traps on signed add overflow. (gcc uses addu even for signed, but an undefined-behaviour detector might use add.)
C Undefined Behaviour rules (signed overflow, and division specifically) which let gcc emit code which can trap in that case.
gcc with no options is the same as gcc -O0.
-O0
Reduce compilation time and make debugging produce the expected results. This is the default.
This explains the difference between your two versions:
Not only does gcc -O0 not try to optimize, it actively de-optimizes to make asm that independently implements each C statement within a function. This allows gdb's jump command to work safely, letting you jump to a different line within the function and act like you're really jumping around in the C source. Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? explains more about how and why -O0 compiles the way it does.
It also can't assume anything about variable values between statements, because you can change variables with set b = 4. This is obviously catastrophically bad for performance, which is why -O0 code runs several times slower than normal code, and why optimizing for -O0 specifically is total nonsense. It also makes -O0 asm output really noisy and hard for a human to read, because of all the storing/reloading, and lack of even the most obvious optimizations.
int a = 0x80000000;
int b = -1;
// debugger can stop here on a breakpoint and modify b.
int c = a / b; // a and b have to be treated as runtime variables, not constants.
printf("%d\n", c);
I put your code inside functions on the Godbolt compiler explorer to get the asm for those statements.
To evaluate a/b, gcc -O0 has to emit code to reload a and b from memory, and not make any assumptions about their value.
But with int c = a / -1;, you can't change the -1 with a debugger, so gcc can and does implement that statement the same way it would implement int c = -a;, with an x86 neg eax or AArch64 neg w0, w0 instruction, surrounded by a load(a)/store(c). On ARM32, it's a rsb r3, r3, #0 (reverse-subtract: r3 = 0 - r3).
However, clang5.0 -O0 doesn't do that optimization. It still uses idiv for a / -1, so both versions will fault on x86 with clang. Why does gcc "optimize" at all? See Disable all optimization options in GCC. gcc always transforms through an internal representation, and -O0 is just the minimum amount of work needed to produce a binary. It doesn't have a "dumb and literal" mode that tries to make the asm as much like the source as possible.
x86 idiv vs. AArch64 sdiv:
x86-64:
# int c = a / b from x86_fault()
mov eax, DWORD PTR [rbp-4]
cdq # dividend sign-extended into edx:eax
idiv DWORD PTR [rbp-8] # divisor from memory
mov DWORD PTR [rbp-12], eax # store quotient
Unlike imul r32,r32, there's no 2-operand idiv that doesn't have a dividend upper-half input. Anyway, not that it matters; gcc is only using it with edx = copies of the sign bit in eax, so it's really doing a 32b / 32b => 32b quotient + remainder. As documented in Intel's manual, idiv raises #DE on:
divisor = 0
The signed result (quotient) is too large for the destination.
Overflow can easily happen if you use the full range of divisors, e.g. for int result = long long / int with a single 64b / 32b => 32b division. But gcc can't do that optimization because it's not allowed to make code that would fault instead of following the C integer promotion rules and doing a 64-bit division and then truncating to int. It also doesn't optimize even in cases where the divisor is known to be large enough that it couldn't #DE
When doing 32b / 32b division (with cdq), the only input that can overflow is INT_MIN / -1. The "correct" quotient is a 33-bit signed integer, i.e. positive 0x80000000 with a leading-zero sign bit to make it a positive 2's complement signed integer. Since this doesn't fit in eax, idiv raises a #DE exception. The kernel then delivers SIGFPE.
AArch64:
# int c = a / b from x86_fault() (which doesn't fault on AArch64)
ldr w1, [sp, 12]
ldr w0, [sp, 8] # 32-bit loads into 32-bit registers
sdiv w0, w1, w0 # 32 / 32 => 32 bit signed division
str w0, [sp, 4]
ARM hardware division instructions don't raise exceptions for divide by zero or for INT_MIN/-1 overflow. Nate Eldredge commented:
The full ARM architecture reference manual states that UDIV or SDIV, when dividing by zero, simply return zero as the result, "without any indication that the division by zero occurred" (C3.4.8 in the Armv8-A version). No exceptions and no flags - if you want to catch divide by zero, you have to write an explicit test. Likewise, signed divide of INT_MIN by -1 returns INT_MIN with no indication of the overflow.
AArch64 sdiv documentation doesn't mention any exceptions.
However, software implementations of integer division may raise: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka4061.html. (gcc uses a library call for division on ARM32 by default, unless you set a -mcpu that has HW division.)
C Undefined Behaviour.
As PSkocik explains, INT_MIN / -1 is undefined behaviour in C, like all signed integer overflow. This allows compilers to use hardware division instructions on machines like x86 without checking for that special case. If it had to not fault, unknown inputs would require run-time compare-and branch checks, and nobody wants C to require that.
More about the consequences of UB:
With optimization enabled, the compiler can assume that a and b still have their set values when a/b runs. It can then see the program has undefined behaviour, and thus can do whatever it wants. gcc chooses to produce INT_MIN like it would from -INT_MIN.
On a 2's complement system, the most-negative number is its own negative. This is a nasty corner-case for 2's complement, because it means abs(x) can still be negative.
https://en.wikipedia.org/wiki/Two%27s_complement#Most_negative_number
int x86_fault() {
int a = 0x80000000;
int b = -1;
int c = a / b;
return c;
}
compile to this with gcc6.3 -O3 for x86-64
x86_fault:
mov eax, -2147483648
ret
but clang5.0 -O3 compiles to (with no warning even with -Wall -Wextra`):
x86_fault:
ret
Undefined Behaviour really is totally undefined. Compilers can do whatever they feel like, including returning whatever garbage was in eax on function entry, or loading a NULL pointer and an illegal instruction. e.g. with gcc6.3 -O3 for x86-64:
int *local_address(int a) {
return &a;
}
local_address:
xor eax, eax # return 0
ret
void foo() {
int *p = local_address(4);
*p = 2;
}
foo:
mov DWORD PTR ds:0, 0 # store immediate 0 into absolute address 0
ud2 # illegal instruction
Your case with -O0 didn't let the compilers see the UB at compile time, so you got the "expected" asm output.
See also What Every C Programmer Should Know About Undefined Behavior (the same LLVM blog post that Basile linked).

Signed int division in two's complement is undefined if:
the divisor is zero, OR
the dividend is INT_MIN (==0x80000000 if int is int32_t) and the divisor is -1 (in two's complement,
-INT_MIN > INT_MAX, which causes integer overflow, which is undefined behavior in C)
(https://www.securecoding.cert.org recommends wrapping integer operations in functions that check for such edge cases)
Since you're invoking undefined behavior by breaking rule 2, anything can happen, and as it happens, this particular anything on your platform happens to be an FPE signal being generated by your processor.

With undefined behavior very bad things could happen, and sometimes they do happen.
Your question has no sense in C (read Lattner on UB). But you could get the assembler code (e.g. produced by gcc -O -fverbose-asm -S) and care about machine code behavior.
On x86-64 with Linux integer overflow (and also integer division by zero, IIRC) gives a SIGFPE signal. See signal(7)
BTW, on PowerPC integer division by zero is rumored to give -1 at the machine level (but some C compilers generate extra code to test that case).
The code in your question is undefined behavior in C. The generated assembler code has some defined behavior (depends upon the ISA and processor).
(the assignment is done to make you read more about UB, notably Lattner 's blog, which you should absolutely read)

On x86 if you divide by actually using the idiv operation (which is not really necessary for constant arguments, not even for variables-known-to-be-constant, but it happened anyway), INT_MIN / -1 is one of the cases that results in #DE (divide error). It's really a special case of the quotient being out of range, in general that is possible because idiv divides an extra-wide dividend by the divisor, so many combinations cause overflow - but INT_MIN / -1 is the only case that isn't a div-by-0 that you can normally access from higher level languages since they typically do not expose the extra-wide-dividend capabilities.
Linux annoyingly maps the #DE to SIGFPE, which has probably confused everyone who dealt with it the first time.

Why does using mod with int64_t operand makes this function 150% slower?

The max_rem function computes the maximum remainder that (a+1)^n + (a-1)^n leaves when divided by a² for n = 1, 2, 3.... The main calls max_rem on every a from 3 to 999. Complete code:
#include <inttypes.h>
#include <stdio.h>
int max_rem(int a) {
int max_r = 0;
int m = a * a; // <-------- offending line
int r1 = a+1, r2 = a-1;
for(int n = 1; n <= a*a; n++) {
r1 = (r1 * (a + 1)) % m;
r2 = (r2 * (a - 1)) % m;
int r = (r1 + r2) % m;
if(max_r < r)
max_r = r;
}
return max_r;
}
int main() {
int64_t sum = 0;
for(int a = 3; a < 1000; a++)
sum += max_rem(a);
printf("%ld\n", sum);
}
If I change line 6 from:
int m = a * a;
to
int64_t m = a * a;
the whole computation becames about 150% slower. I tried both with gcc 5.3 and clang 3.6.
With int:
$ gcc -std=c99 -O3 -Wall -o 120 120.c
$ time(./120)
real 0m3.823s
user 0m3.816s
sys 0m0.000s
with int64_t:
$ time(./120)
real 0m9.861s
user 0m9.836s
sys 0m0.000s
and yes, I'm on a 64-bit system. Why does this happen?
I've always assumed that using int64_t is safer and more portable and "the modern way to write C"® and wouldn't harm performances on 64bits systems for numeric code. Is this assumption erroneous?
EDIT: just to be clear: the slowdown persists even if you change every variable to int64_t. So this is not a problem with mixing int and int64_t.

I've always assumed that using int64_t is safer and more portable and "the modern way to write C"® and wouldn't harm performances on 64bits systems for numeric code. Is this assumption erroneous?
It seems so to me. You can find the instruction timings in Intel's Software Optimization Reference manual (appendix C, table C-17 General Purpose Instructions on page 645):
IDIV r64 Throughput 85-100 cycles per instruction
IDIV r32 Throughput 20-26 cycles per instruction

TL;DR: You see different performance with the change of types because you are measuring different computations -- one with all 32-bit data, the other with partially or all 64-bit data.
I've always assumed that using int64_t is safer and more portable and "the modern way to write C"®
int64_t is the safest and most portable (among conforming C99 and C11 compilers) way to refer to a 64-bit signed integer type with no padding bits and a two's complement representation, if the implementation in fact provides such a type. Whether using this type actually makes your code more portable depends on whether the code depends on any of those specific characteristics of integer representation, and on whether you are concerned with portability to environments that do not provide such a type.
and wouldn't harm performances on 64bits systems for numeric code. Is this assumption erroneous?
int64_t is specified to be a typedef. On any given system, using int64_t is semantically identical to directly using the type that underlies the typedef on that system. You will see no performance difference between those alternatives.
However, your line of reasoning and question seem to belie an assumption: either that on the system where you perform your tests, the basic type underlying int64_t is int, or that 64-bit arithmetic will perform identically to 32-bit arithmetic on that system. Neither of those assumptions is justified. It is by no means guaranteed that C implementations for 64-bit systems will make int a 64-bit type, and in particular, neither GCC not Clang for x86_64 does so. Moreover, C has nothing whatever to say about the relative performance of arithmetic on different types, and as others have pointed out, native x86_64 integer division instructions are in fact slower for 64-bit operands than for 32-bit operands. Other platforms might exhibit other differences.

Integer division / modulo is extremely slow compared to any other operation. (And is dependent on data size, unlike most operations on modern hardware, see the end of this answer)
For repeated use of the same modulus, you will get much better performance from finding the multiplicative inverse for your integer divisor. Compilers do this for you for compile-time constants, but it's moderately expensive in time and code-size to do it at run-time, so with current compilers you have to decide for yourself when it's worth doing.
It takes some CPU cycles up front, but they're amortized over 3 divisions per iteration.
The reference paper for this idea is Granlund and Montgomery's 1994 paper, back when divide was only 4x as expensive as multiply on P5 Pentium hardware. That paper talks about implementing the idea in gcc 2.6, as well as the mathematical proof that it works.
Compiler output shows the kind of code that division by a small constant turns into:
## clang 3.8 -O3 -mtune=haswell for x86-64 SysV ABI: first arg in rdi
int mod13 (int a) { return a%13; }
movsxd rax, edi # sign-extend 32bit a into 64bit rax
imul rcx, rax, 1321528399 # gcc uses one-operand 32bit imul (32x32 => 64b), which is faster on Atom but slower on almost everything else. I'm showing clang's output because it's simpler
mov rdx, rcx
shr rdx, 63 # 0 or 1: extract the sign bit with a logical right shift
sar rcx, 34 # only use the high half of the 32x32 => 64b multiply
add ecx, edx # ecx = a/13. # adding the sign bit accounts for the rounding semantics of C integer division with negative numbers
imul ecx, ecx, 13 # do the remainder as a - (a/13)*13
sub eax, ecx
ret
And yes, all this is cheaper than a div instruction, for throughput and latency.
I tried to google for simpler descriptions or calculators, and found stuff like this page.
On modern Intel CPUs, 32 and 64b multiply has one per cycle throughput, and 3 cycle latency. (i.e. it's fully pipelined).
Division is only partially pipelined (the div unit can't accept one input per clock), and unlike most instructions, has data-dependent performance:
From Agner Fog's insn tables (see also the x86 tag wiki):
Intel Core2: idiv r32: one per 12-36c throughput (18-42c latency, 4 uops).
idiv r64: one per 28-40c throughput (39-72c latency, 56 uops). (unsigned div is significantly faster: 32 uops, one per 18-37c throughput)
Intel Haswell: div/idiv r32: one per 8-11c throughput (22-29c latency, 9 uops).
idiv r64: one per 24-81c throughput (39-103c latency, 59 uops). (unsigned div: one per 21-74c throughput, 36 uops)
Skylake: div/idiv r32: one per 6c throughput (26c latency, 10 uops).
64b: one per 24-90c throughput (42-95c latency, 57 uops). (unsigned div: one per 21-83c throughput, 36 uops)
So on Intel hardware, unsigned division is cheaper for 64bit operands, the same for 32b operands.
The throughput differences between 32b and 64b idiv can easily account for 150% performance. Your code is completely throughput bound, since you have plenty of independent operations, especially between loop iterations. The loop-carried dependency is just a cmov for the max operation.

The answer to this question can come only from looking at the assembly. I'd run it on my box for my curiosity but it's 3000 miles away:( so I'll have to guess and you look and post your findings here...
Just add -S to your compiler command line.
I believe that with int64 the compilers are doing something different than with int32. That is, they cannot use use some optimization that is available to them with int32.
Maybe gcc replaces the division with multiplication only with int32? There should be a 'if( x < 0 )' branch. Maybe gcc can eliminate it with int32?
I somehow don't believe the performance can be so different if they both do plain 'idiv'

MUL(Assembler) in C

In Assembler i can use the MUL command and get a 64 bit Result EAX:EDX,
how can i do the same in C ? http://siyobik.info/index.php?module=x86&id=210
My approach to use a uint64_t and shift the Result don't work^^
Thank you for your help (=
Me

Any decent compiler will just do it when asked.
For example using VC++ 2010, the following code:
unsigned long long result ;
unsigned long a = 0x12345678 ;
unsigned long b = 0x87654321 ;
result = (unsigned long long)a * b ;
generates the following assembler:
mov eax,dword ptr [b]
mov ecx,dword ptr [a]
mul eax,ecx
mov dword ptr [result],eax
mov dword ptr [a],edx

Post some code. This works for me:
#include <inttypes.h>
#include <stdio.h>
int main(void) {
uint32_t x, y;
uint64_t z;
x = 0x10203040;
y = 0x3000;
z = (uint64_t)x * y;
printf("%016" PRIX64 "\n", z);
return 0;
}

See if you can get the equivalent of __emul or __emulu for your compiler(or just use this if you've got an MS compiler). though 64bit multiply should automatically work unless your sitting behind some restriction or other funny problem(like _aulmul)

You mean to multiply two 32 bit quantities to obtain a 64 bit result?
This is not foreseen in C by itself, either you have tow 32 bit in such as uint32_t and then the result is of the same width. Or you cast before to uint64_t but then you loose the advantage of that special (and fast) multiply.
The only way I see is to use inline assembler extensions. gcc is quite good in this, you may produce quite optimal code. But this isn't portable between different versions of compilers. (Many public domain compilers adopt the gcc, though, I think)

#include
/* The name says it all. Multiply two 32 bit unsigned ints and get
* one 64 bit unsigned int.
*/
uint64_t mul_U32xU32_u64(uint32_t a, uint32_t x) {
return a * (uint64_t)b; /* Note about the cast below. */
}
This produces:
mul_U32xU32_u64:
movl 8(%esp), %eax
mull 4(%esp)
popl %ebp
ret
When compiled with:
gcc -m32 -O3 -fomit-frame-pointer -S mul.c
Which uses the mul instruction (called mull here for multiply long, which is how the gnu assembler for x86 likes it) in the way that you want.
In this case one of the parameters was pulled directly from the stack rather than placed in a register (the 4(%esp) thing means 4 bytes above the stack pointer, and the 4 bytes being skipped over are the return address) because the numbers were passed into the function and would have been pushed onto the stack (as per the x86 ABI (application binary interface) ).
If you inlined the function or just did the math in it in your code it would most likely result in using the mul instruction in many cases, though optimizing compilers may also replace some multiplications with simpler code if they can tell that it would work (for instance it could turn this into a shift or even a constant if the one or more of the arguments were known).
In the C code at least one of the arguments had to be cast to a 64 bit value so that the compiler would produce a 64 bit result. Even if the compiler had to use code that produced a 64 bit result when multiplying 32 bit values, it may have not considered the top half of it to be important because according to the rules of C operations usually result in a value with the same type as the value with the largest range out of its components (except you can sometimes argue that is not really exactly what it does).

You cannot do exactly that in C, i.e. you cannot multiply two N-bit values and obtain a 2N-bit value as the result. Semantics of C multiplication is different from that of your machine multiplication. In C the multiplication operator is always applied to values of the same type T (so called usual arithmetic conversions take care of that) and produces the result of the same type T.
If you run into overflow on multiplication, you have to use a bigger type for the operands. If there's no bigger type, you are out of luck (i.e. you have no other choice but to use library-level implementation of large multiplication).
For example, if the largest integer type of your platform is a 64-bit type, then at assembly level on your machine you have access to mul operation producing the correct 128-bit result. At the language level you have no access to such multiplication.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight