Fastest method of vectorized integer division by non-constant divisor - c

Based on the answers/comments of this question i wrote a performance test with gcc 4.9.2 (MinGW64) to estimate which way of multiple integer division is faster, as following:
#include <emmintrin.h> // SSE2
static unsigned short x[8] = {0, 55, 2, 62003, 786, 5555, 123, 32111}; // Dividend
__attribute__((noinline)) static void test_div_x86(unsigned i){
for(; i; --i)
x[0] /= i,
x[1] /= i,
x[2] /= i,
x[3] /= i,
x[4] /= i,
x[5] /= i,
x[6] /= i,
x[7] /= i;
}
__attribute__((noinline)) static void test_div_sse(unsigned i){
for(; i; --i){
__m128i xmm0 = _mm_loadu_si128((const __m128i*)x);
__m128 xmm1 = _mm_set1_ps(i);
_mm_storeu_si128(
(__m128i*)x,
_mm_packs_epi32(
_mm_cvtps_epi32(
_mm_div_ps(
_mm_cvtepi32_ps(_mm_unpacklo_epi16(xmm0, _mm_setzero_si128())),
xmm1
)
),
_mm_cvtps_epi32(
_mm_div_ps(
_mm_cvtepi32_ps(_mm_unpackhi_epi16(xmm0, _mm_setzero_si128())),
xmm1
)
)
)
);
}
}
int main(){
const unsigned runs = 40000000; // Choose a big number, so the compiler doesn't dare to unroll loops and optimize with constants
test_div_x86(runs),
test_div_sse(runs);
return 0;
}
The results by GNU Gprof and tools parameters.
/*
gcc -O? -msse2 -pg -o test.o -c test.c
g++ -o test test.o -pg
test
gprof test.exe gmon.out
-----------------------------------
test_div_sse(unsigned int) test_div_x86(unsigned int)
-O0 2.26s 1.10s
-O1 1.41s 1.07s
-O2 0.95s 1.09s
-O3 0.77s 1.07s
*/
Now i'm confused why the x86 test barely gets optimized and the SSE test becomes faster though the expensive conversion to & from floating point. Furthermore i'd like to know how much results depend on compilers and architectures.
To summarize it: what's faster at the end: dividing one-by-one or the floating-point detour?

Dividing all elements of a vector by the same scalar can be done with integer multiply and shift. libdivide (C/C++, zlib license) provides some inline functions to do this for scalars (e.g. int), and for dividing vectors by scalars. Also see SSE integer division? (as you mention in your question) for a similar technique giving approximate results. It's more efficient if the same scalar will be applied to lots of vectors. libdivide doesn't say anything about the results being inexact, but I haven't investigated.
re: your code:
You have to be careful about checking what the compiler actually produces, when giving it a trivial loop like that. e.g. is it actually loading/storing back to RAM every iteration? Or is it keeping variables live in registers, and only storing at the end?
Your benchmark is skewed in favour of the integer-division loop, because the vector divider isn't kept 100% occupied in the vector loop, but the integer divider is kept 100% occupied in the int loop. (These paragraphs were added after the discussion in comments. The previous answer didn't explain as much about keeping the dividers fed, and dependency chains.)
You only have a single dependency chain in your vector loop, so the vector divider sits idle for several cycles every iteration after producing the 2nd result, while the chain of convert fp->si, pack, unpack, convert si->fp happens. You've set things up so your throughput is limited by the length of the entire loop-carried dependency chain, rather than the throughput of the FP dividers. If the data each iteration was independent (or there were at least several independent values, like how you have 8 array elements for the int loop), then the unpack/convert and convert/pack of one set of values would overlap with the divps execution time for another vector. The vector divider is only partially pipelined, but everything else if fully pipelined.
This is the difference between throughput and latency, and why it matters for a pipelined out-of-order execution CPU.
Other stuff in your code:
You have __m128 xmm1 = _mm_set1_ps(i); in the inner loop. _set1 with an arg that isn't a compile-time constant is usually at least 2 instructions: movd and pshufd. And in this case, an int-to-float conversion, too. Keeping a float-vector version of your loop counter, which you increment by adding a vector of 1.0, would be better. (Although this probably isn't throwing off your speed test any further, because this excess computation can overlap with other stuff.)
Unpacking with zero works fine. SSE4.1 __m128i _mm_cvtepi16_epi32 (__m128i a) is another way. pmovsxwd is the same speed, but doesn't need a zeroed register.
If you're going to convert to FP for divide, have you considered just keeping your data as FP for a while? Depends on your algorithm how you need rounding to happen.
performance on recent Intel CPUs
divps (packed single float) is 10-13 cycle latency, with a throughput of one per 7 cycles, on recent Intel designs. div / idiv r16 ((unsigned) integer divide in GP reg) is 23-26 cycle latency, with one per 9 or 8 cycle throughput. div is 11 uops, so it even gets in the way of other things issuing / executing for some of the time it's going through the pipeline. (divps is a single uop.) So, Intel CPUs are not really designed to be fast at integer division, but make an effort for FP division.
So just for the division alone, a single integer division is slower than a vector FP division. You're going to come out ahead even with the conversion to/from float, and the unpack/pack.
If you can do the other integer ops in vector regs, that would be ideal. Otherwise you have to get the integers into / out of vector regs. If the ints are in RAM, a vector load is fine. If you're generating them one at a time, PINSRW is an option, but it's possible that just storing to memory to set up for a vector load would be a faster way to load a full vector. Similar for getting the data back out, with PEXTRW or by storing to RAM. If you want the values in GP registers, skip the pack after converting back to int, and just MOVD / PEXTRD from whichever of the two vector regs your value is in. insert/extract instructions take two uops on Intel, which means they take up two "slots", compared to most instructions taking only one fused-domain uop.
Your timing results, showing that the scalar code doesn't improve with compiler optimizations, is because the CPU can overlap the verbose non-optimized load/store instructions for other elements while the divide unit is the bottleneck. The vector loop on the other hand only has one or two dependency chains, with every iteration dependent on the previous, so extra instructions adding latency can't be overlapped with anything. Testing with -O0 is pretty much never useful.

Related

How can I instruct the MSVC compiler to use a 64bit/32bit division instead of the slower 128bit/64bit division?

How can I tell the MSVC compiler to use the 64bit/32bit division operation to compute the result of the following function for the x86-64 target:
#include <stdint.h>
uint32_t ScaledDiv(uint32_t a, uint32_t b)
{
if (a > b)
return ((uint64_t)b<<32) / a; //Yes, this must be casted because the result of b<<32 is undefined
else
return uint32_t(-1);
}
I would like the code, when the if statement is true, to compile to use the 64bit/32bit division operation, e.g. something like this:
; Assume arguments on entry are: Dividend in EDX, Divisor in ECX
mov edx, edx ;A dummy instruction to indicate that the dividend is already where it is supposed to be
xor eax,eax
div ecx ; EAX = EDX:EAX / ECX
...however the x64 MSVC compiler insists on using the 128bit/64bit div instruction, such as:
mov eax, edx
xor edx, edx
shl rax, 32 ; Scale up the dividend
mov ecx, ecx
div rcx ;RAX = RDX:RAX / RCX
See: https://www.godbolt.org/z/VBK4R71
According to the answer to this question, the 128bit/64bit div instruction is not faster than the 64bit/32bit div instruction.
This is a problem because it unnecessarily slows down my DSP algorithm which makes millions of these scaled divisions.
I tested this optimization by patching the executable to use the 64bit/32bit div instruction: The performance increased 28% according to the two timestamps yielded by the rdtsc instructions.
(Editor's note: presumably on some recent Intel CPU. AMD CPUs don't need this micro-optimization, as explained in the linked Q&A.)
No current compilers (gcc/clang/ICC/MSVC) will do this optimization from portable ISO C source, even if you let them prove that b < a so the quotient will fit in 32 bits. (For example with GNU C if(b>=a) __builtin_unreachable(); on Godbolt). This is a missed optimization; until that's fixed, you have to work around it with intrinsics or inline asm.
(Or use a GPU or SIMD instead; if you have the same divisor for many elements see https://libdivide.com/ for SIMD to compute a multiplicative inverse once and apply it repeatedly.)
_udiv64 is available starting in Visual Studio 2019 RTM.
In C mode (-TC) it's apparently always defined. In C++ mode, you need to #include <immintrin.h>, as per the Microsoft docs. or intrin.h.
https://godbolt.org/z/vVZ25L (Or on Godbolt.ms because recent MSVC on the main Godbolt site is not working1.)
#include <stdint.h>
#include <immintrin.h> // defines the prototype
// pre-condition: a > b else 64/32-bit division overflows
uint32_t ScaledDiv(uint32_t a, uint32_t b)
{
uint32_t remainder;
uint64_t d = ((uint64_t) b) << 32;
return _udiv64(d, a, &remainder);
}
int main() {
uint32_t c = ScaledDiv(5, 4);
return c;
}
_udiv64 will produce 64/32 div. The two shifts left and right are a missed optimization.
;; MSVC 19.20 -O2 -TC
a$ = 8
b$ = 16
ScaledDiv PROC ; COMDAT
mov edx, edx
shl rdx, 32 ; 00000020H
mov rax, rdx
shr rdx, 32 ; 00000020H
div ecx
ret 0
ScaledDiv ENDP
main PROC ; COMDAT
xor eax, eax
mov edx, 4
mov ecx, 5
div ecx
ret 0
main ENDP
So we can see that MSVC doesn't do constant-propagation through _udiv64, even though in this case it doesn't overflow and it could have compiled main to just mov eax, 0ccccccccH / ret.
UPDATE #2 https://godbolt.org/z/n3Dyp-
Added a solution with Intel C++ Compiler, but this is less efficient and will defeat constant-propagation because it's inline asm.
#include <stdio.h>
#include <stdint.h>
__declspec(regcall, naked) uint32_t ScaledDiv(uint32_t a, uint32_t b)
{
__asm mov edx, eax
__asm xor eax, eax
__asm div ecx
__asm ret
// implicit return of EAX is supported by MSVC, and hopefully ICC
// even when inlining + optimizing
}
int main()
{
uint32_t a = 3 , b = 4, c = ScaledDiv(a, b);
printf( "(%u << 32) / %u = %u\n", a, b, c);
uint32_t d = ((uint64_t)a << 32) / b;
printf( "(%u << 32) / %u = %u\n", a, b, d);
return c != d;
}
Footnote 1: Matt Godbolt's main site's non-WINE MSVC compilers are temporarily(?) gone. Microsoft runs https://www.godbolt.ms/ to host the recent MSVC compilers on real Windows, and normally the main Godbolt.org site relayed to that for MSVC.)
It seems godbolt.ms will generate short links, but not expand them again! Full links are better anyway for their resistance to link-rot.
#Alex Lopatin's answer shows how to use _udiv64 to get non-terrible scalar code (despite MSVC's stupid missed optimization shifting left/right).
For compilers that support GNU C inline asm (including ICC), you can use that instead of the inefficient MSVC inline asm syntax that has a lot of overhead for wrapping a single instruction. See What is the difference between 'asm', '__asm' and '__asm__'? for an example wrapping 64-bit / 32-bit => 32-bit idiv. (Use it for div by just changing the mnemonic and the types to unsigned.) GNU C doesn't have an intrinsic for 64 / 32 or 128 / 64 division; it's supposed to optimize pure C. But unfortunately GCC / Clang / ICC have missed optimizations for this case even using if(a<=b) __builtin_unreachable(); to promise that a>b.
But that's still scalar division, with pretty poor throughput.
Perhaps you can a GPU for your DSP task? If you have a large enough batch of work (and the rest of your algorithm is GPU-friendly) then it's probably worth the overhead of the communication round trip to the GPU.
If you're using the CPU, then anything we can suggest will benefit from parallelizing over multiple cores, so do that for more throughput.
x86 SIMD (SSE4/AVX2/AVX512*) doesn't have SIMD integer division in hardware. The Intel SVML functions _mm_div_epu64 and _mm256_div_epu64 are not intrinsics for a real instruction, they're slow functions that maybe unpack to scalar or compute multiplicative inverses. Or whatever other trick they use; possibly the 32-bit division functions convert to SIMD vectors of double, especially if AVX512 is available. (Intel still calls them "intrinsics" maybe because they're like built-in function that it understands and can do constant-propagation through. They're probably as efficient as they can be, but that's "not very", and they need to handle the general case, not just your special case with the low half of one divisor being all zero and the quotient fitting in 32 bits.)
If you have the same divisor for many elements, see https://libdivide.com/ for SIMD to compute a multiplicative inverse once and apply it repeatedly. (You should adapt that technique to bake in the shifting of the dividend without actually doing it, leaving the all-zero low half implicit.)
If your divisor is always varying, and this isn't a middle step in some larger SIMD-friendly algorithm, scalar division may well be your best bet if you need exact results.
You could get big speedups from using SIMD float if 24-bit mantissa precision is sufficient
uint32_t ScaledDiv(uint32_t a, uint32_t b)
{
return ((1ULL<<32) * (float)b) / a;
}
(float)(1ULL<<32) is a compile-time constant 4294967296.0f.
This does auto-vectorize over an array, with gcc and clang even without -ffast-math (but not MSVC). See it on Godbolt. You could port gcc or clang's asm back to intrinsics for MSVC; they use some FP tricks for packed-conversion of unsigned integers to/from float without AVX512. Non-vectorized scalar FP will probably be slower than plain integer on MSVC, as well as less accurate.
For example, Skylake's div r32 throughput is 1 per 6 cycles. But its AVX vdivps ymm throughput is one instruction (of 8 floats) per 5 cycles. Or for 128-bit SSE2, divps xmm has one per 3 cycle throughput. So you get about 10x the division throughput from AVX on Skylake. (8 * 6/5 = 9.6) Older microarchitectures have much slower SIMD FP division, but also somewhat slower integer division. In general the ratio is smaller because older CPUs don't have as wide SIMD dividers, so 256-bit vdivps has to run the 128-bit halves through separately. But there's still plenty of gain to be had, like better than a factor of 4 on Haswell. And Ryzen has vdivps ymm throughput of 6c, but div 32 throughput of 14-30 cycles. So that's an even bigger speedup than Skylake.
If the rest of your DSP task can benefit from SIMD, the overall speedup should be very good. float operations have higher latency, so out-of-order execution has to work harder to hide that latency and overlap execution of independent loop iterations. So IDK whether it would be better for you to just convert to float and back for this one operation, or to change your algorithm to work with float everywhere. It depends what else you need to do with your numbers.
If your unsigned numbers actually fit into signed 32-bit integers, you can use direct hardware support for packed SIMD int32 -> float conversion. Otherwise you need AVX512F for packed uint32 -> float with a single instruction, but that can be emulated with some loss of efficiency. That's what gcc/clang do when auto-vectorizing with AVX2, and why MSVC doesn't auto-vectorize.
MSVC does auto-vectorize with int32_t instead of uint32_t (and gcc/clang can make more efficient code), so prefer that if the highest bit of your integer inputs and/or outputs can't be set. (i.e. the 2's complement interpretation of their bit-patterns will be non-negative.)
With AVX especially, vdivps is slow enough to mostly hide the throughput costs of converting from integer and back, unless there's other useful work that could have overlapped instead.
Floating point precision:
A float stores numbers as significand * 2^exp where the significand is in the range [1.0, 2.0). (Or [0, 1.0) for subnormals). A single-precision float has 24-bits of significand precision, including the 1 implicit bit.
https://en.wikipedia.org/wiki/Single-precision_floating-point_format
So the 24 most-significant digits of an integer can be represented, the rest lost to rounding error. An integer like (uint64_t)b << 32 is no problem for float; that just means a larger exponent. The low bits are all zero.
For example, b = 123105810 gives us 528735427897589760 for b64 << 32. Converting that to float directly from 64-bit integer gives us 528735419307655168, a rounding error of 0.0000016%, or about 2^-25.8. That's unsurprising: the max rounding error is 0.5ulp (units in the last place), or 2^-25, and this number was even so it had 1 trailing zero anyway. That's the same relative error we'd get from converting 123105810; the resulting float is also the same except for its exponent field (which is higher by 32).
(I used https://www.h-schmidt.net/FloatConverter/IEEE754.html to check this.)
float's max exponent is large enough to hold integers outside the INT64_MIN to INT64_MAX range. The low bits of the large integers that float can represent are all zero, but that's exactly what you have with b<<32. So you're only losing the low 9 bits of b in the worst case where it's full-range and odd.
If the important part of your result is the most-significant bits, and having the low ~9 integer bits = rounding error is ok after converting back to integer, then float is perfect for you.
If float doesn't work, double may be an option.
divpd is about twice as slow as divps on many CPUs, and only does half as much work (2 double elements instead of 4 float). So you lose a factor of 4 throughput this way.
But every 32-bit integer can be represented exactly as a double. And by converting back with truncation towards zero, I think you get exact integer division for all pairs of inputs, unless double-rounding is a problem (first to nearest double, then truncation). You can test it with
// exactly correct for most inputs at least, maybe all.
uint32_t quotient = ((1ULL<<32) * (double)b) / a;
The unsigned long long constant (1ULL<<32) is converted to double, so you have 2x u32 -> double conversions (of a and b), a double multiply, a double divide, and a double -> u32 conversion. x86-64 can do all of these efficiently with scalar conversions (by zero extending uint32_t into int64_t, or ignoring the high bits of a double->int64_t conversion), but it will probably still be slower than div r32.
Converting u32 -> double and back (without AVX512) is maybe even more expensive that converting u32 -> float, but clang does auto-vectorize it.
(Just change float to double in the godbolt link above). Again it would help a lot if your inputs were all <= INT32_MAX so they could be treated as signed integers for FP conversion.
If double-rounding is a problem, you could maybe set the FP rounding mode to truncation instead of the default round-to-nearest, if you don't use FP for anything else in the thread where your DSP code is running.

Summing 8-bit integers in __m512i with AVX intrinsics

AVX512 provide us with intrinsics to sum all cells in a __mm512 vector. However, some of their counterparts are missing: there is no _mm512_reduce_add_epi8, yet.
_mm512_reduce_add_ps //horizontal sum of 16 floats
_mm512_reduce_add_pd //horizontal sum of 8 doubles
_mm512_reduce_add_epi32 //horizontal sum of 16 32-bit integers
_mm512_reduce_add_epi64 //horizontal sum of 8 64-bit integers
Basically, I need to implement MAGIC in the following snippet.
__m512i all_ones = _mm512_set1_epi16(1);
short sum_of_ones = MAGIC(all_ones);
/* now sum_of_ones contains 32, the sum of 32 ones. */
The most obvious way would be using _mm512_storeu_epi8 and sum the elements of the array together, but that would be slow, plus it might invalidate the cache. I suppose there exists a faster approach.
Bonus points for implementing _mm512_reduce_add_epi16 as well.
First of all, _mm512_reduce_add_epi64 does not correspond to a single AVX512 instruction, but it generates a sequence of shuffles and additions.
To reduce 64 epu8 values to 8 epi64 values one usually uses the vpsadbw instruction (SAD=Sum of Absolute Differences) against a zero vector, which then can be reduced further:
long reduce_add_epu8(__m512i a)
{
return _mm512_reduce_add_epi64(_mm512_sad_epu8(a, _mm512_setzero_si512()));
}
Try it on godbolt: https://godbolt.org/z/1rMiPH. Unfortunately, neither GCC nor Clang seem to be able to optimize away the function if it is used with _mm512_set1_epi16(1).
For epi8 instead of epu8 you need to first add 128 to each element (or xor with 0x80), then reduce it using vpsadbw and at the end subtract 64*128 (or 8*128 on each intermediate 64bit result). [Note this was wrong in a previous version of this answer]
For epi16 I suggest having a look at what instructions _mm512_reduce_add_epi32 and _mm512_reduce_add_epi64 generate and derive from there what to do.
Overall, as #Mysticial suggested, it depends on your context what the best approach of reducing is. E.g., if you have a very large array of int64 and want a sum as int64, you should just add them together packet-wise and only at the very end reduce one packet to a single int64.

Why does using mod with int64_t operand makes this function 150% slower?

The max_rem function computes the maximum remainder that (a+1)^n + (a-1)^n leaves when divided by a² for n = 1, 2, 3.... The main calls max_rem on every a from 3 to 999. Complete code:
#include <inttypes.h>
#include <stdio.h>
int max_rem(int a) {
int max_r = 0;
int m = a * a; // <-------- offending line
int r1 = a+1, r2 = a-1;
for(int n = 1; n <= a*a; n++) {
r1 = (r1 * (a + 1)) % m;
r2 = (r2 * (a - 1)) % m;
int r = (r1 + r2) % m;
if(max_r < r)
max_r = r;
}
return max_r;
}
int main() {
int64_t sum = 0;
for(int a = 3; a < 1000; a++)
sum += max_rem(a);
printf("%ld\n", sum);
}
If I change line 6 from:
int m = a * a;
to
int64_t m = a * a;
the whole computation becames about 150% slower. I tried both with gcc 5.3 and clang 3.6.
With int:
$ gcc -std=c99 -O3 -Wall -o 120 120.c
$ time(./120)
real 0m3.823s
user 0m3.816s
sys 0m0.000s
with int64_t:
$ time(./120)
real 0m9.861s
user 0m9.836s
sys 0m0.000s
and yes, I'm on a 64-bit system. Why does this happen?
I've always assumed that using int64_t is safer and more portable and "the modern way to write C"® and wouldn't harm performances on 64bits systems for numeric code. Is this assumption erroneous?
EDIT: just to be clear: the slowdown persists even if you change every variable to int64_t. So this is not a problem with mixing int and int64_t.
I've always assumed that using int64_t is safer and more portable and "the modern way to write C"® and wouldn't harm performances on 64bits systems for numeric code. Is this assumption erroneous?
It seems so to me. You can find the instruction timings in Intel's Software Optimization Reference manual (appendix C, table C-17 General Purpose Instructions on page 645):
IDIV r64 Throughput 85-100 cycles per instruction
IDIV r32 Throughput 20-26 cycles per instruction
TL;DR: You see different performance with the change of types because you are measuring different computations -- one with all 32-bit data, the other with partially or all 64-bit data.
I've always assumed that using int64_t is safer and more portable and "the modern way to write C"®
int64_t is the safest and most portable (among conforming C99 and C11 compilers) way to refer to a 64-bit signed integer type with no padding bits and a two's complement representation, if the implementation in fact provides such a type. Whether using this type actually makes your code more portable depends on whether the code depends on any of those specific characteristics of integer representation, and on whether you are concerned with portability to environments that do not provide such a type.
and wouldn't harm performances on 64bits systems for numeric code. Is this assumption erroneous?
int64_t is specified to be a typedef. On any given system, using int64_t is semantically identical to directly using the type that underlies the typedef on that system. You will see no performance difference between those alternatives.
However, your line of reasoning and question seem to belie an assumption: either that on the system where you perform your tests, the basic type underlying int64_t is int, or that 64-bit arithmetic will perform identically to 32-bit arithmetic on that system. Neither of those assumptions is justified. It is by no means guaranteed that C implementations for 64-bit systems will make int a 64-bit type, and in particular, neither GCC not Clang for x86_64 does so. Moreover, C has nothing whatever to say about the relative performance of arithmetic on different types, and as others have pointed out, native x86_64 integer division instructions are in fact slower for 64-bit operands than for 32-bit operands. Other platforms might exhibit other differences.
Integer division / modulo is extremely slow compared to any other operation. (And is dependent on data size, unlike most operations on modern hardware, see the end of this answer)
For repeated use of the same modulus, you will get much better performance from finding the multiplicative inverse for your integer divisor. Compilers do this for you for compile-time constants, but it's moderately expensive in time and code-size to do it at run-time, so with current compilers you have to decide for yourself when it's worth doing.
It takes some CPU cycles up front, but they're amortized over 3 divisions per iteration.
The reference paper for this idea is Granlund and Montgomery's 1994 paper, back when divide was only 4x as expensive as multiply on P5 Pentium hardware. That paper talks about implementing the idea in gcc 2.6, as well as the mathematical proof that it works.
Compiler output shows the kind of code that division by a small constant turns into:
## clang 3.8 -O3 -mtune=haswell for x86-64 SysV ABI: first arg in rdi
int mod13 (int a) { return a%13; }
movsxd rax, edi # sign-extend 32bit a into 64bit rax
imul rcx, rax, 1321528399 # gcc uses one-operand 32bit imul (32x32 => 64b), which is faster on Atom but slower on almost everything else. I'm showing clang's output because it's simpler
mov rdx, rcx
shr rdx, 63 # 0 or 1: extract the sign bit with a logical right shift
sar rcx, 34 # only use the high half of the 32x32 => 64b multiply
add ecx, edx # ecx = a/13. # adding the sign bit accounts for the rounding semantics of C integer division with negative numbers
imul ecx, ecx, 13 # do the remainder as a - (a/13)*13
sub eax, ecx
ret
And yes, all this is cheaper than a div instruction, for throughput and latency.
I tried to google for simpler descriptions or calculators, and found stuff like this page.
On modern Intel CPUs, 32 and 64b multiply has one per cycle throughput, and 3 cycle latency. (i.e. it's fully pipelined).
Division is only partially pipelined (the div unit can't accept one input per clock), and unlike most instructions, has data-dependent performance:
From Agner Fog's insn tables (see also the x86 tag wiki):
Intel Core2: idiv r32: one per 12-36c throughput (18-42c latency, 4 uops).
idiv r64: one per 28-40c throughput (39-72c latency, 56 uops). (unsigned div is significantly faster: 32 uops, one per 18-37c throughput)
Intel Haswell: div/idiv r32: one per 8-11c throughput (22-29c latency, 9 uops).
idiv r64: one per 24-81c throughput (39-103c latency, 59 uops). (unsigned div: one per 21-74c throughput, 36 uops)
Skylake: div/idiv r32: one per 6c throughput (26c latency, 10 uops).
64b: one per 24-90c throughput (42-95c latency, 57 uops). (unsigned div: one per 21-83c throughput, 36 uops)
So on Intel hardware, unsigned division is cheaper for 64bit operands, the same for 32b operands.
The throughput differences between 32b and 64b idiv can easily account for 150% performance. Your code is completely throughput bound, since you have plenty of independent operations, especially between loop iterations. The loop-carried dependency is just a cmov for the max operation.
The answer to this question can come only from looking at the assembly. I'd run it on my box for my curiosity but it's 3000 miles away:( so I'll have to guess and you look and post your findings here...
Just add -S to your compiler command line.
I believe that with int64 the compilers are doing something different than with int32. That is, they cannot use use some optimization that is available to them with int32.
Maybe gcc replaces the division with multiplication only with int32? There should be a 'if( x < 0 )' branch. Maybe gcc can eliminate it with int32?
I somehow don't believe the performance can be so different if they both do plain 'idiv'

Which of these C multiplication algorithms is easier CPU and has lower overhead?

I want to know which of these functions is easier for CPU to calculate/run. I was told that direct multiplication (e.g. 4x3) is more difficult for CPU to calculate than a series of summation (e.g. 4+4+4). Well the first one has direct multiplication, but the second one has a for loop.
Algorithm 1
The first one is like x*y:
int multi_1(int x, int y)
{
return x * y;
}
Algorithm 2
The second one is like x+x+x+...+x (as much as y):
int multi_2(int num1, int num2)
{
int sum=0;
for(int i=0; i<num2; i++)
{
sum += num1;
}
return sum;
}
Please don't respond with "Don't try to do micro-optimization" or something similar. How can I evaluate which of these codes run better/faster? Does C language automatically convert direct multiplication to summation?
You can generally expect the multiplication operator * to be implemented as efficiently as possible. Beating it with a custom multiplication algorithm is highly unlikely. If for any reason multi_2 is faster than multi_1 for all but some edge cases, consider writing a bug report against your compiler vendor.
On modern (i.e. made in this century) machines, multiplications by arbitrary integers are extremely fast and takes four cycles at most, which is faster than initializing the loop in multi_2.
The more "high level" your code is, the more optimization paths your compiler will be able to use. So, I'd say that code #1 will have the most chances to produce a fast and optimized code.
In fact, for a simple CPU architecture that doesn't support direct multiplication operations, but does support addition and shifts, the second algorithm won't be used at all. The usual procedure is something similar to the following code:
unsigned int mult_3 (unsigned int x, unsigned int y)
{
unsigned int res = 0;
while (x)
{
res += (x&1)? y : 0;
x>>=1;
y<<=1;
}
return res;
}
Typical modern CPUs can do multiplication in hardware, often at the same speed as addition. So clearly #1 is better.
Even if multiplication is not available and you are stuck with addition there are algorithms much faster than #2.
You were misinformed. Multiplication is not "more difficult" than repeated addition. Multipliers are built-into in the ALU (Arithmetic & Logical Unit) of modern CPU, and they work in constant time. On the opposite, repeated additions take time proportional to the value of one of the operands, which could be as large a one billion !
Actually, multiplies rarely performed by straight additions; when you have to implement them in software, you do it by repeated shifts, using a method similar to duplation, known of the Ancient Aegyptiens.
This depends on the architecture you run it on, as well as the compiler and the values for x and y.
If x and y are small, the second version might be faster. However, when x and y are very large numbers, the second version will certainly be much slower.
The only way to really find out is to measure the running time of your code, for example like this: https://stackoverflow.com/a/9085330/369009
Since you're dealing with int values, the multiplication operator (*) will be far more efficient. C will compile into the CPU-specific assembly language, which will have a multiplication instruction (e.g., x86's mul/imul). Virtually all modern CPUs can multiply integers within a few clock cycles. It doesn't get much faster than that. Years ago (and on some relatively uncommon embedded CPUs) it used to be the case that multiplication took more clock cycles than addition, but even then, the additional jump instructions to loop would result in more cycles being consumed, even if only looping once or twice.
The C language does not require that multiplications by integers be converted into series of additions. It permits implementations to do that, I suppose, but I would be surprised to find an implementation that did so, at least in the general context you present.
Note also that in your case #2 you have replaced one multiplication operation with not just num2 addition operations, but with at least 2 * num2 additions, num2 comparisons, and 2 * num2 storage operations. The storage operations probably end up being approximately free, as the values likely end up living in CPU registers, but they don't have to do.
Overall, I would expect alternative #1 to be much faster, but it is always best to answer performance questions by testing. You will see the largest difference for large values of num2. For instance, try with num1 == 1 and num2 == INT_MAX.

Execution time of different operators

I was reading Knuth's The Art of Computer Programming and I noticed that he indicates that the DIV command takes 6 times longer than the ADD command in his MIX assembly language.
To test the relevancy to modern architecture, I wrote the following code snippet:
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
clock_t start;
unsigned int ia=0,ib=0,ic=0;
int i;
float fa=0.0,fb=0.0,fc=0.0;
int sample_size=100000;
if (argc > 1)
sample_size = atoi(argv[1]);
#define TEST(OP) \
start = clock();\
for (i = 0; i < sample_size; ++i)\
ic += (ia++) OP ((ib--)+1);\
printf("%d,", (int)(clock() - start))
TEST(+);
TEST(*);
TEST(/);
TEST(%);
TEST(>>);
TEST(<<);
TEST(&);
TEST(|);
TEST(^);
#undef TEST
//TEST must be redefined for floating point types
#define TEST(OP) \
start = clock();\
for (i = 0; i < sample_size; ++i)\
fc += (fa+=0.5) OP ((fb-=0.5)+1);\
printf("%d,", (int)(clock() - start))
TEST(+);
TEST(*);
TEST(/);
#undef TEST
printf("\n");
return ic+fc;//to prevent optimization!
}
I then generated 4000 test samples (each containing a sample size of 100000 operations of each type) using this command line:
for i in {1..4000}; do ./test >> output.csv; done
Finally, I opened the results with Excel and graphed the averages. What I found was rather surprising. Here is the graph of the results:
The actual averages were (from left-to-right): 463.36475,437.38475,806.59725,821.70975,419.56525,417.85725,426.35975,425.9445,423.792,549.91975,544.11825,543.11425
Overall this is what I expected (division and modulo are slow, as are floating point results).
My question is: why do both integer and floating-point multiplication execute faster than their addition counterparts? It is a small factor, but it is consistent across numerous tests. In TAOCP Knuth lists ADD as taking 2 units of time while MUL takes 10. Did something change in CPU architecture since then?
Different instructions take different amounts of time on the same CPU; and the same instructions can take different amounts of time on different CPUs. For example, for Intel's original Pentium 4 shifting was relatively expensive and addition was quite fast, so adding a register to itself was faster than shifting a register left by 1; and for Intel's recent CPUs shifting and addition are roughly the same speed (shifting is faster than it was on the original Pentium 4 and addition slower, in terms of "cycles").
To complicate things more, different CPUs may be able to do more or less at the same time, and have other differences that effect performance.
In theory (and not necessarily in practice):
Shifting and boolean operations (AND, OR, XOR) should be fastest (each bit can be done in parallel). Addition and subtraction should be next (relatively simple, but all bits of the result can't be done in parallel because of the carry from one pair of bits to the next).
Multiplication should be much slower as it involves many additions, but some of those additions can be done in parallel. For a simple example (using decimal digits not binary) something like 12 * 34 (with multiple digits) can be broken down into "single digit" form and becomes 2*4 + 2*3 * 10 + 1*4 * 10 + 1*3 * 100; where all "single digit" multiplications can be done in parallel, then 2 additions can be done in parallel, then the last addition can be done.
Division is mostly "compare and subtract if larger, repeated". It's the slowest because it can't be done in parallel (the results of the subtraction are needed for the next comparison). Modulo is the remainder of a division and essentially identical to division (and for most CPUs it's actually the same instruction - e.g. a DIV instruction gives you a quotient and a remainder).
For floating point; each number has 2 parts (significand and exponent), so things get a little more complicated. Floating point shifting is actually adding to or subtracting from the exponent (and should cost roughly the same as integer addition/subtraction). For floating point addition, subtraction and boolean operations you need to equalise the exponents, and after that you do the operation on the significands alone (and the "equalising" and "doing the operation" can't be done in parallel). Multiplication is multiplying the significands and adding the exponents (and adjusting the bias), where both parts can be done in parallel so the total cost is whichever is slowest (multiplying the significands); so it's as fast as integer multiplication. Division is dividing the significands and subtracting the exponents (and adjusting the bias), where both parts can be done in parallel and total cost is whichever is slowest (dividing the significands); so it's as fast as integer division.
Note: I've simplified in various places to make it much easier to understand.
to test the execution time, look at the instructions produced in the assembly listing and look at the documentation for the processor for those instructions and note if the FPU is performing the operation or if it is directly performed in the code.
Then, add up the execution time for each instruction.
However, if the cpu is pipelined or multi threaded, the operation could take MUCH less time than calculated.
It is true that division and modulo (a division operation) is slower than addition. The reason behind this is the design of ALU (Arithmetic Logical Unit). The ALU is combination of parallel adders and logic-circuits. Division is performed by repeated subtraction, therefore needs more level of subtract logic making division slower than addition. The propagation delays of gates involved in division adds cherry on cake.

Resources