which operation takes more CPU clocks, modulo or comparison? - c

Which operation takes more CPU clocks, modulo or comparison?
Will this code take more time:
for(j = i; j <= 10; j++)
{
if(j == 10) printf("0");
else printf("%d", j);
}
or this
for(j = i; j <= 10; j++)
printf("%d", j % 10);
and why?

If measured in CPU cycles, probably the modulo operation takes more cycles; this may depend on CPU. However, CPU cycles aren't a great way to measure performance with modern processors which run more than one instruction at once (pipelining), have multiple layers of cache etc. In this case, putting an additional test in will mean an additional branch, which may be more significant in terms of timing (i.e. affect the instruction pipeline). The only way to know for sure is to compile it optimised, and time it.
I know your example is meant to be just that, an example, but this also illustrates premature optimisation. The call to printf will take orders of magnitude more time than the modulo or compare. If you want to optimise your example, you would write something like:
printf ("1234567890");

Comparison is a simple operation and is usually faster (the CPU can use logical operators on bits).
If you perform a modulo to a number that is not a power of two, the CPU has to perform a division, that can be a quite expensive operation (of course it depends on the size of the numbers you are using).
Speaking of cpu clocks, a comparison can be done in parallel, since you can just use a xor operation, so doing x==10 or x==200000 will take the same small amount of cpu clocks. With a division this is not possible and a bigger number will require more time.

In terms of Assembly, a modulo operation implies a "never so easy" multiplication. See some algorithms.
A branch operation actually is the second fastest instruction (jump is the first) as it only takes at most one substraction to do the comparison.

Related

Which of these C multiplication algorithms is easier CPU and has lower overhead?

I want to know which of these functions is easier for CPU to calculate/run. I was told that direct multiplication (e.g. 4x3) is more difficult for CPU to calculate than a series of summation (e.g. 4+4+4). Well the first one has direct multiplication, but the second one has a for loop.
Algorithm 1
The first one is like x*y:
int multi_1(int x, int y)
{
return x * y;
}
Algorithm 2
The second one is like x+x+x+...+x (as much as y):
int multi_2(int num1, int num2)
{
int sum=0;
for(int i=0; i<num2; i++)
{
sum += num1;
}
return sum;
}
Please don't respond with "Don't try to do micro-optimization" or something similar. How can I evaluate which of these codes run better/faster? Does C language automatically convert direct multiplication to summation?
You can generally expect the multiplication operator * to be implemented as efficiently as possible. Beating it with a custom multiplication algorithm is highly unlikely. If for any reason multi_2 is faster than multi_1 for all but some edge cases, consider writing a bug report against your compiler vendor.
On modern (i.e. made in this century) machines, multiplications by arbitrary integers are extremely fast and takes four cycles at most, which is faster than initializing the loop in multi_2.
The more "high level" your code is, the more optimization paths your compiler will be able to use. So, I'd say that code #1 will have the most chances to produce a fast and optimized code.
In fact, for a simple CPU architecture that doesn't support direct multiplication operations, but does support addition and shifts, the second algorithm won't be used at all. The usual procedure is something similar to the following code:
unsigned int mult_3 (unsigned int x, unsigned int y)
{
unsigned int res = 0;
while (x)
{
res += (x&1)? y : 0;
x>>=1;
y<<=1;
}
return res;
}
Typical modern CPUs can do multiplication in hardware, often at the same speed as addition. So clearly #1 is better.
Even if multiplication is not available and you are stuck with addition there are algorithms much faster than #2.
You were misinformed. Multiplication is not "more difficult" than repeated addition. Multipliers are built-into in the ALU (Arithmetic & Logical Unit) of modern CPU, and they work in constant time. On the opposite, repeated additions take time proportional to the value of one of the operands, which could be as large a one billion !
Actually, multiplies rarely performed by straight additions; when you have to implement them in software, you do it by repeated shifts, using a method similar to duplation, known of the Ancient Aegyptiens.
This depends on the architecture you run it on, as well as the compiler and the values for x and y.
If x and y are small, the second version might be faster. However, when x and y are very large numbers, the second version will certainly be much slower.
The only way to really find out is to measure the running time of your code, for example like this: https://stackoverflow.com/a/9085330/369009
Since you're dealing with int values, the multiplication operator (*) will be far more efficient. C will compile into the CPU-specific assembly language, which will have a multiplication instruction (e.g., x86's mul/imul). Virtually all modern CPUs can multiply integers within a few clock cycles. It doesn't get much faster than that. Years ago (and on some relatively uncommon embedded CPUs) it used to be the case that multiplication took more clock cycles than addition, but even then, the additional jump instructions to loop would result in more cycles being consumed, even if only looping once or twice.
The C language does not require that multiplications by integers be converted into series of additions. It permits implementations to do that, I suppose, but I would be surprised to find an implementation that did so, at least in the general context you present.
Note also that in your case #2 you have replaced one multiplication operation with not just num2 addition operations, but with at least 2 * num2 additions, num2 comparisons, and 2 * num2 storage operations. The storage operations probably end up being approximately free, as the values likely end up living in CPU registers, but they don't have to do.
Overall, I would expect alternative #1 to be much faster, but it is always best to answer performance questions by testing. You will see the largest difference for large values of num2. For instance, try with num1 == 1 and num2 == INT_MAX.

How to count numbers of executed x86-64 instructions between two points in program (or function call)?

I've made few algorithm implementations with various micro-optimalizations. I need to count number of executed instructions of a call, or between two places (before and after call).
Algorithm uses few cycles and conditional jumps, and it's data sensible. So I can't just use calculated number of instructions per cycle iteration, and multiply it with count of iterations.
Disclaimer: I know that number of executed instructions ain't much relevant, because performance for same instructions varies with different CPUs, but it's for demonstration purpose only.
On x86 (both 32- and 64-bit) you are probably looking for the RDTSC instruction.
Given how complex modern CPUs are, any form of simulation or static analysis certainly isn't.
Your compiler may or may not have an intrinsic for it, if not, do something like this: (GCC syntax for the inline asm,)
uint64_t GetTSC(void)
{
uint64_t h, l;
h = l = 0;
__asm__("rdtsc" : "=a"(l), "=d"(h));
h <<= 32;
h |= l;
return h;
}
With the caveats described in https://en.wikipedia.org/wiki/Time_Stamp_Counter

Fastest method of vectorized integer division by non-constant divisor

Based on the answers/comments of this question i wrote a performance test with gcc 4.9.2 (MinGW64) to estimate which way of multiple integer division is faster, as following:
#include <emmintrin.h> // SSE2
static unsigned short x[8] = {0, 55, 2, 62003, 786, 5555, 123, 32111}; // Dividend
__attribute__((noinline)) static void test_div_x86(unsigned i){
for(; i; --i)
x[0] /= i,
x[1] /= i,
x[2] /= i,
x[3] /= i,
x[4] /= i,
x[5] /= i,
x[6] /= i,
x[7] /= i;
}
__attribute__((noinline)) static void test_div_sse(unsigned i){
for(; i; --i){
__m128i xmm0 = _mm_loadu_si128((const __m128i*)x);
__m128 xmm1 = _mm_set1_ps(i);
_mm_storeu_si128(
(__m128i*)x,
_mm_packs_epi32(
_mm_cvtps_epi32(
_mm_div_ps(
_mm_cvtepi32_ps(_mm_unpacklo_epi16(xmm0, _mm_setzero_si128())),
xmm1
)
),
_mm_cvtps_epi32(
_mm_div_ps(
_mm_cvtepi32_ps(_mm_unpackhi_epi16(xmm0, _mm_setzero_si128())),
xmm1
)
)
)
);
}
}
int main(){
const unsigned runs = 40000000; // Choose a big number, so the compiler doesn't dare to unroll loops and optimize with constants
test_div_x86(runs),
test_div_sse(runs);
return 0;
}
The results by GNU Gprof and tools parameters.
/*
gcc -O? -msse2 -pg -o test.o -c test.c
g++ -o test test.o -pg
test
gprof test.exe gmon.out
-----------------------------------
test_div_sse(unsigned int) test_div_x86(unsigned int)
-O0 2.26s 1.10s
-O1 1.41s 1.07s
-O2 0.95s 1.09s
-O3 0.77s 1.07s
*/
Now i'm confused why the x86 test barely gets optimized and the SSE test becomes faster though the expensive conversion to & from floating point. Furthermore i'd like to know how much results depend on compilers and architectures.
To summarize it: what's faster at the end: dividing one-by-one or the floating-point detour?
Dividing all elements of a vector by the same scalar can be done with integer multiply and shift. libdivide (C/C++, zlib license) provides some inline functions to do this for scalars (e.g. int), and for dividing vectors by scalars. Also see SSE integer division? (as you mention in your question) for a similar technique giving approximate results. It's more efficient if the same scalar will be applied to lots of vectors. libdivide doesn't say anything about the results being inexact, but I haven't investigated.
re: your code:
You have to be careful about checking what the compiler actually produces, when giving it a trivial loop like that. e.g. is it actually loading/storing back to RAM every iteration? Or is it keeping variables live in registers, and only storing at the end?
Your benchmark is skewed in favour of the integer-division loop, because the vector divider isn't kept 100% occupied in the vector loop, but the integer divider is kept 100% occupied in the int loop. (These paragraphs were added after the discussion in comments. The previous answer didn't explain as much about keeping the dividers fed, and dependency chains.)
You only have a single dependency chain in your vector loop, so the vector divider sits idle for several cycles every iteration after producing the 2nd result, while the chain of convert fp->si, pack, unpack, convert si->fp happens. You've set things up so your throughput is limited by the length of the entire loop-carried dependency chain, rather than the throughput of the FP dividers. If the data each iteration was independent (or there were at least several independent values, like how you have 8 array elements for the int loop), then the unpack/convert and convert/pack of one set of values would overlap with the divps execution time for another vector. The vector divider is only partially pipelined, but everything else if fully pipelined.
This is the difference between throughput and latency, and why it matters for a pipelined out-of-order execution CPU.
Other stuff in your code:
You have __m128 xmm1 = _mm_set1_ps(i); in the inner loop. _set1 with an arg that isn't a compile-time constant is usually at least 2 instructions: movd and pshufd. And in this case, an int-to-float conversion, too. Keeping a float-vector version of your loop counter, which you increment by adding a vector of 1.0, would be better. (Although this probably isn't throwing off your speed test any further, because this excess computation can overlap with other stuff.)
Unpacking with zero works fine. SSE4.1 __m128i _mm_cvtepi16_epi32 (__m128i a) is another way. pmovsxwd is the same speed, but doesn't need a zeroed register.
If you're going to convert to FP for divide, have you considered just keeping your data as FP for a while? Depends on your algorithm how you need rounding to happen.
performance on recent Intel CPUs
divps (packed single float) is 10-13 cycle latency, with a throughput of one per 7 cycles, on recent Intel designs. div / idiv r16 ((unsigned) integer divide in GP reg) is 23-26 cycle latency, with one per 9 or 8 cycle throughput. div is 11 uops, so it even gets in the way of other things issuing / executing for some of the time it's going through the pipeline. (divps is a single uop.) So, Intel CPUs are not really designed to be fast at integer division, but make an effort for FP division.
So just for the division alone, a single integer division is slower than a vector FP division. You're going to come out ahead even with the conversion to/from float, and the unpack/pack.
If you can do the other integer ops in vector regs, that would be ideal. Otherwise you have to get the integers into / out of vector regs. If the ints are in RAM, a vector load is fine. If you're generating them one at a time, PINSRW is an option, but it's possible that just storing to memory to set up for a vector load would be a faster way to load a full vector. Similar for getting the data back out, with PEXTRW or by storing to RAM. If you want the values in GP registers, skip the pack after converting back to int, and just MOVD / PEXTRD from whichever of the two vector regs your value is in. insert/extract instructions take two uops on Intel, which means they take up two "slots", compared to most instructions taking only one fused-domain uop.
Your timing results, showing that the scalar code doesn't improve with compiler optimizations, is because the CPU can overlap the verbose non-optimized load/store instructions for other elements while the divide unit is the bottleneck. The vector loop on the other hand only has one or two dependency chains, with every iteration dependent on the previous, so extra instructions adding latency can't be overlapped with anything. Testing with -O0 is pretty much never useful.

Execution time of different operators

I was reading Knuth's The Art of Computer Programming and I noticed that he indicates that the DIV command takes 6 times longer than the ADD command in his MIX assembly language.
To test the relevancy to modern architecture, I wrote the following code snippet:
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
clock_t start;
unsigned int ia=0,ib=0,ic=0;
int i;
float fa=0.0,fb=0.0,fc=0.0;
int sample_size=100000;
if (argc > 1)
sample_size = atoi(argv[1]);
#define TEST(OP) \
start = clock();\
for (i = 0; i < sample_size; ++i)\
ic += (ia++) OP ((ib--)+1);\
printf("%d,", (int)(clock() - start))
TEST(+);
TEST(*);
TEST(/);
TEST(%);
TEST(>>);
TEST(<<);
TEST(&);
TEST(|);
TEST(^);
#undef TEST
//TEST must be redefined for floating point types
#define TEST(OP) \
start = clock();\
for (i = 0; i < sample_size; ++i)\
fc += (fa+=0.5) OP ((fb-=0.5)+1);\
printf("%d,", (int)(clock() - start))
TEST(+);
TEST(*);
TEST(/);
#undef TEST
printf("\n");
return ic+fc;//to prevent optimization!
}
I then generated 4000 test samples (each containing a sample size of 100000 operations of each type) using this command line:
for i in {1..4000}; do ./test >> output.csv; done
Finally, I opened the results with Excel and graphed the averages. What I found was rather surprising. Here is the graph of the results:
The actual averages were (from left-to-right): 463.36475,437.38475,806.59725,821.70975,419.56525,417.85725,426.35975,425.9445,423.792,549.91975,544.11825,543.11425
Overall this is what I expected (division and modulo are slow, as are floating point results).
My question is: why do both integer and floating-point multiplication execute faster than their addition counterparts? It is a small factor, but it is consistent across numerous tests. In TAOCP Knuth lists ADD as taking 2 units of time while MUL takes 10. Did something change in CPU architecture since then?
Different instructions take different amounts of time on the same CPU; and the same instructions can take different amounts of time on different CPUs. For example, for Intel's original Pentium 4 shifting was relatively expensive and addition was quite fast, so adding a register to itself was faster than shifting a register left by 1; and for Intel's recent CPUs shifting and addition are roughly the same speed (shifting is faster than it was on the original Pentium 4 and addition slower, in terms of "cycles").
To complicate things more, different CPUs may be able to do more or less at the same time, and have other differences that effect performance.
In theory (and not necessarily in practice):
Shifting and boolean operations (AND, OR, XOR) should be fastest (each bit can be done in parallel). Addition and subtraction should be next (relatively simple, but all bits of the result can't be done in parallel because of the carry from one pair of bits to the next).
Multiplication should be much slower as it involves many additions, but some of those additions can be done in parallel. For a simple example (using decimal digits not binary) something like 12 * 34 (with multiple digits) can be broken down into "single digit" form and becomes 2*4 + 2*3 * 10 + 1*4 * 10 + 1*3 * 100; where all "single digit" multiplications can be done in parallel, then 2 additions can be done in parallel, then the last addition can be done.
Division is mostly "compare and subtract if larger, repeated". It's the slowest because it can't be done in parallel (the results of the subtraction are needed for the next comparison). Modulo is the remainder of a division and essentially identical to division (and for most CPUs it's actually the same instruction - e.g. a DIV instruction gives you a quotient and a remainder).
For floating point; each number has 2 parts (significand and exponent), so things get a little more complicated. Floating point shifting is actually adding to or subtracting from the exponent (and should cost roughly the same as integer addition/subtraction). For floating point addition, subtraction and boolean operations you need to equalise the exponents, and after that you do the operation on the significands alone (and the "equalising" and "doing the operation" can't be done in parallel). Multiplication is multiplying the significands and adding the exponents (and adjusting the bias), where both parts can be done in parallel so the total cost is whichever is slowest (multiplying the significands); so it's as fast as integer multiplication. Division is dividing the significands and subtracting the exponents (and adjusting the bias), where both parts can be done in parallel and total cost is whichever is slowest (dividing the significands); so it's as fast as integer division.
Note: I've simplified in various places to make it much easier to understand.
to test the execution time, look at the instructions produced in the assembly listing and look at the documentation for the processor for those instructions and note if the FPU is performing the operation or if it is directly performed in the code.
Then, add up the execution time for each instruction.
However, if the cpu is pipelined or multi threaded, the operation could take MUCH less time than calculated.
It is true that division and modulo (a division operation) is slower than addition. The reason behind this is the design of ALU (Arithmetic Logical Unit). The ALU is combination of parallel adders and logic-circuits. Division is performed by repeated subtraction, therefore needs more level of subtract logic making division slower than addition. The propagation delays of gates involved in division adds cherry on cake.

How does C perform the % operation interally

I am curious to understand the logic behind the mod operation since I understand that bit-shifting operations can be performed to do different things such as bit shifting to multiply.
One way I can see it being done is by a recursive algorithm that keeps dividing until you cannot divide anymore, but this does not seem efficient.
Any ideas will be helpful. Thanks in advance!
The quick version is: Depends on hardware, the optimizer, if it's division by a constant or not (pdf), if there's exceptions to be checked for (e.g. modulo by 0), if and how negative numbers are handled (this is a scary question for C++), etc...
R gave a nice, concise answer for unsigned integers, but it's difficult to understand unless you're well versed with C.
The crux of the technique illuminated by R is to strip away multiples of q until there's no more multiples of q left. We could naively do this with a simple loop:
while (p >= q) p -= q; // One liner, woohoo!
The code may be short, but for large values of p and small values of q this might take a very long time.
Better than stripping away one q at a time would be to strip away many q's at a time. Note that we actually want to strip away as many q's as possible -- that is, floor(p/q) many q's... And indeed, that's a valid technique. For unsigned integers, one would expect that p % q == p - (p / q) * q. (Note that unsigned integer division rounds down.)
But this almost feels like cheating because division and remainder operations are so intimately related. (In fact, often if hardware natively supports division, it supports a divide-and-compute-remainder operation because they're so strongly related.)
Assuming we've no access to division, how shall we find a multiple of q greater than 1 to strip away? In hardware, fixed shift operations are cheap (if not practically free) and conceptually represent multiplication by a non-negative power of two. For example, shifting a bit string left by 3 is equivalent to multiplying by 8 (that is, 2^3), e.g. 5 decimal is equivalent to '101' binary. Shift '101' in binary by adding three zeroes on the right (giving '101000') and the result is 50 in decimal -- five times eight.
Likewise, shift operations are very cheap as software operations and you'll struggle to find a controller that doesn't support them and quickly. (Some architectures such as ARM can even combine shifts with other instructions to make them 'free' a good deal of the time.)
ARMed (couldn't resist) with these shift operations, we can proceed as follows:
Find out the largest power of two we can multiply q by and still be less than p.
Working from the largest power of two to the smallest, multiply q by each power of two and if it's less than what's left of p subtract it from what's left of p.
Whatever you've got left is the remainder.
Why does this work? Because in the end you'll find that all the subtracted powers of two actually sum to floor(p / q)! Don't take my word for it, similar knowledge has been known for a very long time.
Breaking apart R's answer:
#define HI (-1U-(-1U/2))
This effectively gives you an unsigned integer with only the highest value bit set.
unsigned i;
for (i=0; !(HI & (q<<i)); i++);
This line actually finds the highest power of two q can be multiplied before overflowing an unsigned integer. This isn't strictly necessary, but it doesn't change the results other than increasing the amount of execution time required.
In case you're not familiar with the C-isms in this line:
(q<<i) is a left bit shift by i. Recall this is equivalent to multiplying by 2^i.
HI & (q<<i) performs a bitwise-AND. Since HI only has its top bit populated this will only result in a non-zero value when (q<<i) is large enough to cause the top bit to be non-zero. One more shift over to the left and there'd be an integer overflow.
!(HI & (q<<i)) is 'true' when (HI & (q<<i)) is zero and 'false' otherwise.
do { if (p >= (q<<i)) p -= (q<<i); } while (i--);
This is a simple decreasing loop do { .... } while (i--);. Note that post-decrementing is used on i so the loop executes, then it checks to see if i is not zero, then it subtracts one from i, and then if its earlier check resulted in true it continues. This has the property that the loop executes its last time when i is 0. This is important because we may need to strip away an unmultiplied copy of q.
if (p >= (q<<i)) checks if the 2^i * q is less than or equal to p. If it is, p -= (q<<i) strips it away.
The remainder is left.
While most C implementations run on hardware that has a division instruction, the remainder operation can be performed roughly like this, for computing p%q, assuming unsigned values:
#define HI (-1U-(-1U/2))
unsigned i;
for (i=0; !(HI & (q<<i)); i++);
do { if (p >= (q<<i)) p -= (q<<i); } while (i--);
The resulting remainder is in p.
In addition to a hardware instruction and implementation using shifts, as R.. suggests, there's also reciprocal multiplication.
This technique can be used when the right-hand side of % is a constant, known at compile time.
Reciprocal multiplication is used to implement division, but using it for % is easy, based on the formula a%b == a-(a/b)*b.
Depending on the smarts of the optimizer, there is a shortcut for modulo base 2. For example, a % 32 can be implemented as a & 31. In general, a % (2^N) == a & (2^N -1). This is lightning fast compared to division. Most dividers (ever hardware) require at least 1 cycle for each bit of the result to calculate, while logic AND is just a few cycle operation (in the pipeline).
EDIT: this only works if a is unsigned !

Resources