Say I am doing a simple task-- a lot. For this post I will use the example of reducing mod a power of 2-- but this can be any task.
There are many approaches, and it is difficult to determine which one is better. For example, to reduce a 64-bit unsigned integer a modulo 2^b, we can either do:
a - (a >> b << b)
a << (64 - b) >> (64 - b)
a & ((1 << b) - 1)
a % (1 << b)
a & array[b], where array[i] would contain the value ((1 << i) - 1) for various i's. I do not wish to assume that array[i] will be in the L1 cache.
Perhaps there are others. For each of these methods, it is fairly straightforward to determine its cost after looking at the assembly code-- e.g. two shifts and a minus; or two shifts, a single-byte minus, and another minus operation partially optimized away by the compiler.
However, it is difficult to determine, which of these is actually faster on a given architecture. I have tried the following, but failed:
-- Read documentation on the architecture. When I can find it, it gives a clear answer as to how many cycles a shift, or an & take-- however, it is still difficult to tell, just how many cycles would a cached minus operation take, or how much of performance drawback I am incurring by pushing useful data out of the cache in order to load the array[b] data.
-- Run the same code many million times in a row. However, this results in the array remaining in the cache, and therefore giving a faster performance.
-- Run the same code many million times, and in between the runs run some code to put / read other data in the cache. However, there is a lot of variability in the running time, from me putting and accessing data in the cache, and the standard deviation is too big for the results to be reliable.
-- Thanks to a suggestion by #klutt, I pasted these methods into the real code, and the first three methods seem to have equal performance. What is probably happening is, all three methods can finish executing while the program is waiting for another value to be pulled up from the cache, later on in the program. However, in case the program changes later (e.g., less cache lookups), then one of these methods might become preferable to the others.
If I do not wish to revisit my reduction mod 2^* code every time the program changes, is there another, better method, to measure which of these is faster?
A lot of these different ways to implement a task (especially if it's a really simple task), will be optimized by the compiler to the exact same machine code instructions. Have you tried exploring http://godbolt.org/ and seeing what machine code instructions each of the different methods compiles to? This might give you a better idea of which ones are better.
Related
I am evaluating a network+rendering workload for my project.
The program continuously runs a main loop:
while (true) {
doSomething()
drawSomething()
doSomething2()
sendSomething()
}
The main loop runs more than 60 times per second.
I want to see the performance breakdown, how much time each procedure takes.
My concern is that if I print the time interval for every entrance and exit of each procedure,
It would incur huge performance overhead.
I am curious what is an idiomatic way of measuring the performance.
Printing of logging is good enough?
Generally: For repeated short things, you can just time the whole repeat loop. (But microbenchmarking is hard; easy to distort results unless you understand the implications of doing that; for very short things, throughput and latency are different, so measure both separately by making one iteration use the result of the previous or not. Also beware that branch prediction and caching can make something look fast in a microbenchmark when it would actually be costly if done one at a time between other work in a larger program.
e.g. loop unrolling and lookup tables often look good because there's no pressure on I-cache or D-cache from anything else.)
Or if you insist on timing each separate iteration, record the results in an array and print later; you don't want to invoke heavy-weight printing code inside your loop.
This question is way too broad to say anything more specific.
Many languages have benchmarking packages that will help you write microbenchmarks of a single function. Use them. e.g. for Java, JMH makes sure the function under test is warmed up and fully optimized by the JIT, and all that jazz, before doing timed runs. And runs it for a specified interval, counting how many iterations it completes. See How do I write a correct micro-benchmark in Java? for that and more.
Beware common microbenchmark pitfalls
Failure to warm up code / data caches and stuff: page faults within the timed region for touching new memory, or code / data cache misses, that wouldn't be part of normal operation. (Example of noticing this effect: Performance: memset; or example of a wrong conclusion based on this mistake)
Never-written memory (obtained fresh from the kernel) gets all its pages copy-on-write mapped to the same system-wide physical page (4K or 2M) of zeros if you read without writing, at least on Linux. So you can get cache hits but TLB misses. e.g. A large allocation from new / calloc / malloc, or a zero-initialized array in static storage in .bss. Use a non-zero initializer or memset.
Failure to give the CPU time to ramp up to max turbo: modern CPUs clock down to idle speeds to save power, only clocking up after a few milliseconds. (Or longer depending on the OS / HW).
related: on modern x86, RDTSC counts reference cycles, not core clock cycles, so it's subject to the same CPU-frequency variation effects as wall-clock time.
Most integer and FP arithmetic asm instructions (except divide and square root which are already slower than others) have performance (latency and throughput) that doesn't depend on the actual data. Except for subnormal aka denormal floating point being very slow, and in some cases (e.g. legacy x87 but not SSE2) also producing NaN or Inf can be slow.
On modern CPUs with out-of-order execution, some things are too short to truly time meaningfully, see also this. Performance of a tiny block of assembly language (e.g. generated by a compiler for one function) can't be characterized by a single number, even if it doesn't branch or access memory (so no chance of mispredict or cache miss). It has latency from inputs to outputs, but different throughput if run repeatedly with independent inputs is higher. e.g. an add instruction on a Skylake CPU has 4/clock throughput, but 1 cycle latency. So dummy = foo(x) can be 4x faster than x = foo(x); in a loop. Floating-point instructions have higher latency than integer, so it's often a bigger deal. Memory access is also pipelined on most CPUs, so looping over an array (address for next load easy to calculate) is often much faster than walking a linked list (address for next load isn't available until the previous load completes).
Obviously performance can differ between CPUs; in the big picture usually it's rare for version A to be faster on Intel, version B to be faster on AMD, but that can easily happen in the small scale. When reporting / recording benchmark numbers, always note what CPU you tested on.
Related to the above and below points: you can't "benchmark the * operator" in C in general, for example. Some use-cases for it will compile very differently from others, e.g. tmp = foo * i; in a loop can often turn into tmp += foo (strength reduction), or if the multiplier is a constant power of 2 the compiler will just use a shift. The same operator in the source can compile to very different instructions, depending on surrounding code.
You need to compile with optimization enabled, but you also need to stop the compiler from optimizing away the work, or hoisting it out of a loop. Make sure you use the result (e.g. print it or store it to a volatile) so the compiler has to produce it. For an array, volatile double sink = output[argc]; is a useful trick: the compiler doesn't know the value of argc so it has to generate the whole array, but you don't need to read the whole array or even call an RNG function. (Unless the compiler aggressively transforms to only calculate the one output selected by argc, but that tends not to be a problem in practice.)
For inputs, use a random number or argc or something instead of a compile-time constant so your compiler can't do constant-propagation for things that won't be constants in your real use-case. In C you can sometimes use inline asm or volatile for this, e.g. the stuff this question is asking about. A good benchmarking package like Google Benchmark will include functions for this.
If the real use-case for a function lets it inline into callers where some inputs are constant, or the operations can be optimized into other work, it's not very useful to benchmark it on its own.
Big complicated functions with special handling for lots of special cases can look fast in a microbenchmark when you run them repeatedly, especially with the same input every time. In real life use-cases, branch prediction often won't be primed for that function with that input. Also, a massively unrolled loop can look good in a microbenchmark, but in real life it slows everything else down with its big instruction-cache footprint leading to eviction of other code.
Related to that last point: Don't tune only for huge inputs, if the real use-case for a function includes a lot of small inputs. e.g. a memcpy implementation that's great for huge inputs but takes too long to figure out which strategy to use for small inputs might not be good. It's a tradeoff; make sure it's good enough for large inputs (for an appropriate definition of "enough"), but also keep overhead low for small inputs.
Litmus tests:
If you're benchmarking two functions in one program: if reversing the order of testing changes the results, your benchmark isn't fair. e.g. function A might only look slow because you're testing it first, with insufficient warm-up. example: Why is std::vector slower than an array? (it's not, whichever loop runs first has to pay for all the page faults and cache misses; the 2nd just zooms through filling the same memory.)
Increasing the iteration count of a repeat loop should linearly increase the total time, and not affect the calculated time-per-call. If not, then you have non-negligible measurement overhead or your code optimized away (e.g. hoisted out of the loop and runs only once instead of N times).
Vary other test parameters as a sanity check.
For C / C++, see also Simple for() loop benchmark takes the same time with any loop bound where I went into some more detail about microbenchmarking and using volatile or asm to stop important work from optimizing away with gcc/clang.
I am writing a c code and I would like to know if making simple operation like multiplication more CPU friendly makes any difference and the code faster. For example, replacing this line of code:
y = x * 15;
with
y = x << 4;
y -= x;
Does the compiler already do that? Should I use the -O2 option in order to make it happen?
There are two parts to the answer:
No, unless you are writing a very specialized function (e.g. a signal processing function that must execute in 20 clocks) you should not optimize; leave that to the compiler. In general your job is to write readable code, and the compiler will (to it's capability optimize it). Note that the optimization will be different for different processors as their hardware (computational capability) can be very different. For example a shift by N instruction (like that in your code) may take N clocks on a processor with a regular shifter, but it will take a single clock (or less) on processors with a hardware barrel shifter.
Yes, most modern optimizing compilers will optimize (for example replace multiplication by shifting where appropriate) without explicit optimization options.
Summarized, optimize only in rare situations when you already know that the compiler didn't do a good job, it is a problem that must be addressed, you know well how to do better than the compiler, and the resulting increased maintenance cost is worth it.
Optimizing code by hand is almost always an exercise in futility these days, especially in higher level languages. While C is almost assembly, modern compilers have a lot more tricks built into them than most people are aware of.
In addition, unless the code you're optimizing is going to be used a lot, i.e., millions of times in close succession, the work of optimizing the code will cost more time than the savings you achieve.
With that said, the only way to see if your code is measurably faster would be to test it: Put each version in a tight loop and execute it a million (or more) times, and see if there's a noticeable difference.
Note that your optimization is for a specific multiplier - any other operand you use it for is going to yield different results. Because it can't be generalized, there's little likelyhood this optimization will be done by any compiler in all cases - and just looking at the code and not knowing what processor architecture it will be run on, I can't say whether it would be faster or not.
While I was reading tips in C, I have seen this tip here http://www.cprogramming.com/tips/tip/multiply-rather-than-divide
but I am not sure. I was told both multiply and divide are slower and time consuming and requires many cycles.
and I have seen people often use i << 2 instead of i x 4 since shifting is faster.
Is it a good tip using x0.5 or /2 ? or however modern compilers do optimize it in a better way?
It's true that some (if not most) processors can multiply faster than performing a division operation, but, it's like the myth of ++i being faster than i++ in a for loop. Yes, it once was, but nowadays, compilers are smart enough to optimize all those things for you, so you should not care about this anymore.
And about bit-shifting, it once was faster to shift << 2 than to multiply by 4, but those days are over as most processors can multiply in one clock cycle, just like a shift operation.
A great example of this was the calculation of the pixel address in VGA 320x240 mode. They all did this:
address = x + (y << 8) + (y << 6)
to multiply y with 320. On modern processors, this can be slower than just doing:
address = x + y * 320;
So, just write what you think and the compiler will do the rest :)
I find that this service is invaluable for testing this sort of stuff:
http://gcc.godbolt.org/
Just look at the final assembly. 99% of the time, you will see that the compiler optimises it all to the same code anyway. Don't waste the brain power!
In some cases, it is better to write it explicitly. For example, 2^n (where n is a positive integer) could be written as (int) pow( 2.0, n ) but it is obviously better to use 1<<n (and the compiler won't make that optimisation for you). So it can be worth keeping these things in the back of your mind. As with anything though, don't optimise prematurely.
"multiply by 0.5 rather than divide by 2" (2.0) is faster on fewer environments these days than before, primarily due to improved compilers that will optimize the code.
"use i << 2 instead of i x 4" is faster in fewer environments for similar reasons.
In select cases, the programmer still needs to attend to such issues, but it is increasingly rare. Code maintenance continues to grow as a dominate issue. So use what makes the most sense for that code snippet: x*0.5, x/2.0, half(x), etc.
Compilers readily optimize code. Recommend you code with high level issues in mind. E. g. Is the algorithm O(n) or O(n*n)?
The important thought to pass on is that best code design practices evolve and variations occur amongst environments. Be adaptable. What is best today may shift (or multiply) in the future.
Many CPUs can perform multiplication in 1 or 2 clock cycles but division always takes longer (although FP division is sometimes faster than integer division).
If you look at this answer How can I compare the performance of log() and fp division in C++? you will see that division can exceed 24 cycles.
Why does division take so much longer than multiplication? If you remember back to grade school, you may recall that multiplication can essentially be performed with many simultaneous additions. Division requires iterative subtraction that cannot be performed simultaneously so it takes longer. In fact, some FP units speed up division by performing a reciprocal approximation and multiplying by that. It isn't quite as accurate but is somewhat faster.
If you are working with integers, and you expect to get an integer as result, it's better to use / 2, this way avoids unnecesary conversions to/from float
I am going to analyse and optimize some C-Code and therefore I first have to check, whether the functions I want to optimize are memory-bound or cpu-bound. In general I know, how to do this, but I have some questions about counting Floating Point Operations and analysing the size of data, which is used. Look at the following for-loop, which I want to analyse. The values of the array are doubles (that means 8 Byte each):
for(int j=0 ;j<N;j++){
for(int i=1 ;i<Nt;i++){
matrix[j*Nt+i] = matrix[j*Nt+i-1] * mu + matrix[j*Nt+i]*sigma;
}
}
1) How many floating point operations do you count? I thought about 3*(Nt-1)*N... but do I have to count the operations within the arrays, too (matrix[j*Nt+i], which are 2 more FLOP for this array)?
2)How much data is transfered? 2* ((Nt-1)*N)8Byte or 3 ((Nt-1)*N)*8Byte. I mean, every entry of the matrix has to be loaded. After the calculation, the new values is saved to that index of the array (now these is 1load and 1 store). But this value is used for the next calculation. Is another load operations needed therefore, or is this value (matrix[j*Nt+i-1]) already available without a load operation?
Thx a lot!!!
With this type of code, the direct sort of analysis you are proposing to do can be almost completely misleading. The only meaningful information about the performance of the code is actually measuring how fast it runs in practice (benchmarking).
This is because modern compilers and processors are very clever about optimizing code like this, and it will end up executing in a way which is nothing like your straightforward analysis. The compiler will optimize the code, rearranging the individual operations. The processor will itself try to execute the individual sub-operations in parallel and/or pipelined, so that for example computation is occurring while data is being fetched from memory.
It's useful to think about algorithmic complexity, to distinguish between O(n) and O(n²) and so on, but constant factors (like you ask about 2*... or 3*...) are completely moot because they vary in practice depending on lots of details.
From a long time ago I have a memory which has stuck with me that says comparisons against zero are faster than any other value (ahem Z80).
In some C code I'm writing I want to skip values which have all their bits set. Currently the type of these values is char but may change. I have two different alternatives to perform the test:
if (!~b)
/* skip */
and
if (b == 0xff)
/* skip */
Apart from the latter making the assumption that b is an 8bit char whereas the former does not, would the former ever be faster due to the old compare to zero optimization trick, or are the CPUs of today way beyond this kind of thing?
If it is faster, the compiler will substitute it for you.
In general, you can't write C better than the compiler can optimize it. And it is architecture specific anyway.
In short, don't worry about it unless that sub-micro-nano-second is ultra important
From what I recall in my architecture classes, I believe they should be equally fast. Both have 2 instructions.
First example
1. Negate b into a temp register
2. Compare temp register equal 0
Second example
1. Subtract 0xff from b into a temp register
2. Compare temp register equal to 0
These are basically identical, and besides, even if your particular architecture requires more or less than this, is it really worth the fraction of a nanosecond? Several minutes have been spent just answering this question.
I would say it's not so much the that CPUs are beyond these kind of tricks as it is the compilers.
The CPUs of today are, however, beyond simple tricks which pull an extra clock-tick or two of speed. Even if you do this 100,000 times a second, we are still only talking about an increase in speed of 0.00003 seconds on a single-core 3Ghz computer - it is simply not worth your time to worry about things like this.
Go with the one that will be easier for the person who is maintaining your code to understand. If you have a successful product, most of the expense in software is in maintenance. If you write cryptic code you add to that expense. If you don't have a successful product, it doesn't matter because no one will have to maintain it. I have been in situations where I had to save every byte I could, and had to resort to tricks like the one you gave, but I only do it as the very very very last resort.