While I was reading tips in C, I have seen this tip here http://www.cprogramming.com/tips/tip/multiply-rather-than-divide
but I am not sure. I was told both multiply and divide are slower and time consuming and requires many cycles.
and I have seen people often use i << 2 instead of i x 4 since shifting is faster.
Is it a good tip using x0.5 or /2 ? or however modern compilers do optimize it in a better way?
It's true that some (if not most) processors can multiply faster than performing a division operation, but, it's like the myth of ++i being faster than i++ in a for loop. Yes, it once was, but nowadays, compilers are smart enough to optimize all those things for you, so you should not care about this anymore.
And about bit-shifting, it once was faster to shift << 2 than to multiply by 4, but those days are over as most processors can multiply in one clock cycle, just like a shift operation.
A great example of this was the calculation of the pixel address in VGA 320x240 mode. They all did this:
address = x + (y << 8) + (y << 6)
to multiply y with 320. On modern processors, this can be slower than just doing:
address = x + y * 320;
So, just write what you think and the compiler will do the rest :)
I find that this service is invaluable for testing this sort of stuff:
http://gcc.godbolt.org/
Just look at the final assembly. 99% of the time, you will see that the compiler optimises it all to the same code anyway. Don't waste the brain power!
In some cases, it is better to write it explicitly. For example, 2^n (where n is a positive integer) could be written as (int) pow( 2.0, n ) but it is obviously better to use 1<<n (and the compiler won't make that optimisation for you). So it can be worth keeping these things in the back of your mind. As with anything though, don't optimise prematurely.
"multiply by 0.5 rather than divide by 2" (2.0) is faster on fewer environments these days than before, primarily due to improved compilers that will optimize the code.
"use i << 2 instead of i x 4" is faster in fewer environments for similar reasons.
In select cases, the programmer still needs to attend to such issues, but it is increasingly rare. Code maintenance continues to grow as a dominate issue. So use what makes the most sense for that code snippet: x*0.5, x/2.0, half(x), etc.
Compilers readily optimize code. Recommend you code with high level issues in mind. E. g. Is the algorithm O(n) or O(n*n)?
The important thought to pass on is that best code design practices evolve and variations occur amongst environments. Be adaptable. What is best today may shift (or multiply) in the future.
Many CPUs can perform multiplication in 1 or 2 clock cycles but division always takes longer (although FP division is sometimes faster than integer division).
If you look at this answer How can I compare the performance of log() and fp division in C++? you will see that division can exceed 24 cycles.
Why does division take so much longer than multiplication? If you remember back to grade school, you may recall that multiplication can essentially be performed with many simultaneous additions. Division requires iterative subtraction that cannot be performed simultaneously so it takes longer. In fact, some FP units speed up division by performing a reciprocal approximation and multiplying by that. It isn't quite as accurate but is somewhat faster.
If you are working with integers, and you expect to get an integer as result, it's better to use / 2, this way avoids unnecesary conversions to/from float
Related
In my program I have a great presence of the operation n % 10. I know that the module operation can be done much faster when we have n% m where m is the power of 2, since it can be replaced by n & (m-1 ), however Is there any faster way to calculate modulus if the operand is 10?
In my case n is a uint8_t in some cases and in other cases n is an uint32_t.
Because most modern processors can do multiplication much, much faster than division, it is often possible to speed up division and modulus operations where the dividend is a known small constant by replacing the division with one or two multiplications and a few other fast operations (such as shift and addition).
To do so requires computing at compile-time some magic numbers dependent on the dividend; fortunately most modern compilers know how to do this so you don't need to do anything to take advantage. Just let your compiler do the heavy lifting for you, as #chux suggests in an excellent answer.
You can help the compiler by using unsigned types; for some dividends, signed division and modulus are harder to replace.
The basic outline of the optimisation of modulus looks like this:
If you had exact arithmetic, you could replace x % p with p * ((x * (1/p)) % 1). For constant p, 1/p can be precomputed at compile time. The %1 operation simply consists of discarding the fraction part, which is just a right-shift. So that replaces a division with two multiplies, and if p only has a few bits set, the multiply by p might be further optimised into a few left-shifts.
We can do that computation with fixed-point arithmetic, taking advantage of the fact that most processors produce a double-sized result for integer multiplication. Since we don't care about the integer part of the inner multiplication and we know that the result of the outer multiplication must be less than p, we only need to reserve ceil(log2 p) bits for the integer part of the computation, leaving the rest of the bits for the fraction. And that might give us enough precision to correctly handle the possible range of values of x, particularly if x has a limited range (eg. uint8_t or even uint16_t). The key is finding a position of the fixed-point which minimises the error in representation of 1/p.
For many small values of p, that works. For others, there is an alternative (but slower) solution which involves estimating q = x/p using multiplication by the inverse, and then computing x - q * p. If the estimate of q can be guaranteed to be either correct or off by one in a known direction, we only need to correct the final computation by conditionally adding or subtracting p; that can be accomplished without a branch on many modern CPUs. (The direction of the error is known because it will depend only on whether the approximation we chose for the inverse of the dividend was too small or too big.)
In the very specific case of x % 10 where x is a uint_8, you might be able to do better than the above using a 256-byte lookup table. That would only be worthwhile if you were doing the modulus operation in a tight loop over a large number of values, and even then you'd want to profile carefully to verify that it is an improvement.
I doubt whether that's the best expenditure of your time; there are probably much more fruitful optimisation opportunities in your application.
however Is there any faster way to calculate modulus if the operand is 10?
With a good compiler, no. The compiler would have already emitted good code. You can explore different optimization settings with the compiler.
OTOH, if you know of some restrictions that the compiler cannot assume with n % 10, like values are always positive or of a sub-range, you might be able to out optimize the compiler.
Such micro-optimisation is usually not efficient use of programmer's time.
I am trying to learn some basic benchmarking. I have a loop in my Java program like,
float a=6.5f;
int b=3;
for(long j=0; j<999999999; j++){
var = a*b+(a/b);
}//end of for
My processor takes around 0.431635 second to process this. How would I calculate processor speed in terms of Flops(Floating point Operations Per Second) and Iops(Integer Operations Per Second)? Can you provide explanations with some steps?
You have a single loop with 999999999 iterations: lets call this 1e9 (one billion) for simplicity. The integers will get promoted to floats in the calculations that involve both, so the loop contains 3 floating-point operations: one mult, one add, and one div, so there are 3e9. This takes 0.432s, so you're apparently getting about 6.94 GFLOP/s (3e9/0.432). Similarly, you are doing 1 integer op (j++) per loop iteration, so you are getting 1e9/0.432 or about 2.32 GIOP/s.
However, the calculation a*b+(a/b) is loop-invariant, so it would be pretty surprising if this didn't get optimized away. I don't know much about Java, but any C compiler will evaluate this at compile-time, remove the a and b variables and the loop, and (effectively) replace the whole lot with var=21.667;. This is a very basic optimization, so I'd be surprised if javac didn't do it too.
I have no idea what's going on under the hood in Java, but I'd be suspicious of getting 7 GFLOPs. Modern Intel CPUs (I'm assuming that's what you've got) are, in principle, capable of two vector arithmetic ops per clock cycle with the right instruction mix (one add and one mult per cycle), so for a 3 GHz 4-core CPU, it's even possible to get 3e9*4*8 = 96 single-precision GFLOPs under ideal conditions. The various mul and add instructions have a reciprocal throughput of 1 cycle, but the div takes more than ten times as long, so I'd be very suspicious of getting more than about CLK/12 FLOPs (scalar division on a single core) once division is involved: if the compiler is smart enough to vectorize and/or parallelize the code to get more than that, which it would have to do, it would surely be smart enough to optimize away the whole loop.
In summary, I suspect that the loop is being optimized away completely and the 0.432 seconds you're seeing is just overhead. You have not given any indication how you're timing the above loop, so I can't be sure. You can check this out for yourself by replacing the ~1e9 loop iterations with 1e10. If it doesn't take about 10x as long, you're not timing what you think you're timing.
There's a lot more to say about benchmarking and profiling, but I'll leave it at that.
I know this is very late, but I hope it helps someone.
Emmet.
Suppose I have a very small float a (for instance a=0.5) that enters the following expression:
6000.f * a * a;
Does the order of the operands make any difference? Is it better to write
6000.f * (a*a);
Or even
float result = a*a;
result *= 6000.f;
I've checked the classic What Every Computer Scientist Should Know About Floating-Point Arithmetic but couldn't find anything.
Is there an optimal way to order operands in a floating point operation?
It really depends on the values and your goals. For instance if a is very small, a*a might be zero, whereas 6000.0*a*a (which means (6000.0*a)*a) could still be nonzero. For avoiding overflow and underflow, the general rule is to apply the associative law to first perform multiplications where the operands' logs have opposite sign, which means squaring first is generally a worst strategy. On the other hand, for performance reasons, squaring first might be a very good strategy if you can reuse the value of the square. You may encounter yet another issue, which could matter more for correctness than overflow/underflow issues if your numbers will never be very close to zero or infinity: certain multiplications may be guaranteed to have exact answers, while others involve rounding. In general you'll get the most accurate results by minimizing the number of rounding steps that happen.
The optimal way depends on the purpose, really.
First of all, multiplication is faster than division.
So if you have to write a = a / 2;, it is better to write a = a * 0.5f;.
Your compiler is usually smart enough to replace division with multiplication on constants if the results is the same, but it will not do that with variables of course.
Sometimes, you can optimize a bit by replacing divisions with multiplications, but there may be problems with precision.
Some other operations may be faster but less precise.
Let's take an example.
float f = (a * 100000) / (b * 10);
float g = (a / b) * (100000 / 10);
These are mathematically equivalent but the result can be a little different.
The first uses two multiplication and one division, the second uses one division and one multiplication. In both cases there may be a loss in precision, it depends on the size of a and b, if they are small values first works better, if they are large values second works better
Then... if you have several constants and you want speed, group contants together.
float a = 6.3f * a * 2.0f * 3.1f;
Just write
a = a * (6.3f * 2.0f * 3.1f);
Some compiler optimize well, some other optimize less, but in both cases there is no risk in keeping all constants together.
After we say this we should talk for hours on how processors works.
Even the same family like intel works in a different way between generations!
Some compilers uses SSE instructions, some other doesn't.
Some processor supports SSE2, some SSE, some only MMX... some system don't have an FPU neither!
Each system do better some calculations than other, finding a common thing is hard.
You should just write a readable code, clean and simple, without worryng too much about these unpredictable very low level optimizations.
If your expression looks complicated, do some algebra and\or go to wolframalpha search engine and ask him to optimize that for you :)
Said that, you don't really need to declare one variable and replace its content over and over, compiler usually can optimize less in this situation.
a = 5 + b;
a /= 2 * c;
a += 2 - c;
a *= 7;
just write your expression avoiding this mess :)
a = ((5 + b) / (2 * c) + 2 - c) * 7;
About your specific example, 6000.f * a * a, just write it as you write it, no need to change it; it is fine as it is.
Not typically, no.
That being said, if you're doing multiple operations with large values, it may make sense to order them in a way that avoids overflows or reduces precision errors, based on their precedence and associativity, if the algorithm provides a way to make that obvious. This would, however, require advance knowledge of the values involved, and not just be based on the syntax.
There are indeed algorithms to minimize cumulative error in a sequence of floating-point operations. One such is http://en.wikipedia.org/wiki/Kahan_summation_algorithm. Others exist for other operations: http://www.cs.cmu.edu/~quake-papers/related/Priest.ps.
Floating point calculation is neither associative nor distributive on processors. So,
(a + b) + c is not equal to a + (b + c)
and a * (b + c) is not equal to a * b + a * c
Is there any way to perform deterministic floating point calculation that do not give different results. It would be deterministic on uniprocessor ofcourse, but it would not be deterministic in multithreaded programs if threads add to a sum for example, as there might be different interleavings of the threads.
So my question is, how can one achieve deterministic results for floating point calculations in multithreaded programs?
Floating-point is deterministic. The same floating-point operations, run on the same hardware, always produces the same result. There is no black magic, noise, randomness, fuzzing, or any of the other things that people commonly attribute to floating-point. The tooth fairy does not show up, take the low bits of your result, and leave a quarter under your pillow.
Now, that said, certain blocked algorithms that are commonly used for large-scale parallel computations are non-deterministic in terms of the order in which floating-point computations are performed, which can result in non-bit-exact results across runs.
What can you do about it?
First, make sure that you actually can't live with the situation. Many things that you might try to enforce ordering in a parallel computation will hurt performance. That's just how it is.
I would also note that although blocked algorithms may introduce some amount of non-determinism, they frequently deliver results with smaller rounding errors than do naive unblocked serial algorithms (surprising but true!). If you can live with the errors produced by a naive serial algorithm, you can probably live with the errors of a parallel blocked algorithm.
Now, if you really, truly, need exact reproducibility across runs, here are a few suggestions that tend not to adversely affect performance too much:
Don't use multithreaded algorithms that can reorder floating-point computations. Problem solved. This doesn't mean you can't use multithreaded algorithms at all, merely that you need to ensure that each individual result is only touched by a single thread between synchronization points. Note that this can actually improve performance on some architectures if done properly, by reducing D$ contention between cores.
In reduction operations, you can have each thread store its result to an indexed location in an array, wait for all threads to finish, the accumulate the elements of the array in order. This adds a small amount of memory overhead, but is generally pretty tolerable, especially when the number of threads is "small".
Find ways to hoist the parallelism. Instead of computing 24 matrix multiplications, each one of which uses parallel algorithms, compute 24 matrix products in parallel, each one of which uses a serial algorithm. This, too, can be beneficial for performance (sometimes enormously so).
There are lots of other ways to handle this. They all require thought and care. Parallel programming usually does.
Edit: I've removed my old answer since I seem to have misunderstood OP's question. If you want to see it you can read the edit history.
I think the ideal solution would be to switch to having a separate accumulator for each thread. This avoids all locking, which should make a drastic difference to performance. You can simply sum the accumulators at the end of the whole operation.
Alternatively, if you insist on using a single accumulator, one solution is to use "fixed-point" rather than floating point. This can be done with floating-point types by including a giant "bias" term in your accumulator to lock the exponent at a fixed value. For example if you know the accumulator will never exceed 2^32, you can start the accumulator at 0x1p32. This will lock you at 32 bits of precision to the left of the radix point, and 20 bits of fractional precision (assuming double). If that's not enough precision, you could us a smaller bias (assuming the accumulator will not grow too large) or switch to long double. If long double is 80-bit extended format, a bias of 2^32 would give 31 bits of fractional precision.
Then, whenever you want to actually "use" the value of the accumulator, simply subtract out the bias term.
Even using a high-precision fixed point datatype would not solve the problem of making the results for said equations determinisic (except in certain cases). As Keith Thompson pointed out in a comment, 1/3 is a trivial counter-example of a value that cannot be stored correctly in either a standard base-10 or base-2 floating point representation (regardless of precision or memory used).
One solution that, depending upon particular needs, may address this issue (it still has limits) is to use a Rational number data-type (one that stores both a numerator and denominator). Keith suggested GMP as one such library:
GMP is a free library for arbitrary precision arithmetic, operating on signed integers, rational numbers, and floating point numbers. There is no practical limit to the precision...
Whether it is suitable (or adequate) for this task is another story...
Happy coding.
Use a decimal type or library supporting such a type.
Try storing each intermediate result in a volatile object:
volatile double a_plus_b = a + b;
volatile double a_plus_b_plus_c = a_plus_b + c;
This is likely to have nasty effects on performance. I suggest measuring both versions.
EDIT: The purpose of volatile is to inhibit optimizations that might affect the results even in a single-threaded environment, such as changing the order of operations or storing intermediate results in wider registers. It doesn't address multi-threading issues.
EDIT2: Something else to consider is that
A floating expression may be contracted, that is, evaluated as though
it were an atomic operation, thereby omitting rounding errors implied
by the source code and the expression evaluation method.
This can be inhibited by using
#include <math.h>
...
#pragma STDC FP_CONTRACT off
Reference: C99 standard (large PDF), sections 7.12.2 and 6.5 paragraph 8. This is C99-specific; some compilers might not support it.
Use packed decimal.
I'm taking a Computer Systems class as a pre-req for my Masters and came across something I found fascinating and hard to see practical use of and that is "faking subtraction" and the fact that there doesn't need to be a subtraction instruction.
Something like:
x - y
Can be written as:
x + (~y + 1)
Now, that's all well and good but it seems like that is overly complicated for a simple subtraction, especially when you could just easily put "x - y". Are there situations where it would be necessary to do this, or is it just something that CAN be done but isn't.
This is often how it's done at the hardware level (i.e. inside the ALU).
At the software level, it's generally useless, as it can never be more efficient than the straightfoward subtraction (unless you have a truly bizarre compiler/platform combination).
The two's complement implementation is done in hardware, so you do not need to implement them like that for builtin datatypes.
If you are making an n-bit integer arithmetic library, then you need to emulate the integer addition, subtraction, multiplication and division etc operations, in which case such a technique might be implemented to add the n-bit length numbers, but using the carry flag to do so is a better implementation in my opinion.
It should be obvious that that is how substraction is done internally, so I'm not sure what you mean by "being used in the real world". This is why two's complement was chosen in the first place, because subtraction is just overflowing negative addition.
I do not see any reason to do it in your C code. Doing it in software is no faster than subtracting using the minus operator - and is a lot more unclear.
However, that is the way processors execute subtraction. I bet you have seen this code as an example of what hardware does, since it is easier to see how x + (~y + 1) will become a logic circuit.
So... no, you will not use this code in real world, but this operation is executed a lot of times in your processor.
I couldn't see the point of doing this. It is not anymore efficient. In fact if it's not optimised out by the compiler it ends up generating more opcodes.
Stuff like this was more common back before CPU's had billions of transistors to play with. A particular CPU might not implement a specific subtract opcode, and so a compiler (or assembly program) targeting it would have to know that trick.
These manipulations can also help you understand the internal implementation of CPU's. For example, CPU's division operations are sometimes accomplished by taking the reciprocal of the divisor and multiplying it by the dividend; the reciprocal is the only actual "division" being performed.