I want to write function in C that takes seconds and nanoseconds as input. Converts seconds and nanoseconds into microseconds, returns the total in microseconds.
unsigned long long get_microseconds(int seconds, unsigned long long nSeconds);
Now the conversion is pretty trivial. I can use following formula-
mSeconds = Seconds*1000000 + nSeconds/1000 (Loss of precision in nanosecond conversion is alright, my timer has anyway minimum resolution of 100 microseconds)
What would be fastest way of implementing this equation without using multiplication and division operators to get the best accuracy and least number of cpu cycles.
EDIT: I am running on a custom DSP with a GNU based but custom designed toolchain. I have not really tested out performance of the arithmetic operation, I am simply curious to know if it would affect the performance and if is there a way to improve it.
return Seconds*1000000 + nSeconds/1000;
If there's any worthwhile bit-shifting or other bit manipulation worth doing, your compiler will probably take care of it.
The compiler will almost certainly optimize the multiplication as far as it can. What it will not do is "accept a small loss" when dividing by 1000, so you will perhaps find it somewhat faster writing
return Seconds*1000000 + nSeconds/1024; /* Explicitly show the error */
...keeping in mind that nSeconds can't grow too much, or the error may become unacceptable.
But whatever you do, test the results - both speed and accuracy over real inputs. Also explore converting the function to a macro and save the call altogether. Frankly, for so simple a calculation there's precious little chance to do better than an optimizing compiler.
Also, consider the weight of this optimization in the scope of the global algorithm. Is this function really called with such a frequency that its savings are worth the hassle?
If nSeconds never gets above 232 (it shouldn't if you are working with "split time" as from timespec - it should be below 109), you should probably use a 32 bit integer for it.
On a 64 bit machine it's not a problem to use 64 bit integers for everything (the division is optimized to a multiply by inverse+shift), but on a 32 bit one the compiler gets tricked into using a full 64 bit division routine, which is quite heavyweight. So, I would do:
unsigned long long get_microseconds(int seconds, unsigned long nSeconds) {
return seconds*1000000ULL + nSeconds / 1000;
}
This, at least on x86, doesn't call external routines and manages to keep the 64 bit overhead to a minimum.
Of course, these are tests done on x86 (which has a 32x32=>64 multiply instruction even in 32 bit mode), given that you are working on a DSP you would need to check the actual code produced by your compiler.
Related
So I'm wondering about the costs of division on a atmega2560 as well as in general:
Let's say I got something like this
unsigned long long a=some-large-value;
unsigned long long b=some-other-large-value;
unsigned long result=(a-b)/A_CONSTANT
//A_CONSTANT i.e. 16
How long does it actually take? Are we speaking about hundrets or thousands of cycles? And does it make a difference if I change the division to a multiplication i.e. like so
unsigned long result=(a-b)*1/A_CONSTANT
I want to use that in a time-critical application for calculating a time span which is used for determining when to execute another part of the program. Assuming the division takes too much time, what other options do I have?
This really depends on your A_CONSTANT and how good the compiler is IMO.
I've looked up the chip and it's obviously an 8 bit processor with 8 or 16 MHz.
As such, I'd consider those unsigned long long integer to be the biggest hurdle to take, if your division is trivial.
For this it would have to be a power of two (like 2, 4, 8, 16, etc.). What would happen then, would be an optimization, replacing the whole division with a simple right shift, which would be completed in far less cycles.
Switching to a multiplication won't net you anything good. You'll at least suffer precision issues and your current code would result in the result 0 all the time, unless A_CONSTANT is 1 (since you're obviously doing an integer division, where the result is rounded down).
So what to do or whether to consider this something for optimization heavily depends on the actual value of A_CONSTANT.
Probably the easiest way solving this (or comparing solutions) would be comparing the resulting assembly code, because it will be the final result that's actually processed. Optimizing this purely on theory is rather complicated and might even get you wrong or misleading results.
AVR instructions set doesn't have a divide operation on its own so as being mentioned in the comments it's all goes to point how compiler you are using implements this operation.
You might want to have a look on generated machine instructions to see what's actually generated and think of possible optimisation.
There are a lot of information available on google about different implementations of integer divisions, like for example this
Also very good source of information.
While I was reading tips in C, I have seen this tip here http://www.cprogramming.com/tips/tip/multiply-rather-than-divide
but I am not sure. I was told both multiply and divide are slower and time consuming and requires many cycles.
and I have seen people often use i << 2 instead of i x 4 since shifting is faster.
Is it a good tip using x0.5 or /2 ? or however modern compilers do optimize it in a better way?
It's true that some (if not most) processors can multiply faster than performing a division operation, but, it's like the myth of ++i being faster than i++ in a for loop. Yes, it once was, but nowadays, compilers are smart enough to optimize all those things for you, so you should not care about this anymore.
And about bit-shifting, it once was faster to shift << 2 than to multiply by 4, but those days are over as most processors can multiply in one clock cycle, just like a shift operation.
A great example of this was the calculation of the pixel address in VGA 320x240 mode. They all did this:
address = x + (y << 8) + (y << 6)
to multiply y with 320. On modern processors, this can be slower than just doing:
address = x + y * 320;
So, just write what you think and the compiler will do the rest :)
I find that this service is invaluable for testing this sort of stuff:
http://gcc.godbolt.org/
Just look at the final assembly. 99% of the time, you will see that the compiler optimises it all to the same code anyway. Don't waste the brain power!
In some cases, it is better to write it explicitly. For example, 2^n (where n is a positive integer) could be written as (int) pow( 2.0, n ) but it is obviously better to use 1<<n (and the compiler won't make that optimisation for you). So it can be worth keeping these things in the back of your mind. As with anything though, don't optimise prematurely.
"multiply by 0.5 rather than divide by 2" (2.0) is faster on fewer environments these days than before, primarily due to improved compilers that will optimize the code.
"use i << 2 instead of i x 4" is faster in fewer environments for similar reasons.
In select cases, the programmer still needs to attend to such issues, but it is increasingly rare. Code maintenance continues to grow as a dominate issue. So use what makes the most sense for that code snippet: x*0.5, x/2.0, half(x), etc.
Compilers readily optimize code. Recommend you code with high level issues in mind. E. g. Is the algorithm O(n) or O(n*n)?
The important thought to pass on is that best code design practices evolve and variations occur amongst environments. Be adaptable. What is best today may shift (or multiply) in the future.
Many CPUs can perform multiplication in 1 or 2 clock cycles but division always takes longer (although FP division is sometimes faster than integer division).
If you look at this answer How can I compare the performance of log() and fp division in C++? you will see that division can exceed 24 cycles.
Why does division take so much longer than multiplication? If you remember back to grade school, you may recall that multiplication can essentially be performed with many simultaneous additions. Division requires iterative subtraction that cannot be performed simultaneously so it takes longer. In fact, some FP units speed up division by performing a reciprocal approximation and multiplying by that. It isn't quite as accurate but is somewhat faster.
If you are working with integers, and you expect to get an integer as result, it's better to use / 2, this way avoids unnecesary conversions to/from float
Or both floating point and integer operations are same speed? And if not so, how much faster is the integer version?
You can find information about Instruction-specific scheduling for Advanced SIMD instructions for Cortex-A8 (they don't publish it for newer cores since timing business got quite complicated since).
See Advanced SIMD integer ALU instructions versus Advanced SIMD floating-point instructions:
You may need to read explanation of how to read those tables.
To give a complete answer, in general floating point instructions take two cycles while instructions executes on ALU takes one cycle. On the other hand multiplication of long long (8 byte integer) is four cycles (forum same source) while multiplication of double is two cycles.
In general it seems you shouldn't care about float versus integer but carefully choosing data type (float vs double, int vs long long) is more important.
It depends on which model you have, but the tendency has been for integer to have more opportunities to use the 128-bit wide data paths. This is no longer true on newer CPUs.
Of course, integer arithmetic also gives you the opportunity to increase the parallelism by using 16-bit or 8-bit operations.
As with all integer-versus-floating-point arguments, it depends on the specific problem and how much time you're willing to invest in tuning, because they can rarely run exactly the same code.
I would refer to auselen's answer for great links to all of the references, however, I found the actual cycle counts a little misleading. It is true that it can "go either way" depending on the precision that you need, but let's say that you have some parallelism in your routine and can efficiently operate on two words (SP float) at a time. Let's assume that you need the amount of precision for which floating point may be a good idea... 24 bits.
In particular when analyzing NEON performance, remember that there is a write-back delay (pipeline delay) so that you have to wait for a result to become ready if that result is required as the input to another instruction.
For fixed point you will need 32 bit ints to represent at least 24 bits of precision:
Multiply two-by-two 32 bit numbers together, and get a 64 bit result. This takes two cycles and requires an extra register to store the wide result.
Shift the 64 bit numbers back to a 32 bit numbers of the desired precision. This takes one cycle, and you have to wait for the write-back (5-6 cycle) delay from the multiply.
For floating point:
Multiply two-by-two 32 bit floats together. This takes one cycle.
So for this scenario, there is no way in heck that you would ever choose integer over floating point.
If you are dealing with 16 bit data, then the tradeoffs are much closer, although you may still need an extra instruction to shift the result of the multiply back to the desired precision. To achieve good performance if you are using Q15, then you can use the VQDMULH instruction on s16 data and achieve much higher performance with fewer registers than SP float.
Also, as auselen mentions, newer cores have different micro-architectures, and things always change. We are lucky that ARM actually makes their info public. For vendors that modify the microarchitecture like Apple, Qualcomm and Samsung (probably others...) the only way to know is to try it, which can be a lot of work if you are writing assembly. Still, I think the official ARM instruction timing website is probably quite useful. And I actually do think that they publish the numbers for A9, and these are mostly identical.
I am trying to learn some basic benchmarking. I have a loop in my Java program like,
float a=6.5f;
int b=3;
for(long j=0; j<999999999; j++){
var = a*b+(a/b);
}//end of for
My processor takes around 0.431635 second to process this. How would I calculate processor speed in terms of Flops(Floating point Operations Per Second) and Iops(Integer Operations Per Second)? Can you provide explanations with some steps?
You have a single loop with 999999999 iterations: lets call this 1e9 (one billion) for simplicity. The integers will get promoted to floats in the calculations that involve both, so the loop contains 3 floating-point operations: one mult, one add, and one div, so there are 3e9. This takes 0.432s, so you're apparently getting about 6.94 GFLOP/s (3e9/0.432). Similarly, you are doing 1 integer op (j++) per loop iteration, so you are getting 1e9/0.432 or about 2.32 GIOP/s.
However, the calculation a*b+(a/b) is loop-invariant, so it would be pretty surprising if this didn't get optimized away. I don't know much about Java, but any C compiler will evaluate this at compile-time, remove the a and b variables and the loop, and (effectively) replace the whole lot with var=21.667;. This is a very basic optimization, so I'd be surprised if javac didn't do it too.
I have no idea what's going on under the hood in Java, but I'd be suspicious of getting 7 GFLOPs. Modern Intel CPUs (I'm assuming that's what you've got) are, in principle, capable of two vector arithmetic ops per clock cycle with the right instruction mix (one add and one mult per cycle), so for a 3 GHz 4-core CPU, it's even possible to get 3e9*4*8 = 96 single-precision GFLOPs under ideal conditions. The various mul and add instructions have a reciprocal throughput of 1 cycle, but the div takes more than ten times as long, so I'd be very suspicious of getting more than about CLK/12 FLOPs (scalar division on a single core) once division is involved: if the compiler is smart enough to vectorize and/or parallelize the code to get more than that, which it would have to do, it would surely be smart enough to optimize away the whole loop.
In summary, I suspect that the loop is being optimized away completely and the 0.432 seconds you're seeing is just overhead. You have not given any indication how you're timing the above loop, so I can't be sure. You can check this out for yourself by replacing the ~1e9 loop iterations with 1e10. If it doesn't take about 10x as long, you're not timing what you think you're timing.
There's a lot more to say about benchmarking and profiling, but I'll leave it at that.
I know this is very late, but I hope it helps someone.
Emmet.
I'm working on OKI 431 microcontroller. This is 8-bit microcontroller. We don't like to have any floating point operation to be performed in our project so we've eliminated all floating point operations and converted them into integer operations in some way. But we cannot eliminate one floating point operation because optimizing the calculation for integer operation requires 64-bit integer which the micro doesn't natively support. It has C compiler that supports upto 32-bit integer operation. The calculation takes too long time which is noticeable in a way to user.
I'm wondering if there is any 64-bit integer library that can be easily used in C for microcontoller coding. Or what is the easiest way to write such thing efficiently? Here efficiently implies minimize amount of time required.
Thanks in advance.
Since this is a micro-controller you will probably want to use a simple assembly library. The fewer operations it has to support the simpler and smaller it can be. You may also find that you can get away with smaller than 64 bit numbers (48 bit, perhaps) and reduce the run time and register requirements.
You may have to go into assembly to do this. The obvious things you need are:
addition
2s complement (invert and increment)
left and right arithmetic shift by 1
From those you can build subtraction, multiplication, long division, and longer shifts. Keep in mind that multiplying two 64-bit numbers gives you a 128-bit number, and long division may need to be able to take a 128-bit dividend.
It will seem painfully slow, but the assumption in such a machine is that you need a small footprint, not speed. I assume you are doing these calculations at the lowest frequency you can.
An open-source library may have a slightly faster way to do it,
but it could also be even slower.
Whenever speed is a problem with floating point math in small embedded systems, and when integer math is not enough, fixed point math is a fast replacement.
http://forum.e-lab.de/topic.php?t=2387
http://en.wikipedia.org/wiki/Fixed-point_arithmetic
http://www.eetimes.com/discussion/other/4024639/Fixed-point-math-in-C
http://en.wikibooks.org/wiki/Embedded_Systems/Floating_Point_Unit