So I'm wondering about the costs of division on a atmega2560 as well as in general:
Let's say I got something like this
unsigned long long a=some-large-value;
unsigned long long b=some-other-large-value;
unsigned long result=(a-b)/A_CONSTANT
//A_CONSTANT i.e. 16
How long does it actually take? Are we speaking about hundrets or thousands of cycles? And does it make a difference if I change the division to a multiplication i.e. like so
unsigned long result=(a-b)*1/A_CONSTANT
I want to use that in a time-critical application for calculating a time span which is used for determining when to execute another part of the program. Assuming the division takes too much time, what other options do I have?
This really depends on your A_CONSTANT and how good the compiler is IMO.
I've looked up the chip and it's obviously an 8 bit processor with 8 or 16 MHz.
As such, I'd consider those unsigned long long integer to be the biggest hurdle to take, if your division is trivial.
For this it would have to be a power of two (like 2, 4, 8, 16, etc.). What would happen then, would be an optimization, replacing the whole division with a simple right shift, which would be completed in far less cycles.
Switching to a multiplication won't net you anything good. You'll at least suffer precision issues and your current code would result in the result 0 all the time, unless A_CONSTANT is 1 (since you're obviously doing an integer division, where the result is rounded down).
So what to do or whether to consider this something for optimization heavily depends on the actual value of A_CONSTANT.
Probably the easiest way solving this (or comparing solutions) would be comparing the resulting assembly code, because it will be the final result that's actually processed. Optimizing this purely on theory is rather complicated and might even get you wrong or misleading results.
AVR instructions set doesn't have a divide operation on its own so as being mentioned in the comments it's all goes to point how compiler you are using implements this operation.
You might want to have a look on generated machine instructions to see what's actually generated and think of possible optimisation.
There are a lot of information available on google about different implementations of integer divisions, like for example this
Also very good source of information.
Related
I am working on some code to be run on a very heterogeneous cluster. The program performs interval arithmetic using 3, 4, or 5 32 bit words (unsigned ints) to represent high precision boundaries for the intervals. It seems to me that representing some words in floating point in some situations may produce a speedup. So, my question is two parts:
1) Are there any guarantees in the C11 standard as to what range of integers will be represented exactly, and what range of input pairs would have their products represented exactly? One multiplication error could entirely change the results.
2) Is this even a reasonable approach? It seems that the separation of floating point and integer processing within the processor would allow data to be running through both pipelines simultaneously, improving throughput. I don't know much about hardware though, so I'm not sure that the pipelines for integers and floating points actually are all that separate, or, if they are, if they can be used simultaneously.
I understand that the effectiveness of this sort of thing is platform dependent, but right now I am concerned about the reliability of the approach. If it is reliable, I can benchmark it and see, but I am having trouble proving reliability. Secondly, perhaps this sort of approach shows little promise, and if so I would like to know so I can focus elsewhere.
Thanks!
I don't know about the Standard, but it seems that you can assume all your processors are using the normal IEEE floating point format. In this case, it's pretty easy to determine whether your calculations are correct. The first integer not representable by the 32-bit float format is 16777217 (224+1), so if all your intermediate results are less than that (in absolute value), float will be fine.
The reverse is also true: if any intermediate result is greater than 224 (in absolute value) and odd, float representation will alter it, which is unacceptable for you.
If you are worried specifically about multiplications, look at how the multiplicands are limited. If one is limited by 211, and the other by 213, you will be fine (just barely). If, for example, both are limited by 216, there almost certainly is a problem. To prove it, find a test case that causes their product to exceed 224 and be odd.
All that you need to know to which limits you may go and still have integer precision should be available to you through the macros defined in <float.h>. There you have the exact description of the floating point types, FLT_RADIX for the radix, FLT_MANT_DIG for the number of the digits, etc.
As you say, whether or not such an approach is efficient will depend on the platform. You should be aware that this is much dependent of the particular processor you'd have, not only the processor family. From one Intel or AMD processor variant to another there could already be sensible differences. So you'd basically benchmark all possibilities and have code that decides on program startup which variant to use.
I am making a 2D shooter game, and thus I have to stuff in a array lots of bullets, including their position, and where they are going.
So I have two issues, one is memory use, specially writing arrays that don't place things out of aligned and results in lots of padding or alignment that makes the speed of calculations suck.
The second is speed of calculation.
First this mean between choosing integers or floats... For now I am going with integers (if someone think floating point is better, please say so).
Then, this also mean choosing a variant of that type (8 bits? 16 bits? C confusing default? The CPU word size? Single precision? Double precision?)
Thus the question is: What type in C is fastest in modern processors (ie: common x86, ARM and other popular processors, don't worry about Z80 or 36bit processors), and what type is more reasonable when taking speed AND memory use in account?
Also, signed and unsigned has differences in speed?
EDIT because of close votes: Yes, it might be premature optimization, but I am asking not only about CPU use, but memory use (that might vary significantly), also I am doing the project to exercise my C skills, it is some years I don't code in C, and I thought to have some fun and find limits and stretch them, and also learn new standards (last time I used C it was still C89).
Finally, the major motivation of asking this question was just hacker curiosity when I found out some new interesting types (like int_fast*_t) existed in newer standards.
But if you still think this is not worth asking, then I can delete the question and go peruse the standards and some books, learn by myself. Then if others one day have the same curiosity, it is not my problem.
I would say an int should be the most comfortable for your CPU. But the C standard does have:
The typedef name int_fastN_t designates the fastest signed integer
type with a width of at least N . The typedef name uint_fastN_t
designates the fastest unsigned integer type with a width of at least
N
So in theory you could say things like: "I need it to be at least 16 bits so I shall use int_fast16_t". In practice that might translate to a plain int.
I suspect it is premature to think about these before you actually hit a performance issue that you can try to work around. I think it is better to solve problems when they occur than to try to think of an elusive super-solution that could solve all future possible issues.
Single precision floating point add and multiply is as fast as as 32 bit integer arithmetic in all modern processors (x86,ARM,MIPS), i.e. one result per clock cycle. Calculating positions and velocity in space is a lot easier with floating point arithmetic, so use floats. Single precision floats are 32 bits, and are the same size as the most efficient integer type on 32 bit CPUs.
Or both floating point and integer operations are same speed? And if not so, how much faster is the integer version?
You can find information about Instruction-specific scheduling for Advanced SIMD instructions for Cortex-A8 (they don't publish it for newer cores since timing business got quite complicated since).
See Advanced SIMD integer ALU instructions versus Advanced SIMD floating-point instructions:
You may need to read explanation of how to read those tables.
To give a complete answer, in general floating point instructions take two cycles while instructions executes on ALU takes one cycle. On the other hand multiplication of long long (8 byte integer) is four cycles (forum same source) while multiplication of double is two cycles.
In general it seems you shouldn't care about float versus integer but carefully choosing data type (float vs double, int vs long long) is more important.
It depends on which model you have, but the tendency has been for integer to have more opportunities to use the 128-bit wide data paths. This is no longer true on newer CPUs.
Of course, integer arithmetic also gives you the opportunity to increase the parallelism by using 16-bit or 8-bit operations.
As with all integer-versus-floating-point arguments, it depends on the specific problem and how much time you're willing to invest in tuning, because they can rarely run exactly the same code.
I would refer to auselen's answer for great links to all of the references, however, I found the actual cycle counts a little misleading. It is true that it can "go either way" depending on the precision that you need, but let's say that you have some parallelism in your routine and can efficiently operate on two words (SP float) at a time. Let's assume that you need the amount of precision for which floating point may be a good idea... 24 bits.
In particular when analyzing NEON performance, remember that there is a write-back delay (pipeline delay) so that you have to wait for a result to become ready if that result is required as the input to another instruction.
For fixed point you will need 32 bit ints to represent at least 24 bits of precision:
Multiply two-by-two 32 bit numbers together, and get a 64 bit result. This takes two cycles and requires an extra register to store the wide result.
Shift the 64 bit numbers back to a 32 bit numbers of the desired precision. This takes one cycle, and you have to wait for the write-back (5-6 cycle) delay from the multiply.
For floating point:
Multiply two-by-two 32 bit floats together. This takes one cycle.
So for this scenario, there is no way in heck that you would ever choose integer over floating point.
If you are dealing with 16 bit data, then the tradeoffs are much closer, although you may still need an extra instruction to shift the result of the multiply back to the desired precision. To achieve good performance if you are using Q15, then you can use the VQDMULH instruction on s16 data and achieve much higher performance with fewer registers than SP float.
Also, as auselen mentions, newer cores have different micro-architectures, and things always change. We are lucky that ARM actually makes their info public. For vendors that modify the microarchitecture like Apple, Qualcomm and Samsung (probably others...) the only way to know is to try it, which can be a lot of work if you are writing assembly. Still, I think the official ARM instruction timing website is probably quite useful. And I actually do think that they publish the numbers for A9, and these are mostly identical.
I'm working on OKI 431 microcontroller. This is 8-bit microcontroller. We don't like to have any floating point operation to be performed in our project so we've eliminated all floating point operations and converted them into integer operations in some way. But we cannot eliminate one floating point operation because optimizing the calculation for integer operation requires 64-bit integer which the micro doesn't natively support. It has C compiler that supports upto 32-bit integer operation. The calculation takes too long time which is noticeable in a way to user.
I'm wondering if there is any 64-bit integer library that can be easily used in C for microcontoller coding. Or what is the easiest way to write such thing efficiently? Here efficiently implies minimize amount of time required.
Thanks in advance.
Since this is a micro-controller you will probably want to use a simple assembly library. The fewer operations it has to support the simpler and smaller it can be. You may also find that you can get away with smaller than 64 bit numbers (48 bit, perhaps) and reduce the run time and register requirements.
You may have to go into assembly to do this. The obvious things you need are:
addition
2s complement (invert and increment)
left and right arithmetic shift by 1
From those you can build subtraction, multiplication, long division, and longer shifts. Keep in mind that multiplying two 64-bit numbers gives you a 128-bit number, and long division may need to be able to take a 128-bit dividend.
It will seem painfully slow, but the assumption in such a machine is that you need a small footprint, not speed. I assume you are doing these calculations at the lowest frequency you can.
An open-source library may have a slightly faster way to do it,
but it could also be even slower.
Whenever speed is a problem with floating point math in small embedded systems, and when integer math is not enough, fixed point math is a fast replacement.
http://forum.e-lab.de/topic.php?t=2387
http://en.wikipedia.org/wiki/Fixed-point_arithmetic
http://www.eetimes.com/discussion/other/4024639/Fixed-point-math-in-C
http://en.wikibooks.org/wiki/Embedded_Systems/Floating_Point_Unit
I have a big number (integer, unsigned) stored in 2 variables (as you can see, the high and low part of number):
unsigned long long int high;
unsigned long long int low;
I know how to add or subtract some other that-kind of variable.
But I need to divide that-kind of numbers. How to do it? I know, I can subtract N times, but, maybe, there are more better solutions. ;-)
Language: C
Yes. It will involve shifts, and I don't recommend doing that in C. This is one of those rare examples where assembler can still prove its value, easily making things run hundreds of times faster (And I don't think I'm exaggerating this.)
I don't claim total correctness, but the following should get you going :
(1) Initialize result to zero.
(2) Shift divisor as many bits as possible to the left, without letting it become greater than the dividend.
(3) Subtract shifted divisor from dividend and add one to result.
(4) Now shift divisor to the right until once again, it is less than the remaining dividend, and for each right-shift, left-shift result by one bit. Go back to (3) unless stopping condition is satisfied. (Stopping condition must be something like "divisor has become zero", but I'm not certain about that.)
It really feels great to get back to some REAL programming problems :-)
Have you looked at any large-number libraries, such as GNU MP BigNum?
I know, I can subtract N times, but, maybe, there are more better solutions.
Subtracting N times may be slow when N is large.
Better (i.e. more complicated but faster) would be shift-and-subtract, using the algorithm you learned to do long division of decimal numbers in elementary school.
[There may also be 3rd-party library and/or compiler-specific support for such numbers.]
Hmm. I suppose if you have some headroom in "high", you could shift it all up one digit, divide high by the number, then add the remainder to the top remaining digit in low and divide low by the number, then shift everything back.
Here's another library doing 128 bit arithmetic. GnuCash: Math128.
Per my commenters below, my previous answer was stupid.
Quickly, my new answer would be that when I've tried to do this in the past, it almost always involved shifting, because it's the only operation that can be applied across multiple "words", if you will, and have it look the same as if it were one large word (with the exception of having to track carryover bits).
There are a couple different approaches to it, but I don't know of any better general direction than using shifts, unless your hardware has some special operations.
You could implement a "BigInt" type algorithm that does divisions on string arrays. Create 1 string array for each high,low pair and do the division. Store the result in another string array, then convert back to high,low integer pair.
Since the language is C, the array would probably be a character array. Consider it analogous to the "string array" I was mentioning above.
You can do addition and subtraction of arbitrarily large binary objects using the assembler looping and "add/subtract with carry (adc/sbb)" instructions. You can implement the other operations using them. I've never investigated doing anything beyond those two personally.
If your processor (or your C library) has a fast 64-bit divide, you can break the 128-bit divide into pieces (the same way you'd do a 32-bit divide on processors that had 16-bit divisions).
By the way, there are all sorts of tricks you can use if you know what typical values will be for the dividend and divisor. What is the source of these numbers? If a lot of your cases can be solved quickly, it might be OK the occasional case takes a long time.
Also, if you can find cases where an approximate answer is OK, that opens the door to a lot of speedy approximations.