Is NEON of ARM faster for integers than floating points? - c

Or both floating point and integer operations are same speed? And if not so, how much faster is the integer version?

You can find information about Instruction-specific scheduling for Advanced SIMD instructions for Cortex-A8 (they don't publish it for newer cores since timing business got quite complicated since).
See Advanced SIMD integer ALU instructions versus Advanced SIMD floating-point instructions:
You may need to read explanation of how to read those tables.
To give a complete answer, in general floating point instructions take two cycles while instructions executes on ALU takes one cycle. On the other hand multiplication of long long (8 byte integer) is four cycles (forum same source) while multiplication of double is two cycles.
In general it seems you shouldn't care about float versus integer but carefully choosing data type (float vs double, int vs long long) is more important.

It depends on which model you have, but the tendency has been for integer to have more opportunities to use the 128-bit wide data paths. This is no longer true on newer CPUs.
Of course, integer arithmetic also gives you the opportunity to increase the parallelism by using 16-bit or 8-bit operations.
As with all integer-versus-floating-point arguments, it depends on the specific problem and how much time you're willing to invest in tuning, because they can rarely run exactly the same code.

I would refer to auselen's answer for great links to all of the references, however, I found the actual cycle counts a little misleading. It is true that it can "go either way" depending on the precision that you need, but let's say that you have some parallelism in your routine and can efficiently operate on two words (SP float) at a time. Let's assume that you need the amount of precision for which floating point may be a good idea... 24 bits.
In particular when analyzing NEON performance, remember that there is a write-back delay (pipeline delay) so that you have to wait for a result to become ready if that result is required as the input to another instruction.
For fixed point you will need 32 bit ints to represent at least 24 bits of precision:
Multiply two-by-two 32 bit numbers together, and get a 64 bit result. This takes two cycles and requires an extra register to store the wide result.
Shift the 64 bit numbers back to a 32 bit numbers of the desired precision. This takes one cycle, and you have to wait for the write-back (5-6 cycle) delay from the multiply.
For floating point:
Multiply two-by-two 32 bit floats together. This takes one cycle.
So for this scenario, there is no way in heck that you would ever choose integer over floating point.
If you are dealing with 16 bit data, then the tradeoffs are much closer, although you may still need an extra instruction to shift the result of the multiply back to the desired precision. To achieve good performance if you are using Q15, then you can use the VQDMULH instruction on s16 data and achieve much higher performance with fewer registers than SP float.
Also, as auselen mentions, newer cores have different micro-architectures, and things always change. We are lucky that ARM actually makes their info public. For vendors that modify the microarchitecture like Apple, Qualcomm and Samsung (probably others...) the only way to know is to try it, which can be a lot of work if you are writing assembly. Still, I think the official ARM instruction timing website is probably quite useful. And I actually do think that they publish the numbers for A9, and these are mostly identical.

Related

Standard guarantees for using floating point arithmetic to represent integer operations

I am working on some code to be run on a very heterogeneous cluster. The program performs interval arithmetic using 3, 4, or 5 32 bit words (unsigned ints) to represent high precision boundaries for the intervals. It seems to me that representing some words in floating point in some situations may produce a speedup. So, my question is two parts:
1) Are there any guarantees in the C11 standard as to what range of integers will be represented exactly, and what range of input pairs would have their products represented exactly? One multiplication error could entirely change the results.
2) Is this even a reasonable approach? It seems that the separation of floating point and integer processing within the processor would allow data to be running through both pipelines simultaneously, improving throughput. I don't know much about hardware though, so I'm not sure that the pipelines for integers and floating points actually are all that separate, or, if they are, if they can be used simultaneously.
I understand that the effectiveness of this sort of thing is platform dependent, but right now I am concerned about the reliability of the approach. If it is reliable, I can benchmark it and see, but I am having trouble proving reliability. Secondly, perhaps this sort of approach shows little promise, and if so I would like to know so I can focus elsewhere.
Thanks!
I don't know about the Standard, but it seems that you can assume all your processors are using the normal IEEE floating point format. In this case, it's pretty easy to determine whether your calculations are correct. The first integer not representable by the 32-bit float format is 16777217 (224+1), so if all your intermediate results are less than that (in absolute value), float will be fine.
The reverse is also true: if any intermediate result is greater than 224 (in absolute value) and odd, float representation will alter it, which is unacceptable for you.
If you are worried specifically about multiplications, look at how the multiplicands are limited. If one is limited by 211, and the other by 213, you will be fine (just barely). If, for example, both are limited by 216, there almost certainly is a problem. To prove it, find a test case that causes their product to exceed 224 and be odd.
All that you need to know to which limits you may go and still have integer precision should be available to you through the macros defined in <float.h>. There you have the exact description of the floating point types, FLT_RADIX for the radix, FLT_MANT_DIG for the number of the digits, etc.
As you say, whether or not such an approach is efficient will depend on the platform. You should be aware that this is much dependent of the particular processor you'd have, not only the processor family. From one Intel or AMD processor variant to another there could already be sensible differences. So you'd basically benchmark all possibilities and have code that decides on program startup which variant to use.

C atmega2560 Division of large integers

So I'm wondering about the costs of division on a atmega2560 as well as in general:
Let's say I got something like this
unsigned long long a=some-large-value;
unsigned long long b=some-other-large-value;
unsigned long result=(a-b)/A_CONSTANT
//A_CONSTANT i.e. 16
How long does it actually take? Are we speaking about hundrets or thousands of cycles? And does it make a difference if I change the division to a multiplication i.e. like so
unsigned long result=(a-b)*1/A_CONSTANT
I want to use that in a time-critical application for calculating a time span which is used for determining when to execute another part of the program. Assuming the division takes too much time, what other options do I have?
This really depends on your A_CONSTANT and how good the compiler is IMO.
I've looked up the chip and it's obviously an 8 bit processor with 8 or 16 MHz.
As such, I'd consider those unsigned long long integer to be the biggest hurdle to take, if your division is trivial.
For this it would have to be a power of two (like 2, 4, 8, 16, etc.). What would happen then, would be an optimization, replacing the whole division with a simple right shift, which would be completed in far less cycles.
Switching to a multiplication won't net you anything good. You'll at least suffer precision issues and your current code would result in the result 0 all the time, unless A_CONSTANT is 1 (since you're obviously doing an integer division, where the result is rounded down).
So what to do or whether to consider this something for optimization heavily depends on the actual value of A_CONSTANT.
Probably the easiest way solving this (or comparing solutions) would be comparing the resulting assembly code, because it will be the final result that's actually processed. Optimizing this purely on theory is rather complicated and might even get you wrong or misleading results.
AVR instructions set doesn't have a divide operation on its own so as being mentioned in the comments it's all goes to point how compiler you are using implements this operation.
You might want to have a look on generated machine instructions to see what's actually generated and think of possible optimisation.
There are a lot of information available on google about different implementations of integer divisions, like for example this
Also very good source of information.

SSE ints vs. floats practice

When dealing with both ints and floats in SSE (AVX) is it a good practice to convert all ints to floats and work only with floats?
Because we need only a few SIMD instructions after that, and all we need to use is addition and compare instructions (<, <=, ==) which this conversion, I hope, should retain completely.
Expand my comments into an answer.
Basically you weighing the following trade-off:
Stick with integer:
Integer SSE is low-latency, high throughput. (dual issue on Sandy Bridge)
Limited to 128-bit SIMD width.
Convert to floating-point:
Benefit from 256-bit AVX.
Higher latencies, and only single-issue addition/subtraction (on Sandy Bridge)
Incurs initial conversion overhead.
Restricts input to those that fit into a float without precision loss.
I'd say stick with integer for now. If you don't want to duplicate code with the float versions, then that's your call.
The only times I've seen where emulating integers with floating-point becomes faster are when you have to do divisions.
Note that I've made no mention of readability as diving into manual vectorization probably implies that performance is more important.

Do I get a performance penalty when mixing SSE integer/float SIMD instructions

I've used x86 SIMD instructions (SSE1234) in the form of intrinsics quite a lot lately. What I found frustrating is that the SSE ISA has several simple instructions that are available only for floats or only for integers, but in theory should perform equally for both. For example, both float and double vectors have instructions to load higher 64bits of a 128-bit vector from an address (movhps, movhpd), but there's no such instruction for integer vectors.
My question:
Is there any reasons to expect a performance hit when using floating point instructions on integer vectors, e.g. using movhps to load data to an integer vector?
I wrote several tests to check that, but I suppose their results are not credible. It's really hard to write a correct test that explores all corner cases for such things, especially when the instruction scheduling is most probably involved here.
Related question:
Other trivially similar things also have several instructions that do basically the same. For example I can do bitwise OR with por, orps or orpd. Can anyone explain what's the purpose of these additional instructions? I guess this might be related to different scheduling algorithms applied to each instruction.
From an expert (obviously not me :P): http://www.agner.org/optimize/optimizing_assembly.pdf [13.2 Using vector instructions with other types of data than they are intended for (pages 118-119)]:
There is a penalty for using the wrong type of instructions on some processors. This is
because the processor may have different data buses or different execution units for integer
and floating point data. Moving data between the integer and floating point units can take
one or more clock cycles depending on the processor, as listed in table 13.2.
Processor Bypass delay, clock cycles
Intel Core 2 and earlier 1
Intel Nehalem 2
Intel Sandy Bridge and later 0-1
Intel Atom 0
AMD 2
VIA Nano 2-3
Table 13.2. Data bypass delays between integer and floating point execution units

64-bit integer implementation for 8-bit microcontroller

I'm working on OKI 431 microcontroller. This is 8-bit microcontroller. We don't like to have any floating point operation to be performed in our project so we've eliminated all floating point operations and converted them into integer operations in some way. But we cannot eliminate one floating point operation because optimizing the calculation for integer operation requires 64-bit integer which the micro doesn't natively support. It has C compiler that supports upto 32-bit integer operation. The calculation takes too long time which is noticeable in a way to user.
I'm wondering if there is any 64-bit integer library that can be easily used in C for microcontoller coding. Or what is the easiest way to write such thing efficiently? Here efficiently implies minimize amount of time required.
Thanks in advance.
Since this is a micro-controller you will probably want to use a simple assembly library. The fewer operations it has to support the simpler and smaller it can be. You may also find that you can get away with smaller than 64 bit numbers (48 bit, perhaps) and reduce the run time and register requirements.
You may have to go into assembly to do this. The obvious things you need are:
addition
2s complement (invert and increment)
left and right arithmetic shift by 1
From those you can build subtraction, multiplication, long division, and longer shifts. Keep in mind that multiplying two 64-bit numbers gives you a 128-bit number, and long division may need to be able to take a 128-bit dividend.
It will seem painfully slow, but the assumption in such a machine is that you need a small footprint, not speed. I assume you are doing these calculations at the lowest frequency you can.
An open-source library may have a slightly faster way to do it,
but it could also be even slower.
Whenever speed is a problem with floating point math in small embedded systems, and when integer math is not enough, fixed point math is a fast replacement.
http://forum.e-lab.de/topic.php?t=2387
http://en.wikipedia.org/wiki/Fixed-point_arithmetic
http://www.eetimes.com/discussion/other/4024639/Fixed-point-math-in-C
http://en.wikibooks.org/wiki/Embedded_Systems/Floating_Point_Unit

Resources