SSE ints vs. floats practice - c

When dealing with both ints and floats in SSE (AVX) is it a good practice to convert all ints to floats and work only with floats?
Because we need only a few SIMD instructions after that, and all we need to use is addition and compare instructions (<, <=, ==) which this conversion, I hope, should retain completely.

Expand my comments into an answer.
Basically you weighing the following trade-off:
Stick with integer:
Integer SSE is low-latency, high throughput. (dual issue on Sandy Bridge)
Limited to 128-bit SIMD width.
Convert to floating-point:
Benefit from 256-bit AVX.
Higher latencies, and only single-issue addition/subtraction (on Sandy Bridge)
Incurs initial conversion overhead.
Restricts input to those that fit into a float without precision loss.
I'd say stick with integer for now. If you don't want to duplicate code with the float versions, then that's your call.
The only times I've seen where emulating integers with floating-point becomes faster are when you have to do divisions.
Note that I've made no mention of readability as diving into manual vectorization probably implies that performance is more important.

Related

Did old DOS C compilers implement double as 32-bit?

I'm reading Michael Abrash's Graphics Programming Black Book which is all about 3D graphics performance, so I was surprised to find that a lot of the C code there uses double instead of float. We're talking about early 90s computers (286, 386, Pentium) and MS-DOS C compilers, so what's the reason for using double in that era? Didn't float exist or was double's and float's precision different than today?
In short, why double was used in performance-critical code in that era?
As far as I know there was no C compiler targeting MS-DOS that used a 32-bit wide double, instead they all used a 64-bit wide double. This was certainly the case by the early 90's. Based on quick read of the "Floating Point for Real-Time 3D" chapter of the book, it appears the Michael Abrash thought that floating-point math of any precision was too slow on anything less than a Pentium CPU. Either the floating-point code you're looking was intended for Pentium CPUs or it was used on a non-critical path where performance doesn't matter. For performance critical code meant for earlier CPUs, Abrash implies that he would've used fixed-point arithmetic instead.
In a lot of cases using float instead of double wouldn't have actually made much difference. There's a few reasons. First, if you don't have an x87 FPU (floating-point unit) installed (a separate chip before the '486), using less precision wouldn't improve performance enough to make software emulated floating-point arithmetic fast enough to be useful for game. The second is that the performance of most x87 FPU operations wasn't actually affected by precision. On a Pentium CPU only division was faster if performed at a narrower precision. For earlier x87 FPUs I'm not sure precision affected division, though it could affect the performance of multiplication on the 80387. On all x87 FPUs addition would've been the same speed regardless of precision.
The third is that the specific C data type used, whether a 32-bit float, the 64-bit double, or even the 80-bit long double that many compilers supported, didn't actually affect the precision the FPU used during calculations. This is because the FPU didn't have different instructions (or encodings) for the three different precisions it supported. There was no way to tell it perform a float addition or a double divide. Instead it performed all arithmetic at a given precision that was set in the FPU's control register. (Or more accurately stated, it performed arithmetic as if using infinite precision and then rounding the result to the set precision.) While it would've been possible to change this register every time a floating-point instruction is used, this would cause massive decrease in performance, so compilers never did this. Instead they just set it to either 80-bit or 64-bit precision at program startup and left it that way.
Now it was actually a common technique for 3D games to set the FPU to single-precision. This meant floating-point arithmetic, whether using double or float types, would be performed using single-precision arithmetic. While this would end up only affecting the performance of floating-point divides, 3D graphics programming tends to do a lot divisions in critical code (eg. perspective divides), so this could have a significant performance improvement.
There is however one way that using float instead of double could improve performance, and that's simply because a float takes up half the space of a double. If you have a lot of floating-point values then having to read and write half as much memory can make a significant difference in performance. However, on Pentium or earlier PCs this wouldn't result in the huge performance difference it would today. The gap between CPU speed and RAM speed wasn't as wide back then, and floating-point performance was a fair bit slower. Still, it would be a worth while optimization if the extra precision isn't needed, as is usually the case in games.
Note that modern x86 C compilers don't normally use x87 FPU instructions for floating-point arithmetic, instead they use scalar SSE instructions, which unlike the x87 instructions, do come in single- and double-precision versions. (But no 80-bit wide extended-precision versions.) Except for division, this doesn't make any performance difference, but does mean that results are always truncated to float or double precision after every operation. When doing math on the x87 FPU this truncation would only happen when the result was written to memory. This means SSE floating-point code has now has predictable results, while x87 FPU code had unpredictable results because it was in general hard to predict when the compiler would need to spill a floating-point register into memory to make room for something else.
So basically using float instead of double wouldn't have made a big performance difference except when storing floating-point values in a big array or other large data structure in memory.

How sqrt() of GCC works after compiled? Which method of root is used? Newton-Raphson?

Just curiosity about the standard sqrt() from math.h on GCC works. I coded my own sqrt() using Newton-Raphson to do it!
yeah, I know fsqrt. But how the CPU does it? I can't debug hardware
Typical div/sqrt hardware in modern CPUs uses a power of 2 radix to calculate multiple result bits at once. e.g. http://www.imm.dtu.dk/~alna/pubs/ARITH20.pdf presents details of a design for a Radix-16 div/sqrt ALU, and compares it against the design in Penryn. (They claim lower latency and less power.) I looked at the pictures; looks like the general idea is to do something and feed a result back through a multiplier and adder iteratively, basically like long division. And I think similar to how you'd do bit-at-a-time division in software.
Intel Broadwell introduced a Radix-1024 div/sqrt unit. This discussion on RWT asks about changes between Penryn (Radix-16) and Broadwell. e.g. widening the SIMD vector dividers so 256-bit division was less slow vs. 128-bit, as well as increasing radix.
Maybe also see
The integer division algorithm of Intel's x86 processors - Merom's Radix-2 and Radix-4 dividers was replaced by Penryn's Radix-16. (Core2 65nm vs. 45nm)
https://electronics.stackexchange.com/questions/280673/why-does-hardware-division-take-much-longer-than-multiplication
https://scicomp.stackexchange.com/questions/187/why-is-division-so-much-more-complex-than-other-arithmetic-operations
But however the hardware works, IEEE requires sqrt (and mul/div/add/sub) to give a correctly rounded result, i.e. error <= 0.5 ulp, so you don't need to know how it works, just the performance. These operations are special, other functions like log and sin do not have this requirement, and real library implementations usually aren't that accurate. (And x87 fsin is definitely not that accurate for inputs near Pi/2 where catastrophic cancellation in range-reduction leads to potentially huge relative errors.)
See https://agner.org/optimize/ for x86 instruction tables including throughput and latency for scalar and SIMD sqrtsd / sqrtss and their wider versions. I collected up the results in Floating point division vs floating point multiplication
For non-x86 hardware sqrt, you'd have to look at data published by other vendors, or results from people who have tested it.
Unlike most instructions, sqrt performance is typically data-dependent. (Usually more significant bits or larger magnitude of the result takes longer).
sqrt is defined by C, so most likely you have to look in glibc.
You did not specify which architecture you are asking for, so I think it's safe to assume x86-64. If that's the case, they are defined in:
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/fpu/e_sqrt.c
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/fpu/e_sqrtf.c
https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/fpu/e_sqrtl.c
tl;dr they are simply implemented by calling the x86-64 square root instructions sqrts{sd}:
https://www.felixcloutier.com/x86/sqrtss
https://www.felixcloutier.com/x86/sqrtsd
Furthermore, and just for the sake of discussion, if you enable fast-math (something you probably should not do if you care about result precision), you will see that most compilers will actually inline the call and directly emit the sqrts{sd} instructions:
https://godbolt.org/z/Wb4unC

Is NEON of ARM faster for integers than floating points?

Or both floating point and integer operations are same speed? And if not so, how much faster is the integer version?
You can find information about Instruction-specific scheduling for Advanced SIMD instructions for Cortex-A8 (they don't publish it for newer cores since timing business got quite complicated since).
See Advanced SIMD integer ALU instructions versus Advanced SIMD floating-point instructions:
You may need to read explanation of how to read those tables.
To give a complete answer, in general floating point instructions take two cycles while instructions executes on ALU takes one cycle. On the other hand multiplication of long long (8 byte integer) is four cycles (forum same source) while multiplication of double is two cycles.
In general it seems you shouldn't care about float versus integer but carefully choosing data type (float vs double, int vs long long) is more important.
It depends on which model you have, but the tendency has been for integer to have more opportunities to use the 128-bit wide data paths. This is no longer true on newer CPUs.
Of course, integer arithmetic also gives you the opportunity to increase the parallelism by using 16-bit or 8-bit operations.
As with all integer-versus-floating-point arguments, it depends on the specific problem and how much time you're willing to invest in tuning, because they can rarely run exactly the same code.
I would refer to auselen's answer for great links to all of the references, however, I found the actual cycle counts a little misleading. It is true that it can "go either way" depending on the precision that you need, but let's say that you have some parallelism in your routine and can efficiently operate on two words (SP float) at a time. Let's assume that you need the amount of precision for which floating point may be a good idea... 24 bits.
In particular when analyzing NEON performance, remember that there is a write-back delay (pipeline delay) so that you have to wait for a result to become ready if that result is required as the input to another instruction.
For fixed point you will need 32 bit ints to represent at least 24 bits of precision:
Multiply two-by-two 32 bit numbers together, and get a 64 bit result. This takes two cycles and requires an extra register to store the wide result.
Shift the 64 bit numbers back to a 32 bit numbers of the desired precision. This takes one cycle, and you have to wait for the write-back (5-6 cycle) delay from the multiply.
For floating point:
Multiply two-by-two 32 bit floats together. This takes one cycle.
So for this scenario, there is no way in heck that you would ever choose integer over floating point.
If you are dealing with 16 bit data, then the tradeoffs are much closer, although you may still need an extra instruction to shift the result of the multiply back to the desired precision. To achieve good performance if you are using Q15, then you can use the VQDMULH instruction on s16 data and achieve much higher performance with fewer registers than SP float.
Also, as auselen mentions, newer cores have different micro-architectures, and things always change. We are lucky that ARM actually makes their info public. For vendors that modify the microarchitecture like Apple, Qualcomm and Samsung (probably others...) the only way to know is to try it, which can be a lot of work if you are writing assembly. Still, I think the official ARM instruction timing website is probably quite useful. And I actually do think that they publish the numbers for A9, and these are mostly identical.

Does doing pointer arithmetic incur the cost of a divide

I'm working on an embedded processor where the cost of doing a divide is high. When tracking down divide calls in the assembler output I was surprised to see pointer arithmetic generating a call to the divide function.
I can't see how compilers can avoid the divide unless the size of the struct is a power of 2. Anyone know if cleverer compilers like gcc manage to avoid this somehow?
Division by a constant can usually be optimized into a wide multiplication followed by a shift. This may still be too slow for you, I don't know. But this only happens for pointer subtraction, which can probably be avoided, depending on how you're using it.
On certain processors, when full optimisations are on, compilers can do strength reduction to turn a divide into a multiply. So for instance instead of dividing by 10 they will multiply by 3435973837 and take the upper 32 bits, which is equivalent to multiplying by 0.8, and then divide by 8 using a shift.

64-bit integer implementation for 8-bit microcontroller

I'm working on OKI 431 microcontroller. This is 8-bit microcontroller. We don't like to have any floating point operation to be performed in our project so we've eliminated all floating point operations and converted them into integer operations in some way. But we cannot eliminate one floating point operation because optimizing the calculation for integer operation requires 64-bit integer which the micro doesn't natively support. It has C compiler that supports upto 32-bit integer operation. The calculation takes too long time which is noticeable in a way to user.
I'm wondering if there is any 64-bit integer library that can be easily used in C for microcontoller coding. Or what is the easiest way to write such thing efficiently? Here efficiently implies minimize amount of time required.
Thanks in advance.
Since this is a micro-controller you will probably want to use a simple assembly library. The fewer operations it has to support the simpler and smaller it can be. You may also find that you can get away with smaller than 64 bit numbers (48 bit, perhaps) and reduce the run time and register requirements.
You may have to go into assembly to do this. The obvious things you need are:
addition
2s complement (invert and increment)
left and right arithmetic shift by 1
From those you can build subtraction, multiplication, long division, and longer shifts. Keep in mind that multiplying two 64-bit numbers gives you a 128-bit number, and long division may need to be able to take a 128-bit dividend.
It will seem painfully slow, but the assumption in such a machine is that you need a small footprint, not speed. I assume you are doing these calculations at the lowest frequency you can.
An open-source library may have a slightly faster way to do it,
but it could also be even slower.
Whenever speed is a problem with floating point math in small embedded systems, and when integer math is not enough, fixed point math is a fast replacement.
http://forum.e-lab.de/topic.php?t=2387
http://en.wikipedia.org/wiki/Fixed-point_arithmetic
http://www.eetimes.com/discussion/other/4024639/Fixed-point-math-in-C
http://en.wikibooks.org/wiki/Embedded_Systems/Floating_Point_Unit

Resources