Can we store a floating point in a regular register? - c

As I understand, floating points are stored in XMM registers and not the general purpose registers such as eax, so I did an experiment:
float a = 5;
in this case, a is stored as 1084227584 in the XMM register.
Here is an assembly version:
.text
.global _start
.LCO:
.long 1084227584
_start:
mov .LCO, %eax
movss .LCO, %xmm0
Executing the above assembly and debugging it using gdb shows that the value in eax will be 1084227584, however the value in ymm0 is 5.
Here are is my questions:
1- What's so special about the XMM registers? beside the SIMD instructions, are they the only type of registers to store floating points?
why can't I set the same bits in a regular register?
2- Are float and double values always stored as a floating point?
Can we never store them as a fixed point in C or assembly?

however the value in ymm0 is 5.
The bit-pattern in ymm0 is 1084227584. The float interpretation of that number is 5.0.
But you can print /x $xmm0.v4_int32 to see a hex representation of the bits in xmm0.
What's so special about the XMM registers? beside the SIMD instructions, are they the only type of registers to store floating points?
No, in asm everything is just bytes.
Some compilers will use an integer register to copy a float or double from one memory location to another, if not doing any computation on it. (Integer instructions are often smaller.) e.g. clang will do this: https://godbolt.org/z/76EWMY
void copy(float *d, float *s) { *d = *s; }
# clang8.0 -O3 targeting x86-64 System V
copy: # #copy
mov eax, dword ptr [rsi]
mov dword ptr [rdi], eax
ret
XMM/YMM/ZMM registers are special because they're the only registers that FP ALU instructions exist for (ignoring x87, which is only used for 80-bit long double in x86-64).
addsd xmm0, xmm1 (add scalar double) has no equivalent for integer registers.
Usually FP and integer data don't mingle very much, so providing a whole separate set of architectural registers allows more space for more data to be in registers. (Given the same instruction-encoding constraints, it's a choice between 16 FP + 16 GP integer vs. 16 unified registers, not vs. 32 unified registers).
Plus, a major microarchitectural benefit of a separate register file is that it can be physically close to the FP ALUs, while the integer register file can be physically close to the integer ALUs. For more, see Is there any architecture that uses the same register space for scalar integer and floating point operations?
are float and double values always stored as a floating point? can we never store them as a fixed point in C or assembly?
x86 compilers use float = IEEE754 binary32 https://en.wikipedia.org/wiki/Single-precision_floating-point_format. (And double = IEEE754 binary64). This is specified as part of the ABI.
Internally the as-if rule allows the compiler to do whatever it wants, as long as the final result is identical. (Or with -ffast-math, to pretend that FP math is associative, and assume NaN/Inf aren't possible.)
Compilers can't just randomly choose a different object representation for some float that other separately-compiled functions might look at.
There might be rare cases for locals that are never visible to other functions where a "human compiler" (hand-writing asm to implement C) could prove that fixed-point was safe. Or more likely, that the float values were exact integers small enough that double wouldn't round them, so your fixed-point could degenerate to integer (except maybe for a final step).
But it would be rare to know this much about possible values without just being able to do constant propagation and optimize everything away. That's why I say a human would have to be involved, to prove things the compiler wouldn't know to look for.
I think in theory you could have a C implementation that did use a fixed-point float or double. ISO C puts very little restrictions on what float and double actually are.
But limits.h constants like FLT_RADIX and DBL_MAX_EXP have interactions that might not make sense for a fixed-point format, which has constant distance between each representable value, instead of being much closer together near 0 and much farther apart for large number. (Rounding error of 0.5ulp is relative to the magnitude, instead of absolute.)
Still, most programs don't actually do things that would break if the "mantissa" and exponent limits didn't correspond to what you'd expect for DBL_MIN and DBL_MAX.
Another interesting possibility is to make float and double based on the Posit format (similar to traditional floating-point, but with a variable-length exponent encoding. https://www.johndcook.com/blog/2018/04/11/anatomy-of-a-posit-number/ https://posithub.org/index).
Modern hardware, especially Intel CPUs, has very good support for IEEE float/double, so fixed-point is often not a win. There are some nice SIMD instructions for 16-bit fixed-point, though, like high-half-only multiply, and even pmulhrsw which does fixed-point rounding.
But general 32-bit integer multiply has worse throughput than packed-float multiply. (Because the SIMD ALUs optimized for float/double only need 24x24-bit significand multipliers per 32 bits of vector element. Modern Intel CPUs run integer multiply and shift on the FMA execution units, with 2 uops per clock throughput.)

are they the only type of registers to store floating points?
No. There are the 80-bit floating-point registers (fp0-fp7) in the 8087-compatible FPU which should still be present in most modern CPUs.
Most 32-bit programs use these registers.
Can we store a floating point in a regular [integer] register?
Yes. 30 years ago many PCs contained a CPU without 80x87 FPU, so there were no fp0-fp7 registers. CPUs with XMM registers came even later.
We find a similar situation in mobile devices today.
What's so special about the XMM registers?
Using the 80x87 FPU seems to be more complicated than using XMM registers. Furthermore, I'm not sure if using the 80x87 is allowed in 64-bit programs in every operating system.
If you store a floating-point value in an integer register (such as eax), you don't have any instructions performing arithmetic: On x86 CPUs, there is no instruction for doing a multiplication or addition of floating-point values that are stored in integer registers.
In the case of CPUs without FPU, you have to do floating-point emulation. This means you have to perform one floating-point operation by doing multiple integer operations - just like you would do it with paper and pencil.
However, if you only want to store a floating-point value, you can of course also use an integer register. The same is true for copying a value or checking if two values are equal and similar operations.
Can we never store them as a fixed point in C or assembly?
Fixed point is used a lot when using CPUs that do not have an FPU.
For example when using 8- or 16-bit CPUs which are still used in automotive industry, consumer devices or PC peripheral devices.
However, I doubt that there are C compilers that automatically translate the keyword "float" to fixed point.

Related

Dividing packed 16-bit integer with mask using AVX512 or SVML intrinsics

I am looking for a solution for dividing packed 16-bit integers with mask (__mmask16 for example). _mm512_mask_div_epi32 intrinsics seem to be good; however they only support packed 32-bit integers, which unnecessarily forces me to wide my packed 16-bit to packed 32-bit before using.
_mm512_mask_div_epi32 isn't a real intrinsic; it's an Intel SVML function. x86 doesn't have SIMD integer division, only SIMD FP double and float.
If your divisor vectors are compile-time constants (or reused for multiple dividends), see https://libdivide.com/ for exact division using a multiplicative inverse.
Otherwise probably your best bet is to convert to single-precision FP which can exactly represent every 16-bit integer. If _mm512_mask_div_epi32 does any extra work to deal with the fact that FP32 can't exactly represent every possible int32_t, that's wasted for your use case.
(Some future CPUs may have support for some kind of 16-bit FP in the IA cores, not just the GPU, but for now the best way to take advantage of the high-throughput hardware div/sqrt SIMD execution unit is via conversion to float. Like one __m256 per 5 clock cycles for Skylake vdivps ymm with a single uop, or one per 10 clock cycles for __m512 with a 3-uop vdivps zmm)

Did old DOS C compilers implement double as 32-bit?

I'm reading Michael Abrash's Graphics Programming Black Book which is all about 3D graphics performance, so I was surprised to find that a lot of the C code there uses double instead of float. We're talking about early 90s computers (286, 386, Pentium) and MS-DOS C compilers, so what's the reason for using double in that era? Didn't float exist or was double's and float's precision different than today?
In short, why double was used in performance-critical code in that era?
As far as I know there was no C compiler targeting MS-DOS that used a 32-bit wide double, instead they all used a 64-bit wide double. This was certainly the case by the early 90's. Based on quick read of the "Floating Point for Real-Time 3D" chapter of the book, it appears the Michael Abrash thought that floating-point math of any precision was too slow on anything less than a Pentium CPU. Either the floating-point code you're looking was intended for Pentium CPUs or it was used on a non-critical path where performance doesn't matter. For performance critical code meant for earlier CPUs, Abrash implies that he would've used fixed-point arithmetic instead.
In a lot of cases using float instead of double wouldn't have actually made much difference. There's a few reasons. First, if you don't have an x87 FPU (floating-point unit) installed (a separate chip before the '486), using less precision wouldn't improve performance enough to make software emulated floating-point arithmetic fast enough to be useful for game. The second is that the performance of most x87 FPU operations wasn't actually affected by precision. On a Pentium CPU only division was faster if performed at a narrower precision. For earlier x87 FPUs I'm not sure precision affected division, though it could affect the performance of multiplication on the 80387. On all x87 FPUs addition would've been the same speed regardless of precision.
The third is that the specific C data type used, whether a 32-bit float, the 64-bit double, or even the 80-bit long double that many compilers supported, didn't actually affect the precision the FPU used during calculations. This is because the FPU didn't have different instructions (or encodings) for the three different precisions it supported. There was no way to tell it perform a float addition or a double divide. Instead it performed all arithmetic at a given precision that was set in the FPU's control register. (Or more accurately stated, it performed arithmetic as if using infinite precision and then rounding the result to the set precision.) While it would've been possible to change this register every time a floating-point instruction is used, this would cause massive decrease in performance, so compilers never did this. Instead they just set it to either 80-bit or 64-bit precision at program startup and left it that way.
Now it was actually a common technique for 3D games to set the FPU to single-precision. This meant floating-point arithmetic, whether using double or float types, would be performed using single-precision arithmetic. While this would end up only affecting the performance of floating-point divides, 3D graphics programming tends to do a lot divisions in critical code (eg. perspective divides), so this could have a significant performance improvement.
There is however one way that using float instead of double could improve performance, and that's simply because a float takes up half the space of a double. If you have a lot of floating-point values then having to read and write half as much memory can make a significant difference in performance. However, on Pentium or earlier PCs this wouldn't result in the huge performance difference it would today. The gap between CPU speed and RAM speed wasn't as wide back then, and floating-point performance was a fair bit slower. Still, it would be a worth while optimization if the extra precision isn't needed, as is usually the case in games.
Note that modern x86 C compilers don't normally use x87 FPU instructions for floating-point arithmetic, instead they use scalar SSE instructions, which unlike the x87 instructions, do come in single- and double-precision versions. (But no 80-bit wide extended-precision versions.) Except for division, this doesn't make any performance difference, but does mean that results are always truncated to float or double precision after every operation. When doing math on the x87 FPU this truncation would only happen when the result was written to memory. This means SSE floating-point code has now has predictable results, while x87 FPU code had unpredictable results because it was in general hard to predict when the compiler would need to spill a floating-point register into memory to make room for something else.
So basically using float instead of double wouldn't have made a big performance difference except when storing floating-point values in a big array or other large data structure in memory.

SSE instruction MOVSD (extended: floating point scalar & vector operations on x86, x86-64)

I am somehow confused by the MOVSD assembly instruction. I wrote some numerical code computing some matrix multiplication, simply using ordinary C code with no SSE intrinsics. I do not even include the header file for SSE2 intrinsics for compilation. But when I check the assembler output, I see that:
1) 128-bit vector registers XMM are used;
2) SSE2 instruction MOVSD is invoked.
I understand that MOVSD essentially operates on single double precision floating point. It only uses the lower 64-bit of an XMM register and set the upper 64-bit 0. But I just don't understand two things:
1) I never give the compiler any hint for using SSE2. Plus, I am using GCC not intel compiler. As far as I know, intel compiler will automatically seek opportunities for vectorization, but GCC will not. So how does GCC know to use MOVSD?? Or, has this x86 instruction been around long before SSE instruction set, and the _mm_load_sd() intrinsics in SSE2 is just to provide backward compatibility for using XMM registers for scalar computation?
2) Why does not the compiler use other floating point registers, either the 80-bit floating point stack, or 64-bit floating point registers?? Why must it take the toll using XMM register (by setting upper 64-bit 0 and essentially wasting that storage)? Does XMM do provide faster access??
By the way, I have another question regarding SSE2. I just can't see the difference between _mm_store_sd() and _mm_storel_sd(). Both store the lower 64-bit value to an address. What is the difference? Performance difference?? Alignment difference??
Thank you.
Update 1:
OKAY, obviously when I first asked this question, I lacked some basic knowledge on how a CPU manages floating point operations. So experts tend to think my question is non-sense. Since I did not include even the shortest sample C code, people might think this question vague as well. Here I would provide a review as an answer, which hopefully will be useful to any people unclear about the floating point operations on modern CPUs.
A review of floating point scalar/vector processing on modern CPUs
The idea of vector processing dates back to old time vector processors, but these processors had been superseded by modern architectures with cache systems. So we focus on modern CPUs, especially x86 and x86-64. These architectures are the main stream in high performance scientific computing.
Since i386, Intel introduced the floating point stack where floating point numbers up to 80-bit wide can be held. This stack is commonly known as x87 or 387 floating point "registers", with a set of x87 FPU instructions. x87 stack are not real, directly addressable registers like general purpose registers, as they are on a stack. Access to register st(i) is by offsetting the stack top register %st(0) or simply %st. With help of an instruction FXCH which swaps the contents between current stack top %st and some offset register %st(i), random access can be achieved. But FXCH can impose some performance penalty, though minimized. x87 stack provides high precision computation by calculating intermediate results with 80 bits of precision by default, to minimise roundoff error in numerically unstable algorithms. However, x87 instructions are completely scalar.
The first effort on vectorization is the MMX instruction set, which implemented integer vector operations. The vector registers under MMX are 64-bit wide registers MMX0, MMX1, ..., MMX7. Each can be used to hold either 64-bit integers, or multiple smaller integers in a "packed" format. A single instruction can then be applied to two 32-bit integers, four 16-bit integers, or eight 8-bit integers at once. So now there are the legacy general purpose registers for scalar integer operations, as well as new MMX for integer vector operations with no shared execution resources. But MMX shared execution resources with scalar x87 FPU operation: each MMX register corresponded to the lower 64 bits of an x87 register, and the upper 16 bits of the x87 registers is unused. These MMX registers were each directly addressable. But the aliasing made it difficult to work with floating point and integer vector operations in the same application. To maximize performance, programmers often used the processor exclusively in one mode or the other, deferring the relatively slow switch between them as long as possible.
Later, SSE created a separate set of 128-bit wide registers XMM0–XMM7 along side of x87 stack. SSE instructions focused exclusively on single-precision floating-point operations (32-bit); integer vector operations were still performed using the MMX register and MMX instruction set. But now both operations can proceed at the same time, as they share no execution resources. It is important to know that SSE not only do floating point vector operations, but also floating point scalar operations. Essentially it provides a new place where floating operations take place, and the x87 stack is no longer prior choice to carry out floating operations. Using XMM registers for scalar floating point operations is faster than using x87 stack, as all XMM registers are easier to access, while the x87 stack can't be randomly accessed without FXCH. When I posted my question, I was clearly unaware of this fact. The other concept I was not clear about is that general purpose registers are integer/address registers. Even if they are 64-bit on x86-64, they can not hold 64-bit floating point. The main reason is that the execution unit associated with general purpose registers is ALU (arithmetic & logical unit), which is not for floating point computation.
SSE2 is a major progress, as it extends vector data type, so SSE2 instructions, either scalar or vector, can work with all C standard data type. Such extension in fact makes MMX obsolete. Also, x87 stack is no long as important as it once was. Since there are two alternative places where floating point operations can take place, you can specify your option to the compiler. For example for GCC, compilation with flag
-mfpmath=387
will schedule floating point operations on the legacy x87 stack. Note that this seems to be the default for 32-bit x86, even if SSE is already available. For example, I have an Intel Core2Duo laptop made in 2007, and it was already equipped with SSE release up to version SSE4, while GCC will still by default use x87 stack, which makes scientific computations unnecessarily slower. In this case, we need compile with flag
-mfpmath=sse
and GCC will schedule floating point operations on XMM registers. 64-bit x86-64 user needs not worry about such configuration as this is default on x86-64. Such signal will only affect scalar floating point operation. If we have written code using vector instructions and compiler the code with flag
-msse2
then XMM registers will be the only place where computation can take place. In other words, this flags turns on -mfpmath=sse. For more information see GCC's configuration of x86, x86-64. For examples of writing SSE2 C code, see my other post How to ask GCC to completely unroll this loop (i.e., peel this loop)?.
SSE set of instructions, though very useful, are not the latest vector extensions. The AVX, advanced vector extensions enhances SSE by providing 3-operands and 4 operands instructions. See number of operands in instruction set if you are unclear of what this means. 3-operands instruction optimizes the commonly seen fused multiply-add (FMA) operation in scientific computing by 1) using 1 fewer register; 2) reducing the explicit amount of data movement between registers; 3) speeding up FMA computations in itself. For example of using AVX, see #Nominal Animal's answer to my post.

How floating point conversion was handled before the invention of FPU and SSE?

I am trying to understand how floating point conversion is handled at the low level. So based on my understanding, this is implemented in hardware. So, for example, SSE provides the instruction cvttss2si which converts a float to an int.
But my question is: was the floating point conversion always handled this way? What about before the invention of FPU and SSE, was the calculation done manually using Assembly code?
It depends on the processor, and there have been a huge number of different processors over the years.
FPU stands for "floating-point unit". It's a more or less generic term that can refer to a floating-point hardware unit for any computer system. Some systems might have floating-point operations built into the CPU. Others might have a separate chip. Yet others might not have hardware floating-point support at all. If you specify a floating-point conversion in your code, the compiler will generate whatever CPU instructions are needed to perform the necessary computation. On some systems, that might be a call to a subroutine that does whatever bit manipulations are needed.
SSE stands for "Streaming SIMD Extensions", and is specific to the x86 family of CPUs. For non-x86 CPUs, there's no "before" or "after" SSE; SSE simply doesn't apply.
The conversion from floating-point to integer is considered a basic enough operation that the 387 instruction set already had such an instruction, FIST—although not useful for compiling the (int)f construct of C programs, as that instruction used the current rounding mode.
Some RISC instruction sets have always considered that a dedicated conversion instruction from floating-point to integer was an unnecessary luxury, and that this could be done with several instructions accessing the IEEE 754 floating-point representation. One basic scheme might look like this blog post, although the blog post is about rounding a float to a
float representing the nearest integer.
Prior to the standardization of IEEE 754 arithmetic, there were many competing vendor-specific ways of doing floating-point arithmetic. These had different ranges, precision, and different behavior with respect to overflow, underflow, signed zeroes, and undefined results such as 0/0 or sqrt(-1).
However, you can divide floating point implementations into two basic groups: hardware and software. In hardware, you would typically see an opcode which performs the conversion, although coprocessor FPUs can complicate things. In software, the conversion would be done by a function.
Today, there are still soft FPUs around, mostly on embedded systems. Not too long ago, this was common for mobile devices, but soft FPUs are still the norm on smaller systems.
Indeed, the floating point operations are a challenge for hardware engineers, as they require much hardware (leading to higher costs of the final product) and consume much power. There are some architectures that do not contain a floating point unit. There are also architectures that do not provide instructions even for basic operations like integer division. The ARM architecture is an example of this, where you have to implement division in software. Also, the floating point unit comes as an optional coprocessor in this architecture. It is worth thinking about this, considering the fact that ARM is the main architecture used in embedded systems.
IEEE 754 (the floating point standard used today in most of the applications) is not the only way of representing real numbers. You can also represent them using a fixed point format. For example, if you have a 32 bit machine, you can assume you have a decimal point between bit 15 and 16 and perform operations keeping this in mind. This is a simple way of representing floating numbers and it can be handled in software easily.
It depends on the implementation of the compiler. You can implement floating point math in just about any language (an example in C: http://www.jhauser.us/arithmetic/SoftFloat.html), and so usually the compiler's runtime library will include a software implementation of things like floating point math (or possibly the target hardware has always supported native instructions for this - again, depends on the hardware) and instructions which target the FPU or use SSE are offered as an optimization.
Before Floating Point Units doesn't really apply, since some of the earliest computers made back in the 1940's supported floating point numbers: wiki - first electro mechanical computers.
On processors without floating point hardware, the floating point operations are implemented in software, or on some computers, in microcode as opposed to being fully hardware implemented: wiki - microcode , or the operations could be handled by separate hardware components such as the Intel x87 series: wiki - x87 .
But my question is: was the floating point conversion always handled this way?
No, there's no x87 or SSE on architectures other than x86 so no cvttss2si either
Everything you can do with software, you can also do in hardware and vice versa.
The same to float conversion. If you don't have the hardware support, just do some bit hacking. There's nothing low level here so you can do it in C or any other languages easily. There is already a lot of solutions on SO
Converting Int to Float/Float to Int using Bitwise
Casting float to int (bitwise) in C
Converting float to an int (float2int) using only bitwise manipulation
...
Yes. The exponent was changed to 0 by shifting the mantissa, denormalizing the number. If the result was too large for an int an exception was generated. Otherwise the denormalized number (minus the factional part and optionally rounded) is the integer equivalent.

Integer SIMD Instruction AVX in C

I am trying to run SIMD instruction over data types int, float and double.
I need multiply, add and load operation.
For float and double I successfully managed to make those instructions work:
_mm256_add_ps, _mm256_mul_ps and _mm256_load_ps (ending *pd for double).
(Direct FMADD operation isn't supported)
But for integer I couldn't find a working instruction. All of those showed at intel AVX manual give similar error by GCC 4.7 like "‘_mm256_mul_epu32’ was not declared in this scope".
For loading integer I use _mm256_set_epi32 and that's fine for GCC. I don't know why those other instructions aren't defined. Do I need to update something?
I am including all of those <pmmintrin.h>, <immintrin.h> <x86intrin.h>
My processor is an Intel core i5 3570k (Ivy Bridge).
256-bit integer operations are only added since AVX2, so you'll have to use 128-bit __m128i vectors for integer intrinsics if you only have AVX1.
AVX1 does have integer loads/stores, and intrinsics like _mm256_set_epi32 can be implemented with FP shuffles or a simple load of a compile-time constant.
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Advanced_Vector_Extensions_2
Advanced Vector Extensions 2 (AVX2), also known as Haswell New Instructions,[2] is an expansion of the AVX instruction set introduced in Intel's Haswell microarchitecture. AVX2 makes the following additions:
expansion of most vector integer SSE and AVX instructions to 256 bits
three-operand general-purpose bit manipulation and multiply
three-operand fused multiply-accumulate support (FMA3)
Gather support, enabling vector elements to be loaded from non-contiguous memory locations
DWORD- and QWORD-granularity any-to-any permutes
vector shifts.
FMA3 is actually a separate feature; AMD Piledriver/Steamroller have it but not AVX2.
Nevertheless if the int value range fits in 24 bits then you can use float instead. However note that if you need the exact result or the low bits of the result then you'll have to convert float to double, because a 24x24 multiplication will produce a 48-bit result which can be only stored exactly in a double. At that point you still only have 4 elements per vector, and might have been better off with XMM vectors of int32. (But note that FMA throughput is typically better than integer multiply throughput.)
AVX1 has VEX encodings of 128-bit integer operations so you can use them in the same function as 256-bit FP intrinsics without causing SSE-AVX transition stalls. (In C you generally don't have to worry about that; your compiler will take care of using vzeroupper where needed.)
You could try to simulate an integer addition with AVX bitwise instructions like VANDPS and VXORPS, but without a bitwise left shift for ymm vectors it won't work.
If you're sure FTZ / DAZ are not set, you can use small integers as denormal / subnormal float values, where the bits outside the mantissa are all zero. Then FP addition and integer addition are the same bitwise operation. (And VADDPS doesn't need a microcode assist on Intel hardware when the inputs and result are both denormal.)

Resources