Xilinx MicroBlaze Floating Point Compatibility

Xilinx MicroBlaze Floating Point Compatibility - c

I have a 'c' code targeted to a MicroBlaze CPU.
When I debug the code as c program in Eclipse + GCC or Visual Studio I get the results I want.
Yet when I run on the target the result are different.
It happens only on floating point operations (Multiplication and Division).
How can I make it work with full floating point precision?
Are there special GCC flags?
P.S.
The configuration of the MicroBlaze is with all the hardware of floating point operations enabled.

I'm not very experienced with MicroBlaze, but the Wikipedia page states:
Also, key processor instructions which are rarely used but more expensive to implement in hardware can be selectively added/removed (i.e. multiply, divide, and floating-point ops.)
Emphasis mine.
So, make sure that your particular MicroBlaze actually has the floating point operations supported, otherwise I imagine your results will be very random.
Also make sure your compiler toolchain generates the proper instructions, sometimes toolchains for embedded development support software-emulated floating point. This should be trivial to figure out by disassembling the final code, and seeing how the floating-point operations are implemented.

MicroBlaze floating-point in hardware supports IEEE754 with some exceptions that is listed in the MicroBlaze reference guide.
Floating-point is not 100% identical on all machines.
It depends on actual precision when executing the operations (hardware can use extended precision when executing single-precision operations), it also depends on the configuration of the rounding-mode (IEEE defines four different rounding modes).
MicroBlaze do not support denormalized floating-point (they will be consider to be zero).
However normal coding should avoid denormalized values since they have a reduced accuracy.
What kind of difference do you see?
Göran Bilski

Related

How floating point conversion was handled before the invention of FPU and SSE?

I am trying to understand how floating point conversion is handled at the low level. So based on my understanding, this is implemented in hardware. So, for example, SSE provides the instruction cvttss2si which converts a float to an int.
But my question is: was the floating point conversion always handled this way? What about before the invention of FPU and SSE, was the calculation done manually using Assembly code?

It depends on the processor, and there have been a huge number of different processors over the years.
FPU stands for "floating-point unit". It's a more or less generic term that can refer to a floating-point hardware unit for any computer system. Some systems might have floating-point operations built into the CPU. Others might have a separate chip. Yet others might not have hardware floating-point support at all. If you specify a floating-point conversion in your code, the compiler will generate whatever CPU instructions are needed to perform the necessary computation. On some systems, that might be a call to a subroutine that does whatever bit manipulations are needed.
SSE stands for "Streaming SIMD Extensions", and is specific to the x86 family of CPUs. For non-x86 CPUs, there's no "before" or "after" SSE; SSE simply doesn't apply.

The conversion from floating-point to integer is considered a basic enough operation that the 387 instruction set already had such an instruction, FIST—although not useful for compiling the (int)f construct of C programs, as that instruction used the current rounding mode.
Some RISC instruction sets have always considered that a dedicated conversion instruction from floating-point to integer was an unnecessary luxury, and that this could be done with several instructions accessing the IEEE 754 floating-point representation. One basic scheme might look like this blog post, although the blog post is about rounding a float to a
float representing the nearest integer.

Prior to the standardization of IEEE 754 arithmetic, there were many competing vendor-specific ways of doing floating-point arithmetic. These had different ranges, precision, and different behavior with respect to overflow, underflow, signed zeroes, and undefined results such as 0/0 or sqrt(-1).
However, you can divide floating point implementations into two basic groups: hardware and software. In hardware, you would typically see an opcode which performs the conversion, although coprocessor FPUs can complicate things. In software, the conversion would be done by a function.
Today, there are still soft FPUs around, mostly on embedded systems. Not too long ago, this was common for mobile devices, but soft FPUs are still the norm on smaller systems.

Indeed, the floating point operations are a challenge for hardware engineers, as they require much hardware (leading to higher costs of the final product) and consume much power. There are some architectures that do not contain a floating point unit. There are also architectures that do not provide instructions even for basic operations like integer division. The ARM architecture is an example of this, where you have to implement division in software. Also, the floating point unit comes as an optional coprocessor in this architecture. It is worth thinking about this, considering the fact that ARM is the main architecture used in embedded systems.
IEEE 754 (the floating point standard used today in most of the applications) is not the only way of representing real numbers. You can also represent them using a fixed point format. For example, if you have a 32 bit machine, you can assume you have a decimal point between bit 15 and 16 and perform operations keeping this in mind. This is a simple way of representing floating numbers and it can be handled in software easily.

It depends on the implementation of the compiler. You can implement floating point math in just about any language (an example in C: http://www.jhauser.us/arithmetic/SoftFloat.html), and so usually the compiler's runtime library will include a software implementation of things like floating point math (or possibly the target hardware has always supported native instructions for this - again, depends on the hardware) and instructions which target the FPU or use SSE are offered as an optimization.

Before Floating Point Units doesn't really apply, since some of the earliest computers made back in the 1940's supported floating point numbers: wiki - first electro mechanical computers.
On processors without floating point hardware, the floating point operations are implemented in software, or on some computers, in microcode as opposed to being fully hardware implemented: wiki - microcode , or the operations could be handled by separate hardware components such as the Intel x87 series: wiki - x87 .

But my question is: was the floating point conversion always handled this way?
No, there's no x87 or SSE on architectures other than x86 so no cvttss2si either
Everything you can do with software, you can also do in hardware and vice versa.
The same to float conversion. If you don't have the hardware support, just do some bit hacking. There's nothing low level here so you can do it in C or any other languages easily. There is already a lot of solutions on SO
Converting Int to Float/Float to Int using Bitwise
Casting float to int (bitwise) in C
Converting float to an int (float2int) using only bitwise manipulation
...

Yes. The exponent was changed to 0 by shifting the mantissa, denormalizing the number. If the result was too large for an int an exception was generated. Otherwise the denormalized number (minus the factional part and optionally rounded) is the integer equivalent.

Does any floating point-intensive code produce bit-exact results in any x86-based architecture?

I would like to know if any code in C or C++ using floating point arithmetic would produce bit exact results in any x86 based architecture, regardless of the complexity of the code.
To my knowledge, any x86 architecture since the Intel 8087 uses a FPU unit prepared to handle IEEE-754 floating point numbers, and I cannot see any reason why the result would be different in different architectures. However, if they were different (namely due to different compiler or different optimization level), would there be some way to produce bit-exact results by just configuring the compiler?

Table of contents:
C/C++
asm
Creating real-life software that achieves this.
In C or C++:
No, a fully ISO C11 and IEEE-conforming C implementation does not guarantee bit-identical results to other C implementations, even other implementations on the same hardware.
(And first of all, I'm going to assume we're talking about normal C implementations where double is the IEEE-754 binary64 format, etc., even though it would be legal for a C implementation on x86 to use some other format for double and implement FP math with software emulation, and define the limits in float.h. That might have been plausible when not all x86 CPUs included with an FPU, but in 2016 that's Deathstation 9000 territory.)
related: Bruce Dawson's Floating-Point Determinism blog post is an answer to this question. His opening paragraph is amusing (and is followed by a lot of interesting stuff):
Is IEEE floating-point math deterministic? Will you always get the same results from the same inputs? The answer is an unequivocal “yes”. Unfortunately the answer is also an unequivocal “no”. I’m afraid you will need to clarify your question.
If you're pondering this question, then you will definitely want to have a look at the index to Bruce's series of articles about floating point math, as implemented by C compilers on x86, and also asm, and IEEE FP in general.
First problem: Only "basic operations": + - * / and sqrt are required to return "correctly rounded" results, i.e. <= 0.5ulp of error, correctly-rounded out to the last bit of the mantissa, so the results is the closest representable value to the exact result.
Other math library functions like pow(), log(), and sin() allow implementers to make a tradeoff between speed and accuracy. For example, glibc generally favours accuracy, and is slower than Apple's OS X math libraries for some functions, IIRC. See also glibc's documentation of the error bounds for every libm function across different architectures.
But wait, it gets worse. Even code that only uses the correctly-rounded basic operations doesn't guarantee the same results.
C rules also allow some flexibility in keeping higher precision temporaries. The implementation defines FLT_EVAL_METHOD so code can detect how it works, but you don't get a choice if you don't like what the implementation does. You do get a choice (with #pragma STDC FP_CONTRACT off) to forbid the compiler from e.g. turning a*b + c into an FMA with no rounding of the a*b temporary before the add.
On x86, compilers targeting 32-bit non-SSE code (i.e. using obsolete x87 instructions) typically keep FP temporaries in x87 registers between operations. This produces the FLT_EVAL_METHOD = 2 behaviour of 80-bit precision. (The standard specifies that rounding still happens on every assignment, but real compilers like gcc don't actually do extra store/reloads for rounding unless you use -ffloat-store. See https://gcc.gnu.org/wiki/FloatingPointMath. That part of the standard seems to have been written assuming non-optimizing compilers, or hardware that efficiently provides rounding to the type width like non-x86, or like x87 with precision set to round to 64-bit double instead of 80-bit long double. Storing after every statement is exactly what gcc -O0 and most other compilers do, and the standard allows extra precision within evaluation of one expression.)
So when targeting x87, the compiler is allowed to evaluate the sum of three floats with two x87 FADD instructions, without rounding off the sum of the first two to a 32-bit float. In that case, the temporary has 80-bit precision... Or does it? Not always, because the C implementation's startup code (or a Direct3D library!!!) may have changed the precision setting in the x87 control word, so values in x87 registers are rounded to 53 or 24 bit mantissa. (This makes FDIV and FSQRT run a bit faster.) All of this from Bruce Dawson's article about intermediate FP precision).
In assembly:
With rounding mode and precision set the same, I think every x86 CPU should give bit-identical results for the same inputs, even for complex x87 instructions like FSIN.
Intel's manuals don't define exactly what those results are for every case, but I think Intel aims for bit-exact backwards compatibility. I doubt they'll ever add extended-precision range-reduction for FSIN, for example. It uses the 80-bit pi constant you get with fldpi (correctly-rounded 64-bit mantissa, actually 66-bit because the next 2 bits of the exact value are zero). Intel's documentation of the worst-case-error was off by a factor of 1.3 quintillion until they updated it after Bruce Dawson noticed how bad the worst-case actually was. But this can only be fixed with extended-precision range reduction, so it wouldn't be cheap in hardware.
I don't know if AMD implements their FSIN and other micro-coded instructions to always give bit-identical results to Intel, but I wouldn't be surprised. Some software does rely on it, I think.
Since SSE only provides instructions for add/sub/mul/div/sqrt, there's nothing too interesting to say. They implement the IEEE operation exactly, so there's no chance that any x86 implementation will ever give you anything different (unless the rounding mode is set differently, or denormals-are-zero and/or flush-to-zero are different and you have any denormals).
SSE rsqrt (fast approximate reciprocal square root) is not exactly specified, and I think it's possible you might get a different result even after a Newton iteration, but other than that SSE/SSE2 is always bit exact in asm, assuming the MXCSR isn't set weird. So the only question is getting the compiler go generate the same code, or just using the same binaries.
In real life:
So, if you statically link a libm that uses SSE/SSE2 and distribute those binaries, they will run the same everywhere. Unless that library uses run-time CPU detection to choose alternate implementations...
As #Yan Zhou points out, you pretty much need to control every bit of the implementation down to the asm to get bit-exact results.
However, some games really do depend on this for multi-player, but often with detection/correction for clients that get out of sync. Instead of sending the entire game state over the network every frame, every client computes what happens next. If the game engine is carefully implemented to be deterministic, they stay in sync.
In the Spring RTS, clients checksum their gamestate to detect desync. I haven't played it for a while, but I do remember reading something at least 5 years ago about them trying to achieve sync by making sure all their x86 builds used SSE math, even the 32-bit builds.
One possible reason for some games not allowing multi-player between PC and non-x86 console systems is that the engine gives the same results on all PCs, but different results on the different-architecture console with a different compiler.
Further reading: GAFFER ON GAMES: Floating Point Determinism. Some techniques that real game engines use to get deterministic results. e.g. wrap sin/cos/tan in non-optimized function calls to force the compiler to leave them at single-precision.

If the compiler and architecture is compliant to IEEE standards, yes.
For instance, gcc is IEEE compliant if configured properly. If you use the -ffast-math flag, it will not be IEEE compliant.
See http://www.validlab.com/goldberg/paper.pdf page 25.
If you want to know exactly what exactness you can rely on when using a IEEE 754-1985 hardware/compiler pair, you need to purchase the standard paper on IEEE site. Unfortunately, this is not publicly available
Link

Is C floating-point non-deterministic?

I have read somewhere that there is a source of non-determinism in C double-precision floating point as follows:
The C standard says that 64-bit floats (doubles) are required to produce only about 64-bit accuracy.
Hardware may do floating point in 80-bit registers.
Because of (1), the C compiler is not required to clear the low-order bits of floating-point registers before stuffing a double into the high-order bits.
This means YMMV, i.e. small differences in results can happen.
Is there any now-common combination of hardware and software where this really happens? I see in other threads that .net has this problem, but is C doubles via gcc OK? (e.g. I am testing for convergence of successive approximations based on exact equality)

The behavior on implementations with excess precision, which seems to be the issue you're concerned about, is specified strictly by the standard in most if not all cases. Combined with IEEE 754 (assuming your C implementation follows Annex F) this does not leave room for the kinds of non-determinism you seem to be asking about. In particular, things like x == x (which Mehrdad mentioned in a comment) failing are forbidden since there are rules for when excess precision is kept in an expression and when it is discarded. Explicit casts and assignment to an object are among the operations that drop excess precision and ensure that you're working with the nominal type.
Note however that there are still a lot of broken compilers out there that don't conform to the standards. GCC intentionally disregards them unless you use -std=c99 or -std=c11 (i.e. the "gnu99" and "gnu11" options are intentionally broken in this regard). And prior to GCC 4.5, correct handling of excess precision was not even supported.

This may happen on Intel x86 code that uses the x87 floating-point unit (except probably 3., which seems bogus. LSB bits will be set to zero.). So the hardware platform is very common, but on the software side use of x87 is dying out in favor of SSE.
Basically whether a number is represented in 80 or 64 bits is at the whim of the compiler and may change at any point in the code. With for example the consequence that a number which just tested non-zero is now zero. m)
See "The pitfalls of verifying floating-point computations", page 8ff.

Testing for exact convergence (or equality) in floating point is usually a bad idea, even with in a totally deterministic environment. FP is an approximate representation to begin with. It is much safer to test for convergence to within a specified epsilon.

How floats are computed on a machine without an FPU

C language has a data-type float. Some machines have a floating point processor that carries out all the floating point computations. My question is: Could there be some machines without a floating point processor? How do such machines use floating point?

Many small controllers do not have floating point units. In that case, there is a floating point software library.
In the mid-1980s, we considered ourselves blessed if our system had an 8087, the FPU for the 8086 and 8088. Unfortunately our software had to work correctly if an 8087 was present or not. That meant trapping and emulating 8087 instructions if it was missing.

The c standard allows floating points.
It is the compiler's responsibility to translate it to the specific hardware architecture.
If the hardware instruction set supports floating points [and most modern machines do], then - the compiler will most likely use it.
Otherwise, it will have to create a native language code that simulates the behavior of floating points by its own. How is it done? You could read more about floating points in the wikipeida page and in this more detailed article about floating point arithmetics

Up till and including the 486SX, no CPU's had a a builtin FPU unit.
As for microcontrollers, most of them do not have a FPU unit.

You'll find that nearly all modern desktop computers and servers include a FPU.
High end mobile devices have begun to include FPUs, but not all of them have them. And if we're talking about mobile devices other than at the high end, you won't find many devices that have FPUs.
In many applications, it's possible to do arithmetic on fractional numbers using "fixed point arithmetic"--that doesn't require an FPU.
In other cases, you can do the same math that an FPU does, but it takes longer when you have to build it yourself out of other arithmetic primitives rather than having a complex chip take care of it for you.
My favorite example of floating point simulation on fixed point processors is provided in Donald Knuth's MMIXware, a complete processor simulation in very portable C.

Emulating floating point is a bit slow, but theoretically fairly simple. It's just about like most people learned in high school or so: you have a number with an exponent. To add or subtract, you have to adjust the numbers so they have the same exponents, then add/subtract the mantissas. To multiply or divide, you multiply/divide the mantissas and add/subtract the exponents.
When you've finished that, you normalize the result. In high school we used decimal, and normally required exactly one digit before the decimal point, so (for example) 10001 would be written as 1.0001 x 104. On the computer, the details are a bit different (e.g., we're dealing in binary instead of decimal) but the basic idea is pretty much the same.

power function without the use of math library

I'm working on a micro-controller that contains access to floating-point operations.
I need to make use of a power function. The problem is, there isn't enough memory to support the pow and sqrt function. This is because the microcontroller doesn't support FP operations natively, and produces a large number of instructions to use them. I can still multiply and divide floating point numbers.
Architecture: Freescale HCS12 (16-bit)

If you mentioned the architecture, you might get a more specific answer.
The linux kernel still has the old x87 IEEE-754 math emulation library for i386 and i486 processors without a hardware floating point unit, under: arch/x86/math-emu/
There are a lot of resources online for floating point routines implemented for PIC micros, and AVR libc has a floating point library - though it's in AVR assembly.
glibc has implementations for pow functions in sysdeps/ieee754. Obviously, the compiler must handle the elementary floating point ops using hardware instructions or emulation / function calls.

Make your own function that multiplies repeatedly in a loop.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight