STM32H753 (ARM Cortex M7) rounding mode - arm

The Cortex M7 provides in the register FPSCR the bits Rmode to set the rounding mode of the FPU.
I don't understand what this means exactly.
I guess it refers to the rounding of a floating point instruction since it cannot usually be exact ? But then what the different rounding modes mean ? I understand "round to nearest" but what "round to infinity" or "to zero" mean ?

Rounding to positive infinity means that the result of rounding is always larger than the rounded value. Same with negative infinity (values will always be smaller)
Rounding to zero: values are rounded towards zero. 3.1, 3.2, -3.9, 3.9 will be rounded to -3 or 3 depending on the sign.
GCC doc: It should be used unless there is a specific need for one of the others. In this mode results are rounded to the nearest representable value. If the result is midway between two representable values, the even representable is chosen. Even here means the lowest-order bit is zero."

Related

Why adding 1.0/3.0 three times works just as mathematically expected?

I am aware that real numbers cannot be represented exactly in binary (even though with so-called double precision) in most cases. For example, 1.0/3.0 is approximated by 0x3fd5555555555555, which actually represents 0.33333333333333331483.... If we perform (1.0/3.0)+(1.0/3.0) then we obtain 0x3fe5555555555555 (so 0.66666666666666662965...), just as expected in a sense of computer arithmetic.
However, when I tried to perform (1.0/3.0)+(1.0/3.0)+(1.0/3.0) by writing the following code
#include<stdio.h>
int main(){
double result=1.0/3.0;
result+=1.0/3.0;
result+=1.0/3.0;
printf("%016llx\n",result);
}
and compiling it with the standard GNU C compiler, then the resulting program returned 0x3ff0000000000000 (which represents exactly 1). This result made me confused, because I initially expected 0x3fefffffffffffff (I did not expect rounding error to cancel each other because both (1.0/3.0) and ((1.0/3.0)+(1.0/3.0)) are smaller than actual value when represented in binary), and I still have not figured out what happened.
I would be grateful if you let me know possible reasons for this result.
There is no need to consider 80 bit representation - the results are the same in Java which requires, except for some irrelevant edge cases, the same behavior as IEEE 754 64-bit binary arithmetic for its doubles.
The exact value of 1.0/3.0 is 0.333333333333333314829616256247390992939472198486328125
As long as all numbers involved are in the normal range, multiplying or dividing by a power of two is exact. It only changes the exponent, not the significand. In particular, adding 1.0/3.0 to itself is exact, so the result of the first addition is 0.66666666666666662965923251249478198587894439697265625
The second addition does involve rounding. The exact sum is 0.99999999999999988897769753748434595763683319091796875, which is bracketed by representable numbers 0.999999999999999944488848768742172978818416595458984375 and 1.0. The exact value is half way between the bracketing numbers. A single bit has to be dropped. The least significant bit of 1.0 is a zero, so that is the rounded result of the addition.
That is a good rounding question. If I correctly remember, the arithmetic coprocessor uses 80 bits: 64 precision bits and 15 for the exponent (ref.). That means that internally the operation uses more bits than you can display. And in the end the coprocessor actually rounds its internal representation (more accurate) to give a 64 bit only value. And as the first bit dropped is 1 and not 0, the result is rounded upside giving 1.
But I must admit I am just guessing here...
But if you try to do by hand the operation, if immediately comes that the addition sets all precision bits to 1 (adding 5555...5 and 555...5 shifted by 1) plus the first bit to drop which is also 1. So by hand a normal human being would round upside also giving 1, so it is no surprise that the arithmetic unit is also able to do the correct rounding.

ARM NEON convert f32 to s32 with round toward even

Is there any functions that controls rounding mode of vcvt_s32_f32 intrinsic? I want to use round toward even instead of round toward negative infinity.
Thanks.
No, you can't change the rounding mode.
NEON is designed for performance rather than precision, and thus is restricted compared to VFP. Unlike VFP, it's not a full IEEE 754 implementation, and is hardwired to certain settings - quoting from the ARM ARM:
denormalized numbers are flushed to zero
only default NaNs are supported
the Round to Nearest* rounding mode selected
untrapped exception handling selected for all floating-point exceptions
The specific case of floating-point to integer conversion is slightly different in that the behaviour of the VCVT instruction in this case (for both VFP and NEON) is to ignore the selected rounding mode and always round towards zero. The VCVTR instruction which does use the selected rounding mode is only available in VFP.
The ARMv8 architecture introduced a whole bunch of rounding and conversion instructions
for using specific rounding modes, but I suspect that's not much help in this particular case. If you want to do conversions under a different rounding mode on ARMv7 and earlier, you'll either have to use VFP (if available) or some bit-hacking to implement it manually.
* The ARM ARM uses IEEE 754-1985 terminology, so more precisely this is round to nearest, ties to even

Why dividing a float by a power of 10 is less accurate than typing the number directly?

When I run
printf("%.8f\n", 971090899.9008999);
printf("%.8f\n", 9710908999008999.0 / 10000000.0);
I get
971090899.90089989
971090899.90089977
I know why neither is exact, but what I don't understand is why doesn't the second match the first?
I thought basic arithmetic operations (+ - * /) were always as accurate as possible...
Isn't the first number a more accurate result of the division than the second?
Judging from the numbers you're using and based on the standard IEEE 754 floating point standard, it seems the left hand side of the division is too large to be completely encompassed in the mantissa (significand) of a 64-bit double.
You've got 52 bits worth of pure integer representation before you start bleeding precision. 9710908999008999 has ~54 bits in its representation, so it does not fit properly -- thus, the truncation and approximation begins and your end numbers get all finagled.
EDIT: As was pointed out, the first number that has no mathematical operations done on it doesn't fit either. But, since you're doing extra math on the second one, you're introducing extra rounding errors not present with the first number. So you'll have to take that into consideration too!
Evaluating the expression 971090899.9008999 involves one operation, a conversion from decimal to the floating-point format.
Evaluating the expression 9710908999008999.0 / 10000000.0 involves three operations:
Converting 9710908999008999.0 from decimal to the floating-point format.
Converting 10000000.0 from decimal to the floating-point format.
Dividing the results of the above operations.
The second of those should be exact in any good C implementation, because the result is exactly representable. However, the other two add rounding errors.
C does not require implementations to convert decimal to floating-point as accurately as possible; it allows some slack. However, a good implementation does convert accurately, using extra precision if necessary. Thus, the single operation on 971090899.9008999 produces a more accurate result than the multiple operations.
Additionally, as we learn from a comment, the C implementation used by the OP converts 9710908999008999.0 to 9710908999008998. This is incorrect by the rules of IEEE-754 for the common round-to-nearest mode. The correct result is 9710908999009000. Both of these candidates are representable in IEEE-754 64-bit binary, and both are equidistant from the source value, 9710908999008999. The usual rounding mode is round-to-nearest, ties-to-even, meaning the candidate with the even low bit should be selected, which is 9710908999009000 (with significand 0x1.1400298aa8174), not 9710908999008998 (with significand 0x1.1400298aa8173). (IEEE 754 defines another round-to-nearest mode: ties-to-away, which selects the candidate with the larger magnitude, which is again 9710908999009000.)
The C standard permits some slack in conversions; either of these two candidates conforms to the C standard, but good implementations also conform to IEEE 754.

Rounding to specified absolute decimal precision in C90/99

I am working on software that, among other things, converts measured numbers between text and internal (double) representation. A necessary part of the process is to produce text representations with the correct decimal precision based on the statistical uncertainty of the measurement. The needed precision varies with the number, and the least-significant digit in it can be anywhere, including left of the (decimal) units place.
Correct rounding is essential for this process, where "correct" means according to the floating-point rounding mode in effect at the time, or at least in a well-defined rounding mode. As such, I need to be careful about (read: avoid) performing intermediate arithmetic on the numbers being handled, because rounding can be sensitive even to the least-significant bit in the internal representation of a number.
I think I can do almost all the needed formatting reasonably well with the printf family of functions if I first compute the number of significant digits in the required representation:
sprintf(buffer, "%.*e", num_sig_figs - 1, number);
There is one class of corner cases that has so far defeated me, however: the one where the most significant (decimal) digit in the measured number is one place right of the least significant digit of the desired-precision representation. In that case, rounding should yield the least (and only) significant digit in the desired result as either 0 or 1, but I haven't been able to devise a way to perform the rounding in a portable(*) way without risk of changing the result. This is similar to what the MPFR function mpfr_prec_round() could do, except that it works in binary precision, whereas I need to use decimal precision.
For example, in the default rounding mode (round-to-nearest with ties rounded to even):
0.5 expressed to unit (10^0) precision should be "0" or "0e+00"
654 expressed to thousands (10^3) precision should be "1e+03"
0.03125 expressed to tenths (10^-1) precision should be "0" or "0e-01" or even "0e+00"
(*) "Portable" here means that the code accurately expresses the computation in standard, portable C99 (or better, C90). It is understood that the actual result may depend on machine details, and it should depend (and be consistent with) the floating-point rounding mode in effect.
What options do I have?
One simple (albeit fairly inefficient) approach that will always work is to print the full exact decimal value as a string, then do your rounding in decimal manually. This can be achieved by something like
snprintf(buf, sizeof buf, "%.*f", DBL_MANT_DIG-DBL_MIN_EXP, x);
I hope I got that precision right. The idea is that each additional mantissa bit, and each additional negative power of two, takes up one extra decimal place.
You avoid the issue of double rounding by the fact that the decimal value obtained is exact.
Note that double rounding only matters in the default rounding mode (nearest). In other modes, double rounding obtains the same result that would be obtained by a single rounding step, so you can take lots of shortcuts if you like.
There are probably better solutions which I'll post later if I think of them. Note that the above solution will only work on high-quality implementations where the printf family of functions is capable of printing exact decimals. It will fail horribly, for example, on MSVCRT and other low-quality implementations, even some conforming ones.

How are floating point numbers stored in memory?

I've read that they're stored in the form of mantissa and exponent
I've read this document but I could not understand anything.
To understand how they are stored, you must first understand what they are and what kind of values they are intended to handle.
Unlike integers, a floating-point value is intended to represent extremely small values as well as extremely large. For normal 32-bit floating-point values, this corresponds to values in the range from 1.175494351 * 10^-38 to 3.40282347 * 10^+38.
Clearly, using only 32 bits, it's not possible to store every digit in such numbers.
When it comes to the representation, you can see all normal floating-point numbers as a value in the range 1.0 to (almost) 2.0, scaled with a power of two. So:
1.0 is simply 1.0 * 2^0,
2.0 is 1.0 * 2^1, and
-5.0 is -1.25 * 2^2.
So, what is needed to encode this, as efficiently as possible? What do we really need?
The sign of the expression.
The exponent
The value in the range 1.0 to (almost) 2.0. This is known as the "mantissa" or the significand.
This is encoded as follows, according to the IEEE-754 floating-point standard.
The sign is a single bit.
The exponent is stored as an unsigned integer, for 32-bits floating-point values, this field is 8 bits. 1 represents the smallest exponent and "all ones - 1" the largest. (0 and "all ones" are used to encode special values, see below.) A value in the middle (127, in the 32-bit case) represents zero, this is also known as the bias.
When looking at the mantissa (the value between 1.0 and (almost) 2.0), one sees that all possible values start with a "1" (both in the decimal and binary representation). This means that it's no point in storing it. The rest of the binary digits are stored in an integer field, in the 32-bit case this field is 23 bits.
In addition to the normal floating-point values, there are a number of special values:
Zero is encoded with both exponent and mantissa as zero. The sign bit is used to represent "plus zero" and "minus zero". A minus zero is useful when the result of an operation is extremely small, but it's still important to know from which direction the operation came from.
plus and minus infinity -- represented using an "all ones" exponent and a zero mantissa field.
Not a Number (NaN) -- represented using an "all ones" exponent and a non-zero mantissa.
Denormalized numbers -- numbers smaller than the smallest normal number. Represented using a zero exponent field and a non-zero mantissa. The special thing with these numbers is that the precision (i.e. the number of digits a value can contain) will drop the smaller the value becomes, simply because there is not room for them in the mantissa.
Finally, the following is a handful of concrete examples (all values are in hex):
1.0 : 3f800000
-1234.0 : c49a4000
100000000000000000000000.0: 65a96816
In layman's terms, it's essentially scientific notation in binary. The formal standard (with details) is IEEE 754.
typedef struct {
unsigned int mantissa_low:32;
unsigned int mantissa_high:20;
unsigned int exponent:11;
unsigned int sign:1;
} tDoubleStruct;
double a = 1.2;
tDoubleStruct* b = reinterpret_cast<tDoubleStruct*>(&a);
Is an example how memory is set up if compiler uses IEEE 754 double precision which is the default for a C double on little endian systems (e.g. Intel x86).
Here it is in C based binary form and better read
wikipedia about double precision to understand it.
There are a number of different floating-point formats. Most of them share a few common characteristics: a sign bit, some bits dedicated to storing an exponent, and some bits dedicated to storing the significand (also called the mantissa).
The IEEE floating-point standard attempts to define a single format (or rather set of formats of a few sizes) that can be implemented on a variety of systems. It also defines the available operations and their semantics. It's caught on quite well, and most systems you're likely to encounter probably use IEEE floating-point. But other formats are still in use, as well as not-quite-complete IEEE implementations. The C standard provides optional support for IEEE, but doesn't mandate it.
The mantissa represents the most significant bits of the number.
The exponent represents how many shifts are to be performed on the mantissa in order to get the actual value of the number.
Encoding specifies how are represented sign of mantissa and sign of exponent (basically whether shifting to the left or to the right).
The document you refer to specifies IEEE encoding, the most widely used.
I have found the article you referenced quite illegible (and I DO know a little how IEEE floats work). I suggest you try with the Wiki version of the explanation. It's quite clear and has various examples:
http://en.wikipedia.org/wiki/Single_precision and http://en.wikipedia.org/wiki/Double_precision
It is implementation defined, although IEEE-754 is the most common by far.
To be sure that IEEE-754 is used:
in C, use #ifdef __STDC_IEC_559__
in C++, use the std::numeric_limits<float>::is_iec559 constants
I've written some guides on IEEE-754 at:
In Java, what does NaN mean?
What is a subnormal floating point number?

Resources