Is IEEE-754 representation used in C? - c

I have to encode the electron charge, which is -1.602*10-19 C, using IEEE-754. I did it manually and verified my result using this site. So I know my representation is good. My problem is that, if I try to build a C program showing my number in scientific notation, I get the wrong number.
Here is my code:
#include <stdio.h>
int main(int argc, char const *argv[])
{
float q = 0xa03d217b;
printf("q = %e", q);
return 0;
}
Here is the result:
$ ./test.exe
q = 2.688361e+09
My question: Is there another representation that my CPU might be using internally for floating point other than IEEE-754?

The line float q = 0xa03d217b; converts the integer (hex) literal into a float value representing that number (or an approximation thereof); thus, the value assigned to your q will be the (decimal) value 2,688,360,827 (which is what 0xa03d217b equates to), as you have noted.
If you must initialize a float variable with its internal IEEE-754 (HEX) representation, then your best option is to use type punning via the members of a union (legal in C but not in C++):
#include <stdio.h>
typedef union {
float f;
unsigned int h;
} hexfloat;
int main()
{
hexfloat hf;
hf.h = 0xa03d217b;
float q = hf.f;
printf("%lg\n", q);
return 0;
}
There are also some 'quick tricks' using pointer casting, like:
unsigned iee = 0xa03d217b;
float q = *(float*)(&iee);
But, be aware, there are numerous issues with such approaches, like potential endianness conflicts and the fact that you're breaking strict aliasing requirements.

Hence, q doesn't not contains the value you expect. The hex value is converted to a float with the same value (with approximation), not with the same bit-representation.
When compiled with g++ and the option -Wall, there is a warning:
warning: implicit conversion from 'unsigned int' to 'float' changes value from 2688360827 to 2688360704 [-Wimplicit-const-int-float-conversion]
Can be tested on Compiler Explorer.
This warning is apparently not supported by gcc. Instead, you can use the option -Wfloat-conversion (with is not part of -Wall -Wextra):
warning: conversion from 'unsigned int' to 'float' changes value from '2688360827' to '2.6883607e+9f' [-Wfloat-conversion]
Again on Compiler Explorer.

My problem is that if I try to build a c program showing my the number in scientific notation.
What if your target machine might or might not use IEEE754 encoding? Copying the bit pattern may fail.
If starting with a binary32 constant 0xa03d217b, code could examine it and then build up the best float available for that implementation.
#include <math.h>
#define BINARY32_MASK_SIGN 0x80000000
#define BINARY32_MASK_EXPO 0x7FE00000
#define BINARY32_MASK_SNCD 0x007FFFFF
#define BINARY32_IMPLIED_BIT 0x800000
#define BINARY32_SHIFT_EXPO 23
float binary32_to_float(uint32_t x) {
// Break up into 3 parts
bool sign = x & BINARY32_MASK_SIGN;
int biased_expo = (x & BINARY32_MASK_EXPO) >> BINARY32_SHIFT_EXPO;
int32_t significand = x & BINARY32_MASK_SNCD;
float y;
if (biased_expo == 0xFF) {
y = significand ? NAN : INFINITY; // For simplicity, NaN payload not copied
} else {
int expo;
if (biased_expo > 0) {
significand |= BINARY32_IMPLIED_BIT;
expo = biased_expo - 127;
} else {
expo = 126;
}
y = ldexpf((float)significand, expo - BINARY32_SHIFT_EXPO);
}
if (sign) {
y = -y;
}
return y;
}
Sample usage and output
#include <float.h>
#include <stdio.h>
int main() {
float e = -1.602e-19;
printf("%.*e\n", FLT_DECIMAL_DIG, e);
uint32_t e_as_binary32 = 0xa03d217b;
printf("%.*e\n", FLT_DECIMAL_DIG, binary32_to_float(e_as_binary32));
}
-1.602000046e-19
-1.602000046e-19

Note that C supports hexadecimal-floating point numbers as literals. See https://en.cppreference.com/w/cpp/language/floating_literal for details. This notation is useful to write the number in a portable way, without any concern for rounding issues as would be the case if you write it in regular decimal/scientific notation. Here's the number you're interested in:
#include <stdio.h>
int main(void) {
float f = -0x1.7a42f6p-63;
printf("%e\n", f);
return 0;
};
When I run this program, I get:
$ make a
cc a.c -o a
$ ./a
-1.602000e-19
So long as your compiler supports this notation, you need not worry about how the underlying machine represents floats, so long as this particular number fits into its float representation.

Related

Float inputs for which sinf and sin return different results?

I'm trying to understand something about sin and sinf from math.h.
I understand that their types differ: the former takes and returns doubles, and the latter takes and returns floats.
However, GCC still compiles my code if I call sin with float arguments:
#include <stdio.h>
#include <math.h>
#define PI 3.14159265
int main ()
{
float x, result;
x = 135 / 180 * PI;
result = sin (x);
printf ("The sin of (x=%f) is %f\n", x, result);
return 0;
}
By default, all compiles just fine (even with -Wall, -std=c99 and -Wpedantic; I need to work with C99). GCC won't complain about me passing floats to sin. If I enable -Wconversion then GCC tells me:
warning: conversion to ‘float’ from ‘double’ may alter its value [-Wfloat-conversion]
result = sin (x);
^~~
So my question is: is there a float input for which using sin, like above, and (implicitly) casting the result back to float, will result in a value that is different from that obtained using sinf?
This program finds three examples on my machine:
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
int i;
float f, f1, f2;
for(i = 0; i < 10000; i++) {
f = (float)rand() / RAND_MAX;
float f1 = sinf(f);
float f2 = sin(f);
if(f1 != f2) printf("jackpot: %.8f %.8f %.8f\n", f, f1, f2);
}
}
I got:
jackpot: 0.98704159 0.83439910 0.83439904
jackpot: 0.78605396 0.70757037 0.70757031
jackpot: 0.78636044 0.70778692 0.70778686
This will find all the float input values in the range 0.0 to 2 * M_PI where (float)sin(input) != sinf(input):
#include <stdio.h>
#include <math.h>
#include <float.h>
#ifndef M_PI
#define M_PI 3.14159265358979323846
#endif
int main(void)
{
for (float in = 0.0; in < 2 * M_PI; in = nextafterf(in, FLT_MAX)) {
float sin_result = (float)sin(in);
float sinf_result = sinf(in);
if (sin_result != sinf_result) {
printf("sin(%.*g) = %.*g, sinf(%.*g) = %.*g\n",
FLT_DECIMAL_DIG, in, FLT_DECIMAL_DIG, sin_result,
FLT_DECIMAL_DIG, in, FLT_DECIMAL_DIG, sinf_result);
}
}
return 0;
}
There are 1020963 such inputs on my amd64 Linux system with glibc 2.32.
float precision is approximately 6 significant figures decimal, while double is good for about 15. (It is approximate because they are binary floating point values not decimal floating point).
As such for example: a double value 1.23456789 will become 1.23456xxx as a float where xxx are unlikely to be 789 in this case.
Clearly not all (in fact very few) double values are exactly representable by float, so will change value when down-converted.
So for:
double a = 1.23456789 ;
float b = a ;
printf( "double: %.10f\n", a ) ;
printf( "float: %.10f\n", b ) ;
The result in my test was:
double: 1.2345678900
float: 1.2345678806
As you can see the float in fact retained 9 significant figures in this case, but it is by no means guaranteed for all possible values.
In your test you have limited the number of instances of mismatch because of the limited and finite range of rand() and also because f itself is float. Consider:
int main()
{
unsigned mismatch_count = 0 ;
unsigned iterations = 0 ;
for( double f = 0; f < 6.28318530718; f += 0.000001)
{
float f1 = sinf(f);
float f2 = sin(f);
iterations++ ;
if(f1 != f2)
{
mismatch_count++ ;
}
}
printf("%f%%\n", (double)mismatch_count/iterations* 100.0);}
In my test about 55% of comparisons mismatched. Changing f to float, the mismatches reduced to 1.3%.
So in your test, you see few mismatches because of the constraints of your method of generating f and its type. In the general case the issue is much more obvious.
In some cases you might see no mismatches - an implementation may simply implement sinf() using sin() with explicit casts. The compiler warning is for the general case of implicitly casting a double to a float without reference to any operations performed prior to the conversion.
However, GCC still compiles my code if I call sin with float arguments:
Yes, this is because they are implicitly converted to double (because sin() requires a float), and back to float (because sin() returns a double) on entering and exiting from the sinf() function. See below why it is better to use sinf() in this case, instead of having only one function.
You have included math.h which has prototypes for both function calls:
double sin(double);
float sinf(float);
And so, the compiler knows that to use sin() it is necessary a conversion from float to double so it compiles a conversion before calling, and also compiles a conversion from double to float in the result from sin().
In case you have not #include <math.h> and you ignored the compiler warning telling you are calling a function sin() with no prototype, the compiler should have also converted first the float to double (because on nonspecified argument types this is how it mus proceed) and pass the double data to the function (which is assumed to return an int in this case, that will provoke a serious Undefined Behaviour)
In case you have used the sinf() function (with the proper prototype), and passed a float, then no conversion should be compiled, the float is passed as such with no type conversion, and the returned value is assigned to a float variable, also with no conversion. So everything goes fine with no conversion, this makes the fastest code.
In case you have used the sinf() function (with no prototype), and passed a float, this float would be converted to a double and passed as such to sinf(), resulting in undefined behaviour. In case somehow sinf() returned properly, an int result (that could have something to do with the calculation or not, as per UB) would be converted into float type (should this be possible) and assigned to the result value.
In the case mentioned above, in case you are operating on floats, it is better to use sinf() as it takes less to execute (it has less iterations to do, as less precision is required in them) and the two conversions (from float to double and back from double to float) have not to be compiled in, in the binary code output by the compiler.
There are some systems where computations on float are an order of magnitude faster than computations on double. The primary purpose of sinf is to allow trigonometric calculations to be performed efficiently on such systems in cases where the lower precision of float would be adequate to satisfy application needs. Converting a value to float, calling sin, and converting the result to float would always yield a value that either matched that of sinf or was more accurate(*), and on some implementations that would in fact be the most efficient way of implementing sinf. On some other systems, however, such an approach would be more than an order of magnitude slower than using a purpose-designed function to evaluate the sine of a float.
(*) Note that for arguments outside the range +/- π/2, the most mathematically accurate way of computing sin(x) for an exact specified value of x might not be the most accurate way of computing what the calling code wants to know. If an application computes sinf(angle * (2.0f * 3.14159265f)), when angle is 0.5, having the function (double)3.1415926535897932385-(float)3.14159265f may be more "mathematically accurate" than having it return sin(angle-(2.0f*3.14159265f)), but the latter would more accurately represent the sine of the angle the code was actually interested in.

How to implement wrapping signed int addition in C

This is a complete rewrite of the question. Hopefully it is clearer now.
I want to implement in C a function that performs addition of signed ints with wrapping in case of overflow.
I want to target mainly the x86-64 architecture, but of course the more portable the implementation is the better. I'm also concerned mostly about producing decent assembly code through gcc, clang, icc, and whatever is used on Windows.
The goal is twofold:
write correct C code that doesn't fall into the undefined behavior blackhole;
write code that gets compiled to decent machine code.
By decent machine code I mean a single leal or a single addl instruction on machines which natively support the operation.
I'm able to satisfy either of the two requisites, but not both.
Attempt 1
The first implementation that comes to mind is
int add_wrap(int x, int y) {
return (unsigned) x + (unsigned) y;
}
This seems to work with gcc, clang and icc. However, as far as I know, the C standard doesn't specify the cast from unsigned int to signed int, leaving freedom to the implementations (see also here).
Otherwise, if the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.
I believe most (all?) major compilers do the expected conversion from unsigned to int, meaning that they take the correct representative modulus 2^N, where N is the number of bits, but it's not mandated by the standard so it cannot be relied upon (stupid C standard hits again). Also, while this is the simplest thing to do on two's complement machines, it is impossible on ones' complement machines, because there is a class which is not representable: 2^(N/2).
Attempt 2
According to the clang docs, one can use __builtin_add_overflow like this
int add_wrap(int x, int y) {
int res;
__builtin_add_overflow(x, y, &res);
return res;
}
and this should do the trick with clang, because the docs clearly say
If possible, the result will be equal to mathematically-correct result and the builtin will return 0. Otherwise, the builtin will return 1 and the result will be equal to the unique value that is equivalent to the mathematically-correct result modulo two raised to the k power, where k is the number of bits in the result type.
The problem is that in the GCC docs they say
These built-in functions promote the first two operands into infinite precision signed type and perform addition on those promoted operands. The result is then cast to the type the third pointer argument points to and stored there.
As far as I know, casting from long int to int is implementation specific, so I don't see any guarantee that this will result in the wrapping behavior.
As you can see [here][godbolt], GCC will also generate the expected code, but I wanted to be sure that this is not by chance ans is indeed part of the specification of __builtin_add_overflow.
icc also seems to produce something reasonable.
This produces decent assembly, but relies on intrinsics, so it's not really standard compliant C.
Attempt 3
Follow the suggestions of those pedantic guys from SEI CERT C Coding Standard.
In their CERT INT32-C recommendation they explain how to check in advance for potential overflow. Here is what comes out following their advice:
#include <limits.h>
int add_wrap(int x, int y) {
if ((x > 0) && (y > INT_MAX - x))
return (x + INT_MIN) + (y + INT_MIN);
else if ((x < 0) && (y < INT_MIN - x))
return (x - INT_MIN) + (y - INT_MIN);
else
return x + y;
}
The code performs the correct checks and compiles to leal with gcc, but not with clang or icc.
The whole CERT INT32-C recommendation is complete garbage, because it tries to transform C into a "safe" language by forcing the programmers to perform checks that should be part of the definition of the language in the first place. And in doing so it forces also the programmer to write code which the compiler can no longer optimize, so what is the reason to use C anymore?!
Edit
The contrast is between compatibility and decency of the assembly generated.
For instance, with both gcc and clang the two following functions which are supposed to do the same get compiled to different assembly.
f is bad in both cases, g is good in both cases (addl+jo or addl+cmovnol). I don't know if jo is better than cmovnol, but the function g is consistently better than f.
#include <limits.h>
signed int f(signed int si_a, signed int si_b) {
signed int sum;
if (((si_b > 0) && (si_a > (INT_MAX - si_b))) ||
((si_b < 0) && (si_a < (INT_MIN - si_b)))) {
return 0;
} else {
return si_a + si_b;
}
}
signed int g(signed int si_a, signed int si_b) {
signed int sum;
if (__builtin_add_overflow(si_a, si_b, &sum)) {
return 0;
} else {
return sum;
}
}
A bit like #Andrew's answer without the memcpy().
Use a union to negate the need for memcpy(). With C2x, we are sure that int is 2's compliment.
int add_wrap(int x, int y) {
union {
unsigned un;
int in;
} u = {.un = (unsigned) x + (unsigned) y};
return u.in;
}
For those who like 1-liners, use a compound literal.
int add_wrap2(int x, int y) {
return ( union { unsigned un; int in; }) {.un = (unsigned) x + (unsigned) y}.in;
}
I'm not so sure because of the rules for casting from unsigned to signed
You exactly quoted the rules. If you convert from a unsigned value to a signed one, then the result is implementation-defined or a signal is raised. In simple words, what will happen is described by your compiler.
For example the gcc9.2.0 compiler has the following to in it's documentation about implementation defined behavior of integers:
The result of, or the signal raised by, converting an integer to a signed integer type when the value cannot be represented in an object of that type (C90 6.2.1.2, C99 and C11 6.3.1.3).
For conversion to a type of width N, the value is reduced modulo 2^N to be within range of the type; no signal is raised.
I had to do something similar; however, I was working with known width types from stdint.h and needed to handle wrapping 32-bit signed integer operations. The implementation below works because stdint types are required to be 2's complement. I was trying to emulate the behaviour in Java, so I had some Java code generate a bunch of test cases and have tested on clang, gcc and MSVC.
inline int32_t add_wrap_i32(int32_t a, int32_t b)
{
const int64_t a_widened = a;
const int64_t b_widened = b;
const int64_t sum = a_widened + b_widened;
return (int32_t)(sum & INT64_C(0xFFFFFFFF));
}
inline int32_t sub_wrap_i32(int32_t a, int32_t b)
{
const int64_t a_widened = a;
const int64_t b_widened = b;
const int64_t difference = a_widened - b_widened;
return (int32_t)(difference & INT64_C(0xFFFFFFFF));
}
inline int32_t mul_wrap_i32(int32_t a, int32_t b)
{
const int64_t a_widened = a;
const int64_t b_widened = b;
const int64_t product = a_widened * b_widened;
return (int32_t)(product & INT64_C(0xFFFFFFFF));
}
It seems ridiculous, but I think that the recommended method is to use memcpy. Apparently all modern compilers optimize the memcpy away and it ends up doing just what you're hoping in the first place -- preserving the bit pattern from the unsigned addition.
int a;
int b;
unsigned u = (unsigned)a + b;
int result;
memcpy(&result, &u, sizeof(result));
On x86 clang with optimization, this is a single instruction if the destination is a register.

My floating value doesn't match my value in C

I'm trying to interface a board with a raspberry.
I have to read/write value to the board via modbus, but I can't write floating point value like the board.
I'm using C, and Eclipse debug perspective to see the variable's value directly.
The board send me 0x46C35000 which should value 25'000 Dec but eclipse shows me 1.18720512e+009...
When I try on this website http://www.binaryconvert.com/convert_float.html?hexadecimal=46C35000 I obtain 25,000.
What's the problem?
For testing purposes I'm using this:
int main(){
while(1){ // To view easily the value in the debug perspective
float test = 0x46C35000;
printf("%f\n",test);
}
return 0;
}
Thanks!
When you do this:
float test = 0x46C35000;
You're setting the value to 0x46C35000 (decimal 1187205120), not the representation.
You can do what you want as follows:
union {
uint32_t i;
float f;
} u = { 0x46C35000 };
printf("f=%f\n", u.f);
This safely allows an unsigned 32-bit value to be interpreted as a float.
You’re confusing logical value and internal representation. Your assignments sets the value, which is thereafter 0x46C35000, i.e. 1187205120.
To set the internal representation of the floating point number you need to make a few assumptions about how floating point numbers are represented in memory. The assumptions on the website you’re using (IEEE 754, 32 bit) are fair on a general purpose computer though.
To change the internal representation, use memcpy to copy the raw bytes into the float:
// Ensure our assumptions are correct:
#if !defined(__STDC_IEC_559__) && !defined(__GCC_IEC_559)
# error Floating points might not be in IEEE 754/IEC 559 format!
#endif
_Static_assert(sizeof(float) == sizeof(uint32_t), "Floats are not 32 bit numbers");
float f;
uint32_t rep = 0x46C35000;
memcpy(&f, &rep, sizeof f);
printf("%f\n", f);
Output: 25000.000000.
(This requires the header stdint.h for uint32_t, and string.h for memcpy.)
The constant 0x46C35000 being assigned to a float will implicitly convert the int value 1187205120 into a float, rather than directly overlay the bits into the IEEE-754 floating point format.
I normally use a union for this sort of thing:
#include <stdio.h>
typedef union
{
float f;
uint32_t i;
} FU;
int main()
{
FU foo;
foo.f = 25000.0;
printf("%.8X\n", foo.i);
foo.i = 0x46C35000;
printf("%f\n", foo.f);
return 0;
}
Output:
46C35000
25000.000000
You can understand how data are represented in memory when you access them through their address:
#include <stdio.h>
int main()
{
float f25000; // totally unused, has exactly same size as `int'
int i = 0x46C35000; // put binary value of 0x46C35000 into `int' (4 bytes representation of integer)
float *faddr; // pointer (address) to float
faddr = (float*)&i; // put address of `i' into `faddr' so `faddr' points to `i' in memory
printf("f=%f\n", *faddr); // print value pointed bu `faddr'
return 0;
}
and the result:
$ gcc -of25000 f25000.c; ./f25000
f=25000.000000
What it does is:
put 0x46C35000 into int i
copy address of i into faddr, which is also address that points data in memory, in this case of float type
print value pointed by faddr; treat it as float type
you get your 25000.0.

how to tell the value of a float pointer when it has been referenced from an integer? ex: float *f= (float *)someInteger

I cannot figure out how to convert the value of a referenced float pointer when it is referenced from an integer casted into a float pointer. I'm sorry if I'm wording this incorrectly. Here is an example of what I mean:
#include <stdio.h>
main() {
int i;
float *f;
i = 1092616192;
f = (float *)&i;
printf("i is %d and f is %f\n", i, *f);
}
the output for f is 10. How did I get that result?
Normally, the value of 1092616192 in hexadecimal is 0x41200000.
In floating-point, that will give you:
sign = positive (0b)
exponent = 130, 2^3 (10000010b)
significand = 2097152, 1.25 (01000000000000000000000b)
2^3*1.25
= 8 *1.25
= 10
To explain the exponent part uses an offset encoding, so you have to subtract 127 from it to get the real value. 130 - 127 = 3. And since this is a binary encoding, we use 2 as the base. 2 ^ 3 = 8.
To explain the significand part, you start with an invisible 'whole' value of 1. the uppermost (leftmost) bit is half of that, 0.5. The next bit is half of 0.5, 0.25. Because only the 0.25 bit and the default '1' bit is set, the significand represents 1 + 0.25 = 1.25.
What you are trying to do is called type-punning. It should be done via a union, or using memcpy() and is only meaningful on an architecture where sizeof(int) == sizeof(float) without padding bits. The result is highly dependent on the architecture: byte ordering and floating point representation will affect the reinterpreted value. The presence of padding bits would invoke undefined behavior as the representation of float 15.0 could be a trap value for type int.
Here is how you get the number corresponding to 15.0:
#include <stdio.h>
int main(void) {
union {
float f;
int i;
unsigned int u;
} u;
u.f = 15;
printf("re-interpreting the bits of float %.1f as int gives %d (%#x in hex)\n",
u.f, u.i, u.u);
return 0;
}
output on an Intel PC:
re-interpreting the bits of float 15.0 as int gives 1097859072 (0x41700000 in hex)
You are trying to predict the consequence of an undefined activity - it depends on a lot of random things, and on the hardware and OS you are using.
Basically, what you are doing is throwing a glass against the wall and getting a certain shard. Now you are asking how to get a differently formed shard. well, you need to throw the glass differently against the wall...

Converting int32_t to float

I've seen a lot of questions about converting from float to int32_T, but none that address converting a int32_t to float.
The data that I am working with is in centimeters. So I just want to confirm that there won't be any type conversion issues if I try to convert them to floats.
The reason that I am interested in this conversion, is that a function I am using works only with floating point numbers. So if I pass the function int32_t's and it is expecting floats, will it automatically typecast my arguments?
If you pass an int32_t to a function that takes a float parameter by value, then there will be an implicit cast (type conversion). One caveat though is that an IEEE754 single precision float has less precision than a 32 bit int (it has approximately 24 bits of precision, versus 32 for an int32_t), so you may lose some accuracy if you're using large values.
Example:
#include <stdio.h>
#include <stdint.h>
int32_t sqr(int32_t x)
{
return x * x;
}
float sqrf(float x)
{
return x * x;
}
int main(void)
{
int32_t x = 9999;
printf("sqr(%d) = %d, sqrf(%d) = %f\n", x, sqr(x), x, sqrf(x));
return 0;
}
Compile and run:
$ gcc -Wall int_float_prec.c && ./a.out
sqr(9999) = 99980001, sqrf(9999) = 99980000.000000
Cast to float before passing to function
int32_t i=32;
func((double) i);

Resources