C: convert a real number to 64 bit floating point binary - c

I'm trying to write a code that converts a real number to a 64 bit floating point binary. In order to do this, the user inputs a real number (for example, 547.4242) and the program must output a 64 bit floating point binary.
My ideas:
The sign part is easy.
The program converts the integer part (547 for the previous example) and stores the result in an int variable. Then, the program converts the fractional part (.4242 for the previous example) and stores the result into an array (each position of the array stores '1' or '0').
This is where I'm stuck. Summarizing, I have: "Integer part = 1000100011" (type int) and "Fractional part = 0110110010011000010111110000011011110110100101000100" (array).
How can I proceed?

the following code is used to determine internal representation of a floating point number according to the IEEE754 notation. This code is made in Turbo c++ ide but you can easily convert for a generalised ide.
#include<conio.h>
#include<stdio.h>
void decimal_to_binary(unsigned char);
union u
{
float f;
char c;
};
int main()
{
int i;
char*ptr;
union u a;
clrscr();
printf("ENTER THE FLOATING POINT NUMBER : \n");
scanf("%f",&a.f);
ptr=&a.c+sizeof(float);
for(i=0;i<sizeof(float);i++)
{
ptr--;
decimal_to_binary(*ptr);
}
getch();
return 0;
}
void decimal_to_binary(unsigned char n)
{
int arr[8];
int i;
//printf("n = %u ",n);
for(i=7;i>=0;i--)
{
if(n%2==0)
arr[i]=0;
else
arr[i]=1;
n/=2;
}
for(i=0;i<8;i++)
printf("%d",arr[i]);
printf(" ");
}
For further details visit Click here!

In order to correctly round all possible decimal representations to the nearest double, you need big integers. Using only the basic integer types from C will leave you to re-implement big integer arithmetics. Each of these two approaches is possible, more information about each follows:
For the first approach, you need a big integer library: GMP is a good one. Armed with such a big integer library, you tackle an input such as the example 123.456E78 as the integer 123456 * 1075 and start wondering what values M in [253 … 254) and P in [-1022 … 1023] make (M / 253) * 2P closest to this number. This question can be answered with big integer operations, following the steps described in this blog post (summary: first determine P. Then use a division to compute M). A complete implementation must take care of subnormal numbers and infinities (inf is the correct result to return for any decimal representation of a number that would have an exponent larger than +1023).
The second approach, if you do not want to include or implement a full general-purpose big integer library, still requires a few basic operations to be implemented on arrays of C integers representing large numbers. The function decfloat() in this implementation represents large numbers in base 109 because that simplifies the conversion from the initial decimal representation to the internal representation as an array x of uint32_t.

Following is a basic conversion. Enough to get OP started.
OP's "integer part of real number" --> int is far too limiting. Better to simply convert the entire string to a large integer like uintmax_t. Note the decimal point '.' and account for overflow while scanning.
This code does not handle exponents nor negative numbers. It may be off in the the last bit or so due to limited integer ui or the the final num = ui * pow10(expo). It handles most overflow cases.
#include <inttypes.h>
double my_atof(const char *src) {
uintmax_t ui = 0;
int dp = '.';
size_t dpi;
size_t i = 0;
size_t toobig = 0;
int ch;
for (i = 0; (ch = (unsigned char) src[i]) != '\0'; i++) {
if (ch == dp) {
dp = '\0'; // only get 1 dp
dpi = i;
continue;
}
if (!isdigit(ch)) {
break; // illegal character
}
ch -= '0';
// detect overflow
if (toobig ||
(ui >= UINTMAX_MAX / 10 &&
(ui > UINTMAX_MAX / 10 || ch > UINTMAX_MAX % 10))) {
toobig++;
continue;
}
ui = ui * 10 + ch;
}
intmax_t expo = toobig;
if (dp == '\0') {
expo -= i - dpi - 1;
}
double num;
if (expo < 0) {
// slightly more precise than: num = ui * pow10(expo);
num = ui / pow10(-expo);
} else {
num = ui * pow10(expo);
}
return num;
}

The trick is to treat the value as an integer, so read your 547.4242 as an unsigned long long (ie 64-bits or more), ie 5474242, counting the number of digits after the '.', in this case 4. Now you have a value which is 10^4 bigger than it should be. So you float the 5474242 (as a double, or long double) and divide by 10^4.
Decimal to binary conversion is deceptively simple. When you have more bits than the float will hold, then it will have to round. More fun occurs when you have more digits than a 64-bit integer will hold -- noting that trailing zeros are special -- and you have to decide whether to round or not (and what rounding occurs when you float). Then there's dealing with an E+/-99. Then when you do the eventual division (or multiplication) by 10^n, you have (a) another potential rounding, and (b) the issue that large 10^n are not exactly represented in your floating point -- which is another source of error. (And for E+/-99 forms, you may need upto and a little beyond 10^300 for the final step.)
Enjoy !

Related

How to convert float number to string without losing user-entered precision in C?

Here's what I'm trying to do:
I need to print the fractional part of a floating number which has to be input as a float during user input.
The fractional part should be like: if float is 43.3423, the output should be 3423; and if number is 45.3400 output should be 3400.
This can be done easily with a string input but I need a way to make this work with float without losing the extra zeros or without appending zeros to user's original input.
Here's what I already tried :-
Take the fractional part by frac = num - (int)num and then multiplying frac until we get zero as the remainder. But this fails for cases like 34.3400 — the last two zeros won't get included with this method.
Convert the float number to a string by
char string[20];
sprintf(string, "%f", float_number);
The sprintf function puts the float number as a string but here also it doesn't automatically detect the user entered precision and fills the string with extra zeros at the end (6 total precision). So here also the information about the user's original entered precision is not obtained.
So, is there a way to get this done? The number must be taken as float number from user. Is there any way to get info about what's the user's entered precision? If it's not possible, an explanation would be very helpful.
I think I understand where you're coming from. E.g. in physics, it's a difference whether you write 42.5 or 42.500, the number of significant digits is implicitly given. 42.5 stands for any number x: 42.45 <= x < 42.55 and 42.500 for any x: 42.4995 <= x < 42.5005.
For larger numbers, you would use scientific notation: 1.0e6 would mean a number x with x: 950000 <= x < 1050000.
A floating point number uses this same format, but with binary digits (sometimes called bits ;)) instead of decimal digits. But there are two important differences:
The number of digits (bits) used depends only on the data type of the floating point number. If your data type has e.g. 20 bits for the mantissa, every number stored in it will have these 20 bits. The mantissa is always stored without a part after the "decimal" (binary?) point, so you won't know how many significant bits there are.
There's no direct mapping between bits and decimal digits. You will need roughly 3.5 bits to represent a decimal digit. So even if you knew a number of significant bits, you still wouldn't know how many significant decimal digits that would make.
To address your problem, you could store the number of significant digits yourself in something like this:
struct myNumber
{
double value;
int nsignificant;
};
Of course, you have to parse the input yourself to find out what to place in nsignificant. Also, use at least double here for the value, the very limited precision of float won't get you far. With this, you could use nsignificant to determine a proper format string for printing the number with the amount of digits you want.
This still has the problem mentioned above: you can't directly map decimal digits to bits, so there's never a guarantee your number can be stored with the precision you intend. In cases where an exact decimal representation is important, you'll want to use a different data type for that. C# provides one, but C doesn't. You'd have to implement it yourself. You could start with something like this:
struct myDecimal
{
long mantissa;
short exponent;
short nsignificant;
}
In this struct, you could e.g. place 1.0e6 like this:
struct myDecimal x = {
.mantissa = 1;
.exponent = 6;
.nsignificant = 2;
};
Of course, this would require you to write quite a lot of own code for parsing and formatting these numbers.
which has to be input as a float during user input.
So, is there a way to get this done.
Almost. The "trick" is to note the textual length of user input. The below will remember the offset of the first non-whitespace character and the offset after the numeric input.
scanf(" %n%f%n", &n1, &input, &n2);
n2 - n1 gives code the length of user input to represent the float. This method can get fooled if user input is in exponential notation, hexadecimal FP notation, infinity, Not-a-number, excessive leading zeros, etc. Yet works well with straight decimal input.
The idea is to print the number to a buffer with at least n2 - n1 precision and then determine how much of the fractional portion to print.
Recall that float typically has about 6-7 significant leading digits of significance, so attempting to input text like "123456789.0" will result in a float with the exact value of 123456792.0 and the output will be based on that value.
#include <float.h>
#include <math.h>
int scan_print_float(void) {
float input;
int n1, n2;
int cnt = scanf(" %n%f%n", &n1, &input, &n2);
if (cnt == 1) {
int len = n2 - n1;
char buf[len * 2 + 1];
snprintf(buf, sizeof buf, "%.*f", len, input);
char dp = '.';
char *p = strchr(buf, dp);
if (p) {
int front_to_dp = p + 1 - buf;
int prec = len - front_to_dp;
if (prec >= 0) {
return printf("<%.*s>\n", prec, p+1);
}
}
}
puts(".");
return 0;
}
int main(void) {
while (scan_print_float()) {
fflush(stdout);
}
return EXIT_SUCCESS;
}
Input/Output
43.3423
<3423>
45.3400
<3400>
-45.3400
<3400>
0.00
<00>
1234.500000
<500000>
.
.
To robustly handle this and the various edge cases, code should read user input as text and not as a float.
Note: float can typically represent about 232 numbers exactly.
43.3423 is usually not one of them. Instead it has an exactly value of 43.3423004150390625
43.3400 is usually not one of them. Instead it has an exactly value of 43.340000152587890625
The only way is to create a struct with the original string value and/ or required precision for rounding

How to find out in C whether a double number has any digits after decimal point

I stumbled on one issue while I was implementing in C the given algorithm:
int getNumberOfAllFactors(int number) {
int counter = 0;
double sqrt_num = sqrt(number);
for (int i = 1; i <= sqrt_num; i++) {
if ( number % i == 0) {
counter = counter + 2;
}
}
if (number == sqrt_num * sqrt_num)
counter--;
return counter;
}
– the reason for second condition – is to make a correction for perfect squares (i.e. 36 = 6 * 6), however it does not avoid situations (false positives) like this one:
sqrt(91) = 18.027756377319946
18.027756377319946 * 18.027756377319946 = 91.0
So my questions are: how to avoid it and what is the best way in C language to figure out whether a double number has any digits after decimal point? Should I cast square root values from double to integers?
In your case, you could test it like this:
if (sqrt_num == (int)sqrt_num)
You should probably use the modf() family of functions:
#include <math.h>
double modf(double value, double *iptr);
The modf functions break the argument value into integral and fractional parts, each of
which has the same type and sign as the argument. They store the integral part (in
floating-point format) in the object pointed to by iptr.
This is more reliable than trying to use direct conversions to int because an int is typically a 32-bit number and a double can usually store far larger integer values (up to 53 bits worth) so you can run into errors unnecessarily. If you decide you must use a conversion to int and are working with double values, at least use long long for the conversion rather than int.
(The other members of the family are modff() which handles float and modfl() which handles long double.)

C, split a floating point number into individual digits

I am working on a small electronics project at home using a PIC microcontroller 18F which I am programming with HiTech C18 that is going to be used for digital control of a bench power supply.
I have run into a problem which is that I have a floating point number in a variable lets say for example 12.34 and need to split it out into 4 variables holding each individual number so i get Char1 = 1, Char2=2 etc etc for display on a 4-way seven segment LED display. The number will always be rounded to 2 decimal places so there shouldnt be a need to track the location of the decimal point.
I am trying to avoid any rounding where possible above 2 decimal places as the displays are giving measurements of voltage/current and this would affect the accuracy of the readouts.
Any advice on how to get this split would be greatly appreciated.
Thanks
Use sprintf to put the value into a character array. And then pick out the digits from there.
You could convert the floating point value directly to text. Or you could multiply by 100, truncate or round to int, and then convert that to text.
Convert to int and then to a string.
float x;
int i = x*100;
// or i = x*100.0f + 0.5f is round to nearest desired.
if ((i < 0) || (i > 9999)) Handle_RangeProblem();
char buf[5];
sprintf(buf, "%04d", i);
In embedded applications, many compilers use the fixed format string to determine which parts of the large printf() code will be needed. If code is all ready using "%f" else where, then a direct sprintf("%f") here is not an issue. Otherwise using %04d" could result in significant space savings.
Floating point numbers are stored in binary format comprised of a sign bit, mantissa, and exponent. A floating point number may not exactly match a given decimal representation (because of the different base-10 for decimal from the base-2 storage of floating point). Conversion of a floating point number to a decimal representation is a problem often assigned in beginning programming courses.
Since are only interested in two decimal places, and a limited range of values, you could use a fixed point representation of your value. This would reduce the problem from conversion of a floating point to decimal into conversion of integer to decimal.
long
longround( float f )
{
long x;
x = (long)((f*100)+.5); //round least significant digit
return(x);
}
char*
long2char( char ca[], long x )
{
int pos=0;
char sign = '+';
ca[pos] = '0';
long v = x;
if( v<0 ) {
sign = '-';
v = -v;
}
for( pos=0; v>0; ++pos )
{
ca[pos] = (v%10)+'0';
v = v/10;
}
ca[pos++] = sign;
ca[pos] = '\0'; //null-terminate char array
//reverse string - left as exercise for OP
return(ca);
}
If you have a problem where the largest value could exceed the range of values supported by long integer on your system, then you would need to modify the above solution.
Given the stated stability of your decimal point: simply sprintf() float into a buffer with appropriate format specifier, then you have your 4 values in a string easily extracted into what ever type you need them to be in...
Example
float num = 12.1234456;
char buf[6];
int main(void)
{
char a[2], b[2], c[2], d[2];
int x, y, z, w;
sprintf(buf, "%0.2f", num);//capture numeric into string
//split string into individual values (null terminate)
a[0] = buf[0]; a[1]=0;
b[0] = buf[1]; b[1]=0;
//skip decimal point
c[0] = buf[3]; c[1]=0;
d[0] = buf[4]; d[1]=0;
//convert back into numeric discretes if necessary
x = atoi(a);
y = atoi(b);
z = atoi(c);
w = atoi(d);
}
There are certainly more elegant ways, but this will work...

How to store 14.5 trillion and some change in a C floating point type variable

I'm creating a change counter program for my C++ class. The numbers we are using are in the 10s of trillions and I was wondering if there was an easy way to store that into a floating point type variable then cast that into an integer type. It isn't an integer literal, it's accepted as an input and I expect possible change.
Don't use floats. Keep it as an integer and use 64-bit longs. Use "long long" or "int64_t" as the type for storing these integers. The latter can be used by #include <stdint.h>
int main()
{
long long x = 1450000000000LL;
printf("x == %lld\n", x);
return 0;
}
Uhm. No :D
You can however use matrices and write functions for the mathematical operations you need to use. If you're doing a lot or arithmetic with very large numbers, have a look at http://gmplib.org/
If you use floating point math to represent your change counter you'll get in serious troubles. Why? - You are a victim of accuracy problems that lead to problems representing values differing in the 1s, 10s and 100s and so on up to (IIRC) 10^6 of the values. (assuming you are referring to 10^12 version of the term 'trillion'. See H. Schmidt's IEEE 754 Converter page and the Wikipedia article about thisif you want deeper insight into this)
So if you need a precision that goes higher than a several million (and I assume you do), you'll really get in hot water if you use such a beast like floating points. You really need something like the (multiple precision library from GNU in order to be able to calculate the numbers. Of course you are free to implement the same functionality yourself.
In your case mabye a 64-bit integer could do it. (Note that long long is not always 64 bit and nonstandard for C89) Just parse the user input yourself by doing something like this (untested, just to illustrate the idea):
const char input[] = "14.5"
uint64_t result = 0;
uint64_t multiplier = 1000000000000;
unsigned int i = 0;
/* First convert the integer part of the number of your input value.
Could also be done by a library function like strtol or something
like that */
while ((input[i] != '.')
&& (input[i] != '\0'))
{
/* shift the current value by 1 decimal magnitude and add the new 10^0 */
result = (result * 10) + (input[i] - '0');
i++;
}
/* Skip the decimal point */
if (input[i] == '.')
{
i++;
}
/* Add the sub trillions */
while (input[i] != '\0')
{
/* shift the current value by 1 decimal magnitude and add the new 10^0 */
result = (result * 10) + (input[i] - '0');
multiplier /= 10; // as this is just another fraction we have added,
// we reduce the multiplier...
i++:
}
result = result * multiplier;
Of course there are a several exceptions that need to be handled seperatly like overflows of the result or handling non numeric characters properly but as I noted above, the code is only to illustrate the idea.
P.S: In case of signed integers you have to handle the negative sign too of course.

What's the first double that deviates from its corresponding long by delta?

I want to know the first double from 0d upwards that deviates by the long of the "same value" by some delta, say 1e-8. I'm failing here though. I'm trying to do this in C although I usually use managed languages, just in case. Please help.
#include <stdio.h>
#include <limits.h>
#define DELTA 1e-8
int main() {
double d = 0; // checked, the literal is fine
long i;
for (i = 0L; i < LONG_MAX; i++) {
d=i; // gcc does the cast right, i checked
if (d-i > DELTA || d-i < -DELTA) {
printf("%f", d);
break;
}
}
}
I'm guessing that the issue is that d-i casts i to double and therefore d==i and then the difference is always 0. How else can I detect this properly -- I'd prefer fun C casting over comparing strings, which would take forever.
ANSWER: is exactly as we expected. 2^53+1 = 9007199254740993 is the first point of difference according to standard C/UNIX/POSIX tools. Thanks much to pax for his program. And I guess mathematics wins again.
Doubles in IEE754 have a precision of 52 bits which means they can store numbers accurately up to (at least) 251.
If your longs are 32-bit, they will only have the (positive) range 0 to 231 so there is no 32-bit long that cannot be represented exactly as a double. For a 64-bit long, it will be (roughly) 252 so I'd be starting around there, not at zero.
You can use the following program to detect where the failures start to occur. An earlier version I had relied on the fact that the last digit in a number that continuously doubles follows the sequence {2,4,8,6}. However, I opted eventually to use a known trusted tool (bc) for checking the whole number, not just the last digit.
Keep in mind that this may be affected by the actions of sprintf() rather than the real accuracy of doubles (I don't think so personally since it had no troubles with certain numbers up to 2143).
This is the program:
#include <stdio.h>
#include <string.h>
int main() {
FILE *fin;
double d = 1.0; // 2^n-1 to avoid exact powers of 2.
int i = 1;
char ds[1000];
char tst[1000];
// Loop forever, rely on break to finish.
while (1) {
// Get C version of the double.
sprintf (ds, "%.0f", d);
// Get bc version of the double.
sprintf (tst, "echo '2^%d - 1' | bc >tmpfile", i);
system(tst);
fin = fopen ("tmpfile", "r");
fgets (tst, sizeof (tst), fin);
fclose (fin);
tst[strlen (tst) - 1] = '\0';
// Check them.
if (strcmp (ds, tst) != 0) {
printf( "2^%d - 1 <-- bc failure\n", i);
printf( " got [%s]\n", ds);
printf( " expected [%s]\n", tst);
break;
}
// Output for status then move to next.
printf( "2^%d - 1 = %s\n", i, ds);
d = (d + 1) * 2 - 1; // Again, 2^n - 1.
i++;
}
}
This keeps going until:
2^51 - 1 = 2251799813685247
2^52 - 1 = 4503599627370495
2^53 - 1 = 9007199254740991
2^54 - 1 <-- bc failure
got [18014398509481984]
expected [18014398509481983]
which is about where I expected it to fail.
As an aside, I originally used numbers of the form 2n but that got me up to:
2^136 = 87112285931760246646623899502532662132736
2^137 = 174224571863520493293247799005065324265472
2^138 = 348449143727040986586495598010130648530944
2^139 = 696898287454081973172991196020261297061888
2^140 = 1393796574908163946345982392040522594123776
2^141 = 2787593149816327892691964784081045188247552
2^142 = 5575186299632655785383929568162090376495104
2^143 <-- bc failure
got [11150372599265311570767859136324180752990210]
expected [11150372599265311570767859136324180752990208]
with the size of a double being 8 bytes (checked with sizeof). It turned out these numbers were of the binary form "1000..." which can be represented for far longer with doubles. That's when I switched to using 2n-1 to get a better bit pattern: all one bits.
The first long to be 'wrong' when cast to a double will not be off by 1e-8, it will be off by 1. As long as the double can fit the long in its significand, it will represent it accurately.
I forget exactly how many bits a double has for precision vs offset, but that would tell you the max size it could represent. The first long to be wrong should have the binary form 10000..., so you can find it much quicker by starting at 1 and left-shifting.
Wikipedia says 52 bits in the significand, not counting the implicit starting 1. That should mean the first long to be cast to a different value is 2^53.
Although I'm hesitant to mention Fortran 95 and successors in this discussion, I'll mention that Fortran since the 1990 standard has offered a SPACING intrinsic function which tells you what the difference between representable REALs are about a given REAL. You could do a binary search on this, stopping when SPACING(X) > DELTA. For compilers that use the same floating point model as the one you are interested in (likely to be the IEEE754 standard), you should get the same results.
Off hand, I thought that doubles could represent all integers (within their bounds) exactly.
If that is not the case, then you're going to want to cast both i and d to something with MORE precision than either of them. Perhaps a long double will work.

Resources