how to write floating value accurately to a bin file - c

I am trying to dump the floating point values from my program to a bin file. Since I can't use any stdlib function, I am thinking of writting it char by char to a big char array which I am dumping in my test application to a file.
It's like
float a=3132.000001;
I will be dumping this to a char array in 4 bytes.
Code example would be:-
if((a < 1.0) && (a > 1.0) || (a > -1.0 && a < 0.0))
a = a*1000000 // 6 bit fraction part.
Can you please help me writting this in a better way.

Assuming you plan to read it back into the same program on the same architecture (no endianness issues), just write the number out directly:
fwrite(&a, sizeof(a), 1, f);
or copy it with memcpy to your intermediate buffer:
memcpy(bufp, &a, sizeof(a));
bufp += sizeof(a);
If you have to deal with endianness issues, you could be sneaky. Cast the float to a long, and use htonl:
assert(sizeof(float) == sizeof(long)); // Just to be sure
long n = htonl(*(long*)&a);
memcpy(bufp, &n, sizeof(n));
bufp += sizeof(n);
Reading it back in:
assert(sizeof(float) == sizeof(long)); // Just to be sure
long n;
memcpy(&n, bufp, sizeof(n));
n = ntohl(n);
a = *(float*)n;
bufp += sizeof(n);

Use frexp.
int32_t exponent, mantissa;
mantissa = frexp( a, &exponent ) / FLT_EPSILON;
The sign is captured in the mantissa. This should handle denormals correctly, but not infinity or NaN.
Writing exponent and mantissa will necessarily take more than 4 bytes, since the implicit mantissa bit was made explicit. If you want to write the float as raw data, the question is not about floats at all but rather handling raw data and endianness.
On the other end, use ldexp.
If you could use the standard library, printf has a format specifier just for this: %a. But maybe you consider frexp to be standard library too. Not clear.

If you aren't worried about platform differences between the reader and the writer:
#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
...
union float_bytes {
float val;
uint8_t a[sizeof(float)]; // This type can be unsigned char if you don't have stdint.h
};
size_t float_write(FILE * outf, float f) {
union float_bytes fb = { .val = f };
return fwrite(fb.a, sizeof(float), outf);
}
There are shorter ways to turn a float into a byte array, but they involve more typecasting and are more difficult to read. Other methods of doing this probably do not make faster or smaller compiled code (though the union would make the debug code bigger).
If you are trying to store floats in a platform independent way then the easiest way to do it is to store it as a string (with lots of digits after the . ) . More difficult is to choose a floating point bit layout to use and convert all of your floats to/from that format as you read/write them. Probably just choose IEEE floating point at a certain width and a certain endian and stick with that.

Related

C - erroneous output after multiplication of large numbers

I'm implementing my own decrease-and-conquer method for an.
Here's the program:
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <time.h>
double dncpow(int a, int n)
{
double p = 1.0;
if(n != 0)
{
p = dncpow(a, n / 2);
p = p * p;
if(n % 2)
{
p = p * (double)a;
}
}
return p;
}
int main()
{
int a;
int n;
int a_upper = 10;
int n_upper = 50;
int times = 5;
time_t t;
srand(time(&t));
for(int i = 0; i < times; ++i)
{
a = rand() % a_upper;
n = rand() % n_upper;
printf("a = %d, n = %d\n", a, n);
printf("pow = %.0f\ndnc = %.0f\n\n", pow(a, n), dncpow(a, n));
}
return 0;
}
My code works for small values of a and n, but a mismatch in the output of pow() and dncpow() is observed for inputs such as:
a = 7, n = 39
pow = 909543680129861204865300750663680
dnc = 909543680129861348980488826519552
I'm pretty sure that the algorithm is correct, but dncpow() is giving me wrong answers.
Can someone please help me rectify this? Thanks in advance!
Simple as that, these numbers are too large for what your computer can represent exactly in a single variable. With a floating point type, there's an exponent stored separately and therefore it's still possible to represent a number near the real number, dropping the lowest bits of the mantissa.
Regarding this comment:
I'm getting similar outputs upon replacing 'double' with 'long long'. The latter is supposed to be stored exactly, isn't it?
If you call a function taking double, it won't magically operate on long long instead. Your value is simply converted to double and you'll just get the same result.
Even with a function handling long long (which has 64 bits on nowadays' typical platforms), you can't deal with such large numbers. 64 bits aren't enough to store them. With an unsigned integer type, they will just "wrap around" to 0 on overflow. With a signed integer type, the behavior of overflow is undefined (but still somewhat likely a wrap around). So you'll get some number that has absolutely nothing to do with your expected result. That's arguably worse than the result with a floating point type, that's just not precise.
For exact calculations on large numbers, the only way is to store them in an array (typically of unsigned integers like uintmax_t) and implement all the arithmetics yourself. That's a nice exercise, and a lot of work, especially when performance is of interest (the "naive" arithmetic algorithms are typically very inefficient).
For some real-life program, you won't reinvent the wheel here, as there are libraries for handling large numbers. The arguably best known is libgmp. Read the manuals there and use it.

C: convert a real number to 64 bit floating point binary

I'm trying to write a code that converts a real number to a 64 bit floating point binary. In order to do this, the user inputs a real number (for example, 547.4242) and the program must output a 64 bit floating point binary.
My ideas:
The sign part is easy.
The program converts the integer part (547 for the previous example) and stores the result in an int variable. Then, the program converts the fractional part (.4242 for the previous example) and stores the result into an array (each position of the array stores '1' or '0').
This is where I'm stuck. Summarizing, I have: "Integer part = 1000100011" (type int) and "Fractional part = 0110110010011000010111110000011011110110100101000100" (array).
How can I proceed?
the following code is used to determine internal representation of a floating point number according to the IEEE754 notation. This code is made in Turbo c++ ide but you can easily convert for a generalised ide.
#include<conio.h>
#include<stdio.h>
void decimal_to_binary(unsigned char);
union u
{
float f;
char c;
};
int main()
{
int i;
char*ptr;
union u a;
clrscr();
printf("ENTER THE FLOATING POINT NUMBER : \n");
scanf("%f",&a.f);
ptr=&a.c+sizeof(float);
for(i=0;i<sizeof(float);i++)
{
ptr--;
decimal_to_binary(*ptr);
}
getch();
return 0;
}
void decimal_to_binary(unsigned char n)
{
int arr[8];
int i;
//printf("n = %u ",n);
for(i=7;i>=0;i--)
{
if(n%2==0)
arr[i]=0;
else
arr[i]=1;
n/=2;
}
for(i=0;i<8;i++)
printf("%d",arr[i]);
printf(" ");
}
For further details visit Click here!
In order to correctly round all possible decimal representations to the nearest double, you need big integers. Using only the basic integer types from C will leave you to re-implement big integer arithmetics. Each of these two approaches is possible, more information about each follows:
For the first approach, you need a big integer library: GMP is a good one. Armed with such a big integer library, you tackle an input such as the example 123.456E78 as the integer 123456 * 1075 and start wondering what values M in [253 … 254) and P in [-1022 … 1023] make (M / 253) * 2P closest to this number. This question can be answered with big integer operations, following the steps described in this blog post (summary: first determine P. Then use a division to compute M). A complete implementation must take care of subnormal numbers and infinities (inf is the correct result to return for any decimal representation of a number that would have an exponent larger than +1023).
The second approach, if you do not want to include or implement a full general-purpose big integer library, still requires a few basic operations to be implemented on arrays of C integers representing large numbers. The function decfloat() in this implementation represents large numbers in base 109 because that simplifies the conversion from the initial decimal representation to the internal representation as an array x of uint32_t.
Following is a basic conversion. Enough to get OP started.
OP's "integer part of real number" --> int is far too limiting. Better to simply convert the entire string to a large integer like uintmax_t. Note the decimal point '.' and account for overflow while scanning.
This code does not handle exponents nor negative numbers. It may be off in the the last bit or so due to limited integer ui or the the final num = ui * pow10(expo). It handles most overflow cases.
#include <inttypes.h>
double my_atof(const char *src) {
uintmax_t ui = 0;
int dp = '.';
size_t dpi;
size_t i = 0;
size_t toobig = 0;
int ch;
for (i = 0; (ch = (unsigned char) src[i]) != '\0'; i++) {
if (ch == dp) {
dp = '\0'; // only get 1 dp
dpi = i;
continue;
}
if (!isdigit(ch)) {
break; // illegal character
}
ch -= '0';
// detect overflow
if (toobig ||
(ui >= UINTMAX_MAX / 10 &&
(ui > UINTMAX_MAX / 10 || ch > UINTMAX_MAX % 10))) {
toobig++;
continue;
}
ui = ui * 10 + ch;
}
intmax_t expo = toobig;
if (dp == '\0') {
expo -= i - dpi - 1;
}
double num;
if (expo < 0) {
// slightly more precise than: num = ui * pow10(expo);
num = ui / pow10(-expo);
} else {
num = ui * pow10(expo);
}
return num;
}
The trick is to treat the value as an integer, so read your 547.4242 as an unsigned long long (ie 64-bits or more), ie 5474242, counting the number of digits after the '.', in this case 4. Now you have a value which is 10^4 bigger than it should be. So you float the 5474242 (as a double, or long double) and divide by 10^4.
Decimal to binary conversion is deceptively simple. When you have more bits than the float will hold, then it will have to round. More fun occurs when you have more digits than a 64-bit integer will hold -- noting that trailing zeros are special -- and you have to decide whether to round or not (and what rounding occurs when you float). Then there's dealing with an E+/-99. Then when you do the eventual division (or multiplication) by 10^n, you have (a) another potential rounding, and (b) the issue that large 10^n are not exactly represented in your floating point -- which is another source of error. (And for E+/-99 forms, you may need upto and a little beyond 10^300 for the final step.)
Enjoy !

Decimal to Binary conversion - I need Help [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Decimal to Binary conversion
I need to convert a 20digits decimal to binary using C programming. How do I go about it. Really, I am finding it hard. What buffer will I create, because it will be very large, even the calculator can't compute converting 20 digits to Binary.
I need suggestions, links and possibly sample codes.
Thanks.
Do you need to convert a decimal string to a binary string or to a value?
Rule of thumb 10^3 ~= 2^10, therefore 10^20 ~= 2^70 > 64 bits (67 to be accurate).
==> A 64bit integer will not not be enough. You can you a structure with 2 64bit integers (long long in C) or even a 8bit byte for the upper part and 64 for the lower part.
Make sure the lower part is unsigned.
You will need to write code that checks for overflow on lower part and increases upper part when this happens. You will also need to use the long division algorithm once you cross the 64bit line.
What about using a library for extended precision arithmetic? try to give a look at http://gmplib.org/
I don't know if you are trying to convert a string of numerical characters into a really big int, or a really big int into a string of 1s and 0s... but in general, you'll be doing something like this:
for (i = 0; i < digits; i++)
{
bit[i] = (big_number>>i) & 1;
// or, for the other way around
// big_number |= (bit[i] << i);
}
the main problem is that there is no builtin type that can store "big_number". So you'll probably be doing it more like this...
uint8_t big_number[10]; // the big number is stored in 10 bytes.
// (uint8_t is just "unsigned char")
for (block = 0; block < 10; block++)
{
for (i = 0; i < 8; i++)
{
bit[block*8 + i] = (big_number[block]>>i) & 1;
}
}
[edit]
To read an string of numerical characters into an int (without using scanf, or atoi, etc), I would do something like this:
// Supposing I have something like char token[]="23563344324467434533";
int n = strlen(token); // number of digits.
big_number = 0;
for (int i = 0; i < n; i++)
{
big_number += (token[i] - '0') * pow(10, n-i-1);
}
That will work for reading the number, but again, the problem with this is that there is no built-in type to store big_number. You could use a float or a double, and that would get the magnitude of the number correct, but the last few decimal places would be rounded off. If you need perfect precision, you will have to use an arbitrary-precision integer. A google search turns up a few possible libraries to use for that; but I don't have much personal experience with those libraries, so I won't make a recommendation.
Really though, the data type you use depends on what you want to do with the data afterwards. Maybe you need an arbitrary-precision integer, or maybe a double would be exactly what you need; or maybe you can write your own basic data type using the technique I outlined with the blocks of uint8_t, or maybe you're better off just leaving it as a string!

how do we print a number that's greater than 2^32-1 with int and float? (is it even possible?)

how do we print a number that's greater than 2^32-1 with int and float? (is it even possible?)
How does your variable contain a number that is greater than 2^32 - 1? Short answer: It'll probably be a specific data-structure and assorted functions (oh, a class?) that deal with this.
Given this data structure, how do we print it? With BigInteger_Print(BigInteger*) of course :)
Really though, there is no correct answer to this, as printing a number larger than 2^32-1 depends entirely upon how you're storing that number.
More theoretically: suppose you have a very very very large number stored somewhere somehow; if so, I suppose that you are somehow able to do math on that number, otherwise it would be quite pointless storing it.
If you can do math on it, just divide the bignumber by ten (10); store the remainder somewhere. Repeat until the result is smaller than 10. When it's smaller than ten, print it, then print the remainders, from the last to the first. Finish.
You can speed up things by dividing for the largest power of 10 that you are able to print without effort (on 32 bit, 1'000'000'000).
Edit: pseudo code:
#include <stdio.h>
#include <math.h>
#include <math_with_very_very_big_num.h>
int main(int argc, char **argv) {
very_very_big_num bignum = someveryverybignum;
very_very_big_num quot;
int size = (int) floor(vvbn_log10(bignum)) + 1;
char *result = calloc(size, sizeof(char));
int i = 0;
do {
quot = vvbn_divide(bignum, 10);
result[i++] = (char) vvbn_remainder(bignum, 10) + '0';
bignum = quot;
} while (vvbn_greater(bignum, 9));
result[i] = (char) vvbn_to_i(bignum) + '0';
while(i >= 0)
printf("%c", result[i--]);
printf("\n");
}
(I wrote this using long, than translating it with veryverybignum stuff; it worked with long, unluckily I cannot try this version, so please forgive me if I made transation errors...)
If you are talking about int64 types, you can try %I64u, %I64d, %I64x, %llu, %lld
On common hardware, the largest float is (2^128 - 2^104), so if it's smaller than that, you just use %f (or %g or %a) with printf( ).
For int64 types, JustJeff's answer is spot on.
The range of double (%f) extends to nearly 2^1024, which is really quite huge; on Intel hardware, when the long double (%Lf) type corresponds to 80-bit float, the range of that type goes up to 2^16384.
If you need larger numbers than that, you need to use a library (which will likely have its own print routines) or roll your own representation and provide your own printing support.

What's the first double that deviates from its corresponding long by delta?

I want to know the first double from 0d upwards that deviates by the long of the "same value" by some delta, say 1e-8. I'm failing here though. I'm trying to do this in C although I usually use managed languages, just in case. Please help.
#include <stdio.h>
#include <limits.h>
#define DELTA 1e-8
int main() {
double d = 0; // checked, the literal is fine
long i;
for (i = 0L; i < LONG_MAX; i++) {
d=i; // gcc does the cast right, i checked
if (d-i > DELTA || d-i < -DELTA) {
printf("%f", d);
break;
}
}
}
I'm guessing that the issue is that d-i casts i to double and therefore d==i and then the difference is always 0. How else can I detect this properly -- I'd prefer fun C casting over comparing strings, which would take forever.
ANSWER: is exactly as we expected. 2^53+1 = 9007199254740993 is the first point of difference according to standard C/UNIX/POSIX tools. Thanks much to pax for his program. And I guess mathematics wins again.
Doubles in IEE754 have a precision of 52 bits which means they can store numbers accurately up to (at least) 251.
If your longs are 32-bit, they will only have the (positive) range 0 to 231 so there is no 32-bit long that cannot be represented exactly as a double. For a 64-bit long, it will be (roughly) 252 so I'd be starting around there, not at zero.
You can use the following program to detect where the failures start to occur. An earlier version I had relied on the fact that the last digit in a number that continuously doubles follows the sequence {2,4,8,6}. However, I opted eventually to use a known trusted tool (bc) for checking the whole number, not just the last digit.
Keep in mind that this may be affected by the actions of sprintf() rather than the real accuracy of doubles (I don't think so personally since it had no troubles with certain numbers up to 2143).
This is the program:
#include <stdio.h>
#include <string.h>
int main() {
FILE *fin;
double d = 1.0; // 2^n-1 to avoid exact powers of 2.
int i = 1;
char ds[1000];
char tst[1000];
// Loop forever, rely on break to finish.
while (1) {
// Get C version of the double.
sprintf (ds, "%.0f", d);
// Get bc version of the double.
sprintf (tst, "echo '2^%d - 1' | bc >tmpfile", i);
system(tst);
fin = fopen ("tmpfile", "r");
fgets (tst, sizeof (tst), fin);
fclose (fin);
tst[strlen (tst) - 1] = '\0';
// Check them.
if (strcmp (ds, tst) != 0) {
printf( "2^%d - 1 <-- bc failure\n", i);
printf( " got [%s]\n", ds);
printf( " expected [%s]\n", tst);
break;
}
// Output for status then move to next.
printf( "2^%d - 1 = %s\n", i, ds);
d = (d + 1) * 2 - 1; // Again, 2^n - 1.
i++;
}
}
This keeps going until:
2^51 - 1 = 2251799813685247
2^52 - 1 = 4503599627370495
2^53 - 1 = 9007199254740991
2^54 - 1 <-- bc failure
got [18014398509481984]
expected [18014398509481983]
which is about where I expected it to fail.
As an aside, I originally used numbers of the form 2n but that got me up to:
2^136 = 87112285931760246646623899502532662132736
2^137 = 174224571863520493293247799005065324265472
2^138 = 348449143727040986586495598010130648530944
2^139 = 696898287454081973172991196020261297061888
2^140 = 1393796574908163946345982392040522594123776
2^141 = 2787593149816327892691964784081045188247552
2^142 = 5575186299632655785383929568162090376495104
2^143 <-- bc failure
got [11150372599265311570767859136324180752990210]
expected [11150372599265311570767859136324180752990208]
with the size of a double being 8 bytes (checked with sizeof). It turned out these numbers were of the binary form "1000..." which can be represented for far longer with doubles. That's when I switched to using 2n-1 to get a better bit pattern: all one bits.
The first long to be 'wrong' when cast to a double will not be off by 1e-8, it will be off by 1. As long as the double can fit the long in its significand, it will represent it accurately.
I forget exactly how many bits a double has for precision vs offset, but that would tell you the max size it could represent. The first long to be wrong should have the binary form 10000..., so you can find it much quicker by starting at 1 and left-shifting.
Wikipedia says 52 bits in the significand, not counting the implicit starting 1. That should mean the first long to be cast to a different value is 2^53.
Although I'm hesitant to mention Fortran 95 and successors in this discussion, I'll mention that Fortran since the 1990 standard has offered a SPACING intrinsic function which tells you what the difference between representable REALs are about a given REAL. You could do a binary search on this, stopping when SPACING(X) > DELTA. For compilers that use the same floating point model as the one you are interested in (likely to be the IEEE754 standard), you should get the same results.
Off hand, I thought that doubles could represent all integers (within their bounds) exactly.
If that is not the case, then you're going to want to cast both i and d to something with MORE precision than either of them. Perhaps a long double will work.

Resources