Converting a large integer to a floating point number in C

Converting a large integer to a floating point number in C - c

I recently wrote a block of code that takes as an input an 8 digit hexadecimal number from the user, transforms it into an integer and then converts it into a float. To go from integer to float I use the following:
int myInt;
float myFloat;
myFloat = *(float *)&myInt;
printf("%g", myFloat);
It works perfectly for small numbers. But when the user inputs hexadecimal numbers such as:
0x0000ffff
0x7eeeeeef
I get that myInt = -2147483648 and that myFloat = -0. I know that the number I get for myInt is the smallest possible number that can be stored in an int variable in C.
Because of this problem, the input range of my program is extremely limited. Does anyone have any advice on how I could expand the range of my code so that it could handle a number as big as:
0xffffffff
Thank you so much for any help you may give me!

The correct way to get the value transferred as accurately as float will allow is:
float myFloat = myInt;
If you want better accuracy, use double instead of float.
What you're doing is trying to reinterpret the bit pattern for the int as if it was a float, which is not a good idea. There are hexadecimal floating-point constants and conversions available in C99 and later. (However, if that's what you are trying, your code in the question is correct — your problem appears to be in converting hex to integer.)
If you get -2147483648 from 0x0000FFFF (or from 0x7EEEFFFF), there is a bug in your conversion code. Fix that before doing anything else. How are you doing the hex to integer conversion? Using strtol() is probably a good way (and sscanf()
and friends is also possible), but be careful about overflows.)

Does anyone have any advice on how I could expand the range of my code so that it could
handle a number as big as 0xffffffff
You can't store 0xffffffff in a 32-bit int; the largest positive hex value you can store in a 32 bit int is 0x7FFFFFFF or (2^31 -1) or 2147483647, but the negative range is -2^31 or -2147483648,
The ranges are due to obvious limitations in the number of bits available and the 2's complement system.
Use an unsigned int if you want 0xffffffff.

Related

Incorrect multiplication result using fixed point in C

I'm trying to implement signed unsigned multiplication in C using fixed point arithmetic, but I get a incorrect result. I can't imagine how to solve this problem. I think there is some problem in the bit extension.
Here is the piece of code:
int16_t audio_sample=0x1FF; //format signed Q1.8 -> Value represented=-0.00390625
uint8_t gain=0xA; //format unsigned Q5.2 -> Value represented = 2.5
int16_t result= (int16_t)(((int16_t)((int32_t)audio_sample * (int32_t) gain);
printf("%x",result);
The result from printf is 0x13F6, which is of course the result from 0x1FF*0xA, but the fixed-point arithmetics said that the correct results would be 0x3FF6, considering the proper bit-extension. 0x3FF6 in Q6.10 format represent -0.009765625=-0.00390625*2.5 .
Please help me find my mistake.
Thank in advance.

You should use unsigned types here. The representation is in your head (or the comments), not in the data types in the code.
2's complement means the 1 on the left is theoretically continued forever. e.g. 0x1FF in Q1.8 is the same as 0xFFFF in Q8.8 (-1 / 256).
If you have a 16bit integer, you cannot have Q1.8, it will always be Q8.8, the machine will not ignore the other bits. So, 0x1FF in Q1.8 should be 0xFFFF in Q8.8. The 0xA in Q5.2 do not change in Q6.2.
0xFFFF * 0xA = 0x9FFF6, cut away the overflow (therefore use unsigned) and you have 0xFFF6 in Q6.10, which is -10 / 1024, which is your expected result.

It is best to think of fixed-point as a matter of scaling, and to express your calculation simply and clearly in terms of numbers — rather than bits. (Example)
A Q1.8 or Q5.2 number in AMD Q notation is a real number scaled by a factor of 28 or 22 respectively.
But C doesn't have 9 or 7-bit number types. Your int16_t and uint8_t variables have enough range to store such numbers. But for arithmetic operations, it is unwise to use unsigned integers, or to mix signed and unsigned types. int has enough range and avoids some efficiency pitfalls.
int audio_sample = -0.00390625*256; // Q1.8
int gain = 2.5*4; // Q5.2
The product of numbers scaled by 28 and 22 has a scale of 210.
int result = audio_sample * gain; // Q6.10
To convert back to the real value, divide by the scaler.
printf("%lg * %lg = %lg\n",
(double)audio_sample/256,
(double)gain/4,
(double)result/1024);
Please help me find my mistake.
The mistake was in assigning 0x1FF to audio_sample, instead of -1. 0x1FF is the unsigned truncation of the 9-bit two's-complement value -1. But audio_sample is wider and would require more leading 1 bits. It would have been clearer and safer to express your intent by assigning -0.00390625*256 to audio_sample.
the fixed-point arithmetics said that the correct results would be 0x3FF6, considering the proper bit-extension
0x3FF6 is the unsigned 14-bit truncation of the correct two's complement answer. But the result requires 16-bits so you're probably looking for value, 0xFFF6.
printf("unsigned Q6.10: 0x%x\n", (unsigned)result & 0xFFFF);

Is it always possible to convert an `int` to a `float`

Is a conversion from an int to a float always possible in C without the float becoming one of the special values like +Inf or -Inf?
AFAIK there is is no upper limit on the range of int.
I think a 128 bit int would cause an issue for a platform with an IEEE754 float as that has an upper value of around the 127th power of 2.

Short answer to your question: no, it is not always possible.
But it is worthwhile to go a little bit more into details. The following paragraph shows what the standard says about integer to floating-point conversions (online C11 standard draft):
6.3.1.4 Real floating and integer
2) When a value of integer type is converted to a real floating type,
if the value being converted can be represented exactly in the new
type, it is unchanged. If the value being converted is in the range of
values that can be represented but cannot be represented exactly, the
result is either the nearest higher or nearest lower representable
value, chosen in an implementation-defined manner. If the value being
converted is outside the range of values that can be represented, the
behavior is undefined. ...
So many integer values may be converted exactly. Some integer values may lose precision, yet a conversion is at least possible. For some values, however, the behaviour might be undefined (if, for example, an integer value would not be able to be represented with the maximum exponent of the float value). But actually I cannot assume a case where this will happen.

Is it always possible to convert an int to a float?
Reasonably - yes. An int will always convert to a finite float. The conversion may lose some precision for great int values.
Yet for the pedantic, an odd compiler could have trouble.
C allows for excessively wide int, not just 16, 32 or 64 bit ones and float could have a limit range, as small as 1e37.
It is not the upper range of int or INT_MAX that should be of concern. It is the lower end. INT_MIN which often has +1 greater magnitude than INT_MAX.
A 124 bit int min value could be about -1.06e37, so that does exceed the minimal float range.
With the common binary32 float, an int would need to be more than 128 bits to cause a float infinity.
So what test is needed to detect this rare situation?
Form an exact power-of-2 limit and perform careful math to avoid overflow or imprecision.
#if -INT_MAX == INT_MIN
// rare non 2's complement machine
#define INT_MAX_P1_HALF (INT_MAX/2 + 1)
_Static_assert(FLT_MAX/2 >= INT_MAX_P1_HALF, "non-2's comp.`int` range exceeds `float`");
#else
_Static_assert(-FLT_MAX <= INT_MIN, "2's complement `int` range exceeds `float`");
#endif

The standard only requires floating point representations to include a finite number as large as 1037 (§5.2.4.2.2/12) and does not put any limit on the maximum size of an integer. So if your implementation has 128-bit integers (or even 124-bit integers), it is possible for an integer-to-float conversion to exceed the range of finite representable floating point numbers.

No, it not always possible to convert an int to a float, due to how floats work. 32 bit floats greater than 16777216 (or less than -16777216) need to be even, greater than 33554432 (or less than -33554432) need to be evenly divisibly by 4, greater than 67108864 (or less than -67108864) need to be evenly divisibly by 8, etc. The IEEE-754 float standard defines round to nearest even as the default mode, but other modes exist depending upon implementation.
Also, the largest 128 bit int = 2^128 - 1 is greater than the largest 32 bit float = 2^127 x 1.11111111111111111111111 = 2^127 x (2-2^-23) = 2^127 x (2^1-2^-23) = 2^(127+1) - 2^(127-23) = 2^(127+1)-2^(127-23) = 2^(128) - 2^(104)

C - Unsigned long long to double on 32-bit machine

Hi I have two questions:
uint64_t vs double, which has a higher range limit for covering positive numbers?
How to convert double into uint64_t if only the whole number part of double is needed.
Direct casting apparently doesn't work due to how double is defined.
Sorry for any confusion, I'm talking about the 64bit double in C on a 32bit machine.
As for an example:
//operation for convertion I used:
double sampleRate = (
(union { double i; uint64_t sampleRate; })
{ .i = r23u.outputSampleRate}
).sampleRate;
//the following are printouts on command line
// double uint64_t
//printed by %.16llx %.16llx
outputSampleRate 0x41886a0000000000 0x41886a0000000000 sampleRate
//printed by %f %llu
outputSampleRate 51200000.000000 4722140757530509312 sampleRate
So the two numbers remain the same bit pattern but when print out as decimals, the uint64_t is totally wrong.
Thank you.

uint64_t vs double, which has a higher range limit for covering positive numbers?
uint64_t, where supported, has 64 value bits, no padding bits, and no sign bit. It can represent all integers between 0 and 264 - 1, inclusive.
Substantially all modern C implementations represent double in IEEE-754 64-bit binary format, but C does not require nor even endorse that format. It is so common, however, that it is fairly safe to assume that format, and maybe to just put in some compile-time checks against the macros defining FP characteristics. I will assume for the balance of this answer that the C implementation indeed does use that representation.
IEEE-754 binary double precision provides 53 bits of mantissa, therefore it can represent all integers between 0 and 253 - 1. It is a floating-point format, however, with an 11-bit binary exponent. The largest number it can represent is (253 - 1) * 21023, or nearly 21077. In this sense, double has a much greater range than uint64_t, but the vast majority of integers between 0 and its maximum value cannot be represented exactly as doubles, including almost all of the numbers that can be represented exactly by uint64_t.
How to convert double into uint64_t if only the whole number part of double is needed
You can simply assign (conversion is implicit), or you can explicitly cast if you want to make it clear that a conversion takes place:
double my_double = 1.2345678e48;
uint64_t my_uint;
uint64_t my_other_uint;
my_uint = my_double;
my_other_uint = (uint64_t) my_double;
Any fractional part of the double's value will be truncated. The integer part will be preserved exactly if it is representable as a uint64_t; otherwise, the behavior is undefined.
The code you presented uses a union to overlay storage of a double and a uint64_t. That's not inherently wrong, but it's not a useful technique for converting between the two types. Casts are C's mechanism for all non-implicit value conversions.

double can hold substantially larger numbers than uint64_t, as the value range for 8 bytes IEEE 754 is 4.94065645841246544e-324d to 1.79769313486231570e+308d (positive or negative) [taken from here][more detailed explanation]. However if you do addition of small values in that range, you will be in for a surprise, because at some point the precision will not be able to represent e.g. an addition of 1 and will round down to the lower value, essentially making a loop steadily incremented by 1 non-terminating.
This code for example:
#include <stdio.h>
2 int main()
3 {
4 for (double i = 100000000000000000000000000000000.0; i < 1000000000000000000000000000000000000000000000000.0; ++i)
5 printf("%lf\n", i);
6 return 0;
7 }
gives me a constant output of 100000000000000005366162204393472.000000. That's also why we have nextafter and nexttoward functions in math.h. You can also find ceil and floor functions there, which, in theory, will allow you to solve your second problem: removing the fraction part.
However, if you really need to hold large numbers you should look at bigint implementations instead, e.g. GMP. Bigints were designed to do operations on very large integers, and operations like an addition of one will truly increment the number even for very large values.

Should I use the hexadecimal number format in C to perform arithmetic?

I have a question about signed numbers and hexadecimal numbers and their use in arithmetic in C. From what I understand, signed numbers generally are able to store smaller numbers than their unsigned counterparts.
For example, a signed integer that is 32 bits in length has a maximum value of 2,147,483,647; whereas, an unsigned 32-bit integer has a range of up to 4,294,967,295.
It appears that the value of these numbers overflows when performing addition on the highest possible values:
printf("My integer: %i\n", 2147483647 + 1);
The output that I get is:
My integer: -2147483648
However, despite this overflow, the hexadecimal number string representation appears to be well-formed and correct for the operation of addition.
printf("My hexadecimal: %#X\n", 0x7FFFFFFF + 0x1); // 2147483647 + 1 in Hex
The output that I get is:
My hexadecimal: 0X80000000
My question is, would there be any situation where performing this type of addition and then looking at the hexadecimal representation is beneficial?
At first glance, it appears that this method would give us access to the entire range of the 32-bit number for the operation of addition. Any thoughts or comments are appreciated. Cheers

Overflow on signed integers is undefined behavior. Which means you can't reliably predict what will happen when you do so.
That being said, what you're seeing here is an illustration that your system is using 2's complement for representing negative values in signed integers.
While 2's complement is very common, it's not universal. Some systems may use a sign bit instead. So for maximum portability, you shouldn't depend on this behavior.

Integer overflow problem

Please explain the following paragraph.
"The next question is whether we can assign a certain value to a variable without losing precision. It is not sufficient if we just check for overflow during the addition or subtraction, because someone might add 1 to -5 and assign the result to an unsigned int. Then the actual addition does not overflow, but the result still does not fit."
when i am adding 1 to -5 i dont see any reason to worry.the answer is as it should be -4.
so what is the problem of result not being fit??
you can find the full article here through which i was going:
http://www.fefe.de/intof.html

The binary representation of -4, in a 32-bit word, is as follows (hex notation)
0xfffffffc
When interpreted as an unsigned integer, this bit pattern represents the number 2**32-4, or 18446744073709551612. I'm not sure I would call this phenomenon "overflow", but it is a common mistake to assign a small negative integer to a variable of unsigned type and wind up with a really big positive integer.
This trick is actually exploited for bounds checking: if you have a signed integer i and want to know if it is in the range 0 <= i < n, you can test
if ((unsigned)i < n) { ... }
which gives you the answer using one comparison instead of two. The cast to unsigned has no run-time cost; it just tells the compiler to generate an unsigned comparison instead of a signed comparison.

Try assigning it to a unsigned int, not an int.
The term unsigned int is the key - by default an int datatype will hold negative and positive numbers; however, unsigned ints are always positive. They provide this option because uints can technically hold greater positive values than regular signed ints because they do not need to use a bit to keep track of whether or not its negative or positive.
Please see:
Signed versus Unsigned Integers

The problem is that you're storing -4 in an unsigned int. Unsigned ints can only contain zero and positive values. If you assign -4 to one, you'll actually end up getting a very large positive number (the actual value depends on how wide an int you're using).

The problem is that the sizes of storage such as unsigned int can only hold so much. With 1 and -5 it does not matter, but with 1 and -500000000 you might end up with a confusing result. Also, unsigned storage will interpret anything stored in it as positive, so you cannot put a negative value in an unsigned variable.
Two big things to watch out for:
1. Overflow in the operation itself: 1 + -500000000
2. Issues in casting: (unsigned int)(1 + -500)

Unsigned variables, like unsigned int, cannot hold negative values. So assigning 1 - 5 to an unsigned int won't give you -4. I'm not sure what it'll give you, it's probably compiler specific.

Some code:
signed int first, second;
unsigned int result;
first = obtain(); // happens to be 1
second = obtain(); // happens to be -5
result = first + second; // unexpected result here - very large number - and it's too late to check that there's a problem
Say you obtained those values from keyboard. You need to check before addition that the result can be represented in unsigned int. That's what the article talks about.

By definition the number -4 cannot be represented in an unsigned int. -4 is a signed integer. The same goes for any negative number.
When you assign a negative integer to an unsigned int the actual bits of the number do not change, but they are merely represented differently. You'll get some ridiculously-large number due to the way integers are represented in binary (two's complement).
In two's complement, -4 is represented as 0xfffffffc. When 0xfffffffc is represented as an unsigned int you'll get the number 4,294,967,292.

You have to remember that fundamentally you're working with bits. So you can assign a value of -4 to an unsigned integer and this will place a series of bits into that memory location. Those bits can be interpreted as -4 in certain circumstances. One such circumstance is the obvious one: you've told the compiler/system that the bits in that memory location should be interpreted as a two's compliment signed number. So if you do printf("%s",i) prtinf does its magic and converts the two's compliment number to a magnitude and sign. The magnitude will be 4 and the sign will be negative, so it displays '-4'.
However, if you tell the compiler that the data at that memory location is not signed then the bits don't change but instead their interpretation does. So when you do your addition, store the result in an unsigned integer memory location and then call printf on the result it doesn't bother looking for the sign because by definition it is always positive. It calculates the magnitude and prints it. The magnitude will be off because the sign information is still encoded in the bits but it's treated as magnitude information.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight