Repercussions of storing a 5 digit integer as a 16bit float - c

I am working with some data that can have large number values and the data itself is important.
The highest number seen is "89,482". So originally I was going to use unsigned int.
However using these numbers in usigned int format is causing some headaches, namely manipulating them in OpenGL shaders.
Basically things would be a lot simpler if I could use float instead.
However I don't fully understand the repercussions of storing a number like this as floating point. Especially as in the OpenGL case I don't have the choice of storing a single channel 32 bit floating point texture, only 16bit.
For 16bit float Wikipedia states:
Precision limitations on integer values
Integers between 0 and 2048 can be exactly represented
Integers between 2049 and 4096 round to a multiple of 2 (even number)
Integers between 4097 and 8192 round to a multiple of 4
Integers between 8193 and 16384 round to a multiple of 8
Integers between 16385 and 32768 round to a multiple of 16
Integers between 32769 and 65519 round to a multiple of 32
Integers equal to or above 65520 are rounded to "infinity".
So does this quite simply mean that in the case of the number "89,482", trying to store it in an 16bit OpenGL texture it will be rounded to infinity? If so what are my options? Stick with unsigned int? What about when I need to normalize it, can I cast to float?
--EDIT---
I need the value to be un-normalised in the shader

Related

Understanding casts from integer to float

Could someone explain this weird looking output on a 32 bit machine?
#include <stdio.h>
int main() {
printf("16777217 as float is %.1f\n",(float)16777217);
printf("16777219 as float is %.1f\n",(float)16777219);
return 0;
}
Output
16777217 as float is 16777216.0
16777219 as float is 16777220.0
The weird thing is that 16777217 casts to a lower value and 16777219 casts to a higher value...
In the IEEE-754 basic 32-bit binary floating-point format, all integers from −16,777,216 to +16,777,216 are representable. From 16,777,216 to 33,554,432, only even integers are representable. Then, from 33,554,432 to 67,108,864, only multiples of four are representable. (Since the question does not necessitate discussion of which numbers are representable, I will omit explanation and just take this for granted.)
The most common default rounding mode is to round the exact mathematical result to the nearest representable value and, in case of a tie, to round to the representable value which has zero in the low bit of its significand.
16,777,217 is equidistant between the two representable values 16,777,216 and 16,777,218. These values are represented as 1000000000000000000000002•21 and 1000000000000000000000012•21. The former has 0 in the low bit of its significand, so it is chosen as the result.
16,777,219 is equidistant between the two representable values 16,777,218 and 16,777,220. These values are represented as 1000000000000000000000012•21 and 1000000000000000000000102•21. The latter has 0 in the low bit of its significand, so it is chosen as the result.
You may have heard of the concept of "precision", as in "this fractional representation has 3 digits of precision".
This is very easy to think about in a fixed-point representation. If I have, say, three digits of precision past the decimal, then I can exactly represent 1/2 = 0.5, and I can exactly represent 1/4 = 0.25, and I can exactly represent 1/8 = 0.125, but if I try to represent 1/16, I can not get 0.0625; I will either have to settle for 0.062 or 0.063.
But that's for fixed-point. The computer you're using uses floating-point, which is a lot like scientific notation. You get a certain number of significant digits total, not just digits to the right of the decimal point. For example, if you have 3 decimal digits worth of precision in a floating-point format, you can represent 0.123 but not 0.1234, and you can represent 0.0123 and 0.00123, but not 0.01234 or 0.001234. And if you have digits to the left of the decimal point, those take away away from the number you can use to the right of the decimal point. You can use 1.23 but not 1.234, and 12.3 but not 12.34, and 123.0 but not 123.4 or 123.anythingelse.
And -- you can probably see the pattern by now -- if you're using a floating-point format with only three significant digits, you can't represent all numbers greater than 999 perfectly accurately at all, even though they don't have a fractional part. You can represent 1230 but not 1234, and 12300 but not 12340.
So that's decimal floating-point formats. Your computer, on the other hand, uses a binary floating-point format, which ends up being somewhat trickier to think about. We don't have an exact number of decimal digits' worth of precision, and the numbers that can't be exactly represented don't end up being nice even multiples of 10 or 100.
In particular, type float on most machines has 24 binary bits worth of precision, which works out to 6-7 decimal digits' worth of precision. That's obviously not enough for numbers like 16777217.
So where did the numbers 16777216 and 16777220 come from? As Eric Postpischil has already explained, it ends up being because they're multiples of 2. If we look at the binary representations of nearby numbers, the pattern becomes clear:
16777208 111111111111111111111000
16777209 111111111111111111111001
16777210 111111111111111111111010
16777211 111111111111111111111011
16777212 111111111111111111111100
16777213 111111111111111111111101
16777214 111111111111111111111110
16777215 111111111111111111111111
16777216 1000000000000000000000000
16777218 1000000000000000000000010
16777220 1000000000000000000000100
16777215 is the biggest number that can be represented exactly in 24 bits. After that, you can represent only even numbers, because the low-order bit is the 25th, and essentially has to be 0.
Type float cannot hold that much significance. The significand can only hold 24 bits. Of those 23 are stored and the 24th is 1 and not stored, because the significand is normalised.
Please read this which says "Integers in [ − 16777216 , 16777216 ] can be exactly represented", but yours are out of that range.
Floating representation follows a method similar to what we use in everyday life and we call exponential representation. This is a number using a number of digits that we decide will suffice to realistically represent the value, we call it mantissa, or significant, that we will multiply to a base, or radix, value elevated to a power that we call exponent. In plain words:
num*base^exp
We generally use 10 as base, because we have 10 finger in our hands, so we are habit to numbers like 1e2, which is 100=1*10^2.
Of course we regret to use exponential representation for so small numbers, but we prefer to use it when acting on very large numbers, or, better, when our number has a number of digits that we consider enough to represent the entity we are valorizing.
The correct number of digits could be how many we can handle by mind, or what are required for an engineering application. When we decided how many digits we need we will not care anymore for how adherent to the real value will be the numeric representation we are going to handle. I.e. for a number like 123456.789e5 it is understood that adding up 99 unit we can tolerate the rounded representation and consider it acceptable anyway, if not we should change the representation and use a different one with appropriate number of digits as in 12345678900.
On a computer when you have to handle very large numbers, that couldn't fit in a standard integer, or when the you have to represent a real number (with decimal part) the right choice is a floating or double floating point representation. It uses the same layout we discussed above, but the base is 2 instead of 10. This because a computer can have only 2 fingers, the states 0 or 1. Se the formula we used before, to represent 100, become:
100100*2^0
That's still isn't the real floating point representation, but gives the idea. Now consider that in a computer the floating point format is standardized and for a standard float, as per IEE-754, it uses, as memory layout (we will see after why it is assumed 1 more bit for the mantissa), 23bits for the mantissa, 1bit for the sign and 8bits for the exponent biased by -127 (that simply means that it will range between -126 and +127 without the need for a sign bit, and the values 0x00 and 0xff reserved for special meaning).
Now consider using 0 as exponent, this means that the value 2^exponent=2^0=1 multiplied by mantissa give the same behavior of a 23bits integer. This imply that incrementing a count as in:
float f = 0;
while(1)
{
f +=1;
printf ("%f\n", f);
}
You will see that the printed value linearly increase by one until it saturates the 23bits and the exponent will become to grow.
If the base, or radix, of our floating point number would have been 10, we would see an increase each 10 loops for the first 100 (10^2) values, than an increase of 100 for the next 1000 (10^3) values and so on. You see that this corresponds to the *truncation** we have to make due to the limited number of available digits.
The same phenomenon will be observed when using the binary base, only the changes happens on powers of 2 interval.
What we discussed up to now is called the denormalized form of a floating point, what is normally used is the counterpart normalized. The latter simply means that there is a 24th bit, not stored, that is always 1. In plane words we wouldn't use an exponent of 0 for number less that 2^24, but we shift it (multiply by 2) up to the MSbit==1 reach the 24th bit, than the exponent is adjusted to such a negative value that force the conversion to shift back the number to its original value.
Remember the reserved value of the exponent we talked above? Well an exponent==0x00 means that we have a denormalized number. exponent==0xff indicate a nan (not-a-number) or +/-infinity if mantissa==0.
It should be clear now that when the number we express is beyond the 24bits of the significant (mantissa), we should expect approximation of the real value depending on how much far we are from 2^24.
Now the number you are using are just on the edge of 2^24=16,277,216 :
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|1|0|0|1|0|1|1|0|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1| = 16,277,215
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
s\______ _______/\_____________________ _______________________/
i v v
g exponent mantissa
n
Now increasing by 1 we have:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|1|0|0|1|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| = 16,277,216
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
s\__ exponent __/\_________________ mantissa __________________/
Note that we have triggered to 1 the 24th bit, but from now on we are above the 24 bit representation, and each possible further representation is in steps of 2^1=2. Simply advance by 2 or can represent only even numbers (multiples of 2^1=2). I.e. setting to 1 the Less Significant bit we have:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|1|0|0|1|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1| = 16,277,218
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
s\__ exponent __/\_________________ mantissa __________________/
Increasing again:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|1|0|0|1|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0| = 16,277,220
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
s\__ exponent __/\_________________ mantissa __________________/
As you can see we cannot exactly represent 16,277,219. In your code:
// This will print 16777216, because 1 increment isn't enough to
// increase the significant that can express only intervals
// that are > 2^1
printf("16777217 as float is %.1f\n",(float)16777217);
// This will print 16777220, because an increment of 3 on
// the base 16777216=2^24 will trigger an exponent increase rounded
// to the closer exact representation
printf("16777219 as float is %.1f\n",(float)16777219);
As said above the choice of the numeric format must be appropriate for the usage, a floating point is only an approximate representation of a real number, and is definitively our duty to carefully use the right type.
In the case if we need more precision we could use a double, or an integer long long int.
Just for sake of completeness I would add few words on the approximate representation for irriducible numbers. This numbers are not divisible by a fraction of 2, so the representation in float format will always be not exact, and need to be rounded to the correct value during conversion to decimal representation.
For more details see:
https://en.wikipedia.org/wiki/IEEE_754
https://en.wikipedia.org/wiki/Single-precision_floating-point_format
Online demo applets:
https://babbage.cs.qc.cuny.edu/IEEE-754/
https://evanw.github.io/float-toy/
https://www.h-schmidt.net/FloatConverter/IEEE754.html

GMP most significant digits

I'm performing some calculations on arbitrary precision integers using GNU Multiple Precision (GMP) library. Then I need the decimal digits of the result. But not all of them: just, let's say, a hundred of most significant digits (that is, the digits the number starts with) or a selected range of digits from the middle of the number (e.g. digits 100..200 from a 1000-digit number).
Is there any way to do it in GMP?
I couldn't find any functions in the documentation to extract a range of decimal digits as a string. The conversion functions which convert mpz_t to character strings always convert the entire number. One can only specify the radix, but not the starting/ending digit.
Is there any better way to do it other than converting the entire number into a humongous string only to take a small piece of it and throw out the rest?
Edit: What I need is not to control the precision of my numbers or limit it to a particular fixed amount of digits, but selecting a subset of digits from the digit string of the number of arbitrary precision.
Here's an example of what I need:
71316831 = 19821203202357042996...2076482743
The actual number has 1112852 digits, which I contracted into the ....
Now, I need only an arbitrarily chosen substring of this humongous string of digits. For example, the ten most significant digits (1982120320 in this case). Or the digits from 1112841th to 1112849th (21203202 in this case). Or just a single digit at the 1112841th position (2 in this case).
If I were to first convert my GMP number to a string of decimal digits with mpz_get_str, I would have to allocate a tremendous amount of memory for these digits only to use a tiny fraction of them and throw out the rest. (Not to mention that the original mpz_t number in binary representation already eats up quite a lot.)
If you know the number of decimal digits of x = 7^1316831 in advance, e.g., 1112852. Then you get your lower, say, 10 digits with:
x % (10^10), and the upper 20 digits with:
x / (10^(1112852 - 20)).
Note, I get 19821203202357042995 for the latter; 5 at final, not 6.
I don't think you can do that in GMP. However you can use Boost Multiprecision Library
Depending upon the number type, precision may be arbitrarily large (limited only by available memory), fixed at compile time (for example 50 or 100 decimal digits), or a variable controlled at run-time by member functions. The types are expression-template-enabled for better performance than naive user-defined types.
Emphasis mine
Another alternative is ttmath with the type ttmath::Big<e,m> that you can control the needed precision. Any fixed-precision types will work, provided that you only need the most significant digits, as they all drop the low significant digits like how float and double work. Those digits don't affect the high digits of the result, hence can be omitted safely. For instance if you need the high 20 digits then use a type that can store 20 digits and a little more, in order to provide enough data for correct rounding later
For demonstration let's take a simple example of 77 = 823543 and you only need the top 2 digits. Using a 4-digit type for calculation you'll get this
75 = 16807 => round to 1681×10¹ and store
75×7 = 1681×101×7 = 11767*10¹ ≈ 1177×102
75×7×7 = 1177×102×7 = 8232×102
As you can see the top digits are the same even without needing to get the full exact result. Calculating the full precision using GMP not only wastes a lot of time but also memory. Think about the amount of memory you need to store the result of another operation on 2 bigints to get the digits you want. By fixing the precision instead of leaving it at infinite you'll decrease the CPU and memory usage significantly.
If you need the 100th to 200th high order digits then use a type that has enough room for 201 digits and more, and extract those 101 digits after calculation. But this will be more wasteful so you may need to change to an arbitrary-precision (or fixed-precision) type that uses a base that's a power of 10 for its limbs (I'm using GMP notation here). For example if the type uses base 109 then each limb represents 9 digits in the decimal output and you can get arbitrary digit in decimal directly without any conversion from binary to decimal. That means zero waste for the string. I'm not sure which library uses base 10n but you can look at Mini-Pi's implementation which uses base 109, or write it yourself. This way it also work for efficiently getting the high digits
See
How are extremely large floating-point numbers represented in memory?
What is the simplest way of implementing bigint in C?

Efficient way to store a fixed range float

I'm heaving an (big) array of floats, each float takes 4 bytes.
Is there a way, given the fact that my floats are ranged between 0 and 255, to store each float in less than 4 bytes?
I can do any amount of computation on the whole array.
I'm using C.
How much precision do you need?
You can store each float in 2 bytes by representing it as an unsigned short (ranges from 0 to 65,535) and dividing all values by 2^8 when you need the actual value. This is essentially the same as using a fixed point format instead of floating point.
Your precision is limited to 1.0 / (2^8) = 0.00390625 when you do this, however.
The absolute range of your data doesn't really matter that much, it's the amount of precision you need. If you can get away with e.g. 6 digits of precision, then you only need as much storage as would be required to store the integers from 1-1000000, and that's 20 bits. So, supposing this, what you can do is:
1) Shift your data so that the smallest element has value 0. I.e. subtract a single value from every element. Record this shift.
2) Scale (multiply) your data by a number just large enough so that after truncation to an integer, you will not lose any precision you need.
3) Now this might be tricky unless you can pack your data into convenient 8- or 16-bit units--pack the data into successive unsigned integers. Each one of your data values needs 20 bits in this example, so value 1 takes up the first 20 bits of integer 1, value 2 takes up the remaining 12 bits of integer 1 and the first 8 bits of integer 2, and so on. In this hypothetical case you end up saving ~40%.
4) Now, 'decrypting'. Unpack the values (you have saved the # of bits in each one), un-scale, and un-shift.
So, this will do it, and might be faster and more compact than standard compression algorithms, as they aren't allowed to make assumptions about how much precision you need, but you are.
For example you could store integers (floats with .0) on one byte, but the other float need more bytes.
You could also use fixed-point if you don't worry about precision...

What is the approximate resolution of a single precision floating point number when its around zero

I am storing many longitudes and latitudes as doubles, I am wondering if I can get away with storing them as floats.
To answer this question, I need to know the approximate resolution of a single precision floating point number when the stored values are longitudes / latitudes (-180 to +180).
Your question may have several interpretations.
If it is just for angles and for storage on a disk or on a device i would suggest you to store your values using a totally different technique: store as 32 bit integer.
int encodedAngle = (int)(value * (0x7FFFFFFF / 180.0));
To recover it, do the contrary.
double angle = (encodedAngle / (0x7FFFFFFF / 180.0));
In this way you have full 31 bit resolution for 180 degrees and 1 bit for the sign.
You can use this way also to keep your values in ram, the cost of this coversion is higher compared to work directly with doubles, but if you want to keep your memory low but resolution high this can work quite well.
The cost is not so high, just a conversion to/from integer from/to double and a multiplication, modern processors will do it in a very little amount of time, and since the accessed memory is less, if the list contains a lot of values, your code will be more friendly with processor cache.
Your resolution will be 180 / ((2^31) - 1) = 8.38190318 × 10^-8 degrees, not bad :)
The resolution you can count on with single-precision floats is about 360 / (2 ^ 23) or 4 * 10 ^ -5.
More precisely, the largest single-precision float strictly inferior to 360. (which is representable exactly) is about 359.999969. For the whole range -360. .. 360, you will be able to represent differences at least as small as the difference between these two numbers.
Depends, but rather not.
32-bit float stores 7 significant digits. That is normally too little for storing the proper resolution of longitude/latitude. For example, openstreetmap.org uses six digits after the decimal point, so minimum eight, maximum total of ten digits.
In short, use float64.
Usually floats are 4 bytes (32 bits) while doubles are double that. However, the exact precision if you're doing calculations is implementation (and hardware) specific. On some systems all floats will be stored as doubles, just to add to the confusion.

Why does the addition of two float numbers is incorrect in C?

I have a problem with the addition of two float numbers.
Code below:
float a = 30000.0f;
float b = 4499722832.0f;
printf("%f\n", a+b);
Why the output result is 450002816.000000? (The correct one should be 450002832.)
Float are not represented exactly in C - see http://en.wikipedia.org/wiki/Floating_point#IEEE_754:_floating_point_in_modern_computers and http://en.wikipedia.org/wiki/Single_precision, so calculations with float can only give an approximate result.
This is especially apparent for larger values, since the possible difference can be represented as a percentage of the value. In case of adding/subtracting two values, you get the worse precision of both (and of the result).
Floating-point values cannot represent all integer values.
Remember that single-precision floating-point numbers only have 24 (or 23, depending on how you count) bits of precision (i.e. significant figures). So as values get larger, you begin to lose low-end precision, which is why the result of your calculation isn't quite "correct".
From wikipedia
Single precision, called "float" in the C language family, and "real" or "real*4" in Fortran. This is a binary format that occupies 32 bits (4 bytes) and its significand has a precision of 24 bits (about 7 decimal digits).
So your number doesn't actually fit in float. You can use double instead.

Resources