Represent float using integer - c

The limit size of a BLE packet is 20 bytes. I need to transfer the following data over it:
struct Data {
uint16_t top;
uint16_t bottom;
float accX;
float accY;
float accZ;
float qx;
float qy;
float qz;
float qw;
};
Size of Data is 32 bytes. The precision of floats can not be sacrificed, since they represent accelerometers and quaternions, and would create a huge drift error if not represented precisely (data would be integrated over time).
I don't want to send 2 packets as well, as it's really important that whole data is taken at the same time.
I'm planning to take advantage of the range instead.
Accelerometer are IEEE floats in the range of [-10, 10]
Quaternions are IEEE floats in the range of [-1, 1]. We could remove w, as x^2 + y^2 + z^2 + w^2 = 1
Top, and bottom are 10-bit each.
Knowing this information, how can I serialize Data using at most 20 bytes?

Assuming binary32, code is using 2*16 + 7*32 bits (256 bits) and OP wants to limit to 20*8 bits (160).
Some savings:
1) 10 bit uint16_t,
2) reduced exponent range saves a bit or 2 per float - would save a few more bits if OP stated the _minimum exponent as well. (estimate 4 bits total)
3) Not coding w.
This make make for 2*10 + 6*(32-4) = 188 bits, still not down to 160.
OP says "The precision of floats can not be sacrificed" implying the 24 bit (23- bits explicitly coded) significand is needed. 7 float * 23 bits is 161 bits and that is not counting the sign, exponent nor 2 uint16_t.
So unless some pattern or redundant information can be eliminated, OP is outta luck.
Suggest taking many samples of data and try compressing it using LZ or other compression techniques. If OP ends up with significantly less than 20 bytes per averagedata, then the answer is yes - in theory, else you are SOL.

Related

Convert integer in a new floating point format

This code is intended to convert a signed 16-bit integer to a new floating point format (similar to the normal IEEE 754 floating point format). I unterstand the regular IEEE 754 floating point format, but i don't understand how this code works and how this floating point format looks like. I would be grateful for some insights into what the idea of the code is respectively how many bits are used for representing the significand and how many bits are used for representing the exponent in this new format.
#include <stdint.h>
uint32_t short2fp (int16_t inp)
{
int x, f, i;
if (inp == 0)
{
return 0;
}
else if (inp < 0)
{
i = -inp;
x = 191;
}
else
{
i = inp;
x = 63;
}
for (f = i; f > 1; f >>= 1)
x++;
for (f = i; f < 0x8000; f <<= 1);
return (x * 0x8000 + f - 0x8000);
}
This couple of tricks should help you recognize the parameters (exponent's size and mantissa's size) of a custom floating-point format:
First of all, how many bits is this float number long?
We know that the sign bit is the highest bit set in any negative float number. If we calculate short2fp(-1) we obtain 0b10111111000000000000000, that is a 23-bit number. Therefore, this custom float format is a 23-bit float.
If we want to know the exponent's and mantissa's sizes, we can convert the number 3, because this will set both the highest exponent's bit and the highest mantissa's bit. If we do short2fp(3), we obtain 0b01000000100000000000000, and if we split this number we get 0 1000000 100000000000000: the first bit is the sign, then we have 7 bits of exponent, and finally 15 bits of mantissa.
Conclusion:
Float format size: 23 bits
Exponent size: 7 bits
Mantissa size: 15 bits
NOTE: this conclusion may be wrong for a different number of reasons (e.g.: float format particularly different from IEEE754 ones, short2fp() function not working properly, too much coffee this morning, etc.), but in general this works for every binary floating-point format defined by IEEE754 (binary16, binary32, binary64, etc.) so I'm confident this works for your custom float format too.
P.S.: the short2fp() function is written very poorly, you may try improve its clearness if you want to investigate the inner workings of the function.
The two statements x = 191; and x = 63; set x to either 1•128 + 63 or 0•128 + 63, according to whether the number is negative or positive. Therefore 128 (27) has the sign bit at this point. As x is later multiplied by 0x8000 (215), the sign bit is 222 in the result.
These statements also initialize the exponent to 0, which is encoded as 63 due to a bias of 63. This follows the IEEE-754 pattern of using a bias of 2n−1−1 for an exponent field of n bits. (The “single” format has eight exponent bits and a bias of 27−1 = 127, and the “double” format has 11 exponent bits and a bias of 210−1 = 1023.) Thus we expect an exponent field of 7 bits, with bias 26−1 = 63.
This loop:
for (f = i; f > 1; f >>= 1)
x++;
detects the magnitude of i (the absolute value of the input), adding one to the exponent for each power of two that f is detected to exceed. For example, if the input is 4, 5, 6, or 7, the loop executes two times, adding two to x and reducing f to 1, at which point the loop stops. This confirms the exponent bias; if i is 1, x is left as is, so the initial value of 63 corresponds to an exponent of 0 and a represented value of 20 = 1.
The loop for (f = i; f < 0x8000; f <<= 1); scales f in the opposite direction, moving its leading bit to be in the 0x8000 position.
In return (x * 0x8000 + f - 0x8000);, x * 0x8000 moves the sign bit and exponent field from their initial positions (bit 7 and bits 6 to 0) to their final positions (bit 22 and bits 21 to 15). f - 0x8000 removes the leading bit from f, giving the trailing bits of the significand. This is then added to the final value, forming the primary encoding of the significand in bits 14 to 0.
Thus the format has the sign bit in bit 22, exponent bits in bits 21 to 15 with a bias of 63, and the trailing significand bits in bits 14 to 0.
The format could encode subnormal numbers, infinities, and NaNs in the usual way, but this is not discernible from the code shown, as it encodes only integers in the normal range.
As a comment suggested, I would use a small number of strategically selected test cases to reverse engineer the format. The following assumes an IEEE-754-like binary floating-point format using sign-magnitude encoding with a sign bit, exponent bits, and significand (mantissa) bits.
short2fp (1) = 001f8000 while short2fp (-1) = 005f8000. The exclusive OR of these is 0x00400000 which means the sign bit is in bit 22 and this floating-point format comprises 23 bits.
short2fp (1) = 001f8000, short2fp (2) = 00200000, and short2fp (4) = 00208000. The difference between consecutive values is 0x00008000 so the least significant bit of the exponent field is bit 15, the exponent field comprises 7 bits, and the exponent is biased by (0x001f8000 >> 15) = 0x3F = 63.
This leaves the least significant 15 bits for the significand. We can see from short2fp (2) = 00200000 that the integer bit of the significand (mantissa) is not stored, that is, it is implicit as in IEEE-754 formats like binary32 or binary64.

Flipping bytes, doing arithmetic and flipping them back again

I have a programming/math related question regarding converting between big endian and little endian and doing arithmetic.
Assume we have two integers in little endian mode:
int a = 5;
int b = 6;
//a+b = 11
Let's flip the bytes and add them again:
int a = 1280;
int b = 1536;
//a+b = 2816
Now if we flip the byte order of 2816 we get 11. So essentially we can do arithmetic computation between little endian and big endian and once converted they represent the same number?
Does this have a theory/name behind it in the computer science world?
It doesn't work if the addition involves carrying since carrying propagates right-to-left. Swapping digits doesn't mean carrying switches direction, so any bytes that overflow into the next byte will be different.
Let's look at an example in hex, pretending that endianness means each 4-bit nibble is swapped:
int a = 0x68;
int b = 0x0B;
//a+b: 0x73
int a = 0x86;
int b = 0xB0;
//a+b: 0x136
816 + B16 is 1316. That 1 is carried and adds on to the 6 in the first sum. But in the second sum it's not carried right and added to the 6, it's carried left and overflows into the third hex digit.
First, it should be noted that your assumption that int in C has 16 bits is wrong. In most modern systems int is a 32-bit type, so if we reverse (not flip, which typically means taking the complement) the bytes of 5 we'll get 83886080 (0x05000000), not 1280 (0x0500)
Is the size of C "int" 2 bytes or 4 bytes?
What does the C++ standard state the size of int, long type to be?
Also note that you should write in hex to make it easier to understand because computers don't work in decimal:
int16_t a = 0x0005;
int16_t b = 0x0006;
// a+b = 0x000B
int16_t a = 0x0500; // 1280
int16_t b = 0x0600; // 1536
//a+b = 0x0B00
OK now as others said, ntohl(htonl(5) + htonl(6)) happens to be the same as 5 + 6 just be cause you have small numbers that their reverses' sum don't overflow. Choosing larger numbers and you'll see the difference right away
However that property does hold in ones' complement for systems where values are stored in 2 smaller parts like this case
In ones' complement one does arithmetic with end-around carry by propagating the carry out back to the carry in. That makes ones' complement arithmetic endian independent if one has only one internal "carry break" (i.e. the stored value is broken into two separate chunks) because of the "circular carry"
Suppose we have xxyy and zztt then xxyy + zztt is done like this
carry
xx yy
+ zz <───── tt
──────────────
carry aa bb
│ ↑
└─────────────┘
When we reverse the chunks, yyxx + ttzz is carried the same way. Because xx, yy, zz, tt are chunks of bits of any length, it works for PDP's mixed endian, or when you store a 32-bit number in two 16-bit parts, a 64-bit number in two 32-bit parts...
For example:
0x7896 + 0x6987 = 0xE21D
0x9678 + 0x8769 = 0x11DE1 → 0x1DE1 + 1 = 0x1DE2
0x2345 + 0x9ABC = 0xBE01
0x4523 + 0xBC9A = 0x101BD → 0x01BD + 1 = 0x01BE
0xABCD + 0xBCDE = 0x168AB → 0x68AB + 1 = 0x68AC
0xCDAB + 0xDEBC = 0x1AC67 → 0xAC67 + 1 = 0xAC68
Or John Kugelman's example above: 0x68 + 0x0B = 0x73; 0x86 + 0xB0 = 0x136 → 0x36 + 1 = 0x37
The end-around carry is one of the reasons why ones' complement was chosen for TCP checksum, because you can calculate the sum in higher precision easily. 16-bit CPUs can work in 16-bit units like normal, but 32 and 64-bit CPUs can add 32 and 64-bit chunks in parallel without worrying about the carry when SIMD isn't available like the SWAR technique
This only appears to work because you happened to pick numbers that are small enough so that they as well as their sum fit into one byte. As long as everything going on in your number stays within its respective byte, you can obviously shuffle and deshuffle your bytes however you want, it won't make a difference. If you pick larger numbers, e.g., 1234 and 4321, you will notice that it won't work anymore. In fact, you will most likely end up invoking undefined behavior because your int will overflow…
Apart from all that, you will almost certainly want to read this: https://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html

MATLAB: difference between double and logical data allocation

I need to create a large binary matrix that is over the array size limit for MATLAB.
By default, MATLAB creates integer arrays as double precision arrays. But since my matrix is binary, I am hoping that there is a way to create an array of bits instead of doubles and consume far less memory.
I created a random binary matrix A and converted it to a logical array B:
A = randi([0 1], 1000, 1000);
B=logical(A);
I saved both as .mat files. They take up about the same space on my computer so I don't think MATLAB is using a more compact data type for logicals, which seems very wasteful. Any ideas?
Are you sure that the variables take the same amount of space? logical data matrices / arrays are inherently 1 byte per number where as randi is double precision, which is 8 bytes per number. A simple call to whos will show you how much memory each variable takes:
>> A = randi([0 1], 1000, 1000);
>> B = logical(A);
>> whos
Name Size Bytes Class Attributes
A 1000x1000 8000000 double
B 1000x1000 1000000 logical
As you can see, A takes 8 x 1000 x 1000 = 8M bytes where as B takes up 1 x 1000 x 1000 = 1M bytes. There is most certainly memory savings between them.
The drawback with logicals is that it takes 1 byte per number, and you're looking for 1 bit instead. The best thing I can think of is to use an unsigned integer type and interleave chunks of N-bits where N is the associated bit precision of the data type, so uint8, uint16, uint32 etc. into a single interleaved array. As such, 32 digits can be interleaved per number and you can save this final matrix.
Going off on a tangent - Images
In fact, this is how Java packs colour pixels when reading images in using their BufferedImage class. Each pixel in a RGB image is 24 bits, where there are 8 bits per colour channel - red, green and blue. Each pixel is represented as a proportion of red, green and blue, and they concatenate the trio of 8 bits into a single 24-bit integer. Usually, integers are represented as 32 bits and so you may think that there are 8 extra bits being wasted. There is in fact an alpha channel that represents the transparency of each colour pixel and that is another 8 bits to represent this. If you don't use transparency, these are assumed to be all 1s, and so the collection of these 4 pairs of 8 bits constitute 32 bits per pixel. There is, however, compression algorithms to reduce the size of each pixel on average to significantly less than 32 bits per pixel, but that's outside the scope of what I'm talking about.
Going back to our discussion, one way to represent this binary matrix in bit form would be perhaps in a for loop like so:
Abin = zeros(1, ceil(numel(A)/32), 'uint32');
for ii = 1 : numel(Abin)
val = A((ii-1)*32 + 1:ii*32);
dec = bin2dec(sprintf('%d', val));
Abin(ii) = dec;
end
Bear in mind that this will only work for matrices where the total number of elements is divisible by 32. I won't go into how to handle the case where it isn't because I solely want to illustrate the point that you can do what you ask, but it requires a bit of manipulation. Your case of 1000 x 1000 = 1M is certainly divisible by 32 (you get 1M / 32 = 31250), and so this will work.
This is probably not the most optimized code, but it gets the point across. Basically, we take chunks of 32 numbers (0/1) going column-wise from left to right and determining the 32-bit unsigned integer representation of this number. We then store this in a single location in the matrix Abin. What you will get in the end, given your 1000 x 1000 matrix is 31250 32-bit unsigned integers, which corresponds to 1000 x 1000 bits, or 1M bits = 125,000 bytes.
Try looking at the size of each variable now:
>> whos
Name Size Bytes Class Attributes
A 1000x1000 8000000 double
Abin 1x31250 125000 uint32
B 1000x1000 1000000 logical
To perform a reconstruction, try:
Arec = zeros(size(A));
for ii = 1 : numel(Abin)
val = dec2bin(Abin(ii), 32) - '0';
Arec((ii-1)*32 + 1:ii*32) = val(:);
end
Also not the most optimized, but it gets the point across. Given the "compressed" matrix Abin that we calculated before, for each element, we reconstruct what the original 32-bit number was then assign these numbers in 32-bit chunks stored in Arec.
You can verify that Arec is indeed equal to the original matrix A:
>> isequal(A, Arec)
ans =
1
Also, check out the workspace with whos:
>> whos
Name Size Bytes Class Attributes
A 1000x1000 8000000 double
Abin 1x31250 125000 uint32
Arec 1000x1000 8000000 double
B 1000x1000 1000000 logical
You are storing your data in a compressed file format. For mat files in version 7.0 and 7.3 gzip compression is used. The uncompressed data has different sizes, but after compression both are compressed down to roughly the same size. That happened because both data contains only 0 and 1 which can be compressed efficient.

casting data losing C

I try to understand when casting causes data losing and how it works.
so for the following examples i try to understand if there is data loss and if yes why:
(i - int(4),f - float(4),d-double(8))
i == (int)(float) i; // sizeof(int)==sizeof(float) <- no loss
i == (int)(double) i; // sizeof(int)!=sizeof(double) <- possible loss
f == (float)(double) f;// sizeof(double)!=sizeof(float) <- possible loss
d == (float) d;// sizeof(double)!=sizeof(float) <- possible loss
Is it sufficient to base the answer only on type sizes?(+ round )
Assuming 32 bit ints and normal 4 and 8 byte IEEE-754 floats/doubles it would be:
i == (int)(float) i; // possible loss (32 -> 23 -> 32 bits)
i == (int)(double) i; // no loss (32 -> 52 -> 32 bits)
f == (float)(double) f; // no loss (23 -> 52 -> 23 bits)
d == (float) d; // possible loss (52 -> 23 -> 52 bits)
Note that int has 32 bits of precision, float has 23 bits, double has 52.
The memory allocated to store variable of a type is not the only fact for you to consider loss of data. Generally, way to roundoff and how CPU processes numeric data in case of overflow would be other aspects you might want to look into.
Because the sizeof reports the same size in memory does not mean there is data loss.
Consider 0.5.
Can store that in a float but cannot store it in an integer.
Therefore data loss.
I.e. I want 0.5 of that cake. Cannot represent that as an integer. Either get nothing or lots of cakes. Yum
Why integer? because you may need only integer numbers for example an ID_num
Why float? because you may need to work on real numbers example % calculations
Why double? when you have real numbers that cannot fit into float size

Fixed Point Arithmetic in C Programming

I am trying to create an application that stores stock prices with high precision. Currently I am using a double to do so. To save up on memory can I use any other data type? I know this has something to do with fixed point arithmetic, but I can't figure it out.
The idea behind fixed-point arithmetic is that you store the values multiplied by a certain amount, use the multiplied values for all calculus, and divide it by the same amount when you want the result. The purpose of this technique is to use integer arithmetic (int, long...) while being able to represent fractions.
The usual and most efficient way of doing this in C is by using the bits shifting operators (<< and >>). Shifting bits is a quite simple and fast operation for the ALU and doing this have the property to multiply (<<) and divide (>>) the integer value by 2 on each shift (besides, many shifts can be done for exactly the same price of a single one). Of course, the drawback is that the multiplier must be a power of 2 (which is usually not a problem by itself as we don't really care about that exact multiplier value).
Now let's say we want to use 32 bits integers for storing our values. We must choose a power of 2 multiplier. Let's divide the cake in two, so say 65536 (this is the most common case, but you can really use any power of 2 depending on your needs in precision). This is 216 and the 16 here means that we will use the 16 least significant bits (LSB) for the fractional part. The rest (32 - 16 = 16) is for the most significant bits (MSB), the integer part.
integer (MSB) fraction (LSB)
v v
0000000000000000.0000000000000000
Let's put this in code:
#define SHIFT_AMOUNT 16 // 2^16 = 65536
#define SHIFT_MASK ((1 << SHIFT_AMOUNT) - 1) // 65535 (all LSB set, all MSB clear)
int price = 500 << SHIFT_AMOUNT;
This is the value you must put in store (structure, database, whatever). Note that int is not necessarily 32 bits in C even though it is mostly the case nowadays. Also without further declaration, it is signed by default. You can add unsigned to the declaration to be sure. Better than that, you can use uint32_t or uint_least32_t (declared in stdint.h) if your code highly depends on the integer bit size (you may introduce some hacks about it). In doubt, use a typedef for your fixed-point type and you're safer.
When you want to do calculus on this value, you can use the 4 basic operators: +, -, * and /. You have to keep in mind that when adding and subtracting a value (+ and -), that value must also be shifted. Let's say we want to add 10 to our 500 price:
price += 10 << SHIFT_AMOUNT;
But for multiplication and division (* and /), the multiplier/divisor must NOT be shifted. Let's say we want to multiply by 3:
price *= 3;
Now let's make things more interesting by dividing the price by 4 so we make up for a non-zero fractional part:
price /= 4; // now our price is ((500 + 10) * 3) / 4 = 382.5
That's all about the rules. When you want to retrieve the real price at any point, you must right-shift:
printf("price integer is %d\n", price >> SHIFT_AMOUNT);
If you need the fractional part, you must mask it out:
printf ("price fraction is %d\n", price & SHIFT_MASK);
Of course, this value is not what we can call a decimal fraction, in fact it is an integer in the range [0 - 65535]. But it maps exactly with the decimal fraction range [0 - 0.9999...]. In other words, mapping looks like: 0 => 0, 32768 => 0.5, 65535 => 0.9999...
An easy way to see it as a decimal fraction is to resort to C built-in float arithmetic at this point:
printf("price fraction in decimal is %f\n", ((double)(price & SHIFT_MASK) / (1 << SHIFT_AMOUNT)));
But if you don't have FPU support (either hardware or software), you can use your new skills like this for complete price:
printf("price is roughly %d.%lld\n", price >> SHIFT_AMOUNT, (long long)(price & SHIFT_MASK) * 100000 / (1 << SHIFT_AMOUNT));
The number of 0's in the expression is roughly the number of digits you want after the decimal point. Don't overestimate the number of 0's given your fraction precision (no real trap here, that's quite obvious). Don't use simple long as sizeof(long) can be equal to sizeof(int). Use long long in case int is 32 bits as long long is guaranted to be 64 bits minimum (or use int64_t, int_least64_t and such, declared in stdint.h). In other words, use a type twice the size of your fixed-point type, that's fair enough. Finally, if you don't have access to >= 64 bits types, maybe it's time to exercice emulating them, at least for your output.
These are the basic ideas behind fixed-point arithmetics.
Be careful with negative values. It can becomes tricky sometimes, especially when it's time to show the final value. Besides, C is implementation-defined about signed integers (even though platforms where this is a problem are very uncommon nowadays). You should always make minimal tests in your environment to make sure everything goes as expected. If not, you can hack around it if you know what you do (I won't develop on this, but this has something to do with arithmetic shift vs logical shift and 2's complement representation). With unsigned integers however, you're mostly safe whatever you do as behaviors are well defined anyway.
Also take note that if a 32 bits integer can not represent values bigger than 232 - 1, using fixed-point arithmetic with 216 limits your range to 216 - 1! (and divide all of this by 2 with signed integers, which in our example would leave us with an available range of 215 - 1). The goal is then to choose a SHIFT_AMOUNT suitable to the situation. This is a tradeoff between integer part magnitude and fractional part precision.
Now for the real warnings: this technique is definitely not suitable in areas where precision is a top priority (financial, science, military...). Usual floating point (float/double) are also often not precise enough, even though they have better properties than fixed-point overall. Fixed-point has the same precision whatever the value (this can be an advantage in some cases), where floats precision is inversely proportional to the value magnitude (ie. the lower the magnitude, the more precision you get... well, this is more complex than that but you get the point). Also floats have a much greater magnitude than the equivalent (in number of bits) integers (fixed-point or not), to the cost of a loss of precision with high values (you can even reach a point of magnitude where adding 1 or even greater values will have no effect at all, something that cannot happen with integers).
If you work in those sensible areas, you're better off using libraries dedicated to the purpose of arbitrary precision (go take a look at gmplib, it's free). In computing science, essentially, gaining precision is about the number of bits you use to store your values. You want high precision? Use bits. That's all.
I see two options for you. If you are working in the financial services industry, there are probably standards that your code should comply with for precision and accuracy, so you'll just have to go along with that, regardless of memory cost. I understand that that business is generally well funded, so paying for more memory shouldn't be a problem. :)
If this is for personal use, then for maximum precision I recommend you use integers and multiply all prices by a fixed factor before storage. For example, if you want things accurate to the penny (probably not good enough), multiply all prices by 100 so that your unit is effectively cents instead of dollars and go from there. If you want more precision, multiply by more. For example, to be accurate to the hundredth of a cent (a standard that I have heard is commonly applied), multiply prices by 10000 (100 * 100).
Now with 32-bit integers, multiplying by 10000 leaves little room for large numbers of dollars. A practical 32-bit limit of 2 billion means that only prices as high as $20000 can be expressed: 2000000000 / 10000 = 20000. This gets worse if you multiply that 20000 by something, as there may be no room to hold the result. For this reason, I recommend using 64-bit integers (long long). Even if you multiply all prices by 10000, there is still plenty of headroom to hold large values, even across multiplications.
The trick with fixed-point is that whenever you do a calculation you need to remember that each value is really an underlying value multiplied by a constant. Before you add or subtract, you need to multiply values with a smaller constant to match those with a bigger constant. After you multiply, you need to divide by something to get the result back to being multiplied by the desired constant. If you use a non-power of two as your constant, you'll have to do an integer divide, which is expensive, time-wise. Many people use powers of two as their constants, so they can shift instead of divide.
If all this seems complicated, it is. I think the easiest option is to use doubles and buy more RAM if you need it. They have 53 bits of precision, which is roughly 9 quadrillion, or almost 16 decimal digits. Yes, you still might lose pennies when you are working with billions, but if you care about that, you're not being a billionaire the right way. :)
#Alex gave a fantastic answer here. However, I wanted to add some improvements to what he's done, by, for example, demonstrating how to do emulated-float (using integers to act like floats) rounding to any desired decimal place. I demonstrate that in my code below. I went a lot farther, though, and ended up writing a whole code tutorial to teach myself fixed-point math. Here it is:
my fixed_point_math tutorial: A tutorial-like practice code to learn how to do fixed-point math, manual "float"-like prints using integers only,
"float"-like integer rounding, and fractional fixed-point math on large integers.
If you really want to learn fixed-point math, I think this is valuable code to carefully go through, but it took me an entire weekend to write, so expect it to take you perhaps a couple hours to thoroughly go through it all. The basics of the rounding stuff can be found right at the top section, however, and learned in just a few minutes.
My full code on GitHub: https://github.com/ElectricRCAircraftGuy/fixed_point_math.
Or, below (truncated, because Stack Overflow won't allow that many characters):
/*
fixed_point_math tutorial
- A tutorial-like practice code to learn how to do fixed-point math, manual "float"-like prints using integers only,
"float"-like integer rounding, and fractional fixed-point math on large integers.
By Gabriel Staples
www.ElectricRCAircraftGuy.com
- email available via the Contact Me link at the top of my website.
Started: 22 Dec. 2018
Updated: 25 Dec. 2018
References:
- https://stackoverflow.com/questions/10067510/fixed-point-arithmetic-in-c-programming
Commands to Compile & Run:
As a C program (the file must NOT have a C++ file extension or it will be automatically compiled as C++, so we will
make a copy of it and change the file extension to .c first):
See here: https://stackoverflow.com/a/3206195/4561887.
cp fixed_point_math.cpp fixed_point_math_copy.c && gcc -Wall -std=c99 -o ./bin/fixed_point_math_c fixed_point_math_copy.c && ./bin/fixed_point_math_c
As a C++ program:
g++ -Wall -o ./bin/fixed_point_math_cpp fixed_point_math.cpp && ./bin/fixed_point_math_cpp
*/
#include <stdbool.h>
#include <stdio.h>
#include <stdint.h>
// Define our fixed point type.
typedef uint32_t fixed_point_t;
#define BITS_PER_BYTE 8
#define FRACTION_BITS 16 // 1 << 16 = 2^16 = 65536
#define FRACTION_DIVISOR (1 << FRACTION_BITS)
#define FRACTION_MASK (FRACTION_DIVISOR - 1) // 65535 (all LSB set, all MSB clear)
// // Conversions [NEVERMIND, LET'S DO THIS MANUALLY INSTEAD OF USING THESE MACROS TO HELP ENGRAIN IT IN US BETTER]:
// #define INT_2_FIXED_PT_NUM(num) (num << FRACTION_BITS) // Regular integer number to fixed point number
// #define FIXED_PT_NUM_2_INT(fp_num) (fp_num >> FRACTION_BITS) // Fixed point number back to regular integer number
// Private function prototypes:
static void print_if_error_introduced(uint8_t num_digits_after_decimal);
int main(int argc, char * argv[])
{
printf("Begin.\n");
// We know how many bits we will use for the fraction, but how many bits are remaining for the whole number,
// and what's the whole number's max range? Let's calculate it.
const uint8_t WHOLE_NUM_BITS = sizeof(fixed_point_t)*BITS_PER_BYTE - FRACTION_BITS;
const fixed_point_t MAX_WHOLE_NUM = (1 << WHOLE_NUM_BITS) - 1;
printf("fraction bits = %u.\n", FRACTION_BITS);
printf("whole number bits = %u.\n", WHOLE_NUM_BITS);
printf("max whole number = %u.\n\n", MAX_WHOLE_NUM);
// Create a variable called `price`, and let's do some fixed point math on it.
const fixed_point_t PRICE_ORIGINAL = 503;
fixed_point_t price = PRICE_ORIGINAL << FRACTION_BITS;
price += 10 << FRACTION_BITS;
price *= 3;
price /= 7; // now our price is ((503 + 10)*3/7) = 219.857142857.
printf("price as a true double is %3.9f.\n", ((double)PRICE_ORIGINAL + 10)*3/7);
printf("price as integer is %u.\n", price >> FRACTION_BITS);
printf("price fractional part is %u (of %u).\n", price & FRACTION_MASK, FRACTION_DIVISOR);
printf("price fractional part as decimal is %f (%u/%u).\n", (double)(price & FRACTION_MASK) / FRACTION_DIVISOR,
price & FRACTION_MASK, FRACTION_DIVISOR);
// Now, if you don't have float support (neither in hardware via a Floating Point Unit [FPU], nor in software
// via built-in floating point math libraries as part of your processor's C implementation), then you may have
// to manually print the whole number and fractional number parts separately as follows. Look for the patterns.
// Be sure to make note of the following 2 points:
// - 1) the digits after the decimal are determined by the multiplier:
// 0 digits: * 10^0 ==> * 1 <== 0 zeros
// 1 digit : * 10^1 ==> * 10 <== 1 zero
// 2 digits: * 10^2 ==> * 100 <== 2 zeros
// 3 digits: * 10^3 ==> * 1000 <== 3 zeros
// 4 digits: * 10^4 ==> * 10000 <== 4 zeros
// 5 digits: * 10^5 ==> * 100000 <== 5 zeros
// - 2) Be sure to use the proper printf format statement to enforce the proper number of leading zeros in front of
// the fractional part of the number. ie: refer to the "%01", "%02", "%03", etc. below.
// Manual "floats":
// 0 digits after the decimal
printf("price (manual float, 0 digits after decimal) is %u.",
price >> FRACTION_BITS); print_if_error_introduced(0);
// 1 digit after the decimal
printf("price (manual float, 1 digit after decimal) is %u.%01lu.",
price >> FRACTION_BITS, (uint64_t)(price & FRACTION_MASK) * 10 / FRACTION_DIVISOR);
print_if_error_introduced(1);
// 2 digits after decimal
printf("price (manual float, 2 digits after decimal) is %u.%02lu.",
price >> FRACTION_BITS, (uint64_t)(price & FRACTION_MASK) * 100 / FRACTION_DIVISOR);
print_if_error_introduced(2);
// 3 digits after decimal
printf("price (manual float, 3 digits after decimal) is %u.%03lu.",
price >> FRACTION_BITS, (uint64_t)(price & FRACTION_MASK) * 1000 / FRACTION_DIVISOR);
print_if_error_introduced(3);
// 4 digits after decimal
printf("price (manual float, 4 digits after decimal) is %u.%04lu.",
price >> FRACTION_BITS, (uint64_t)(price & FRACTION_MASK) * 10000 / FRACTION_DIVISOR);
print_if_error_introduced(4);
// 5 digits after decimal
printf("price (manual float, 5 digits after decimal) is %u.%05lu.",
price >> FRACTION_BITS, (uint64_t)(price & FRACTION_MASK) * 100000 / FRACTION_DIVISOR);
print_if_error_introduced(5);
// 6 digits after decimal
printf("price (manual float, 6 digits after decimal) is %u.%06lu.",
price >> FRACTION_BITS, (uint64_t)(price & FRACTION_MASK) * 1000000 / FRACTION_DIVISOR);
print_if_error_introduced(6);
printf("\n");
// Manual "floats" ***with rounding now***:
// - To do rounding with integers, the concept is best understood by examples:
// BASE 10 CONCEPT:
// 1. To round to the nearest whole number:
// Add 1/2 to the number, then let it be truncated since it is an integer.
// Examples:
// 1.5 + 1/2 = 1.5 + 0.5 = 2.0. Truncate it to 2. Good!
// 1.99 + 0.5 = 2.49. Truncate it to 2. Good!
// 1.49 + 0.5 = 1.99. Truncate it to 1. Good!
// 2. To round to the nearest tenth place:
// Multiply by 10 (this is equivalent to doing a single base-10 left-shift), then add 1/2, then let
// it be truncated since it is an integer, then divide by 10 (this is a base-10 right-shift).
// Example:
// 1.57 x 10 + 1/2 = 15.7 + 0.5 = 16.2. Truncate to 16. Divide by 10 --> 1.6. Good.
// 3. To round to the nearest hundredth place:
// Multiply by 100 (base-10 left-shift 2 places), add 1/2, truncate, divide by 100 (base-10
// right-shift 2 places).
// Example:
// 1.579 x 100 + 1/2 = 157.9 + 0.5 = 158.4. Truncate to 158. Divide by 100 --> 1.58. Good.
//
// BASE 2 CONCEPT:
// - We are dealing with fractional numbers stored in base-2 binary bits, however, and we have already
// left-shifted by FRACTION_BITS (num << FRACTION_BITS) when we converted our numbers to fixed-point
// numbers. Therefore, *all we have to do* is add the proper value, and we get the same effect when we
// right-shift by FRACTION_BITS (num >> FRACTION_BITS) in our conversion back from fixed-point to regular
// numbers. Here's what that looks like for us:
// - Note: "addend" = "a number that is added to another".
// (see https://www.google.com/search?q=addend&oq=addend&aqs=chrome.0.0l6.1290j0j7&sourceid=chrome&ie=UTF-8).
// - Rounding to 0 digits means simply rounding to the nearest whole number.
// Round to: Addends:
// 0 digits: add 5/10 * FRACTION_DIVISOR ==> + FRACTION_DIVISOR/2
// 1 digits: add 5/100 * FRACTION_DIVISOR ==> + FRACTION_DIVISOR/20
// 2 digits: add 5/1000 * FRACTION_DIVISOR ==> + FRACTION_DIVISOR/200
// 3 digits: add 5/10000 * FRACTION_DIVISOR ==> + FRACTION_DIVISOR/2000
// 4 digits: add 5/100000 * FRACTION_DIVISOR ==> + FRACTION_DIVISOR/20000
// 5 digits: add 5/1000000 * FRACTION_DIVISOR ==> + FRACTION_DIVISOR/200000
// 6 digits: add 5/10000000 * FRACTION_DIVISOR ==> + FRACTION_DIVISOR/2000000
// etc.
printf("WITH MANUAL INTEGER-BASED ROUNDING:\n");
// Calculate addends used for rounding (see definition of "addend" above).
fixed_point_t addend0 = FRACTION_DIVISOR/2;
fixed_point_t addend1 = FRACTION_DIVISOR/20;
fixed_point_t addend2 = FRACTION_DIVISOR/200;
fixed_point_t addend3 = FRACTION_DIVISOR/2000;
fixed_point_t addend4 = FRACTION_DIVISOR/20000;
fixed_point_t addend5 = FRACTION_DIVISOR/200000;
// Print addends used for rounding.
printf("addend0 = %u.\n", addend0);
printf("addend1 = %u.\n", addend1);
printf("addend2 = %u.\n", addend2);
printf("addend3 = %u.\n", addend3);
printf("addend4 = %u.\n", addend4);
printf("addend5 = %u.\n", addend5);
// Calculate rounded prices
fixed_point_t price_rounded0 = price + addend0; // round to 0 decimal digits
fixed_point_t price_rounded1 = price + addend1; // round to 1 decimal digits
fixed_point_t price_rounded2 = price + addend2; // round to 2 decimal digits
fixed_point_t price_rounded3 = price + addend3; // round to 3 decimal digits
fixed_point_t price_rounded4 = price + addend4; // round to 4 decimal digits
fixed_point_t price_rounded5 = price + addend5; // round to 5 decimal digits
// Print manually rounded prices of manually-printed fixed point integers as though they were "floats".
printf("rounded price (manual float, rounded to 0 digits after decimal) is %u.\n",
price_rounded0 >> FRACTION_BITS);
printf("rounded price (manual float, rounded to 1 digit after decimal) is %u.%01lu.\n",
price_rounded1 >> FRACTION_BITS, (uint64_t)(price_rounded1 & FRACTION_MASK) * 10 / FRACTION_DIVISOR);
printf("rounded price (manual float, rounded to 2 digits after decimal) is %u.%02lu.\n",
price_rounded2 >> FRACTION_BITS, (uint64_t)(price_rounded2 & FRACTION_MASK) * 100 / FRACTION_DIVISOR);
printf("rounded price (manual float, rounded to 3 digits after decimal) is %u.%03lu.\n",
price_rounded3 >> FRACTION_BITS, (uint64_t)(price_rounded3 & FRACTION_MASK) * 1000 / FRACTION_DIVISOR);
printf("rounded price (manual float, rounded to 4 digits after decimal) is %u.%04lu.\n",
price_rounded4 >> FRACTION_BITS, (uint64_t)(price_rounded4 & FRACTION_MASK) * 10000 / FRACTION_DIVISOR);
printf("rounded price (manual float, rounded to 5 digits after decimal) is %u.%05lu.\n",
price_rounded5 >> FRACTION_BITS, (uint64_t)(price_rounded5 & FRACTION_MASK) * 100000 / FRACTION_DIVISOR);
// =================================================================================================================
printf("\nRELATED CONCEPT: DOING LARGE-INTEGER MATH WITH SMALL INTEGER TYPES:\n");
// RELATED CONCEPTS:
// Now let's practice handling (doing math on) large integers (ie: large relative to their integer type),
// withOUT resorting to using larger integer types (because they may not exist for our target processor),
// and withOUT using floating point math, since that might also either not exist for our processor, or be too
// slow or program-space-intensive for our application.
// - These concepts are especially useful when you hit the limits of your architecture's integer types: ex:
// if you have a uint64_t nanosecond timestamp that is really large, and you need to multiply it by a fraction
// to convert it, but you don't have uint128_t types available to you to multiply by the numerator before
// dividing by the denominator. What do you do?
// - We can use fixed-point math to achieve desired results. Let's look at various approaches.
// - Let's say my goal is to multiply a number by a fraction < 1 withOUT it ever growing into a larger type.
// - Essentially we want to multiply some really large number (near its range limit for its integer type)
// by some_number/some_larger_number (ie: a fraction < 1). The problem is that if we multiply by the numerator
// first, it will overflow, and if we divide by the denominator first we will lose resolution via bits
// right-shifting out.
// Here are various examples and approaches.
// -----------------------------------------------------
// EXAMPLE 1
// Goal: Use only 16-bit values & math to find 65401 * 16/127.
// Result: Great! All 3 approaches work, with the 3rd being the best. To learn the techniques required for the
// absolute best approach of all, take a look at the 8th approach in Example 2 below.
// -----------------------------------------------------
uint16_t num16 = 65401; // 1111 1111 0111 1001
uint16_t times = 16;
uint16_t divide = 127;
printf("\nEXAMPLE 1\n");
// Find the true answer.
// First, let's cheat to know the right answer by letting it grow into a larger type.
// Multiply *first* (before doing the divide) to avoid losing resolution.
printf("%u * %u/%u = %u. <== true answer\n", num16, times, divide, (uint32_t)num16*times/divide);
// 1st approach: just divide first to prevent overflow, and lose precision right from the start.
uint16_t num16_result = num16/divide * times;
printf("1st approach (divide then multiply):\n");
printf(" num16_result = %u. <== Loses bits that right-shift out during the initial divide.\n", num16_result);
// 2nd approach: split the 16-bit number into 2 8-bit numbers stored in 16-bit numbers,
// placing all 8 bits of each sub-number to the ***far right***, with 8 bits on the left to grow
// into when multiplying. Then, multiply and divide each part separately.
// - The problem, however, is that you'll lose meaningful resolution on the upper-8-bit number when you
// do the division, since there's no bits to the right for the right-shifted bits during division to
// be retained in.
// Re-sum both sub-numbers at the end to get the final result.
// - NOTE THAT 257 IS THE HIGHEST *TIMES* VALUE I CAN USE SINCE 2^16/0b0000,0000,1111,1111 = 65536/255 = 257.00392.
// Therefore, any *times* value larger than this will cause overflow.
uint16_t num16_upper8 = num16 >> 8; // 1111 1111
uint16_t num16_lower8 = num16 & 0xFF; // 0111 1001
num16_upper8 *= times;
num16_lower8 *= times;
num16_upper8 /= divide;
num16_lower8 /= divide;
num16_result = (num16_upper8 << 8) + num16_lower8;
printf("2nd approach (split into 2 8-bit sub-numbers with bits at far right):\n");
printf(" num16_result = %u. <== Loses bits that right-shift out during the divide.\n", num16_result);
// 3rd approach: split the 16-bit number into 2 8-bit numbers stored in 16-bit numbers,
// placing all 8 bits of each sub-number ***in the center***, with 4 bits on the left to grow when
// multiplying and 4 bits on the right to not lose as many bits when dividing.
// This will help stop the loss of resolution when we divide, at the cost of overflowing more easily when we
// multiply.
// - NOTE THAT 16 IS THE HIGHEST *TIMES* VALUE I CAN USE SINCE 2^16/0b0000,1111,1111,0000 = 65536/4080 = 16.0627.
// Therefore, any *times* value larger than this will cause overflow.
num16_upper8 = (num16 >> 4) & 0x0FF0;
num16_lower8 = (num16 << 4) & 0x0FF0;
num16_upper8 *= times;
num16_lower8 *= times;
num16_upper8 /= divide;
num16_lower8 /= divide;
num16_result = (num16_upper8 << 4) + (num16_lower8 >> 4);
printf("3rd approach (split into 2 8-bit sub-numbers with bits centered):\n");
printf(" num16_result = %u. <== Perfect! Retains the bits that right-shift during the divide.\n", num16_result);
// -----------------------------------------------------
// EXAMPLE 2
// Goal: Use only 16-bit values & math to find 65401 * 99/127.
// Result: Many approaches work, so long as enough bits exist to the left to not allow overflow during the
// multiply. The best approach is the 8th one, however, which 1) right-shifts the minimum possible before the
// multiply, in order to retain as much resolution as possible, and 2) does integer rounding during the divide
// in order to be as accurate as possible. This is the best approach to use.
// -----------------------------------------------------
num16 = 65401; // 1111 1111 0111 1001
times = 99;
divide = 127;
printf("\nEXAMPLE 2\n");
// Find the true answer by letting it grow into a larger type.
printf("%u * %u/%u = %u. <== true answer\n", num16, times, divide, (uint32_t)num16*times/divide);
// 1st approach: just divide first to prevent overflow, and lose precision right from the start.
num16_result = num16/divide * times;
printf("1st approach (divide then multiply):\n");
printf(" num16_result = %u. <== Loses bits that right-shift out during the initial divide.\n", num16_result);
// 2nd approach: split the 16-bit number into 2 8-bit numbers stored in 16-bit numbers,
// placing all 8 bits of each sub-number to the ***far right***, with 8 bits on the left to grow
// into when multiplying. Then, multiply and divide each part separately.
// - The problem, however, is that you'll lose meaningful resolution on the upper-8-bit number when you
// do the division, since there's no bits to the right for the right-shifted bits during division to
// be retained in.
// Re-sum both sub-numbers at the end to get the final result.
// - NOTE THAT 257 IS THE HIGHEST *TIMES* VALUE I CAN USE SINCE 2^16/0b0000,0000,1111,1111 = 65536/255 = 257.00392.
// Therefore, any *times* value larger than this will cause overflow.
num16_upper8 = num16 >> 8; // 1111 1111
num16_lower8 = num16 & 0xFF; // 0111 1001
num16_upper8 *= times;
num16_lower8 *= times;
num16_upper8 /= divide;
num16_lower8 /= divide;
num16_result = (num16_upper8 << 8) + num16_lower8;
printf("2nd approach (split into 2 8-bit sub-numbers with bits at far right):\n");
printf(" num16_result = %u. <== Loses bits that right-shift out during the divide.\n", num16_result);
/////////////////////////////////////////////////////////////////////////////////////////////////
// TRUNCATED BECAUSE STACK OVERFLOW WON'T ALLOW THIS MANY CHARACTERS.
// See the rest of the code on github: https://github.com/ElectricRCAircraftGuy/fixed_point_math
/////////////////////////////////////////////////////////////////////////////////////////////////
return 0;
} // main
// PRIVATE FUNCTION DEFINITIONS:
/// #brief A function to help identify at what decimal digit error is introduced, based on how many bits you are using
/// to represent the fractional portion of the number in your fixed-point number system.
/// #details Note: this function relies on an internal static bool to keep track of if it has already
/// identified at what decimal digit error is introduced, so once it prints this fact once, it will never
/// print again. This is by design just to simplify usage in this demo.
/// #param[in] num_digits_after_decimal The number of decimal digits we are printing after the decimal
/// (0, 1, 2, 3, etc)
/// #return None
static void print_if_error_introduced(uint8_t num_digits_after_decimal)
{
static bool already_found = false;
// Array of power base 10 values, where the value = 10^index:
const uint32_t POW_BASE_10[] =
{
1, // index 0 (10^0)
10,
100,
1000,
10000,
100000,
1000000,
10000000,
100000000,
1000000000, // index 9 (10^9); 1 Billion: the max power of 10 that can be stored in a uint32_t
};
if (already_found == true)
{
goto done;
}
if (POW_BASE_10[num_digits_after_decimal] > FRACTION_DIVISOR)
{
already_found = true;
printf(" <== Fixed-point math decimal error first\n"
" starts to get introduced here since the fixed point resolution (1/%u) now has lower resolution\n"
" than the base-10 resolution (which is 1/%u) at this decimal place. Decimal error may not show\n"
" up at this decimal location, per say, but definitely will for all decimal places hereafter.",
FRACTION_DIVISOR, POW_BASE_10[num_digits_after_decimal]);
}
done:
printf("\n");
}
Output:
gabriel$ cp fixed_point_math.cpp fixed_point_math_copy.c && gcc -Wall -std=c99 -o ./bin/fixed_point_math_c > fixed_point_math_copy.c && ./bin/fixed_point_math_c
Begin.
fraction bits = 16.
whole number bits = 16.
max whole number = 65535.
price as a true double is 219.857142857.
price as integer is 219.
price fractional part is 56173 (of 65536).
price fractional part as decimal is 0.857132 (56173/65536).
price (manual float, 0 digits after decimal) is 219.
price (manual float, 1 digit after decimal) is 219.8.
price (manual float, 2 digits after decimal) is 219.85.
price (manual float, 3 digits after decimal) is 219.857.
price (manual float, 4 digits after decimal) is 219.8571.
price (manual float, 5 digits after decimal) is 219.85713. <== Fixed-point math decimal error first
starts to get introduced here since the fixed point resolution (1/65536) now has lower resolution
than the base-10 resolution (which is 1/100000) at this decimal place. Decimal error may not show
up at this decimal location, per say, but definitely will for all decimal places hereafter.
price (manual float, 6 digits after decimal) is 219.857131.
WITH MANUAL INTEGER-BASED ROUNDING:
addend0 = 32768.
addend1 = 3276.
addend2 = 327.
addend3 = 32.
addend4 = 3.
addend5 = 0.
rounded price (manual float, rounded to 0 digits after decimal) is 220.
rounded price (manual float, rounded to 1 digit after decimal) is 219.9.
rounded price (manual float, rounded to 2 digits after decimal) is 219.86.
rounded price (manual float, rounded to 3 digits after decimal) is 219.857.
rounded price (manual float, rounded to 4 digits after decimal) is 219.8571.
rounded price (manual float, rounded to 5 digits after decimal) is 219.85713.
RELATED CONCEPT: DOING LARGE-INTEGER MATH WITH SMALL INTEGER TYPES:
EXAMPLE 1
65401 * 16/127 = 8239. <== true answer
1st approach (divide then multiply):
num16_result = 8224. <== Loses bits that right-shift out during the initial divide.
2nd approach (split into 2 8-bit sub-numbers with bits at far right):
num16_result = 8207. <== Loses bits that right-shift out during the divide.
3rd approach (split into 2 8-bit sub-numbers with bits centered):
num16_result = 8239. <== Perfect! Retains the bits that right-shift during the divide.
EXAMPLE 2
65401 * 99/127 = 50981. <== true answer
1st approach (divide then multiply):
num16_result = 50886. <== Loses bits that right-shift out during the initial divide.
2nd approach (split into 2 8-bit sub-numbers with bits at far right):
num16_result = 50782. <== Loses bits that right-shift out during the divide.
3rd approach (split into 2 8-bit sub-numbers with bits centered):
num16_result = 1373. <== Completely wrong due to overflow during the multiply.
4th approach (split into 4 4-bit sub-numbers with bits centered):
num16_result = 15870. <== Completely wrong due to overflow during the multiply.
5th approach (split into 8 2-bit sub-numbers with bits centered):
num16_result = 50922. <== Loses a few bits that right-shift out during the divide.
6th approach (split into 16 1-bit sub-numbers with bits skewed left):
num16_result = 50963. <== Loses the fewest possible bits that right-shift out during the divide.
7th approach (split into 16 1-bit sub-numbers with bits skewed left):
num16_result = 50963. <== [same as 6th approach] Loses the fewest possible bits that right-shift out during the divide.
[BEST APPROACH OF ALL] 8th approach (split into 16 1-bit sub-numbers with bits skewed left, w/integer rounding during division):
num16_result = 50967. <== Loses the fewest possible bits that right-shift out during the divide,
& has better accuracy due to rounding during the divide.
References:
[my repo] https://github.com/ElectricRCAircraftGuy/eRCaGuy_analogReadXXbit/blob/master/eRCaGuy_analogReadXXbit.cpp - see "Integer math rounding notes" at bottom.
I would not recommend you to do so, if your only purpose is to save memory. The error in the calculation of price can be accumulated and you are going to screw up on it.
If you really want to implement similar stuff, can you just take the minimum interval of the price and then directly use int and integer operation to manipulate your number? You only need to convert it to the floating point number when display, which make your life easier.

Resources