Fixed Point Arithmetic in C Programming - c
I am trying to create an application that stores stock prices with high precision. Currently I am using a double to do so. To save up on memory can I use any other data type? I know this has something to do with fixed point arithmetic, but I can't figure it out.
The idea behind fixed-point arithmetic is that you store the values multiplied by a certain amount, use the multiplied values for all calculus, and divide it by the same amount when you want the result. The purpose of this technique is to use integer arithmetic (int, long...) while being able to represent fractions.
The usual and most efficient way of doing this in C is by using the bits shifting operators (<< and >>). Shifting bits is a quite simple and fast operation for the ALU and doing this have the property to multiply (<<) and divide (>>) the integer value by 2 on each shift (besides, many shifts can be done for exactly the same price of a single one). Of course, the drawback is that the multiplier must be a power of 2 (which is usually not a problem by itself as we don't really care about that exact multiplier value).
Now let's say we want to use 32 bits integers for storing our values. We must choose a power of 2 multiplier. Let's divide the cake in two, so say 65536 (this is the most common case, but you can really use any power of 2 depending on your needs in precision). This is 216 and the 16 here means that we will use the 16 least significant bits (LSB) for the fractional part. The rest (32 - 16 = 16) is for the most significant bits (MSB), the integer part.
integer (MSB) fraction (LSB)
v v
0000000000000000.0000000000000000
Let's put this in code:
#define SHIFT_AMOUNT 16 // 2^16 = 65536
#define SHIFT_MASK ((1 << SHIFT_AMOUNT) - 1) // 65535 (all LSB set, all MSB clear)
int price = 500 << SHIFT_AMOUNT;
This is the value you must put in store (structure, database, whatever). Note that int is not necessarily 32 bits in C even though it is mostly the case nowadays. Also without further declaration, it is signed by default. You can add unsigned to the declaration to be sure. Better than that, you can use uint32_t or uint_least32_t (declared in stdint.h) if your code highly depends on the integer bit size (you may introduce some hacks about it). In doubt, use a typedef for your fixed-point type and you're safer.
When you want to do calculus on this value, you can use the 4 basic operators: +, -, * and /. You have to keep in mind that when adding and subtracting a value (+ and -), that value must also be shifted. Let's say we want to add 10 to our 500 price:
price += 10 << SHIFT_AMOUNT;
But for multiplication and division (* and /), the multiplier/divisor must NOT be shifted. Let's say we want to multiply by 3:
price *= 3;
Now let's make things more interesting by dividing the price by 4 so we make up for a non-zero fractional part:
price /= 4; // now our price is ((500 + 10) * 3) / 4 = 382.5
That's all about the rules. When you want to retrieve the real price at any point, you must right-shift:
printf("price integer is %d\n", price >> SHIFT_AMOUNT);
If you need the fractional part, you must mask it out:
printf ("price fraction is %d\n", price & SHIFT_MASK);
Of course, this value is not what we can call a decimal fraction, in fact it is an integer in the range [0 - 65535]. But it maps exactly with the decimal fraction range [0 - 0.9999...]. In other words, mapping looks like: 0 => 0, 32768 => 0.5, 65535 => 0.9999...
An easy way to see it as a decimal fraction is to resort to C built-in float arithmetic at this point:
printf("price fraction in decimal is %f\n", ((double)(price & SHIFT_MASK) / (1 << SHIFT_AMOUNT)));
But if you don't have FPU support (either hardware or software), you can use your new skills like this for complete price:
printf("price is roughly %d.%lld\n", price >> SHIFT_AMOUNT, (long long)(price & SHIFT_MASK) * 100000 / (1 << SHIFT_AMOUNT));
The number of 0's in the expression is roughly the number of digits you want after the decimal point. Don't overestimate the number of 0's given your fraction precision (no real trap here, that's quite obvious). Don't use simple long as sizeof(long) can be equal to sizeof(int). Use long long in case int is 32 bits as long long is guaranted to be 64 bits minimum (or use int64_t, int_least64_t and such, declared in stdint.h). In other words, use a type twice the size of your fixed-point type, that's fair enough. Finally, if you don't have access to >= 64 bits types, maybe it's time to exercice emulating them, at least for your output.
These are the basic ideas behind fixed-point arithmetics.
Be careful with negative values. It can becomes tricky sometimes, especially when it's time to show the final value. Besides, C is implementation-defined about signed integers (even though platforms where this is a problem are very uncommon nowadays). You should always make minimal tests in your environment to make sure everything goes as expected. If not, you can hack around it if you know what you do (I won't develop on this, but this has something to do with arithmetic shift vs logical shift and 2's complement representation). With unsigned integers however, you're mostly safe whatever you do as behaviors are well defined anyway.
Also take note that if a 32 bits integer can not represent values bigger than 232 - 1, using fixed-point arithmetic with 216 limits your range to 216 - 1! (and divide all of this by 2 with signed integers, which in our example would leave us with an available range of 215 - 1). The goal is then to choose a SHIFT_AMOUNT suitable to the situation. This is a tradeoff between integer part magnitude and fractional part precision.
Now for the real warnings: this technique is definitely not suitable in areas where precision is a top priority (financial, science, military...). Usual floating point (float/double) are also often not precise enough, even though they have better properties than fixed-point overall. Fixed-point has the same precision whatever the value (this can be an advantage in some cases), where floats precision is inversely proportional to the value magnitude (ie. the lower the magnitude, the more precision you get... well, this is more complex than that but you get the point). Also floats have a much greater magnitude than the equivalent (in number of bits) integers (fixed-point or not), to the cost of a loss of precision with high values (you can even reach a point of magnitude where adding 1 or even greater values will have no effect at all, something that cannot happen with integers).
If you work in those sensible areas, you're better off using libraries dedicated to the purpose of arbitrary precision (go take a look at gmplib, it's free). In computing science, essentially, gaining precision is about the number of bits you use to store your values. You want high precision? Use bits. That's all.
I see two options for you. If you are working in the financial services industry, there are probably standards that your code should comply with for precision and accuracy, so you'll just have to go along with that, regardless of memory cost. I understand that that business is generally well funded, so paying for more memory shouldn't be a problem. :)
If this is for personal use, then for maximum precision I recommend you use integers and multiply all prices by a fixed factor before storage. For example, if you want things accurate to the penny (probably not good enough), multiply all prices by 100 so that your unit is effectively cents instead of dollars and go from there. If you want more precision, multiply by more. For example, to be accurate to the hundredth of a cent (a standard that I have heard is commonly applied), multiply prices by 10000 (100 * 100).
Now with 32-bit integers, multiplying by 10000 leaves little room for large numbers of dollars. A practical 32-bit limit of 2 billion means that only prices as high as $20000 can be expressed: 2000000000 / 10000 = 20000. This gets worse if you multiply that 20000 by something, as there may be no room to hold the result. For this reason, I recommend using 64-bit integers (long long). Even if you multiply all prices by 10000, there is still plenty of headroom to hold large values, even across multiplications.
The trick with fixed-point is that whenever you do a calculation you need to remember that each value is really an underlying value multiplied by a constant. Before you add or subtract, you need to multiply values with a smaller constant to match those with a bigger constant. After you multiply, you need to divide by something to get the result back to being multiplied by the desired constant. If you use a non-power of two as your constant, you'll have to do an integer divide, which is expensive, time-wise. Many people use powers of two as their constants, so they can shift instead of divide.
If all this seems complicated, it is. I think the easiest option is to use doubles and buy more RAM if you need it. They have 53 bits of precision, which is roughly 9 quadrillion, or almost 16 decimal digits. Yes, you still might lose pennies when you are working with billions, but if you care about that, you're not being a billionaire the right way. :)
#Alex gave a fantastic answer here. However, I wanted to add some improvements to what he's done, by, for example, demonstrating how to do emulated-float (using integers to act like floats) rounding to any desired decimal place. I demonstrate that in my code below. I went a lot farther, though, and ended up writing a whole code tutorial to teach myself fixed-point math. Here it is:
my fixed_point_math tutorial: A tutorial-like practice code to learn how to do fixed-point math, manual "float"-like prints using integers only,
"float"-like integer rounding, and fractional fixed-point math on large integers.
If you really want to learn fixed-point math, I think this is valuable code to carefully go through, but it took me an entire weekend to write, so expect it to take you perhaps a couple hours to thoroughly go through it all. The basics of the rounding stuff can be found right at the top section, however, and learned in just a few minutes.
My full code on GitHub: https://github.com/ElectricRCAircraftGuy/fixed_point_math.
Or, below (truncated, because Stack Overflow won't allow that many characters):
/*
fixed_point_math tutorial
- A tutorial-like practice code to learn how to do fixed-point math, manual "float"-like prints using integers only,
"float"-like integer rounding, and fractional fixed-point math on large integers.
By Gabriel Staples
www.ElectricRCAircraftGuy.com
- email available via the Contact Me link at the top of my website.
Started: 22 Dec. 2018
Updated: 25 Dec. 2018
References:
- https://stackoverflow.com/questions/10067510/fixed-point-arithmetic-in-c-programming
Commands to Compile & Run:
As a C program (the file must NOT have a C++ file extension or it will be automatically compiled as C++, so we will
make a copy of it and change the file extension to .c first):
See here: https://stackoverflow.com/a/3206195/4561887.
cp fixed_point_math.cpp fixed_point_math_copy.c && gcc -Wall -std=c99 -o ./bin/fixed_point_math_c fixed_point_math_copy.c && ./bin/fixed_point_math_c
As a C++ program:
g++ -Wall -o ./bin/fixed_point_math_cpp fixed_point_math.cpp && ./bin/fixed_point_math_cpp
*/
#include <stdbool.h>
#include <stdio.h>
#include <stdint.h>
// Define our fixed point type.
typedef uint32_t fixed_point_t;
#define BITS_PER_BYTE 8
#define FRACTION_BITS 16 // 1 << 16 = 2^16 = 65536
#define FRACTION_DIVISOR (1 << FRACTION_BITS)
#define FRACTION_MASK (FRACTION_DIVISOR - 1) // 65535 (all LSB set, all MSB clear)
// // Conversions [NEVERMIND, LET'S DO THIS MANUALLY INSTEAD OF USING THESE MACROS TO HELP ENGRAIN IT IN US BETTER]:
// #define INT_2_FIXED_PT_NUM(num) (num << FRACTION_BITS) // Regular integer number to fixed point number
// #define FIXED_PT_NUM_2_INT(fp_num) (fp_num >> FRACTION_BITS) // Fixed point number back to regular integer number
// Private function prototypes:
static void print_if_error_introduced(uint8_t num_digits_after_decimal);
int main(int argc, char * argv[])
{
printf("Begin.\n");
// We know how many bits we will use for the fraction, but how many bits are remaining for the whole number,
// and what's the whole number's max range? Let's calculate it.
const uint8_t WHOLE_NUM_BITS = sizeof(fixed_point_t)*BITS_PER_BYTE - FRACTION_BITS;
const fixed_point_t MAX_WHOLE_NUM = (1 << WHOLE_NUM_BITS) - 1;
printf("fraction bits = %u.\n", FRACTION_BITS);
printf("whole number bits = %u.\n", WHOLE_NUM_BITS);
printf("max whole number = %u.\n\n", MAX_WHOLE_NUM);
// Create a variable called `price`, and let's do some fixed point math on it.
const fixed_point_t PRICE_ORIGINAL = 503;
fixed_point_t price = PRICE_ORIGINAL << FRACTION_BITS;
price += 10 << FRACTION_BITS;
price *= 3;
price /= 7; // now our price is ((503 + 10)*3/7) = 219.857142857.
printf("price as a true double is %3.9f.\n", ((double)PRICE_ORIGINAL + 10)*3/7);
printf("price as integer is %u.\n", price >> FRACTION_BITS);
printf("price fractional part is %u (of %u).\n", price & FRACTION_MASK, FRACTION_DIVISOR);
printf("price fractional part as decimal is %f (%u/%u).\n", (double)(price & FRACTION_MASK) / FRACTION_DIVISOR,
price & FRACTION_MASK, FRACTION_DIVISOR);
// Now, if you don't have float support (neither in hardware via a Floating Point Unit [FPU], nor in software
// via built-in floating point math libraries as part of your processor's C implementation), then you may have
// to manually print the whole number and fractional number parts separately as follows. Look for the patterns.
// Be sure to make note of the following 2 points:
// - 1) the digits after the decimal are determined by the multiplier:
// 0 digits: * 10^0 ==> * 1 <== 0 zeros
// 1 digit : * 10^1 ==> * 10 <== 1 zero
// 2 digits: * 10^2 ==> * 100 <== 2 zeros
// 3 digits: * 10^3 ==> * 1000 <== 3 zeros
// 4 digits: * 10^4 ==> * 10000 <== 4 zeros
// 5 digits: * 10^5 ==> * 100000 <== 5 zeros
// - 2) Be sure to use the proper printf format statement to enforce the proper number of leading zeros in front of
// the fractional part of the number. ie: refer to the "%01", "%02", "%03", etc. below.
// Manual "floats":
// 0 digits after the decimal
printf("price (manual float, 0 digits after decimal) is %u.",
price >> FRACTION_BITS); print_if_error_introduced(0);
// 1 digit after the decimal
printf("price (manual float, 1 digit after decimal) is %u.%01lu.",
price >> FRACTION_BITS, (uint64_t)(price & FRACTION_MASK) * 10 / FRACTION_DIVISOR);
print_if_error_introduced(1);
// 2 digits after decimal
printf("price (manual float, 2 digits after decimal) is %u.%02lu.",
price >> FRACTION_BITS, (uint64_t)(price & FRACTION_MASK) * 100 / FRACTION_DIVISOR);
print_if_error_introduced(2);
// 3 digits after decimal
printf("price (manual float, 3 digits after decimal) is %u.%03lu.",
price >> FRACTION_BITS, (uint64_t)(price & FRACTION_MASK) * 1000 / FRACTION_DIVISOR);
print_if_error_introduced(3);
// 4 digits after decimal
printf("price (manual float, 4 digits after decimal) is %u.%04lu.",
price >> FRACTION_BITS, (uint64_t)(price & FRACTION_MASK) * 10000 / FRACTION_DIVISOR);
print_if_error_introduced(4);
// 5 digits after decimal
printf("price (manual float, 5 digits after decimal) is %u.%05lu.",
price >> FRACTION_BITS, (uint64_t)(price & FRACTION_MASK) * 100000 / FRACTION_DIVISOR);
print_if_error_introduced(5);
// 6 digits after decimal
printf("price (manual float, 6 digits after decimal) is %u.%06lu.",
price >> FRACTION_BITS, (uint64_t)(price & FRACTION_MASK) * 1000000 / FRACTION_DIVISOR);
print_if_error_introduced(6);
printf("\n");
// Manual "floats" ***with rounding now***:
// - To do rounding with integers, the concept is best understood by examples:
// BASE 10 CONCEPT:
// 1. To round to the nearest whole number:
// Add 1/2 to the number, then let it be truncated since it is an integer.
// Examples:
// 1.5 + 1/2 = 1.5 + 0.5 = 2.0. Truncate it to 2. Good!
// 1.99 + 0.5 = 2.49. Truncate it to 2. Good!
// 1.49 + 0.5 = 1.99. Truncate it to 1. Good!
// 2. To round to the nearest tenth place:
// Multiply by 10 (this is equivalent to doing a single base-10 left-shift), then add 1/2, then let
// it be truncated since it is an integer, then divide by 10 (this is a base-10 right-shift).
// Example:
// 1.57 x 10 + 1/2 = 15.7 + 0.5 = 16.2. Truncate to 16. Divide by 10 --> 1.6. Good.
// 3. To round to the nearest hundredth place:
// Multiply by 100 (base-10 left-shift 2 places), add 1/2, truncate, divide by 100 (base-10
// right-shift 2 places).
// Example:
// 1.579 x 100 + 1/2 = 157.9 + 0.5 = 158.4. Truncate to 158. Divide by 100 --> 1.58. Good.
//
// BASE 2 CONCEPT:
// - We are dealing with fractional numbers stored in base-2 binary bits, however, and we have already
// left-shifted by FRACTION_BITS (num << FRACTION_BITS) when we converted our numbers to fixed-point
// numbers. Therefore, *all we have to do* is add the proper value, and we get the same effect when we
// right-shift by FRACTION_BITS (num >> FRACTION_BITS) in our conversion back from fixed-point to regular
// numbers. Here's what that looks like for us:
// - Note: "addend" = "a number that is added to another".
// (see https://www.google.com/search?q=addend&oq=addend&aqs=chrome.0.0l6.1290j0j7&sourceid=chrome&ie=UTF-8).
// - Rounding to 0 digits means simply rounding to the nearest whole number.
// Round to: Addends:
// 0 digits: add 5/10 * FRACTION_DIVISOR ==> + FRACTION_DIVISOR/2
// 1 digits: add 5/100 * FRACTION_DIVISOR ==> + FRACTION_DIVISOR/20
// 2 digits: add 5/1000 * FRACTION_DIVISOR ==> + FRACTION_DIVISOR/200
// 3 digits: add 5/10000 * FRACTION_DIVISOR ==> + FRACTION_DIVISOR/2000
// 4 digits: add 5/100000 * FRACTION_DIVISOR ==> + FRACTION_DIVISOR/20000
// 5 digits: add 5/1000000 * FRACTION_DIVISOR ==> + FRACTION_DIVISOR/200000
// 6 digits: add 5/10000000 * FRACTION_DIVISOR ==> + FRACTION_DIVISOR/2000000
// etc.
printf("WITH MANUAL INTEGER-BASED ROUNDING:\n");
// Calculate addends used for rounding (see definition of "addend" above).
fixed_point_t addend0 = FRACTION_DIVISOR/2;
fixed_point_t addend1 = FRACTION_DIVISOR/20;
fixed_point_t addend2 = FRACTION_DIVISOR/200;
fixed_point_t addend3 = FRACTION_DIVISOR/2000;
fixed_point_t addend4 = FRACTION_DIVISOR/20000;
fixed_point_t addend5 = FRACTION_DIVISOR/200000;
// Print addends used for rounding.
printf("addend0 = %u.\n", addend0);
printf("addend1 = %u.\n", addend1);
printf("addend2 = %u.\n", addend2);
printf("addend3 = %u.\n", addend3);
printf("addend4 = %u.\n", addend4);
printf("addend5 = %u.\n", addend5);
// Calculate rounded prices
fixed_point_t price_rounded0 = price + addend0; // round to 0 decimal digits
fixed_point_t price_rounded1 = price + addend1; // round to 1 decimal digits
fixed_point_t price_rounded2 = price + addend2; // round to 2 decimal digits
fixed_point_t price_rounded3 = price + addend3; // round to 3 decimal digits
fixed_point_t price_rounded4 = price + addend4; // round to 4 decimal digits
fixed_point_t price_rounded5 = price + addend5; // round to 5 decimal digits
// Print manually rounded prices of manually-printed fixed point integers as though they were "floats".
printf("rounded price (manual float, rounded to 0 digits after decimal) is %u.\n",
price_rounded0 >> FRACTION_BITS);
printf("rounded price (manual float, rounded to 1 digit after decimal) is %u.%01lu.\n",
price_rounded1 >> FRACTION_BITS, (uint64_t)(price_rounded1 & FRACTION_MASK) * 10 / FRACTION_DIVISOR);
printf("rounded price (manual float, rounded to 2 digits after decimal) is %u.%02lu.\n",
price_rounded2 >> FRACTION_BITS, (uint64_t)(price_rounded2 & FRACTION_MASK) * 100 / FRACTION_DIVISOR);
printf("rounded price (manual float, rounded to 3 digits after decimal) is %u.%03lu.\n",
price_rounded3 >> FRACTION_BITS, (uint64_t)(price_rounded3 & FRACTION_MASK) * 1000 / FRACTION_DIVISOR);
printf("rounded price (manual float, rounded to 4 digits after decimal) is %u.%04lu.\n",
price_rounded4 >> FRACTION_BITS, (uint64_t)(price_rounded4 & FRACTION_MASK) * 10000 / FRACTION_DIVISOR);
printf("rounded price (manual float, rounded to 5 digits after decimal) is %u.%05lu.\n",
price_rounded5 >> FRACTION_BITS, (uint64_t)(price_rounded5 & FRACTION_MASK) * 100000 / FRACTION_DIVISOR);
// =================================================================================================================
printf("\nRELATED CONCEPT: DOING LARGE-INTEGER MATH WITH SMALL INTEGER TYPES:\n");
// RELATED CONCEPTS:
// Now let's practice handling (doing math on) large integers (ie: large relative to their integer type),
// withOUT resorting to using larger integer types (because they may not exist for our target processor),
// and withOUT using floating point math, since that might also either not exist for our processor, or be too
// slow or program-space-intensive for our application.
// - These concepts are especially useful when you hit the limits of your architecture's integer types: ex:
// if you have a uint64_t nanosecond timestamp that is really large, and you need to multiply it by a fraction
// to convert it, but you don't have uint128_t types available to you to multiply by the numerator before
// dividing by the denominator. What do you do?
// - We can use fixed-point math to achieve desired results. Let's look at various approaches.
// - Let's say my goal is to multiply a number by a fraction < 1 withOUT it ever growing into a larger type.
// - Essentially we want to multiply some really large number (near its range limit for its integer type)
// by some_number/some_larger_number (ie: a fraction < 1). The problem is that if we multiply by the numerator
// first, it will overflow, and if we divide by the denominator first we will lose resolution via bits
// right-shifting out.
// Here are various examples and approaches.
// -----------------------------------------------------
// EXAMPLE 1
// Goal: Use only 16-bit values & math to find 65401 * 16/127.
// Result: Great! All 3 approaches work, with the 3rd being the best. To learn the techniques required for the
// absolute best approach of all, take a look at the 8th approach in Example 2 below.
// -----------------------------------------------------
uint16_t num16 = 65401; // 1111 1111 0111 1001
uint16_t times = 16;
uint16_t divide = 127;
printf("\nEXAMPLE 1\n");
// Find the true answer.
// First, let's cheat to know the right answer by letting it grow into a larger type.
// Multiply *first* (before doing the divide) to avoid losing resolution.
printf("%u * %u/%u = %u. <== true answer\n", num16, times, divide, (uint32_t)num16*times/divide);
// 1st approach: just divide first to prevent overflow, and lose precision right from the start.
uint16_t num16_result = num16/divide * times;
printf("1st approach (divide then multiply):\n");
printf(" num16_result = %u. <== Loses bits that right-shift out during the initial divide.\n", num16_result);
// 2nd approach: split the 16-bit number into 2 8-bit numbers stored in 16-bit numbers,
// placing all 8 bits of each sub-number to the ***far right***, with 8 bits on the left to grow
// into when multiplying. Then, multiply and divide each part separately.
// - The problem, however, is that you'll lose meaningful resolution on the upper-8-bit number when you
// do the division, since there's no bits to the right for the right-shifted bits during division to
// be retained in.
// Re-sum both sub-numbers at the end to get the final result.
// - NOTE THAT 257 IS THE HIGHEST *TIMES* VALUE I CAN USE SINCE 2^16/0b0000,0000,1111,1111 = 65536/255 = 257.00392.
// Therefore, any *times* value larger than this will cause overflow.
uint16_t num16_upper8 = num16 >> 8; // 1111 1111
uint16_t num16_lower8 = num16 & 0xFF; // 0111 1001
num16_upper8 *= times;
num16_lower8 *= times;
num16_upper8 /= divide;
num16_lower8 /= divide;
num16_result = (num16_upper8 << 8) + num16_lower8;
printf("2nd approach (split into 2 8-bit sub-numbers with bits at far right):\n");
printf(" num16_result = %u. <== Loses bits that right-shift out during the divide.\n", num16_result);
// 3rd approach: split the 16-bit number into 2 8-bit numbers stored in 16-bit numbers,
// placing all 8 bits of each sub-number ***in the center***, with 4 bits on the left to grow when
// multiplying and 4 bits on the right to not lose as many bits when dividing.
// This will help stop the loss of resolution when we divide, at the cost of overflowing more easily when we
// multiply.
// - NOTE THAT 16 IS THE HIGHEST *TIMES* VALUE I CAN USE SINCE 2^16/0b0000,1111,1111,0000 = 65536/4080 = 16.0627.
// Therefore, any *times* value larger than this will cause overflow.
num16_upper8 = (num16 >> 4) & 0x0FF0;
num16_lower8 = (num16 << 4) & 0x0FF0;
num16_upper8 *= times;
num16_lower8 *= times;
num16_upper8 /= divide;
num16_lower8 /= divide;
num16_result = (num16_upper8 << 4) + (num16_lower8 >> 4);
printf("3rd approach (split into 2 8-bit sub-numbers with bits centered):\n");
printf(" num16_result = %u. <== Perfect! Retains the bits that right-shift during the divide.\n", num16_result);
// -----------------------------------------------------
// EXAMPLE 2
// Goal: Use only 16-bit values & math to find 65401 * 99/127.
// Result: Many approaches work, so long as enough bits exist to the left to not allow overflow during the
// multiply. The best approach is the 8th one, however, which 1) right-shifts the minimum possible before the
// multiply, in order to retain as much resolution as possible, and 2) does integer rounding during the divide
// in order to be as accurate as possible. This is the best approach to use.
// -----------------------------------------------------
num16 = 65401; // 1111 1111 0111 1001
times = 99;
divide = 127;
printf("\nEXAMPLE 2\n");
// Find the true answer by letting it grow into a larger type.
printf("%u * %u/%u = %u. <== true answer\n", num16, times, divide, (uint32_t)num16*times/divide);
// 1st approach: just divide first to prevent overflow, and lose precision right from the start.
num16_result = num16/divide * times;
printf("1st approach (divide then multiply):\n");
printf(" num16_result = %u. <== Loses bits that right-shift out during the initial divide.\n", num16_result);
// 2nd approach: split the 16-bit number into 2 8-bit numbers stored in 16-bit numbers,
// placing all 8 bits of each sub-number to the ***far right***, with 8 bits on the left to grow
// into when multiplying. Then, multiply and divide each part separately.
// - The problem, however, is that you'll lose meaningful resolution on the upper-8-bit number when you
// do the division, since there's no bits to the right for the right-shifted bits during division to
// be retained in.
// Re-sum both sub-numbers at the end to get the final result.
// - NOTE THAT 257 IS THE HIGHEST *TIMES* VALUE I CAN USE SINCE 2^16/0b0000,0000,1111,1111 = 65536/255 = 257.00392.
// Therefore, any *times* value larger than this will cause overflow.
num16_upper8 = num16 >> 8; // 1111 1111
num16_lower8 = num16 & 0xFF; // 0111 1001
num16_upper8 *= times;
num16_lower8 *= times;
num16_upper8 /= divide;
num16_lower8 /= divide;
num16_result = (num16_upper8 << 8) + num16_lower8;
printf("2nd approach (split into 2 8-bit sub-numbers with bits at far right):\n");
printf(" num16_result = %u. <== Loses bits that right-shift out during the divide.\n", num16_result);
/////////////////////////////////////////////////////////////////////////////////////////////////
// TRUNCATED BECAUSE STACK OVERFLOW WON'T ALLOW THIS MANY CHARACTERS.
// See the rest of the code on github: https://github.com/ElectricRCAircraftGuy/fixed_point_math
/////////////////////////////////////////////////////////////////////////////////////////////////
return 0;
} // main
// PRIVATE FUNCTION DEFINITIONS:
/// #brief A function to help identify at what decimal digit error is introduced, based on how many bits you are using
/// to represent the fractional portion of the number in your fixed-point number system.
/// #details Note: this function relies on an internal static bool to keep track of if it has already
/// identified at what decimal digit error is introduced, so once it prints this fact once, it will never
/// print again. This is by design just to simplify usage in this demo.
/// #param[in] num_digits_after_decimal The number of decimal digits we are printing after the decimal
/// (0, 1, 2, 3, etc)
/// #return None
static void print_if_error_introduced(uint8_t num_digits_after_decimal)
{
static bool already_found = false;
// Array of power base 10 values, where the value = 10^index:
const uint32_t POW_BASE_10[] =
{
1, // index 0 (10^0)
10,
100,
1000,
10000,
100000,
1000000,
10000000,
100000000,
1000000000, // index 9 (10^9); 1 Billion: the max power of 10 that can be stored in a uint32_t
};
if (already_found == true)
{
goto done;
}
if (POW_BASE_10[num_digits_after_decimal] > FRACTION_DIVISOR)
{
already_found = true;
printf(" <== Fixed-point math decimal error first\n"
" starts to get introduced here since the fixed point resolution (1/%u) now has lower resolution\n"
" than the base-10 resolution (which is 1/%u) at this decimal place. Decimal error may not show\n"
" up at this decimal location, per say, but definitely will for all decimal places hereafter.",
FRACTION_DIVISOR, POW_BASE_10[num_digits_after_decimal]);
}
done:
printf("\n");
}
Output:
gabriel$ cp fixed_point_math.cpp fixed_point_math_copy.c && gcc -Wall -std=c99 -o ./bin/fixed_point_math_c > fixed_point_math_copy.c && ./bin/fixed_point_math_c
Begin.
fraction bits = 16.
whole number bits = 16.
max whole number = 65535.
price as a true double is 219.857142857.
price as integer is 219.
price fractional part is 56173 (of 65536).
price fractional part as decimal is 0.857132 (56173/65536).
price (manual float, 0 digits after decimal) is 219.
price (manual float, 1 digit after decimal) is 219.8.
price (manual float, 2 digits after decimal) is 219.85.
price (manual float, 3 digits after decimal) is 219.857.
price (manual float, 4 digits after decimal) is 219.8571.
price (manual float, 5 digits after decimal) is 219.85713. <== Fixed-point math decimal error first
starts to get introduced here since the fixed point resolution (1/65536) now has lower resolution
than the base-10 resolution (which is 1/100000) at this decimal place. Decimal error may not show
up at this decimal location, per say, but definitely will for all decimal places hereafter.
price (manual float, 6 digits after decimal) is 219.857131.
WITH MANUAL INTEGER-BASED ROUNDING:
addend0 = 32768.
addend1 = 3276.
addend2 = 327.
addend3 = 32.
addend4 = 3.
addend5 = 0.
rounded price (manual float, rounded to 0 digits after decimal) is 220.
rounded price (manual float, rounded to 1 digit after decimal) is 219.9.
rounded price (manual float, rounded to 2 digits after decimal) is 219.86.
rounded price (manual float, rounded to 3 digits after decimal) is 219.857.
rounded price (manual float, rounded to 4 digits after decimal) is 219.8571.
rounded price (manual float, rounded to 5 digits after decimal) is 219.85713.
RELATED CONCEPT: DOING LARGE-INTEGER MATH WITH SMALL INTEGER TYPES:
EXAMPLE 1
65401 * 16/127 = 8239. <== true answer
1st approach (divide then multiply):
num16_result = 8224. <== Loses bits that right-shift out during the initial divide.
2nd approach (split into 2 8-bit sub-numbers with bits at far right):
num16_result = 8207. <== Loses bits that right-shift out during the divide.
3rd approach (split into 2 8-bit sub-numbers with bits centered):
num16_result = 8239. <== Perfect! Retains the bits that right-shift during the divide.
EXAMPLE 2
65401 * 99/127 = 50981. <== true answer
1st approach (divide then multiply):
num16_result = 50886. <== Loses bits that right-shift out during the initial divide.
2nd approach (split into 2 8-bit sub-numbers with bits at far right):
num16_result = 50782. <== Loses bits that right-shift out during the divide.
3rd approach (split into 2 8-bit sub-numbers with bits centered):
num16_result = 1373. <== Completely wrong due to overflow during the multiply.
4th approach (split into 4 4-bit sub-numbers with bits centered):
num16_result = 15870. <== Completely wrong due to overflow during the multiply.
5th approach (split into 8 2-bit sub-numbers with bits centered):
num16_result = 50922. <== Loses a few bits that right-shift out during the divide.
6th approach (split into 16 1-bit sub-numbers with bits skewed left):
num16_result = 50963. <== Loses the fewest possible bits that right-shift out during the divide.
7th approach (split into 16 1-bit sub-numbers with bits skewed left):
num16_result = 50963. <== [same as 6th approach] Loses the fewest possible bits that right-shift out during the divide.
[BEST APPROACH OF ALL] 8th approach (split into 16 1-bit sub-numbers with bits skewed left, w/integer rounding during division):
num16_result = 50967. <== Loses the fewest possible bits that right-shift out during the divide,
& has better accuracy due to rounding during the divide.
References:
[my repo] https://github.com/ElectricRCAircraftGuy/eRCaGuy_analogReadXXbit/blob/master/eRCaGuy_analogReadXXbit.cpp - see "Integer math rounding notes" at bottom.
I would not recommend you to do so, if your only purpose is to save memory. The error in the calculation of price can be accumulated and you are going to screw up on it.
If you really want to implement similar stuff, can you just take the minimum interval of the price and then directly use int and integer operation to manipulate your number? You only need to convert it to the floating point number when display, which make your life easier.
Related
How is float to int type conversion done in C? [duplicate]
I was wondering if you could help explain the process on converting an integer to float, or a float to an integer. For my class, we are to do this using only bitwise operators, but I think a firm understanding on the casting from type to type will help me more in this stage. From what I know so far, for int to float, you will have to convert the integer into binary, normalize the value of the integer by finding the significand, exponent, and fraction, and then output the value in float from there? As for float to int, you will have to separate the value into the significand, exponent, and fraction, and then reverse the instructions above to get an int value? I tried to follow the instructions from this question: Casting float to int (bitwise) in C. But I was not really able to understand it. Also, could someone explain why rounding will be necessary for values greater than 23 bits when converting int to float?
First, a paper you should consider reading, if you want to understand floating point foibles better: "What Every Computer Scientist Should Know About Floating Point Arithmetic," http://www.validlab.com/goldberg/paper.pdf And now to some meat. The following code is bare bones, and attempts to produce an IEEE-754 single precision float from an unsigned int in the range 0 < value < 224. That's the format you're most likely to encounter on modern hardware, and it's the format you seem to reference in your original question. IEEE-754 single-precision floats are divided into three fields: A single sign bit, 8 bits of exponent, and 23 bits of significand (sometimes called a mantissa). IEEE-754 uses a hidden 1 significand, meaning that the significand is actually 24 bits total. The bits are packed left to right, with the sign bit in bit 31, exponent in bits 30 .. 23, and the significand in bits 22 .. 0. The following diagram from Wikipedia illustrates: The exponent has a bias of 127, meaning that the actual exponent associated with the floating point number is 127 less than the value stored in the exponent field. An exponent of 0 therefore would be encoded as 127. (Note: The full Wikipedia article may be interesting to you. Ref: http://en.wikipedia.org/wiki/Single_precision_floating-point_format ) Therefore, the IEEE-754 number 0x40000000 is interpreted as follows: Bit 31 = 0: Positive value Bits 30 .. 23 = 0x80: Exponent = 128 - 127 = 1 (aka. 21) Bits 22 .. 0 are all 0: Significand = 1.00000000_00000000_0000000. (Note I restored the hidden 1). So the value is 1.0 x 21 = 2.0. To convert an unsigned int in the limited range given above, then, to something in IEEE-754 format, you might use a function like the one below. It takes the following steps: Aligns the leading 1 of the integer to the position of the hidden 1 in the floating point representation. While aligning the integer, records the total number of shifts made. Masks away the hidden 1. Using the number of shifts made, computes the exponent and appends it to the number. Using reinterpret_cast, converts the resulting bit-pattern to a float. This part is an ugly hack, because it uses a type-punned pointer. You could also do this by abusing a union. Some platforms provide an intrinsic operation (such as _itof) to make this reinterpretation less ugly. There are much faster ways to do this; this one is meant to be pedagogically useful, if not super efficient: float uint_to_float(unsigned int significand) { // Only support 0 < significand < 1 << 24. if (significand == 0 || significand >= 1 << 24) return -1.0; // or abort(); or whatever you'd like here. int shifts = 0; // Align the leading 1 of the significand to the hidden-1 // position. Count the number of shifts required. while ((significand & (1 << 23)) == 0) { significand <<= 1; shifts++; } // The number 1.0 has an exponent of 0, and would need to be // shifted left 23 times. The number 2.0, however, has an // exponent of 1 and needs to be shifted left only 22 times. // Therefore, the exponent should be (23 - shifts). IEEE-754 // format requires a bias of 127, though, so the exponent field // is given by the following expression: unsigned int exponent = 127 + 23 - shifts; // Now merge significand and exponent. Be sure to strip away // the hidden 1 in the significand. unsigned int merged = (exponent << 23) | (significand & 0x7FFFFF); // Reinterpret as a float and return. This is an evil hack. return *reinterpret_cast< float* >( &merged ); } You can make this process more efficient using functions that detect the leading 1 in a number. (These sometimes go by names like clz for "count leading zeros", or norm for "normalize".) You can also extend this to signed numbers by recording the sign, taking the absolute value of the integer, performing the steps above, and then putting the sign into bit 31 of the number. For integers >= 224, the entire integer does not fit into the significand field of the 32-bit float format. This is why you need to "round": You lose LSBs in order to make the value fit. Thus, multiple integers will end up mapping to the same floating point pattern. The exact mapping depends on the rounding mode (round toward -Inf, round toward +Inf, round toward zero, round toward nearest even). But the fact of the matter is you can't shove 24 bits into fewer than 24 bits without some loss. You can see this in terms of the code above. It works by aligning the leading 1 to the hidden 1 position. If a value was >= 224, the code would need to shift right, not left, and that necessarily shifts LSBs away. Rounding modes just tell you how to handle the bits shifted away.
Have you checked the IEEE 754 floating-point representation? In 32-bit normalized form, it has (mantissa's) sign bit, 8-bit exponent (excess-127, I think) and 23-bit mantissa in "decimal" except that the "0." is dropped (always in that form) and the radix is 2, not 10. That is: the MSB value is 1/2, the next bit 1/4 and so on.
Joe Z's answer is elegant but range of input values is highly limited. 32 bit float can store all integer values from the following range: [-224...+224] = [-16777216...+16777216] and some other values outside this range. The whole range would be covered by this: float int2float(int value) { // handles all values from [-2^24...2^24] // outside this range only some integers may be represented exactly // this method will use truncation 'rounding mode' during conversion // we can safely reinterpret it as 0.0 if (value == 0) return 0.0; if (value == (1U<<31)) // ie -2^31 { // -(-2^31) = -2^31 so we'll not be able to handle it below - use const // value = 0xCF000000; return (float)INT_MIN; // *((float*)&value); is undefined behaviour } int sign = 0; // handle negative values if (value < 0) { sign = 1U << 31; value = -value; } // although right shift of signed is undefined - all compilers (that I know) do // arithmetic shift (copies sign into MSB) is what I prefer here // hence using unsigned abs_value_copy for shift unsigned int abs_value_copy = value; // find leading one int bit_num = 31; int shift_count = 0; for(; bit_num > 0; bit_num--) { if (abs_value_copy & (1U<<bit_num)) { if (bit_num >= 23) { // need to shift right shift_count = bit_num - 23; abs_value_copy >>= shift_count; } else { // need to shift left shift_count = 23 - bit_num; abs_value_copy <<= shift_count; } break; } } // exponent is biased by 127 int exp = bit_num + 127; // clear leading 1 (bit #23) (it will implicitly be there but not stored) int coeff = abs_value_copy & ~(1<<23); // move exp to the right place exp <<= 23; union { int rint; float rfloat; }ret = { sign | exp | coeff }; return ret.rfloat; } Of course there are other means to find abs value of int (branchless). Similarly couting leading zeros can also be done without a branch so treat this example as example ;-).
how to convert negative hexadecimal to decimal
hi I want to know how it is possible to convert a hexadecimal negative value (to complement encoding) to decimal, easily without converting hexadecimal to binary and then multiplying each bit in by a power of 2 and sums all the value to get the result, it takes too much time : example of number (32 bits) : 0xFFFFFE58 so how can I do it?
without using a computer you can calculate it like this: 0xFFFF FE58 = - 0x1A8 = -(1 * 16² + 10 * 16 + 8) = -(256 + 160 + 8) = -424 0xFFFF FE58 is a negative number in 2's complement. To get the absolute value you have to invert all bits and add 1 in binary. You also can subtract this number from the first number out of range (0x1 0000 0000) 0x100000000 -0x0FFFFFE58 = 0x0000001A8 now we know that your number is -0x1A8. now you have to add up the digits multiplied with their place value. 8 * 16^0 + A (which is 10) * 16^1 + 1 * 16^2 = 424. So the decimal value of your number is -424.
do a calculation on the positive number, then convert to the negative with Two's complement. see explanation here for positive conversion from hexa to decimal: http://www.permadi.com/tutorial/numHexToDec/ basically: Get the right most digit of the hex number, call this digit the currentDigit. Make a variable, let's call it power. Set the value to 0. Multiply the current digit with (16^power), store the result. Increment power by 1. Set the the currentDigit to the previous digit of the hex number. Repeat from step 3 until all digits have been multiplied. Sum the result of step 3 to get the answer number.
Hex remove leading digits
When you do something like 0x01AE1 - 0x01AEA = fffffff7. I only want the last 3 digits. So I used the modulus trick to remove the extra digits. The displacement gets filled with hex values. int extra_crap = 0; int extra_crap1 = 0; int displacement = 0; int val1 = 0; int val2 = 0; displacement val1 - val2; extra_crap = displacement % 0x100; extra_crap1 = displacement % 256; printf(" extra_crap is %x \n", extra_crap); printf(" extra_crap1 is %x \n", extra_crap1); Unfortunately this is having no effect at all. Is there another way to remove all but the last 3 digits?
'Unfortunately this is having no effect at all.' That's probably because you do your calculations on signed int. Try casting the value to unsigned, or simply forget the remainder operator % and use bitwise masking: displacement & 0xFF; displacement & 255; for two hex digits or displacement & 0xFFF; displacement & 4095; for three digits. EDIT – some explanation A detailed answer would be quite long... You need to learn about data types used in C (esp. int and unsigned int, which are two of most used Integral types), the range of values that can be represented in those types and their internal representation in Two's complement code. Also about Integer overflow and Hexadecimal system. Then you will easily get what happened to your data: subtracting 0x01AE1 - 0x01AEA, that is 6881 - 6890, gave the result of -9, which in 32-bit signed integer encoded with 2's complement and printed in hexadecimal is FFFFFFF7. That MINUS NINE divided by 256 gave a quotient ZERO and Remainder MINUS NINE, so the remainder operator % gave you a precise and correct result. What you call 'no effect at all' is just a result of your lack of understanding what you were actually doing. My answer above (variant 1) is not any kind of magic, but just a way to enforce calculation on positive numbers. Casting values to unsigned type makes the program to interpret 0xFFFFFFF7 as 4294967287, which divided by 265 (0x100 in hex) results in quotient 16777215 (0xFFFFFF) and remainder 247 (0xF7). Variant 2 does no division at all and just 'masks' those necessary bits: numbers 255 and 4095 contain 8 and 12 low-order bits equal 1 (in hexadecimal 0xFF and 0xFFF, respectively), so bitwise AND does exactly what you want: removes the higher part of the value, leaving just the required two or three low-order hex dgits.
16bit Float Multiplication in C
I'm working on a small project, where I need float multiplication with 16bit floats (half precision). Unhappily, I'm facing some problems with the algorithm: Example Output 1 * 5 = 5 2 * 5 = 10 3 * 5 = 14.5 4 * 5 = 20 5 * 5 = 24.5 100 * 4 = 100 100 * 5 = 482 The Source Code const int bits = 16; const int exponent_length = 5; const int fraction_length = 10; const int bias = pow(2, exponent_length - 1) - 1; const int exponent_mask = ((1 << 5) - 1) << fraction_length; const int fraction_mask = (1 << fraction_length) - 1; const int hidden_bit = (1 << 10); // Was 1 << 11 before update 1 int float_mul(int f1, int f2) { int res_exp = 0; int res_frac = 0; int result = 0; int exp1 = (f1 & exponent_mask) >> fraction_length; int exp2 = (f2 & exponent_mask) >> fraction_length; int frac1 = (f1 & fraction_mask) | hidden_bit; int frac2 = (f2 & fraction_mask) | hidden_bit; // Add exponents res_exp = exp1 + exp2 - bias; // Remove double bias // Multiply significants res_frac = frac1 * frac2; // 11 bit * 11 bit → 22 bit! // Shift 22bit int right to fit into 10 bit if (highest_bit_pos(res_mant) == 21) { res_mant >>= 11; res_exp += 1; } else { res_mant >>= 10; } res_frac &= ~hidden_bit; // Remove hidden bit // Construct float return (res_exp << bits - exponent_length - 1) | res_frac; } By the way: I'm storing the floats in ints, because I'll try to port this code to some kind of Assembler w/o float point operations later. The Question Why does the code work for some values only? Did I forget some normalization or similar? Or does it work only by accident? Disclaimer: I'm not a CompSci student, it's a leisure project ;) Update #1 Thanks to the comment by Eric Postpischil I noticed one problem with the code: the hidden_bit flag was off by one (should be 1 << 10). With that change, I don't get decimal places any more, but still some calculations are off (e.g. 3•3=20). I assume, it's the res_frac shift as descibred in the answers. Update #2 The second problem with the code was indeed the res_frac shifting. After update #1 I got wrong results when having 22 bit results of frac1 * frac2. I've updated the code above with a the corrected shift statement. Thanks to all for every comment and answer! :)
From a cursory look: No attempt is made to determine the location of the high bit in the product. Two 11-bit numbers, each their high bit set, may produce a 21- or 22-bit number. (Example with two-bit numbers: 102•102 is 1002, three bits, but 112•112 is 10012, four bits.) The result is truncated instead of rounded. Signs are ignored. Subnormal numbers are not handled, on input or output. 11 is hardcoded as a shift amount in one place. This is likely incorrect; the correct amount will depend on how the significand is handled for normalization and rounding. In decoding, the exponent field is shifted right by fraction_length. In encoding, it is shifted left by bits - exponent_length - 1. To avoid bugs, the same expression should be used in both places. From a more detailed look by chux: res_frac = frac1 * frac2 fails if int is less than 23 bits (22 for the product and one for the sign).
This is more a suggestion for how to make it easier to get your code right, rather than analysis of what is wrong with the existing code. There are a number of steps that are common to some or all of the floating point arithmetic operations. I suggest extracting each into a function that can be written with focus on one issue, and tested separately. Then when you come to write e.g. multiplication, you only have to deal with the specifics of that operation. All the operations will be easier working with a structure that has the actual signed exponent, and the full significand in a wider unsigned integer field. If you were dealing with signed numbers, it would also have a boolean for the sign bit. Here are some sample operations that could be separate functions, at least until you get it working: unpack: Take a 16 bit float and extract the exponent and significand into a struct. pack: Undo unpack - deal with dropping the hidden bit, applying the bias the expoent, and combining them into a float. normalize: Shift the significand and adjust the exponent to bring the most significant 1-bit to a specified bit position. round: Apply your rounding rules to drop low significance bits. If you want to do IEEE 754 style round-to-nearest, you need a guard digit that is the most significant bit that will be dropped, and an additional bit indicating if there are any one bits of lower significance than the guard bit.
One problem is that you are truncating instead of rounding: res_frac >>= 11; // Shift 22bit int right to fit into 10 bit You should compute res_frac & 0x7ff first, the part of the 22-bit result that your algorithm is about to discard, and compare it to 0x400. If it is below, truncate. If it is above, round away from zero. If it is equal to 0x400, round to the even alternative.
Homework - C bit puzzle - Perform % using C bit operations (no looping, conditionals, function calls, etc)
I'm completely stuck on how to do this homework problem and looking for a hint or two to keep me going. I'm limited to 20 operations (= doesn't count in this 20). I'm supposed to fill in a function that looks like this: /* Supposed to do x%(2^n). For example: for x = 15 and n = 2, the result would be 3. Additionally, if positive overflow occurs, the result should be the maximum positive number, and if negative overflow occurs, the result should be the most negative number. */ int remainder_power_of_2(int x, int n){ int twoToN = 1 << n; /* Magic...? How can I do this without looping? We are assuming it is a 32 bit machine, and we can't use constants bigger than 8 bits (0xFF is valid for example). However, I can make a 32 bit number by ORing together a bunch of stuff. Valid operations are: << >> + ~ ! | & ^ */ return theAnswer; } I was thinking maybe I could shift the twoToN over left... until I somehow check (without if/else) that it is bigger than x, and then shift back to the right once... then xor it with x... and repeat? But I only have 20 operations!
Hint: In decadic system to do a modulo by power of 10, you just leave the last few digits and null the other. E.g. 12345 % 100 = 00045 = 45. Well, in computer numbers are binary. So you have to null the binary digits (bits). So look at various bit manipulation operators (&, |, ^) to do so.
Since binary is base 2, remainders mod 2^N are exactly represented by the rightmost bits of a value. For example, consider the following 32 bit integer: 00000000001101001101000110010101 This has the two's compliment value of 3461525. The remainder mod 2 is exactly the last bit (1). The remainder mod 4 (2^2) is exactly the last 2 bits (01). The remainder mod 8 (2^3) is exactly the last 3 bits (101). Generally, the remainder mod 2^N is exactly the last N bits. In short, you need to be able to take your input number, and mask it somehow to get only the last few bits. A tip: say you're using mod 64. The value of 64 in binary is: 00000000000000000000000001000000 The modulus you're interested in is the last 6 bits. I'll provide you a sequence of operations that can transform that number into a mask (but I'm not going to tell you what they are, you can figure them out yourself :D) 00000000000000000000000001000000 // starting value 11111111111111111111111110111111 // ??? 11111111111111111111111111000000 // ??? 00000000000000000000000000111111 // the mask you need Each of those steps equates to exactly one operation that can be performed on an int type. Can you figure them out? Can you see how to simplify my steps? :D Another hint: 00000000000000000000000001000000 // 64 11111111111111111111111111000000 // -64
Since your divisor is always power of two, it's easy. uint32_t remainder(uint32_t number, uint32_t power) { power = 1 << power; return (number & (power - 1)); } Suppose you input number as 5 and divisor as 2 `00000000000000000000000000000101` number AND `00000000000000000000000000000001` divisor - 1 = `00000000000000000000000000000001` remainder (what we expected) Suppose you input number as 7 and divisor as 4 `00000000000000000000000000000111` number AND `00000000000000000000000000000011` divisor - 1 = `00000000000000000000000000000011` remainder (what we expected) This only works as long as divisor is a power of two (Except for divisor = 1), so use it carefully.