In C I need to scale a uint8_t from 0 - 255 to 0 - 31
What is the best way to do this evenly?
If you're trying to scale from 8 bits to 5 bits, you can do a 3 bit shift;
uint8_t scaled = (uint8_t)(original >> 3);
This drops the lower 3 bits.
You can use some simple multiplication and division:
uint8_t scaled = (uint8_t)(((uint32_t)original * 32U) / 256U);
Related
This question already has answers here:
How many 64-bit multiplications are needed to calculate the low 128-bits of a 64-bit by 128-bit product?
(2 answers)
Closed 2 years ago.
How can I multiply a pair of uint64 values safely in order to get the result as a pair of LSB and MSB of the same type?
typedef struct uint128 {
uint64 lsb;
uint64 msb;
};
uint128 mul(uint64 x, uint64 y)
{
uint128 z = {0, 0};
z.lsb = x * y;
if (z.lsb / x != y)
{
z.msb = ?
}
return z;
}
Am I computing the LSB correctly?
How can I compute the MSB correctly?
As said in the comments, the best solution would probably using a library which does that for you. But i will explain how you can do it without a library, because i think you asked to learn something. It is probably not a very efficient way but it works.
When we where in school and we had to multiply 2 numbers without a calculator, we multiplied 2 digits, had a result with 1-2 digits, and wrote them down and in the end we added them all up. We spited the multiplication up so we only had to calculate a single digit multiplication at once. A similar thing is possible with higher numbers on a CPU. But there we do not use decimal digits, we use half of the register size as digit. With that, we can multiply 2 digits and become 2 digits, in one register. In decimal 13*42 can be calculated as:
3* 2 = 0 6
10* 2 = 2 0
3*40 = 1 2 0
10*40 = 0 4 0 0
--------
0 5 4 6
A similar thing can be done with integers. To make it simple, i multiply 2 8 bit numbers to a 16 bit number on a 8 bit CPU, for that i only multiple 4 bit with 4 bit at a time. Lets multiply 0x73 with 0x4F.
0x03*0x0F = 0x002D
0x70*0x0F = 0x0690
0x03*0x40 = 0x00C0
0x70*0x40 = 0x1C00
-------
0x22BD
You basically create an array with 4 elements, in your case each element has the type uint32_t, store or add the result of a single multiplication in the right element(s) of the array, if the result of a single multiplication is too large for a single element, store the higher bits in the higher element. If an addition overflows carry 1 to the next element. In the end you can combine 2 elements of the array, in your case to two uint64_t.
I'm trying to scale down a 32 bit value to a signed 16bit number (uint32_t -> int16_t), or in other words trying to scale down my uin32_t result to scale between 0 and 32767(int16_max). My code looks like this. In this snippet, my input range happens to be 0 to 90000. So an input of 90000 should correspond to 32767, and so on:
uint16_t scaled_estimate = 0;
uint32_t input = 85000;
uint32_t max_base = 90000;
uint16_t new_base = INT16_MAX; // 32767
uint16_t scaled_estimate = (input * new_base) / max_base;
if(scaled_estimate > new_base) scaled_estimate = new_base; // clamp
Is there a better way to achieve this scaling on embedded platforms or should I trust the compiler to do the right thing?
I'm working on a small project, where I need float multiplication with 16bit floats (half precision). Unhappily, I'm facing some problems with the algorithm:
Example Output
1 * 5 = 5
2 * 5 = 10
3 * 5 = 14.5
4 * 5 = 20
5 * 5 = 24.5
100 * 4 = 100
100 * 5 = 482
The Source Code
const int bits = 16;
const int exponent_length = 5;
const int fraction_length = 10;
const int bias = pow(2, exponent_length - 1) - 1;
const int exponent_mask = ((1 << 5) - 1) << fraction_length;
const int fraction_mask = (1 << fraction_length) - 1;
const int hidden_bit = (1 << 10); // Was 1 << 11 before update 1
int float_mul(int f1, int f2) {
int res_exp = 0;
int res_frac = 0;
int result = 0;
int exp1 = (f1 & exponent_mask) >> fraction_length;
int exp2 = (f2 & exponent_mask) >> fraction_length;
int frac1 = (f1 & fraction_mask) | hidden_bit;
int frac2 = (f2 & fraction_mask) | hidden_bit;
// Add exponents
res_exp = exp1 + exp2 - bias; // Remove double bias
// Multiply significants
res_frac = frac1 * frac2; // 11 bit * 11 bit → 22 bit!
// Shift 22bit int right to fit into 10 bit
if (highest_bit_pos(res_mant) == 21) {
res_mant >>= 11;
res_exp += 1;
} else {
res_mant >>= 10;
}
res_frac &= ~hidden_bit; // Remove hidden bit
// Construct float
return (res_exp << bits - exponent_length - 1) | res_frac;
}
By the way: I'm storing the floats in ints, because I'll try to port this code to some kind of Assembler w/o float point operations later.
The Question
Why does the code work for some values only? Did I forget some normalization or similar? Or does it work only by accident?
Disclaimer: I'm not a CompSci student, it's a leisure project ;)
Update #1
Thanks to the comment by Eric Postpischil I noticed one problem with the code: the hidden_bit flag was off by one (should be 1 << 10). With that change, I don't get decimal places any more, but still some calculations are off (e.g. 3•3=20). I assume, it's the res_frac shift as descibred in the answers.
Update #2
The second problem with the code was indeed the res_frac shifting. After update #1 I got wrong results when having 22 bit results of frac1 * frac2. I've updated the code above with a the corrected shift statement. Thanks to all for every comment and answer! :)
From a cursory look:
No attempt is made to determine the location of the high bit in the product. Two 11-bit numbers, each their high bit set, may produce a 21- or 22-bit number. (Example with two-bit numbers: 102•102 is 1002, three bits, but 112•112 is 10012, four bits.)
The result is truncated instead of rounded.
Signs are ignored.
Subnormal numbers are not handled, on input or output.
11 is hardcoded as a shift amount in one place. This is likely incorrect; the correct amount will depend on how the significand is handled for normalization and rounding.
In decoding, the exponent field is shifted right by fraction_length. In encoding, it is shifted left by bits - exponent_length - 1. To avoid bugs, the same expression should be used in both places.
From a more detailed look by chux:
res_frac = frac1 * frac2 fails if int is less than 23 bits (22 for the product and one for the sign).
This is more a suggestion for how to make it easier to get your code right, rather than analysis of what is wrong with the existing code.
There are a number of steps that are common to some or all of the floating point arithmetic operations. I suggest extracting each into a function that can be written with focus on one issue, and tested separately. Then when you come to write e.g. multiplication, you only have to deal with the specifics of that operation.
All the operations will be easier working with a structure that has the actual signed exponent, and the full significand in a wider unsigned integer field. If you were dealing with signed numbers, it would also have a boolean for the sign bit.
Here are some sample operations that could be separate functions, at least until you get it working:
unpack: Take a 16 bit float and extract the exponent and significand into a struct.
pack: Undo unpack - deal with dropping the hidden bit, applying the bias the expoent, and combining them into a float.
normalize: Shift the significand and adjust the exponent to bring the most significant 1-bit to a specified bit position.
round: Apply your rounding rules to drop low significance bits. If you want to do IEEE 754 style round-to-nearest, you need a guard digit that is the most significant bit that will be dropped, and an additional bit indicating if there are any one bits of lower significance than the guard bit.
One problem is that you are truncating instead of rounding:
res_frac >>= 11; // Shift 22bit int right to fit into 10 bit
You should compute res_frac & 0x7ff first, the part of the 22-bit result that your algorithm is about to discard, and compare it to 0x400. If it is below, truncate. If it is above, round away from zero. If it is equal to 0x400, round to the even alternative.
I was recently asked in an interview how to set the 513th bit of a char[1024] in C, but I'm unsure how to approach the problem. I saw How do you set, clear, and toggle a single bit?, but how do I choose the bit from such a large array?
int bitToSet = 513;
inArray[bitToSet / 8] |= (1 << (bitToSet % 8));
...making certain assumptions about character size and desired endianness.
EDIT: Okay, fine. You can replace 8 with CHAR_BIT if you want.
#include <limits.h>
int charContaining513thBit = 513 / CHAR_BIT;
int offsetOf513thBitInChar = 513 - charContaining513thBit*CHAR_BIT;
int bit513 = array[charContaining513thBit] >> offsetOf513thBitInChar & 1;
You have to know the width of characters (in bits) on your machine. For pretty much everyone, that's 8. You can use the constant CHAR_BIT from limits.h in a C program. You can then do some fairly simple math to find the offset of the bit (depending on how you count them).
Numbering bits from the left, with the 2⁷ bit in a[0] being bit 0, the 2⁰ bit being bit 7, and the 2⁷ bit in a[1] being bit 8, this gives:
offset = 513 / CHAR_BIT; /* using integer (truncating) math, of course */
bit = 513 % CHAR_BIT;
a[offset] |= (0x80>>bit)
There are many sane ways to number bits, here are two:
a[0] a[1]
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 This is the above
7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 This is |= (1<<bit)
You could also number from the other end of the array (treating it as one very large big-endian number).
Small optimization:
The / and % operators are rather slow, even on a lot of modern cpus, with modulus being slightly slower. I would replace them with the equivalent operations using bit shifting (and subtraction), which only works nicely when the second operand is a power of two, obviously.
x / 8 becomes x >> 3
x % 8 becomes x-((x>>3)<<3)
for this second operation, just reuse the result from the initial division.
Depending on the desired order (left to right versus right to left), it might change. But the general idea assuming 8 bits per byte would be to choose the byte as. This is expanded into lots of lines of code to hopefully show more clearly the intended steps (or perhaps it just obfuscates the intention):
int bitNum = 513;
int bytePos = bitNum / 8;
Then the bit position would be computed as:
int bitInByte = bitNum % 8;
Then set the bit (assuming the goal is to set it to 1 as opposed to clear or toggle it):
charArray[bytePos] |= ( 1 << bitInByte );
When you say 513th are you using index 0 or 1 for the 1st bit? If it's the former your post refers to the bit at index 512. I think the question is valid since everywhere else in C the first index is always 0.
BTW
static char chr[1024];
...
chr[512>>3]=1<<(512&0x7);
Using only adding, subtracting, and bitshifting, how can I multiply an integer by a given number?
For example, I want to multiply an integer by 17.
I know that shifting left is multiplying by a multiple of 2 and shifting right is dividing by a power of 2 but I don’t know how to generalize that.
What about negative numbers? Convert to two's complement and do the same procedure?
(EDIT: OK, I got this, nevermind. You convert to two's complement and then do you shifting according to the number from left to right instead of right to left.)
Now the tricky part comes in. We can only use 3 operators.
For example, multiplying by 60 I can accomplish by using this:
(x << 5) + (x << 4) + (x << 3) + (x << 2)
Where x is the number I am multiplying. But that is 7 operators - how can I condense this to use only 3?
It's called shift-and-add. Wikipedia has a good explanation of this:
http://en.wikipedia.org/wiki/Multiplication_algorithm#Shift_and_add
EDIT:
To answer your other question, yes converting to two's compliment will work. But you need to sign extend it long enough to hold the entire product. (assuming that's what you want)
EDIT2:
If both operands are negative, just two's compliment both of them from the start and you won't have to worry about this.
Here's an example of multiplying by 3:
unsigned int y = (x << 1) + (x << 0);
(where I'm assuming that x is also unsigned).
Hopefully you should be able to generalise this.
As far as I know, there is no easy way to multiply in general using just 3 operators.
Multiplying with 60 is possible, since 60 = 64 - 4: (x << 6) - (x << 2)
17 = 16 + 1 = (2^4) + (2^0). Therefore, shift your number left 4 bits (to multiply by 2^4 = 16), and add the original number to it.
Another way to look at it is: 17 is 10001 in binary (base 2), so you need a shift operation for each of the bits set in the multiplier (i.e. bits 4 and 0, as above).
I don't know C, so I won't embarrass myself by offering code.
Numbers that would work with using only 3 operators (a shift, plus or minus, and another shift) is limited, but way more than the 3, 17 and 60 mentioned above. If a number can be represented as (2^x) +/- (2^y) it can be done with only 3 operators.