Dividing with/without using floats in C [closed] - c

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
Below is the main function I wrote in C (for PIC18F8722 microprocessor) attempting to drive 2 multiplexing 7 segments displays at a specific frequency set by the unsigned int function get_ADC_value(). The displays also display the current multiplexing frequency. This frequency range is set by #define to be in the range LAB_Fmin and LAB_Fmax and must scale as the get_ADC_value() increases or decreases from 0 to 255.
This code however does not work as I think there is implicit conversion from int to float at freq =.
The challenge is to fix this error with floats and to find an alternative using only integer types (int, char...).
while (1) {
unsigned int x, y, z;
float freq, delay;
x = get_ADC_value();
y = x & 0b00001111;
z = (x & 0b11110000) >> 4 ;
freq = LAB_Fmin + (((LAB_Fmax) - (LAB_Fmin))/ 255)*x ;
delay = 1/(freq*1000); // convert hZ to ms delay accurately
LATF = int_to_SSD(y);
LATH = 0b11111110; //enable 7seg U1
for (unsigned int i = 0; i<(delay) ; i++){
Delay10TCYx(250); //1ms delay
}
LATF = int_to_SSD(z);
LATH = 0b11111101; //enable 7seg U2
for (unsigned int j = 0; j<(delay) ; j++){
Delay10TCYx(250); //1ms delay
}
}

C is defined to divide ints using integer division, and only when there is a float does it "promote" other ints to floats first. Note that this even happens if it will be assigned to a float - if the right-hand side is all ints, then the division will all be integer, and only for the final assignment will C convert the int result to float.
So, with your line:
freq = LAB_Fmin + (((LAB_Fmax) - (LAB_Fmin))/ 255)*x ;
it all depends on what LAB_Fmax and LAB_Fmin are. It doesn't matter what freq or x are, because the "damage" will already have been done due to the parentheses forcing the division to be first.
If those LAB_F variables are ints, the easiest way to use floating point division is to simply tell C that you want that by making the constant 255 a floating point number rather than an integer, by using a decimal point: 255. (or 255.0 to be less subtle).
If you want to use integer arithmetic only, then the usual suggestion is to do all of your multiplications before any divisions. Of course, this runs the risk of overflowing the intermediate result - to help that, you can use the type long. Define your LAB_F or x variables as long, and do the division last:
freq = LAB_Fmin + (((LAB_Fmax) - (LAB_Fmin)) * x / 255);

Code review:
unsigned int x, y, z; Avoid using raw integer types on embedded systems. Exact-width types from stdint.h should always be used, so you know exactly what size you use. If you don't have access to stdint.h then typedef those types yourself.
float freq, delay; Floating point numbers should generally be avoided on most embedded systems. Particularly on 8 bit MCUs with no FPU! This will result in software-defined floating point numbers that are incredibly slow and memory-consuming. There seem to be no reason for you to use floats in this program, it would seem that you should be able to write this algorithm with uint16_t or smaller, unless you have extreme accuracy requirements.
x = get_ADC_value(); Since you only seem interested in 8 bits of the ADC read, why not use an 8 bit type?
Please note that binary number literals are not standard C.
((LAB_Fmax) - (LAB_Fmin))/ 255 This looks fishy. First of all, are these integers or floats? What's their size? The answer to your question depends on that. By swapping the literal to 255.0f you can force a conversion to float. But are you sure the division should be by 255? And not 256?
i<(delay). You should always avoid using floating point expressions inside loop conditions, since it makes the loop needlessly slow and can potentially lead to floating point inaccuracy bugs. Also, the parenthesis fills no purpose.
Overall, your program suffers from "sloppy typing", meaning that the programmer has not given any thought about what types that are used in each expression. Note that literals have types too. Implicit conversions might cause a lot of these expressions to be calculated on too large types, which is very bad news for the PIC. I'd recommend reading up on "balancing", aka the usual arithmetic conversions.
This "sloppy typing" will cause your program to get very bloated and slow, for nothing gained. You must keep in mind that PIC is perhaps the least code-efficient MCU still manufactured. When writing C code for any 8-bit MCU, you should avoid types larger than 8 bit. In particular, you should avoid 32 bit integers and floating point numbers like the plague.
Your program re-scales all data to types that ease the thinking for the programmer. This is a common design mistake - instead your program should use types that are easy to use for the processor. For example, instead of milliseconds, you could use timer ticks as the unit.

You are correct about integer division.
Change to
freq = LAB_Fmin + (((LAB_Fmax) - (LAB_Fmin)) / 255.0)*x;
^^

freq = LAB_Fmin + (((LAB_Fmax) - (LAB_Fmin))/ 255)*x ;
This is indeed an implicit conversion to integer, and you're doing integer division to do that.
That is because 255 is an Integer literal.
Change it to 255.0 to be a double literal, which should play nicely with your calculation.
If you want to be more precise, you can even use a float literal, like 255.0f or an explicit cast like (float)255.
Your code could look like this then:
freq = LAB_Fmin + (((LAB_Fmax) - (LAB_Fmin))/ 255.0)*x ;
Or this:
freq = LAB_Fmin + (((LAB_Fmax) - (LAB_Fmin))/ (float)255)*x ;

Math operations with integers will by default result into an integer too,
so you need to either express one of the literals as double/float
freq = LAB_Fmin + (((LAB_Fmax) - (LAB_Fmin))/ 255.0)*x ;
or cast (float)
as many other state the first option is the most commonly implemented.

Related

Floating point operations with no library

I am looking for a efficient way to properly do mathematical operations with floating values. As I am in the embedded C, I don't want to use any extra library for float data type.
As far as I understand, the correct way here would be to treat a floating value as a raw binary(sign, exponent, mantissa), and do the operations like that. But I cannot find any examples on how exactly that works.
I am looking for a explication on how to do the following with no float data type:
Given a variable int x that can have values from 0 to 10000.
y = x * 0.720 + 84.234;
y = y / 2.5;
Thank you for your time internet
Floating point libraries are not required for the example operations you have suggested, and while avoiding floating point code on an embedded system without an FPU is often advisable, doing that by implementing your own floating point encoding will save you nothing and will likely be less efficient, less comprehensible and more error prone than using compiler's built-in FP support.
Instead, you need to avoid floating-point code entirely, and use fixed-point encoding. In many cases that can be done ad-hoc for individual expressions, but if your application is math intensive (involving trig, logs, sqrt, exponentiation for example) you might to choose a fixed-point library or implement your own.
Floating-point dependency is trivially eradicated in the examples you have suggested; for example:
// y = x * 0.720 + 84.234
// Where x_x1000 = real value * 1000
int y_x1000 = (x_x1000 * 720) / 1000 + 84234 ;
or more efficiently using binary-fixed-point and a 10 bit fractional part:
// y = x * 0.720 + 84.234
// Where x_q10 = real value * 1024
int32_t y_q10 = (x_q10 * 737) >> 10 + 86256 ;
Although you might consider int64_t for greater numeric range - in which case you might also use more fractional bits for greater precision too.
If you are doing a lot of intensive fixed-point maths, you would do well to consider a library or implement one using CORDIC algorithms. An example of such a library can be found at https://www.justsoftwaresolutions.co.uk/news/optimizing-applications-with-fixed-point-arithmetic.html, although it is C++ - the clear advantage being that by defining a fixed class and extensive operator overloading, existing floating-point code can largely be converted to fixed point by replacing double or float keywords with fixed and compiling as C++ - even if the code is otherwise non-OOP and entirely C-like.

Is it defined what will happen if you shift a float?

I am following This video tutorial to implement a raycaster. It contains this code:
if(ra > PI) { ry = (((int)py>>6)<<6)-0.0001; rx=(py-ry)*aTan+px; yo=-64; xo=-yo*aTan; }//looking up
I hope I have transcribed this correctly. In particular, my question is about casting py (it's declared as float) to integer, shifting it back and forth, subtracting something, and then assigning it to a ry (also a float) This line of code is entered at time 7:24, where he also explains that he wants to
round the y position to the nearest 64th value
(I'm unsure if that means the nearest multiple of 64 or the nearest (1/64), but I know that the 6 in the source is derived from the number 64, being 2⁶)
For one thing, I think that it would be valid for the compiler to load (say) a 32-bit float into a machine register, and then shift that value down by six spaces, and then shift it back up by six spaces (these two operations could interfere with the mantissa, or the exponent, or maybe something else, or these two operations could be deleted by a peephole optimisation step.)
Also I think it would be valid for the compiler to make demons fly out of your nose when this statement is executed.
So my question is, is (((int)py>>6)<<6) defined in C when py is float?
is (((int)py>>6)<<6) defined in C when py is float?
It is certainly undefined behavior (UB) for many float. The cast to an int is UB for float with a whole number value outside the [INT_MIN ... INT_MAX] range.
So code is UB for about 38% of all typical float - the large valued ones, NaNs and infinities.
For typical float, a cast to int128_t is defined for nearly all float.
To get to OP's goal, code could use the below, which I believe to be well defined for all float.
If anything, use the below to assess the correctness of one's crafted code.
// round the y position to the nearest 64th value
float round_to_64th(float x) {
if (isfinite(x)) {
float ipart;
// The modf functions break the argument value into integral and fractional parts
float frac = modff(x, &ipart);
x = ipart + roundf(frac*64)/64;
}
return x;
}
"I'm unsure if that means the nearest multiple of 64 or the nearest (1/64)"
On review, OP's code is attempting to truncate to the nearest multiple of 64 or 2⁶.
It is still UB for many float.
That code doesn't shift a float because the bitshift operators aren't defined for floating-point types. If you try it you will get a compiler error.
Notice that the code is (int)py >> 6, the float is cast to an int before the shift operation. The integer value is what is being shifted.
If your question is "what will happen if you shift a float?", the answer is it won't compile. Example on Compiler Explorer.
The best possible recreation of the shift ops for floating points, in short, without using additional functions are the following:
Left shift:
ShiftFloat(py,6,1);
Right shift:
ShiftFloat(py,6,0);
float ShiftFloat(float x, int count, int ismultiplication)
{
float value = x;
for (int i = 0; i < count; ++i)
{
value *= (powf(0.5,(float)(ismultiplication^1)) / powf(2.0,(float)(ismultiplication)));
}
return count != 0 ? value : x;
}

How to force compiler to promote variable to float when doing maths

I got question about math in C, quick example below:
uint32_t desired_val;
uint16_t target_us=1500
uint32_t period_us=20000;
uint32_t tempmod=37500;
desired_val = (((target_us)/period_us) * tempmod);
At the moment (target_us/period_us) results in 0 which gives desired_value also 0. I don't want to make these variables float unless i really have to. I dont need anything after comma as it will be saved into 32bit register.
Is it possible to get correct results from this equation without declaring target_us or period_us as float? I want to make fixed point calculations when it's possible and floating point when it's needed.
Working on cortex-M4 if that helps.
Do the multiplication first.
You should split it into two statements with a temporary variable, to ensure the desired order of operations (parentheses ensure proper grouping, but not order).
uint64_t tempprod = (uint64_t)target_us * tempmod;
desired_val = tempprod / period_us;
I've also used uint64_t for the temporary, in case the product overflows. There's still a problem if the desired value doesn't fit into 32 bits; hopefully the data precludes that.
You'll probably have to do some casting in any case, but there's two different methods. First, stick with integers and do the multiplication first:
desired_val = ((uint64_t)target_us * tempmod) / period_us;
or do the calculations in floating point:
desired_val = (uint32_t)(((double)target_us / period_us) * tempmod);
You can do the computation with double quite easily:
desired_val = (double)target_us * tempmod / period_us;
float would be a mistake, since it has far too little precision to be reliable.
You might want to round that off to the nearest integer rather than letting it be truncated:
#include <math.h>
desired_val = round((double)target_us * tempmod / period_us);
See man round
You could, of course, do the computation using a wider integer type (for example, replacing the double with int64_t or long long). That will make rounding slightly trickier.

In C, How do I calculate the signed difference between two 48-bit unsigned integers?

I've got two values from an unsigned 48bit nanosecond counter, which may wrap.
I need the difference, in nanoseconds, of the two times.
I think I can assume that the readings were taken at roughly the same time, so of the two possible answers I think I'm safe taking the smallest.
They're both stored as uint64_t. Because I don't think I can have 48 bit types.
I'd like to calculate the difference between them, as a signed integer (presumably int64_t), accounting for the wrapping.
so e.g. if I start out with
x=5
y=3
then the result of x-y is 2, and will stay so if I increment both x and y, even as they wrap over the top of the max value 0xffffffffffff
Similarly if x=3, y=5, then x-y is -2, and will stay so whenever x and y are incremented simultaneously.
If I could declare x,y as uint48_t, and the difference as int48_t, then I think
int48_t diff = x - y;
would just work.
How do I simulate this behaviour with the 64-bit arithmetic I've got available?
(I think any computer this is likely to run on will use 2's complement arithmetic)
P.S. I can probably hack this out, but I wonder if there's a nice neat standard way to do this sort of thing, which the next person to read my code will be able to understand.
P.P.S Also, this code is going to end up in the tightest of tight loops, so something that will compile efficiently would be nice, so that if there has to be a choice, speed trumps readability.
You can simulate a 48-bit unsigned integer type by just masking off the top 16 bits of a uint64_t after any arithmetic operation. So, for example, to take the difference between those two times, you could do:
uint64_t diff = (after - before) & 0xffffffffffff;
You will get the right value even if the counter wrapped around during the procedure. If the counter didn't wrap around, the masking is not needed but not harmful either.
Now if you want this difference to be recognized as a signed integer by your compiler, you have to sign extend the 48th bit. That means that if the 48th bit is set, the number is negative, and you want to set the 49th through the 64th bit of your 64-bit integer. I think a simple way to do that is:
int64_t diff_signed = (int64_t)(diff << 16) >> 16;
Warning: You should probably test this to make sure it works, and also beware there is implementation-defined behavior when I cast the uint64_t to an int64_t, and I think there is implementation-defined behavior when I shift a signed negative number to the right. I'm sure a C language lawyer could some up with something more robust.
Update: The OP points out that if you combine the operation of taking the difference and doing the sign extension, there is no need for masking. That would look like this:
int64_t diff = (int64_t)(x - y) << 16 >> 16;
struct Nanosecond48{
unsigned long long u48 : 48;
// int res : 12; // just for clarity, don't need this one really
};
Here we just use the explicit width of the field to be 48 bits and with that (admittedly somewhat awkward) type you live it up to your compiler to properly handle different architectures/platforms/whatnot.
Like the following:
Nanosecond48 u1, u2, overflow;
overflow.u48 = -1L;
u1.u48 = 3;
u2.u48 = 5;
const auto diff = (u2.u48 + (overflow.u48 + 1) - u1.u48) & 0x0000FFFFFFFFFFFF;
Of course in the last statement you can just do the remainder operation with % (overflow.u48 + 1) if you prefer.
Do you know which was the earlier reading and which was later? If so:
diff = (earlier <= later) ? later - earlier : WRAPVAL - earlier + later;
where WRAPVAL is (1 << 48) is pretty easy to read.

Why doesn't my code work when replacing 622.08E6 with 622080000?

I recently came across a C code (working by the way) where I found
freq_xtal = ((622.08E6 * vcxo_reg_val->hiv * vcxo_reg_val->n1)/(temp_rfreq));
From my intuition it seems that 622.08E6 should mean 622.08 x 106. From this question this assumption is correct.
So I tried replacing 622.08e6 with
uint32_t default_freq = 622080000;
For some reason this doesn't seem to work
Any thoughts or suggestions appreciated
The problem you are having (and I'm speculating here because I don't have the rest of your code) appears to be that replacing the floating point with an integer caused the multiplication and division to be integer based, and not decimal based. As a result, you now compute the wrong value.
Try type casting your uint32_t to a double and see if that clears it up.
The problem is due to overflow!
The original expression (622.08E6 * vcxo_reg_val->hiv * vcxo_reg_val->n1)/temp_rfreq (you have too many unnecessary parentheses though) is done in double precision because 622.08E6 is a double literal. That'll result in a floating-point value
However if you replace the literal with 622080000 then the whole expression will be done in integer math if all the variables are integer. But more importantly, integer math will overflow (at least much sooner than floating-point one)
Notice that UINT32_MAX / 622080000.0 ≈ 6.9. That means just multiply the constant by 7 and it'll overflow. However in the code you multiply 622080000 with 2 other values whose product may well be above 6. You should add the ULL suffix to do the math in unsigned long long
freq_xtal = (622080000ULL * vcxo_reg_val->hiv * vcxo_reg_val->n1)/temp_rfreq;
or change the variable to uint64_t default_freq = 622080000ULL;

Resources