Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
I have the following code :
func(double val) {
// I am trying with the following values. both of which are in the range as per
// IEEE 754 std.
// val = 1.847474
int temp = [some_val = (1 << 23)];
double temp2 = val * temp;
printf("the produt111a = %15f\n",temp2);
}
value in temp2 results in loss of precision.
However, if I directly substitute the value of val while doing multiplication I got the correct result.
What can be done to avoid precision loss in such a scenario?
Precision is the (relative) difference from one floating point number to the next.
Accuracy is the (relative) difference between the numerical result and the exact result.
Multiplication with a power of 2 changes the exponent part of the floating point number and leaves the mantissa bits unchanged (see also the earlier comment by David Hammen). Thus there should be neither a loss in relative precision (still the same f.p. number type) nor in relative accuracy. Except in cases where you are very close to numerical over- or underflow.
Related
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 12 months ago.
Improve this question
I've a problem: I would need to concatenate 3 double numbers into one only double number. For example, I've:
a = 40.000000;
b = 56.000000;
c = 10.236330;
I need the following number: 40.5610236330. The integer part is defined by the first two cyphers of a, the first two decimal cyphers are the integer part of b and the other decimal cyphers are all the cyphers of c. I've tried with:
k = a+(b/100)+(c/1000);
But due to approximation error, the result is 40.570236. Could you help me? Thank you so much!
Floating point calculation always loose some precision.
But 40.570236 instead of 40.5610236330 is too much off.
The big error you see is because of a simple bug in your code.
You need k = a+(b/100)+(c/10000); (i.e. c is to be divided by 10000)
Maybe it would be more clear if you did k = a+(b/100)+(c/100/100);
But never expect floating point calculation to 100% precise. It's not even certain that the number 40.5610236330 can be represented in float/double
And further, the input values them self may be imprecise:
double c = 10.236330;
printf("%.20f\n", c);
Output:
10.23633000000000059515
There is enough precision in a double variable to store a number of 12 significant digits (though your question does not really state how many digits c has).
double k= a + b * 0.01 + c * 0.0001;
will work. But when you display it, be sure to use a format with 10 digits after the decimal point (%.10f) so that rounding restores the correct decimals.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I want to write a fmod() function
double fmod(double x, double y) {
double mod = x;
while(mod >= y)
{
mod -= y;
}
return mod;
}
But fmod(1.2, 0.05) returns 0.05
Although the title asks about incorrect comparison, the comparison in the program shown is correct. It is the only correct floating-point operation in the program; all the others have errors (compared to real-number arithmetic).
In fmod(1.2, 0.05), neither 1.2 nor 0.5 are representable in the double format used in your C implementation. These numerals in source code are rounded to the nearest representable values, 1.1999999999999999555910790149937383830547332763671875 and 0.05000000000000000277555756156289135105907917022705078125.
Then, in the subtraction in mod -= y;, the exact real arithmetic result, 1.14999999999999995281552145343084703199565410614013671875 is not representable, so it is rounded to 1.149999999999999911182158029987476766109466552734375.
Similar errors continue during the calculations, until eventually 0.0499999999999994615418330567990778945386409759521484375 is produced. At each point, the comparison mod >= y correctly evaluates whether mod is greater than or equal to y. When mod is less than y, the loop stops.
However, due to intervening errors, the result produced, 0.0499999999999994615418330567990778945386409759521484375, is not equal to the residue of 1.1999999999999999555910790149937383830547332763671875 divided by 0.05000000000000000277555756156289135105907917022705078125. The correct result can be calculated with the standard fmod function, which returns 0.04999999999999989175325509904723730869591236114501953125.
Note that, when you define a function named fmod, the C standard does not define the behavior because this conflicts with the standard library function of that name. You ought to give it a different name, such as fmodAlternate.
Inside the fmod routine, errors can be avoided. It is possible to implement fmod so that it returns an exact result for the arguments it is given. (This is possible because the result is always in a region of the floating-point range that is fine enough [has a low enough exponent] to represent the real arithmetic result exactly.) However, the errors in providing the arguments cannot be corrected: It is not possible to represent 1.2 or 0.05 in the double format your C implementation uses. The source code fmod(1.2, .05) will always calculate fmod(1.1999999999999999555910790149937383830547332763671875, 0.05000000000000000277555756156289135105907917022705078125), which is 0.04999999999999989175325509904723730869591236114501953125.
An alternative is to represent the numbers differently. For example, you could scale these numbers by a factor of 100, and fmod(120, 5) will return 0. What solution is appropriate depends on the circumstances of the problem you are trying to solve.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
A C program to find if the input number is palindrome or not
The problem as I see it is that the even numbered powers come out strange. Can anyone please tell me what the problem could be?
The pow function is returning incorrect values right. instead of using the pow in the code multiply 10 with the sum to get the output
remainder = n%10;
reversed = reversed*10 + remainder;
n /= 10;
The best way to get accurate results with pow function would be to use doubles as much as you can. Like most functions using integers for large floating point operations tends to leave you with inaccurate results
The implementation of pow you are using returns incorrect values. For integer powers of 10, it ought to return exactly 1, 10, 100, 1000, et cetera, but it returns values slightly different. Furthermore, when it returns a value slightly under an integer and that value is converted from floating-point to int, it is truncated, so the result of int x = pow(10, 3) may be 999 rather than 1000.
Do not use this pow for exponentiating ten. You can write a simple integer replacement for pow (with another name, of course), or you can rewrite your code to avoid relying on exponentiating ten. (Working iteratively, with 1, 10, 100, 1000, and so on, is often better—simply multiplying by ten at each step instead of exponentiating to calculate the power.)
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 6 years ago.
Improve this question
I am trying to do a physics problem and need to store a value around 5 * 10-11;
After trying float, long double and a few others none of them see to be long enough. Is there a data type that will allow me to do so?
Thanks
long double I = 0;
I = 0.01902*pow(0.00318,3)/12;
printf("%Lf\n",I);
Output is 0.000000
long double I = 0;
I = 0.01902*pow(0.00318,3)/12;
At this moment, I's value is approximately 5.096953e-11. Then...
printf("%Lf\n", I);
The sole format specifier in this printf() call is %Lf. This indicates that the argument is a long double (L), and that it should be printed as a floating-point number (f). Finally, as the precision (number of digits printed after the period) is not explicitly given, it is assumed to be 6. This means that up to 6 digits will be printed after the period.
There are several ways to fix this. Two of them would be...
printf(".15Lf\n", I);
This will set the precision to be 15. As such, 15 digits will be printed after the period. And...
printf("%Le\n", I);
This will print the number in scientific notation, that is, 5.096953e-11. It too can be configured to print more digits if you want them.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
The microcontroller I have to implement my digital filter does not support floating point operations.
Given an analog input signal (which can take on values from -1.65 V to 1.65 V) sampled at a given rate of 100 Hz, I can only perform fixed-point operations. So I'm guessing I have to convert my input to fixed point first. It is also stated that the output of the ADC is quantized into unsigned 10-bit values.
My problem is.
I know that there is a Qm.n format for fixed-points which includes a sign bit. And none of the references online include conversion from signed input floating point to unsigned fixed-point
AND I FOUND THIS CODE:
int fixedValue = (int)Math.Round(floatValue*Scale);
double floatValue = (double)fixedValue/Scale;
Questions:
1. How can I choose my scaling factor?
2. Is it dependent on the range of my input values and the number of bits used for the fixed-point representation?
3. The Qm.n format uses a signed bit. Can fixed point representations be unsigned?
It all boils down to choosing the scaling factor and mapping from signed input to unsigned 10 bit fixed point (which will be used for further calculations in solving a difference equation then converting it back to double at the output)
Thanks in advance.
Use a simple 2-point interpolation.
#define Value_MAX 1.65
#define Value_MIN (-1.65)
#define value10bit_MAX 1023
#define value10bit_MIN 0
#define slope ((value10bit_MAX - value10bit_MIN)/(Value_MAX - Value_MIN))
int value10bit = (int)Math.Round((floatValue - Value_MIN)*slope + value10bit_MIN);
OP reports "microcontroller that only support fixed-point operations." yet appears to be using (or wants to use) int fixedValue = (int)Math.Round(floatValue*Scale);. So maybe this works for OP