As part of a program that I am writing for an assignment, I need to compare two numbers. Essentially, the program computes the eccentricity of an ellipse given its two axes and it has to compare the value of the calculated eccentricity to the (given) eccentricity of the Moon's orbit around the Earth, and Earth's orbit around the Sun. If the calculated eccentricity is greater than the given eccentricity, then this needs to be represented by a value of 1, otherwise, a value of 0. All of these values are floating-point, specifically, long double.
The constraints of the assignment do not allow me to use comparison operators (like >) or any sort of logic (!x or if-else). However, I am allowed to use the pow and sqrt functions from the math.h library. I am also allowed to use arithmetic operations as well as the modulo operation.
I know that I can take advantage of integer division to truncate the decimal if the denominator is greater than the numerator, i.e.:
int x = eccentricity / MOON_ORBIT_ECCENTRICITY;
... will be 0 if MOON_ORBIT_ECCENTRICITY is greater than eccentricity. However, if this relationship is inverted, then the value of x could be any non-zero integer. In such a case, the desired result is 1.
The first and most intuitive (and naïve) solution was:
int y = (x / x);
This will return 1 if x is non-zero. However, if x is 0, then my program crashes due to division by zero. In fact, I keep running into the problem of dividing by zero. This also happens in the case of:
int y = (x + 1) % x;
Does anyone have an idea of how to solve this? This seems so frustratingly easy.
#lurker comment above is a good approach to handle eccentricity as restricted by OP.
So as not to copy that, consider the not-so-serious following:
// Return e1 > e2
int Eccentricity_Compare(long double e1, long double e2) {
char buf[20];
// print a number beginning with
// if e2 >= e1: `+`
// else `-`
sprintf(buf, "%+Le", e2 - e1); // reverse subtraction for = case
const char *pm = "+-";
char *p = strchr(pm, buf[0]);
return (int) (p - pm);
}
Wink, wink: OP said nothing about <stdio.h> functions.
Related
This is what I've found so far online,
int main(void)
{
long a = 12345;
int b = 10;
int remain = a - (a / b) * b;
printf("%i\n", remain);
}
First I wonder how the formula works. Maybe i cant do math, but the priority of operations here seems a bit odd. If i run this code the expected answer of 5 is printed. But I dont get how (a / b) * b doesn't cancel out to 'a' leading to a - a = 0.
Now, this only works for int and long, as soon as double are involved it doesn't work anymore. Anyone might tell me why? Is there an alternative to modulo that works for double?
Also I'm not sure if i understand up to what value a long can go, i found online that the upper limit was 2147483647 but when i input bigger numbers such as the one in 'a' the code runs without any issue up to a certain point...
Thanks for your help I'm new to coding and trying to learn!
Given two double finite numbers x and y, with y not equal to zero, fmod(x, y) produces the remainder of x when divided by y. Specifically, it returns x − ny, where n is chosen so that x − ny has the same sign as x and is smaller in magnitude than y. (So, if x is positive, 0 ≤ fmod(x, y) < x, and, if x is negative, x < fmod(x, y) ≤ 0.)
fmod is declared in <math.h>.
A properly implemented fmod returns an exact result; there is no floating-point error, since the specified result is always representable.
The C standard also specifies remquo to return the remainder and some low bits (at least three) of the quotient n and remainder with a variation on the definition of the remainder. It also specifies variants of these functions for float and long double.
Naive implementation. Limited range. Adds additional floating point imprecisions (as it does some arithmetic)
double naivemod(double x)
{
return x - (long long)x;
}
int main(void)
{
printf("%.50f\n", naivemod(345345.567567756));
printf("%.50f\n", naivemod(.0));
printf("%.50f\n", naivemod(10.5));
printf("%.50f\n", naivemod(-10.0/3));
}
I am new to C, and my task is to create a function
f(x) = sqrt[(x^2)+1]-1
that can handle very large numbers and very small numbers. I am submitting my script on an online interface that checks my answers.
For very large numbers I simplify the expression to:
f(x) = x-1
By just using the highest power. This was the correct answer.
The same logic does not work for smaller numbers. For small numbers (on the order of 1e-7), they are very quickly truncated to zero, even before they are squared. I suspect that this has to do with floating point precision in C. In my textbook, it says that the float type has smallest possible value of 1.17549e-38, with 6 digit precision. So although 1e-7 is much larger than 1.17e-38, it has a higher precision, and is therefore rounded to zero. This is my guess, correct me if I'm wrong.
As a solution, I am thinking that I should convert x to a long double when x < 1e-6. However when I do this, I still get the same error. Any ideas? Let me know if I can clarify. Code below:
#include <math.h>
#include <stdio.h>
double feval(double x) {
/* Insert your code here */
if (x > 1e299)
{;
return x-1;
}
if (x < 1e-6)
{
long double g;
g = x;
printf("x = %Lf\n", g);
long double a;
a = pow(x,2);
printf("x squared = %Lf\n", a);
return sqrt(g*g+1.)- 1.;
}
else
{
printf("x = %f\n", x);
printf("Used third \n");
return sqrt(pow(x,2)+1.)-1;
}
}
int main(void)
{
double x;
printf("Input: ");
scanf("%lf", &x);
double b;
b = feval(x);
printf("%f\n", b);
return 0;
}
For small inputs, you're getting truncation error when you do 1+x^2. If x=1e-7f, x*x will happily fit into a 32 bit floating point number (with a little bit of error due to the fact that 1e-7 does not have an exact floating point representation, but x*x will be so much smaller than 1 that floating point precision will not be sufficient to represent 1+x*x.
It would be more appropriate to do a Taylor expansion of sqrt(1+x^2), which to lowest order would be
sqrt(1+x^2) = 1 + 0.5*x^2 + O(x^4)
Then, you could write your result as
sqrt(1+x^2)-1 = 0.5*x^2 + O(x^4),
avoiding the scenario where you add a very small number to 1.
As a side note, you should not use pow for integer powers. For x^2, you should just do x*x. Arbitrary integer powers are a little trickier to do efficiently; the GNU scientific library for example has a function for efficiently computing arbitrary integer powers.
There are two issues here when implementing this in the naive way: Overflow or underflow in intermediate computation when computing x * x, and substractive cancellation during final subtraction of 1. The second issue is an accuracy issue.
ISO C has a standard math function hypot (x, y) that performs the computation sqrt (x * x + y * y) accurately while avoiding underflow and overflow in intermediate computation. A common approach to fix issues with subtractive cancellation is to transform the computation algebraically such that it is transformed into multiplications and / or divisions.
Combining these two fixes leads to the following implementation for float argument. It has an error of less than 3 ulps across all possible inputs according to my testing.
/* Compute sqrt(x*x+1)-1 accurately and without spurious overflow or underflow */
float func (float x)
{
return (x / (1.0f + hypotf (x, 1.0f))) * x;
}
A trick that is often useful in these cases is based on the identity
(a+1)*(a-1) = a*a-1
In this case
sqrt(x*x+1)-1 = (sqrt(x*x+1)-1)*(sqrt(x*x+1)+1)
/(sqrt(x*x+1)+1)
= (x*x+1-1) / (sqrt(x*x+1)+1)
= x*x/(sqrt(x*x+1)+1)
The last formula can be used as an implementation. For vwry small x sqrt(x*x+1)+1 will be close to 2 (for small enough x it will be 2) but we don;t loose precision in evaluating it.
The problem isn't with running into the minimum value, but with the precision.
As you said yourself, float on your machine has about 7 digits of precision. So let's take x = 1e-7, so that x^2 = 1e-14. That's still well within the range of float, no problems there. But now add 1. The exact answer would be 1.00000000000001. But if we only have 7 digits of precision, this gets rounded to 1.0000000, i.e. exactly 1. So you end up computing sqrt(1.0)-1 which is exactly 0.
One approach would be to use the linear approximation of sqrt around x=1 that sqrt(x) ~ 1+0.5*(x-1). That would lead to the approximation f(x) ~ 0.5*x^2.
I'm trying to compare values with double precision using epsilon. However, I have a problem - initially I have thought that the difference should be equal to the epsilon, but it isn't. Additionally, when I've tried to check the binary representation using the successive multiplication something strange has happened and I feel confused, therefore I would appreciate your explanation to the problem and comments on my way of thinking
#include <stdio.h>
#define EPSILON 1e-10
void double_equal(double a, double b) {
printf("a: %.12f, b: %.12f, a - b = %.12f\n", a, b, a - b);
printf("a: %.12f, b: %.12f, b - a = %.12f\n", a, b, b - a);
if (a - b < EPSILON) printf("a - b < EPSILON\n");
if (a - b == EPSILON) printf("a - b == EPSILON\n");
if (a - b <= EPSILON) printf("a - b <= EPSILON\n");
if (b - a <= EPSILON) printf("b - a <= EPSILON\n");
}
int main(void) {
double wit1 = 1.0000000001;
double wit2 = 1.0;
double_equal(wit1, wit2);
return 0;
}
The output is:
a: 1.000000000100, b: 1.000000000000, a - b = 0.000000000100
a: 1.000000000100, b: 1.000000000000, b - a = -0.000000000100
b - a <= EPSILON
Numeric constants in C are declared as doubles if we don't provide "F"/"f" sign right after the number (#define EPSILON 1e-10F), therefore I can't see here the problem of conversion as in this question. Therefore, I have created really simple program for THESE SPECIFIC examples (I know it should include handling converting integral parts to binary numbers).
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
char* convert(double a) {
char* res = malloc(200);
int count = 0;
double integral;
a = modf(a, &integral);
if (integral == 1) {
res[count++] = integral + '0';
res[count++] = '.';
} else {
res[count++] = '0';
res[count++] = '.';
}
while(a != 0 && count < 200) {
printf("%.100f\n", a);
a *= 2;
a = modf(a, &integral);
if (integral == 1) res[count++] = integral + '0';
else res[count++] = '0';
}
res[count] = '\0';
return res;
}
int main(void) {
double wit1 = 1.0000000001;
double diff = 0.0000000001;
char* res = convert(wit1);
char* di = convert(diff);
printf("this: %s\n", res);
printf("diff: %s\n", di);
return 0;
}
Direct output:
this: 1.0000000000000000000000000000000001101101111100111
diff: 0.00000000000000000000000000000000011011011111001101111111011001110101111011110110111011
First question: why there are so many ending zero-ones in the difference? Why do the results after the binary point differ?
However, if we look at the process of calculation and the fractional part, printed out (I'm presenting only the first few lines):
1.0000000001:
0.0000000001000000082740370999090373516082763671875000000000000000000000000000000000000000000000000000
0.0000000002000000165480741998180747032165527343750000000000000000000000000000000000000000000000000000
0.0000000004000000330961483996361494064331054687500000000000000000000000000000000000000000000000000000
0.0000000001:
0.0000000001000000000000000036432197315497741579165547065599639608990401029586791992187500000000000000
0.0000000002000000000000000072864394630995483158331094131199279217980802059173583984375000000000000000
0.0000000004000000000000000145728789261990966316662188262398558435961604118347167968750000000000000000
Second question: why there are so many strange ending numbers? Is this a result of the incapability of the floating-point arithmetic of precisely representing decimal values?
Analyzing the subtraction, I can see, why the result is bigger than the epsilon. I follow the procedure:
Prepare a complement sequence of zero-ones for the sequence to subtract
"Add" the sequences
Subtract the one in the beginning, add it to the rightmost bit
Therefore:
1.0000000000000000000000000000000001101101111100111
- 1.0000000000000000000000000000000000000000000000000
|
\/
1.0000000000000000000000000000000001101101111100111
"+"0.1111111111111111111111111111111111111111111111111
--------------------------------------------------------
10.0000000000000000000000000000000001101101111100110
|
\/
0.0000000000000000000000000000000001101101111100111
Comparing with the calculated value of epsilon:
0.000000000000000000000000000000000110110111110011 0 1111111011001110101111011110110111011
0.000000000000000000000000000000000110110111110011 1
Spaces indicate the difference.
Third question: do I have to worry if I can't compare the value equal to the epsilon? I think that this situation indicates what the interval of tolerance with epsilon has been made for. However, is there anything I should change?
Why do the results after the binary point differ?
Because that is the difference.
Expecting something else comes from thinking 1.0000000001 and 0.0000000001 as double have those 2 values. They do not. Their difference is not 1.0. They have values near those two, each with about 53 binary digits of significance. Their difference is close to the unit in the last place of 1.0000000001.
why there are so many strange ending numbers? Is this a result of the incapability of the floating-point arithmetic of precisely representing decimal values?
Somewhat.
double can encode about 264 different numbers. 1.0000000001 and 0.0000000001 are not in that set. Instead nearby ones are used that look like strange ending numbers.
do I have to worry if I can't compare the value equal to the epsilon? I think that this situation indicates what the interval of tolerance with epsilon has been made for. However, is there anything I should change?
Yes, change use of epsilon. epsilon is useful for the relative difference, not absolute one. Very large consecutive double values are far more than epsilon apart. About 45% of all double, (the small ones) are all less than epsilon in magnitude. Either if (a - b <= EPSILON) printf("a - b <= EPSILON\n"); or if (b - a <= EPSILON) printf("b - a <= EPSILON\n"); will be true for small a, b even though they are trillions of times different in magnitude.
Oversimplification:
if (fabs(a-b) < EPSILON*fabs(a + b)) {
return values_a_b_are_near_each_other;
}
This answer assumes your C implementation uses IEEE-754 binary64, also known as the “double” format for its double type. This is common.
If the C implementation rounds correctly, then double wit1 = 1.0000000001; initializes wit1 to 1.0000000001000000082740370999090373516082763671875. This is because the two representable values nearest 1.0000000001 are 1.000000000099999786229432174877729266881942749023437500000000000000000000000000000000000000000000000 and 1.0000000001000000082740370999090373516082763671875. The latter is chosen since it is closer.
If correctly rounded, the 1e-10 used for EPSILON will produce 0.000000000100000000000000003643219731549774157916554706559963960899040102958679199218750000000000000.
Clearly wit1 - 1 exceeds EPSILON, so the test a - b < EPSILON in double_equal evaluates as false.
First question: why there are so many ending zero-ones in the difference?
Count the number of bits from the first 1 to the last 1. In each number, there are 53. That is because there are 53 bits in the significand of a double. It is a bit of a coincidence your numbers happened to end in a 1 bit. About half the time, the trailing bit is 0, and a quarter of the time, the last two bits are zeros, and so on. However, since there are 53 bits in the significand of a double, there will be exactly 53 bits from the first 1 bit to the last bit that is part of the represented value.
Since your first number starts with a 1 in the integer position, it has at most 52 bits after that. At that point, the number must be rounded to the nearest representable value.
Since your second number is between 2−34 and 2−33, its first 1 bit is in the 2−34 position, and it can go to the 2−86 position before it has to be rounded.
Third question: do I have to worry if I can't compare the value equal to the epsilon?
Why do you want to compare to the epsilon? There is no general solution for comparing floating-point numbers that contain errors from previous operations. Whether or not an “epsilon comparison” can or should be used is dependent on the application and the operations and numbers involved.
This question already has an answer here:
pow() function in C problems [duplicate]
(1 answer)
Closed 3 years ago.
I'm trying to multiply 2, 3 digit numbers.
I used 2 for loops (nested) and multiplied each digit of num1 with num2,
and shifted each result to the appropriate place using pow().
So the problem is pow(10,3) is coming out to be 299 instead of 300.
I haven't tried much as but used printf to find what is actually happening in the runtime and this is what I have found.
the values of tempR after shift should be
5,40,300,100,800,6000,1500,12000,90000
but are coming as
5,40,299,100,799,6000,1500,12000,89999
int main(void)
{
int result; // final result
int tempR; // temporary for each iteration
char a[] = "345"; // number 1
char b[] = "321"; // number 2
for(int i = 2;i>= 0 ; i --)
{
for(int j = 2;j >= 0 ; j --)
{
int shift = abs(i-2 + j -2);
printf("%d\n",shift); //used to see the values of shift.
//and it is coming as expected
tempR = (int)(b[i] - '0') * (int)(a[j] - '0');
printf("%d \n",tempR); // value to tempR is perfect
tempR = tempR*pow(10,shift);
printf("%d \n",tempR); // here the problem starts
result += tempR;
}
}
printf("%d",result);
}
Although IEEE754 (ubiquitous on desktop systems) is required to return the best possible floating point value for certain operators such as addition, multiplication, division, and subtraction, and certain functions such as sqrt, this does not apply to pow.
pow(x, y) can and often is implemented as exp(y * ln (x)). Hopefully you can see that this can cause result to "go off" spectacularly when pow is used with seemingly trivial integral arguments and the result truncated to int.
There are C implementations out there that have more accurate implementations of pow than the one you have, particularly for integral arguments. If such accuracy is required, then you could move your toolset to such an implementation. Borrowing an implementation of pow from a respected mathematics library is also an option, else roll your own. Using round is also a technique, if a little kludgy if you get my meaning.
Never use float functions for the integer calculations. Your pow result almost never will be precise. In this case it is slightly below 300 and the cast to integer makes it 299.
The pow function operates on doubles. Doubles use finite precision. Conversion back to integer chops rather than rounding.
Finite precision is like representing 1/3 as 0.333333. If you do 9 * 1/3 and chop to an integer, you'll get 2 instead of 3 because 9 * 1/3 will give 2.999997 which chops to two.
This same kind of rounding and chopping is causing you to be off by one. You could also round by adding 0.5 before chopping to an integer, but I wouldn't suggest it.
Don't pass integers through doubles and back if you expect exact answers.
Others have mentioned that pow does not yield exact results, and if you convert the result to an integer there's a high risk of loss of precision. Especially since if you assign a float type to an integer type, the result get truncated rather than rounded. Read more here: Is floating math broken?
The most convenient solution is to write your own integer variant of pow. It can look like this:
int int_pow(int num, int e)
{
int ret = 1;
while(e-- > 0)
ret *= num;
return ret;
}
Note that it will not work if e is negative or if both num and e is 0. It also have no protection for overflow. It just shows the idea.
In your particular case, you could write a very specialized variant based on 10:
unsigned int pow10(unsigned int e)
{
unsigned int ret = 1;
while(e-- > 0)
ret *= 10;
return ret;
}
Take this simple function for example.
int checkIfTriangleIsValid(double a, double b, double c) {
//fix the precision problem
int c1, c2, c3;
c1 = a+b>c ? 0 : 1;
c2 = b+c>a ? 0 : 1;
c3 = c+a>b ? 0 : 1;
if(c1 == 0 && c2 == 0 && c3 == 0)
return 0;
else {
printf("%d, %d, %d\n",c1, c2, c3);
return 1;
}
}
I place for a = 1.923, b = 59.240, c = 61.163
Now for some reason when I check for the condition in c1 it should give me 1, but instead, it gives me 0. I tried to do a printf with %.30f and found that the values later changes.
How can I fix this problem?
EDIT: I checked the other questions that are similar to mine but they don't even have a double.
Likely your C implementation uses the IEEE-754 basic 64-bit binary floating-point format for double. When 1.923, 59.240, and 61.163 are properly converted to the nearest values representable in double, the results are exactly:
1.9230000000000000426325641456060111522674560546875,
59.24000000000000198951966012828052043914794921875, and
61.1629999999999967030817060731351375579833984375.
As you can see, the first two of these sum to more than the third. This means that, by the time you assign these values to double objects, they have already been altered in a way that changes their relationship. No subsequent calculations can repair this, because the original information is gone.
Since no solution after conversion to double can work, you need a solution that operates before or instead of conversion to double. If you want to compute exactly, or more precisely, with the values 1.923, 59.240, and 61.163, you may need to write your own decimal arithmetic code or find some other code that supports decimal arithmetic. If you only want to work with numbers with three decimal places, then a possible solution is to write some code that reads input such as “59.240” and returns it in an integer object scaled by 1000, so that 59240 is returned. The resulting values could then easily be tested for the triangle inequality.
when I check for the condition in c1 it should give me 1, but instead it gives me 0
How can I fix this problem?
Change your expectations.
A typical double can represent exactly about 264 different values. 1.923, 59.240, 61.163 are typically not in that set as double is usually encoded in a binary way. e.g. binary64.
When a,b,c are assigned 1.923, 59.240, 61.163, they get values more like the below which are the closet double.
a 1.923000000000000042632564145606...
b 59.240000000000001989519660128281...
c 61.162999999999996703081706073135...
In my case, the a, and b both received a slightly higher value than the decimal code form, while c received a slightly lower one.
When adding a+b, the sum was rounded up, further away from c.
printf("a+b %35.30f\n", a+b);
a+b 61.163000000000003808509063674137
a + b > c was true, as well as other compares and OP's
checkIfTriangleIsValid(1.923, 59.240, 61.163) should return valid (0) as it is really more like checkIfTriangleIsValid(1.9230000000000000426..., 59.24000000000000198..., 61.16299999999999670...)
Adding a+b is further complicated in that the addition may occur using double or long double math. Research FLT_EVAL_METHOD for details. Rounding mode also can affect the final sum.
#include <float.h>
printf("FLT_EVAL_METHOD %d\n", FLT_EVAL_METHOD);
As to an alternative triangle check, subtract the largest 2 values and then compare against the smallest.
a > (c-b) can preserve significantly more precision than (a+b) > c.
// Assume a,b,c >= 0
int checkIfTriangleIsValid_2(double a, double b, double c) {
// Sort so `c` is largest, then b, a.
if (c < b) {
double t = b; b = c; c = t;
}
if (c < a) {
double t = a; a = c; c = t;
}
if (a > b) {
double t = b; b = a; a = t;
}
// So far, no loss of precision is expected due to compares/swaps.
// Only now need to check a + b >= c for valid triangle
// To preserve precision, subtract from `c` the value closest to it (`b`).
return a > (c-b);
}
I will review more later as time permits. This approach significant helps for a precise answer - yet need to assess more edge cases. It reports a valid triangle checkIfTriangleIsValid_2(1.923, 59.240, 61.163)).
FLT_EVAL_METHOD, rounding mode and double encoding can result in different answers on other platforms.
Notes:
It appears a checkIfTriangleIsValid() returning 0 means valid triangle.
It also appears when the triangle has 0 area, the expected result is 1 or invalid.