I'm using avr-gcc where sizeof(double) and sizeof(float) are both 4 and I'm having an issue with double arithmetic to get the correct integer result:
// x is some value between 8.0 and 9.6103
double x = 9.6103;
uint32_t r = pow(x,2) * 8813377.768984962;
The correct value of r should be 813984763 rounded down but the actual result is 813984768.
How can I get the correct integer result?
I've tried to split the calculation like this:
uint32_t r1 = pow(x,2) * 8813;
double d1 = pow(x,2) * .77768984962;
uint32_t r = r1 + d1;
But this still suffers from precision issues i.e I can't seem to get 813984763 exactly and I'm only interested in that the integer part of the result is correct. Any ideas?
A float cannot represent the precision you need for this value (813984763), much less for the calculation, and as you've noted avr-gcc has wrongly redefined double to be the same as float.
The closest representable values in float are:
Below: 813984704
Above: 813984768 (closer)
You could scale it up and use 128-bit integers to do the arithmetic. 128-bit is soooo much you could just multiply it all to integers.
double x = 9.6103;
uint128_t y = x * 10000; // = 96103 / 10000
uint128_t c = 8813377768984962; // = 8813377.768984962 * 1000000000
uint32_t r = y * y * c / 10000/10000 /1000000000;
// max y * y * c = 96103 * 96103 * 8813377768984962 =
// = 81398476378849607561973858
// UINT128_MAX = 340282366920938463463374607431768211456
// ^^ is way more, so it will not overflow.
Your platform most probably does not support __uint128_t GCC extension, so you could write your own library for that. There are endless 128-bit libraries in C++ on github - port one to C (or find one in C) and use it.
Well, I got some free time and I always wanted to have a C uint128 library, so I took this library https://github.com/calccrypto/uint128_t and ported it to C and wrote an executable that does the same computations as presented above and compiled it for atmega128 with avr-gcc -Os and run avr-nm -td --sort-size over the result. These are the biggest 5 symbols in the result and the whole program has ~12KB of .text. So, a bit of space is needed for this solution to work.
00000642 T how_to_split_double_multiplication
00000706 T kuint128_rshift
00000762 T kuint128_lshift
00003104 T kuint128_mul
00004594 T kuint128_divmod
Related
I am writing a function in C that returns the radius of an ellipse at a given angle with a given length and width; Basically writing this calculation in C:
Unfortunately the platform does not support math.h however there are sin and cos functions built in that I can use.
How do I write this calculation in C and store it in an int?
I have tried:
int theta = 90;
int a = 164;
int b = 144;
float aa = (((a^2) * ((sin_lookup(DEG_TO_TRIGANGLE(theta)))^2)) +
((b^2) * ((cos_lookup(DEG_TO_TRIGANGLE(theta)))^2))) /
(TRIG_MAX_ANGLE^2);
float result = (a * b) / (my_sqrt(aa));
int value = (int)result;
Easy enough
int getRadius(double a, double b, double theta)
{
double s = sin(theta),
c = cos(theta);
return (a * b) / sqrt((a*a)*(s*s)+(b*b)*(c*c))
}
Though I'm not sure why you want to return an int. You'll loose a lot of precision.
The ^ operator is not the way to do powers. It's actually a bitwise XOR. This is a common mistake new (C) programmers make. math.h has a function pow() for calculating powers, but you said you can't use math.h. These values are only raised to the second power, so it's pretty easy to just multiply it manually.
I'm trying to represent the following mathematical expression in C:
P(n) = (n!)(6^n)
The program should compute the answer to expression when n = 156. I have attempted to create the program in C and it fails to produce an answer. The answer is approximately 10^397. The program utilises 2 logarithmic identities. It also utilises Stirling's approximation to calculate the large factorial.
How can I make it produce the correct answer and do you have any suggestions as to how I could improve the code? (I'm fairly new to programming):
#include <math.h>
typedef unsigned int uint;
int main()
{
uint n=156; // Declare variables
double F,pi=3.14159265359,L,e=exp(1),P;
F = sqrt(2*pi*n) * pow((n/e),n); // Stirling's Approximation Formula
L = log(F) + n*log(6); // Transform P(n) using logarithms - log(xy) = log(x) + log(y) and log(y^n) = n*log(y)
P = pow(e,L); // Transform the resultant logarithm back to a normal numbers
}
Thank you! :)
Neither integer nor floating point variables in most C implementations can support numbers of that magnitude. Typical 64-bit doubles go up to something like 10308 with substantial loss of precision at that magnitude.
You'll need what's called a 'bignum library' to compute this, which is not part of standard C.
One idea is to use the long double type. Its precision isn't guaranteed, so it may or may not be big enough for your needs, depending on what compiler you're using.
Replace double with long double. Add an 'l' (lower case L) suffix to all math functions (expl, logl, powl, sqrtl). Compile with C99 enabled, since the long double math functions are provided in C99. It worked for me using GCC 4.8.1.
#include <math.h>
#include <stdio.h>
typedef unsigned int uint;
int main()
{
uint n=156; // Declare variables
long double F,pi=3.14159265359,L,e=expl(1),P;
F = sqrtl(2*pi*n) * powl((n/e),n); // Stirling's Approximation Formula
L = logl(F) + n*logl(6); // Transform P(n) using logarithms - log(xy) = log(x) + log(y) and log(y^n) = n*log(y)
P = powl(e,L); // Transform the resultant logarithm back to a normal numbers
printf("%Lg\n", P);
}
I get 1.83969e+397.
Loosely speaking, in C a double is represented as a base number raised to a power. As already mentioned, the maximum is roughly 1E308, but as you get to larger and larger numbers (or smaller and smaller), you lose precision because the base number has a finite number of digits and cannot always be accurately represented in this way.
See http://en.wikipedia.org/wiki/Double-precision_floating-point_format for more information
#include <math.h>
#include <float.h>
typedef unsigned int uint;
int main()
{
uint n=156; // Declare variables
long double F,pi=3.14159265359,L,e=expl(1),P;
F = sqrtl(2*pi*n) * powl((n/e),n); // Stirling's Approximation Formula
L = logl(F) + n*logl(6); // Transform P(n) using logarithms - log(xy) = log(x) + log(y) and log(y^n) = n*log(y)
P = powl(e,L); // Transform the resultant logarithm back to a normal numbers
printf("%d\n", LDBL_MAX_10_EXP);
}
Sorry for the wordy title. My code is targeting a microcontroller (msp430) with no floating point unit, but this should apply to any similar MCU.
If I am multiplying a large runtime variable with what would normally be considered a floating point decimal number (1.8), is this still treated like floating point math by the MCU or compiler?
My simplified code is:
int multip = 0xf; // Can be from 0-15, not available at compile time
int holder = multip * 625; // 0 - 9375
holder = holder * 1.8; // 0 - 16875`
Since the result will always be a positive full, real integer number, is it still floating point math as far as the MCU or compiler are concerned, or is it fixed point?
(I realize I could just multiply by 18, but that would require declaring a 32bit long instead of a 16 bit int then dividing and downcasting for the array it will be put in, trying to skimp on memory here)
The result is not an integer; it rounds to an integer.
9375 * 1.8000000000000000444089209850062616169452667236328125
yields
16875.0000000000004163336342344337026588618755340576171875
which rounds (in double precision floating point) to 16875.
If you write a floating-point multiply, I know of no compiler that will determine that there's a way to do that in fixed-point instead. (That does not mean they do not exist, but it ... seems unlikely.)
I assume you simplified away something important, because it seems like you could just do:
result = multip * 1125;
and get the final result directly.
I'd go for chux's formula if there's some reason you can't just multiply by 1125.
Confident FP code will be created for
holder = holder * 1.8
To avoid FP and 32-bit math, given the OP values of
int multip = 0xf; // Max 15
unsigned holder = multip * 625; // Max 9375
// holder = holder * 1.8;
// alpha depends on rounding desired, e.g. 2 for round to nearest.
holder += (holder*4u + alpha)/5;
If int x is non-negative, you can compute x *= 1.8 rounded to nearest using only int arithmetic, without overflow unless the final result overflows, with:
x - (x+2)/5 + x
For truncation instead of round-to-nearest, use:
x - (x+4)/5 + x
If x may be negative, some additional work is needed.
I'm looking for a way to truncate a float into an int in a fast and portable (IEEE 754) way. The reason is because in this function 50% of the time is spent in the cast:
float fm_sinf(float x) {
const float a = 0.00735246819687011731341356165096815f;
const float b = -0.16528911397014738207016302002888890f;
const float c = 0.99969198629596757779830113868360584f;
float r, x2;
int k;
/* bring x in range */
k = (int) (F_1_PI * x + copysignf(0.5f, x)); /* <-- 50% of time is spent in cast */
x -= k * F_PI;
/* if x is in an odd pi count we must flip */
r = 1 - 2 * (k & 1); /* trick for r = (k % 2) == 0 ? 1 : -1; */
x2 = x * x;
return r * x*(c + x2*(b + a*x2));
}
The slowness of float->int casts mainly occurs when using x87 FPU instructions on x86. To do the truncation, the rounding mode in the FPU control word needs to be changed to round-to-zero and back, which tends to be very slow.
When using SSE instead of x87 instructions, a truncation is available without control word changes. You can do this using compiler options (like -mfpmath=sse -msse -msse2 in GCC) or by compiling the code as 64-bit.
The SSE3 instruction set has the FISTTP instruction to convert to integer with truncation without changing the control word. A compiler may generate this instruction if instructed to assume SSE3.
Alternatively, the C99 lrint() function will convert to integer with the current rounding mode (round-to-nearest unless you changed it). You can use this if you remove the copysignf term. Unfortunately, this function is still not ubiquitous after more than ten years.
I found a fast truncate method by Sree Kotay which provides exactly the optimization that I needed.
to be portable you would have to add some directives and learn a couple assembler languages but you could theoretically could use some inline assembly to move portions of the floating point register into eax/rax ebx/rbx and convert what you would need by hand, floating point specification though is a pain in the butt, but I am pretty certain that if you do it with assembly you will be way faster, as your needs are very specific and the system method is probably more generic and less efficient for your purpose
You could skip the conversion to int altogether by using frexpf to get the mantissa and exponent, and inspect the raw mantissa (use a union) at the appropriate bit position (calculated using the exponent) to determine (the quadrant dependent) r.
I have a program implemented in matlab and the same program in c, and the results differ.
I am bit puzzled that the cos function does not return the exact same result.
I use the same computer, Intel Core 2 Duo, and 8 bytes double data type in both cases.
Why does the result differ?
Here is the test:
c:
double a = 2.89308776595231886830;
double b = cos(a);
printf("a = %.50f\n", a);
printf("b = %.50f\n", b);
printf("sizeof(a): %ld\n", sizeof(a));
printf("sizeof(b): %ld\n", sizeof(b));
a = 2.89308776595231886830106304842047393321990966796875
b = -0.96928123535654842068964853751822374761104583740234
sizeof(a): 8
sizeof(b): 8
matlab:
a = 2.89308776595231886830
b = cos(a);
fprintf('a = %.50f\n', a);
fprintf('b = %.50f\n', b);
whos('a')
whos('b')
a = 2.89308776595231886830106304842047393321990966796875
b = -0.96928123535654830966734607500256970524787902832031
Name Size Bytes Class Attributes
a 1x1 8 double
Name Size Bytes Class Attributes
b 1x1 8 double
So, b differ a bit (very slightly, but enough to make my debuging task difficult)
b = -0.96928123535654842068964853751822374761104583740234 c
b = -0.96928123535654830966734607500256970524787902832031 matlab
I use the same computer, Intel Core 2 Duo, and 8 bytes double data type.
Why does the result differ?
does matlab do not use the cos function hardware built-in in Intel?
Is there a simple way to use the same cos function in matlab and c (with exact results), even if a bit slower, so that I can safely compare the results of my matlab and c program?
Update:
thanks a lot for your answers!
So, as you have pointed out, the cos function for matlab and c differ.
That's amazing! I thought they were using the cos function built-in in the Intel microprocessor.
The cos version of matlab is equal (at least for this test) to the one of matlab.
you can try from matlab also: b=java.lang.Math.cos(a)
Then, I did a small MEX function to use the cos c version from within matlab, and it works fine; This allows me to debug the my program (the same one implemented in matlab and c) and see at what point they differ, which was the purpose of this post.
The only thing is that calling the MEX c cos version from matlab is way too slow.
I am now trying to call the Java cos function from c (as it is the same from matlab), see if that goes faster.
Floating point numbers are stored in binary, not decimal. A double precision float has 52 bits of precision, which translates to roughly 15 significant decimal places. In other words, the first 15 nonzero decimal digits of a double printed in decimal are enough to uniquely determine which double was printed.
As a diadic rational, a double has an exact representation in decimal, which takes many more decimal places than 15 to represent (in your case, 52 or 53 places, I believe). However, the standards for printf and similar functions do not require the digits past the 15th to be correct; they could be complete nonsense. I suspect one of the two environments is printing the exact value, and the other is printing a poor approximation, and that in reality both correspond to the exact same binary double value.
Using the script at http://www.mathworks.com/matlabcentral/fileexchange/1777-from-double-to-string
the difference between the two numbers is only in the last bit:
octave:1> bc = -0.96928123535654842068964853751822374761104583740234;
octave:2> bm = -0.96928123535654830966734607500256970524787902832031;
octave:3> num2bin(bc)
ans = -.11111000001000101101000010100110011110111001110001011*2^+0
octave:4> num2bin(bm)
ans = -.11111000001000101101000010100110011110111001110001010*2^+0
One of them must be closer to the "correct" answer, assuming the value given for a is exact.
>> be = vpa('cos(2.89308776595231886830)',50)
be =
-.96928123535654836529707365425580405084360377470583
>> bc = -0.96928123535654842068964853751822374761104583740234;
>> bm = -0.96928123535654830966734607500256970524787902832031;
>> abs(bc-be)
ans =
.5539257488326242e-16
>> abs(bm-be)
ans =
.5562972757925323e-16
So, the C library result is more accurate.
For the purposes of your question, however, you should not expect to get the same answer in matlab and whichever C library you linked with.
The result is the same up to 15 decimal places, I suspect that is sufficient for almost all applications and if you require more you should probably be implementing your own version of cosine anyway such that you are in control of the specifics and your code is portable across different C compilers.
They will differ because they undoubtedly use different methods to calculate the approximation to the result or iterate a different number of times. As cosine is defined as an infinite series of terms an approximation must be used for its software implementation. The CORDIC algorithm is one common implementation.
Unfortunately, I don't know the specifics of the implementation in either case, indeed the C one will depend on which C standard library implementation you are using.
As others have explained, when you enter that number directly in your source code, not all the fraction digits will be used, as you only get 15/16 decimal places for precision. In fact, they get converted to the nearest double value in binary (anything beyond the fixed limit of digits is dropped).
To make things worse, and according to #R, IEEE 754 tolerates error in the last bit when using the cosine function. I actually ran into this when using different compilers.
To illustrate, I tested with the following MEX file, once compiled with the default LCC compiler, and then using VS2010 (I am on WinXP 32-bit).
In one function we directly call the C functions (mexPrintf is simply a macro #define as printf). In the other, we call mexEvalString to evaulate stuff in the MATLAB engine (equivalent to using the command prompt in MATLAB).
prec.c
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include "mex.h"
void c_test()
{
double a = 2.89308776595231886830L;
double b = cos(a);
mexPrintf("[C] a = %.25Lf (%16Lx)\n", a, a);
mexPrintf("[C] b = %.25Lf (%16Lx)\n", b, b);
}
void matlab_test()
{
mexEvalString("a = 2.89308776595231886830;");
mexEvalString("b = cos(a);");
mexEvalString("fprintf('[M] a = %.25f (%bx)\\n', a, a)");
mexEvalString("fprintf('[M] b = %.25f (%bx)\\n', b, b)");
}
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
matlab_test();
c_test();
}
copmiled with LCC
>> prec
[M] a = 2.8930877659523189000000000 (4007250b32d9c886)
[M] b = -0.9692812353565483100000000 (bfef045a14cf738a)
[C] a = 2.8930877659523189000000000 ( 32d9c886)
[C] b = -0.9692812353565484200000000 ( 14cf738b) <---
compiled with VS2010
>> prec
[M] a = 2.8930877659523189000000000 (4007250b32d9c886)
[M] b = -0.9692812353565483100000000 (bfef045a14cf738a)
[C] a = 2.8930877659523189000000000 ( 32d9c886)
[C] b = -0.9692812353565483100000000 ( 14cf738a) <---
I compile the above using: mex -v -largeArrayDims prec.c, and switch between the backend compilers using: mex -setup
Note that I also tried to print the hexadecimal representation of the numbers. I only managed to show the lower half of binary double numbers in C (perhaps you can get the other half using some sort of bit manipulations, but I'm not sure how!)
Finally, if you need more precision in you calculations, consider using a library for variable precision arithmetic. In MATLAB, if you have access to the Symbolic Math Toolbox, try:
>> a = sym('2.89308776595231886830');
>> b = cos(a);
>> vpa(b,25)
ans =
-0.9692812353565483652970737
So you can see that the actual value is somewhere between the two different approximations I got above, and in fact they are all equal up to the 15th decimal place:
-0.96928123535654831.. # 0xbfef045a14cf738a
-0.96928123535654836.. # <--- actual value (cannot be represented in 64-bit)
-0.96928123535654842.. # 0xbfef045a14cf738b
^
15th digit --/
UPDATE:
If you want to correctly display the hexadecimal representation of floating point numbers in C, use this helper function instead (similar to NUM2HEX function in MATLAB):
/* you need to adjust for double/float datatypes, big/little endianness */
void num2hex(double x)
{
unsigned char *p = (unsigned char *) &x;
int i;
for(i=sizeof(double)-1; i>=0; i--) {
printf("%02x", p[i]);
}
}