I have a program implemented in matlab and the same program in c, and the results differ.
I am bit puzzled that the cos function does not return the exact same result.
I use the same computer, Intel Core 2 Duo, and 8 bytes double data type in both cases.
Why does the result differ?
Here is the test:
c:
double a = 2.89308776595231886830;
double b = cos(a);
printf("a = %.50f\n", a);
printf("b = %.50f\n", b);
printf("sizeof(a): %ld\n", sizeof(a));
printf("sizeof(b): %ld\n", sizeof(b));
a = 2.89308776595231886830106304842047393321990966796875
b = -0.96928123535654842068964853751822374761104583740234
sizeof(a): 8
sizeof(b): 8
matlab:
a = 2.89308776595231886830
b = cos(a);
fprintf('a = %.50f\n', a);
fprintf('b = %.50f\n', b);
whos('a')
whos('b')
a = 2.89308776595231886830106304842047393321990966796875
b = -0.96928123535654830966734607500256970524787902832031
Name Size Bytes Class Attributes
a 1x1 8 double
Name Size Bytes Class Attributes
b 1x1 8 double
So, b differ a bit (very slightly, but enough to make my debuging task difficult)
b = -0.96928123535654842068964853751822374761104583740234 c
b = -0.96928123535654830966734607500256970524787902832031 matlab
I use the same computer, Intel Core 2 Duo, and 8 bytes double data type.
Why does the result differ?
does matlab do not use the cos function hardware built-in in Intel?
Is there a simple way to use the same cos function in matlab and c (with exact results), even if a bit slower, so that I can safely compare the results of my matlab and c program?
Update:
thanks a lot for your answers!
So, as you have pointed out, the cos function for matlab and c differ.
That's amazing! I thought they were using the cos function built-in in the Intel microprocessor.
The cos version of matlab is equal (at least for this test) to the one of matlab.
you can try from matlab also: b=java.lang.Math.cos(a)
Then, I did a small MEX function to use the cos c version from within matlab, and it works fine; This allows me to debug the my program (the same one implemented in matlab and c) and see at what point they differ, which was the purpose of this post.
The only thing is that calling the MEX c cos version from matlab is way too slow.
I am now trying to call the Java cos function from c (as it is the same from matlab), see if that goes faster.
Floating point numbers are stored in binary, not decimal. A double precision float has 52 bits of precision, which translates to roughly 15 significant decimal places. In other words, the first 15 nonzero decimal digits of a double printed in decimal are enough to uniquely determine which double was printed.
As a diadic rational, a double has an exact representation in decimal, which takes many more decimal places than 15 to represent (in your case, 52 or 53 places, I believe). However, the standards for printf and similar functions do not require the digits past the 15th to be correct; they could be complete nonsense. I suspect one of the two environments is printing the exact value, and the other is printing a poor approximation, and that in reality both correspond to the exact same binary double value.
Using the script at http://www.mathworks.com/matlabcentral/fileexchange/1777-from-double-to-string
the difference between the two numbers is only in the last bit:
octave:1> bc = -0.96928123535654842068964853751822374761104583740234;
octave:2> bm = -0.96928123535654830966734607500256970524787902832031;
octave:3> num2bin(bc)
ans = -.11111000001000101101000010100110011110111001110001011*2^+0
octave:4> num2bin(bm)
ans = -.11111000001000101101000010100110011110111001110001010*2^+0
One of them must be closer to the "correct" answer, assuming the value given for a is exact.
>> be = vpa('cos(2.89308776595231886830)',50)
be =
-.96928123535654836529707365425580405084360377470583
>> bc = -0.96928123535654842068964853751822374761104583740234;
>> bm = -0.96928123535654830966734607500256970524787902832031;
>> abs(bc-be)
ans =
.5539257488326242e-16
>> abs(bm-be)
ans =
.5562972757925323e-16
So, the C library result is more accurate.
For the purposes of your question, however, you should not expect to get the same answer in matlab and whichever C library you linked with.
The result is the same up to 15 decimal places, I suspect that is sufficient for almost all applications and if you require more you should probably be implementing your own version of cosine anyway such that you are in control of the specifics and your code is portable across different C compilers.
They will differ because they undoubtedly use different methods to calculate the approximation to the result or iterate a different number of times. As cosine is defined as an infinite series of terms an approximation must be used for its software implementation. The CORDIC algorithm is one common implementation.
Unfortunately, I don't know the specifics of the implementation in either case, indeed the C one will depend on which C standard library implementation you are using.
As others have explained, when you enter that number directly in your source code, not all the fraction digits will be used, as you only get 15/16 decimal places for precision. In fact, they get converted to the nearest double value in binary (anything beyond the fixed limit of digits is dropped).
To make things worse, and according to #R, IEEE 754 tolerates error in the last bit when using the cosine function. I actually ran into this when using different compilers.
To illustrate, I tested with the following MEX file, once compiled with the default LCC compiler, and then using VS2010 (I am on WinXP 32-bit).
In one function we directly call the C functions (mexPrintf is simply a macro #define as printf). In the other, we call mexEvalString to evaulate stuff in the MATLAB engine (equivalent to using the command prompt in MATLAB).
prec.c
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include "mex.h"
void c_test()
{
double a = 2.89308776595231886830L;
double b = cos(a);
mexPrintf("[C] a = %.25Lf (%16Lx)\n", a, a);
mexPrintf("[C] b = %.25Lf (%16Lx)\n", b, b);
}
void matlab_test()
{
mexEvalString("a = 2.89308776595231886830;");
mexEvalString("b = cos(a);");
mexEvalString("fprintf('[M] a = %.25f (%bx)\\n', a, a)");
mexEvalString("fprintf('[M] b = %.25f (%bx)\\n', b, b)");
}
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
matlab_test();
c_test();
}
copmiled with LCC
>> prec
[M] a = 2.8930877659523189000000000 (4007250b32d9c886)
[M] b = -0.9692812353565483100000000 (bfef045a14cf738a)
[C] a = 2.8930877659523189000000000 ( 32d9c886)
[C] b = -0.9692812353565484200000000 ( 14cf738b) <---
compiled with VS2010
>> prec
[M] a = 2.8930877659523189000000000 (4007250b32d9c886)
[M] b = -0.9692812353565483100000000 (bfef045a14cf738a)
[C] a = 2.8930877659523189000000000 ( 32d9c886)
[C] b = -0.9692812353565483100000000 ( 14cf738a) <---
I compile the above using: mex -v -largeArrayDims prec.c, and switch between the backend compilers using: mex -setup
Note that I also tried to print the hexadecimal representation of the numbers. I only managed to show the lower half of binary double numbers in C (perhaps you can get the other half using some sort of bit manipulations, but I'm not sure how!)
Finally, if you need more precision in you calculations, consider using a library for variable precision arithmetic. In MATLAB, if you have access to the Symbolic Math Toolbox, try:
>> a = sym('2.89308776595231886830');
>> b = cos(a);
>> vpa(b,25)
ans =
-0.9692812353565483652970737
So you can see that the actual value is somewhere between the two different approximations I got above, and in fact they are all equal up to the 15th decimal place:
-0.96928123535654831.. # 0xbfef045a14cf738a
-0.96928123535654836.. # <--- actual value (cannot be represented in 64-bit)
-0.96928123535654842.. # 0xbfef045a14cf738b
^
15th digit --/
UPDATE:
If you want to correctly display the hexadecimal representation of floating point numbers in C, use this helper function instead (similar to NUM2HEX function in MATLAB):
/* you need to adjust for double/float datatypes, big/little endianness */
void num2hex(double x)
{
unsigned char *p = (unsigned char *) &x;
int i;
for(i=sizeof(double)-1; i>=0; i--) {
printf("%02x", p[i]);
}
}
Related
I'm writing some code for an embedded system (MSP430) without hardware floating point support. Unfortunately, I will need to work with fractions in my code as I'm doing ranging, and a short-range sensor with a precision of 1m isn't a very good sensor.
I can do as much of the math as I need in ints, but by the end there are two values that I will definitely need to have fractions on; range and speed. Range will be a value between 2-500 (cm), while speed should be no higher than -10 to 10 (ms^-1). I am unsure how to represent them without floating point values, if it is possible. A simple way of rounding the fractions up or down would be best.
Some sample code I have:
voltage_difference_new = ((memval3_new - memval4_new)*3.3/4096);
where memval3_new and memval4_new are ints, but voltage_difference_new is a float.
Please let me know if more information is needed. Or if there is a blindingly easy fix.
You have rather answered your own question with the statement:
Range will be a value between 2-500 (cm),
Work in centimetre (or even millimetre) rather than metre units.
That said you don't need floating-point hardware to do floating point math; the compiler will support "soft" floating point and generate the code to perform floating point operations - it will be slower than hardware floating point or integer operations, but that may not be an issue in your application.
Nonetheless there are many reasons to avoid floating-point even with hardware support and it does not sound like your case for FP is particularly compelling, but it is hard to tell without seeing your code and a specific example. In 32 years of embedded systems development I have seldom resorted to FP even for trig, log, sqrt and digital signal processing.
A general method is to use a fixed point presentation. My earlier suggestion of using centimetres is an example of decimal fixed point, but for greater efficiency you should use binary fixed point. For example you might represent distance in 1/1024 metre units (giving > 1 mm precision). Because the fixed point is binary, all the necessary rescaling can be done with shifts rather than more expensive multiply/divide operations.
For example, say you have an 8 bit sensor generating linear output 0 to 255 corresponding to a real distance 0 to 0.5 metre.
#define Q10_SHIFT = 10 ; // 10 bits fractional (1/1024)
typedef int q10_t ;
#define ONE_METRE = (1 << Q10_SHIFT)
#define SENSOR_MAX = 255
#define RANGE_MAX = (ONE_METRE/2)
q10_t distance = read_sensor() * RANGE_MAX / SENSOR_MAX ;
distance is in Q10 fixed point representation. Performing addition and subtraction on such is normal integer arithmentic, multiply and divide require scaling:
int q10_add( q10_t a, q10_t b )
{
return a + b ;
}
int q10_sub( q10_t a, q10_t b )
{
return a - b ;
}
int q10_mul( q10_t a, q10_t b )
{
return (a * b) >> Q10_SHIFT ;
}
int q10_div( q10_t a, q10_t b )
{
return (a << Q10_SHIFT) / b ;
}
Of course you may want to be able to mix types and say multiply a q10_t by an int - providing a comprehensive library for fixed-point can get complex. Personally for that I use C++ where you have classes, function overloading and operator overloading to support more natural code. But unless your code has a great deal of general fixed point math, it may be simpler to code specific fixed point operations ad-hoc.
To take the one example you have provided:
double voltage_difference_new = ((memval3_new - memval4_new)*3.3/4096);
The floating-point there is trivially removed using millivolts:
int voltage_difference_new_mv = ((memval3_new - memval4_new) * 3300) /4096 ;
The issue then perhaps becomes one of presentation. For example if you have to present or report the value in volts to a user. In that case:
int volt_fract = abs(voltage_difference_new_mv % 1000) ;
int volt_whole = voltage_difference_new_mv / 1000 ;
printf( "%d.%04d", volt_whole, volt_fract ) ;
I was trying to write a program to calculate the value of x^n using a while loop:
#include <stdio.h>
#include <math.h>
int main()
{
float x = 3, power = 1, copyx;
int n = 22, copyn;
copyx = x;
copyn = n;
while (n)
{
if ((n % 2) == 1)
{
power = power * x;
}
n = n / 2;
x *= x;
}
printf("%g^%d = %f\n", copyx, copyn, power);
printf("%g^%d = %f\n", copyx, copyn, pow(copyx, copyn));
return 0;
}
Up until the value of 15 for n, the answer from my created function and the pow function (from math.h) gives the same value; but, when the value of n exceeds 15, then it starts giving different answers.
I cannot understand why there is a difference in the answer. Is it that I have written the function in the wrong way or it is something else?
You are mixing up two different types of floating-point data. The pow function uses the double type but your loop uses the float type (which has less precision).
You can make the results coincide by either using the double type for your x, power and copyx variables, or by calling the powf function (which uses the float type) instead of pow.
The latter adjustment (using powf) gives the following output (clang-cl compiler, Windows 10, 64-bit):
3^22 = 31381059584.000000
3^22 = 31381059584.000000
And, changing the first line of your main to double x = 3, power = 1, copyx; gives the following:
3^22 = 31381059609.000000
3^22 = 31381059609.000000
Note that, with larger and larger values of n, you are increasingly likely to get divergence between the results of your loop and the value calculated using the pow or powf library functions. On my platform, the double version gives the same results, right up to the point where the value overflows the range and becomes Infinity. However, the float version starts to diverge around n = 55:
3^55 = 174449198498104595772866560.000000
3^55 = 174449216944848669482418176.000000
When I run your code I get this:
3^22 = 31381059584.000000
3^22 = 31381059609.000000
This would be because pow returns a double but your code uses float. When I changed to powf I got identical results:
3^22 = 31381059584.000000
3^22 = 31381059584.000000
So simply use double everywhere if you need high resolution results.
Floating point math is imprecise (and float is worse than double, having even fewer bits to store the data in; using double might delay the imprecision longer). The pow function (usually) uses an exponentiation algorithm that minimizes precision loss, and/or delegates to a chip-level instruction that may do stuff more efficiently, more precisely, or both. There could be more than one implementation of pow too, depending on whether you tell the compiler to use strictly conformant floating point math, the fastest possible, the hardware instruction, etc.
Your code is fine (though using double would get more precise results), but matching the improved precision of math.h's pow is non-trivial; by the time you've done so, you'll have reinvented it. That's why you use the library function.
That said, for logically integer math as you're using here, precision loss from your algorithm likely doesn't matter, it's purely the float vs. double issue where you lose precision from the type itself. As a rule, default to using double, and only switch to float if you're 100% sure you don't need the precision and can't afford the extra memory/computation cost of double.
Precision
float x = 3, power = 1; ... power = power * x forms a float product.
pow(x, y) forms a double result and good implementations internally use even wider math.
OP's loop method incurs rounded results after the 15th iteration. These roundings slowly compound the inaccuracy of the final result.
316 is a 26 bit odd number.
float encodes all odd numbers exactly until typically 224. Larger values are all even and of only 24 significant binary digits.
double encodes all odd numbers exactly until typically 253.
To do a fair comparison, use:
double objects and pow() or
float objects and powf().
For large powers, the pow(f)() function is certain to provide better answers than a loop at such functions often use internally extended precision and well managed rounding vs. the loop approach.
I'm using avr-gcc where sizeof(double) and sizeof(float) are both 4 and I'm having an issue with double arithmetic to get the correct integer result:
// x is some value between 8.0 and 9.6103
double x = 9.6103;
uint32_t r = pow(x,2) * 8813377.768984962;
The correct value of r should be 813984763 rounded down but the actual result is 813984768.
How can I get the correct integer result?
I've tried to split the calculation like this:
uint32_t r1 = pow(x,2) * 8813;
double d1 = pow(x,2) * .77768984962;
uint32_t r = r1 + d1;
But this still suffers from precision issues i.e I can't seem to get 813984763 exactly and I'm only interested in that the integer part of the result is correct. Any ideas?
A float cannot represent the precision you need for this value (813984763), much less for the calculation, and as you've noted avr-gcc has wrongly redefined double to be the same as float.
The closest representable values in float are:
Below: 813984704
Above: 813984768 (closer)
You could scale it up and use 128-bit integers to do the arithmetic. 128-bit is soooo much you could just multiply it all to integers.
double x = 9.6103;
uint128_t y = x * 10000; // = 96103 / 10000
uint128_t c = 8813377768984962; // = 8813377.768984962 * 1000000000
uint32_t r = y * y * c / 10000/10000 /1000000000;
// max y * y * c = 96103 * 96103 * 8813377768984962 =
// = 81398476378849607561973858
// UINT128_MAX = 340282366920938463463374607431768211456
// ^^ is way more, so it will not overflow.
Your platform most probably does not support __uint128_t GCC extension, so you could write your own library for that. There are endless 128-bit libraries in C++ on github - port one to C (or find one in C) and use it.
Well, I got some free time and I always wanted to have a C uint128 library, so I took this library https://github.com/calccrypto/uint128_t and ported it to C and wrote an executable that does the same computations as presented above and compiled it for atmega128 with avr-gcc -Os and run avr-nm -td --sort-size over the result. These are the biggest 5 symbols in the result and the whole program has ~12KB of .text. So, a bit of space is needed for this solution to work.
00000642 T how_to_split_double_multiplication
00000706 T kuint128_rshift
00000762 T kuint128_lshift
00003104 T kuint128_mul
00004594 T kuint128_divmod
I just need a little help converting these lines of octave to opencl i guess i'm confused on the translation ^ to pow her is octave/opencl code but the results are not the same
v1=0.3;
E1=(207*(10^9));
E2=(3*(10^6));
v2=0.49;
octave
Ers=1/(1/pi*((1-v1^2)/E1+(1-v2^2)/E2));%MPa Whats this look like in c
DEF=(6*n*U*B^2)/Er/(sigma^3);
my opencl attempt at translation
Ers=(1.00/(1.00/pi*((1.00-pow(v1 ,2))/E1+(1.00-pow(v2, 2))/E2))); //MPa
DEF=6*n*U*pow(B,2)/Er/pow(sigma,3);
octave results
//ers 1.2402e+07
//DEF 30.962
opencl results
Ers 139.336666
DEF -0.000003
not sure what i've done wrong but if someone sees it please help
Thanks
OpenCL is derived from C99, so you can check your conversion is correct with a trivial vanilla C program.
If I take your C version of the Ers calculation and run it with C, the result I get matches your Octave result. I suspect your issue lies in the definition of your E1 and E2 constant values. If I paste those into the C program unchanged, I get an answer very similar to your OpenCL result. The issue is that although 10^9 is syntactically valid in C, it doesn't mean the same as in Octave (^ is the bitwise XOR operator in C). Instead, you should use scientific E notation, such as 1e9.
So, here's a complete C program that computes the Ers value using your C code copied almost verbatim, with just the constant values corrected. This produces the output Ers = 1.24024e+07 on my system.
#include <math.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
double v1 = 0.3;
double E1 = (207*(1e9));
double E2 = (3*(1e6));
double v2 = 0.49;
double pi = M_PI;
double Ers = (1.00/(1.00/pi*((1.00-pow(v1,2))/E1+(1.00-pow(v2, 2))/E2)));
printf("Ers = %g\n", Ers);
return 0;
}
I'm looking for a way to truncate a float into an int in a fast and portable (IEEE 754) way. The reason is because in this function 50% of the time is spent in the cast:
float fm_sinf(float x) {
const float a = 0.00735246819687011731341356165096815f;
const float b = -0.16528911397014738207016302002888890f;
const float c = 0.99969198629596757779830113868360584f;
float r, x2;
int k;
/* bring x in range */
k = (int) (F_1_PI * x + copysignf(0.5f, x)); /* <-- 50% of time is spent in cast */
x -= k * F_PI;
/* if x is in an odd pi count we must flip */
r = 1 - 2 * (k & 1); /* trick for r = (k % 2) == 0 ? 1 : -1; */
x2 = x * x;
return r * x*(c + x2*(b + a*x2));
}
The slowness of float->int casts mainly occurs when using x87 FPU instructions on x86. To do the truncation, the rounding mode in the FPU control word needs to be changed to round-to-zero and back, which tends to be very slow.
When using SSE instead of x87 instructions, a truncation is available without control word changes. You can do this using compiler options (like -mfpmath=sse -msse -msse2 in GCC) or by compiling the code as 64-bit.
The SSE3 instruction set has the FISTTP instruction to convert to integer with truncation without changing the control word. A compiler may generate this instruction if instructed to assume SSE3.
Alternatively, the C99 lrint() function will convert to integer with the current rounding mode (round-to-nearest unless you changed it). You can use this if you remove the copysignf term. Unfortunately, this function is still not ubiquitous after more than ten years.
I found a fast truncate method by Sree Kotay which provides exactly the optimization that I needed.
to be portable you would have to add some directives and learn a couple assembler languages but you could theoretically could use some inline assembly to move portions of the floating point register into eax/rax ebx/rbx and convert what you would need by hand, floating point specification though is a pain in the butt, but I am pretty certain that if you do it with assembly you will be way faster, as your needs are very specific and the system method is probably more generic and less efficient for your purpose
You could skip the conversion to int altogether by using frexpf to get the mantissa and exponent, and inspect the raw mantissa (use a union) at the appropriate bit position (calculated using the exponent) to determine (the quadrant dependent) r.