the Floating-point error

the Floating-point error - c

#include <stdio.h>
int main()
{
int n;
while ( scanf( "%d", &n ) != EOF ) {
double sum = 0,k;
if( n > 5000000 || n<=0 ) //the judgment of the arrange
break;
for ( int i = 1; i <= n; i++ ) {
k = (double) 1 / i;
sum += k;
}
/*
for ( int i = n; i > 0; i-- ) {
k = 1 / (double)i;
sum += k;
}
*/
printf("%.12lf\n", sum);
}
return 0;
}
Why in the different loop I get the different answer. Is there a float-error? When I input 5000000 the sum is 16.002164235299 but as I use the other loop of for (notation part) I get the sum 16.002164235300.

Because floating point math is not associative:
i.e. (a + b) + c is not necessarily equal to a + (b + c)

I also bumped into a + b + c issue. Totally agreed with ArjunShankar.
// Here A != B in general case
float A = ( (a + b) + c) );
float B = ( (a + c) + b) );
Most of floating point operations are performed with data loss in mantis, even when components are fit well in it (numbers like 0.5 or 0.25).
In fact I was quite happy to find out the cause of bug in my application. I have written short reminder article with detailed explanation:
http://stepan.dyatkovskiy.com/2018/04/machine-fp-partial-invariance-issue.html
Below is the C example. Good luck!
example.c
#include <stdio.h>
// Helpers declaration, for implementation scroll down
float getAllOnes(unsigned bits);
unsigned getMantissaBits();
int main() {
// Determine mantissa size in bits
unsigned mantissaBits = getMantissaBits();
// Considering mantissa has only 3 bits, we would then get:
// a = 0b10 m=1, e=1
// b = 0b110 m=11, e=1
// c = 0b1000 m=1, e=3
// a + b = 0b1000, m=100, e=1
// a + c = 0b1010, truncated to 0b1000, m=100, e=1
// a + b + c result: 0b1000 + 0b1000 = 0b10000, m=100, e=2
// a + c + b result: 0b1000 + 0b110 = 0b1110, m=111, e=1
float a = 2,
b = getAllOnes(mantissaBits) - 1,
c = b + 1;
float ab = a + b;
float ac = a + c;
float abc = a + b + c;
float acb = a + c + b;
printf("\n"
"FP partial invariance issue demo:\n"
"\n"
"Mantissa size = %i bits\n"
"\n"
"a = %.1f\n"
"b = %.1f\n"
"c = %.1f\n"
"(a+b) result: %.1f\n"
"(a+c) result: %.1f\n"
"(a + b + c) result: %.1f\n"
"(a + c + b) result: %.1f\n"
"---------------------------------\n"
"diff(a + b + c, a + c + b) = %.1f\n\n",
mantissaBits,
a, b, c,
ab, ac,
abc, acb,
abc - acb);
return 1;
}
// Helpers
float getAllOnes(unsigned bits) {
return (unsigned)((1 << bits) - 1);
}
unsigned getMantissaBits() {
unsigned sz = 1;
unsigned unbeleivableHugeSize = 1024;
float allOnes = 1;
for (;sz != unbeleivableHugeSize &&
allOnes + 1 != allOnes;
allOnes = getAllOnes(++sz)
) {}
return sz-1;
}

Related

How can I compute a * b / c when both a and b are smaller than c, but a * b overflows?

Assuming that uint is the largest integral type on my fixed-point platform, I have:
uint func(uint a, uint b, uint c);
Which needs to return a good approximation of a * b / c.
The value of c is greater than both the value of a and the value of b.
So we know for sure that the value of a * b / c would fit in a uint.
However, the value of a * b itself overflows the size of a uint.
So one way to compute the value of a * b / c would be:
return a / c * b;
Or even:
if (a > b)
return a / c * b;
return b / c * a;
However, the value of c is greater than both the value of a and the value of b.
So the suggestion above would simply return zero.
I need to reduce a * b and c proportionally, but again - the problem is that a * b overflows.
Ideally, I would be able to:
Replace a * b with uint(-1)
Replace c with uint(-1) / a / b * c.
But no matter how I order the expression uint(-1) / a / b * c, I encounter a problem:
uint(-1) / a / b * c is truncated to zero because of uint(-1) / a / b
uint(-1) / a * c / b overflows because of uint(-1) / a * c
uint(-1) * c / a / b overflows because of uint(-1) * c
How can I tackle this scenario in order to find a good approximation of a * b / c?
Edit 1
I do not have things such as _umul128 on my platform, when the largest integral type is uint64. My largest type is uint, and I have no support for anything larger than that (neither on the HW level, nor in some pre-existing standard library).
My largest type is uint.
Edit 2
In response to numerous duplicate suggestions and comments:
I do not have some "larger type" at hand, which I can use for solving this problem. That is why the opening statement of the question is:
Assuming that uint is the largest integral type on my fixed-point platform
I am assuming that no other type exists, neither on the SW layer (via some built-in standard library) nor on the HW layer.

needs to return a good approximation of a * b / c
My largest type is uint
both a and b are smaller than c
Variation on this 32-bit problem:
Algorithm: Scale a, b to not overflow
SQRT_MAX_P1 as a compile time constant of sqrt(uint_MAX + 1)
sh = 0;
if (c >= SQRT_MAX_P1) {
while (|a| >= SQRT_MAX_P1) a/=2, sh++
while (|b| >= SQRT_MAX_P1) b/=2, sh++
while (|c| >= SQRT_MAX_P1) c/=2, sh--
}
result = a*b/c
shift result by sh.
With an n-bit uint, I expect the result to be correct to at least about n/2 significant digits.
Could improve things by taking advantage of the smaller of a,b being less than SQRT_MAX_P1. More on that later if interested.
Example
#include <inttypes.h>
#define IMAX_BITS(m) ((m)/((m)%255+1) / 255%255*8 + 7-86/((m)%255+12))
// https://stackoverflow.com/a/4589384/2410359
#define UINTMAX_WIDTH (IMAX_BITS(UINTMAX_MAX))
#define SQRT_UINTMAX_P1 (((uintmax_t)1ull) << (UINTMAX_WIDTH/2))
uintmax_t muldiv_about(uintmax_t a, uintmax_t b, uintmax_t c) {
int shift = 0;
if (c > SQRT_UINTMAX_P1) {
while (a >= SQRT_UINTMAX_P1) {
a /= 2; shift++;
}
while (b >= SQRT_UINTMAX_P1) {
b /= 2; shift++;
}
while (c >= SQRT_UINTMAX_P1) {
c /= 2; shift--;
}
}
uintmax_t r = a * b / c;
if (shift > 0) r <<= shift;
if (shift < 0) r >>= shift;
return r;
}
#include <stdio.h>
int main() {
uintmax_t a = 12345678;
uintmax_t b = 4235266395;
uintmax_t c = 4235266396;
uintmax_t r = muldiv_about(a,b,c);
printf("%ju\n", r);
}
Output with 32-bit math (Precise answer is 12345677)
12345600
Output with 64-bit math
12345677

Here is another approach that uses recursion and minimal approximation to achieve high precision.
First the code and below an explanation.
Code:
uint32_t bp(uint32_t a) {
uint32_t b = 0;
while (a!=0)
{
++b;
a >>= 1;
};
return b;
}
int mul_no_ovf(uint32_t a, uint32_t b)
{
return ((bp(a) + bp(b)) <= 32);
}
uint32_t f(uint32_t a, uint32_t b, uint32_t c)
{
if (mul_no_ovf(a, b))
{
return (a*b) / c;
}
uint32_t m = c / b;
++m;
uint32_t x = m*b - c;
// So m * b == c + x where x < b and m >= 2
uint32_t n = a/m;
uint32_t r = a % m;
// So a*b == n * (c + x) + r*b == n*c + n*x + r*b where r*b < c
// Approximation: get rid of the r*b part
uint32_t res = n;
if (r*b > c/2) ++res;
return res + f(n, x, c);
}
Explanation:
The multiplication a * b can be written as a sum of b
a * b = b + b + .... + b
Since b < c we can take a number m of these b so that (m-1)*b < c <= m*b, like
(b + b + ... + b) + (b + b + ... + b) + .... + b + b + b
\---------------/ \---------------/ + \-------/
m*b + m*b + .... + r*b
\-------------------------------------/
n times m*b
so we have
a*b = n*m*b + r*b
where r*b < c and m*b > c. Consequently, m*b is equal to c + x, so we have
a*b = n*(c + x) + r*b = n*c + n*x + r*b
Divide by c :
a*b/c = (n*c + n*x + r*b)/c = n + n*x/c + r*b/c
The values m, n, x, r can all be calculated from a, b and c without any loss of
precision using integer division (/) and remainder (%).
The approximation is to look at r*b (which is less than c) and "add zero" when r*b<=c/2
and "add one" when r*b>c/2.
So now there are two possibilities:
1) a*b = n + n*x/c
2) a*b = (n + 1) + n*x/c
So the problem (i.e. calculating a*b/c) has been changed to the form
MULDIV(a1,b1,c) = NUMBER + MULDIV(a2,b2,c)
where a2,b2 is less than a1,b2. Consequently, recursion can be used until
a2*b2 no longer overflows (and the calculation can be done directly).

I've established a solution which work in O(1) complexity (no loops):
typedef unsigned long long uint;
typedef struct
{
uint n;
uint d;
}
fraction;
uint func(uint a, uint b, uint c);
fraction reducedRatio(uint n, uint d, uint max);
fraction normalizedRatio(uint a, uint b, uint scale);
fraction accurateRatio(uint a, uint b, uint scale);
fraction toFraction(uint n, uint d);
uint roundDiv(uint n, uint d);
uint func(uint a, uint b, uint c)
{
uint hi = a > b ? a : b;
uint lo = a < b ? a : b;
fraction f = reducedRatio(hi, c, (uint)(-1) / lo);
return f.n * lo / f.d;
}
fraction reducedRatio(uint n, uint d, uint max)
{
fraction f = toFraction(n, d);
if (n > max || d > max)
f = normalizedRatio(n, d, max);
if (f.n != f.d)
return f;
return toFraction(1, 1);
}
fraction normalizedRatio(uint a, uint b, uint scale)
{
if (a <= b)
return accurateRatio(a, b, scale);
fraction f = accurateRatio(b, a, scale);
return toFraction(f.d, f.n);
}
fraction accurateRatio(uint a, uint b, uint scale)
{
uint maxVal = (uint)(-1) / scale;
if (a > maxVal)
{
uint c = a / (maxVal + 1) + 1;
a /= c; // we can now safely compute `a * scale`
b /= c;
}
if (a != b)
{
uint n = a * scale;
uint d = a + b; // can overflow
if (d >= a) // no overflow in `a + b`
{
uint x = roundDiv(n, d); // we can now safely compute `scale - x`
uint y = scale - x;
return toFraction(x, y);
}
if (n < b - (b - a) / 2)
{
return toFraction(0, scale); // `a * scale < (a + b) / 2 < MAXUINT256 < a + b`
}
return toFraction(1, scale - 1); // `(a + b) / 2 < a * scale < MAXUINT256 < a + b`
}
return toFraction(scale / 2, scale / 2); // allow reduction to `(1, 1)` in the calling function
}
fraction toFraction(uint n, uint d)
{
fraction f = {n, d};
return f;
}
uint roundDiv(uint n, uint d)
{
return n / d + n % d / (d - d / 2);
}
Here is my test:
#include <stdio.h>
int main()
{
uint a = (uint)(-1) / 3; // 0x5555555555555555
uint b = (uint)(-1) / 2; // 0x7fffffffffffffff
uint c = (uint)(-1) / 1; // 0xffffffffffffffff
printf("0x%llx", func(a, b, c)); // 0x2aaaaaaaaaaaaaaa
return 0;
}

You can cancel prime factors as follows:
uint gcd(uint a, uint b)
{
uint c;
while (b)
{
a %= b;
c = a;
a = b;
b = c;
}
return a;
}
uint func(uint a, uint b, uint c)
{
uint temp = gcd(a, c);
a = a/temp;
c = c/temp;
temp = gcd(b, c);
b = b/temp;
c = c/temp;
// Since you are sure the result will fit in the variable, you can simply
// return the expression you wanted after having those terms canceled.
return a * b / c;
}

How can I add two different data types char and int in c language?

This is the code; and in this one a + c gives me result in digit, why? The output below. How the character is converted into a digit? And why 125 + 'c' = 212? Thank you in response!
#include <stdio.h>
int main()
{
int a = 125, b = 12345;
long ax = 1234567890;
short s = 4043;
float x = 2.13459;
double dx = 1.1415927;
char c = 'W';
unsigned long ux = 2541567890;
printf("a + c = %d\n", a + c);
printf("x + c = %f\n", x + c);
printf("dx + x = %f\n", dx + x);
printf("((int) dx) + ax = %ld\n", ((int) dx) + ax);
printf("a + x = %f\n", a + x);
printf("s + b = %d\n", s + b);
printf("ax + b = %ld\n", ax + b);
printf("s + c = %hd\n", s + c);
printf("ax + c = %ld\n", ax + c);
printf("ax + ux = %lu\n", ax + ux);
return 0;
}
Sample output:
a + c = 212
x + c = 89.134590
dx + x = 3.276183
((int) dx) + ax = 1234567891
a + x = 127.134590
s + b = 16388
ax + b = 1234580235
s + c = 4130
ax + c = 1234567977
ax + ux = 3776135780

This is the code; and in this one a + c gives me result in digit, why?
It gives you this output, as you 1) specified, that you want to print an integer (%d) and 2) If you add a char to an int, the result is an int.
How the character is converted into a digit?
Each character has an int value, ('A'=65,' '=32,...,See: https://en.wikipedia.org/wiki/ASCII)
And why 125 + 'c' = 212?
'c' has the ASCII-Value 87, and 125+87==212.

Optimizing a program for solving ax+by=c with positve integers

I am writing a program that for any given positive integers a < b < c will output YES if there is a solution to ax+by=c where x and y are also positive integers (x,y > 0), or NO if there isn't a solution. Keep in mind that I need to work with big numbers.
The approach I take for solving this problem is that I subtract b from c and I check if this number is divisable by a.
Here's my code:
#include <stdio.h>
#include <stdlib.h>
int main(){
unsigned long long int a, b, c;
scanf("%I64u %I64u %I64u", &a, &b, &c);
while(c>=a+b){ //if c becomes less than a+b, than there's no sollution
c-=b;
if(c%a==0){
printf("YES");
return 0;
}
}
printf("NO");
return 0;
}
is there a more optimised way to find wether ax+by=c has positive sollutions? I tried reading about linear Diophantine equations, but all I found is a way to find integer sollutions (but not positive).

My approach so far.
Use Euclidean Algorithm to find GCD(a, b)
There are solutions (in integers) to ax + by = c if and only if GCD(a, b) divides c. No integer solutions means no positive solutions.
use Extended Euclidean Algorithm to solve the Diophantine equation and return NO if it gives non-positive solutions.
For comparisons it's hard to find examples that take longer than a second but in deciding on thousands of random equations the performance difference is noticeable. This Lecture has a solution for finding the number of positive
solutions to a Linear Diophantine Equation.
typedef unsigned long long int BigInt;
int pos_solvable(BigInt a, BigInt b, BigInt c) {
/* returns 1 if there exists x, y > 0 s.t. ax + by = c
* where 0 < a < b < c
* returns 0, otherwise
*/
BigInt gcd = a, bb = b, temp;
while (bb) { /* Euclidean Algorithm */
temp = bb;
bb = gcd % bb;
gcd = temp;
}
if (c % gcd) { /* no integer (or positive) solution */
return 0;
} else {
/* Extended Euclidean Algorithm */
BigInt s = 0, old_s = 1;
BigInt t = 1, old_t = 0;
BigInt r = b / gcd, old_r = a / gcd;
while (r > 0) {
BigInt quotient = old_r / r;
BigInt ds = quotient * s;
BigInt dt = quotient * t;
if (ds > old_s || dt > old_t)
return 0; /* will give non-positive solution */
temp = s;
s = old_s - ds;
old_s = temp;
temp = t;
t = old_t - dt;
old_t = temp;
temp = r;
r = old_r - quotient * r;
old_r = temp;
}
return 1;
}
}

The following is a comment but too big for the comment section.
This is posted to help others dig into this problem a little deeper.
OP: Incorporate any of in your post if you like.
What is still needed are some challenging a,b,c.
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
//#define LLF "%I64u"
#define LLF "%llu"
int main(void) {
unsigned long long int a, b, c, x, y, sum, c0;
// scanf(LLF LLF LLF, &a, &b, &c);
c = c0 = ULLONG_MAX;
b = 10000223;
a = 10000169;
y = 0;
sum = a + b;
time_t t0 = time(NULL);
while (c >= sum) { //if c becomes less than a+b, than there's no solution
c -= b;
if (c % a == 0) {
break;
}
}
if (c % a == 0) {
y = (c0 - c) / b;
x = c / a;
printf("YES " LLF "*" LLF " + " LLF "*" LLF " = " LLF "\n", a, x, b, y, c);
} else {
printf("NO\n");
}
time_t t1 = time(NULL);
printf("time :" LLF "\n", (unsigned long long) (t1 - t0));
return 0;
}
Output
YES 10000169*1844638544065 + 10000223*4688810 = 18446697184563946985
time :0

64 bit mathematical operations without any loss of data or precision

I believe there isn't any portable standard data type for 128 bits of data. So, my question is about how efficiently 64 bit operations can be carried out without loss of data using existing standard data-types.
For example : I have following two uint64_t type variables:
uint64_t x = -1;
uint64_t y = -1;
Now, how the result of mathematical operations such as x+y, x-y, x*y and x/y can be stored/retrieved/printed ?
For above variables, x+y results in value of -1 which is actually a 0xFFFFFFFFFFFFFFFFULL with a carry 1.
void add (uint64_t a, uint64_t b, uint64_t result_high, uint64_t result_low)
{
result_low = result_high = 0;
result_low = a + b;
result_high += (result_low < a);
}
How other operations can be performed as like add, which gives proper final output ?
I'd appreciate if someone share the generic algorithm which take care of overflow/underflow etcetera that might comes into picture using such operations.
Any standard tested algorithms which might can help.

There are lot of BigInteger libraries out there to manipulate big numbers.
GMP Library
C++ Big Integer Library
If you want to avoid library integration and your requirement is quite small, here is my basic BigInteger snippet that I generally use for problem with basic requirement. You can create new methods or overload operators according your need. This snippet is widely tested and bug free.
Source
class BigInt {
public:
// default constructor
BigInt() {}
// ~BigInt() {} // avoid overloading default destructor. member-wise destruction is okay
BigInt( string b ) {
(*this) = b; // constructor for string
}
// some helpful methods
size_t size() const { // returns number of digits
return a.length();
}
BigInt inverseSign() { // changes the sign
sign *= -1;
return (*this);
}
BigInt normalize( int newSign ) { // removes leading 0, fixes sign
for( int i = a.size() - 1; i > 0 && a[i] == '0'; i-- )
a.erase(a.begin() + i);
sign = ( a.size() == 1 && a[0] == '0' ) ? 1 : newSign;
return (*this);
}
// assignment operator
void operator = ( string b ) { // assigns a string to BigInt
a = b[0] == '-' ? b.substr(1) : b;
reverse( a.begin(), a.end() );
this->normalize( b[0] == '-' ? -1 : 1 );
}
// conditional operators
bool operator < (BigInt const& b) const { // less than operator
if( sign != b.sign ) return sign < b.sign;
if( a.size() != b.a.size() )
return sign == 1 ? a.size() < b.a.size() : a.size() > b.a.size();
for( int i = a.size() - 1; i >= 0; i-- ) if( a[i] != b.a[i] )
return sign == 1 ? a[i] < b.a[i] : a[i] > b.a[i];
return false;
}
bool operator == ( const BigInt &b ) const { // operator for equality
return a == b.a && sign == b.sign;
}
// mathematical operators
BigInt operator + ( BigInt b ) { // addition operator overloading
if( sign != b.sign ) return (*this) - b.inverseSign();
BigInt c;
for(int i = 0, carry = 0; i<a.size() || i<b.size() || carry; i++ ) {
carry+=(i<a.size() ? a[i]-48 : 0)+(i<b.a.size() ? b.a[i]-48 : 0);
c.a += (carry % 10 + 48);
carry /= 10;
}
return c.normalize(sign);
}
BigInt operator - ( BigInt b ) { // subtraction operator overloading
if( sign != b.sign ) return (*this) + b.inverseSign();
int s = sign;
sign = b.sign = 1;
if( (*this) < b ) return ((b - (*this)).inverseSign()).normalize(-s);
BigInt c;
for( int i = 0, borrow = 0; i < a.size(); i++ ) {
borrow = a[i] - borrow - (i < b.size() ? b.a[i] : 48);
c.a += borrow >= 0 ? borrow + 48 : borrow + 58;
borrow = borrow >= 0 ? 0 : 1;
}
return c.normalize(s);
}
BigInt operator * ( BigInt b ) { // multiplication operator overloading
BigInt c("0");
for( int i = 0, k = a[i] - 48; i < a.size(); i++, k = a[i] - 48 ) {
while(k--) c = c + b; // ith digit is k, so, we add k times
b.a.insert(b.a.begin(), '0'); // multiplied by 10
}
return c.normalize(sign * b.sign);
}
BigInt operator / ( BigInt b ) { // division operator overloading
if( b.size() == 1 && b.a[0] == '0' ) b.a[0] /= ( b.a[0] - 48 );
BigInt c("0"), d;
for( int j = 0; j < a.size(); j++ ) d.a += "0";
int dSign = sign * b.sign;
b.sign = 1;
for( int i = a.size() - 1; i >= 0; i-- ) {
c.a.insert( c.a.begin(), '0');
c = c + a.substr( i, 1 );
while( !( c < b ) ) c = c - b, d.a[i]++;
}
return d.normalize(dSign);
}
BigInt operator % ( BigInt b ) { // modulo operator overloading
if( b.size() == 1 && b.a[0] == '0' ) b.a[0] /= ( b.a[0] - 48 );
BigInt c("0");
b.sign = 1;
for( int i = a.size() - 1; i >= 0; i-- ) {
c.a.insert( c.a.begin(), '0');
c = c + a.substr( i, 1 );
while( !( c < b ) ) c = c - b;
}
return c.normalize(sign);
}
// << operator overloading
friend ostream& operator << (ostream&, BigInt const&);
private:
// representations and structures
string a; // to store the digits
int sign; // sign = -1 for negative numbers, sign = 1 otherwise
};
ostream& operator << (ostream& os, BigInt const& obj) {
if( obj.sign == -1 ) os << "-";
for( int i = obj.a.size() - 1; i >= 0; i--) {
os << obj.a[i];
}
return os;
}
Usage
BigInt a, b, c;
a = BigInt("1233423523546745312464532");
b = BigInt("45624565434216345i657652454352");
c = a + b;
// c = a * b;
// c = b / a;
// c = b - a;
// c = b % a;
cout << c << endl;
// dynamic memory allocation
BigInt *obj = new BigInt("123");
delete obj;

You can emulate uint128_t if you don't have it:
typedef struct uint128_t { uint64_t lo, hi } uint128_t;
...
uint128_t add (uint64_t a, uint64_t b) {
uint128_t r; r.lo = a + b; r.hi = + (r.lo < a); return r; }
uint128_t sub (uint64_t a, uint64_t b) {
uint128_t r; r.lo = a - b; r.hi = - (r.lo > a); return r; }
Multiplication without inbuilt compiler or assembler support is a bit more difficult to get right. Essentially, you need to split both multiplicands into hi:lo unsigned 32-bit, and perform 'long multiplication' taking care of carries and 'columns' between the partial 64-bit products.
Divide and modulo return 64 bit results given 64 bit arguments - so that's not an issue as you have defined the problem. Dividing 128 bit by 64 or 128 bit operands is a much more complicated operation, requiring normalization, etc.
longlong.h routines umul_ppmm and udiv_qrnnd in GMP give the 'elementary' steps for multiple-precision/limb operations.

In most of the modern GCC compilers __int128 type is supported which can hold a 128 bit integers.
Example,
__int128 add(__int128 a, __int128 b){
return a + b;
}

FFT returning conjugate of true answer

I have an odd problem. Following (re: copying) from here, I've been trying to implement the Cooley–Tukey FFT algorithm for arrays with a power-of-2 size, but the answers returned from this implementation are the conjugate of the true answers.
int fft_pow2(int dir,int m,float complex *a)
{
long nn,i,i1,j,k,i2,l,l1,l2;
float c1,c2,tx,ty,t1,t2,u1,u2,z;
float complex t;
/* Calculate the number of points */
nn = 1;
for (i=0;i<m;i++)
nn *= 2;
/* Do the bit reversal */
i2 = nn >> 1;
j = 0;
for (i=0;i<nn-1;i++) {
if (i < j) {
t = a[i];
a[i] = a[j];
a[j] = t;
}
k = i2;
while (k <= j) {
j -= k;
k >>= 1;
}
j += k;
}
/* Compute the FFT */
c1 = -1.0;
c2 = 0.0;
l2 = 1;
for (l=0;l<m;l++) {
l1 = l2;
l2 <<= 1;
u1 = 1.0;
u2 = 0.0;
for (j=0;j<l1;j++) {
for (i=j;i<nn;i+=l2) {
i1 = i + l1;
t = u1 * crealf(a[i1]) - u2 * cimagf(a[i1])
+ I * (u1 * cimagf(a[i1]) + u2 * crealf(a[i1]));
a[i1] = a[i] - t;
a[i] += t;
}
z = u1 * c1 - u2 * c2;
u2 = u1 * c2 + u2 * c1;
u1 = z;
}
c2 = sqrt((1.0 - c1) / 2.0);
if (dir == 1)
c2 = -c2;
c1 = sqrt((1.0 + c1) / 2.0);
}
/* Scaling for forward transform */
if (dir == 1) {
for (i=0;i<nn;i++) {
a[i] /= (float)nn;
}
}
return 1;
}
int main(int argc, char **argv) {
float complex arr[4] = { 1.0, 2.0, 3.0, 4.0 };
fft_pow2(0, log2(n), arr);
for (int i = 0; i < n; i++) {
printf("%f %f\n", crealf(arr[i]), cimagf(arr[i]));
}
}
The results:
10.000000 0.000000
-2.000000 -2.000000
-2.000000 0.000000
-2.000000 2.000000
whereas the true answer is the conjugate.
Any ideas?

The FFT is often defined with Hk = sum(e–2•π•i•j•k/N•hj, 0 < j ≤ N). Note the minus sign in the exponent. The FFT can be defined with a plus sign instead of the minus sign. In large part, the definitions are equivalent, because +i and –i are completely symmetric.
The code you show is written for the definition with the negative sign, and it is also written so that the first parameter, dir, is 1 for a forward transform and something else for a reverse transform. We can determine the intended direction because of the comment about scaling for the forward transform: It scales if dir is 1.
So, where your code in main calls fft_pow2 with 0 for dir, it is requesting a reverse transform. Your code has performed a reverse transform using the FFT definition with a negative sign. The reverse of the transform with a negative sign is a transform with a positive sign. For [1, 2, 3, 4], the result is:
10•1 + 11•2 + 12•3 + 13•4 = 1 + 2 + 3 + 4 = 10.
i0•1 + i1•2 + i2•3 + i3•4 = 1 + 2i – 3 – 4i = –2 – 2i.
(–1)0•1 + (–1)1•2 + (–1)2•3 + (–1)3•4 = 1 – 2 + 3 – 4 = –2.
(–i)0•1 + (–i)1•2 + (–i)2•3 + (–i)3•4 = 1 – 2i – 3 + 4i = –2 + 2i.
And that is the result you obtained.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

the Floating-point error - c

Because floating point math is not associative: i.e. (a + b) + c is not necessarily equal to a + (b + c)

Related

How can I compute a * b / c when both a and b are smaller than c, but a * b overflows?

How can I add two different data types char and int in c language?

Optimizing a program for solving ax+by=c with positve integers

64 bit mathematical operations without any loss of data or precision

FFT returning conjugate of true answer

Categories

Resources