FFTW guru interface - c

I don't understand the guru interface of FFTW. Let me explain how I thought it worked based on the manual and this question How to use fftw Guru interface and maybe someone can clear up my misunderstanding.
fftw_plan fftw_plan_guru64_dft(
int rank, const fftw_iodim64 *dims,
int howmany_rank, const fftw_iodim64 *howmany_dims,
fftw_complex *in, fftw_complex *out,
int sign, unsigned flags);
Suppose we want to calculate the DFT of interleaved multidimensional arrays, such as the six 2x2 arrays (each with a different colour) in this picture.
interleaved dfts
Because the dfts have stride 3 in the vertical direction, and stride 2 in the horizontal direction, I thought we would need rank = 2 and dims = {(2, 3, 3), (2, 2, 2)}. The starting points are a 3 x 2 subarray, so I thought howmany_rank = 2, howmany_dims = {(3, 1, 1), (2, 1, 1)}.
However, this is not actually what FFTW does. I made a smaller example that is easy to calculate by hand, consisting of 4 DFTs of size 2x1 (indicated by colours). Each dft is of the form (+-1, 0) which has as output (+-1, +-1), but that is not what FFTW calculates.
small example
Here is the code I used to calculate the DFT.
#include <stdio.h>
#include <stdlib.h>
#include <complex.h>
#include <math.h>
#include <fftw3.h>
int main()
fftw_complex* X = fftw_malloc(8 * sizeof(fftw_complex));
fftw_iodim* sizes = malloc(2 * sizeof(fftw_iodim));
fftw_iodim* startingPoints = malloc(2 * sizeof(fftw_iodim));
sizes[0].n = 2; sizes[0].is = 2; sizes[0].os = 2;
sizes[1].n = 1; sizes[1].is = 2; sizes[1].os = 2;
startingPoints[0].n = 2; startingPoints[0].is = 1; startingPoints[0].os = 1;
startingPoints[1].n = 2; startingPoints[1].is = 1; startingPoints[1].os = 1;
fftw_plan plan = fftw_plan_guru_dft(2, sizes, 2, startingPoints, X, X, FFTW_FORWARD, FFTW_ESTIMATE);
X[0] = 1.0; X[1] = -1.0;
X[2] = 1.0; X[3] = -1.0;
X[4] = 0.0; X[5] = 0.0;
X[6] = 0.0; X[7] = 0.0;
printf("\nOutput in row-major order:\n");
for (int i = 0; i < 8; i++) {
printf("%lf + %lfi, ", creal(X[i]), cimag(X[i]));
return 0;

Strides even for major axes are in "units", i.e. doubles or fftw_complexes, not number of rows: https://www.fftw.org/fftw3_doc/Guru-vector-and-transform-sizes.html#Guru-vector-and-transform-sizes
My guess is that in major axis strides have to be multiplied by the distance between consecutive rows, also in units. So for the arrays their iodims.is and iodims.os strides should be 4*3 == 12.


Sparse matrix multiplication in Eigen giving wrong result?

I am using Eigen in a project of mine, and I am running into a strange issue. I have complex sparse matrices A and B (1500x1500 or larger), and am multiplying them together with coefficients.
When A = B, and taking vector x of ones, I expect that
(A-B)*x = 0, (A*B-B*A)*x = 0,
(A*A*B*B - B*B*A*A)*x = 0,
etc. and I do get this result for all these cases. (A.isApprox(B) evaluates to 1 and (A-B).norm() = 0).
However, when I multiply the matrices by doubles, as in
(c1*A*c2*A*d1*B*d2*B - d1*B*d2*B*c1*A*c2*A)*x,
I get a nonzero result, which doesn't make sense to me, as scalars should commute with the matrices. In fact, if I do,
(c1*c2*d1*d2*A*A*B*B - d1*d2*c1*c2*B*B*A*A)*x
I get zero. Any time the coefficients are interspersed in the matrix manipulation, I get a nonzero result.
I am not using any compiler optimizations, etc.
What am I doing wrong here?
I have worked up a simple example. Maybe I'm missing something dumb, but here it is. This gives me an error of 10^20.
#include <iostream>
#include <cmath>
#include <vector>
#include <Eigen/Sparse>
#include <complex>
typedef std::complex<double> Scalar;
typedef Eigen::SparseMatrix<Scalar, Eigen::RowMajor> SpMat;
typedef Eigen::Triplet<Scalar> trip;
int main(int argc, const char * argv[]) {
double k0 = M_PI;
double dz = 0.01;
double nz = 1500;
std::vector<double> rhos(nz), atten(nz), cp(nz);
for(int i = 0; i < nz; ++i){
if(i < 750){
rhos[i] = 1.5;
cp[i] = 2500;
atten[i] = 0.5;
rhos[i] = 1;
cp[i] = 1500;
atten[i] = 0;
Scalar ci, eta, n, rho, drhodz;
Scalar t1, t2, t3, t4;
ci = Scalar(0,1);
eta = 1.0/(40.0*M_PI*std::log10(std::exp(1.0)));
int Mp = 6;
std::vector<std::vector<trip> > mat_entries_N(Mp), mat_entries_D(Mp);
for(int i = 0; i < nz; ++i){
n = 1500./cp[i] * (1.+ ci * eta * atten[i]);
rho = rhos[i];
if(i > 0 && i < nz-1){
drhodz = (rhos[i+1]-rhos[i-1])/(2*dz);
else if(i == 0){
drhodz = (rhos[i+1]-rhos[i])/(dz);
else if(i == nz-1){
drhodz = (rhos[i]-rhos[i-1])/(dz);
t1 = (n*n - 1.);
t2 = 1./(k0*k0)*(-2./(dz * dz));
t3 = 1./(k0*k0)*(drhodz/rho*2.*dz);
t4 = 1./(k0*k0)*(1/(dz * dz));
double c,d;
for(int mp = 0; mp < Mp; ++mp){
c = std::pow(std::sin((mp+1)*M_PI/(2*Mp+1)),2);
d = std::pow(std::cos((mp+1)*M_PI/(2*Mp+1)),2);
mat_entries_N[mp].push_back(trip(i,i,(c*(t1 + t2))));
mat_entries_D[mp].push_back(trip(i,i,(d*(t1 + t2))));
if(i < nz - 1){
mat_entries_N[mp].push_back(trip(i,i+1,(c*(-t3 + t4))));
mat_entries_D[mp].push_back(trip(i,i+1,(d*(-t3 + t4))));
if(i > 0){
mat_entries_N[mp].push_back(trip(i,i-1,(c*(t3 + t4))));
mat_entries_D[mp].push_back(trip(i,i-1,(d*(t3 + t4))));
SpMat N(nz,nz), D(nz,nz);
SpMat identity(nz, nz);
std::vector<trip> idcoeffs;
for(int i = 0; i < nz; ++i){
identity.setFromTriplets(idcoeffs.begin(), idcoeffs.end());
SpMat temp(nz,nz);
N = identity;
D = identity;
for(int mp = 0; mp < Mp; ++mp){
temp.setFromTriplets(mat_entries_N[mp].begin(), mat_entries_N[mp].end());
N = (temp*N).eval();
temp.setFromTriplets(mat_entries_D[mp].begin(), mat_entries_D[mp].end());
D = (temp*D).eval();
std::cout << (N*D - D*N).norm() << std::endl;
return 0;
The problem is that without a meaningful reference value defining what is the expected order of magnitude of a non-zero value, it is impossible to conclude whether 1e20 is a huge or a tiny value.
In your case, the norm of the matrices N and D are about 1e20 and 1e18 respectively, and the norm of N*D is about 1e38. Given that the relative precision of double is about 1e-16, an error of 1e20 can be considered as 0 compared to 1e38.
To summarize, it is most of the time meaningless to look at the absolute error. Instead, you have to look at the relative error:
std::cout << (N*D - D*N).norm()/(N*D).norm() << std::endl;
which gives you about 1e-17. This is indeed smaller that the numerical precision of double.

Initial value problem for a system of ODEs solver C program

So I wanted to implement the path of the Moon around the Earth with a C program.
My problem is that you know the Moon's velocity and position at Apogee and Perigee.
So I started to solve it from Apogee, but I cannot figure out how I could add the second velocity and position as "initial value" for it. I tried it with an if but I don't see any difference between the results. Any help is appreciated!
Here is my code:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
typedef void (*ode)(double* p, double t, double* k, double* dk);
void euler(ode f, double *p, double t, double* k, double h, int n, int N)
double kn[N];
double dk[N];
double Rp = - 3.633 * pow(10,8); // x position at Perigee
for(int i = 0; i < n; i++)
f(p, 0, k, dk);
for (int j = 0; j < N; j++)
if (k[0] == Rp) // this is the "if" I mentioned in my comment
// x coordinate at Perigee
k[1] = 0; // y coordinate at Perigee
k[2] = 0; // x velocity component at Perigee
k[3] = 1076; // y velocity component at Perigee
kn[j] = k[j] + h * dk[j];
printf("%f ", kn[j]);
k[j] = kn[j];
void gravity_equation(double* p, double t, double* k, double* dk)
// Earth is at the (0, 0)
double G = p[0]; // Gravitational constant
double m = p[1]; // Earth mass
double x = k[0]; // x coordinate at Apogee
double y = k[1]; // y coordinate at Apogee
double Vx = k[2]; // x velocity component at Apogee
double Vy = k[3]; // y velocity component at Apogee
dk[0] = Vx;
dk[1] = Vy;
dk[2] = (- G * m * x) / pow(sqrt((x * x)+(y * y)),3);
dk[3] = (- G * m * y) / pow(sqrt((x * x)+(y * y)),3);
void run_gravity_equation()
int N = 4; // how many equations there are
double initial_values[N];
initial_values[0] = 4.055*pow(10,8); // x position at Apogee
initial_values[1] = 0; // y position at Apogee
initial_values[2] = 0; // x velocity component at Apogee
initial_values[3] = (-1) * 964; //y velocity component at Perigee
int p = 2; // how many parameters there are
double parameters[p];
parameters[0] = 6.67384 * pow(10, -11); // Gravitational constant
parameters[1] = 5.9736 * pow(10, 24); // Earth mass
double h = 3600; // step size
int n = 3000; // the number of steps
euler(&gravity_equation, parameters, 0, initial_values, h, n, N);
int main()
return 0;
Your interface is
euler(odefun, params, t0, y0, h, n, N)
N = dimension of state space
n = number of steps to perform
h = step size
t0, y0 = initial time and value
The intended function of this procedure seems to be that the updated values are returned inside the array y0. There is no reason to insert some hack to force the state to have some initial conditions. The initial condition is passed as argument. As you are doing in void run_gravity_equation(). The integration routine should remain agnostic of the details of the physical model.
It is extremely improbable that you will hit the same value in k[0] == Rp a second time. What you can do is to check for sign changes in Vx, that is, k[1] to find points or segments of extremal x coordinate.
Trying to interpret your description closer, what you want to do is to solve a boundary value problem where x(0)=4.055e8, x'(0)=0, y'(0)=-964 and x(T)=-3.633e8, x'(T)=0. This has the advanced tasks to solve a boundary value problem with single or multiple shooting and additionally, that the upper boundary is variable.
You might want to to use the Kepler laws to get further insights into the parameters of this problem so that you can solve it just with a forward integration. The Kepler ellipse of the first Kepler law has the formula (scaled for Apogee at phi=0, Perigee at phi=pi)
r = R/(1-E*cos(phi))
so that
R/(1-E)=4.055e8 and R/(1+E)=3.633e8,
which gives
==> E = (4.055-3.633)/(4.055+3.633) = 0.054891,
R = 3.633e8*(1+0.05489) = 3.8324e8
Further, the angular velocity is given by the second Kepler law
phi'*r^2 = const. = sqrt(R*G*m)
which gives tangential velocities at Apogee (r=R/(1-E))
y'(0)=phi'*r = sqrt(R*G*m)*(1-E)/R = 963.9438
and Perigee (r=R/(1+E))
-y'(T)=phi'*r = sqrt(R*G*m)*(1+E)/R = 1075.9130
which indeed reproduces the constants you used in your code.
The area of the Kepler ellipse is pi/4 times the product of smallest and largest diameter. The smallest diameter can be found at cos(phi)=E, the largest is the sum of apogee and perigee radius, so that the area is
pi*R/sqrt(1-E^2)*(R/(1+E)+R/(1-E))/2= pi*R^2/(1-E^2)^1.5
At the same time it is the integral over 0.5*phi*r^2 over the full period 2*T, thus equal to
which is the third Kepler law. This allows to compute the half-period as
T = pi/sqrt(G*m)*(R/(1-E^2))^1.5 = 1185821
With h = 3600 the half point should be reached between n=329 and n=330 (n=329.395). Integration with scipy.integrate.odeint vs. Euler steps gives the following table for h=3600:
n [ x[n], y[n] ] for odeint/lsode for Euler
328 [ -4.05469444e+08, 4.83941626e+06] [ -4.28090166e+08, 3.81898023e+07]
329 [ -4.05497554e+08, 1.36933874e+06] [ -4.28507841e+08, 3.48454695e+07]
330 [ -4.05494242e+08, -2.10084488e+06] [ -4.28897657e+08, 3.14986514e+07]
The same for h=36, n=32939..32940
n [ x[n], y[n] ] for odeint/lsode for Euler
32938 [ -4.05499997e+08 5.06668940e+04] [ -4.05754415e+08 3.93845978e+05]
32939 [ -4.05500000e+08 1.59649309e+04] [ -4.05754462e+08 3.59155385e+05]
32940 [ -4.05500000e+08 -1.87370323e+04] [ -4.05754505e+08 3.24464789e+05]
32941 [ -4.05499996e+08 -5.34389954e+04] [ -4.05754545e+08 2.89774191e+05]
which is a little closer for the Euler method, but not much better.

Implementing equations with very small numbers in C - Plank's Law generating blackbody

I have a problem that, after much head scratching, I think is to do with very small numbers in a long-double.
I am trying to implement Planck's law equation to generate a normalised blackbody curve at 1nm intervals between a given wavelength range and for a given temperature. Ultimately this will be a function accepting inputs, for now it is main() with the variables fixed and outputting by printf().
I see examples in matlab and python, and they are implementing the same equation as me in a similar loop with no trouble at all.
This is the equation:
My code generates an incorrect blackbody curve:
I have tested key parts of the code independently. After trying to test the equation by breaking it into blocks in excel I noticed that it does result in very small numbers and I wonder if my implementation of large numbers could be causing the issue? Does anyone have any insight into using C to implement equations? This a new area to me and I have found the maths much harder to implement and debug than normal code.
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
//global variables
const double H = 6.626070040e-34; //Planck's constant (Joule-seconds)
const double C = 299800000; //Speed of light in vacume (meters per second)
const double K = 1.3806488e-23; //Boltzmann's constant (Joules per Kelvin)
const double nm_to_m = 1e-6; //conversion between nm and m
const int interval = 1; //wavelength interval to caculate at (nm)
//typedef structure to hold results
typedef struct {
int *wavelength;
long double *radiance;
long double *normalised;
} results;
int main() {
int min = 100 , max = 3000; //wavelength bounds to caculate between, later to be swaped to function inputs
double temprature = 200; //temprature in kelvin, later to be swaped to function input
double new_valu, old_valu = 0;
static results SPD_data, *SPD; //setup a static results structure and a pointer to point to it
SPD = &SPD_data;
SPD->wavelength = malloc(sizeof(int) * (max - min)); //allocate memory based on wavelength bounds
SPD->radiance = malloc(sizeof(long double) * (max - min));
SPD->normalised = malloc(sizeof(long double) * (max - min));
for (int i = 0; i <= (max - min); i++) {
//Fill wavelength vector
SPD->wavelength[i] = min + (interval * i);
//Computes radiance for every wavelength of blackbody of given temprature
SPD->radiance[i] = ((2 * H * pow(C, 2)) / (pow((SPD->wavelength[i] / nm_to_m), 5))) * (1 / (exp((H * C) / ((SPD->wavelength[i] / nm_to_m) * K * temprature))-1));
//Copy SPD->radiance to SPD->normalised
SPD->normalised[i] = SPD->radiance[i];
//Find largest value
if (i <= 0) {
old_valu = SPD->normalised[0];
} else if (i > 0){
new_valu = SPD->normalised[i];
if (new_valu > old_valu) {
old_valu = new_valu;
//for debug perposes
printf("wavelength(nm) radiance(Watts per steradian per meter squared) normalised radiance\n");
for (int i = 0; i <= (max - min); i++) {
//Normalise SPD
SPD->normalised[i] = SPD->normalised[i] / old_valu;
//for debug perposes
printf("%d %Le %Lf\n", SPD->wavelength[i], SPD->radiance[i], SPD->normalised[i]);
return 0; //later to be swaped to 'return SPD';
/*********************UPDATE Friday 24th Mar 2017 23:42*************************/
Thank you for the suggestions so far, lots of useful pointers especially understanding the way numbers are stored in C (IEEE 754) but I don't think that is the issue here as it only applies to significant digits. I implemented most of the suggestions but still no progress on the problem. I suspect Alexander in the comments is probably right, changing the units and order of operations is likely what I need to do to make the equation work like the matlab or python examples, but my knowledge of maths is not good enough to do this. I broke the equation down into chunks to take a closer look at what it was doing.
//global variables
const double H = 6.6260700e-34; //Planck's constant (Joule-seconds) 6.626070040e-34
const double C = 299792458; //Speed of light in vacume (meters per second)
const double K = 1.3806488e-23; //Boltzmann's constant (Joules per Kelvin) 1.3806488e-23
const double nm_to_m = 1e-9; //conversion between nm and m
const int interval = 1; //wavelength interval to caculate at (nm)
const int min = 100, max = 3000; //max and min wavelengths to caculate between (nm)
const double temprature = 200; //temprature (K)
//typedef structure to hold results
typedef struct {
int *wavelength;
long double *radiance;
long double *normalised;
} results;
//main program
int main()
//setup a static results structure and a pointer to point to it
static results SPD_data, *SPD;
SPD = &SPD_data;
//allocate memory based on wavelength bounds
SPD->wavelength = malloc(sizeof(int) * (max - min));
SPD->radiance = malloc(sizeof(long double) * (max - min));
SPD->normalised = malloc(sizeof(long double) * (max - min));
//break equasion into visible parts for debuging
long double aa, bb, cc, dd, ee, ff, gg, hh, ii, jj, kk, ll, mm, nn, oo;
for (int i = 0; i < (max - min); i++) {
//Computes radiance at every wavelength interval for blackbody of given temprature
SPD->wavelength[i] = min + (interval * i);
aa = 2 * H;
bb = pow(C, 2);
cc = aa * bb;
dd = pow((SPD->wavelength[i] / nm_to_m), 5);
ee = cc / dd;
ff = 1;
gg = H * C;
hh = SPD->wavelength[i] / nm_to_m;
ii = K * temprature;
jj = hh * ii;
kk = gg / jj;
ll = exp(kk);
mm = ll - 1;
nn = ff / mm;
oo = ee * nn;
SPD->radiance[i] = oo;
//for debug perposes
printf("wavelength(nm) | radiance(Watts per steradian per meter squared)\n");
for (int i = 0; i < (max - min); i++) {
printf("%d %Le\n", SPD->wavelength[i], SPD->radiance[i]);
return 0;
Equation variable values during runtime in xcode:
I notice a couple of things that are wrong and/or suspicious about the current state of your program:
You have defined nm_to_m as 10-9,, yet you divide by it. If your wavelength is measured in nanometers, you should multiply it by 10-9 to get it in meters. To wit, if hh is supposed to be your wavelength in meters, it is on the order of several light-hours.
The same is obviously true for dd as well.
mm, being the exponential expression minus 1, is zero, which gives you infinity in the results deriving from it. This is apparently because you don't have enough digits in a double to represent the significant part of the exponential. Instead of using exp(...) - 1 here, try using the expm1() function instead, which implements a well-defined algorithm for calculating exponentials minus 1 without cancellation errors.
Since interval is 1, it doesn't currently matter, but you can probably see that your results wouldn't match the meaning of the code if you set interval to something else.
Unless you plan to change something about this in the future, there shouldn't be a need for this program to "save" the values of all calculations. You could just print them out as you run them.
On the other hand, you don't seem to be in any danger of underflow or overflow. The largest and smallest numbers you use don't seem to be a far way from 10±60, which is well within what ordinary doubles can deal with, let alone long doubles. The being said, it might not hurt to use more normalized units, but at the magnitudes you currently display, I wouldn't worry about it.
Thanks for all the pointers in the comments. For anyone else running into a similar problem with implementing equations in C, I had a few silly errors in the code:
writing a 6 not a 9
dividing when I should be multiplying
an off by one error with the size of my array vs the iterations of for() loop
200 when I meant 2000 in the temperature variable
As a result of the last one particularly I was not getting the results I expected (my wavelength range was not right for plotting the temperature I was calculating) and this was leading me to the assumption that something was wrong in the implementation of the equation, specifically I was thinking about big/small numbers in C because I did not understand them. This was not the case.
In summary, I should have made sure I knew exactly what my equation should be outputting for given test conditions before implementing it in code. I will work on getting more comfortable with maths, particularly algebra and dimensional analysis.
Below is the working code, implemented as a function, feel free to use it for anything but obviously no warranty of any kind etc.
// Computes radiance for every wavelength of blackbody of given temprature
// INPUTS: int min wavelength to begin calculation from (nm), int max wavelength to end calculation at (nm), int temperature (kelvin)
// OUTPUTS: pointer to structure containing:
// - spectral radiance (Watts per steradian per meter squared per wavelength at 1nm intervals)
// - normalised radiance
//include & define
#include "blackbody.h"
//global variables
const double H = 6.626070040e-34; //Planck's constant (Joule-seconds) 6.626070040e-34
const double C = 299792458; //Speed of light in vacuum (meters per second)
const double K = 1.3806488e-23; //Boltzmann's constant (Joules per Kelvin) 1.3806488e-23
const double nm_to_m = 1e-9; //conversion between nm and m
const int interval = 1; //wavelength interval to calculate at (nm), to change this line 45 also need to be changed
bbresults* blackbody(int min, int max, double temperature) {
double new_valu, old_valu = 0; //variables for normalising result
bbresults *SPD;
SPD = malloc(sizeof(bbresults));
//allocate memory based on wavelength bounds
SPD->wavelength = malloc(sizeof(int) * (max - min));
SPD->radiance = malloc(sizeof(long double) * (max - min));
SPD->normalised = malloc(sizeof(long double) * (max - min));
for (int i = 0; i < (max - min); i++) {
//Computes radiance for every wavelength of blackbody of given temperature
SPD->wavelength[i] = min + (interval * i);
SPD->radiance[i] = ((2 * H * pow(C, 2)) / (pow((SPD->wavelength[i] * nm_to_m), 5))) * (1 / (expm1((H * C) / ((SPD->wavelength[i] * nm_to_m) * K * temperature))));
//Copy SPD->radiance to SPD->normalised
SPD->normalised[i] = SPD->radiance[i];
//Find largest value
if (i <= 0) {
old_valu = SPD->normalised[0];
} else if (i > 0){
new_valu = SPD->normalised[i];
if (new_valu > old_valu) {
old_valu = new_valu;
for (int i = 0; i < (max - min); i++) {
//Normalise SPD
SPD->normalised[i] = SPD->normalised[i] / old_valu;
return SPD;
#ifndef blackbody_h
#define blackbody_h
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
//typedef structure to hold results
typedef struct {
int *wavelength;
long double *radiance;
long double *normalised;
} bbresults;
//function declarations
bbresults* blackbody(int, int, double);
#endif /* blackbody_h */
#include <stdio.h>
#include "blackbody.h"
int main() {
bbresults *TEST;
int min = 100, max = 3000, temp = 5000;
TEST = blackbody(min, max, temp);
printf("wavelength | normalised radiance | radiance |\n");
printf(" (nm) | - | (W per meter squr per steradian) |\n");
for (int i = 0; i < (max - min); i++) {
printf("%4d %Lf %Le\n", TEST->wavelength[i], TEST->normalised[i], TEST->radiance[i]);
return 0;
Plot of output:

How to implement nested loops in cuda thrust

I currently have to run a nested loop as follow:
for(int i = 0; i < N; i++){
for(int j = i+1; j <= N; j++){
compute(...)//some calculation here
I've tried leaving the first loop in CPU and do the second loop in GPU. Results are too many memory access. Is there any other ways to do it? For example by thrust::reduce_by_key?
The whole program is here:
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/binary_search.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/random.h>
#include <cmath>
#include <iostream>
#include <iomanip>
#define N 1000000
// define a 2d point pair
typedef thrust::tuple<float, float> Point;
// return a random Point in [0,1)^2
Point make_point(void)
static thrust::default_random_engine rng(12345);
static thrust::uniform_real_distribution<float> dist(0.0f, 1.0f);
float x = dist(rng);
float y = dist(rng);
return Point(x,y);
struct sqrt_dis: public thrust::unary_function<Point, double>
float x, y;
double tmp;
sqrt_dis(float _x, float _y): x(_x), y(_y){}
__host__ __device__
float operator()(Point a)
tmp =(thrust::get<0>(a)-x)*(thrust::get<0>(a)-x)+\
tmp = -1.0*(sqrt(tmp));
return (1.0/tmp);
int main(void) {
clock_t t1, t2;
double result;
t1 = clock();
// allocate some random points in the unit square on the host
thrust::host_vector<Point> h_points(N);
thrust::generate(h_points.begin(), h_points.end(), make_point);
// transfer to device
thrust::device_vector<Point> points = h_points;
thrust::plus<double> binary_op;
float init = 0;
for(int i = 0; i < N; i++){
Point tmp_i = points[i];
float x = thrust::get<0>(tmp_i);
float y = thrust::get<1>(tmp_i);
result += thrust::transform_reduce(points.begin()+i,\
std::cout<<"result"<<i<<": "<<result<<std::endl;
t2 = clock()-t1;
std::cout<<"result: ";
std::cout<< result <<std::endl;
std::cout<<"run time: "<<t2/CLOCKS_PER_SEC<<"s"<<std::endl;
return 0;
EDIT: Now that you have posted an example, here is how you could solve it:
You have n 2D points stored in a linear array like this (here n=4)
points = [p0 p1 p2 p3]
Based on your code I assume you want to calculate:
result = f(p0, p1) + f(p0, p2) + f(p0, p3) +
f(p1, p2) + f(p1, p3) +
f(p2, p3)
Where f() is your distance function which needs to be executed m times in total:
m = (n-1)*n/2
in this example: m=6
You can look at this problem as a triangular matrix:
[ p0 p1 p2 p3 ]
[ p1 p2 p3 ]
[ p2 p3 ]
[ p3 ]
Transforming this matrix into a linear vector with m elements while leaving out the diagonal elements results in:
[p1 p2 p3 p2 p3 p3]
The index of an element in the vector is k = [0,m-1].
Index k can be remapped to columns and rows of the triangular matrix to k -> (i,j):
i = n - 2 - floor(sqrt(-8*k + 4*n*(n-1)-7)/2.0 - 0.5)
j = k + i + 1 - n*(n-1)/2 + (n-i)*((n-i)-1)/2
i is the row and j is the column.
In our example:
0 -> (0, 1)
1 -> (0, 2)
2 -> (0, 3)
3 -> (1, 2)
4 -> (1, 3)
5 -> (2, 3)
Now you can put all this together and execute a modified distance functor m times which applies the aforementioned mapping to get the corresponding pairs based on the index and then sum up everything.
I modified your code accordingly:
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/transform_reduce.h>
#include <thrust/random.h>
#include <math.h>
#include <iostream>
#include <stdio.h>
#include <stdint.h>
typedef float Float;
// define a 2d point pair
typedef thrust::tuple<Float, Float> Point;
// return a random Point in [0,1)^2
Point make_point(void)
static thrust::default_random_engine rng(12345);
static thrust::uniform_real_distribution<Float> dist(0.0, 1.0);
Float x = dist(rng);
Float y = dist(rng);
return Point(x,y);
struct sqrt_dis_new
typedef thrust::device_ptr<Point> DevPtr;
DevPtr points;
const uint64_t n;
sqrt_dis_new(uint64_t n, DevPtr p) : n(n), points(p)
Float operator()(uint64_t k) const
// calculate indices in triangular matrix
const uint64_t i = n - 2 - floor(sqrt((double)(-8*k + 4*n*(n-1)-7))/2.0 - 0.5);
const uint64_t j = k + i + 1 - n*(n-1)/2 + (n-i)*((n-i)-1)/2;
printf("%llu -> (%llu, %llu)\n", k,i,j);
const Point& p1 = *(points.get()+j);
const Point& p2 = *(points.get()+i);
const Float xm = thrust::get<0>(p1)-thrust::get<0>(p2);
const Float ym = thrust::get<1>(p1)-thrust::get<1>(p2);
return 1.0/(-1.0 * sqrt(xm*xm + ym*ym));
int main()
const uint64_t N = 4;
// allocate some random points in the unit square on the host
thrust::host_vector<Point> h_points(N);
thrust::generate(h_points.begin(), h_points.end(), make_point);
// transfer to device
thrust::device_vector<Point> d_points = h_points;
const uint64_t count = (N-1)*N/2;
std::cout << count << std::endl;
thrust::plus<Float> binary_op;
const Float init = 0.0;
Float result = thrust::transform_reduce(thrust::make_counting_iterator((uint64_t)0),
sqrt_dis_new(N, d_points.data()),
std::cout<<"result: " << result << std::endl;
return 0;
It depends on your compute function which you do not specify.
Usually you unroll the loops and launch the kernel in a 2D manner for every combination of i and j if the computations are independent.
Have a look at the Thrust examples and identify similar use cases to your problem.

Line fit from an array of 2d vectors

I have a problem in some C code, I assume it belonged here over the Mathematics exchange.
I have an array of changes in x and y position generated by a user dragging a mouse, how could I determine if a straight line was drawn or not.
I am currently using linear regression, is there a better(more efficient) way to do this?
Hough transformation attempt:
#define abSIZE 100
#define ARRAYSIZE 10
int A[abSIZE][abSIZE]; //points in the a-b plane
int dX[10] = {0, 10, 13, 8, 20, 18, 19, 22, 12, 23};
int dY[10] = {0, 2, 3, 1, -1, -2, 0, 0, 3, 1};
int absX[10]; //absolute positions
int absY[10];
int error = 0;
int sumx = 0, sumy = 0, i;
//Convert deltas to absolute positions
for (i = 0; i<10; i++) {
absX[i] = sumx+=dX[i];
absY[i] = sumy+=dY[i];
//initialise array to zero
int a, b, x, y;
for(a = -abSIZE/2; a < abSIZE/2; a++) {
for(b = -abSIZE/2; b< abSIZE/2; b++) {
A[a+abSIZE/2][b+abSIZE/2] = 0;
//Hough transform
int aMax = 0;
int bMax = 0;
int highest = 0;
for(i=0; i<10; i++) {
x = absX[i];
y = absX[i];
for(a = -abSIZE/2; a < abSIZE/2; a++) {
for(b = -abSIZE/2; b< abSIZE/2; b++) {
if (a*x + b == y) {
A[a+abSIZE/2][b+abSIZE/2] += 1;
if (A[a+abSIZE/2][b+abSIZE/2] > highest) {
highest++; //highest = A[a+abSIZE/2][b+abSIZE/2]
aMax = a;
bMax = b;
printf("Line is Y = %d*X + %d\n",aMax,bMax);
//Calculate MSE
int e;
for (i = 0; i < ARRAYSIZE; i++) {
e = absY[i] - (aMax * absX[i] + bMax);
e = (int) pow((double)e, 2);
error += e;
printf("error is: %d\n", error);
Though linear regression sounds like a perfectly reasonable way to solve the task, here's another suggestion: Hough transform, which might be somewhat more robust against outliers. Here is a very rough sketch of how this can be applied:
initialize a large matrix A with zeros
transform your deltas to some absolute coordinates (x, y) in a x-y-plane (e.g. start with (0,0))
for each point
there are non-unique parameters a and b such that a*x + b = y. All such points (a,b) define a straight line in the a-b-plane
draw this "line" in the a-b-plane by adding ones to the corresponding cells in A, which represents the quantized plane
now you can find a maximum in the a-b-plane-matrix A, which will correspond to the parameters (a, b) of the straight line in the x-y-plane that has most support by the original points
finally, calculate MSE to the original points and decide with some threshold if the move was a straight line
More details e.g. here:
Edit: here's a quote from Wikipedia that explains why it's better to use a different parametrization to deal with vertical lines (where a would become infinite in ax+b=y):
However, vertical lines pose a problem. They are more naturally described as x = a and would give rise to unbounded values of the slope parameter m. Thus, for computational reasons, Duda and Hart proposed the use of a different pair of parameters, denoted r and theta, for the lines in the Hough transform. These two values, taken in conjunction, define a polar coordinate.
Thanks to Zaw Lin for pointing this out.
