Converting a simple C code into a CUDA code

Converting a simple C code into a CUDA code - c

I'm trying to convert a simple numerical analysis code (trapezium rule numerical integration) into something that will run on my CUDA enabled GPU. There is alot of literature out there but it all seems far more complex than what is required here! My current code is:
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#define N 1000
double function(double);
int main(void)
{
int i;
double lower_bound, upper_bound, h, ans;
printf("Please enter the lower and upper bounds: ");
scanf(" %lf %lf", &lower_bound, &upper_bound);
h = (upper - lower) / N;
ans = (function(lower) + function(upper)) / 2.0;
for (i = 1; i < N; ++i) {
ans += function(i * h);
}
printf("The integral is: %.20lf\n", h * ans));
return 0;
}
double function(double x)
{
return sin(x);
}
This runs well until N becomes very large. I've made an implementation with openMP which is faster but I think it will be handy to know a little about CUDA too. Has anyone got any suggestions about where to start or if there is a painless way to convert this code? Many Thanks, Jack.

It's the loop that would have to be distributed to parallel threads. You can calculate a unique index for each thread (idx = 0...N-1). Each thread merely calculates its individual part of the integral and stores the answer in its position in a common array (intgrl[idx]). You then sum everything up using a procedure called a parallel scan or gather. There are examples in the NVIDIA cuda examples. The easiest way would be to use the Thrust library. You simply tell it "add up these values" and it calculates the fastest method.

You could get rid of the multiplication :D
double nomul = h;
for (i = 1; i < N; ++i) {
ans += function(nomul);
nomul += h;
}

First, go ahead and install CUDA on your computer. After that, try to run some of the examples available on the SDK. They may look a little complicated at first sight, but don't worry, there are tons of CUDA "Hello World" examples on the web.
If you're looking for something fancier, you could try compiling this project (you'll need to install OpenCV), which converts an image to its grayscale representation (it has files to compile on Windows/Linux/Mac OS X, so its worth taking a look if you need help to compile your projects).

Related

Why is the use of unrelated printf statement causing changes in my program output?

I'm stuck with a program where just having a printf statement is causing changes in the output.
I have an array of n elements. For the median of every d consecutive elements, if the (d+1)th element is greater or equals to twice of it (the median), I'm incrementing the value of notifications. The complete problem statement might be referred here.
This is my program:
#include <math.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <assert.h>
#include <limits.h>
#include <stdbool.h>
#define RANGE 200
float find_median(int *freq, int *ar, int i, int d) {
int *count = (int *)calloc(sizeof(int), RANGE + 1);
for (int j = 0; j <= RANGE; j++) {
count[j] = freq[j];
}
for (int j = 1; j <= RANGE; j++) {
count[j] += count[j - 1];
}
int *arr = (int *)malloc(sizeof(int) * d);
float median;
for (int j = i; j < i + d; j++) {
int index = count[ar[j]] - 1;
arr[index] = ar[j];
count[ar[j]]--;
if (index == d / 2) {
if (d % 2 == 0) {
median = (float)(arr[index] + arr[index - 1]) / 2;
} else {
median = arr[index];
}
break;
}
}
free(count);
free(arr);
return median;
}
int main() {
int n, d;
scanf("%d %d", &n, &d);
int *arr = malloc(sizeof(int) * n);
for (int i = 0; i < n; i++) {
scanf("%i", &arr[i]);
}
int *freq = (int *)calloc(sizeof(int), RANGE + 1);
int notifications = 0;
if (d < n) {
for (int i = 0; i < d; i++)
freq[arr[i]]++;
for (int i = 0; i < n - d; i++) {
float median = find_median(freq, arr, i, d); /* Count sorts the arr elements in the range i to i+d-1 and returns the median */
if (arr[i + d] >= 2 * median) { /* If the (i+d)th element is greater or equals to twice the median, increments notifications*/
printf("X");
notifications++;
}
freq[arr[i]]--;
freq[arr[i + d]]++;
}
}
printf("%d", notifications);
return 0;
}
Now, For large inputs like this, the program outputs 936 as the value of notifications whereas when I just exclude the statement printf("X") the program outputs 1027 as the value of notifications.
I'm really not able to understand what is causing this behavior in my program, and what I'm missing/overseeing.

Your program has undefined behavior here:
for (int j = 0; j <= RANGE; j++) {
count[j] += count[j - 1];
}
You should start the loop at j = 1. As coded, you access memory before the beginning of the array count, which could cause a crash or produce an unpredictable value. Changing anything in the running environment can lead to a different behavior. As a matter of fact, even changing nothing could.
The rest of the code is more difficult to follow at a quick glance, but given the computations on index values, there may be more problems there too.
For starters, you should add some consistency checks:
verify the return value of scanf() to ensure proper conversions.
verify the values read into arr, they must be in the range 0..RANGE
verify that int index = count[ar[j]] - 1; never produces a negative number.
same for count[ar[j]]--;
verify that median = (float)(arr[index] + arr[index - 1]) / 2; is never evaluated with index == 0.

Your program has undefined behavior (at several occasions). You really should be scared (and you are not scared enough).
I'm really not able to understand what is causing this behavior in my program
With UB, that question is pointless. You need to dive into implementation details (e.g. study the generated machine code of your program, and the code of your C compiler and standard library) to understand anything more. You probably don't want to do that (it could take years of work).
Please read as quickly as possible Lattner's blog on What Every C Programmer Should Know on Undefined Behavior
what I'm missing/overseeing.
You don't understand well enough UB. Be aware that a programming language is a specification (and code against it), not a software (e.g. your compiler). Program semantics is important.
As I said in comments:
compile with all warnings and debug info (gcc -Wall -Wextra -g with GCC)
improve your code to get no warnings; perhaps try also another compiler like Clang and work to also get no warnings from it (since different compilers give different warnings).
consider using some version control system like git to keep various variants of your code, and some build automation tool.
think more about your program and invariants inside it.
use the debugger (gdb), in particular with watchpoints, to understand the internal state of your process; and have several test cases to run under the debugger and without it.
use instrumentation facilities such as the address sanitizer -fsanitize=address of GCC and tools like valgrind.
use rubber duck debugging methodology
sometimes consider static source code analysis tools (e.g. Frama-C). They require expertise to be used, and/or give many false positives.
read more about programming (e.g. SICP) and about the C Programming Language. Download and study the C11 programming language specification n1570 (and be very careful about every mention of UB in it). Read carefully the documentation of every standard or external function you are using. Study also the documentation of your compiler and of other tools. Handle error and failure cases (e.g. calloc and scanf can fail).
Debugging is difficult (e.g. because of the Halting Problem, of Heisenbugs, etc...) - but sometimes fun and challenging. You can spend weeks on finding one single bug. And you often cannot understand the behavior of a buggy program without diving into implementation details (studying the machine code generated by the compiler, studying the code of the compiler).
PS. Your question shows a wrong mindset -which you should improve-, and misunderstanding of UB.

How to generate a very large non singular matrix A in Ax = b?

I am solving the system of linear algebraic equations Ax = b by using Jacobian method but by taking manual inputs. I want to analyze the performance of the solver for large system. Is there any method to generate matrix A i.e non singular?
I am attaching my code here.`
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#define TOL = 0.0001
void main()
{
int size,i,j,k = 0;
printf("\n enter the number of equations: ");
scanf("%d",&size);
double reci = 0.0;
double *x = (double *)malloc(size*sizeof(double));
double *x_old = (double *)malloc(size*sizeof(double));
double *b = (double *)malloc(size*sizeof(double));
double *coeffMat = (double *)malloc(size*size*sizeof(double));
printf("\n Enter the coefficient matrix: \n");
for(i = 0; i < size; i++)
{
for(j = 0; j < size; j++)
{
printf(" coeffMat[%d][%d] = ",i,j);
scanf("%lf",&coeffMat[i*size+j]);
printf("\n");
//coeffMat[i*size+j] = 1.0;
}
}
printf("\n Enter the b vector: \n");
for(i = 0; i < size; i++)
{
x[i] = 0.0;
printf(" b[%d] = ",i);
scanf("%lf",&b[i]);
}
double sum = 0.0;
while(k < size)
{
for(i = 0; i < size; i++)
{
x_old[i] = x[i];
}
for(i = 0; i < size; i++)
{
sum = 0.0;
for(j = 0; j < size; j++)
{
if(i != j)
{
sum += (coeffMat[i * size + j] * x_old[j] );
}
}
x[i] = (b[i] -sum) / coeffMat[i * size + i];
}
k = k+1;
}
printf("\n Solution is: ");
for(i = 0; i < size; i++)
{
printf(" x[%d] = %lf \n ",i,x[i]);
}
}

This is all a bit Heath Robinson, but here's what I've used. I have no idea how 'random' such matrices all, in particular I don't know what distribution they follow.
The idea is to generate the SVD of the matrix. (Called A below, and assumed nxn).
Initialise A to all 0s
Then generate n positive numbers, and put them, with random signs, in the diagonal of A. I've found it useful to be able to control the ratio of the largest of these positive numbers to the smallest. This ratio will be the condition number of the matrix.
Then repeat n times: generate a random n vector f , and multiply A on the left by the Householder reflector I - 2*f*f' / (f'*f). Note that this can be done more efficiently than by forming the reflector matrix and doing a normal multiplication; indeed its easy to write a routine that given f and A will update A in place.
Repeat the above but multiplying on the right.
As for generating test data a simple way is to pick an x0 and then generate b = A * x0. Don't expect to get exactly x0 back from your solver; even if it is remarkably well behaved you'll find that the errors get bigger as the condition number gets bigger.

Talonmies' comment mentions http://www.eecs.berkeley.edu/Pubs/TechRpts/1991/CSD-91-658.pdf which is probably the right approach (at least in principle, and in full generality).
However, you are probably not handling "very large" matrixes (e.g. because your program use naive algorithms, and because you don't run it on a large supercomputer with a lot of RAM). So the naive approach of generating a matrix with random coefficients and testing afterwards that it is non-singular is probably enough.
Very large matrixes would have many billions of coefficients, and you need a powerful supercomputer with e.g. terabytes of RAM. You probably don't have that, if you did, your program probably would run too long (you don't have any parallelism), might give very wrong results (read http://floating-point-gui.de/ for more) so you don't care.
A matrix of a million coefficients (e.g. 1024*1024) is considered small by current hardware standards (and is more than enough to test your code on current laptops or desktops, and even to test some parallel implementations), and generating randomly some of them (and computing their determinant to test that they are not singular) is enough, and easily doable. You might even generate them and/or check their regularity with some external tool, e.g. scilab, R, octave, etc. Once your program computed a solution x0, you could use some tool (or write another program) to compute Ax0 - b and check that it is very close to the 0 vector (there are some cases where you would be disappointed or surprised, since round-off errors matter).
You'll need some good enough pseudo random number generator perhaps as simple as drand48(3) which is considered as nearly obsolete (you should find and use something better); you could seed it with some random source (e.g. /dev/urandom on Linux).
BTW, compile your code with all warnings & debug info (e.g. gcc -Wall -Wextra -g). Your #define TOL = 0.0001 is probably wrong (should be #define TOL 0.0001 or const double tol = 0.0001;). Use the debugger (gdb) & valgrind. Add optimizations (-O2 -mcpu=native) when benchmarking. Read the documentation of every used function, notably those from <stdio.h>. Check the result count from scanf... In C99, you should not cast the result of malloc, but you forgot to test against its failure, so code:
double *b = malloc(size*sizeof(double));
if (!b) {perror("malloc b"); exit(EXIT_FAILURE); };
You'll rather end, not start, your printf control strings with \n because stdout is often (not always!) line buffered. See also fflush.
You probably should read also some basic linear algebra textbook...
Notice that actually writing robust and efficient programs to invert matrixes or to solve linear systems is a difficult art (which I don't know at all : it has programming issues, algorithmic issues, and mathematical issues; read some numerical analysis book). You can still get a PhD and spend your whole life working on that. Please understand that you need ten years to learn programming (or many other things).

Multiply each element of an array by a number in C

I'm trying to optimize some of my code in C, which is a lot bigger than the snippet below. Coming from Python, I wonder whether you can simply multiply an entire array by a number like I do below.
Evidently, it does not work the way I do it below. Is there any other way that achieves the same thing, or do I have to step through the entire array as in the for loop?
void main()
{
int i;
float data[] = {1.,2.,3.,4.,5.};
//this fails
data *= 5.0;
//this works
for(i = 0; i < 5; i++) data[i] *= 5.0;
}

There is no short-cut you have to step through each element of the array.
Note however that in your example, you may achieve a speedup by using int rather than float for both your data and multiplier.

If you want to, you can do what you want through BLAS, Basic Linear Algebra Subprograms, which is optimised. This is not in the C standard, it is a package which you have to install yourself.
Sample code to achieve what you want:
#include <stdio.h>
#include <stdlib.h>
#include <cblas.h>
int main () {
int limit =10;
float *a = calloc( limit, sizeof(float));
for ( int i = 0; i < limit ; i++){
a[i] = i;
}
cblas_sscal( limit , 0.5f, a, 1);
for ( int i = 0; i < limit ; i++){
printf("%3f, " , a[i]);
}
printf("\n");
}
The names of the functions is not obvious, but reading the guidelines you might start to guess what BLAS functions does. sscal() can be split into s for single precision and scal for scale, which means that this function works on floats. The same function for double precision is called dscal().
If you need to scale a vector with a constant and adding it to another, BLAS got a function for that too:
saxpy()
s a x p y
float a*x + y
y[i] += a*x
As you might guess there is a daxpy() too which works on doubles.

I'm afraid that, in C, you will have to use for(i = 0; i < 5; i++) data[i] *= 5.0;.
Python allows for so many more "shortcuts"; however, in C, you have to access each element and then manipulate those values.
Using the for-loop would be the shortest way to accomplish what you're trying to do to the array.
EDIT: If you have a large amount of data, there are more efficient (in terms of running time) ways to multiply 5 to each value. Check out loop tiling, for example.

data *= 5.0;
Here data is address of array which is constant.
if you want to multiply the first value in that array then use * operator as below.
*data *= 5.0;

Converting from log to linear in C and taking average

I don't know C well at all, and I'm trying to edit someone's code, but I'm having issues when trying to convert values from the log to linear domains.
For example, let's say we have an array A that is full of log values equal to -100 dB, i.e.
float A[100];
int i;
for( i=0; i<100; i++ )
A[i] = -100;
What I want to do is find the average of all the values (which clearly is -100), but by taking the average in the linear and not log domain, i.e.
float tmp_avg = 0.0;
float avg;
int count = 0;
for( i=0; i<100; i++ ) {
tmp_avg += pow(10.0, A[i]/10.0);
count++;
}
avg = 10*log10(tmp_avg / count);
However, the result I'm getting is all 0's. Now the code I'm working on is much more complex than this, but I was wondering if there's anything obvious that I'm missing as to why this won't work.
One thought I had is that 10^(-100/10) is a very small value (1e-10), and perhaps too small to be accurately defined as a float. I've tried making it a double instead, but I still get a result of all 0's.
Thanks!

Just figured out what the problem was: I needed to include the math.h library at the top of the program:
#include <math.h>
Without that, I believe that there was no reference for the log10 function, which in turn caused the result to be all 0's. I now include math.h and everything seems to be working fine.

how to numerically integrate a variable that is being calculate in the program as a pointer (using e.g. trapezoidal rule) in c language

I have a code, that was not made by me.
In this complex code, many rules are being applied to calculate a quantity, d(x). in the code is being used a pointer to calculate it.
I want to calculate an integral over this, like:
W= Int_0 ^L d(x) dx ?
I am doing this:
#define DX 0.003
void WORK(double *d, double *W)
{
double INTE5=0.0;
int N_X_POINTS=333;
double h=((d[N_X_POINTS]-d[0])/N_X_POINTS);
W[0]=W[0]+((h/2)*(d[1]+2.0*d[0]+d[N_X_POINTS-1])); /*BC*/
for (i=1;i<N_X_POINTS-1;i++)
{
W[i]=W[i]+((h/2)*(d[0]+2*d[i]+d[N_X_POINTS]))*DX;
INTE5+=W[i];
}
W[N_X_POINTS-1]=W[N_X_POINTS-1]+((h/2)*(d[0]+2.0*d[N_X_POINTS-1]+d[N_X_POINTS-2])); /*BC*/
}
And I am getting "Segmentation fault". I was wondering to know if, I am doing right in calculate W as a pointer, or should declare it as a simple double? I guess the Segmentation fault is coming for this.
Other point, am I using correctly the trapezoidal rule?
Any help/tip, will very much appreciate.
Luiz

I don't know where that code come from, but it is a lot ugly and has some limits hard-encoded (333 points and increment by 0.003). To use it you need to "sample" properly your function and generate pairs (x, f(x))...
A possible clearer solution to your problem is here.
Let us consider you function and let us suppose it works (I believe it does't, it's a really obscure code...; e.g. when you integrate a function, you expect a number as result; where's this number? Maybe INTE5? It is not given back... and if it is so, why the final update of the W array? It's useless, or maybe we have something meaningful into W?). How does would you use it?
The prototype
void WORK(double *d, double *W);
means the WORK wants two pointers. What these pointers must be depends on the code; a look at it suggests that indeed you need two arrays, with N_X_POINTS elements each. The code reads from and writes into array W, and reads only from d. The N_X_POINTS int is 333, so you need to pass to the function arrays of at least 333 doubles:
double d[333];
double W[333];
Then you have to fill them properly. I thought you need to fill them with (x, f(x)), sampling the function with a proper step. But of course this makes no too much sense. Already said that the code is obscure (now I don't want to try to reverse engineering the intention of the coder...).
Anyway, if you call it with WORK(d, W), you won't get seg fault, since the arrays are big enough. The result will be wrong, but this is harder to track (again, sorry, no "reverse engineering" for it).
Final note (from comments too): if you have double a[N], then a has type double *.

A segmentation fault error often happens in C when you try to access some part of memory that you shouldn't be accessing. I suspect that the expression d[N_X_POINTS] is the culprit (because arrays in C are zero-indexed), but without seeing the definition of d I can't be sure.
Try putting informative printf debugging statements before/after each line of code in your function so you can narrow down the possible sources of the problem.

Here's a simple program that integrates $f(x) = x^2$ over the range [0..10]. It should send you in the right direction.
#include <stdio.h>
#include <stdlib.h>
double int_trapezium(double f[], double dX, int n)
{
int i;
double sum;
sum = (f[0] + f[n-1])/2.0;
for (i = 1; i < n-1; i++)
sum += f[i];
return dX*sum;
}
#define N 1000
int main()
{
int i;
double x;
double from = 0.0;
double to = 10.0;
double dX = (to-from)/(N-1);
double *f = malloc(N*sizeof(*f));
for (i=0; i<N; i++)
{
x = from + i*dX*(to-from);
f[i] = x*x;
}
printf("%f\n", int_trapezium(f, dX, N));
free(f);
return 0;
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight