strange daxpy Fortran in c

strange daxpy Fortran in c - c

I'm still fairly new to c and i can't seem to understand this small bit of code.
void daxpy(int N, double alpha, double *x, double *y)
y=alpha*x+y
for (i=0, i<N, i++)
y[1]=alpha*x[1]+y[1];
i don't seem to know what daxpy function is doing or even its purpose. I know its probably something not very difficult. any help will be much appreciated. this was on my notes. I was just curious about what it was. I know the obvious things like daxpy is a function call. but just need a small explanation on it

I would think the actual code is like this:
void daxpy(int N, double alpha, double *x, double *y)
{
for (int i = 0, i < N, i++)
y[i]= alpha * x[i] + y[i];
}
This is because when looking at your code y = alpha * x + y does not seem to make sense. As x and y seems to be array, it should not work that way.
Furthermore, the following code is a loop, which I would think it explains the statement of y = alpha * x + y itself. And the number in the indices should be i instead of 1, because it is a loop from 0 to N. It does not make sense to put 1 there.
So that function call is basically just to add every element of array y with it's corresponding value in x multiplied by a constant alpha.

Related

CUDA: parallelizing a multiple nested for-loop having a function call with nested loops

Issue
I am interested in parallelizing a problem using CUDA. The C code in question follows this simplified form:
int A, B, C; // 100 < A,B,C,D < 1,000
float* v1, v2, v3;
//v1,v2, v3 will have respective size A,B,C
//and will not be empty
float*** t1, t2, t3;
//t1,t2,t3 will eventually have the size (ci,cj,ck)
//and will not be empty
int i, j , k, l;
float xi, xj, xk;
for (i=0; i<A; ++i){
xi = ci - v1[i];
for (j=0; j<B; ++j){
xj = (j*cj)*cos(j*M_PI/180);
for (k=0;k<C; ++k){
xk = xj - v3[k];
if (xk < xi){
call_1(t1[i], v1, t2[i], &t3[i][j][k]);
}
else t3[i][j][k] = some_number;
}
}
}
here call_1 is
void call_1 (float **w, float *x, float **y, float *z){
int k, max = some_value;
float *v; //initialize to have size max
for (k=0; k<max; ++k)
call_2(x[k], y[k], max, &v[k]);
call_2(y, v, max, z);
}
here call_2is
void call_2 (float *w, float*x, int y, double *z)
that simply contains operations such as bit shifting, multiplication, subtraction and addition inside a single while loop.
Ideas attempted
So far, my idea is that, the function call_1may be transformed into a kernel code __global__ void call_1; and that call_2 may be transformed into device code without modifying its contents. In particular, I can probably make __global__ void call_1 to be
double* v; //initialize to have size max
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int k=index; k<max; k += stride)
call_2 (x[k], y[k], max, &v[k]);
__syncthreads();
call_2 (y, v, max, z);
free (v);
I'm partly aware that the for loops can be removed by using a combination of threadIdx, blockIdx, and gridDim, but I specifically am not sure how especially that the problem contains a function call that also uses a function call.

Well there are two possible answers to that, and, while I don't have the courage to search all of it for you, I'll still make it an answer since you seem to have been blatantly ignored. :/
First.
Recent CUDA API and nvidia architectures have support for function calls and even recursion in CUDA. I'm not exactly sure how it works as I never used it myself, but you might want to research that. (Or do some Vulkan since it looks like so much fun and also supports it.)
Might help you: https://devtalk.nvidia.com/default/topic/493567/cuda-programming-and-performance/calling-external-kernel-from-cuda/
And other stuff with related keywords. : D
On the other hand..
When resolving simple issues, especially if, like me, you would rather spend your time programming than doing research and learning some random API by heart, you can always go with more primitive solutions using only the basic of the language you are using.
In your case, I would simply inline the calls to the function to make a single CUDA kernel, since it seems pretty easy to do.
Yeah, right, it might include some copy-pasting if there are multiple calls to the function... which doesn't really matter if it can make you take it easy and efficiently solve a simple issue and go make something more productive.
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int k=index; k<max; k += stride)
call_2 (x[k], y[k], max, &v[k]); // Insert call_2 code here instead.
Another way to go around that, when you are confident your data is big enough to have a good increase in performances even with the passing of code and data from CPU and RAM to GPU, is to simply have multiple "waves" of cuda kernel call.
You let the first wave process while preparing for the second one, which is then launched on the finished first wave.
It's basically equivalent to other smarter constructs offered by recent CUDA implementations, so you would probably find smarter things to do with a bit of research, but then again... depends on your priority.
But yeah, manually inlining functions is great. : D
*mostly never, but it can be pretty handy

Understanding recurrence relations with function calls inside loops

int fun (int n)
{
int x=1, k;
if (n==1) return x;
for (k=1; k<n; ++k)
x = x + fun(k) * fun(n – k);
return x;
}
The recurrence relation in the solution is given as :
Solving this is easy, I got the answer to be 51.
The doubt is why the '1' (underlined in red) is outside the summation?
I understand if it were asymptotic analysis, we would consider it constant time because we only need to know the nature of growth, but since here we need an accurate answer.
Since this line x + fun(k) * fun(n – k) in inside the loop, x also loops, so why is it outside summation?

Most running sums begin at 0. However, this one begins at 1. That constant value is the one you underlined. It comes from the line
int x=1, k;

For each 'k' in for loop, x is getting replaced for the previous iteration value of x. Thus there wont be addition of the variable x but simply substitution.
consider f(3),
for k =1,
x=x+f(1)*f(2)....1
again loop runs,
k=2
x=x+f(2)*f(1)....2
substituting x from 1,
x=x+f(2)*f(1)+f(1)*f(2).
Carry out the same for everything.. and substitute x= 1 at end.. u will get the
GENERALIZED EQUATION YOU HAVE GIVEN.

Guo Hall iterattions don't persist

I am implementing the guo hall algorithm for a micro controller. The problem is due to it's architecture I cannot use opencv. I have the algorithm working fine except for one problem. in the following code a struct is passed through the thinning iterator the struct contains both the 2d array and a boolean determining whether or not change was made to the array.
int* thinning(int* it, int x, int y)
{
for(int i= 0; i < x*y; ++i)
it[i] /= 255;
struct IterRet base;
base.i = it;
base.b = false;
do
{
base = thinningIteration(base, x, y, 0);
base = thinningIteration(base, x, y, 1);
}
while (base.b);
for(int i= 0; i < x*y; ++i)
base.i[i] *= 255;
return base.i;
}
when I change the while condition to while(0) A single iteration passes and the matrix is properly returned.
When I leave the while loop as is, it goes on indefinitely.
I have narrowed the problem down to the fact that base is reset after each run of the do-while loop.
What would cause this? I can give more code if this is too narrow of a view for it.

I ran your code as it is, it did not go on indefinitely, but ran through once, and stopped. However, there are two places where I made a suggested change. Really just a readability/style thing, not something that will change the behavior of your code in this case.
See commented and replacement lines below.
In thinningIteration()
struct IterRet thinningIteration(struct IterRet it, int x, int y, int iter)
{
//int* marker = malloc(x*y* sizeof *marker);
int* marker = malloc(x*y* sizeof(int));
In main()
//int* src = malloc( sizeof *src * x * y);
int* src = malloc( sizeof (int) * x * y);
Unfortunately, these edits did not address the main issue you asked about, but again, running the code did not exhibit the behavior you described.
If you can add more about the nature of your observed issues, please leave a comment, and if I can, will attempt to help.

Multiply each element of an array by a number in C

I'm trying to optimize some of my code in C, which is a lot bigger than the snippet below. Coming from Python, I wonder whether you can simply multiply an entire array by a number like I do below.
Evidently, it does not work the way I do it below. Is there any other way that achieves the same thing, or do I have to step through the entire array as in the for loop?
void main()
{
int i;
float data[] = {1.,2.,3.,4.,5.};
//this fails
data *= 5.0;
//this works
for(i = 0; i < 5; i++) data[i] *= 5.0;
}

There is no short-cut you have to step through each element of the array.
Note however that in your example, you may achieve a speedup by using int rather than float for both your data and multiplier.

If you want to, you can do what you want through BLAS, Basic Linear Algebra Subprograms, which is optimised. This is not in the C standard, it is a package which you have to install yourself.
Sample code to achieve what you want:
#include <stdio.h>
#include <stdlib.h>
#include <cblas.h>
int main () {
int limit =10;
float *a = calloc( limit, sizeof(float));
for ( int i = 0; i < limit ; i++){
a[i] = i;
}
cblas_sscal( limit , 0.5f, a, 1);
for ( int i = 0; i < limit ; i++){
printf("%3f, " , a[i]);
}
printf("\n");
}
The names of the functions is not obvious, but reading the guidelines you might start to guess what BLAS functions does. sscal() can be split into s for single precision and scal for scale, which means that this function works on floats. The same function for double precision is called dscal().
If you need to scale a vector with a constant and adding it to another, BLAS got a function for that too:
saxpy()
s a x p y
float a*x + y
y[i] += a*x
As you might guess there is a daxpy() too which works on doubles.

I'm afraid that, in C, you will have to use for(i = 0; i < 5; i++) data[i] *= 5.0;.
Python allows for so many more "shortcuts"; however, in C, you have to access each element and then manipulate those values.
Using the for-loop would be the shortest way to accomplish what you're trying to do to the array.
EDIT: If you have a large amount of data, there are more efficient (in terms of running time) ways to multiply 5 to each value. Check out loop tiling, for example.

data *= 5.0;
Here data is address of array which is constant.
if you want to multiply the first value in that array then use * operator as below.
*data *= 5.0;

how to numerically integrate a variable that is being calculate in the program as a pointer (using e.g. trapezoidal rule) in c language

I have a code, that was not made by me.
In this complex code, many rules are being applied to calculate a quantity, d(x). in the code is being used a pointer to calculate it.
I want to calculate an integral over this, like:
W= Int_0 ^L d(x) dx ?
I am doing this:
#define DX 0.003
void WORK(double *d, double *W)
{
double INTE5=0.0;
int N_X_POINTS=333;
double h=((d[N_X_POINTS]-d[0])/N_X_POINTS);
W[0]=W[0]+((h/2)*(d[1]+2.0*d[0]+d[N_X_POINTS-1])); /*BC*/
for (i=1;i<N_X_POINTS-1;i++)
{
W[i]=W[i]+((h/2)*(d[0]+2*d[i]+d[N_X_POINTS]))*DX;
INTE5+=W[i];
}
W[N_X_POINTS-1]=W[N_X_POINTS-1]+((h/2)*(d[0]+2.0*d[N_X_POINTS-1]+d[N_X_POINTS-2])); /*BC*/
}
And I am getting "Segmentation fault". I was wondering to know if, I am doing right in calculate W as a pointer, or should declare it as a simple double? I guess the Segmentation fault is coming for this.
Other point, am I using correctly the trapezoidal rule?
Any help/tip, will very much appreciate.
Luiz

I don't know where that code come from, but it is a lot ugly and has some limits hard-encoded (333 points and increment by 0.003). To use it you need to "sample" properly your function and generate pairs (x, f(x))...
A possible clearer solution to your problem is here.
Let us consider you function and let us suppose it works (I believe it does't, it's a really obscure code...; e.g. when you integrate a function, you expect a number as result; where's this number? Maybe INTE5? It is not given back... and if it is so, why the final update of the W array? It's useless, or maybe we have something meaningful into W?). How does would you use it?
The prototype
void WORK(double *d, double *W);
means the WORK wants two pointers. What these pointers must be depends on the code; a look at it suggests that indeed you need two arrays, with N_X_POINTS elements each. The code reads from and writes into array W, and reads only from d. The N_X_POINTS int is 333, so you need to pass to the function arrays of at least 333 doubles:
double d[333];
double W[333];
Then you have to fill them properly. I thought you need to fill them with (x, f(x)), sampling the function with a proper step. But of course this makes no too much sense. Already said that the code is obscure (now I don't want to try to reverse engineering the intention of the coder...).
Anyway, if you call it with WORK(d, W), you won't get seg fault, since the arrays are big enough. The result will be wrong, but this is harder to track (again, sorry, no "reverse engineering" for it).
Final note (from comments too): if you have double a[N], then a has type double *.

A segmentation fault error often happens in C when you try to access some part of memory that you shouldn't be accessing. I suspect that the expression d[N_X_POINTS] is the culprit (because arrays in C are zero-indexed), but without seeing the definition of d I can't be sure.
Try putting informative printf debugging statements before/after each line of code in your function so you can narrow down the possible sources of the problem.

Here's a simple program that integrates $f(x) = x^2$ over the range [0..10]. It should send you in the right direction.
#include <stdio.h>
#include <stdlib.h>
double int_trapezium(double f[], double dX, int n)
{
int i;
double sum;
sum = (f[0] + f[n-1])/2.0;
for (i = 1; i < n-1; i++)
sum += f[i];
return dX*sum;
}
#define N 1000
int main()
{
int i;
double x;
double from = 0.0;
double to = 10.0;
double dX = (to-from)/(N-1);
double *f = malloc(N*sizeof(*f));
for (i=0; i<N; i++)
{
x = from + i*dX*(to-from);
f[i] = x*x;
}
printf("%f\n", int_trapezium(f, dX, N));
free(f);
return 0;
}