How to build a matrix polynomial in Chapel - sparse-matrix

A wise man once began, I have this matrix A... (known here as "W")
use LinearAlgebra,
LayoutCS,
LinearAlgebra.Sparse;
var nv: int = 8,
D = {1..nv, 1..nv},
SD: sparse subdomain(D) dmapped CS(),
W: [SD] real;
SD += (1,2); W[1,2] = 1.0;
SD += (1,3); W[1,3] = 1.0;
SD += (1,4); W[1,4] = 1.0;
SD += (2,2); W[2,2] = 1.0;
SD += (2,4); W[2,4] = 1.0;
SD += (3,4); W[3,4] = 1.0;
SD += (4,5); W[4,5] = 1.0;
SD += (5,6); W[5,6] = 1.0;
SD += (6,7); W[6,7] = 1.0;
SD += (6,8); W[6,8] = 1.0;
SD += (7,8); W[7,8] = 1.0;
const a: real = 0.5;
I would like to build the polynomial
const P = aW + a^2W^2 + .. + a^kW^k
However, it appears that the function .dot can't be chained. Is there a clear way to do this would building the intermediate elements?

Here's one way to achieve this, but it could be improved by splitting the polynomial computation up such that each matrix power is only computed once:
use LinearAlgebra,
LayoutCS,
LinearAlgebra.Sparse;
var nv: int = 8,
D = {1..nv, 1..nv},
SD: sparse subdomain(D) dmapped CS(),
W: [SD] real;
SD += (1,2); W[1,2] = 1.0;
SD += (1,3); W[1,3] = 1.0;
SD += (1,4); W[1,4] = 1.0;
SD += (2,2); W[2,2] = 1.0;
SD += (2,4); W[2,4] = 1.0;
SD += (3,4); W[3,4] = 1.0;
SD += (4,5); W[4,5] = 1.0;
SD += (5,6); W[5,6] = 1.0;
SD += (6,7); W[6,7] = 1.0;
SD += (6,8); W[6,8] = 1.0;
SD += (7,8); W[7,8] = 1.0;
const a: real = 0.5;
const polynomial = dot(a, W).plus(dot(a**2, W.dot(W))).plus(dot(a**3, W.dot(W).dot(W)));

Related

Create an array in main() instead in the function in a C program

I have the following code from the numerical recipes in C which calculates the incomplete beta function using continuous fraction and Lentz method.
float betacf(float m1, float m2, float theta){
void nrerror(char error_text[]);
int k, k2, MAXIT;
float aa, c, d, del, t, qab, qam, qap;
qab = m1 + m2;
qap = m1 + 1.0;
qam = m1 - 1.0;
c = 1.0;
d = 1.0 - (qab * theta)/qap;
if (fabs(d) < FPMIN) d = FPMIN;
d = 1.0/d;
t = d;
for (k = 1; k <= MAXIT; k++) {
k2 = 2 * k;
aa = k * (m2 - k) * theta/((qam + k2) * (m1 + k2));
d = 1.0 + aa * d;
if (fabs(d) < FPMIN) d = FPMIN;
c = 1.0 + aa/c;
if (fabs(c) < FPMIN) c = FPMIN;
d = 1.0/d;
t *= d * c;
aa = -(m1 + k) * (qab + k) * theta/((m1 + k2) * (qap + k2));
d = 1.0 + aa * d;
if (fabs(d) < FPMIN) d = FPMIN;
c = 1.0 + aa/c;
if (fabs(c) < FPMIN) c=FPMIN;
d = 1.0/d;
del = d * c;
t *= del;
if (fabs(del - 1.0) < EPS) break;
}
if (k > MAXIT) nrerror("m1 or m2 too big, or MAXIT too small in betacf");
return t;
}
/* Returns the incomplete beta function Ix(a, b) */
float betai(float m1, float m2, float theta){
void nrerror(char error_text[]);
float bt;
if (theta < 0.0 || theta > 1.0){
nrerror("Bad x in routine betai");
}
if (theta == 0.0 || theta == 1.0){
bt = 0.0;
}
else {
bt = exp(gammaln(m1+m2)-gammaln(m1)-gammaln(m2)+m1*log(theta)+m2*log(1.0-theta));
}
if (theta < (m1 + 1.0)/(m1 + m2 + 2.0))
{
return (bt * betacf(m1, m2, theta)/m1);
}
else {
return (1.0 - bt * betacf(m2, m1, 1.0 - theta)/m2);
}
}
Then I write a main code where I throw in theta as input and get a value for incomplete beta function.
Now I need to obtain a distribution for theta = [0,1]. Is there a way to write it in way where I don't change anything in this code. I mean just add a for loop in my main function for theta and get the output of the incomplete beta function. I tried doing this, but it throws an error "Incompatible types, expected 'double' but argument is of type 'double *' . I understand the error is because I try to get the output as an array but in my function it is defined to be a single value. Is there a work around this where I don't have to declare theta as an array in my function.
Failing main function
int main() {
float *theta, *result;
.....
.....
printf("Enter number of points required to describe the PDF profile:", N);
scanf("%d", &N);
theta = (float *)malloc(N*sizeof(float));
for (j = 1; j < N; j++)
theta[j] = (float)(j)/ ((float)(N) - 1.0);
result[j] = betai(m1, m2, theta);
printf("%f %f", theta[j], result[j]);
}
}
Thank you

Math Equation not outputting result I want

The purpose of the program is to calculate the volume at each depth. The inputs are the radius and length and in this test case they are 2.1 and 5.6 respectively. I keep getting 0, 1, 2, 3, and 4 for my volume but that's not the right volume, the depth/height is correct so perhaps someone can shed light on whats wrong with my equation below?
This is the function that calculates the volume
int getVolume(double arrplotptr[][col], double *arr2ptr, char *nameptr)
{
double vol, h, diam, ctr, rad, len, x;
int i, j;
rad = arr2ptr[radius];
len = arr2ptr[length];
diam = (rad * 2);
ctr = diam / 100;
h = 0;
for (j = 0; j < 100; j++) {
h = h + ctr;
arrplotptr[0][j] = h;
}
h = 0;
for (i = 0; i < 100; i++) {
h = h + ctr;
x = (rad - h) / rad;
vol = ((rad * rad) * acos(x) - (rad - h) * (sqrt((2 * rad * h) - (h * h)))) * len;
arrplotptr[1][i] = vol;
}
}
I see several issues in your code:
Why do you use ctr = diam / 100; instead of ctr = rad / 100;?
You do not return a value from getVolume, if the caller function relies on the return value, you invoke undefined behavior.
You store the volume of each slice but do not compute the total volume. You did not post the code that does that, maybe there are problems there too.
As written by chqrlie, I think you should change
ctr = diam / 100;
with
ctr = rad / 100;
And, as written by EOF, the function is defined as "int" but returns no value; You should redefine it as "void" or return an integer value.
I add that it doesn't seem necessary to double loop: in each iteration you can calculate "h", "x", "vol" and save the two values of "arrplotptr".
I propose to simplify the function as follows
void getVolume (double arrplotptr[][col], double arr2ptr[])
{
double const rad = arr2ptr[radius];
double const len = arr2ptr[length];
double const ctr = rad / 100;
int i;
double h;
for ( i = 0, h = ctr ; i < 100 ; ++i, h+=ctr )
{
arrplotptr[0][i] = h;
arrplotptr[1][i] = ((rad * rad) * acos((rad - h) / rad)
- (rad - h) * (sqrt((2 * rad * h) - (h * h)))) * len;
}
}

Large 2d array in C

I try to solve linear optimization problem of 4 variables and 600000 constraints.
I need to generate a large input. So I need A[600000][4] for constraint's coefficents and b[600000] for the right part. Here is a code to generate 600000 constraints.
int i, j;
int numberOfInequalities = 600000;
double c[4];
double result[4];;
double A[numberOfInequalities][4], b[numberOfInequalities];
printf("\nPreparing test: 4 variables, 600000 inequalities\n");
A[0][0] = 1.0; A[0][1] = 2.0; A[0][2] = 1.0; A[0][3] = 0.0; b[0] = 10000.0;
A[1][0] = 0.0; A[1][1] = 1.0; A[1][2] = 2.0; A[1][3] = 1.0; b[1] = 10000.0;
A[2][0] = 1.0; A[2][1] = 0.0; A[2][2] = 1.0; A[2][3] = 3.0; b[2] = 10000.0;
A[3][0] = 4.0; A[3][1] = 0.0; A[3][2] = 1.0; A[3][3] = 1.0; b[3] = 10000.0;
c[0]=1.0; c[1]=1.0; c[2]=1.0; c[3]=1.0;
for( i=4; i< 100000; i++ )
{
A[i][0] = (12123*i)%104729;
A[i][1] = (47*i)%104729;
A[i][2] = (2011*i)%104729;
A[i][3] = (7919*i)%104729;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 1 + (i%137);
}
A[100000][0] = 0.0; A[100000][1] = 6.0; A[100000][2] = 1.0;
A[100000][3] = 1.0; b[100000] = 19.0;
for( i=100001; i< 200000; i++ )
{
A[i][0] = (2323*i)%101111;
A[i][1] = (74*i)%101111;
A[i][2] = (2017*i)%101111;
A[i][3] = (7915*i)%101111;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 2 + (i%89);
}
A[200000][0] = 5.0; A[200000][1] = 2.0; A[200000][2] = 0.0;
A[200000][3] = 1.0; b[200000] = 13.0;
for( i=200001; i< 300000; i++ )
{
A[i][0] = (23123*i)%100003;
A[i][1] = (47*i)%100003;
A[i][2] = (2011*i)%100003;
A[i][3] = (7919*i)%100003;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 2 + (i%57);
}
A[300000][0] = 1.0; A[300000][1] = 2.0; A[300000][2] = 1.0;
A[300000][3] = 3.0; b[300000] = 20.0;
A[300001][0] = 1.0; A[300001][1] = 0.0; A[300001][2] = 5.0;
A[300001][3] = 4.0; b[300001] = 32.0;
A[300002][0] = 7.0; A[300002][1] = 1.0; A[300002][2] = 1.0;
A[300002][3] = 7.0; b[300002] = 40.0;
for( i=300003; i< 400000; i++ )
{
A[i][0] = (13*i)%103087;
A[i][1] = (99*i)%103087;
A[i][2] = (2012*i)%103087;
A[i][3] = (666*i)%103087;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 1;
}
for( i=400000; i< 500000; i++ )
{
A[i][0] = 1;
A[i][1] = (17*i)%999983;
A[i][2] = (1967*i)%444443;
A[i][3] = 2;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + (1000000.0/(double)i);
}
for( i=500000; i< 600000; i++ )
{
A[i][0] = (3*i)%111121;
A[i][1] = (2*i)%999199;
A[i][2] = (2*i)%444443;
A[i][3] = i;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 1.3;
}
The problem is: it can't create such a large array, it just terminates at the run-time, BUT it works fine if I create no more than 200000 constraints.
I've tried to increase stack size to unlimited value, but it didn't help.
I've tried to use pointers like **A, but I get incorrect result in output.
P.S.
I use Ubuntu.
Any ideas?
If numberOfInequalities is a runtime constant, you could make it a #define and define A and b as global variables or static local variables:
#define numberOfInequalities 600000
static double A[numberOfInequalities][4], b[numberOfInequalities];
This will move these arrays from the 'stack' to the 'bss' segment.
A better solution is to allocate these arrays with malloc:
double (*A)[4] = malloc(numberOfInequalities * 4 * sizeof(double));
double *b = malloc(numberOfInequalities * sizeof(double));
This will cause these arrays to be allocated from the 'heap' memory.
Don't forget to free them before returning to the caller.
See http://www.geeksforgeeks.org/memory-layout-of-c-program/ for a brief explanation how memory is arranged in a typical C program

How can i optimize my AVX implementation of dot product?

I`ve tried to implement dot product of this two arrays using AVX https://stackoverflow.com/a/10459028. But my code is very slow.
A and xb are arrays of doubles, n is even number. Can you help me?
const int mask = 0x31;
int sum =0;
for (int i = 0; i < n; i++)
{
int ind = i;
if (i + 8 > n) // padding
{
sum += A[ind] * xb[i].x;
i++;
ind = n * j + i;
sum += A[ind] * xb[i].x;
continue;
}
__declspec(align(32)) double ar[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].x };
__m256d x = _mm256_loadu_pd(&A[ind]);
__m256d y = _mm256_load_pd(ar);
i+=4; ind = n * j + i;
__declspec(align(32)) double arr[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].x };
__m256d z = _mm256_loadu_pd(&A[ind]);
__m256d w = _mm256_load_pd(arr);
__m256d xy = _mm256_mul_pd(x, y);
__m256d zw = _mm256_mul_pd(z, w);
__m256d temp = _mm256_hadd_pd(xy, zw);
__m128d hi128 = _mm256_extractf128_pd(temp, 1);
__m128d low128 = _mm256_extractf128_pd(temp, 0);
//__m128d dotproduct = _mm_add_pd((__m128d)temp, hi128);
__m128d dotproduct = _mm_add_pd(low128, hi128);
sum += dotproduct.m128d_f64[0]+dotproduct.m128d_f64[1];
i += 3;
}
There are two big inefficiencies in your loop that are immediately apparent:
(1) these two chunks of scalar code:
__declspec(align(32)) double ar[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].x };
...
__m256d y = _mm256_load_pd(ar);
and
__declspec(align(32)) double arr[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].x };
...
__m256d w = _mm256_load_pd(arr);
should be implemented using SIMD loads and shuffles (or at the very least use _mm256_set_pd and give the compiler a chance to do a half-reasonable job of generating code for a gathered load).
(2) the horizontal summation at the end of the loop:
for (int i = 0; i < n; i++)
{
...
__m256d xy = _mm256_mul_pd(x, y);
__m256d zw = _mm256_mul_pd(z, w);
__m256d temp = _mm256_hadd_pd(xy, zw);
__m128d hi128 = _mm256_extractf128_pd(temp, 1);
__m128d low128 = _mm256_extractf128_pd(temp, 0);
//__m128d dotproduct = _mm_add_pd((__m128d)temp, hi128);
__m128d dotproduct = _mm_add_pd(low128, hi128);
sum += dotproduct.m128d_f64[0]+dotproduct.m128d_f64[1];
i += 3;
}
should be moved out of the loop:
__m256d xy = _mm256_setzero_pd();
__m256d zw = _mm256_setzero_pd();
...
for (int i = 0; i < n; i++)
{
...
xy = _mm256_add_pd(xy, _mm256_mul_pd(x, y));
zw = _mm256_add_pd(zw, _mm256_mul_pd(z, w));
i += 3;
}
__m256d temp = _mm256_hadd_pd(xy, zw);
__m128d hi128 = _mm256_extractf128_pd(temp, 1);
__m128d low128 = _mm256_extractf128_pd(temp, 0);
//__m128d dotproduct = _mm_add_pd((__m128d)temp, hi128);
__m128d dotproduct = _mm_add_pd(low128, hi128);
sum += dotproduct.m128d_f64[0]+dotproduct.m128d_f64[1];

C - Segmentation fault in professor's test code

My professor send out test code to run on our program. However, the test code itself has a segmentation fault error on compiling. The error happens on the first printf. However if that line is commented out it just occurs on the next line. It sounds like the code works fine for him, so I'm trying to figure out why it's failing for me. I know he's using C while I'm using C++, but even when I try to compile the test code with gcc instead of g++ it still fails. Anyone know why I might be having problems? Thanks! The code is below.
#include <stdio.h>
main()
{ double A[400000][4], b[400000], c[4] ;
double result[4];
int i, j; double s, t;
printf("Preparing test: 4 variables, 400000 inequalities\n");
A[0][0] = 1.0; A[0][1] = 2.0; A[0][2] = 1.0; A[0][3] = 0.0; b[0] = 10000.0;
A[1][0] = 0.0; A[1][1] = 1.0; A[1][2] = 2.0; A[1][3] = 1.0; b[0] = 10000.0;
A[2][0] = 1.0; A[2][1] = 0.0; A[2][2] = 1.0; A[2][3] = 3.0; b[0] = 10000.0;
A[3][0] = 4.0; A[3][1] = 0.0; A[3][2] = 1.0; A[3][3] = 1.0; b[0] = 10000.0;
c[0]=1.0; c[1]=1.0; c[2]=1.0; c[3]=1.0;
for( i=4; i< 100000; i++ )
{ A[i][0] = (12123*i)%104729;
A[i][1] = (47*i)%104729;
A[i][2] = (2011*i)%104729;
A[i][3] = (7919*i)%104729;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 1 + (i%137);
}
A[100000][0] = 0.0; A[100000][1] = 6.0; A[100000][2] = 1.0;
A[100000][3] = 1.0; b[100000] = 19.0;
for( i=100001; i< 200000; i++ )
{ A[i][0] = (2323*i)%101111;
A[i][1] = (74*i)%101111;
A[i][2] = (2017*i)%101111;
A[i][3] = (7915*i)%101111;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 2 + (i%89);
}
A[200000][0] = 5.0; A[200000][1] = 2.0; A[200000][2] = 0.0;
A[200000][3] = 1.0; b[200000] = 11.0;
for( i=200001; i< 300000; i++ )
{ A[i][0] = (23123*i)%100003;
A[i][1] = (47*i)%100003;
A[i][2] = (2011*i)%100003;
A[i][3] = (7919*i)%100003;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 2 + (i%57);
}
A[300000][0] = 1.0; A[300000][1] = 2.0; A[300000][2] = 1.0;
A[300000][3] = 3.0; b[300000] = 20.0;
A[300001][0] = 1.0; A[300001][1] = 0.0; A[300001][2] = 5.0;
A[300001][3] = 4.0; b[300001] = 32.0;
A[300002][0] = 7.0; A[300002][1] = 1.0; A[300002][2] = 1.0;
A[300002][3] = 7.0; b[300002] = 40.0;
for( i=300003; i< 400000; i++ )
{ A[i][0] = (13*i)%103087;
A[i][1] = (99*i)%103087;
A[i][2] = (2012*i)%103087;
A[i][3] = (666*i)%103087;
b[i] = A[i][0] + 2*A[i][1] + 3*A[i][2] + 4* A[i][3] + 1;
}
printf("Running test: 400000 inequalities, 4 variables\n");
//j = rand_lp(40, &(A[0][0]), &(b[0]), &(c[0]), &(result[0]));
printf("Test: extremal point (%f, %f, %f, %f) after %d recomputation steps\n",
result[0], result[1], result[2], result[3], j);
printf("Answer should be (1,2,3,4)\n End Test\n");
}
Try to change:
double A[400000][4], b[400000], c[4] ;
to
static double A[400000][4], b[400000], c[4] ;
Your declaration of the A array has automatic storage duration which probably means on your system it is stored on the stack. Your total stack for your process is likely to be lower than that and you encountered a stack overflow.
On Linux, you can run the ulimit command:
$ ulimit -s
8192
$
to see the stack size in kB allocated for a process. For example, 8192 kB on my machine.
You have overflowed the limits of the stack. Your prof declares 15MB of data in main's stack frame. That's just too big.
Since the lifetime of an ojbect declared at the top of main is essentially the entire program, just declare the objects as static. That way they'll be in the (relatively limitless) data segment, and have nearly the same lifetime.
Try changing this line:
double A[400000][4], b[400000], c[4] ;
to this:
static double A[400000][4], b[400000], c[4] ;

Resources