I'm generally an R user but I am trying to use to C for some lower level cumulative sums and multiplications.
I am trying to generate a cumulative sum of eta and storing the result in tmp0. However, when I output tmp0 it either gives me Inf, NaN, or some arbitrarily large number. I double checked the same cumulative sum in R and it works fine; I am not sure why C is not handling it. Below is the code that I am using:
int i,j;
const int p = ncov, n = nin;
double accNum0[n]; //accumulate first part of likelihood sum eta_i
double accNum1[n]; //accumulate the backwards numerator
double accNum2[n]; //acumulate the forward numerator (weighted)
double tmp0 = 0;
double eta[n]; //calculate linear predictor in this step (X %*% beta)
for(i = 0; i < n; i++) {
for (j = 0; j < p; j++)
eta[i] += b[j] * x[n * j + i];
}
for (i = 0; i < n; ++i) {
tmp0 += eta[i];
}
return (tmp0);
Again, I am fairly new to C so I may be making some rookie mistakes and would greatly appreciate any (and all) suggestions!
There might be errors with how you are initializing b or x. However, one definite error is that eta is being used uninitialized. This means eta[i] may begin with some arbitrary value instead of 0 as you are likely expecting.
Add an initialization before accumulating into it.
for(i = 0; i < n; i++) {
eta[i] = 0;
for (j = 0; j < p; j++)
eta[i] += b[j] * x[n * j + i];
}
Related
I am trying to perform a simple operation. I have a matrix that is A x B by size. I have a list of indices of length C, and I want to make a C x B matrix by collecting rows from the first matrix according to the indices. i.e. index i tells me which row from the first matrix I put into row i in the second matrix.
I presorted the indices so the algorithm is input stationary: I load in the row from the A x B matrix and write that row to all the rows in the C x B matrix.
The code looks something like this:
for(int i = 0;i < A; i ++)
{
for(int k = offsets[i]; k < offsets[i+1]; k ++)
{
int dest = index1[k];
for(int j = 0;j < C/ 8; j++)
{
__m256 a = _mm256_load_ps(&input[i * C + j * 8]);
_mm256_store_ps(&output[dest * C + j * 8] ,a);
}
}
}
The code is entirely bottlenecked by write to memory.
This code is efficient when C is small. However it scales very poorly when C increases, which I surmise is due to cache behavior. (It takes 10x time when C = 1024 compared to C = 256).
I tried blocking in the C dimension:
for(int c = 0; c < C; c+= K){
for(int i = 0;i < A; i ++)
{
for(int k = offsets[i]; k < offsets[i+1]; k ++)
{
int dest = index1[k];
for(int j = 0;j < C/ 8 / K; j++)
{
__m256 a = _mm256_load_ps(&input[i * C + c + j * 8]);
_mm256_store_ps(&output[dest * C + c + j * 8] ,a);
}
}
}
}
This actually slows down the code more.
Any suggestions?
It seems the inner loop is a mere streamed copy operation. Cache wouldn't matter in such a case. Rather try using simple memcpy() instead so the compiler can yield better execution code, hopefully.
//for(int j = 0;j < C/ 8; j++)
//{
// __m256 a = _mm256_load_ps(&input[i * C + j * 8]);
// _mm256_store_ps(&output[dest * C + j * 8] ,a);
//}
memcpy(&output[dest * C], &input[i * C], C * sizeof(float));
Appendix
If satisfiable results won't be obtained, in the last resort, take C++ and replace the outer loop with parllel_for(). Then it may be possible to make the cache(or otherwise pipeline?) work a little bit better.
parallel_for(0, A, [&](const int i) {
for(int k = offsets[i]; k < offsets[i+1]; k++)
{
int dest = index1[k];
memcpy(&output[dest * C], &input[i * C], C * sizeof(float));
}
});
I need to write a function to get a curve fit of a dataset. The code below is what I have. It attempts to use gradient descent to find polynomial coefficients which best fit the data.
//solves for y using the form y = a + bx + cx^2 ...
double calc_polynomial(int degree, double x, double* coeffs) {
double y = 0;
for (int i = 0; i <= degree; i++)
y += coeffs[i] * pow(x, i);
return y;
}
//find polynomial fit
//returns an array of coefficients degree + 1 long
double* poly_fit(double* x, double* y, int count, int degree, double learningRate, int iterations) {
double* coeffs = malloc(sizeof(double) * (degree + 1));
double* sums = malloc(sizeof(double) * (degree + 1));
for (int i = 0; i <= degree; i++)
coeffs[i] = 0;
for (int i = 0; i < iterations; i++) {
//reset sums each iteration
for (int j = 0; j <= degree; j++)
sums[j] = 0;
//update weights
for (int j = 0; j < count; j++) {
double error = calc_polynomial(degree, x[j], coeffs) - y[j];
//update sums
for (int k = 0; k <= degree; k++)
sums[k] += error * pow(x[j], k);
}
//subtract sums
for (int j = 0; j <= degree; j++)
coeffs[j] -= sums[j] * learningRate;
}
free(sums);
return coeffs;
}
And my testing code:
double x[] = { 0, 1, 2, 3, 4 };
double y[] = { 5, 3, 2, 3, 5 };
int size = sizeof(x) / sizeof(*x);
int degree = 1;
double* coeffs = poly_fit(x, y, size, degree, 0.01, 1000);
for (int i = 0; i <= degree; i++)
printf("%lf\n", coeffs[i]);
The code above works when degree = 1, but anything higher causes the coefficients to come back as nan.
I've also tried replacing
coeffs[j] -= sums[j] * learningRate;
with
coeffs[j] -= (1/count) * sums[j] * learningRate;
but then I get back 0s instead of nan.
Anyone know what I'm doing wrong?
I tried degree = 2, iteration = 10 and got results other than nan (values around a few thousands) Adding one to iteration seems making magnitude of the results larger by about 3 times after that.
From this observation, I guessed that the results are being multiplied by count.
In the expression
coeffs[j] -= (1/count) * sums[j] * learningRate;
Both of 1 and count are integers, so integer division is done in 1/count and it will become zero if count is larger than 1.
Instead of that, you can divide the result of multiplication by count.
coeffs[j] -= sums[j] * learningRate / count;
Another way is using 1.0 (double value) instead of 1.
coeffs[j] -= (1.0/count) * sums[j] * learningRate;
Aside:
A candidate NAN source is adding opposite signed values where one is an infinity. Given OP is using pow(x, k), which grows rapidly, using other techniques help.
Consider a chained multiplication rather than pow(). The result is usually more numerically stable. calc_polynomial() for example:
double calc_polynomial(int degree, double x, double* coeffs) {
double y = 0;
// for (int i = 0; i <= degree; i++)
for (int i = degree; i >= 0; i--)
//y += coeffs[i] * pow(x, i);
y = y*x + coeffs[i];
}
return y;
}
Similar code could be used for the main() body.
I'm trying to multiply two dynamic matrices by passing them through a function. I'm getting a segmentation fault during the multiplication.
The matrices are being passed through a function. The items in the arguments are correct because I had to use them for a different operation in this project. I have a feeling that I messed up with the pointers, but i'm pretty new to C and i'm not sure where I messed up.
double** multiplyMatrices(
double** a,
const uint32_t a_rows,
const uint32_t a_cols,
double** b,
const uint32_t b_cols){
uint32_t i = 0;
uint32_t j = 0;
uint32_t k = 0;
double** c;
//allocate memory to matrix c
c = (double **)malloc(sizeof(double *) * a_rows);
for (i = 0; i < a_rows; i++) {
*(c +i) = (double *)malloc(sizeof(double) * b_cols);
}
//clear matrix c
for(i = 0; i < a_rows; i++){
for(j = 0; j < a_cols; j++){
*c[j] = 0;
}
}
i = 0;
//multiplication
while(j = 0, i < a_rows ){
while(k = 0, j < b_cols){
while(k < a_cols){
//following line is where i'm getting the segmentation fault
*(*(c+(i*b_cols))+j) += (*(*(a+(i*a_cols))+k)) * (*(*(b+(k*b_cols))+j));
k++;
}
j++;
}
i++;
}
return c;
}
The obvious mistake is that you dereference c + i * b_cols while c is an array of pointers of size a_rows. So likely c + i * b_cols is outside of the area that you previously allocated with malloc().
I would suggest to simplify the matrix representation using a single array of double with the size equal to the total number of elements, i.e. rows * cols.
For example:
double *c;
c = malloc(sizeof(double) * a_rows * b_cols);
This not only has better overall performance, but simplifies the code. You would then have to "linearise" the offset inside your unidimensional array to convert from bi-dimensional matrix coordinates. For example:
c[i * b_cols + j] = ...
Of course, the other two matrices need to be allocated, filled and accessed in a similar manner.
For code clarity, I would also replace the while statements by for statements with the actual variable that they loop on. For example:
for (i = 0; i < a_rows; i++)
for (j = 0; j < b_cols; j++)
for (k = 0; k < a_cols; k++)
You can (ab)use the C language in many ways, but the trick is to make it more clear for you in the first place.
I am trying to optimise the divide operation from the Jacobi relaxation formula.
Also doing profiling using perf.
Here is my code
for (int l = 0; l < iter; l++) {
for (i = 1; i < height; i++) {
for (j = 1; j < width; j++) {
for (k = 1; k < length; k++) {
float val = 0.0f;
// Do the Jacobi additions here
// From profiling, fastest is to fetch k+/-1,j,i
// Slowest is to fetch k,j,i+/-1
// Scale with dimensions of the array
val -= dim * array[k][j][i];
// Want to optimise this
val /= 6.0; // profiling shows this as the slowest op
// Some code here to put the result into the output array
}
}
}
}
The size of the 3D array can be from 100x100x100 up to 1000x1000x1000.
I've tried to multiply it to 1.0f/6.0f but this does not seem to make a difference. The array is a 3D array of floats.
I used an R code which implements a permutation test for the distributional comparison between two populations of functions. We have p univariate p-values.
The bottleneck is the construction of a matrix which contains all the possible CONTIGUOS p-values.
The last row of the matrix of p-values contain all the univariate p-values.
The penultimate row contains all the bivariate p-values in this order:
p_val_c(1,2), p_val_c(2,3), ..., p_val_c(p, 1)
...
The elements of the first row are coincident and the value associated is the p-value of the global test p_val_c(1,...,p)=p_val_c(2,...,p,1)=...=pval(p,1,...,p-1).
For computational reasons, I have decided to implement this component in c and use it in R with .C.
Here the code. The unique important part is the definition of the function Build_pval_asymm_matrix.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <time.h>
void Build_pval_asymm_matrix(int * p, int * B, double * pval,
double * L,
double * pval_asymm_matrix);
// Function used for the sorting of vector T_temp with qsort
int cmp(const void *x, const void *y);
int main() {
int B = 1000; // number Conditional Monte Carlo (CMC) runs
int p = 100; // number univariate tests
// Generate fictitiously data univariate p-values pval and matrix L.
// The j-th column of L is the empirical survival
// function of the statistics test associated to the j-th coefficient
// of the basis expansion. The dimension of L is B * p.
// Generate pval
double pval[p];
memset(pval, 0, sizeof(pval)); // initialize all elements to 0
for (int i = 0; i < p; i++) {
pval[i] = (double)rand() / (double)RAND_MAX;
}
// Construct L
double L[B * p];
// Inizialize to 0 the elements of L
memset(L, 0, sizeof(L));
// Array used to construct the columns of L
double temp_array[B];
memset(temp_array, 0, sizeof(temp_array));
for(int i = 0; i < B; i++) {
temp_array[i] = (double) (i + 1) / (double) B;
}
for (int iter_coeff=0; iter_coeff < p; iter_coeff++) {
// Shuffle temp_array
if (B > 1) {
for (int k = 0; k < B - 1; k++)
{
int j = rand() % B;
double t = temp_array[j];
temp_array[j] = temp_array[k];
temp_array[k] = t;
}
}
for (int i=0; i<B; i++) {
L[iter_coeff + p * i] = temp_array[i];
}
}
double pval_asymm_matrix[p * p];
memset(pval_asymm_matrix, 0, sizeof(pval_asymm_matrix));
// Construct the asymmetric matrix of p-values
clock_t start, end;
double cpu_time_used;
start = clock();
Build_pval_asymm_matrix(&p, &B, pval, L, pval_asymm_matrix);
end = clock();
cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("TOTAL CPU time used: %f\n", cpu_time_used);
return 0;
}
void Build_pval_asymm_matrix(int * p, int * B, double * pval,
double * L,
double * pval_asymm_matrix) {
int nbasis = *p, iter_CMC = *B;
// Scalar output fisher combining function applied on univariate
// p-values
double T0_temp = 0;
// Vector output fisher combining function applied on a set of
//columns of L
double T_temp[iter_CMC];
memset(T_temp, 0, sizeof(T_temp));
// Counter for elements of T_temp greater than or equal to T0_temp
int count = 0;
// Indexes for columns of L
int inf = 0, sup = 0;
// The last row of matrice_pval_asymm contains the univariate p-values
for(int i = 0; i < nbasis; i++) {
pval_asymm_matrix[i + nbasis * (nbasis - 1)] = pval[i];
}
// Construct the rows from bottom to up
for (int row = nbasis - 2; row >= 0; row--) {
for (int col = 0; col <= row; col++) {
T0_temp = 0;
memset(T_temp, 0, sizeof(T_temp));
inf = col;
sup = (nbasis - row) + col - 1;
// Combining function Fisher applied on
// p-values pval[inf:sup]
for (int k = inf; k <= sup; k++) {
T0_temp += log(pval[k]);
}
T0_temp *= -2;
// Combining function Fisher applied
// on columns inf:sup of matrix L
for (int k = 0; k < iter_CMC; k++) {
for (int l = inf; l <= sup; l++) {
T_temp[k] += log(L[l + nbasis * k]);
}
T_temp[k] *= -2;
}
// Sort the vector T_temp
qsort(T_temp, iter_CMC, sizeof(double), cmp);
// Count the number of elements of T_temp less than T0_temp
int h = 0;
while (h < iter_CMC && T_temp[h] < T0_temp) {
h++;
}
// Number of elements of T_temp greater than or equal to T0_temp
count = iter_CMC - h;
pval_asymm_matrix[col + nbasis * row] = (double) count / (double)iter_CMC;
}
// auxiliary variable for columns of L inf:nbasis-1 and 1:sup
int aux_first = 0, aux_second = 0;
int num_col_needed = 0;
for (int col = row + 1; col < nbasis; col++) {
T0_temp = 0;
memset(T_temp, 0, sizeof(T_temp));
inf = col;
sup = ((nbasis - row) + col) % nbasis - 1;
// Useful indexes
num_col_needed = nbasis - inf + sup + 1;
int index_needed[num_col_needed];
memset(index_needed, -1, num_col_needed * sizeof(int));
aux_first = inf;
for (int i = 0; i < nbasis - inf; i++) {
index_needed[i] = aux_first;
aux_first++;
}
aux_second = 0;
for (int j = 0; j < sup + 1; j++) {
index_needed[j + nbasis - inf] = aux_second;
aux_second++;
}
// Combining function Fisher applied on p-values
// pval[inf:p-1] and pval[0:sup-1]1]
for (int k = 0; k < num_col_needed; k++) {
T0_temp += log(pval[index_needed[k]]);
}
T0_temp *= -2;
// Combining function Fisher applied on columns inf:p-1 and 0:sup-1
// of matrix L
for (int k = 0; k < iter_CMC; k++) {
for (int l = 0; l < num_col_needed; l++) {
T_temp[k] += log(L[index_needed[l] + nbasis * k]);
}
T_temp[k] *= -2;
}
// Sort the vector T_temp
qsort(T_temp, iter_CMC, sizeof(double), cmp);
// Count the number of elements of T_temp less than T0_temp
int h = 0;
while (h < iter_CMC && T_temp[h] < T0_temp) {
h++;
}
// Number of elements of T_temp greater than or equal to T0_temp
count = iter_CMC - h;
pval_asymm_matrix[col + nbasis * row] = (double) count / (double)iter_CMC;
} // end for over col from row + 1 to nbasis - 1
} // end for over rows of asymm p-values matrix except the last row
}
int cmp(const void *x, const void *y)
{
double xx = *(double*)x, yy = *(double*)y;
if (xx < yy) return -1;
if (xx > yy) return 1;
return 0;
}
Here the times of execution in seconds measured in R:
time_original_function
user system elapsed
79.726 1.980 112.817
time_function_double_for
user system elapsed
79.013 1.666 89.411
time_c_function
user system elapsed
47.920 0.024 56.096
The first measure was obtained using an equivalent R function with duplication of the vector pval and matrix L.
What I wanted to ask is some suggestions in order to decrease the execution time with the C function for simulation purposes. The last time I used c was five years ago and consequently there is room for improvement. For instance I sort the vector T_temp with qsort in order to compute in linear time with a while the number of elements of T_temp greater than or equal to T0_temp. Maybe this task could be done in a more efficient way. Thanks in advance!!
I reduced the input size to p to 50 to avoid waiting on it (don't have such a fast machine) -- keeping p as is and reducing B to 100 has a similar effect, but profiling it showed that ~7.5 out of the ~8 seconds used to compute this was spent in the log function.
qsort doesn't even show up as a real hotspot. This test seems to headbutt the machine more in terms of micro-efficiency than anything else.
So unless your compiler has a vastly faster implementation of log than I do, my first suggestion is to find a fast log implementation if you can afford some accuracy loss (there are ones out there that can compute log over an order of magnitude faster with precision loss in the range of ~3% or so).
If you cannot have precision loss and accuracy is critical, then I'd suggest trying to memoize the values you use for log if you can and store them into a lookup table.
Update
I tried the latter approach.
// Create a memoized table of log values.
double log_cache[B * p];
for (int j=0, num=B*p; j < num; ++j)
log_cache[j] = log(L[j]);
Using malloc might be better here, as we're pushing rather large data to the stack and could risk overflows.
Then pass her into Build_pval_asymm_matrix.
Replace these:
T_temp[k] += log(L[l + nbasis * k]);
...
T_temp[k] += log(L[index_needed[l] + nbasis * k]);
With these:
T_temp[k] += log_cache[l + nbasis * k];
...
T_temp[k] += log_cache[index_needed[l] + nbasis * k];
This improved the times for me from ~8 seconds to ~5.3 seconds, but we've exchanged the computational overhead of log for memory overhead which isn't that much better (in fact, it rarely is but calling log for double-precision floats is apparently quite expensive, enough to make this exchange worthwhile). The next iteration, if you want more speed, and it is very possible, involves looking into cache efficiency.
For this kind of huge matrix stuff, focusing on memory layouts and access patterns can work wonders.