Float size, matrix multiplication, OpenCL, sockets. Weird - c

I'm generating two matrices using the following function (note some code is omitted):
srand(2007);
randomInit(h_A_data, size_A);
void randomInit(float* data, int size)
{
int i;
for (i = 0; i < size; ++i){
data[i] = rand() / (float)RAND_MAX;
}
}
This is called for matrix A and B. This populates the matrices with 0.something values, e.g. 0.748667. I then perform a matrix multiplication using a CPU. I compare the result to a GPU implementation via OpenCL. The resulting matrix has values in the range 20.something, e.g. 23.472757. Both the CPU and the GPU give the same result. The CPU implementation is taken from the Cuda toolkit distrib by nvidia:
void computeGold(float* C, const float* A, const float* B, unsigned int hA, unsigned int wA, unsigned int wB)
{
unsigned int i;
unsigned int j;
unsigned int k;
for (i = 0; i < hA; ++i)
for (j = 0; j < wB; ++j) {
double sum = 0;
for (k = 0; k < wA; ++k) {
double a = A[i * wA + k];
double b = B[k * wB + j];
sum += a * b;
}
C[i * wB + j] = (float)sum;
}
}
The weird thing is, all three matrices in memory are of the same size, i.e. sizeof(float)*size_A, or *size_B for matrix B etc. When I dump them to the disk, the file for the result stored in matrix C (the multiplied matrix) is bigger than matrix A and B.
Even more critical, for my application I'm transferring these over a network via a socket. In terms of the raw number of bytes, all matrices are the same, and yet it takes longer to transfer matrix C over the network. The problem is extrapolated for large matrix sizes. Why is this?
UPDATE/EDIT:
fprintf(matrix_c_file,"\n\nMatrix C\n");
for(i = 0; i < size_C; i++)
{
fprintf(matrix_c_file,"%f ", h_C_data[i]);
}
fprintf(matrix_c_file,"\n");
When matrix A and B contain only zero's, all three (matrix A, B and C) are the same size on disk.

I think that lijie has the correct (albeit terse) answer in the comments. The %f format specifier can result in a string with variable width. Consider the following C code:
printf("%f\n", 0.0);
printf("%f\n", 3.1415926535897932384626433);
printf("%f\n", 20.53);
printf("%f\n", 20.5e38);
which produces:
0.000000
3.141593
20.530000
2050000000000000019963732141023730597888.000000
All of the output has the same number of digits after the decimal point (6 by default), but a variable number to the left of the decimal point. If you need the textual representation of your matrix to be a consistent size and you don't mind sacrificing some precision, you can use the %e format specifier instead to force an exponential representation like 2.345e12.

Related

Arithmetic exception in C

Arithmetic Exception
//m*n行列Aを用いてy = A*x +b を計算する
void fc(int m, int n, const float *x, const float *A, const float *b, float *y){
int i, j;
for (i = 0; i < m; i++){
y[i] = b[i];
for (j = 0; j < n; j++){
y[i] += A[i * n + j] * x[j];
}
}
}
This is a code that does AX+b calculation of matrixes.
But as in the photo, an arithmetic exception is occurred. Why is this happening? Even though it is multiplication and there is nothing divided by 0.
How can I solve this error?
Sorry that I cannot add the values, or else I will have to add the whole file here. These are the parameters of the Neural Network and I will have to add .dat files here then I will also need other codes that can load those files. Also, I do not know how to bring only numbers from the .dat files, they are kind of weirdly encoded, so.
I will provide all the other information otherwise, so please don't close this question and I really want to know why this happens and how to solve it.
This is also another example of the exception.
Example
What I want to know is how can this happen even where there is nothing divided by 0 in this example. How I can interpret this situation.
according to you image, your matrices size is 100x50 (m,n) that means 5000 items. but you entered A[j*m+i] where 'j' is equal with 55 and 'i' is equal with 0. that means accessing the 5500 item of array which is not allowed.
#include <stdio.h>
void fc(int m, int n, const float *x, const float *A, const float *b, float *y){
int i, j;
for (i = 0; i < m; i++){
y[i] = b[i];
for (j = 0; j < n; j++){
y[i] += A[i * n + j] * x[j];
}
}
}
int main()
{
const float x[3]={1,1,1};
const float *xp=x;
const float A[3][3]={{1,1,1},{1,1,1},{1,1,1}};
const float b[3]={1,1,1};
const float *bp=b;
float y[3]; float *yp=y;
fc (3,3,xp,*A,bp,yp);
printf("%f %f %f ",y[0],y[1],y[2]);
return 0;
}
I've tested the program with imaginary values of 1 for all variables and the matrices size of 3x3 and 3x1. the result was correct with no error. ther result was
4.0000000 4.0000000 4.0000000
So the problem does not arise from the structure of your code. it is definitely comes from a special arithmetic problem.

Segmentation Fault 11 in C caused by larger operation numbers

I have known that when encountered with segmentation fault 11, it means the program has attempted to access an area of memory that it is not allowed to access.
Here I am trying to calculate a Fourier transform, using the following code.
It works well when nPoints = 2^15 (or of course with less points) , however it corrupts when I further increase the points to 2^16. I am wondering, is that caused by occupying too much memory? But I did not notice too much memory occupation during the operation. And although it use recursion, it transforms in-place. I thought it would occupy not so much memory. Then, where's the problem?
Thanks in advance
PS: one thing I forgot to say is, the result above was on Max OS (8G memory).
When I running the code on Windows (16G memory), it corrupts when nPoints = 2^14. So it makes me confused whether it's caused by the memory allocation, as the Windows PC has a larger memory (but it's really hard to say, because the two operation systems utilize different memory strategy).
#include <stdio.h>
#include <tgmath.h>
#include <string.h>
// in place FFT with O(n) memory usage
long double PI;
typedef long double complex cplx;
void _fft(cplx buf[], cplx out[], int n, int step)
{
if (step < n) {
_fft(out, buf, n, step * 2);
_fft(out + step, buf + step, n, step * 2);
for (int i = 0; i < n; i += 2 * step) {
cplx t = exp(-I * PI * i / n) * out[i + step];
buf[i / 2] = out[i] + t;
buf[(i + n)/2] = out[i] - t;
}
}
}
void fft(cplx buf[], int n)
{
cplx out[n];
for (int i = 0; i < n; i++) out[i] = buf[i];
_fft(buf, out, n, 1);
}
int main()
{
const int nPoints = pow(2, 15);
PI = atan2(1.0l, 1) * 4;
double tau = 0.1;
double tSpan = 12.5;
long double dt = tSpan / (nPoints-1);
long double T[nPoints];
cplx At[nPoints];
for (int i = 0; i < nPoints; ++i)
{
T[i] = dt * (i - nPoints / 2);
At[i] = exp( - T[i]*T[i] / (2*tau*tau));
}
fft(At, nPoints);
return 0;
}
You cannot allocate very large arrays in the stack. The default stack size on macOS is 8 MiB. The size of your cplx type is 32 bytes, so an array of 216 cplx elements is 2 MiB, and you have two of them (one in main and one in fft), so that is 4 MiB. That fits on the stack, but, at that size, the program runs to completion when I try it. At 217, it fails, which makes sense because then the program has two arrays taking 8 MiB on stack. The proper way to allocate such large arrays is to include <stdlib.h> and use cmplx *At = malloc(nPoints * sizeof *At); followed by if (!At) { /* Print some error message about being unable to allocate memory and terminate the program. */ }. You should do that for At, T, and out. Also, when you are done with each array, you should free it, as with free(At);.
To calculate an integer power of two, use the integer operation 1 << power, not the floating-point operation pow(2, 16). We have designed pow well on macOS, but, on other systems, it may return approximations even when exact results are possible. An approximate result may be slightly less than the exact integer value, so converting it to an integer truncates to the wrong result. If it may be a power of two larger than suitable for an int, then use (type) 1 << power, where type is a suitably large integer type.
the following, instrumented, code clearly shows that the OPs code repeatedly updates the same locations in the out[] array and actually does not update most of the locations in that array.
#include <stdio.h>
#include <tgmath.h>
#include <assert.h>
// in place FFT with O(n) memory usage
#define N_POINTS (1<<15)
double T[N_POINTS];
double At[N_POINTS];
double PI;
// prototypes
void _fft(double buf[], double out[], int step);
void fft( void );
int main( void )
{
PI = 3.14159;
double tau = 0.1;
double tSpan = 12.5;
double dt = tSpan / (N_POINTS-1);
for (int i = 0; i < N_POINTS; ++i)
{
T[i] = dt * (i - (N_POINTS / 2));
At[i] = exp( - T[i]*T[i] / (2*tau*tau));
}
fft();
return 0;
}
void fft()
{
double out[ N_POINTS ];
for (int i = 0; i < N_POINTS; i++)
out[i] = At[i];
_fft(At, out, 1);
}
void _fft(double buf[], double out[], int step)
{
printf( "step: %d\n", step );
if (step < N_POINTS)
{
_fft(out, buf, step * 2);
_fft(out + step, buf + step, step * 2);
for (int i = 0; i < N_POINTS; i += 2 * step)
{
double t = exp(-I * PI * i / N_POINTS) * out[i + step];
buf[i / 2] = out[i] + t;
buf[(i + N_POINTS)/2] = out[i] - t;
printf( "index: %d buf update: %d, %d\n", i, i/2, (i+N_POINTS)/2 );
}
}
}
Suggest running via (where untitled1 is the name of the executable and on linux)
./untitled1 > out.txt
less out.txt
the out.txt file is 8630880 bytes
An examination of that file shows the lack of coverage and shows that any one entry is NOT the sum of the prior two entries, so I suspect this is not a valid Fourier transform,

How can I code exp(x) in c?

There is a serie for the exp function whitch looks like this:
exp(x) = (x^0)/0! + (x^1)/1! + (x^2)/2! + (x^3)/3! + ···. And I'm trying to compute it for different values of x, checking my results with a calculator and I found that for big values, 20 for example, my results stop increasing and get stuck in a value that is almost the real one. I get 485165184.00 and the real value is 485165195.4.
I must do this code in a for cycle or a recursive function, since it is a homework assignment.
My code looks as following
#include <stdio.h>
#define N 13
#define xi 3
double fun7(int n, int m){
int i;
double res=1, aux=0;
for(i=1, aux=1; i<(n+1); i++){
res += aux;
aux *= m;
aux /= i;
}
return res-1;
}
int main() {
int a, b, pot, x[xi];
float R[N][xi];
x[0] = 5;
x[1] = 10;
x[2] = 20;
for(b=0; b<xi; b++){
for (a=0, pot=1; a<N; a++){
R[a][b] = fun7(pot, x[b]);
pot *= 2;
}
}
for(b=0; b<xi; b++){
for (a=0, pot=1; a<N; a++){
printf("%d\t%f\n", pot, R[a][b]);
pot *= 2;
}
printf("\n");
}
return 0;
}
The float data type can normally represent numbers with a tad more than 7 decimal digits of precision.
485165184 has 9 decimal digits. The last two digits are just meaningless noise as far as float goes. You really should be showing 4.851652e8, which is the correct value for exp(20) with the given level of precision.
If you want to increase precision, try using double or long double data types.

Numerical Integral from 0 to infinity

My aim is to calculate the numerical integral of a probability distribution function (PDF) of the distance of an electron from the nucleus of the hydrogen atom in C programming language. I have written a sample code however it fails to find the numerical value correctly due to the fact that I cannot increase the limit as much as its necessary in my opinion. I have also included the library but I cannot use the values stated in the following post as integral boundaries: min and max value of data type in C . What is the remedy in this case? Should switch to another programming language maybe? Any help and suggestion is appreciated, thanks in advance.
Edit: After some value I get the error segmentation fault. I have checked the actual result of the integral to be 0.0372193 with Wolframalpha. In addition to this if I increment k in smaller amounts I get zero as a result that is why I defined r[k]=k, I know it should be smaller for increased precision.
#include <stdio.h>
#include <math.h>
#include <limits.h>
#define a0 0.53
int N = 200000;
// This value of N is the highest possible number in long double
// data format. Change its value to adjust the precision of integration
// and computation time.
// The discrete integral may be defined as follows:
long double trapezoid(long double x[], long double f[]) {
int i;
long double dx = x[1]-x[0];
long double sum = 0.5*(f[0]+f[N]);
for (i = 1; i < N; i++)
sum+=f[i];
return sum*dx;
}
main() {
long double P[N], r[N], a;
// Declare and initialize the loop variable
int k = 0;
for (k = 0; k < N; k++)
{
r[k] = k ;
P[k] = r[k] * r[k] * exp( -2*r[k] / a0);
//printf("%.20Lf \n", r[k]);
//printf("%.20Lf \n", P[k]);
}
a = trapezoid(r, P);
printf("%.20Lf \n", a);
}
Last Code:
#include <stdio.h>
#include <math.h>
#include <limits.h>
#include <stdlib.h>
#define a0 0.53
#define N LLONG_MAX
// This value of N is the highest possible number in long double
// data format. Change its value to adjust the precision of integration
// and computation time.
// The discrete integral may be defined as follows:
long double trapezoid(long double x[],long double f[]) {
int i;
long double dx = x[1]-x[0];
long double sum = 0.5*(f[0]+f[N]);
for (i = 1; i < N; i++)
sum+=f[i];
return sum*dx;
}
main() {
printf("%Ld", LLONG_MAX);
long double * P = malloc(N * sizeof(long double));
long double * r = malloc(N * sizeof(long double));
// Declare and initialize the loop variable
int k = 0;
long double integral;
for (k = 1; k < N; k++)
{
P[k] = r[k] * r[k] * expl( -2*r[k] / a0);
}
integral = trapezoid(r, P);
printf("%Lf", integral);
}
Edit last code working:
#include <stdio.h>
#include <math.h>
#include <limits.h>
#include <stdlib.h>
#define a0 0.53
#define N LONG_MAX/100
// This value of N is the highest possible number in long double
// data format. Change its value to adjust the precision of integration
// and computation time.
// The discrete integral may be defined as follows:
long double trapezoid(long double x[],long double f[]) {
int i;
long double dx = x[1]-x[0];
long double sum = 0.5*(f[0]+f[N]);
for (i = 1; i < N; i++)
sum+=f[i];
return sum*dx;
}
main() {
printf("%Ld \n", LLONG_MAX);
long double * P = malloc(N * sizeof(long double));
long double * r = malloc(N * sizeof(long double));
// Declare and initialize the loop variable
int k = 0;
long double integral;
for (k = 1; k < N; k++)
{
r[k] = k / 100000.0;
P[k] = r[k] * r[k] * expl( -2*r[k] / a0);
}
integral = trapezoid(r, P);
printf("%.15Lf \n", integral);
free((void *)P);
free((void *)r);
}
In particular I have changed the definition for r[k] by using a floating point number in the division operation to get a long double as a result and also as I have stated in my last comment I cannot go for Ns larger than LONG_MAX/100 and I think I should investigate the code and malloc further to get the issue. I have found the exact value that is obtained analytically by taking the limits; I have confirmed the result with TI-89 Titanium and Wolframalpha (both numerically and analytically) apart from doing it myself. The trapezoid rule worked out pretty well when the interval size has been decreased. Many thanks for all the posters here for their ideas. Having a value of 2147483647 LONG_MAX is not that particularly large as I expected by the way, should the limit not be around ten to power 308?
Numerical point of view
The usual trapezoid method doesn't work with improper integrals. As such, Gaussian quadrature rules are much better, since they not only provide 2n-1 exactness (that is, for a polynomial of degree 2n-1 they will return the correct solution), but also manage improper integrals by using the right weight function.
If your integral is improper in both sides, you should try the Gauss-Hermite quadrature, otherwise use the Gauss-Laguerre quadrature.
The "overflow" error
long double P[N], r[N], a;
P has a size of roughly 3MB, and so does r. That's too much memory. Allocate the memory instead:
long double * P = malloc(N * sizeof(long double));
long double * r = malloc(N * sizeof(long double));
Don't forget to include <stdlib.h> and use free on both P and r if you don't need them any longer. Also, you may not access the N-th entry, so f[N] is wrong.
Using Gauss-Laguerre quadrature
Now Gauss-Laguerre uses exp(-x) as weight function. If you're not familiar with Gaussian quadrature: the result of E(f) is the integral of w * f, where w is the weight function.
Your f looks like this, and:
f x = x^2 * exp (-2 * x / a)
Wait a minute. f already contains exp(-term), so we can substitute x with t = x * a /2 and get
f' x = (t * a/2)^2 * exp(-t) * a/2
Since exp(-t) is already part of our weight function, your function fits now perfectly into the Gauss-Laguerre quadrature. The resulting code is
#include <stdio.h>
#include <math.h>
/* x[] and a[] taken from
* https://de.wikipedia.org/wiki/Gau%C3%9F-Quadratur#Gau.C3.9F-Laguerre-Integration
* Calculating them by hand is a little bit cumbersome
*/
const int gauss_rule_length = 3;
const double gauss_x[] = {0.415774556783, 2.29428036028, 6.28994508294};
const double gauss_a[] = {0.711093009929, 0.278517733569, 0.0103892565016};
double f(double x){
return x *.53/2 * x *.53/2 * .53/2;
}
int main(){
int i;
double sum = 0;
for(i = 0; i < gauss_rule_length; ++i){
sum += gauss_a[i] * f(gauss_x[i]);
}
printf("%.10lf\n",sum); /* 0.0372192500 */
return 0;
}

Matrix and vector multiplication optimization algorithm

Assume that the dimensions are very large (up to 1 billion elements in a matrix). How would I implement a cache oblivious algorithm for matrix-vector product? Based on wikipedia I will need to recursively divide and conquer however I feel like there would be a lot of overhead.. Would it be efficient to do so?
Follow up question and answer: OpenMP with matrices and vectors
So the answer to the question, "how do I make this basic linear algebra operation fast", is always and everywhere to find and link to a tuned BLAS library for your platform. Eg, GotoBLAS (whose work is being continued in OpenBLAS), or the slower autotuned ATLAS, or commercial packages like Intel's MKL. Linear algebra is so fundamental to so many other operations that enormous amounts of effort goes into optimizing these packages for various platforms, and there's just no chance you're going to come up with something in a few afternoon's work that will compete. The particular subroutine calls you're looking for for general dense matrix-vector multiplicaiton is SGEMV/DGEMV/CGEMV/ZGEMV.
Cache-oblivious algorithms, or autotuning, are for when you can't be bothered tuning for the specific cache architecture of your system - which might be fine, normally, but since people are willing to do that for BLAS routines, and then make the tuned results available, means that you're best off just using those routines.
The memory access pattern for GEMV is straightforward enough that you don't really need divide and conquer (same for the standard case of matrix transpose) - you just find the cache blocking size and use it. In GEMV (y = Ax), you still have to scan through the entire matrix once, so there's nothing to be done for reuse (and thus effective cache use) there, but you can try reuse x as much as possible so you load it once instead of (number of rows) times - and you still want access to A to be cache friendly. So the obvious cache blocking thing to do is to break along blocks:
A x -> [ A11 | A12 ] | x1 | = | A11 x1 + A12 x2 |
[ A21 | A22 ] | x2 | | A21 x1 + A22 x2 |
And you can certainly do that recursively. But doing a naive implementation, it's slower than the simple double-loop, and way slower than a proper SGEMV library call:
$ ./gemv
Testing for N=4096
Double Loop: time = 0.024995, error = 0.000000
Divide and conquer: time = 0.299945, error = 0.000000
SGEMV: time = 0.013998, error = 0.000000
The code follows:
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include "mkl.h"
float **alloc2d(int n, int m) {
float *data = malloc(n*m*sizeof(float));
float **array = malloc(n*sizeof(float *));
for (int i=0; i<n; i++)
array[i] = &(data[i*m]);
return array;
}
void tick(struct timeval *t) {
gettimeofday(t, NULL);
}
/* returns time in seconds from now to time described by t */
double tock(struct timeval *t) {
struct timeval now;
gettimeofday(&now, NULL);
return (double)(now.tv_sec - t->tv_sec) + ((double)(now.tv_usec - t->tv_usec)/1000000.);
}
float checkans(float *y, int n) {
float err = 0.;
for (int i=0; i<n; i++)
err += (y[i] - 1.*i)*(y[i] - 1.*i);
return err;
}
/* assume square matrix */
void divConquerGEMV(float **a, float *x, float *y, int n,
int startr, int endr, int startc, int endc) {
int nr = endr - startr + 1;
int nc = endc - startc + 1;
if (nr == 1 && nc == 1) {
y[startc] += a[startr][startc] * x[startr];
} else {
int midr = (endr + startr+1)/2;
int midc = (endc + startc+1)/2;
divConquerGEMV(a, x, y, n, startr, midr-1, startc, midc-1);
divConquerGEMV(a, x, y, n, midr, endr, startc, midc-1);
divConquerGEMV(a, x, y, n, startr, midr-1, midc, endc);
divConquerGEMV(a, x, y, n, midr, endr, midc, endc);
}
}
int main(int argc, char **argv) {
const int n=4096;
float **a = alloc2d(n,n);
float *x = malloc(n*sizeof(float));
float *y = malloc(n*sizeof(float));
struct timeval clock;
double eltime;
printf("Testing for N=%d\n", n);
for (int i=0; i<n; i++) {
x[i] = 1.*i;
for (int j=0; j<n; j++)
a[i][j] = 0.;
a[i][i] = 1.;
}
/* naive double loop */
tick(&clock);
for (int i=0; i<n; i++) {
y[i] = 0.;
for (int j=0; j<n; j++) {
y[i] += a[i][j]*x[j];
}
}
eltime = tock(&clock);
printf("Double Loop: time = %lf, error = %f\n", eltime, checkans(y,n));
for (int i=0; i<n; i++) y[i] = 0.;
/* naive divide and conquer */
tick(&clock);
divConquerGEMV(a, x, y, n, 0, n-1, 0, n-1);
eltime = tock(&clock);
printf("Divide and conquer: time = %lf, error = %f\n", eltime, checkans(y,n));
/* decent GEMV implementation */
tick(&clock);
float alpha = 1.;
float beta = 0.;
int incrx=1;
int incry=1;
char trans='N';
sgemv(&trans,&n,&n,&alpha,&(a[0][0]),&n,x,&incrx,&beta,y,&incry);
eltime = tock(&clock);
printf("SGEMV: time = %lf, error = %f\n", eltime, checkans(y,n));
return 0;
}

Resources