I am reprogramming a piece of MATLAB code in mex (using C). So far my C version of the MATLAB code is about as double as fast as the MATLAB code. Now I have three questions, all related to the code below:
How can I speed up this code more?
Do you see any problems with this code? I ask this because I don't know mex very well and I am also not a C guru ;-) ... I am aware that there should be some checks in the code (for example if there is still heap space while using realloc, but I left this away for the sake of simplicity for the moment)
Is it possible, that MATLAB is optimizing so well, that I really can't get much more than twice as fast code in C...?
The code should be more or less platform independent (Win, Linux, Unix, Mac, different Hardware), so I don't want to use assembler or specific linear Algebra Libraries. So that's why I programmed the staff by myself...
#include <mex.h>
#include <math.h>
#include <matrix.h>
void mexFunction(
int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
double epsilon = ((double)(mxGetScalar(prhs[0])));
int strengthDim = ((int)(mxGetScalar(prhs[1])));
int lenPartMat = ((int)(mxGetScalar(prhs[2])));
int numParts = ((int)(mxGetScalar(prhs[3])));
double *partMat = mxGetPr(prhs[4]);
const mxArray* verletListCells = prhs[5];
mxArray *verletList;
double *pseSum = (double *) malloc(numParts * sizeof(double));
for(int i = 0; i < numParts; i++) pseSum[i] = 0.0;
float *tempVar = NULL;
for(int i = 0; i < numParts; i++)
{
verletList = mxGetCell(verletListCells,i);
int numberVerlet = mxGetM(verletList);
tempVar = (float *) realloc(tempVar, numberVerlet * sizeof(float) * 2);
for(int a = 0; a < numberVerlet; a++)
{
tempVar[a*2] = partMat[((int) (*(mxGetPr(verletList) + a))) - 1] - partMat[i];
tempVar[a*2 + 1] = partMat[((int) (*(mxGetPr(verletList) + a))) - 1 + lenPartMat] - partMat[i + lenPartMat];
tempVar[a*2] = pow(tempVar[a*2],2);
tempVar[a*2 + 1] = pow(tempVar[a*2 + 1],2);
tempVar[a*2] = tempVar[a*2] + tempVar[a*2 + 1];
tempVar[a*2] = sqrt(tempVar[a*2]);
tempVar[a*2] = 4.0/(pow(epsilon,2) * M_PI) * exp(-(pow((tempVar[a*2]/epsilon),2)));
pseSum[i] = pseSum[i] + ((partMat[((int) (*(mxGetPr(verletList) + a))) - 1 + 2*lenPartMat] - partMat[i + (2 * lenPartMat)]) * tempVar[a*2]);
}
}
plhs[0] = mxCreateDoubleMatrix(numParts,1,mxREAL);
for(int a = 0; a < numParts; a++)
{
*(mxGetPr(plhs[0]) + a) = pseSum[a];
}
free(tempVar);
free(pseSum);
}
So this is the improved version, which is about 12 times faster than MATLAB version. The conversion thing is still eating up much time, but I let this away for now, becaues I have to change something in MATLAB for this. So first focus on the remaining C code. Do you see any more potential in the following code?
#include <mex.h>
#include <math.h>
#include <matrix.h>
void mexFunction(
int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
double epsilon = ((double)(mxGetScalar(prhs[0])));
int strengthDim = ((int)(mxGetScalar(prhs[1])));
int lenPartMat = ((int)(mxGetScalar(prhs[2])));
double *partMat = mxGetPr(prhs[3]);
const mxArray* verletListCells = prhs[4];
int numParts = mxGetM(verletListCells);
mxArray *verletList;
plhs[0] = mxCreateDoubleMatrix(numParts,1,mxREAL);
double *pseSum = mxGetPr(plhs[0]);
double epsilonSquared = epsilon*epsilon;
double preConst = 4.0/((epsilonSquared) * M_PI);
int numberVerlet = 0;
double tempVar[2];
for(int i = 0; i < numParts; i++)
{
verletList = mxGetCell(verletListCells,i);
double *verletListPtr = mxGetPr(verletList);
numberVerlet = mxGetM(verletList);
for(int a = 0; a < numberVerlet; a++)
{
int adress = ((int) (*(verletListPtr + a))) - 1;
tempVar[0] = partMat[adress] - partMat[i];
tempVar[1] = partMat[adress + lenPartMat] - partMat[i + lenPartMat];
tempVar[0] = tempVar[0]*tempVar[0] + tempVar[1]*tempVar[1];
tempVar[0] = preConst * exp(-(tempVar[0]/epsilonSquared));
pseSum[i] += ((partMat[adress + 2*lenPartMat] - partMat[i + (2*lenPartMat)]* tempVar[0]);
}
}
}
You do not need to allocate the pseSum for local use and then later copy the data to the output. You can simply allocate a MATLAB object and get the pointer to the memory :
plhs[0] = mxCreateDoubleMatrix(numParts,1,mxREAL);
pseSum = mxGetPr(plhs[0]);
Thus you will not have to initialize pseSum to 0, because MATLAB already does it in mxCreateDoubleMatrix.
Remove all the mxGetPr from the inner loop and assign them to variables before.
Instead of casting doubles to ints consider using int32 or uint32 arrays in MATLAB. Casting double to int is expensive. The internal loop computations would look like
tempVar[a*2] = partMat[somevar[a] - 1] - partMat[i];
You use such constructs in your code
((int) (*(mxGetPr(verletList) + a)))
You do it because the varletList is a 'double' array (that is the case by default in MATLAB), which holds integer values. Instead, you should use integer array. Before you call your mex file type in MATLAB:
varletList = int32(varletList);
Then you will not need the type cast to int above. You will simply write
((int*)mxGetData(verletList))[a]
or better yet, assign earlier
somevar = (int*)mxGetData(verletList);
and later write
somevar[a]
precompute 4.0/(pow(epsilon,2) * M_PI) before all loops! That is one expensive constant.
pow((tempVar[a*2]/epsilon),2)) is simply tempVar[a*2]^2/epsilon^2. You calculate sqrt(tempVar[a*2]) just before. Why do you square it now?
Generally do not use pow(x, 2). Just write x*x
I would add some sanity checks on the parameters, especially if you demand integers. Either use MATLABs int32/uint32 type, or check that what you get actually is an integer.
Edit in the new code
compute -1/epsilonSquared before the loops and compute exp(minvepssq*tempVar[0]).note that the result might differ slightly. Depends what you need, but if you don't care about exact order of operations, do it.
define a register variable preSum_r and use it to sum the results in the inner loop. After the loop assign it to preSum[i]. If you want more fun, you can write the result to the memory using SSE streaming store (_mm_stream_pd compiler intrinsic).
do remove double to int cast
most likely irrelevant, but try to change tempVar[0/1] to normal variables. Irrelevant, because the compiler should do that for you. But again, an array is not needed here.
parallelise the external loop with OpenMP. Trivial (at least the simplest version without thinking about data layout for NUMA architectures) since there is no dependence between the iterations.
Can you estimate ahead of time what will be the maximum size of tempVar and allocate memory for it before the loop instead of using realloc? Reallocating memory is a time consuming operation and if your numParts is large, this could have a huge impact. Take a look at this question.
Related
I am trying to accelerate encryption using the RSA algorithm using CUDA. I can't properly perform power-modulo in the kernel function.
I am using Cuda compilation tools on AWS, release 9.0, V9.0.176 to compile.
#include <cstdio>
#include <math.h>
#include "main.h"
// Kernel function to encrypt the message (m_in) elements into cipher (c_out)
__global__
void enc(int numElements, int e, int n, int *m_in, int *c_out)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
printf("e = %d, n = %d, numElements = %d\n", e, n, numElements);
for (int i = index; i < numElements; i += stride)
{
// POINT OF ERROR //
// c_out[i] = (m_in[i]^e) % n; //**GIVES WRONG RESULTS**
c_out[i] = __pow(m_in[i], e) % n; //**GIVES, error: expression must have integral or enum type**
}
}
// This function is called from main() from other file.
int* cuda_rsa(int numElements, int* data, int public_key, int key_length)
{
int e = public_key;
int n = key_length;
// Allocate Unified Memory – accessible from CPU or GPU
int* message_array;
cudaMallocManaged(&message_array, numElements*sizeof(int));
int* cipher_shared_array; //Array shared by CPU and GPU
cudaMallocManaged(&cipher_shared_array, numElements*sizeof(int));
int* cipher_array = (int*)malloc(numElements * sizeof(int));
//Put message array to be encrypted in a managed array
for(int i=0; i<numElements; i++)
{
message_array[i] = data[i];
}
// Run kernel on 16M elements on the GPU
enc<<<1, 1>>>(numElements, e, n, message_array, cipher_shared_array);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
//Copy into a host array and pass it to main() function for verification.
//Ignored memory leaks.
for(int i=0; i<numElements; i++)
{
cipher_array[i] = cipher_shared_array[i];
}
return (cipher_array);
}
Please help me with this error.
How can I implement power-modulo (as follows) on CUDA kernel?
(x ^ y) % n;
I would really appreciate any help.
In C or C++, this:
(x^y)
does not raise x to the power of y. It performs a bitwise exclusive-or operation. That is why your first realization does not give the correct answer.
In C or C++, the modulo arithmetic operator:
%
is only defined for integer arguments. Even though you are passing integers to the __pow() function, the return result of that function is a double (i.e. a floating-point quantity, not an integer quantity).
I don't know the details of the math you need to perform, but if you cast the result of __pow to an int (for example) this compile error will disappear. That may or may not be valid for whatever arithmetic you wish to perform. (For example, you may wish to cast it to a "long" integer quantity.)
After you do that, you will run into another compile error. The easiest approach is to use pow() instead of __pow():
c_out[i] = (int)pow(m_in[i], e) % n;
If you were actually trying to use the CUDA fast-math intrinsic, you should use __powf not __pow:
c_out[i] = (int)__powf(m_in[i], e) % n;
Note that fast-math intrinsics generally have reduced precision.
Since these raise-to-power functions are performing floating-point arithmetic (even though you are passing integers) it is possible to get some possibly unexpected results. For example, if you raise 5 to the power of 2, its possible to get 24.9999999999 instead of 25. If you simply cast this to an integer quantity, you will get truncation to 24. Therefore you may need to explore rounding your result to the nearest integer, instead of casting. But again, I haven't studied the math you desire to perform.
I run a loop a million times. Within the loop I call a C function to do some math (generating random variables from various distributions, to be exact). As part of that function, I declare a couple of double variables to hold parts of the transformation. An example:
void getRandNorm(double *randnorm, double mean, double var, int n)
{
// Declare variables
double u1;
double u2;
int arrptr = 0;
double sigma = sqrt(var); // the standard deviation
while (arrptr < n) {
// Generate two uniform random variables
u1 = rand() / (double)RAND_MAX;
u2 = rand() / (double)RAND_MAX;
// Box-Muller transform
randnorm[arrptr] = sqrt(-2*log(u1))*cos(2*pi*u2)*sigma+mean;
arrptr++;
if (arrptr < n) { // for an odd n, we cannot add off the end
randnorm[arrptr] = sqrt(-2*log(u2))*cos(2*pi*u1)*sigma+mean;
arrptr++;
}
}
}
And the calling loop:
iter = 1000000 // or something
for (i = 0; i < iter; i++) {
// lots of if statements
getRandNorm(sample1, truemean1, truevar1, n);
// some more analysis
}
I am working on speeding up the runtime. It occurs to me that I don't know what is happening with all these double variables that I am declaring. I assume a new 8 byte chunk of memory is allocated for the double for each of the one million loops. What happens to all those memory locations? They are declared within a C function; do they survive that function? Are they still locked up until the script exits?
The context for this question is wrapping this C program into a python function. If I'm going to execute this function multiple times in parallel from python, I want to be sure that I'm being as thrifty with memory usage as possible.
If you're talking about something like this:
for(int i=0;i<100000;i++){
double d = 5;
// some other stuff here
}
d is only allocated once by the compiler. It's mostly equivalent to declaring it above the for loop, except that the scope doesn't extend as far.
However, if you are doing something like this:
for(int i=0;i<1000000;i++){
double *d = malloc(sizeof(double));
free(d);
}
Then yes, you will allocate a double 1 million times, but it will likely re-use the memory for subsequent allocations. Finally, if you don't free the memory in my second example, you'll leak 16-32MB of memory.
The short answer is: NO, it should not matter if you declare these double variables inside the loop in C. By double variable, I assume you mean variables of type double.
The long answer is: Please post your code so people can tell you if you do something wrong and how to fix it to improve correctness and/or performance (a vast subject).
The final answer is: with the code provided, it makes no difference whether you declare u1 and u2 inside the body of the loop or outside. A good compiler will likely generate the same code.
You can improve the code a tiny bit by testing the odd case just once:
void getRandNorm(double *randnorm, double mean, double var, int n, double pi) {
// Declare variables
double u1, u2;
double sigma = sqrt(var); // the standard deviation
int arrptr, odd;
odd = n & 1; // check if n is odd
n -= odd; // make n even
for (arrptr = 0; arrptr < n; arrptr += 2) {
// Generate two uniform random variables
u1 = rand() / (double)RAND_MAX;
u2 = rand() / (double)RAND_MAX;
// Box-Muller transform
randnorm[arrptr + 0] = sqrt(-2*log(u1)) * cos(2*pi*u2) * sigma + mean;
randnorm[arrptr + 1] = sqrt(-2*log(u2)) * cos(2*pi*u1) * sigma + mean;
}
if (odd) {
u1 = rand() / (double)RAND_MAX;
u2 = rand() / (double)RAND_MAX;
randnorm[arrptr++] = sqrt(-2*log(u1)) * cos(2*pi*u2) * sigma + mean;
}
}
Note: arrptr + 0 is here for symmetry, the compiler will not generate any code for this addition.
regarding your question: If I run a loop a million times, do I have to worry about declaring doubles in each iteration?
The variables are being declared on the stack. So they 'disappear' when the function exits. The next execution of the function 're-creates' the variables, so (in reality) there is only a single instance of the variables and even then, only while the function is being executed.
So it does not matter how many times you call the function.
I was trying to create a sparse distributed Matrix with SUPERLU, but I'm getting some troubles:
basing on SuperLU documentation, I'm using following function,
void dCreate_CompRowLoc_Matrix_dist(SuperMatrix *A, int m, int n,
int nnz_loc, int m_loc, int fst_row,
double *nzval, int *colind, int *rowptr,
Stype_t stype, Dtype_t dtype, Mtype_t mtype);
but it seems that, whatever I pass to, segmentation fault happens.
I've tried passing a very simple 2x2 matrix with 2 nonzeros only, running with 1 process (that means something like)
m = 2;
n = 2;
nnz_loc = 2;
m_loc = 2;
fst_row = 0;
nzval[0] = 1.0;
nzval[1] = 2.0;
colind[0] = 0;
colind[1] = 1;
rowptr[0] = 0;
rowptr[1] = 1;
rowptr[2] = 2;
...SLU_NC, SLU_D, SLU_GE
and I continue having a segmentation fault error
I assume I'm not fully understanding the function usage,
can anyone help me on this (or, if more information are needed, please let me know)
many thanks
Simone
Update/1
just as further information, I've noticed that Row Local matrices are correctly created (with correct position for each element/row) but they seems not to be "collapsed" into a global Supermatrix A-
I'm trying to optimize some of my code in C, which is a lot bigger than the snippet below. Coming from Python, I wonder whether you can simply multiply an entire array by a number like I do below.
Evidently, it does not work the way I do it below. Is there any other way that achieves the same thing, or do I have to step through the entire array as in the for loop?
void main()
{
int i;
float data[] = {1.,2.,3.,4.,5.};
//this fails
data *= 5.0;
//this works
for(i = 0; i < 5; i++) data[i] *= 5.0;
}
There is no short-cut you have to step through each element of the array.
Note however that in your example, you may achieve a speedup by using int rather than float for both your data and multiplier.
If you want to, you can do what you want through BLAS, Basic Linear Algebra Subprograms, which is optimised. This is not in the C standard, it is a package which you have to install yourself.
Sample code to achieve what you want:
#include <stdio.h>
#include <stdlib.h>
#include <cblas.h>
int main () {
int limit =10;
float *a = calloc( limit, sizeof(float));
for ( int i = 0; i < limit ; i++){
a[i] = i;
}
cblas_sscal( limit , 0.5f, a, 1);
for ( int i = 0; i < limit ; i++){
printf("%3f, " , a[i]);
}
printf("\n");
}
The names of the functions is not obvious, but reading the guidelines you might start to guess what BLAS functions does. sscal() can be split into s for single precision and scal for scale, which means that this function works on floats. The same function for double precision is called dscal().
If you need to scale a vector with a constant and adding it to another, BLAS got a function for that too:
saxpy()
s a x p y
float a*x + y
y[i] += a*x
As you might guess there is a daxpy() too which works on doubles.
I'm afraid that, in C, you will have to use for(i = 0; i < 5; i++) data[i] *= 5.0;.
Python allows for so many more "shortcuts"; however, in C, you have to access each element and then manipulate those values.
Using the for-loop would be the shortest way to accomplish what you're trying to do to the array.
EDIT: If you have a large amount of data, there are more efficient (in terms of running time) ways to multiply 5 to each value. Check out loop tiling, for example.
data *= 5.0;
Here data is address of array which is constant.
if you want to multiply the first value in that array then use * operator as below.
*data *= 5.0;
I'm still pretty new to using SSE and am trying to implement a modulo of 2*Pi for double-precision inputs of the order 1e8 (the result of which will be fed into some vectorised trig calculations).
My current attempt at the code is based around the idea that mod(x, 2*Pi) = x - floor(x/(2*Pi))*2*Pi and looks like:
#define _PD_CONST(Name, Val) \
static const double _pd_##Name[2] __attribute__((aligned(16))) = { Val, Val }
_PD_CONST(2Pi, 6.283185307179586); /* = 2*pi */
_PD_CONST(recip_2Pi, 0.159154943091895); /* = 1/(2*pi) */
void vec_mod_2pi(const double * vec, int Size, double * modAns)
{
__m128d sse_a, sse_b, sse_c;
int i;
int k = 0;
double t = 0;
unsigned int initial_mode;
initial_mode = _MM_GET_ROUNDING_MODE();
_MM_SET_ROUNDING_MODE(_MM_ROUND_DOWN);
for (i = 0; i < Size; i += 2)
{
sse_a = _mm_loadu_pd(vec+i);
sse_b = _mm_mul_pd( _mm_cvtepi32_pd( _mm_cvtpd_epi32( _mm_mul_pd(sse_a, *(__m128d*)_pd_recip_2Pi) ) ), *(__m128d*)_pd_2Pi);
sse_c = _mm_sub_pd(sse_a, sse_b);
_mm_storeu_pd(modAns+i,sse_c);
}
k = i-2;
for (i = 0; i < Size%2; i++)
{
t = (double)((int)(vec[k+i] * 0.159154943091895)) * 6.283185307179586;
modAns[k+i] = vec[k+i] - t;
}
_MM_SET_ROUNDING_MODE(initial_mode);
}
Unfortunately, this is currently returning a lot of NaN with a couple of answers of 1.128e119 as well (some what outside the range of 0 -> 2*Pi that I was aiming for!). I suspect that where I'm going wrong is in the double-to-int-to-double conversion that I'm trying to use to do the floor.
Can anyone suggest where I've gone wrong and how to improve it?
P.S. sorry about the format of that code, it's the first time I've posted a question on here and can't seem to get it to give me empty lines within the code block to make it readable.
If you want any kind of accuracy, the simple algorithm is terribly bad. For an accurate range reduction algorithm, see e.g. Ng et al., ARGUMENT REDUCTION FOR HUGE ARGUMENTS: Good to the Last Bit (now available via the Wayback Machine: 2012-12-24)
For large arguments Hayne-Panek algorithm is typically used. However, the Hayne-Panek paper is quite difficult to read, and I suggest to have a look at Chapter 11 in the Handbook of Floating-Point Arithmetic for a more accessible explanation.