I am trying to accelerate encryption using the RSA algorithm using CUDA. I can't properly perform power-modulo in the kernel function.
I am using Cuda compilation tools on AWS, release 9.0, V9.0.176 to compile.
#include <cstdio>
#include <math.h>
#include "main.h"
// Kernel function to encrypt the message (m_in) elements into cipher (c_out)
__global__
void enc(int numElements, int e, int n, int *m_in, int *c_out)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
printf("e = %d, n = %d, numElements = %d\n", e, n, numElements);
for (int i = index; i < numElements; i += stride)
{
// POINT OF ERROR //
// c_out[i] = (m_in[i]^e) % n; //**GIVES WRONG RESULTS**
c_out[i] = __pow(m_in[i], e) % n; //**GIVES, error: expression must have integral or enum type**
}
}
// This function is called from main() from other file.
int* cuda_rsa(int numElements, int* data, int public_key, int key_length)
{
int e = public_key;
int n = key_length;
// Allocate Unified Memory – accessible from CPU or GPU
int* message_array;
cudaMallocManaged(&message_array, numElements*sizeof(int));
int* cipher_shared_array; //Array shared by CPU and GPU
cudaMallocManaged(&cipher_shared_array, numElements*sizeof(int));
int* cipher_array = (int*)malloc(numElements * sizeof(int));
//Put message array to be encrypted in a managed array
for(int i=0; i<numElements; i++)
{
message_array[i] = data[i];
}
// Run kernel on 16M elements on the GPU
enc<<<1, 1>>>(numElements, e, n, message_array, cipher_shared_array);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
//Copy into a host array and pass it to main() function for verification.
//Ignored memory leaks.
for(int i=0; i<numElements; i++)
{
cipher_array[i] = cipher_shared_array[i];
}
return (cipher_array);
}
Please help me with this error.
How can I implement power-modulo (as follows) on CUDA kernel?
(x ^ y) % n;
I would really appreciate any help.
In C or C++, this:
(x^y)
does not raise x to the power of y. It performs a bitwise exclusive-or operation. That is why your first realization does not give the correct answer.
In C or C++, the modulo arithmetic operator:
%
is only defined for integer arguments. Even though you are passing integers to the __pow() function, the return result of that function is a double (i.e. a floating-point quantity, not an integer quantity).
I don't know the details of the math you need to perform, but if you cast the result of __pow to an int (for example) this compile error will disappear. That may or may not be valid for whatever arithmetic you wish to perform. (For example, you may wish to cast it to a "long" integer quantity.)
After you do that, you will run into another compile error. The easiest approach is to use pow() instead of __pow():
c_out[i] = (int)pow(m_in[i], e) % n;
If you were actually trying to use the CUDA fast-math intrinsic, you should use __powf not __pow:
c_out[i] = (int)__powf(m_in[i], e) % n;
Note that fast-math intrinsics generally have reduced precision.
Since these raise-to-power functions are performing floating-point arithmetic (even though you are passing integers) it is possible to get some possibly unexpected results. For example, if you raise 5 to the power of 2, its possible to get 24.9999999999 instead of 25. If you simply cast this to an integer quantity, you will get truncation to 24. Therefore you may need to explore rounding your result to the nearest integer, instead of casting. But again, I haven't studied the math you desire to perform.
Related
I made matrix-vector multiplication program using AVX2, FMA in C. I compiled using GCC ver7 with -mfma, -mavx.
However, I got the error "incorrect checksum for freed object - object was probably modified after being freed."
I think the error would generate if the matrix dimension isn't multiples of 4.
I know AVX2 use ymm register that can use 4 double precision floating point number. Therefore, I can use AVX2 without error in case the matrix is multiples of 4.
But, here is my question.
How can I use AVX2 efficiently if the matrix isn't multiples of 4 ???
Here is my code.
#include "stdio.h"
#include "math.h"
#include "stdlib.h"
#include "time.h"
#include "x86intrin.h"
void mv(double *a,double *b,double *c, int m, int n, int l)
{
__m256d va,vb,vc;
int k;
int i;
for (k = 0; k < l; k++) {
vb = _mm256_broadcast_sd(&b[k]);
for (i = 0; i < m; i+=4) {
va = _mm256_loadu_pd(&a[m*k+i]);
vc = _mm256_loadu_pd(&c[i]);
vc = _mm256_fmadd_pd(vc, va, vb);
_mm256_storeu_pd( &c[i], vc );
}
}
}
int main(int argc, char* argv[]) {
// set variables
int m;
double* a;
double* b;
double* c;
int i;
int temp=0;
struct timespec startTime, endTime;
m=9;
// main program
// set vector or matrix
a=(double *)malloc(sizeof(double) * m*m);
b=(double *)malloc(sizeof(double) * m*1);
c=(double *)malloc(sizeof(double) * m*1);
for (i=0;i<m;i++) {
a[i]=1;
b[i]=1;
c[i]=0.0;
}
for (i=m;i<m*m;i++) {
a[i]=1;
}
// check start time
clock_gettime(CLOCK_REALTIME, &startTime);
mv(a, b, c, m, 1, m);
// check end time
clock_gettime(CLOCK_REALTIME, &endTime);
free(a);
free(b);
free(c);
return 0;
}
You load and store vectors of 4 double, but your loop condition only checks that the first vector element is in-bounds, so you can write outside objects by up to 3x8 = 24 bytes when m is not a multiple of 4.
You need something like i < (m-3) in main loop, and a cleanup strategy for handling the last partial vector of data. Vectorizing with SIMD is very much like unrolling: you have to check that it's ok to do multiple future elements in the loop condition.
A scalar cleanup loop works well, but we can do better. For example, do as many 128-bit vectors as possible after the last full 256-bit vector (i.e. up to 1), before going scalar.
In many cases (e.g. write-only destination) an unaligned vector load that ends at the end of your arrays is very good (when m>=4). It can overlap with your main loop if m%4 != 0, but that's fine because your output array doesn't overlap your inputs, so redoing an element as part of a single cleanup is cheaper than branching to avoid it.
But that doesn't work here, because your logic is c[i+0..3] += ..., so redoing an element would make it wrong.
// cleanup using a 128-bit FMA, then scalar if there's an odd element.
// untested
void mv(double *a,double *b,double *c, int m, int n, int l)
{
/* the loop below should actually work for m=1..3, but a separate strategy might be good.
if (m < 4) {
// maybe check m >= 2 and use __m128 vectors?
// or vectorize differently?
}
*/
for (int k = 0; k < l; k++) {
__m256 vb = _mm256_broadcast_sd(&b[k]);
int i;
for (i = 0; i < (m-3); i+=4) {
__m256d va = _mm256_loadu_pd(&a[m*k+i]);
__m256d vc = _mm256_loadu_pd(&c[i]);
vc = _mm256_fmadd_pd(vc, va, vb);
_mm256_storeu_pd( &c[i], vc );
}
if (i<(m-1)) {
__m128d lasta = _mm_loadu_pd(&a[m*k+i]);
__m128d lastc = _mm_loadu_pd(&c[i]);
lastc = _mm_fmadd_pd(lastc, va, _mm256_castpd256_pd128(vb));
_mm_storeu_pd( &c[i], lastc );
// i+=2; // last element only checks m odd/even, doesn't use i
}
// if (i<m)
if (m&1) {
// odd number of elements, do the last non-vector one
c[m-1] += a[m*k + m-1] * _mm256_cvtsd_f64(vb);
}
}
}
I haven't looked at exactly how gcc/clang -O3 compile that. Sometimes compilers try to get too smart with cleanup code (e.g. trying to auto-vectorize scalar cleanup loops).
Other strategies could include doing the last up-to-4 elements with an AVX masked store: you need the same mask for the end of every matrix row, so generating it once and then using it at the end of every row could be good. See Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all. (To simplify branching, you'd set it up so your main loop only goes to i < (m-4), then you always run the cleanup. In the m%4 == 0 case, the mask is all-ones so you do the final full vector.) If you can't safely read past the end of the matrix, you probably need a masked load as well as masked store.
You could also look at aligning your rows for efficiency, or a row stride that's separate from the logical length of rows. (i.e. pad rows out to 32-byte boundaries). Leaving padding at the end of rows simplifies the cleanup, because you can always do whole vectors that write padding.
Special case m==2: instead of broadcasting one element from b[], you'd like to broadcast 2 elements into two 128-bit lanes of a __m256d, so one 256-bit FMA could do 2 rows at once.
I have a function like this in C (in pseudo-ish code, dropping the unimportant parts):
int func(int s, int x, int* a, int* r) {
int i;
// do some stuff
for (i=0;i<a_really_big_int;++i) {
if (s) r[i] = x ^ i;
else r[i] = x ^ a[i];
// and maybe a couple other ways of computing r
// that are equally fast individually
}
// do some other stuff
}
This code gets called so much that this loop is actually a speed bottleneck in the code. I am wondering a couple things:
Since the switch s is a constant in the function, will good compilers optimize the loop so that the branch isn't slowing things down all the time?
If not, what is a good way to optimize this code?
====
Here is an update with a fuller example:
int func(int s,
int start,int stop,int stride,
double *x,double *b,
int *a,int *flips,int *signs,int i_max,
double *c)
{
int i,k,st;
for (k=start; k<stop; k += stride) {
b[k] = 0;
for (i=0;i<i_max;++i) {
/* this is the code in question */
if (s) st = k^flips[i];
else st = a[k]^flips[i];
/* done with code in question */
b[k] += x[st] * (__builtin_popcount(st & signs[i])%2 ? -c[i] : c[i]);
}
}
}
EDIT 2:
In case anyone is curious, I ended up refactoring the code and hoisting the whole inner for loop (with i_max) outside, making the really_big_int loop be much simpler and hopefully easy to vectorize! (and also avoiding doing a bunch of extra logic a zillion times)
One obvious way to optimize the code is pull the conditional outside the loop:
if (s)
for (i=0;i<a_really_big_int;++i) {
r[i] = x ^ i;
}
else
for (i=0;i<a_really_big_int;++i) {
r[i] = x ^ a[i];
}
A shrewd compiler might be able to change that into r[] assignments of more than one element at a time.
Micro-optimizations
Usually they are not worth the time - reviewing larger issue is more effective.
Yet to micro-optimize, trying a variety of approaches and then profiling them to find the best can make for modest improvements.
In addition to #wallyk and #kabanus fine answers, some simplistic compilers benefit with a loop that ends in 0.
// for (i=0;i<a_really_big_int;++i) {
for (i=a_really_big_int; --i; ) {
[edit 2nd optimization]
OP added a more compete example. One of the issues is that the compiler can not make assumption that that the memory pointed to by b and others do not overlap. This prevents certain optimizations.
Assuming they in fact to do not overlap, use restrict on b to allow optimizations. const helps too for weaker compilers that do no deduce that. restrict on the others may also benefit, again, if the reference data does not overlap.
// int func(int s, int start, int stop, int stride, double *x,
// double *b, int *a, int *flips,
// int *signs, int i_max, double *c) {
int func(int s, int start, int stop, int stride, const double * restrict x,
double * restrict b, const int * restrict a, const int * restrict flips,
const int * restrict signs, int i_max, double *c) {
All your commands are quick O(1) command in the loop. The if is definitely optimized, so is your for+if if all your commands are of the form r[i]=somethingquick. The question may boil down for you on how small can big int be?
A quick int main that just goes from INT_MIN to INT_MAX summing into a long variable, takes ~10 seconds for me on the Ubuntu subsystem on Windows. Your commands may multiply this by a few, which quickly gets to a minute. Bottom line, this may be not be avoidable if you really are iterating a ton.
If r[i] are calculated independently, this would be a classic usage for threading/multi-processing.
EDIT:
I think % is optimized anyway by the compiler, but if not, take care that x & 1 is much faster for an odd/even check.
Assuming x86_64, you can ensure that the pointers are aligned to 16 bytes and use intrinsics. If it is only running on systems with AVX2, you could use the __mm256 variants (similar for avx512*)
int func(int s, int x, const __m128i* restrict a, __m128i* restrict r) {
size_t i = 0, max = a_really_big_int / 4;
__m128i xv = _mm_set1_epi32(x);
// do some stuff
if (s) {
__m128i iv = _mm_set_epi32(3,2,1,0); //or is it 0,1,2,3?
__m128i four = _mm_set1_epi32(4);
for ( ;i<max; ++i, iv=_mm_add_epi32(iv,four)) {
r[i] = _mm_xor_si128(xv,iv);
}
}else{ /*not (s)*/
for (;i<max;++i){
r[i] = _mm_xor_si128(xv,a[i]);
}
}
// do some other stuff
}
Although the if statement will be optimized away on any decent compiler (unless you asked the compiler not to optimize), I would consider writing the optimization in (just in case you compile without optimizations).
In addition, although the compiler might optimize the "absolute" if statement, I would consider optimizing it manually, either using any available builtin, or using bitwise operations.
i.e.
b[k] += x[st] *
( ((__builtin_popcount(st & signs[I]) & 1) *
((int)0xFFFFFFFFFFFFFFFF)) ^c[I] );
This will take the last bit of popcount (1 == odd, 0 == even), multiply it by the const (all bits 1 if odd, all bits 0 if true) and than XOR the c[I] value (which is the same as 0-c[I] or ~(c[I]).
This will avoid instruction jumps in cases where the second absolute if statement isn't optimized.
P.S.
I used an 8 byte long value and truncated it's length by casting it to an int. This is because I have no idea how long an int might be on your system (it's 4 bytes on mine, which is 0xFFFFFFFF).
I have the following problem: given two sorted arrays A and B, I have to produce a sorted array C with the elements of A and B.
I found some solution for solving this problem using CUDA: Merge Path, for example http://www.cc.gatech.edu/~bader/papers/GPUMergePath-ICS2012.pdf
However, their problem is given by the size of the original arrays, at least 10k elements. I have a different problem.
The arrays I've to merge are much smaller (1000 elements at most) and the complexity is given by the number of merges to be done (the order of 10 to the power of 10, 10^5 arrays of size ~1000 to be merged with each other).
Part of their problem is to split the original arrays into equally sized parts that are processed in parallel. The arrays I have to merge are small enough to entirely fit in the GPU shared memory.
Thrust is not my first choice because the output of my procedure is not the sorted array itself, but a calculation with its elements: so I think that a specialized kernel should be faster than first sort the element indices and then use them for the calculation.
A serial version of the algorithm looks like:
i=0
j=0
k=0
T=4
while i<N and j<M:
if A[i]<B[j]:
start_i = max(0,i-T)
C[k]=sum(A[start_i:i+1])
i+=1
else:
start_j = max(0,j-T)
C[k]=sum(B[start_j:j+1])
j+=1
k+=1
while i<N:
start_i = max(0,i-T)
C[k]=sum(A[start_i:i+1])
i+=1
k+=1
while j<M:
start_j = max(0,j-T)
C[k]=sum(B[start_j:j+1])
j+=1
k+=1
How can I exploit CUDA capabilities to solve this problem?
The two most important optimization goals for any CUDA program should be to:
expose (sufficient) parallelism
make efficient use of memory
There are certainly many other things that can be considered during optimization, but these are the two most important items to address first.
A merge operation (not quite the same as a merge-sort) is, at first glance, an inherently serial operation. We cannot make a proper decision about which item to choose, from either A or B input array, to place next in the output array C, until we have made all previous selections in C. In this respect, the merge algorithm (in this realization) makes it difficult to expose parallelism, and the paper linked in the question is almost entirely focused on that topic.
The goal of the algorithm described in the paper is to decompose the two input vectors A and B into multiple smaller pieces that can be worked on independently, so as to expose parallelism. In particular, the goal is to keep all the SMs in a GPU busy, and keep all the SP's in an SM busy. Once a sufficient decomposition of work is performed, each SP is ultimately performing a sequential merge (as mentioned in the paper):
Merging stage - Each core merges the two sub arrays
that it has been given using the same algorithm as a
simple sequential merge.
However, as you've pointed out, what you want to do is somewhat different. You already have many arrays, and you want to perform independent merge operations on those many arrays. Since your array count is ~100,000, this is enough independent pieces of work to consider mapping each to a GPU SP (ie. thread). This means that we can then, just as in the paper, use a simple sequential merge on each core/SP/thread. So the problem of exposing parallelism is in your case, already done (to perhaps a sufficient degree).
At this point we could consider implementing this as-is. The code I show later offers this as a starting point for comparison. However what we discover is the performance is not very good, and this is due to the fact that a merge algorithm fundamentally has a data-dependent access sequence, and so it is (more) difficult to arrange for coalesced access on the GPU. The authors of the paper propose to mitigate this problem by first reading the data (in a coalesced fashion) into shared memory, and then having the algorithm work on it out of shared memory, where the penalty for disorganized access is less.
I'll propose a different approach:
arrange the sequential merge algorithm so that each element of A and B need only be read once
arrange the storage of A, B, and C in column-major form as opposed to the more "natural" row-major storage that one might consider. This is effectively transposing the storage matrices for A, B, and C vectors. This allows for an improvement in coalesced access, as the GPU threads navigate their way through the merging operation on their individual A and B vectors. It's far from perfect, but the improvement is substantial.
Here's a worked example that implements the above idea, running a simple sequential merge in each thread, and each thread merging one of the A vectors with one of the B vectors:
$ cat t784.cu
#include <stdio.h>
#include <stdlib.h>
#include <thrust/sort.h>
#include <thrust/merge.h>
#define NUM_SETS 100000
#define DSIZE 100
typedef int mytype;
// for ascending sorted data
#define cmp(A,B) ((A)<(B))
#define nTPB 512
#define nBLK 128
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
template <typename T>
__host__ __device__ void smerge(const T * __restrict__ a, const T * __restrict__ b, T * __restrict__ c, const unsigned len_a, const unsigned len_b, const unsigned stride_a = 1, const unsigned stride_b = 1, const unsigned stride_c = 1){
unsigned len_c = len_a+len_b;
unsigned nc = 0;
unsigned na = 0;
unsigned nb = 0;
unsigned fa = (len_b == 0);
unsigned fb = (len_a == 0);
T nxta = a[0];
T nxtb = b[0];
while (nc < len_c){
if (fa) {c[stride_c*nc++] = nxta; na++; nxta = a[stride_a*na];}
else if (fb) {c[stride_c*nc++] = nxtb; nb++; nxtb = b[stride_b*nb];}
else if (cmp(nxta,nxtb)){
c[stride_c*nc++] = nxta;
na++;
if (na == len_a) fb++;
else nxta = a[stride_a*na];}
else {
c[stride_c*nc++] = nxtb;
nb++;
if (nb == len_b) fa++;
else nxtb = b[stride_b*nb];}}
}
template <typename T>
__global__ void rmtest(const T * __restrict__ a, const T * __restrict__ b, T * __restrict__ c, int num_arr, int len){
int idx=threadIdx.x+blockDim.x*blockIdx.x;
while (idx < num_arr){
int sel=idx*len;
smerge(a+sel, b+sel, c+(2*sel), len, len);
idx += blockDim.x*gridDim.x;}
}
template <typename T>
__global__ void cmtest(const T * __restrict__ a, const T * __restrict__ b, T * __restrict__ c, int num_arr, int len, int stride_a, int stride_b, int stride_c){
int idx=threadIdx.x+blockDim.x*blockIdx.x;
while (idx < num_arr){
smerge(a+idx, b+idx, c+idx, len, len, stride_a, stride_b, stride_c);
idx += blockDim.x*gridDim.x;}
}
template <typename T>
int rmvalidate(T *a, T *b, T *c, int num_arr, int len){
T *vc = (T *)malloc(2*len*sizeof(T));
for (int i = 0; i < num_arr; i++){
thrust::merge(a+(i*len), a+((i+1)*len), b+(i*len), b+((i+1)*len), vc);
#ifndef TIMING
for (int j = 0; j < len*2; j++)
if (vc[j] != c[(i*2*len)+j]) {printf("rm mismatch i: %d, j: %d, was: %d, should be: %d\n", i, j, c[(i*2*len)+j], vc[j]); return 0;}
#endif
}
return 1;
}
template <typename T>
int cmvalidate(const T *c1, const T *c2, int num_arr, int len){
for (int i = 0; i < num_arr; i++)
for (int j = 0; j < 2*len; j++)
if (c1[i*(2*len)+j] != c2[j*(num_arr)+i]) {printf("cm mismatch i: %d, j: %d, was: %d, should be: %d\n", i, j, c2[j*(num_arr)+i], c1[i*(2*len)+j]); return 0;}
return 1;
}
int main(){
mytype *h_a, *h_b, *h_c, *d_a, *d_b, *d_c;
h_a = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
h_b = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
h_c = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype)*2);
cudaMalloc(&d_a, (DSIZE*NUM_SETS+1)*sizeof(mytype));
cudaMalloc(&d_b, (DSIZE*NUM_SETS+1)*sizeof(mytype));
cudaMalloc(&d_c, DSIZE*NUM_SETS*sizeof(mytype)*2);
// test "row-major" storage
for (int i =0; i<DSIZE*NUM_SETS; i++){
h_a[i] = rand();
h_b[i] = rand();}
thrust::sort(h_a, h_a+DSIZE*NUM_SETS);
thrust::sort(h_b, h_b+DSIZE*NUM_SETS);
cudaMemcpy(d_a, h_a, DSIZE*NUM_SETS*sizeof(mytype), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, DSIZE*NUM_SETS*sizeof(mytype), cudaMemcpyHostToDevice);
unsigned long gtime = dtime_usec(0);
rmtest<<<nBLK, nTPB>>>(d_a, d_b, d_c, NUM_SETS, DSIZE);
cudaDeviceSynchronize();
gtime = dtime_usec(gtime);
cudaMemcpy(h_c, d_c, DSIZE*NUM_SETS*2*sizeof(mytype), cudaMemcpyDeviceToHost);
unsigned long ctime = dtime_usec(0);
if (!rmvalidate(h_a, h_b, h_c, NUM_SETS, DSIZE)) {printf("fail!\n"); return 1;}
ctime = dtime_usec(ctime);
printf("CPU time: %f, GPU RM time: %f\n", ctime/(float)USECPSEC, gtime/(float)USECPSEC);
// test "col-major" storage
mytype *ch_a, *ch_b, *ch_c;
ch_a = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
ch_b = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
ch_c = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
for (int i = 0; i < NUM_SETS; i++)
for (int j = 0; j < DSIZE; j++){
ch_a[j*NUM_SETS+i] = h_a[i*DSIZE+j];
ch_b[j*NUM_SETS+i] = h_b[i*DSIZE+j];}
cudaMemcpy(d_a, ch_a, DSIZE*NUM_SETS*sizeof(mytype), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, ch_b, DSIZE*NUM_SETS*sizeof(mytype), cudaMemcpyHostToDevice);
gtime = dtime_usec(0);
cmtest<<<nBLK, nTPB>>>(d_a, d_b, d_c, NUM_SETS, DSIZE, NUM_SETS, NUM_SETS, NUM_SETS );
cudaDeviceSynchronize();
gtime = dtime_usec(gtime);
cudaMemcpy(ch_c, d_c, DSIZE*NUM_SETS*2*sizeof(mytype), cudaMemcpyDeviceToHost);
if (!cmvalidate(h_c, ch_c, NUM_SETS, DSIZE)) {printf("fail!\n"); return 1;}
printf("GPU CM time: %f\n", gtime/(float)USECPSEC);
return 0;
}
$ nvcc -O3 -DTIMING -o t784 t784.cu
$ ./t784
CPU time: 0.030691, GPU RM time: 0.045814
GPU CM time: 0.002784
$
Notes:
The GPU is actually slower than the naive single-threaded CPU code when the memory organization is row major. But for the column-major organization (which tends to improve opportunities for coalesced access) the GPU code is about 10x faster than the CPU code for my test case. This ~10x speedup factor is roughly in the range (~10-20x) of the speedup factors shown in the paper for a GPU MergePath 32-bit integer speedup vs. x86 serial merge.
using int vs. float datatypes makes a significant difference in the CPU timing. int seems to be faster (on the CPU) so I'm showing that version here. (This disparity is mentioned in the paper as well.)
The -DTIMING switch added to the compile command pares down the first validation function so that it just does the CPU merge operation, for timing.
The basic merge code is templated to be able to handle different data types, and parameterized so that it can used in either the column-major or row major operation.
I've dispensed with CUDA error checking for brevity of presenation. However, any time you're having trouble with a CUDA code, you should always use proper cuda error checking.
What about using thrust (as I suggested in the comments)? It should be possible to use thrust::merge with a suitable device/sequential execution policy, to more or less mimic what I have done above. However, thrust expects vectors to be contiguous, and so, without additional complexity, it could only be used in the row-major case, which we've seen is severely penalized by bad memory access patterns. It should be possible to create a set of permutation iterators in thrust that would allow the column-major, strided access that improves the memory scenario, but I have not pursued that.
I'm trying to optimize some of my code in C, which is a lot bigger than the snippet below. Coming from Python, I wonder whether you can simply multiply an entire array by a number like I do below.
Evidently, it does not work the way I do it below. Is there any other way that achieves the same thing, or do I have to step through the entire array as in the for loop?
void main()
{
int i;
float data[] = {1.,2.,3.,4.,5.};
//this fails
data *= 5.0;
//this works
for(i = 0; i < 5; i++) data[i] *= 5.0;
}
There is no short-cut you have to step through each element of the array.
Note however that in your example, you may achieve a speedup by using int rather than float for both your data and multiplier.
If you want to, you can do what you want through BLAS, Basic Linear Algebra Subprograms, which is optimised. This is not in the C standard, it is a package which you have to install yourself.
Sample code to achieve what you want:
#include <stdio.h>
#include <stdlib.h>
#include <cblas.h>
int main () {
int limit =10;
float *a = calloc( limit, sizeof(float));
for ( int i = 0; i < limit ; i++){
a[i] = i;
}
cblas_sscal( limit , 0.5f, a, 1);
for ( int i = 0; i < limit ; i++){
printf("%3f, " , a[i]);
}
printf("\n");
}
The names of the functions is not obvious, but reading the guidelines you might start to guess what BLAS functions does. sscal() can be split into s for single precision and scal for scale, which means that this function works on floats. The same function for double precision is called dscal().
If you need to scale a vector with a constant and adding it to another, BLAS got a function for that too:
saxpy()
s a x p y
float a*x + y
y[i] += a*x
As you might guess there is a daxpy() too which works on doubles.
I'm afraid that, in C, you will have to use for(i = 0; i < 5; i++) data[i] *= 5.0;.
Python allows for so many more "shortcuts"; however, in C, you have to access each element and then manipulate those values.
Using the for-loop would be the shortest way to accomplish what you're trying to do to the array.
EDIT: If you have a large amount of data, there are more efficient (in terms of running time) ways to multiply 5 to each value. Check out loop tiling, for example.
data *= 5.0;
Here data is address of array which is constant.
if you want to multiply the first value in that array then use * operator as below.
*data *= 5.0;
I am reprogramming a piece of MATLAB code in mex (using C). So far my C version of the MATLAB code is about as double as fast as the MATLAB code. Now I have three questions, all related to the code below:
How can I speed up this code more?
Do you see any problems with this code? I ask this because I don't know mex very well and I am also not a C guru ;-) ... I am aware that there should be some checks in the code (for example if there is still heap space while using realloc, but I left this away for the sake of simplicity for the moment)
Is it possible, that MATLAB is optimizing so well, that I really can't get much more than twice as fast code in C...?
The code should be more or less platform independent (Win, Linux, Unix, Mac, different Hardware), so I don't want to use assembler or specific linear Algebra Libraries. So that's why I programmed the staff by myself...
#include <mex.h>
#include <math.h>
#include <matrix.h>
void mexFunction(
int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
double epsilon = ((double)(mxGetScalar(prhs[0])));
int strengthDim = ((int)(mxGetScalar(prhs[1])));
int lenPartMat = ((int)(mxGetScalar(prhs[2])));
int numParts = ((int)(mxGetScalar(prhs[3])));
double *partMat = mxGetPr(prhs[4]);
const mxArray* verletListCells = prhs[5];
mxArray *verletList;
double *pseSum = (double *) malloc(numParts * sizeof(double));
for(int i = 0; i < numParts; i++) pseSum[i] = 0.0;
float *tempVar = NULL;
for(int i = 0; i < numParts; i++)
{
verletList = mxGetCell(verletListCells,i);
int numberVerlet = mxGetM(verletList);
tempVar = (float *) realloc(tempVar, numberVerlet * sizeof(float) * 2);
for(int a = 0; a < numberVerlet; a++)
{
tempVar[a*2] = partMat[((int) (*(mxGetPr(verletList) + a))) - 1] - partMat[i];
tempVar[a*2 + 1] = partMat[((int) (*(mxGetPr(verletList) + a))) - 1 + lenPartMat] - partMat[i + lenPartMat];
tempVar[a*2] = pow(tempVar[a*2],2);
tempVar[a*2 + 1] = pow(tempVar[a*2 + 1],2);
tempVar[a*2] = tempVar[a*2] + tempVar[a*2 + 1];
tempVar[a*2] = sqrt(tempVar[a*2]);
tempVar[a*2] = 4.0/(pow(epsilon,2) * M_PI) * exp(-(pow((tempVar[a*2]/epsilon),2)));
pseSum[i] = pseSum[i] + ((partMat[((int) (*(mxGetPr(verletList) + a))) - 1 + 2*lenPartMat] - partMat[i + (2 * lenPartMat)]) * tempVar[a*2]);
}
}
plhs[0] = mxCreateDoubleMatrix(numParts,1,mxREAL);
for(int a = 0; a < numParts; a++)
{
*(mxGetPr(plhs[0]) + a) = pseSum[a];
}
free(tempVar);
free(pseSum);
}
So this is the improved version, which is about 12 times faster than MATLAB version. The conversion thing is still eating up much time, but I let this away for now, becaues I have to change something in MATLAB for this. So first focus on the remaining C code. Do you see any more potential in the following code?
#include <mex.h>
#include <math.h>
#include <matrix.h>
void mexFunction(
int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
double epsilon = ((double)(mxGetScalar(prhs[0])));
int strengthDim = ((int)(mxGetScalar(prhs[1])));
int lenPartMat = ((int)(mxGetScalar(prhs[2])));
double *partMat = mxGetPr(prhs[3]);
const mxArray* verletListCells = prhs[4];
int numParts = mxGetM(verletListCells);
mxArray *verletList;
plhs[0] = mxCreateDoubleMatrix(numParts,1,mxREAL);
double *pseSum = mxGetPr(plhs[0]);
double epsilonSquared = epsilon*epsilon;
double preConst = 4.0/((epsilonSquared) * M_PI);
int numberVerlet = 0;
double tempVar[2];
for(int i = 0; i < numParts; i++)
{
verletList = mxGetCell(verletListCells,i);
double *verletListPtr = mxGetPr(verletList);
numberVerlet = mxGetM(verletList);
for(int a = 0; a < numberVerlet; a++)
{
int adress = ((int) (*(verletListPtr + a))) - 1;
tempVar[0] = partMat[adress] - partMat[i];
tempVar[1] = partMat[adress + lenPartMat] - partMat[i + lenPartMat];
tempVar[0] = tempVar[0]*tempVar[0] + tempVar[1]*tempVar[1];
tempVar[0] = preConst * exp(-(tempVar[0]/epsilonSquared));
pseSum[i] += ((partMat[adress + 2*lenPartMat] - partMat[i + (2*lenPartMat)]* tempVar[0]);
}
}
}
You do not need to allocate the pseSum for local use and then later copy the data to the output. You can simply allocate a MATLAB object and get the pointer to the memory :
plhs[0] = mxCreateDoubleMatrix(numParts,1,mxREAL);
pseSum = mxGetPr(plhs[0]);
Thus you will not have to initialize pseSum to 0, because MATLAB already does it in mxCreateDoubleMatrix.
Remove all the mxGetPr from the inner loop and assign them to variables before.
Instead of casting doubles to ints consider using int32 or uint32 arrays in MATLAB. Casting double to int is expensive. The internal loop computations would look like
tempVar[a*2] = partMat[somevar[a] - 1] - partMat[i];
You use such constructs in your code
((int) (*(mxGetPr(verletList) + a)))
You do it because the varletList is a 'double' array (that is the case by default in MATLAB), which holds integer values. Instead, you should use integer array. Before you call your mex file type in MATLAB:
varletList = int32(varletList);
Then you will not need the type cast to int above. You will simply write
((int*)mxGetData(verletList))[a]
or better yet, assign earlier
somevar = (int*)mxGetData(verletList);
and later write
somevar[a]
precompute 4.0/(pow(epsilon,2) * M_PI) before all loops! That is one expensive constant.
pow((tempVar[a*2]/epsilon),2)) is simply tempVar[a*2]^2/epsilon^2. You calculate sqrt(tempVar[a*2]) just before. Why do you square it now?
Generally do not use pow(x, 2). Just write x*x
I would add some sanity checks on the parameters, especially if you demand integers. Either use MATLABs int32/uint32 type, or check that what you get actually is an integer.
Edit in the new code
compute -1/epsilonSquared before the loops and compute exp(minvepssq*tempVar[0]).note that the result might differ slightly. Depends what you need, but if you don't care about exact order of operations, do it.
define a register variable preSum_r and use it to sum the results in the inner loop. After the loop assign it to preSum[i]. If you want more fun, you can write the result to the memory using SSE streaming store (_mm_stream_pd compiler intrinsic).
do remove double to int cast
most likely irrelevant, but try to change tempVar[0/1] to normal variables. Irrelevant, because the compiler should do that for you. But again, an array is not needed here.
parallelise the external loop with OpenMP. Trivial (at least the simplest version without thinking about data layout for NUMA architectures) since there is no dependence between the iterations.
Can you estimate ahead of time what will be the maximum size of tempVar and allocate memory for it before the loop instead of using realloc? Reallocating memory is a time consuming operation and if your numParts is large, this could have a huge impact. Take a look at this question.