How to optimize the computation of a for loop using SIMD? - arm

I am trying to accelerate a stereo matching algorithm on ODROID XU4 ARM platform using Neon SIMD. For this puropose I am using openMp's
void StereoMatch:: sadCol(uint8_t* leftRank,uint8_t* rightRank,const int SAD_WIDTH,const int SAD_WIDTH_STEP, const int imgWidth,int j, int d , uint16_t* cost)
uint16_t sum = 0;
int n = 0;
int m =0;
for ( n = 0; n < SAD_WIDTH+1; n++)
#pragma omp simd
for( m = 0; m< SAD_WIDTH_STEP; m = m + imgWidth )
sum += abs(leftRank[j+m+n]-rightRank[j+m+n-d]);
cost[n] = sum;
sum = 0;
I am fairly new to SIMD and openMp, I understood that using the SIMD pragma in the code will direct the compiler to vectorize the subtraction, but when I executed the code I noticed no difference. What should I add to my code in order to vectorize it ?

As said in the comments, ARM-Neon has an instruction which directly does what you want, i.e., compute the absolute difference of unsigned bytes and accumulates it to unsigned short-integers.
Assuming SAD_WIDTH+1==8, here is a very simple implementation using intrinsics (based on the simplified version by #nemequ):
void sadCol(uint8_t* leftRank,
uint8_t* rightRank,
int j,
int d ,
uint16_t* cost) {
const int SAD_WIDTH = 7;
const int imgWidth = 320;
const int SAD_WIDTH_STEP = SAD_WIDTH * imgWidth;
uint16x8_t cost_8 = {0};
for(int m = 0; m < SAD_WIDTH_STEP; m = m + imgWidth ) {
cost_8 = vabal_u8(cost_8, vld1_u8(&leftRank[j+m]), vld1_u8(&rightRank[j+m-d]));
vst1q_u16(cost, cost_8);
vld1_u8 loads 8 consecutive bytes, vabal_u8 computes the absolute difference and accumulates it to the first register. Finally, vst1q_u16 stores the register to memory.
You can easily make imgWidth and SAD_WIDTH_STEP function parameters. If SAD_WIDTH+1 is a different multiple of 8, you can write another loop for that.
I have no ARM platform at hand to test it, but "it compiles": (and the assembly looks fine, in my eyes). If you optimize with -O3 gcc will unroll the loop.


What am I doing with SIMD and pthreads that is slowing my program down?

Please do not post code as I would like to complete myself but rather if possible point me in the right direction with general information or by pointing out mistakes in thought or other possible useful and relevant resources.
I have a method that creates my square npages * npages matrix hat of double for use in my pagerank algorithm.
I have made it with pthreads, SIMD and with both pthreads and SIMD. I have used xcode instruments time profiler and found that the pthreads only version is the fastest, next is the SIMD only version and slowest is the version with both SIMD and pthreads.
As it is homework it can be run on multiple different machines however we were given the header #include so it is to be assumed we can use upto AVX at least. We are given how many threads the program will use as the argument to the program and store it in a global variable g_nthreads.
In my tests I have been testing it on my machine which is an IvyBridge with 4 hardware cores and 8 logical cores and I've been testing it with 4 threads as an arguments and with 8 threads as an argument.
*331ms - for consturct_matrix_hat function *
PTHREADS ONLY (8 threads):
70ms - each thread concurrently
SIMD & PTHREADS (8 threads):
110ms - each thread concurrently
What am I doing that is slowing it down more when using both forms of optimisation?
I will post each implementation:
All versions share these macros:
#define BIG_CHUNK (g_n2/g_nthreads)
#define SMALL_CHUNK (g_npages/g_nthreads)
#define MOD BIG_CHUNK - (BIG_CHUNK % 4)
#define IDX(a, b) ((a * g_npages) + b)
// struct used for passing arguments
typedef struct {
double* restrict m;
double* restrict m_hat;
int t_id;
char padding[44];
} t_arg_matrix_hat;
// Construct matrix_hat with pthreads
static void* pthread_construct_matrix_hat(void* arg) {
t_arg_matrix_hat* t_arg = (t_arg_matrix_hat*) arg;
// set coordinate limits thread is able to act upon
size_t start = t_arg->t_id * BIG_CHUNK;
size_t end = t_arg->t_id + 1 != g_nthreads ? (t_arg->t_id + 1) * BIG_CHUNK : g_n2;
// Initialise coordinates with given uniform value
for (size_t i = start; i < end; i++) {
t_arg->m_hat[i] = ((g_dampener * t_arg->m[i]) + HAT);
return NULL;
// Construct matrix_hat
double* construct_matrix_hat(double* matrix) {
double* matrix_hat = malloc(sizeof(double) * g_n2);
// create structs to send and retrieve matrix and value from threads
t_arg_matrix_hat t_args[g_nthreads];
for (size_t i = 0; i < g_nthreads; i++) {
t_args[i] = (t_arg_matrix_hat) {
.m = matrix,
.m_hat = matrix_hat,
.t_id = i
// create threads and send structs with matrix and value to divide the matrix and
// initialise the coordinates with the given value
pthread_t threads[g_nthreads];
for (size_t i = 0; i < g_nthreads; i++) {
pthread_create(threads + i, NULL, pthread_construct_matrix_hat, t_args + i);
// join threads after all coordinates have been intialised
for (size_t i = 0; i < g_nthreads; i++) {
pthread_join(threads[i], NULL);
return matrix_hat;
// Construct matrix_hat
double* construct_matrix_hat(double* matrix) {
double* matrix_hat = malloc(sizeof(double) * g_n2);
double dampeners[4] = {g_dampener, g_dampener, g_dampener, g_dampener};
__m256d b = _mm256_loadu_pd(dampeners);
// Use simd to subtract values from each other
for (size_t i = 0; i < g_mod; i += 4) {
__m256d a = _mm256_loadu_pd(matrix + i);
__m256d res = _mm256_mul_pd(a, b);
_mm256_storeu_pd(&matrix_hat[i], res);
// Subtract values from each other that weren't included in simd
for (size_t i = g_mod; i < g_n2; i++) {
matrix_hat[i] = g_dampener * matrix[i];
double hats[4] = {HAT, HAT, HAT, HAT};
b = _mm256_loadu_pd(hats);
// Use simd to raise each value to the power 2
for (size_t i = 0; i < g_mod; i += 4) {
__m256d a = _mm256_loadu_pd(matrix_hat + i);
__m256d res = _mm256_add_pd(a, b);
_mm256_storeu_pd(&matrix_hat[i], res);
// Raise each value to the power 2 that wasn't included in simd
for (size_t i = g_mod; i < g_n2; i++) {
matrix_hat[i] += HAT;
return matrix_hat;
Pthreads & SIMD:
// struct used for passing arguments
typedef struct {
double* restrict m;
double* restrict m_hat;
int t_id;
char padding[44];
} t_arg_matrix_hat;
// Construct matrix_hat with pthreads
static void* pthread_construct_matrix_hat(void* arg) {
t_arg_matrix_hat* t_arg = (t_arg_matrix_hat*) arg;
// set coordinate limits thread is able to act upon
size_t start = t_arg->t_id * BIG_CHUNK;
size_t end = t_arg->t_id + 1 != g_nthreads ? (t_arg->t_id + 1) * BIG_CHUNK : g_n2;
size_t leftovers = start + MOD;
__m256d b1 = _mm256_loadu_pd(dampeners);
for (size_t i = start; i < leftovers; i += 4) {
__m256d a1 = _mm256_loadu_pd(t_arg->m + i);
__m256d r1 = _mm256_mul_pd(a1, b1);
_mm256_storeu_pd(&t_arg->m_hat[i], r1);
for (size_t i = leftovers; i < end; i++) {
t_arg->m_hat[i] = dampeners[0] * t_arg->m[i];
__m256d b2 = _mm256_loadu_pd(hats);
for (size_t i = start; i < leftovers; i += 4) {
__m256d a2 = _mm256_loadu_pd(t_arg->m_hat + i);
__m256d r2 = _mm256_add_pd(a2, b2);
_mm256_storeu_pd(&t_arg->m_hat[i], r2);
for (size_t i = leftovers; i < end; i++) {
t_arg->m_hat[i] += hats[0];
return NULL;
// Construct matrix_hat
double* construct_matrix_hat(double* matrix) {
double* matrix_hat = malloc(sizeof(double) * g_n2);
// create structs to send and retrieve matrix and value from threads
t_arg_matrix_hat t_args[g_nthreads];
for (size_t i = 0; i < g_nthreads; i++) {
t_args[i] = (t_arg_matrix_hat) {
.m = matrix,
.m_hat = matrix_hat,
.t_id = i
// create threads and send structs with matrix and value to divide the matrix and
// initialise the coordinates with the given value
pthread_t threads[g_nthreads];
for (size_t i = 0; i < g_nthreads; i++) {
pthread_create(threads + i, NULL, pthread_construct_matrix_hat, t_args + i);
// join threads after all coordinates have been intialised
for (size_t i = 0; i < g_nthreads; i++) {
pthread_join(threads[i], NULL);
return matrix_hat;
I think it's because your SIMD code is horribly inefficient: It loops over the memory twice, instead of doing the add with the multiply, before storing. You didn't test SIMD vs. a scalar baseline, but if you had you'd probably find that your SIMD code wasn't a speedup with a single thread either.
STOP READING HERE if you want to solve the rest of your homework yourself.
If you used gcc -O3 -march=ivybridge, the simple scalar loop in the pthread version probably auto-vectorized into something like what you should have done with intrinsics. You even used restrict, so it might realize that the pointers can't overlap with each other, or with g_dampener.
// this probably autovectorizes well.
// Initialise coordinates with given uniform value
for (size_t i = start; i < end; i++) {
t_arg->m_hat[i] = ((g_dampener * t_arg->m[i]) + HAT);
// but this would be even safer to help the compiler's aliasing analysis:
double dampener = g_dampener; // in case the compiler things one of the pointers might point at the global
double *restrict hat = t_arg->hat;
const double *restrict mat = t_arg->m;
... same loop but using these locals instead of
It's probably not a problem for an FP loop, since double definitely can't alias with double *.
The coding style is also pretty nasty. You should give meaningful names to your __m256d variables whenever possible.
Also, you use malloc, which doesn't guarantee that matrix_hat will be aligned to a 32B boundary. C11's aligned_alloc is probably the nicest way, vs. posix_memalign (clunky interface), _mm_malloc (have to free with _mm_free, not free(3)), or other options.
double* construct_matrix_hat(const double* matrix) {
// double* matrix_hat = malloc(sizeof(double) * g_n2);
double* matrix_hat = aligned_alloc(64, sizeof(double) * g_n2);
// double dampeners[4] = {g_dampener, g_dampener, g_dampener, g_dampener}; // This idiom is terrible, and might actually compile to code that stores it 4 times on the stack and then loads.
__m256d vdamp = _mm256_set1_pd(g_dampener); // will compile to a broadcast-load (vbroadcastsd)
__m256d vhat = _mm256_set1_pd(HAT);
size_t last_full_vector = g_n2 & ~3ULL; // don't load this from a global.
// it's better for the compiler to see how it's calculated from g_n2
// ??? Use simd to subtract values from each other // huh? this is a multiply, not a subtract. Also, everyone can see it's using SIMD, that part adds no new information
// if you really want to manually vectorize this, instead of using an OpenMP pragma or -O3 on the scalar loop, then:
for (size_t i = 0; i < last_full_vector; i += 4) {
__m256d vmat = _mm256_loadu_pd(matrix + i);
__m256d vmul = _mm256_mul_pd(vmat, vdamp);
__m256d vres = _mm256_add_pd(vmul, vhat);
_mm256_store_pd(&matrix_hat[i], vres); // aligned store. Doesn't matter for performance.
#if 0
// Scalar cleanup
for (size_t i = last_vector; i < g_n2; i++) {
matrix_hat[i] = g_dampener * matrix[i] + HAT;
// assume that g_n2 >= 4, and do a potentially-overlapping unaligned vector
if (last_full_vector != g_n2) {
// Or have this always run, and have the main loop stop one element sooner (so this overlaps by 0..3 instead of by 1..3 with a conditional)
assert(g_n2 >= 4);
__m256d vmat = _mm256_loadu_pd(matrix + g_n2 - 4);
__m256d vmul = _mm256_mul_pd(vmat, vdamp);
__m256d vres = _mm256_add_pd(vmul, vhat);
_mm256_storeu_pd(&matrix_hat[g_n2-4], vres);
return matrix_hat;
This version compiles (after defining a couple globals) to the asm we expect. BTW, normal people pass sizes around as function arguments. This is another way of avoiding optimization-failure due to C aliasing rules.
Anyway, really your best bet is to let OpenMP auto-vectorize it, because then you don't have to write a cleanup loop yourself. There's nothing tricky about the data organization, so it vectorizes trivially. (And it's not a reduction, like in your other question, so there's no loop-carried dependency or order-of-operations concern).

rsa algorithm in openmp

I am trying to parallize RSA algorithm with the help of repeated square and multiply method in openmp.
code is as follow:
long long unsigned int mod_exp(int base,int exp,int n)
long long unsigned int i,pow1=1,pow2=1,pow3=1,pow4=1,pow=1,pow5=1;
int exp1=exp/4;
int id;
return pow;
just with #pragma omp for i am unable to find get the correct output.
kindly help
I guess you could go for something like this:
long long unsigned int i, pow = 1, exp1 = exp / 4;
int k;
#pragma omp parallel for reduction( * : pow )
for ( k = 0; k < 5; k++ ) {
for ( i = 0; i < exp1 ; i++ ) {
pow = ( pow * base ) % n;
That should work but I doubt it would do you much good since the amount of work is so limited that the parallelisation overhead is very likely to slow-down the code instead of speeding it up.
EDIT: hum, actually, I can't make sense of the initial code... Why are we doing 5 times the same computation? Did I miss something?

Equivalent SIMD instruction for multiplying specific array elements

I just understood how to get a dot-product of 2 arrays (as in the following code):
int A[8] = {1,2,3,4,5,1,2,3};
int B[8] = {2,3,4,5,6,2,3,4};
float result = 0;
for (int i = 0; i < 8; i ++) {
result += A[i] * B[i];
is equivalent to (in SIMD):
int A[8] = {1,2,3,4,5,1,2,3};
int B[8] = {2,3,4,5,6,2,3,4};
float result = 0;
__m128 r1 = {0,0,0,0};
__m128 r2 = {0,0,0,0};
__m128 r3 = {0,0,0,0};
for (int i = 0; i < 8; i += 4) {
float C[4] = {A[i], A[i+1], A[i+2], A[i+3]};
float D[4] = {B[i], B[i+1], B[i+2], B[i+3]};
__m128 a = _mm_loadu_ps(C);
__m128 b = _mm_loadu_ps(D);
r1 = _mm_mul_ps(a,b);
r2 = _mm_hadd_ps(r1, r1);
r3 = _mm_add_ss(_mm_hadd_ps(r2, r2), r3);
_mm_store_ss(&result, r3);
I am curious now how to get the equivalent code in SIMD if I want to multiply elements that aren't consecutive in the array. For example, if I wanted to perform the following, what would be the equivalent in SIMD?
int A[8] = {1,2,3,4,5,1,2,3};
int B[8] = {2,3,4,5,6,2,3,4};
float result = 0;
for (int i = 0; i < 8; i++) {
for (int j = 0; j < 8; j++) {
result += A[foo(i)] * B[foo(j)]
foo is just some function that returns an int as some function of the input argument.
If I had to do this task, I would do it as follows:
int A[8] = {1,2,3,4,5,1,2,3};
int B[8] = {2,3,4,5,6,2,3,4};
float PA[8], PB[8];
for (int i = 0; i < 8; i++)
PA[i] = A[foo(i)];
PB[i] = B[foo(i)];
__m128 sums = _mm_set1_ps(0);
for (int i = 0; i < 8; i++)
__m128 a = _mm_set1_ps(PA[i]);
for (int j = 0; j < 8; j += 4)
__m128 b = _mm_loadu_ps(PB + j);
sums = _mm_add_ps(sums, _mm_mul_ps(a, b));
float results[4];
_mm_storeu_ps(results, sums);
float result = results[0] + results[1] + results[2] + results[3];
Generally speaking, SIMD does not like such things as random access to individual elements. However, there are still several tricks that can be used.
If the indices provided foo are known at compile time, you can probably shuffle both vectors to align their elements properly. Just look at intrinsics in swizzle category in the Intrinsics Guide. Most certainly you'll need something like _mm_shuffle_ps and _mm_unpackXX_ps. Also various shift/align instructions may be useful.
With AVX2, you can try to use gather instructions. For float type in 32-bit mode you can
use _mm_i32gather_ps or _mm256_i32gather_ps intrinsics. However, #PaulR writes here that they are no faster than trivial scalar loads.
Another solution may be possible with _mm_shuffle_epi8 instrinsic from SSSE3. It is a great instruction that allows to perform in-register gather operation with granularity of individual bytes. However, creating the shuffle mask is not a simple task. This paper (read sections 3.1 and 4) shows how to extend this approach to input arrays larger than one XMM register, but it seems that for 64 and more elements it is no longer better than scalar code.

How to implement summation using parallel reduction in OpenCL?

I'm trying to implement a kernel which does parallel reduction. The code below works on occasion, I have not been able to pin down why it goes wrong on the occasions it does.
__kernel void summation(__global float* input, __global float* partialSum, __local float *localSum){
int local_id = get_local_id(0);
int workgroup_size = get_local_size(0);
localSum[local_id] = input[get_global_id(0)];
for(int step = workgroup_size/2; step>0; step/=2){
if(local_id < step){
localSum[local_id] += localSum[local_id + step];
if(local_id == 0){
partialSum[get_group_id(0)] = localSum[0];
Essentially I'm summing the values per work group and storing each work group's total into partialSum, the final summation is done on the host. Below is the code which sets up the values for the summation.
size_t global[1];
size_t local[1];
const int DATA_SIZE = 15000;
float *input = NULL;
float *partialSum = NULL;
int count = DATA_SIZE;
local[0] = 2;
global[0] = count;
input = (float *)malloc(count * sizeof(float));
partialSum = (float *)malloc(global[0]/local[0] * sizeof(float));
int i;
for (i = 0; i < count; i++){
input[i] = (float)i+1;
I'm thinking it has something to do when the size of the input is not a power of two? I noticed it begins to go off for numbers around 8000 and beyond. Any assistance is welcome. Thanks.
I'm thinking it has something to do when the size of the input is not a power of two?
Yes. Consider what happens when you try to reduce, say, 9 elements. Suppose you launch 1 work-group of 9 work-items:
for (int step = workgroup_size / 2; step > 0; step /= 2){
// At iteration 0: step = 9 / 2 = 4
if (local_id < step) {
// Branch taken by threads 0 to 3
// Only 8 numbers added up together!
localSum[local_id] += localSum[local_id + step];
You're never summing the 9th element, hence the reduction is incorrect. An easy solution is to pad the input data with enough zeroes to make the work-group size the immediate next power-of-two.

Slow OMP vs serial

I'm trying to optimize a C subroutine called from R that takes up ~60% of the computation time for a problem I'm trying to solve. This is down from 86% when coded purely in R. The vast majority of the execution time in my C code is taking place in a nested for loop and so this seems an obvious candidate to try and parallelize using OpenMP. I've tried doing so with variable results – at best the elapsed time is fractionally worse than not using OMP, at worst the performance scaled inversely to the number of threads. The code for the fastest version is below:
#include <R.h>
#include <Rmath.h>
#include <omp.h>
void gradNegLogLik_c(double *param, double *delta, double *X, double *M, int *nBeta, int *nEpsilon, int *nObs, double *gradient){
// ========================================================================================
// param: double[nBeta + nEpsilon] values of parameters at which to evaluate gradient
// delta: double[nObs] satellite - buoy differences
// X: double[nObs * (nBeta + nEpsilon)] design matrix for mean components (i.e. beta terms)
// M: double[nObs * (nBeta + nEpsilon)] design matrix for variance components (i.e. epsilon terms)
// nBeta: int number of mean terms
// nEpsilon: int number of variance terms
// nObs: int number of observations
// gradient: double[nBeta + nEpsilon] output array of gradients
// ========================================================================================
// ========================================================================================
// local variables
size_t i, j, ind;
size_t nterms = *nBeta + *nEpsilon;
size_t nbeta = *nBeta;
size_t nepsilon = *nEpsilon;
size_t nobs = *nObs;
// allocate local memory and set to zero
double *sigma2 = calloc( nobs , sizeof(double) );
double *fittedValues = calloc( nobs , sizeof(double) );
double *residuals = calloc( nobs , sizeof(double) );
double *beta = calloc( nbeta , sizeof(double) );
double *epsilon2 = calloc( nepsilon , sizeof(double) );
double *residuals2 = calloc( nobs , sizeof(double) );
double gradBeta, gradEpsilon;
// extract beta and epsilon terms from param
// =========================================
for(i = 0 ; i < nbeta ; i++){
beta[i] = param[ i ];
epsilon2[i] = param[ nbeta + i ];
// Initialise gradient to zero for return value
// =========================================
for( i = 0 ; i < nterms ; i++){
gradient[i] = 0;
// calculate sigma, fitted values and residuals
// ============================================
for( i = 0 ; i < nbeta ; i++){
for( j = 0 ; j < nobs ; j++){
ind = i * nobs + j;
sigma2[j] += M[ind] * epsilon2[i];
fittedValues[j] += X[ind] * beta[i];
for( j = 0 ; j < nobs ; j++){
// calculate reciprocal as this is what we actually use and
// we only want to do it once.
sigma2[j] = 1 / sigma2[j];
residuals[j] = delta[j] - fittedValues[j];
residuals2[j] = residuals[j]*residuals[j];
// Loop over all observations and calculate value of (negative) derivative
// =======================================================================
#pragma omp parallel for private(i, j, ind, gradBeta, gradEpsilon)\
shared(gradient, nbeta, nobs, X, M, sigma2, fittedValues, delta, residuals2) \
for( i = 0 ; i < nbeta ; i++){
gradBeta = 0.0;
gradEpsilon = 0.0;
for(j = 0 ; j < nobs ; j++){
ind = i * nobs + j;
gradBeta -= -1.0*X[ind] * sigma2[j]*(fittedValues[j] - delta[j]);
gradEpsilon -= 0.5*M[ind] * sigma2[j]*(residuals2[j] * sigma2[j] - 1);
gradient[i] = gradBeta;
gradient[nbeta + i] = gradEpsilon;
// End of function
// free local memory
nObs is order 10000.
nBeta is in the range 20 – several hundred.
nEpsilon = nBeta and is not currently used.
After searching through this site and an afternoon googling and trying different things I don't seem to be able to make any further improvement. My first thoughts were false sharing – I've tried various things such as unrolling the outer loop to set 8 elements of gradient[] at a time to creating a temporary padded array to store the results in. I've also tried different combinations of shared, private and firstprivate. None of this appears to improve things and my fastest execution time is marginally worse in parallel than in serial. This leads to two questions before I spend any more time on this:
Is my problem (repeating ~9000 of the same set of calculations 20 - 900 times) too small to make it worthwhile using OMP?
Is there something I'm missing or doing wrong?
I suspect it's the latter as I'm relatively inexperienced when using C and OMP. Any help / thoughts would be appreciated.
(For info, I'm running on SLED11 server with 16 cores and 192GB of memory and using GCC 4.7.2 to compile my C code). Other users are using the server but the relative performance of OMP vs serial code seems independent of the other users.
Thanks in advance,
EDIT: For info the compile command I've used is
gcc -I/RHOME/R/3.0.1/lib64/R/include -DNDEBUG -I/usr/local/include -fpic \
-std=c99 -Wall -pedantic –O3 -fopenmp -c src/gradNegLogLik_call.c \
-o src/gradNegLogLik_call.o
Most of the flags are set by the R CMD SHLIB command - I've added the -O3 -fopenmp manually.
It may be useful to give some context to my question above before giving my answer to what I've done to speed up my code (although this has been achieved without using OMP).
My original C function was written to calculate the gradient of a log likelihood function to be used with the R optim() command and the L-BFGS-B method. For each call of optim my log likelihood and gradient functions are each called ~100 times as optim finds the best solution. As a result, these two functions take up the bulk of my execution time, as expected and reported by Rprof, and so were the two targets for converting to C to improve the efficiency of my code.
Converting my two functions to C and optimizing that code has resulted in my calls to optim reducing from an average elapsed time of 1.88s per call to 0.25s per call. This has reduced my processing time from ~1 month to a few days. The change that had the biggest impact (beside calling C) was changing the ordering of the nested loops. The original order was chosen due to the way R stores matrices and chosen to avoid having to transpose my matrices for each call of my C functions. Recognizing that the transpose only needs to be done once for each call to optim(), and not each C call as I had originally coded, this is a small overhead to pay compared to the impact / benefit of changing the order in the C functions.
Given this increase in speed it's had to justify spending any more time on this. The final version of my gradient function (as per my original post) is given below.
Note that whilst I've changed from using .C to .Call in R (hence the change to the function arguments etc) this in itself doesn’t account for the speed increase.
#include <R.h>
#include <Rmath.h>
#include <Rinternals.h>
#include <omp.h>
SEXP gradNegLogLik_call(SEXP param ,SEXP delta, SEXP X, SEXP M, SEXP nBeta, SEXP nEpsilon){
// local variables
double *par, *d;
double *sigma2, *fittedValues, *residuals, *grad, *Xuse, *Muse;
double val, sig2, gradBeta, gradEpsilon;
int n, m, ind, nterms, i, j;
SEXP gradient;
// get / associate parameters with local pointer
par = REAL(param);
Xuse = REAL(X);
Muse = REAL(M);
d = REAL(delta);
n = LENGTH(delta);
m = INTEGER(nBeta)[0];
nterms = m + m;
// allocate memory
PROTECT( gradient = allocVector(REALSXP, nterms ));
// set pointer to real portion of gradient
grad = REAL(gradient);
// set all gradient terms to zero
for(i = 0 ; i < nterms ; i++){
grad[i] = 0.0;
sigma2 = Calloc(n, double );
fittedValues = Calloc(n, double );
residuals = Calloc(n, double );
// calculate sigma, fitted values and residuals
for(i = 0 ; i < n ; i++){
val = 0.0;
sig2 = 0.0;
for(j = 0 ; j < m ; j++){
ind = i*m + j;
val += Xuse[ind]*par[j];
sig2 += Muse[ind]*par[j+m];
// calculate reciprocal of sigma as this is what we actually use
// and we only want to do it once
sigma2[i] = 1.0 / sig2;
fittedValues[i] = val;
residuals[i] = d[i] - val;
// now loop over each paramter and calculate derivative
for(i = 0 ; i < n ; i++){
gradBeta = -1.0*sigma2[i]*(fittedValues[i] - d[i]);
gradEpsilon = 0.5*sigma2[i]*(residuals[i]*residuals[i]*sigma2[i] - 1);
for(j = 0 ; j < m ; j++){
ind = i*m + j;
grad[j] -= Xuse[ind]*gradBeta;
grad[j+m] -= Muse[ind]*gradEpsilon;
// return array of gradients
return gradient;
