I'm working on a problem where the execution time is critical. I have another C function that produces 3-D grids of values at a series of timestamps. What I want is to find the max_value in each 3-D grid at each timestamp. Additionally I am tracking the average value (sum / ncell) of each grid, and returning a maximum normalised by the average value.
I am not proficient in C, so I wanted to check if there is anything I am missing, either in terms of actual code, or use of OpenMP. I guess my question is:
What is the most efficient way to find the maximum values of a n-dimensional array sliced along the nth dimension?
I understand that the best you can hope for (as the grids are unordered) is O(n). My assessment is that this problem is then O(m x n), m = time dimension, n = dimension of the grid, and I think my implementation reaches that. Typically values for these dimensions are perhaps m=5000 to 20000, n=200*200*60.
Currently, I am timing my Python wrapper function (which includes the initialisation of the numpy.ndarrays that receive the max, normMax, and maxIndex values:
m = 2400
n = 54000
threads = 8
For which I am averaging ~0.33 seconds to find the maximum values.
If it's relevant, this is on my laptop with:
Intel(R) Core(TM) i7-7700HQ CPU # 2.80GHz (6MB cache)
32GB RAM
Code:
void find_max(double *mapPt, double *maxPt, double *normMaxPt,
int64_t *indexPt, int32_t nsamp, int32_t ncell,
int64_t threads)
{
double maxValue, currentValue, sum;
int32_t cell, maxIndex, timeSample;
#pragma omp parallel for num_threads(threads)
for (timeSample=0; timeSample<nsamp; timeSample++)
{
maxValue = 0.0;
maxIndex = 0;
sum = 0.0;
for (cell=0; cell<ncell; cell++)
{
currentValue = mapPt[cell * nsamp + timeSample];
sum += currentValue;
if (currentValue > maxValue)
{
maxValue = currentValue;
maxIndex = cell;
}
}
maxPt[timeSample] = maxValue;
normMaxPt[timeSample] = maxValue * ncell / sum;
indexPt[timeSample] = maxIndex;
}
}
I am compiling with gcc 7.4.0, with the important flags probably -Ofast and -lm.
I am completely happy for the answer to be "there's nothing more you can do", just want to know for peace of mind.
One suggestion I could see would be to have double *timesame_mapcells = &mapPt[timeSample]; at the start of every thread.
Then you can just index with cell * nsamp, so one addition less per access. But the compiler might have been clever enough to optimize that.
You could also try having two incremented variables in the for loop:
for (cell = 0, map_idx = timeSample; cell < ncell; cell++, map_idx += nsamps)
{
currentValue = mapPt[map_idx];
[...]
}
Which might save some cycles with the timeSample addition every time + the nsamps multiplication.
Then again, this is just a suggestion for you to try. I don't know whether that will have an observable impact on performance. (But I'm curious to know whether that's the case if you give it a go)
Related
I've got an assignment - compare 2 matrix multiplications - in the default way, and multiplication after transposition of second matrix, we should point the difference which method is faster. I've written something like this below, but time and time2 are nearly equal to each other. In one case the first method is faster, I run the multiplication with the same size of matrix, and in another one the second method is faster. Is something done wrong? Should I change something in my code?
clock_t start = clock();
int sum;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
sum = 0;
for(int k=0; k<size; ++k) {
sum = sum + (m1[i][k] * m2[k][j]);
}
score[i][j] = sum;
}
}
clock_t end = clock();
double time = (end-start)/(double)CLOCKS_PER_SEC;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
int temp = m2[i][j];
m2[i][j] = m2[j][i];
m2[j][i] = temp;
}
}
clock_t start2 = clock();
int sum2;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
sum2 = 0;
for(int k=0; k<size; ++k) {
sum2 = sum2 + (m1[k][i] * m2[k][j]);
}
score[i][j] = sum2;
}
}
clock_t end2 = clock();
double time2 = (end2-start2)/(double)CLOCKS_PER_SEC;
You have multiple severe issues with your code and/or your understanding. Let me try to explain.
Matrix multiplication is bottlenecked by the rate at which the processor can load and store the values to memory. Most current architectures use cache to help with this. Data is moved from memory to cache and from cache to memory in blocks. To maximize the benefit of caching, you want to make sure you will use all the data in that block. To do that, you make sure you access the data sequentially in memory.
In C, multi-dimensional arrays are specified in row-major order. It means that the rightmost index is consecutive in memory; i.e. that a[i][k] and a[i][k+1] are consecutive in memory.
Depending on the architecture, the time it takes for the processor to wait (and do nothing) for the data to be moved from RAM to cache (and vice versa), may or may not be included in the CPU time (that e.g. clock() measures, albeit at a very poor resolution). For this kind of measurement ("microbenchmark"), it is much better to measure and report both CPU and real (or wall clock) time used; especially so if the microbenchmark is run on different machines, to get a better idea of the practical impact of the change.
There will be a lot of variation, so typically, you measure the time taken by a few hundred repeats (each repeat possibly making more than one operation; enough to be easily measured), storing the duration of each, and report their median. Why median, and not minimum, maximum, average? Because there will always be occasional glitches (unreasonable measurement due to an external event, or something), which typically yield a much higher value than normal; this makes the maximum uninteresting, and skews the average (mean) unless removed. The minimum is typically an over-optimistic case, where everything just happened to go perfectly; that rarely occurs in practice, so is only a curiosity, not of practical interest. The median time, on the other hand, gives you a practical measurement: you can expect 50% of all runs of your test case to take no more than the median time measured.
On POSIXy systems (Linux, Mac, BSDs), you should use clock_gettime() to measure the time. The struct timespec format has nanosecond precision (1 second = 1,000,000,000 nanoseconds), but resolution may be smaller (i.e., the clocks change by more than 1 nanosecond, whenever they change). I personally use
#define _POSIX_C_SOURCE 200809L
#include <time.h>
static struct timespec cpu_start, wall_start;
double cpu_seconds, wall_seconds;
void timing_start(void)
{
clock_gettime(CLOCK_REALTIME, &wall_start);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &cpu_start);
}
void timing_stop(void)
{
struct timespec cpu_end, wall_end;
clock_gettime(CLOCK_REALTIME, &wall_end);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &cpu_end);
wall_seconds = (double)(wall_end.tv_sec - wall_start.tv_sec)
+ (double)(wall_end.tv_nsec - wall_start.tv_nsec) / 1000000000.0;
cpu_seconds = (double)(cpu_end.tv_sec - cpu_start.tv_sec)
+ (double)(cpu_end.tv_nsec - cpu_start.tv_nsec) / 1000000000.0;
}
You call timing_start() before the operation, and timing_stop() after the operation; then, cpu_seconds contains the amount of CPU time taken and wall_seconds the real wall clock time taken (both in seconds, use e.g. %.9f to print all meaningful decimals).
The above won't work on Windows, because Microsoft does not want your C code to be portable to other systems. It prefers to develop their own "standard" instead. (Those C11 "safe" _s() I/O function variants are a stupid sham, compared to e.g. POSIX getline(), or the state of wide character support on all systems except Windows.)
Matrix multiplication is
c[r][c] = a[r][0] * b[0][c]
+ a[r][1] * b[1][c]
: :
+ a[r][L] * b[L][c]
where a has L+1 columns, and b has L+1 rows.
In order to make the summation loop use consecutive elements, we need to transpose b. If B[c][r] = b[r][c], then
c[r][c] = a[r][0] * B[c][0]
+ a[r][1] * B[c][1]
: :
+ a[r][L] * B[c][L]
Note that it suffices that a and B are consecutive in memory, but separate (possibly "far" away from each other), for the processor to utilize cache efficiently in such cases.
OP uses a simple loop, similar to the following pseudocode, to transpose b:
For r in rows:
For c in columns:
temporary = b[r][c]
b[r][c] = b[c][r]
b[c][r] = temporary
End For
End For
The problem above is that each element participates in a swap twice. For example, if b has 10 rows and columns, r = 3, c = 5 swaps b[3][5] and b[5][3], but then later, r = 5, c = 3 swaps b[5][3] and b[3][5] again! Essentially, the double loop ends up restoring the matrix to the original order; it does not do a transpose.
Consider the following entries and the actual transpose:
b[0][0] b[0][1] b[0][2] b[0][0] b[1][0] b[2][0]
b[1][0] b[1][1] b[1][2] ⇔ b[0][1] b[1][1] b[2][1]
b[2][0] b[2][1] b[2][2] b[0][2] b[1][2] b[2][2]
The diagonal entries are not swapped. You only need to do the swap in the upper triangular portion (where c > r) or in the lower triangular portion (where r > c), to swap all entries, because each swap swaps one entry from the upper triangular to the lower triangular, and vice versa.
So, to recap:
Is something done wrong?
Yes. Your transpose does nothing. You haven't understood the reason why one would want to transpose the second matrix. Your time measurement relies on a low-precision CPU time, which may not reflect the time taken by moving data between RAM and CPU cache. In the second test case, with m2 "transposed" (except it isn't, because you swap each element pair twice, returning them back to the way they were), your innermost loop is over the leftmost array index, which means it calculates the wrong result. (Moreover, because consecutive iterations of the innermost loop accesses items far from each other in memory, it is anti-optimized: it uses the pattern that is worst in terms of speed.)
All the above may sound harsh, but it isn't intended to be, at all. I do not know you, and I am not trying to evaluate you; I am only pointing out the errors in this particular answer, in your current understanding, and only in the hopes that it helps you, and anyone else encountering this question in similar circumstances, to learn.
Goal: Implement the diagram shown below in OpenCL. The main thing needed from the OpenCl kernel is to multiply the coefficient array and temp array and then accumilate all those values into one at the end. (That is probably the most time intensive operation, parallelism would be really helpful here).
I am using a helper function for the kernel that does the multiplication and addition (I am hoping this function will be parallel as well).
Description of the picture:
One at a time, the values are passed into the array (temp array) which is the same size as the coefficient array. Now every time a single value is passed into this array, the temp array is multiplied with the coefficient array in parallel and the values of each index are then concatenated into one single element. This will continue until the input array reaches it's final element.
What happens with my code?
For 60 elements from the input, it takes over 8000 ms!! and I have a total of 1.2 million inputs that still have to be passed in. I know for a fact that there is a way better solution to do what I am attempting. Here is my code below.
Here are some things that I know are wrong with he code for sure. When I try to multiply the coefficient values with the temp array, it crashes. This is because of the global_id. All I want this line to do is simply multiply the two arrays in parallel.
I tried to figure out why it was taking so long to do the FIFO function, so I started commenting lines out. I first started by commenting everything except the first for loop of the FIFO function. As a result this took 50 ms. Then when I uncommented the next loop, it jumped to 8000ms. So the delay would have to do with the transfer of data.
Is there a register shift that I could use in OpenCl? Perhaps use some logical shifting method for integer arrays? (I know there is a '>>' operator).
float constant temp[58];
float constant tempArrayForShift[58];
float constant multipliedResult[58];
float fifo(float inputValue, float *coefficients, int sizeOfCoeff) {
//take array of 58 elements (or same size as number of coefficients)
//shift all elements to the right one
//bring next element into index 0 from input
//multiply the coefficient array with the array thats the same size of coefficients and accumilate
//store into one output value of the output array
//repeat till input array has reached the end
int globalId = get_global_id(0);
float output = 0.0f;
//Shift everything down from 1 to 57
//takes about 50ms here
for(int i=1; i<58; i++){
tempArrayForShift[i] = temp[i];
}
//Input the new value passed from main kernel. Rest of values were shifted over so element is written at index 0.
tempArrayForShift[0] = inputValue;
//Takes about 8000ms with this loop included
//Write values back into temp array
for(int i=0; i<58; i++){
temp[i] = tempArrayForShift[i];
}
//all 58 elements of the coefficient array and temp array are multiplied at the same time and stored in a new array
//I am 100% sure this line is crashing the program.
//multipliedResult[globalId] = coefficients[globalId] * temp[globalId];
//Sum the temp array with each other. Temp array consists of coefficients*fifo buffer
for (int i = 0; i < 58; i ++) {
// output = multipliedResult[i] + output;
}
//Returned summed value of temp array
return output;
}
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) {
//Initialize the temporary array values to 0
for (int i = 0; i < 58; i ++) {
temp[i] = 0;
tempArrayForShift[i] = 0;
multipliedResult[i] = 0;
}
//fifo adds one element in and calls the fifo function. ALL I NEED TO DO IS SEND ONE VALUE AT A TIME HERE.
for (int i = 0; i < 60; i ++) {
Output[i] = fifo(Array[i], coefficients, 58);
}
}
I have had this problem with OpenCl for a long time. I am not sure how to implement parallel and sequential instructions together.
Another alternative I was thinking about
In the main cpp file, I was thinking of implementing the fifo buffer there and having the kernel do the multiplication and addition. But this would mean I would have to call the kernel 1000+ times in a loop. Would this be the better solution? Or would it just be completely inefficient.
To get good performance out of GPU, you need to parallelize your work to many threads. In your code you are just using a single thread and a GPU is very slow per thread but can be very fast, if many threads are running at the same time. In this case you can use a single thread for each output value. You do not actually need to shift values through a array: For every output value a window of 58 values is considered, you can just grab these values from memory, multiply them with the coefficients and write back the result.
A simple implementation would be (launch with as many threads as output values):
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output)
{
int globalId = get_global_id(0);
float sum=0.0f;
for (int i=0; i< 58; i++)
{
float tmp=0;
if (globalId+i > 56)
{
tmp=Array[i+globalId-57]*coefficient[57-i];
}
sum += tmp;
}
output[globalId]=sum;
}
This is not perfect, as the memory access patterns it generates are not optimal for GPUs. The Cache will likely help a bit, but there is clearly a lot of room for optimization, as the values are reused several times. The operation you are trying to perform is called convolution (1D). NVidia has an 2D example called oclConvolutionSeparable in their GPU Computing SDK, that shows an optimized version. You adapt use their convolutionRows kernel for a 1D convolution.
Here's another kernel you can try out. There are a lot of synchronization points (barriers), but this should perform fairly well. The 65-item work group is not very optimal.
the steps:
init local values to 0
copy coefficients to local variable
looping over the output elements to compute:
shift existing elements (work items > 0 only)
copy new element (work item 0 only)
compute dot product
5a. multiplication - one per work item
5b. reduction loop to compute sum
copy dot product to output (WI 0 only)
final barrier
the code:
__kernel void lowpass(__global float *Array, __constant float *coefficients, __global float *Output, __local float *localArray, __local float *localSums){
int globalId = get_global_id(0);
int localId = get_local_id(0);
int localSize = get_local_size(0);
//1 init local values to 0
localArray[localId] = 0.0f
//2 copy coefficients to local
//don't bother with this id __constant is working for you
//requires another local to be passed in: localCoeff
//localCoeff[localId] = coefficients[localId];
//barrier for both steps 1 and 2
barrier(CLK_LOCAL_MEM_FENCE);
float tmp;
for(int i = 0; i< outputSize; i++)
{
//3 shift elements (+barrier)
if(localId > 0){
tmp = localArray[localId -1]
}
barrier(CLK_LOCAL_MEM_FENCE);
localArray[localId] = tmp
//4 copy new element (work item 0 only, + barrier)
if(localId == 0){
localArray[0] = Array[i];
}
barrier(CLK_LOCAL_MEM_FENCE);
//5 compute dot product
//5a multiply + barrier
localSums[localId] = localArray[localId] * coefficients[localId];
barrier(CLK_LOCAL_MEM_FENCE);
//5b reduction loop + barrier
for(int j = 1; j < localSize; j <<= 1) {
int mask = (j << 1) - 1;
if ((localId & mask) == 0) {
localSums[local_index] += localSums[localId +j]
}
barrier(CLK_LOCAL_MEM_FENCE);
}
//6 copy dot product (WI 0 only)
if(localId == 0){
Output[i] = localSums[0];
}
//7 barrier
//only needed if there is more code after the loop.
//the barrier in #3 covers this in the case where the loop continues
//barrier(CLK_LOCAL_MEM_FENCE);
}
}
What about more work groups?
This is slightly simplified to allow a single 1x65 work group computer the entire 1.2M Output. To allow multiple work groups, you could use / get_num_groups(0) to calculate the amount of work each group should do (workAmount), and adjust the i for-loop:
for (i = workAmount * get_group_id(0); i< (workAmount * (get_group_id(0)+1) -1); i++)
Step #1 must be changed as well to initialize to the correct starting state for localArray, rather than all 0s.
//1 init local values
if(groupId == 0){
localArray[localId] = 0.0f
}else{
localArray[localSize - localId] = Array[workAmount - localId];
}
These two changes should allow you to use a more optimal number of work groups; I suggest some multiple of the number of compute units on the device. Try to keep the amount of work for each group in the thousands though. Play around with this, sometimes what seems optimal on a high-level will be detrimental to the kernel when it's running.
Advantages
At almost every point in this kernel, the work items have something to do. The only time fewer than 100% of the items are working is during the reduction loop in step 5b. Read more here about why that is a good thing.
Disadvantages
The barriers will slow down the kernel just by the nature of what barriers do: the pause a work item until the others reach that point. Maybe there is a way you could implement this with fewer barriers, but I still feel this is optimal because of the problem you are trying to solve.
There isn't room for more work items per group, and 65 is not a very optimal size. Ideally, you should try to use a power of 2, or a multiple of 64. This won't be a huge issue though, because there are a lot of barriers in the kernel which makes them all wait fairly regularly.
I'm working on a demo that requires a lot of vector math, and in profiling, I've found that it spends the most time finding the distances between given vectors.
Right now, it loops through an array of X^2 vectors, and finds the distance between each one, meaning it runs the distance function X^4 times, even though (I think) there are only (X^2)/2 unique distances.
It works something like this: (pseudo c)
#define MATRIX_WIDTH 8
typedef float vec2_t[2];
vec2_t matrix[MATRIX_WIDTH * MATRIX_WIDTH];
...
for(int i = 0; i < MATRIX_WIDTH; i++)
{
for(int j = 0; j < MATRIX_WIDTH; j++)
{
float xd, yd;
float distance;
for(int k = 0; k < MATRIX_WIDTH; k++)
{
for(int l = 0; l < MATRIX_WIDTH; l++)
{
int index_a = (i * MATRIX_LENGTH) + j;
int index_b = (k * MATRIX_LENGTH) + l;
xd = matrix[index_a][0] - matrix[index_b][0];
yd = matrix[index_a][1] - matrix[index_b][1];
distance = sqrtf(powf(xd, 2) + powf(yd, 2));
}
}
// More code that uses the distances between each vector
}
}
What I'd like to do is create and populate an array of (X^2) / 2 distances without redundancy, then reference that array when I finally need it. However, I'm drawing a blank on how to index this array in a way that would work. A hash table would do it, but I think it's much too complicated and slow for a problem that seems like it could be solved by a clever indexing method.
EDIT: This is for a flocking simulation.
performance ideas:
a) if possible work with the squared distance, to avoid root calculation
b) never use pow for constant, integer powers - instead use xd*xd
I would consider changing your algorithm - O(n^4) is really bad. When dealing with interactions in physics (also O(n^4) for distances in 2d field) one would implement b-trees etc and neglect particle interactions with a low impact. But it will depend on what "more code that uses the distance..." really does.
just did some considerations: the number of unique distances is 0.5*n*n(+1) with n = w*h.
If you write down when unique distances occur, you will see that both inner loops can be reduced, by starting at i and j.
Additionally if you only need to access those distances via the matrix index, you can set up a 4D-distance matrix.
If memory is limited we can save up nearly 50%, as mentioned above, with a lookup function that will access a triangluar matrix, as Code-Guru said. We would probably precalculate the line index to avoid summing up on access
float distanceArray[(H*W+1)*H*W/2];
int lineIndices[H];
searchDistance(int i, int j)
{
return i<j?distanceArray[i+lineIndices[j]]:distanceArray[j+lineIndices[i]];
}
I hope someone can share an insight into to the problem I am having. A small program takes an array of doubles (100000 elements), adds together a subset of this array and assign the result to an element of another array. This seems like a very basic task, however my GT630 exerts strange behavior. What happens is that it works (but rather slow) for small values of intervalLength, but as soon as intervalLength becomes larger, around 300, the code fails.
What's more interesting is that the problem seems to be not in the addition, but in the assigning of the results back to the output array. If the last line in the code below
output_dev[threadIdx.x] = totalSum;
is changed to
output_dev[threadIdx.x] = input_dev[0];
then the code runs lightningly fast - at least 100 times faster, and works for any large value of intervalLength. Also, if a line
totalSum=1;
preceeds the asignment then the code also runs fast and without errors. Some experimentation also showed that if the sum is calculated an a series of statements as opposed to using the loop, the code also works fine.
I am using GT630 4GB with 96 CUDA threads, launching 96 threads in one block.
The code:
extern "C" __global__ void TestCompute(double* input_dev, int input_devLen0, int* args_dev, int args_devLen0, double* output_dev, int output_devLen0)
{
int intervalLength = args_dev[0];
double totalSum = input_dev[num];
if (num < input_devLen0)
{
for (int k = 0; k <= input_devLen0; k++)
{
totalSum = 0.0;
for (int i = 0; i < intervalLength; i++)
{
if (input_devLen0 > i)
{
totalSum += input_dev[i];
}
}
if (output_devLen0 > threadIdx.x)
{
// totalSum = 1;
output_dev[threadIdx.x] = totalSum; // input_dev[0];
}
}
}
}
I solved the problem by splitting the processing is smaller chunks and then looping through the list of chinks. Apparently the problem is in the limited number of registers. The lower-end GT630 card seems to run out of registers quickly. This is probably not the ideally correct solution but it works in my case.
I have the following piece of c code,
double findIntraClustSimFullCoverage(cluster * pCluster)
{
double sum = 0;
register int i = 0, j = 0;
double perElemSimilarity = 0;
for (i = 0; i < 10000; i++)
{
perElemSimilarity = 0;
for (j = 0; j < 10000; j++)
{
perElemSimilarity += arr[i][j];
}
perElemSimilarity /= pCluster->size;
sum += perElemSimilarity;
}
return (sum / pCluster->size);
}
NOTE: arr is a matrix of size 10000 X 10000
This is a portion of a GA code, hence this nested for loop runs many times.
This affects the performance of the code i.e. takes hell a lot of time to give the results.
I profiled the code using valgrind / kcachegrind.
This indicated that 70 % of the process execution time was spent in running this nested for loop.
The register variables i and j, do not seem to be stored in register values (profiling with and without "register" keyword indicated this)
I simply can not find a way to optimize this nested for loop portion of code (as it is very simple and straight forward).
Please help me in optimizing this portion of code.
I'm assuming that you change the arr matrix frequently, else you could just compute the sum (see Lucian's answer) once and remember it.
You can use a similar approach when you modify the matrix. Instead of completely re-computing the sum after the matrix has (likely) been changed, you can store a 'sum' value somewhere, and have every piece of code that updates the matrix update the stored sum appropriately. For instance, assuming you start with an array of all zeros:
double arr[10000][10000];
< initialize it to all zeros >
double sum = 0;
// you want set arr[27][53] to 82853
sum -= arr[27][53];
arr[27][53] = 82853;
sum += arr[27][53];
// you want set arr[27][53] to 473
sum -= arr[27][53];
arr[27][53] = 473;
sum += arr[27][53];
You might want to completely re-calculate the sum from time to time to avoid accumulation of errors.
If you're sure that you have no option for algorithmic optimization, you'll have to rely on very low level optimizations to speed up your code. These are very platform/compiler specific so your mileage may vary.
It is probable that, at some point, the bottleneck of the operation is pulling the values of arr from the memory. So make sure that your data is laid out in a linear cache friendly way. That is to say that &arr[i][j+1] - &arr[i][j] == sizeof(double).
You may also try to unroll your inner loop, in case your compiler does not already do it. Your code :
for (j = 0; j < 10000; j++)
{
perElemSimilarity += arr[i][j];
}
Would for example become :
for (j = 0; j < 10000; j+=10)
{
perElemSimilarity += arr[i][j+0];
perElemSimilarity += arr[i][j+1];
perElemSimilarity += arr[i][j+2];
perElemSimilarity += arr[i][j+3];
perElemSimilarity += arr[i][j+4];
perElemSimilarity += arr[i][j+5];
perElemSimilarity += arr[i][j+6];
perElemSimilarity += arr[i][j+7];
perElemSimilarity += arr[i][j+8];
perElemSimilarity += arr[i][j+9];
}
These are the basic ideas, difficult to say more without knowing your platform, compiler, looking at the generated assembly code.
You might want to take a look at this presentation for more complete examples of optimization opportunities.
If you need even more performance, you could take a look at SIMD intrinsics for your platform, of try to use, say OpenMP, to distribute your computation on multiple threads.
Another step would be to try with OpenMP, something along the following (untested) :
#pragma omp parallel for private(perElemSimilarity) reduction(+:sum)
for (i = 0; i < 10000; i++)
{
perElemSimilarity = 0;
/* INSERT INNER LOOP HERE */
perElemSimilarity /= pCluster->size;
sum += perElemSimilarity;
}
But note that even if you bring this portion of code to 0% (which is impossible) of your execution time, your GA algorithm will still take hours to run. Your performance bottleneck is elsewhere now that this portion of code takes 'only' 22% of your running time.
I might be wrong here, but isn't the following equivalent:
for (i = 0; i < 10000; i++)
{
for (j = 0; j < 10000; j++)
{
sum += arr[i][j];
}
}
return (sum / ( pCluster->size * pCluster->size ) );
The register keyword is an optimizer hint, if the optimizer doesn't think the register is well spent there, it won't be.
Is the matrix well packed, i.e. is it a contiguous block of memory?
Is 'j' the minor index (i.e. are you going from one element to the next in memory), or are you jumping from one element to that plus 1000?
Is arr fairly static? Is this called more than once on the same arr? The result of the inner loop only depends on the row/column that j traverses, so calculating it lazily and storing it for future reference will make a big difference
The way this problem is stated, there isn't much you can do. You are processing 10,000 x 10,000 double input values, that's 800 MB. Whatever you do is limited by the time it takes to read 800 MB of data.
On the other hand, are you also writing 10,000 x 10,000 values each time this is called? If not, you could for example store the sums for each row and have a boolean row indicating that a row sum needs to be calculated, which is set each time you change a row element. Or you could even update the sum for a row each time an array element is change.