Cache Misses on a two dimensional array - c

I have a doubt on how the machines stores a two dimensional array in memory. I'll present you my code in order to be clearer.
I'm defining a two dimensional array in this way, in my main loop:
int main()
{
int i;
internalNode**tab =(internalNode **)malloc(sizeof(internalNode *)* DIM);
for (i=0; i<DIM ; i++)
tab[i] = (internalNode *)malloc(sizeof(internalNode) * DIM);
//CODE
CalculusOnGrid(tab,DIM);
}
Where DIM is a user defined variable and internalNode is a structure.
In the function CalculusOnGrid i'm going to do this calculus on the grid ( my two dimensional array):
for(i=1;i<DIM-1;i++)
for(j=1;j<DIM-j;i++)
tab[i][j].temperature_new = 0.25*tab[i+1][j].temperature+tab[i-1][j].temperature + tab[i][j+1].temperature + tab[i][j-1].temperature);
So i'm going to search for the 4 neighbors of my current point (i,j) of the grid.
Here there is my question: I'm going to do a Cache Miss on the upper and below element ( that's to say tab[i+1][] and tab[i-1][]) or on the right and left elements? (that's to say tab[][j+1] and tab[][j-1])
What's your suggestion for speeding up my code and reduce the number of Cache misses?
I hope that the question is proposed in a clear way. If this is not the case, ask me whatever you want!
Thank you!
Alessandro

Cache misses is one of many reasons why you should avoid using pointer-based lookup tables to emulate dynamic arrays.
Instead, use a 2D array:
internalNode (*tab)[DIM] = malloc( sizeof(internalNode[DIM][DIM]) );
free(tab);
Now the memory will be adjacent and performance should be much better.

Related

copying 2d array of type (double **2darray) to GPU using cuda [duplicate]

I am looking into how to copy a 2D array of variable width for each row into the GPU.
int rows = 1000;
int cols;
int** host_matrix = malloc(sizeof(*int)*rows);
int *d_array;
int *length;
...
Each host_matrix[i] might have a different length, which I know length[i], and there is where the problem starts. I would like to avoid copying dummy data. Is there a better way of doing it?
According to this thread, that won't be a clever way of doing it:
cudaMalloc(d_array, rows*sizeof(int*));
for(int i = 0 ; i < rows ; i++) {
cudaMalloc((void **)&d_array[i], length[i] * sizeof(int));
}
But I cannot think of any other method. Is there any other smarter way of doing it?
Can it be improved using cudaMallocPitch and cudaMemCpy2D ??
The correct way to allocate an array of pointers for the GPU in CUDA is something like this:
int **hd_array, **d_array;
hd_array = (int **)malloc(nrows*sizeof(int*));
cudaMalloc(d_array, nrows*sizeof(int*));
for(int i = 0 ; i < nrows ; i++) {
cudaMalloc((void **)&hd_array[i], length[i] * sizeof(int));
}
cudaMemcpy(d_array, hd_array, nrows*sizeof(int*), cudaMemcpyHostToDevice);
(disclaimer: written in browser, never compiled, never tested, use at own risk)
The idea is that you assemble a copy of the array of device pointers in host memory first, then copy that to the device. For your hypothetical case with 1000 rows, that means 1001 calls to cudaMalloc and then 1001 calls to cudaMemcpy just to set up the device memory allocations and copy data into the device. That is an enormous overhead penalty, and I would counsel against trying it; the performance will be truly terrible.
If you have very jagged data and need to store it on the device, might I suggest taking a cue of the mother of all jagged data problems - large, unstructured sparse matrices - and copy one of the sparse matrix formats for your data instead. Using the classic compressed sparse row format as a model you could do something like this:
int * data, * rows, * lengths;
cudaMalloc(rows, nrows*sizeof(int));
cudaMalloc(lengths, nrows*sizeof(int));
cudaMalloc(data, N*sizeof(int));
In this scheme, store all the data in a single, linear memory allocation data. The ith row of the jagged array starts at data[rows[i]] and each row has a length of length[i]. This means you only need three memory allocation and copy operations to transfer any amount of data to the device, rather than nrows in your current scheme, ie. it reduces the overheads from O(N) to O(1).
I would put all the data into one array. Then compose another array with the row lengths, so that A[0] is the length of row 0 and so on. so A[i] = length[i]
Then you need just to allocate 2 arrays on the card and call memcopy twice.
Of course it's a little bit of extra work, but i think performance wise it will be an improvement (depending of course on how you use the data on the card)

Optimising C for performance vs memory optimisation using multidimensional arrays

I am struggling to decide between two optimisations for building a numerical solver for the poisson equation.
Essentially, I have a two dimensional array, of which I require n doubles in the first row, n/2 in the second n/4 in the third and so on...
Now my difficulty is deciding whether or not to use a contiguous 2d array grid[m][n], which for a large n would have many unused zeroes but would probably reduce the chance of a cache miss. The other, and more memory efficient method, would be to dynamically allocate an array of pointers to arrays of decreasing size. This is considerably more efficient in terms of memory storage but would it potentially hinder performance?
I don't think I clearly understand the trade-offs in this situation. Could anybody help?
For reference, I made a nice plot of the memory requirements in each case:
There is no hard and fast answer to this one. If your algorithm needs more memory than you expect to be given then you need to find one which is possibly slower but fits within your constraints.
Beyond that, the only option is to implement both and then compare their performance. If saving memory results in a 10% slowdown is that acceptable for your use? If the version using more memory is 50% faster but only runs on the biggest computers will it be used? These are the questions that we have to grapple with in Computer Science. But you can only look at them once you have numbers. Otherwise you are just guessing and a fair amount of the time our intuition when it comes to optimizations are not correct.
Build a custom array that will follow the rules you have set.
The implementation will use a simple 1d contiguous array. You will need a function that will return the start of array given the row. Something like this:
int* Get( int* array , int n , int row ) //might contain logical errors
{
int pos = 0 ;
while( row-- )
{
pos += n ;
n /= 2 ;
}
return array + pos ;
}
Where n is the same n you described and is rounded down on every iteration.
You will have to call this function only once per entire row.
This function will never take more that O(log n) time, but if you want you can replace it with a single expression: http://en.wikipedia.org/wiki/Geometric_series#Formula
You could use a single array and just calculate your offset yourself
size_t get_offset(int n, int row, int column) {
size_t offset = column;
while (row--) {
offset += n;
n << 1;
}
return offset;
}
double * array = calloc(sizeof(double), get_offset(n, 64, 0));
access via
array[get_offset(column, row)]

How does an N dimensional array with c-dimensional objects perform differently from an N dimensional array with C objects?

Excerpt from the O'Reilly book :
From the above excerpt the author explain in performance terms why there should be a performance difference in big oh or other terms and the basis for the formula to find any element in n by c dimensional array.
Additional: Why are different data types used in the three dimensional example? Why would you even bother to represent this in different ways ?
The article seems to point out different ways to represent matrix data structures and the performance gains of a single array representation, although it doesn't really explain why you get the performance gains.
For example, to represent a NxNxN matrix:
In object form:
Cell {
int x,y,z;
}
Matrix {
int size = 10;
Cell[] cells = new Cell[size];
}
In three-arrays form:
Matrix {
int size = 10;
int[][][] data = new int[size][size][size];
}
In a single array:
Matrx {
int size = 10;
int[] data = new int[size*size*size];
}
To your question, there is a performance gain by representing a NxN matrix as a single array of N*N length, you gain performance because of caching (assuming you cannot fit the entire matrix in one chunk); a single array representation guarantees the entire matrix will be in a contiguous chunk of memory. When data is moved from memory into cache (or disk into memory), it is moved in chunks, you sometimes grabs more data than you need. The extra data you grab contains the area surrounding the data you need.
Say, you are processing the matrix row by row. When getting new data, the OS can grab N+10 items per chunk. In the NxN case, the extra data (+10) may be unrelated data. In the case of a N*N length array, the extra data (+10) is most likely from the matrix.
This article from SGI seems to give a bit more detail, specifically the Principles of Good Cache Use:
http://techpubs.sgi.com/library/dynaweb_docs/0640/SGI_Developer/books/OrOn2_PfTune/sgi_html/ch06.html

C++ What is the best data structure for a two dimensional array of object "piles"?

The size of the grid will be known at the start (but will be different each time the program starts). However, the DEPTH of each cell is not a mere value, but rather a population of objects that will vary constantly during runtime.
Q: What is the most recommended (efficient and easy to maintain; less prone to user error) way of implementing this ?
Is this some kind of a standard 2D array of vector pointers ?
Is it a 3D Vector array ?
Is it a 2D array of linked lists, or binary trees (I am thinking binary trees will add complexity overhead because of continuous deletion and insertion node-gymnastics)
Is it some other custom data structure ?
Use a 1D array for best cache locality. A vector would be fine for this.
std::vector<int> histdata( width * height );
If you need to index the rows quickly, then make something to point into it:
std::vector<int*> histogram( height );
histogram[0] = &histdata[0];
for( int i = 1; i < height; i++ ) {
histogram[i] = histogram[i-1] + width;
}
Now you have a 2D histogram stored in a 1D vector. You can access it like this:
histogram[row][col]++;
If you wrap all this up in a simple class, you're less likely to do something silly with the pointers. You can also make a clear() function to set the histogram data to zero (which just rips through the histdata vector and zeros it).

Maintain a sorted array that a separate, iterative function can keep accessing

I'm writing code for a decision tree in C. Right now it gives me the correct result (0% training error, low test error), but it takes a long time to run.
The problem lies in how often I run qsort. My basic algorithm is this:
for every feature
sort that feature column using qsort
remove duplicate feature values in that column
for every unique feature value
split
determine entropy given that split
save the best feature to split + split value
for every training_example
if training_example's value for best feature < best split value, store in Left[]
else store in Right[]
recursively call this function, using only the Left[] training examples
recursively call this function, using only the Right[] training examples
Because the last two lines are iterative calls, and because the tree can extend for dozens and dozens of branches, the number of calls to qsort is huge (especially for my dataset that has > 1000 features).
My idea to reduce the runtime is to create a 2d array (in a separate function) where each column is a sorted feature column. Then, as long as I maintain a vector of row numbers of the training examples in Left[] and Right[] for each recursive call, I can just call this separate function, grab the rows I want in the pre-sorted feature vector, and save the cost of having to qsort each time.
I'm fairly new to C and so I'm not sure how to code this. In MatLab I can just have a global array that any function can change or access, looking for something like that in C.
Global arrays in C are totally possible. There are actually two ways of doing that. In the first case the dimensions of the array are fixed for the application:
#define NROWS 100
#define NCOLS 100
int array[NROWS][NCOLS];
int main(void)
{
int i, j;
for (i = 0; i < NROWS; i++)
for (j = 0; j < NCOLS; j++)
{
array[i][j] = i+j;
}
return 0;
}
In the second example the dimensions may depend on values from the input.
#include <stdlib.h>
int **array;
int main(void)
{
int nrows = 100;
int ncols = 100;
int i, j;
array = malloc(nrows*sizeof(*array));
for (i = 0; i < nrows; i++)
{
array[i] = malloc(ncols*sizeof(*(array[i])));
for (j = 0; j < ncols; j++)
{
array[i][j] = i+j;
}
}
}
Although the access to the arrays in both examples looks deceivingly similar, the implementation of the arrays is quite different. In the first example the array is located in one piece of memory and the strides to access rows is a whole row. In the second example each row access is a pointer to a row, which is one piece of memory. The various rows can however be located in different areas of the memory. In the second example rows might also have a different length. In that case you would need to store the length of each row somewhere too.
I don't fully understand what you are trying to achieve, because I'm not familiar with the terminology of decision tree, feature and the standard approaches to training sets. But you may also want to have a look at other data structures to maintain sorted data:
http://en.wikipedia.org/wiki/Red–black_tree maintains a more or less balanced and sorted tree.
AVL tree a bit slower but more balanced and sorted tree.
Trie a sorted tree on lists of elements.
Hash function to easily map a complex element to an integral value that can be used to sort the elements. Good for finding exact elements, but there is no real order in the elements itself.
P.S1: Coming from Matlab you may want to consider a different language from C to move to. C++ has standard libraries to support above data structures. Java, Python come to mind or even Haskell if you are daring. Pointer handling in C can be quite tedious and error prone.
P.S2: I'm unable to include a - in a URL on StackOverflow. So the Red-black tree links is a bit off and can't be clicked. If someone can edit my post to fix it, then I would appreciate that.

Resources