Benefits of contiguous memory allocation - c

In terms of performance, what are the benefits of allocating a contiguous memory block versus separate memory blocks for a matrix? I.e., instead of writing code like this:
char **matrix = malloc(sizeof(char *) * 50);
for(i = 0; i < 50; i++)
matrix[i] = malloc(50);
giving me 50 disparate blocks of 50 bytes each and one block of 50 pointers, if I were to instead write:
char **matrix = malloc(sizeof(char *) * 50 + 50 * 50);
char *data = matrix + sizeof(char *) * 50;
for(i = 0; i < 50; i++) {
matrix[i] = data;
data += 50;
}
giving me one contiguous block of data, what would the benefits be? Avoiding cache misses is the only thing I can think of, and even that's only for small amounts of data (small enough to fit on the cache), right? I've tested this on a small application and have noticed a small speed-up and was wondering why.

It's complicated - you need to measure.
Using an intermediate pointer instead of calculating addresses in a two-dimensional array is most likely a loss on current processors, and both of your examples do that.
Next, everything fitting into L1 cache is a big win. malloc () most likely rounds up to multiples of 64 bytes. 180 x 180 = 32,400 bytes might fit into L1 cache, while individual mallocs might allocate 180 x 192 = 34,560 bytes might not fit, especially if you add another 180 pointers.
One contiguous array means you know how the data fits into cache lines, and you know you'll have the minimum number of page table lookups in the hardware. With hundreds of mallocs, no guarantee.

Watch Scott Meyers' "CPU Caches and Why You Care" presentation on Youtube. The performance gains can be entire orders of magnitude.
https://www.youtube.com/watch?v=WDIkqP4JbkE
As for the discussion above, the intermediate pointer argument died a long time ago. Compilers optimize them away. An N-Dimensional array is allocated as a flat 1D vector, ALWAYS. If you do std::vector>, THEN you might get the equivalent of an ordered forward list of vectors, but for raw arrays, they're always allocated as one long, contiguous strip in a flat manner, and multi-dimensional access reduces to pointer arithmetic the same way 1-Dimensional access does.
To access array[i][j][k] (assume width, height, depth of {A, B, C}), you add i*(BC) + (jC) + k to the address at the front of the array. You'd have to do this math manually in a 1-D representation anyway.

Related

Large dynamic 2D array allocation fails - C language

I am trying to define a large dynamic size 2D array in C. When the array size is not very big (e.g. 20,000 x 20,000), memory allocation works perfectly. However, when the array size is greater than 50K x 50K memory allocation fails. Here is the sample code.
long int i = 0;
int **Cf = calloc(30000, sizeof *Cf);
for( i = 0; i < 30000; i++){
Cf[i] = calloc(30000, sizeof *(Cf[i]));}
The What is the maximum size of buffers memcpy/memset etc. can handle? post has explained why this issue occurs. However, it does not say what to do when you need large dynamic 2D arrays in your program. I am sure there should be an efficient way of doing this since many scientific applications need this feature. Any suggestion?

optimized copying from an openCV mat to a 2D float array

I wanted to copy an opencv Mat variable in a 2D float array.I used the code below to reach this purpose.but because I develope a code that speed is a very important metric this method of copy is not enough optimize.Is there another more optimized way to use?
float *ImgSrc_f;
ImgSrc_f = (float *)malloc(512 * 512 * sizeof(float));
for(int i=0;i<512;i++)
for(int j=0;j<512;j++)
{
ImgSrc_f[i * 512 + j]=ImgSrc.at<float>(i,j);
}
Really,
//first method
Mat img_cropped = ImgSrc(512, 512).clone();
float *ImgSrc_f = img_cropped.data;
should be no more than a couple of % less efficient than the best method. I suggest this one as long as it doesn't lose more than 1% to the second method.
Also try this, which is very close to the absolute best method unless you can use some kind of expanded cpu instruction set (and if such ones are available). You'll probably see minimal difference between method 2 and 1.
//method 2
//preallocated memory
//largest possible with minimum number of direct copy calls
//caching pointer arithmetic can speed up things extremely minimally
//removing the Mat header with a malloc can speed up things minimally
>>startx, >>starty, >>ROI;
Mat img_roi = ImgSrc(ROI);
Mat img_copied(ROI.height, ROI.width, CV_32FC1);
for (int y = starty; y < starty+ROI.height; ++y)
{
unsigned char* rowptr = img_roi.data + y*img_roi.step1() + startx * sizeof(float);
unsigned char* rowptr2 = img_coped.data + y*img_copied.step1();
memcpy(rowptr2, rowptr, ROI.width * sizeof(float));
}
Basically, if you really, really care about these details of performance, you should stay away from overloaded operators in general. The more levels of abstraction in code, the higher the penalty cost. Of course that makes your code more dangerous, harder to read, and bug-prone.
Can you use a std::vector structure, too? If try
std::vector<float> container;
container.assign((float*)matrix.startpos, (float*)matrix.endpos);

copying 2d array of type (double **2darray) to GPU using cuda [duplicate]

I am looking into how to copy a 2D array of variable width for each row into the GPU.
int rows = 1000;
int cols;
int** host_matrix = malloc(sizeof(*int)*rows);
int *d_array;
int *length;
...
Each host_matrix[i] might have a different length, which I know length[i], and there is where the problem starts. I would like to avoid copying dummy data. Is there a better way of doing it?
According to this thread, that won't be a clever way of doing it:
cudaMalloc(d_array, rows*sizeof(int*));
for(int i = 0 ; i < rows ; i++) {
cudaMalloc((void **)&d_array[i], length[i] * sizeof(int));
}
But I cannot think of any other method. Is there any other smarter way of doing it?
Can it be improved using cudaMallocPitch and cudaMemCpy2D ??
The correct way to allocate an array of pointers for the GPU in CUDA is something like this:
int **hd_array, **d_array;
hd_array = (int **)malloc(nrows*sizeof(int*));
cudaMalloc(d_array, nrows*sizeof(int*));
for(int i = 0 ; i < nrows ; i++) {
cudaMalloc((void **)&hd_array[i], length[i] * sizeof(int));
}
cudaMemcpy(d_array, hd_array, nrows*sizeof(int*), cudaMemcpyHostToDevice);
(disclaimer: written in browser, never compiled, never tested, use at own risk)
The idea is that you assemble a copy of the array of device pointers in host memory first, then copy that to the device. For your hypothetical case with 1000 rows, that means 1001 calls to cudaMalloc and then 1001 calls to cudaMemcpy just to set up the device memory allocations and copy data into the device. That is an enormous overhead penalty, and I would counsel against trying it; the performance will be truly terrible.
If you have very jagged data and need to store it on the device, might I suggest taking a cue of the mother of all jagged data problems - large, unstructured sparse matrices - and copy one of the sparse matrix formats for your data instead. Using the classic compressed sparse row format as a model you could do something like this:
int * data, * rows, * lengths;
cudaMalloc(rows, nrows*sizeof(int));
cudaMalloc(lengths, nrows*sizeof(int));
cudaMalloc(data, N*sizeof(int));
In this scheme, store all the data in a single, linear memory allocation data. The ith row of the jagged array starts at data[rows[i]] and each row has a length of length[i]. This means you only need three memory allocation and copy operations to transfer any amount of data to the device, rather than nrows in your current scheme, ie. it reduces the overheads from O(N) to O(1).
I would put all the data into one array. Then compose another array with the row lengths, so that A[0] is the length of row 0 and so on. so A[i] = length[i]
Then you need just to allocate 2 arrays on the card and call memcopy twice.
Of course it's a little bit of extra work, but i think performance wise it will be an improvement (depending of course on how you use the data on the card)

Optimising C for performance vs memory optimisation using multidimensional arrays

I am struggling to decide between two optimisations for building a numerical solver for the poisson equation.
Essentially, I have a two dimensional array, of which I require n doubles in the first row, n/2 in the second n/4 in the third and so on...
Now my difficulty is deciding whether or not to use a contiguous 2d array grid[m][n], which for a large n would have many unused zeroes but would probably reduce the chance of a cache miss. The other, and more memory efficient method, would be to dynamically allocate an array of pointers to arrays of decreasing size. This is considerably more efficient in terms of memory storage but would it potentially hinder performance?
I don't think I clearly understand the trade-offs in this situation. Could anybody help?
For reference, I made a nice plot of the memory requirements in each case:
There is no hard and fast answer to this one. If your algorithm needs more memory than you expect to be given then you need to find one which is possibly slower but fits within your constraints.
Beyond that, the only option is to implement both and then compare their performance. If saving memory results in a 10% slowdown is that acceptable for your use? If the version using more memory is 50% faster but only runs on the biggest computers will it be used? These are the questions that we have to grapple with in Computer Science. But you can only look at them once you have numbers. Otherwise you are just guessing and a fair amount of the time our intuition when it comes to optimizations are not correct.
Build a custom array that will follow the rules you have set.
The implementation will use a simple 1d contiguous array. You will need a function that will return the start of array given the row. Something like this:
int* Get( int* array , int n , int row ) //might contain logical errors
{
int pos = 0 ;
while( row-- )
{
pos += n ;
n /= 2 ;
}
return array + pos ;
}
Where n is the same n you described and is rounded down on every iteration.
You will have to call this function only once per entire row.
This function will never take more that O(log n) time, but if you want you can replace it with a single expression: http://en.wikipedia.org/wiki/Geometric_series#Formula
You could use a single array and just calculate your offset yourself
size_t get_offset(int n, int row, int column) {
size_t offset = column;
while (row--) {
offset += n;
n << 1;
}
return offset;
}
double * array = calloc(sizeof(double), get_offset(n, 64, 0));
access via
array[get_offset(column, row)]

Do people really pad their arrays for efficient cache access in high-performance computing?

I'm reading a book on parallel processing called Intel Xeon Phi Coprocessor High Performance Programming. The book recommends that high-performance programs pad array data in order to align heavily used memory addresses for efficient cache line access.
For statically defined arrays, this can be done like so:
size_t size = 1024 * 1024;
float buff[size] __attribute__((align(64)));
The book also demonstrates how to align dynamically created arrays. Here's an example for a 2D matrix:
const size_t kPaddingSize = 64;
int height = 10000;
int width = ((5900 * sizeof(real) + kPaddingSize - 1)/kPaddingSize) * (kPaddingSize/sizeof(real));
size_t size = sizeof(real) * width * kPaddingSize * height;
real *fin = (real *)_mm_malloc(size, kPaddingSize);
real *fout = (real *)_mm_malloc(size, kPaddingSize);
/// ...use the arrays
_mm_free(fin);
_mm_free(fout);
Aligning the arrays has real performance benefits. However, coming from my background as a user-facing software developer, the approach seems undesirable for several reasons:
The size of the cache line, kPaddingSize, is hard-coded in the program.
Padding arrays adds a large amount of complexity to the program. Simply creating the arrays is hard enough, but accessing their members is even worse.
I have two questions:
How do high-performance application developers mitigate these problems?
Is padding arrays a commonplace practice?

Resources