I want to find out what is the best representation of a m x n real matrix in C programming language.
What are advantages of matrix representation as a single pointer:
double* A;
With this representation you could allocate memory:
A = (double* )malloc(m * n * sizeof(double));
In such representation matrix access requires an extra multiplication:
aij = A[i * m + j];
What are disadvantages of matrix representation as a double pointer:
double** B;
Memory allocation requires a loop:
double** B = (double **) malloc(m * sizeof(double*));
for (i = 0; i < m; i++)
A[i] = (double *) malloc(n * sizeof(double))
In such representation you could use intuitive double indexing `bij = B[i][j], but is there some drawback that would affect performance. I would want to know what is the best presentation in terms of performance.
These matrices should be used in numerical algorithms such as singular value decomposition. I need to define a function:
void svd(Matrix A, Matrix U, Matrix Sigma, Matrix V);
and I am looking for the best way to represent Matrix. If there is any other efficient way to represent a matrix in C, please, let me know.
I have seen that most people use single pointer representation. I would like to know if there are some performance benefits as opposed to double array representation?
Look at the memory accesses required.
For the single-pointer case, you have:
read a pointer (the base address), probably from a register
read the four integers, probably from registers or hard-coded into instruction set. For array[i*m+j], the 4 values are i, m, j and sizeof(array[0]).
multiply and add
access the memory address
For the double-pointer case, you have:
read a pointer (the base address), probably from a register
read an index, probably from a register
multiply the index by the size of a pointer and add.
fetch the base address from memory (unlikely to be a register, might be in cache with luck).
read another index, probably from a register
multiply by the size of the object and add
access the memory address
The fact that you have to access two memory locations probably makes the double-pointer solution quite a bit slower than the single-pointer solution. Clearly, caching will be critical; that's one reason why it is important to access arrays so that the accesses are cache-friendly (so you access adjacent memory locations as often as possible).
You can nit-pick about details in my outline, and some 'multiplication' operations may be shift operations, etc, but the general concept remains: the double-pointer requires two memory accesses versus one for the single-pointer solution, and that will be slower.
Here are a couple of articles about row major format.
http://en.wikipedia.org/wiki/Row-major_order
http://fgiesen.wordpress.com/2011/05/04/row-major-vs-column-major-and-gl-es/
These are common constructs in CUDA programming; hence my interest.
Related
The following code is from pg. 93 of Parallel and High Performance Computing and is a single contiguous memory allocation for a 2D array:
double **malloc_2D(int nrows, int ncols) {
double **x = (double **)malloc(
nrows*sizeof(double*)
+ nrows*ncols*sizeof(double)); // L1
x[0] = (double *)x + nrows; // L2
for (int j = 1; j < nrows; j++) { // L3
x[j] = x[j-1] + ncols;
}
return x;
}
The book states that this improves memory allocation and cache efficiency. Is there any reason w.r.t efficiency to prefer the first code to something like the below code? It seems like the below code is more readable, and it's also easily usable with MPI (I only mention this because the book also covers MPI later).
double *malloc_2D(int nrows, int ncols) {
double *M = (double *)malloc(nrows * ncols * sizeof(double))
return M
}
I include the below image to make sure that my mental model of the first code is correct. If it is not, please mention that in the answer. The image is the result of calling the first function to create a 5 x 2 matrix. Note that I just write the indices in the boxes in the below image for clarity, of course the values stored at these memory locations will not be 0 through 14. Also note that L# refers to lines in the first code.
The book states that this improves memory allocation and cache efficiency.
The book’s code improves efficiency relative to a too-often seen method of allocating pointers separately, as in:
double **x = malloc(nrows * sizeof *x);
for (size_t i = 0; i < nrows; ++i)
x[i] = malloc(ncols * sizeof *x[i]);
(Note that all methods should test the malloc result and handle allocation failures. This is elided for the discussion here.)
That method allocates each row separately (from other rows and from the pointers). The book’s method has some benefit that only one allocation is done and that the memory for the array is contiguous. Also, the relationships between elements in different rows are known, and that may allow programmers to take advantage of the relationships in designing algorithms that work well with cache and memory access.
Is there any reason w.r.t efficiency to prefer the first code to something like the below code?
Not for efficiency, no. Both the book’s method and the method above have the disadvantage that they generally require a pointer lookup for every array access (aside from the base pointer, x). Before the processor can get an element from the memory of a row, it has to get the address of the row from memory.
With the method you show, this additional lookup is unnecessary. Further, the processor and/or the compiler may be able to predict some things about the accesses. For example, with your method, the compiler may be able to see that M[(i+1)*ncols + j] is a different element from M[(i+2)*cols + j], whereas with x[i+1][j] and x[i+2][j], it generally cannot know the two pointers x[i+1] and x[i+2] are different.
The book’s code is also defective. The number of bytes it allocates is nrows*sizeof(double*) + nrows*ncols*sizeof(double). Lets say r is nrows, c is ncols, p is sizeof(double*) and d is sizeof(double). Then the code allocates rp + rcd bytes. Then the code sets x[0] to (double *)x + nrows. Because it casts to double *, the addition of nrows is done in units of the pointed-to type, double. So this adds rd bytes to the starting address. And, after that, it expects to have all the elements of the array, which is rcd bytes. So the code is using rd + rcd bytes even though it allocated rp + rcd. If p > d, some elements at the end of the array will be outside of the allocated memory. In current ordinary C implementations, the size of double * is less than or equal to the size of double, but this should not be relied on. Instead of setting x[0] to (double *)x + nrows;, it should calculate x plus the size of nrows elements of type double * plus enough padding to get to the alignment requirement of double, and it should include that padding in the allocation.
If we cannot use variable length arrays, then the array indexing can be provided by a macro, as by defining a macro that replaces x(i, j) with x[i*ncols+j], such as #define x(i, j) x[(i)*ncols + (j)].
I have two scenarios, in both i allocate 78*2 sizeof(int) of memory and initialize it to 0.
Are there any differences regards performances?
Scenario A:
int ** v = calloc(2 , sizeof(int*));
for (i=0; i<2; ++i)
{
v[i] = calloc(78, sizeof(int));
}
Scenario B:
int ** v = calloc(78 , sizeof(int*));
for (i=0; i<78; ++i)
{
v[i] = calloc(2, sizeof(int));
}
I supposed that in performance terms, it's better to use a calloc if an initialize array is needed, let me know if I'm wrong
First, discussing optimization abstractly has some difficulties because compilers are becoming increasingly better at optimization. (For some reason, compiler developers will not stop improving them.) We do not always know what machine code given source code will produce, especially when we write source code today and expect it to be used for many years to come. Optimization may consolidate multiple steps into one or may omit unnecessary steps (such as clearing memory with calloc instead of malloc immediately before the memory is completely overwritten in a for loop). There is a growing difference between what source code nominally says (“Do these specific steps in this specific order”) and what it technically says in the language abstraction (“Compute the same results as this source code in some optimized fashion”).
However, we can generally figure that writing source code without unnecessary steps is at least as good as writing source code with unnecessary steps. With that in mind, let’s consider the nominal steps in your scenarios.
In Scenario A, we tell the computer:
Allocate 2 int *, clear them, and put their address in v.
Twice, allocate 78 int, clear them, and put their addresses in the preceding int *.
In Scenario B, we tell the computer:
Allocate 78 int *, clear them, and put their address in v.
78 times, allocate two int, clear them, and put their addresses in the preceding int *.
We can easily see two things:
Both of these scenarios both clear the memory for the int * and immediately fill it with other data. That is wasteful; there is no need to set memory to zero before setting it to something else. Just set it to something else. Use malloc for this, not calloc. malloc takes just one parameter for the size instead of two that are multiplied, so replace calloc(2, sizeof (int *)) with malloc(2 * sizeof (int *)). (Also, to tie the allocation to the pointer being assigned, use int **v = malloc(2 * sizeof *v); instead of repeating the type separately.)
At the step where Scenario B does 78 things, Scenario A does two things, but the code is otherwise very similar, so Scenario A has fewer steps. If both would serve some purpose, then A is likely preferable.
However, both scenarios allude to another issue. Presumably, the so-called array will be used later in the program, likely in a form like v[i][j]. Using this as a value means:
Fetch the pointer v.
Calculate i elements beyond that.
Fetch the pointer at that location.
Calculate j elements beyond that.
Fetch the int at that location.
Let’s consider a different way to define v: int (*v)[78] = malloc(2 * sizeof *v);.
This says:
Allocate space for 2 arrays of 78 int and put their address in v.
Immediately we see that involves fewer steps than Scenario A or Scenario B. But also look at what it does to the steps for using v[i][j] as a value. Because v is a pointer to an array instead of a pointer to a pointer, the computer can calculate where the appropriate element is instead of having to load an address from memory:
Fetch the pointer v.
Calculate i•78 elements beyond that.
Calculate j elements beyond that.
Fetch the int at that location.
So this pointer-to-array version is one step fewer than the pointer-to-pointer version.
Further, the pointer-to-pointer version requires an additional fetch from memory for each use of v[i][j]. Fetches from memory can be expensive relative to in-processor operations like multiplying and adding, so it is a good step to eliminate. Having to fetch a pointer can prevent a processor from predicting where the next load from memory might be based on recent patterns of use. Additionally, the pointer-to-array version puts all the elements of the 2×78 array together in memory, which can benefit the cache performance. Processors are also designed for efficient use of consecutive memory. With the pointer-to-pointer version, the separate allocations typically wind up with at least some separation between the rows and may have a lot of separation, which can break the benefits of consecutive memory use.
It is more than one questions. I need to deal with an NxN matrix A of integers in C. How can I allocate the memory in the heap? Is this correct?
int **A=malloc(N*sizeof(int*));
for(int i=0;i<N;i++) *(A+i)= malloc(N*sizeof(int));
I am not absolutely sure if the second line of the above code should be there to initiate the memory.
Next, suppose I want to access the element A[i, j] where i and j are the row and column indices starting from zero. It it possible to do it via dereferencing the pointer **A somehow? For example, something like (A+ni+j)? I know I have some conceptual gap here and some help will be appreciated.
not absolutely sure if the second line of the above code should be there to initiate the memory.
It needs to be there, as it actually allocates the space for the N rows carrying the N ints each you needs.
The 1st allocation only allocates the row-indexing pointers.
to access the element A[i, j] where i and j are the row and column indices starting from zero. It it possible to do it via dereferencing the pointer **
Sure, just do
A[1][1]
to access the element the 2nd element of the 2nd row.
This is identical to
*(*(A + 1) + 1)
Unrelated to you question:
Although the code you show is correct, a more robust way to code this would be:
int ** A = malloc(N * sizeof *A);
for (size_t i = 0; i < N; i++)
{
A[i] = malloc(N * sizeof *A[i]);
}
size_t is the type of choice for indexing, as it guaranteed to be large enough to hold any index value possible for the system the code is compiled for.
Also you want to add error checking to the two calls of malloc(), as it might return NULL in case of failure to allocate the amount of memory requested.
The declaration is correct, but the matrix won't occupy continuous memory space. It is array of pointers, where each pointer can point to whatever location, that was returned by malloc. For that reason addressing like (A+ni+j) does not make sense.
Assuming that compiler has support for VLA (which became optional in C11), the idiomatic way to define continuous matrix would be:
int (*matrixA)[N] = malloc(N * sizeof *matrixA);
In general, the syntax of matrix with N rows and M columns is as follows:
int (*matrix)[M] = malloc(N * sizeof *matrixA);
Notice that both M and N does not have to be given as constant expressions (thanks to VLA pointers). That is, they can be ordinary (e.g. automatic) variables.
Then, to access elements, you can use ordinary indice syntax like:
matrixA[0][0] = 100;
Finally, to relase memory for such matrices use single free, e.g.:
free(matrixA);
free(matrix);
You need to understand that 2D and higher arrays do not work well in C 89. Beginner books usually introduce 2D arrays in a very early chapter, just after 1D arrays, which leads people to assume that the natural way to represent 2-dimensional data is via a 2D array. In fact they have many tricky characteristics and should be considered an advanced feature.
If you don't know array dimensions at compile time, or if the array is large, it's almost always easier to allocate a 1D array and access via the logic
array[y*width+x];
so in your case, just call
int *A;
A = malloc(N * N * sizeof(int))
A[3*N+2] = 123; // set element A[3][2] to 123, but you can't use this syntax
It's important to note that the suggestion to use a flat array is just a suggestion, not everyone will agree with it, and 2D array handling is better in later versions of C. However I think you'll find that this method works best.
The example in the nvidia programming guide shows them passing the pitchedPtr to their kernel:
__global__ void MyKernel(cudaPitchedPtr devPitchedPtr,int width, int height, int depth)
But instead of that why not just allocate in the same manner, but then call like:
__global__ void MyKernel(float* devPtr,int pitch, int width, int height, int depth)
and then access the elements however you like. I would prefer the latter implementation, but why does the programming guide give the other example (and albeit a bad example - illustrating how to access the elements but also illustrating a design pattern that should not be implemented with cuda).
Edit : meant to say that the float * devPtr is the ptr (void * ptr) member of the cudaPitchedPtr.
Either method is equally valid - it is purely an aesthetic decision on your part.
It is not even clear to me why cudaPitchedPtr has extra members - the only ones that really matter are the base pointer and the pitch.
I assume your talking about cudaMalloc3D:
From the CUDA reference regarding cudaMalloc3D:
Allocates at least width * height * depth bytes of linear memory on the device and returns a cudaPitchedPtr in which ptr is a pointer to the allocated memory. The function may pad the allocation to ensure hardware alignment requirements are met.
So
cudaMalloc3D(&pitchedDevPtr, make_cudaExtent(w, h, d));
does:
cudaMalloc(&devPtr, w * h * d);
There is no difference to a call of cudaMalloc, but if you like it, you get some convenience. You don't have to calculate the size of your array by your own just pass a cudaExtent struct to the function. Ofcorse you get an array in bytes. There is no definition of the size of your data type specified in the cudaExtent structure.
If you pass your plain pointer, or your cudaPitchedPtr to the kernel is a design decision. Your cudaPitchedPtr delivers not only the devPtr to your kernel, it also stores the amount of memory and the size of the dimensions. For memory and so also register saving you get only the size in x and y direction, z is just pitch / (x * y).
EDIT: As pointed out cudaMalloc3D adds padding to assure coalesced memory access. But since Compute Capability 1.2 a memory access can by coalesced even if the starting address is not propperly aligned. On devices witch CC >= 1.2 there is no difference between those two allocations regarding performance.
I have written a some C code running on OS X 10.6, which happens to be slow so I am using valgrind to check for memory leaks etc. One of the things I have noticed whilst doing this:
If I allocate memory to a 2D array like this:
double** matrix = NULL;
allocate2D(matrix, 2, 2);
void allocate2D(double** matrix, int nrows, int ncols) {
matrix = (double**)malloc(nrows*sizeof(double*));
int i;
for(i=0;i<nrows;i++) {
matrix[i] = (double*)malloc(ncols*sizeof(double));
}
}
Then check the memory address of matrix it is 0x0.
However if I do
double** matrix = allocate2D(2,2);
double** allocate2D(int nrows, int ncols) {
double** matrix = (double**)malloc(nrows*sizeof(double*));
int i;
for(i=0;i<nrows;i++) {
matrix[i] = (double*)malloc(ncols*sizeof(double));
}
return matrix;
}
This works fine, i.e. the pointer to the newly created memory is returned.
When I also have a free2D function to free up the memory. It doesn't seem to free properly. I.e. the pointer still point to same address as before call to free, not 0x0 (which I thought might be default).
void free2D(double** matrix, int nrows) {
int i;
for(i=0;i<nrows;i++) {
free(matrix[i]);
}
free(matrix);
}
My question is: Am I misunderstanding how malloc/free work? Otherwise can someone suggest whats going on?
Alex
When you free a pointer, the value of the pointer does not change, you will have to explicitly set it to 0 if you want it to be null.
In the first example, you've only stored the pointer returned by malloc in a local variable. It's lost when the function returns.
Usual practice in the C language is to use the function's return value to pass the pointer to an allocated object back to the caller. As Armen pointed out, you can also pass a pointer to where the function should store its output:
void Allocate2D(double*** pMatrix...)
{
*pMatrix = malloc(...)
}
but I think most people would scream as soon as they see ***.
You might also consider that arrays of pointers are not an efficient implementation of matrices. Allocating each row separately contributes to memory fragmentation, malloc overhead (because each allocation involves some bookkeeping, not to mention the extra pointers you have to store), and cache misses. And each access to an element of the matrix involves 2 pointer dereferences rather than just one, which can introduce stalls. Finally, you have a lot more work to do allocating the matrix, since you have to check for failure of each malloc and cleanup everything you've already done if any of them fail.
A better approach is to use a one-dimensional array:
double *matrix;
matrix = malloc(nrows*ncols*sizeof *matrix);
then access element (i,j) as matrix[i*ncols+j]. The potential disadvantages are the multiplication (which is slow on ancient cpus but fast on modern ones) and the syntax.
A still-better approach is not to seek excess generality. Most matrix code on SO is not for advanced numerical mathematics where arbitrary matrix sizes might be needed, but for 3d gaming where 2x2, 3x3, and 4x4 are the only matrix sizes of any practical use. If that's the case, try something like
double (*matrix)[4] = malloc(4*sizeof *matrix);
and then you can access element (i,j) as matrix[i][j] with a single dereference and an extremely fast multiply-by-constant. And if your matrix is only needed at local scope or inside a structure, just declare it as:
double matrix[4][4];
If you're not extremely adept with the C type system and the declarations above, it might be best to just wrap all your matrices in structs anyway:
struct matrix4x4 {
double x[4][4];
};
Then declarations, pointer casts, allocations, etc. become a lot more familiar. The only disadvantage is that you need to do something like matrix.x[i][j] or matrix->x[i][j] (depending on whether matrix is a struct of pointer to struct) instead of matrix[i][j].
Edit: I did think of one useful property of implementing your matrices as arrays of row pointers - it makes permutation of rows a trivial operation. If your algorithms need to perform a lot of row permutation, this may be beneficial. Note that the benefit will not be much for small matrices, though, and than column permutation cannot be optimized this way.
In C++ You should pass the pointer by reference :)
Allocate2D(double**& matrix...)
As to what's going on - well you have a pointer that is NULL, you pass the copy of that pointer to the function which allocated mamory and initializes the copy of your pointer with the address of the newly allocated memory, but your original pointer remains NULL. As for free you don't need to pass by reference since only the value of the pointer is relevant. HTH
Since there are no references in C, you can pass by pointer, that is
Allocate2D(double*** pMatrix...)
{
*pMatrix = malloc(...)
}
and later call like
Allocate2D(&matrix ...)