why buffer memory allocation error in opencl - c

I execute the OpenCL program on an NDRange with a work-group size of 16*16 and work-global size of 1024*1024. The application is matrix multiplication.
When the two input matrix size are both little, it works well. But when the size of input matrix becomes large, for example, larger than 20000*20000, it reports error "CL_MEM_OBJECT_ALLOCATION_FAILURE" in the function of enqueuendrangekernrl.
I am puzzled. I am not familiar with memory allocation. What's the reason?

With clGetDeviceInfo, you can query the device global memory size with CL_DEVICE_GLOBAL_MEM_SIZE, and the maximum size you can alloc in a single memory object with CL_DEVICE_MAX_MEM_ALLOC_SIZE. Three matrices of 20000*20000*sizeof(float) = 1.6 GB probably exceed these limits.

Related

Given the max size of an array, should I write everything and shrink at the end or extend each time?

Let's say I know the max size of my new would be 8 but in most cases, the logical size would be lower given the program's logic.
In this scenario, is it more memory and time efficient to?:
A. Begin with an array with a physical size of 1 and write to it and afterwards extend (dynamically allocate and freeing the old array) it by one each time and then continue writing to it.
B. Begin with an array of the given max size (in our case 8) and write every needed slot in this array and at the end shrink it to the logical size (dynamically allocated and freeing the old array).

Is there a way to use MPI_Reduce with dynamic size of send buffer?

I would like to use the "MPI_Reduce" function with a variable number of elements for each process.
For example.
We have 4 processes with an allocated buffer with a dynamic size.
P (0) size buffer = 21
P (1) size buffer = 24
P (2) size buffer = 21
P (3) size buffer = 12
I would like to reduce the values ​​of these elements on the processor with rank 0.
In my thoughts I would like to allocate a receive buffer of a size equal to the maximum of objects to be received by a process (in this case 24) and use that to retrieve the values ​​from the various processes.
There is a way in
which is it possible to do without increasing the execution times too much?
I am using Open MPI 2.1.1 in C, Thanks.
There is no reduction variant that works with different numbers of elements per rank in MPI. It wouldn't know what to fill in for missing operands in the reduction operation. It's pretty straightforward to write though, just as you suggested:
Determine the maximum buffer size
Allocate max-sized buffer on each rank, copy in local buffer, pad with whatever the neutral element of your reduction operation is
Run reduction on the now equal-sized buffers

c language 'order of memory allocation' and 'execution speed'

I'm making a program using c.
I have many arrays and size of each array is not so small.
(more than 10,000 elements in each array).
And, there are set of arrays that are accessed and computed together frequently.
For example,
a_1[index] = constant * a_2[index];
b_1[index] = constant * b_2[index];
a_1 is compute with a_2 and b_1 is computed with b_2.
Suppose that I have a~z_1 and a~z_2 arrays, in my case,
is there significant 'execution speed' difference between the following 2 different memory allocation ways.
allocating memory in order of a~z_1 followed by a~z_2
allocating a_1,a_2 followed by b_1,b_2, c_1,c_2 and others?
1.
MALLOC(a_1);
MALLOC(b_1);
...
MALLOC(z_1);
MALLOC(a_2);
...
MALLOC(z_2);
2.
MALLOC(a_1);
MALLOC(a_2);
MALLOC(b_1);
MALLOC(b_2);
...
MALLOC(z_1);
MALLOC(z_2);
I think allocating memory in second way will be faster because of hit rate.
Because arrays allocated in similar times will be in the similar address, those arrays will be uploaded in the cash or RAM at the same time, and therefore computer does not need to upload arrays in several times to compute one line of code.
For example, to compute
a_1[index] = constant * a_2[index];
, upload a_1 and a_2 at the same time not separately.
(Is it correct?)
However, for me, in terms of maintenance, allocating in first way is much easier.
I have AA_a~AA_z_1,AA_a~AA_z_2, BB_a~BB_z_1, CC_a~CC_z~1 and other arrays.
Because I can efficiently use MACRO in the following way to allocate memory.
Like,
#define MALLOC_GROUP(GROUP1,GROUP2)
MALLOC(GROUP1##_a_##GROUP2);
MALLOC(GROUP1##_b_##GROUP2);
...
MALLOC(GROUP1##_z_##GROUP2)
void allocate(){
MALLOC_GROUP(AA,1);
MALLOC_GROUP(AA,2);
MALLOC_GROUP(BB,2);
}
To sum, is allocating sets of arrays, computed together, at the similar time affects to the execution speed of the program?
Thank you.

numa, mbind, segfault

I have allocated memory using valloc, let's say array A of [15*sizeof(double)]. Now I divided it into three pieces and I want to bind each piece (of length 5) into three NUMA nodes (let's say 0,1, and 2). Currently, I am doing the following:
double* A=(double*)valloc(15*sizeof(double));
piece=5;
nodemask=1;
mbind(&A[0],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
nodemask=2;
mbind(&A[5],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
nodemask=4;
mbind(&A[10],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
First question is am I doing it right? I.e. is there any problems with being properly aligned to page size for example? Currently with size of 15 for array A it runs fine, but if I reset the array size to something like 6156000 and piece=2052000, and subsequently three calls to mbind start with &A[0], &A[2052000], and &A[4104000] then I am getting a segmentation fault (and sometimes it just hangs there). Why it runs for small size fine but for larger gives me segfault? Thanks.
For this to work, you need to deal with chunks of memory that are at least page-size and page-aligned - that means 4KB in most systems. In your case, I suspect the page gets moved twice (possibly three times), due to you calling mbind() three times over.
The way numa memory is located is that CPU socket 0 has a range of 0..X-1 MB, socket 1 has X..2X-1, socket three has 2X-3X-1, etc. Of course, if you stick a 4GB stick of ram next to socket 0 and a 16GB in the socket 1, then the distribution isn't even. But the principle still stands that a large chunk of memory is allocated for each socket, in accordance to where the memory is actually located.
As a consequence of how the memory is located, the physical location of the memory you are using will have to be placed in the linear (virtual) address space by page-mapping.
So, for large "chunks" of memory, it is fine to move it around, but for small chunks, it won't work quite right - you certainly can't "split" a page into something that is affine to two different CPU sockets.
Edit:
To split an array, you first need to find the page-aligned size.
page_size = sysconf(_SC_PAGESIZE);
objs_per_page = page_size / sizeof(A[0]);
// We should be an even number of "objects" per page. This checks that that
// no object straddles a page-boundary
ASSERT(page_size % sizeof(A[0]));
split_three = SIZE / 3;
aligned_size = (split_three / objs_per_page) * objs_per_page;
remnant = SIZE - (aligned_size * 3);
piece = aligned_size;
mbind(&A[0],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
mbind(&A[aligned_size],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
mbind(&A[aligned_size*2 + remnant],piece*sizeof(double),MPOL_BIND,&nodemask,64,MPOL_MF_MOVE);
Obviously, you will now need to split the three threads similarly using the aligned size and remnant as needed.

Armadillo: efficient matrix allocation on the heap

I'm using Armadillo to manipulate large matrices in C++ read from a CSV-file.
mat X;
X.load("myfile.csv",csv_ascii);
colvec x1 = X(span::all,0);
colvec x2 = X(span::all,1);
//etc.
So x1,...,xk (for k=20 say) are the columns of X. X will typically have rows ranging from 2000 to 16000. My question is:
How can I allocate (and subsequently deallocate) X onto the heap (free store)?
This section of Armadillo docs explains auxiliary memory allocation of a mat. Is this the same as heap allocation? It requires prior knowledge of matrix dimensions, which I won't know until X is read from csv:
mat(aux_mem*, n_rows, n_cols, copy_aux_mem = true, strict = true)
Any suggestions would be greatly appreciated. (I'm using g++-4.2.1; my current program runs fine locally on my Macbook Pro, but when I run it on my university's computing cluster (Linux g++-4.1.2), I run into a segmentation fault. The program is too large to post).
Edit: I ended up doing this:
arma::u32 Z_rows = 10000;
arma::u32 Z_cols = 20;
double* aux_mem = new double[Z_rows*Z_cols];
mat Z(aux_mem,Z_rows,Z_cols,false,true);
Z = randn(Z_rows, Z_cols);
which first allocates memory on the heap and then tells the matrix Z to use it.
By looking at the source code, Armadillo already allocates large matrices on the heap.
To reduce the amount of memory required, you may want to use fmat instead of mat. This will come with the trade-off of reduced precision.
fmat uses float, while mat uses double: see http://arma.sourceforge.net/docs.html#Mat.
It's also possible that the system administrator of the linux computing cluster has enabled limits on it (eg. each user can allocate only upto a certain amount of maximum memory). For example, see http://www.linuxhowtos.org/Tips%20and%20Tricks/ulimit.htm.

Resources