Armadillo: efficient matrix allocation on the heap - heap-memory

I'm using Armadillo to manipulate large matrices in C++ read from a CSV-file.
mat X;
X.load("myfile.csv",csv_ascii);
colvec x1 = X(span::all,0);
colvec x2 = X(span::all,1);
//etc.
So x1,...,xk (for k=20 say) are the columns of X. X will typically have rows ranging from 2000 to 16000. My question is:
How can I allocate (and subsequently deallocate) X onto the heap (free store)?
This section of Armadillo docs explains auxiliary memory allocation of a mat. Is this the same as heap allocation? It requires prior knowledge of matrix dimensions, which I won't know until X is read from csv:
mat(aux_mem*, n_rows, n_cols, copy_aux_mem = true, strict = true)
Any suggestions would be greatly appreciated. (I'm using g++-4.2.1; my current program runs fine locally on my Macbook Pro, but when I run it on my university's computing cluster (Linux g++-4.1.2), I run into a segmentation fault. The program is too large to post).
Edit: I ended up doing this:
arma::u32 Z_rows = 10000;
arma::u32 Z_cols = 20;
double* aux_mem = new double[Z_rows*Z_cols];
mat Z(aux_mem,Z_rows,Z_cols,false,true);
Z = randn(Z_rows, Z_cols);
which first allocates memory on the heap and then tells the matrix Z to use it.

By looking at the source code, Armadillo already allocates large matrices on the heap.
To reduce the amount of memory required, you may want to use fmat instead of mat. This will come with the trade-off of reduced precision.
fmat uses float, while mat uses double: see http://arma.sourceforge.net/docs.html#Mat.
It's also possible that the system administrator of the linux computing cluster has enabled limits on it (eg. each user can allocate only upto a certain amount of maximum memory). For example, see http://www.linuxhowtos.org/Tips%20and%20Tricks/ulimit.htm.

Related

c language 'order of memory allocation' and 'execution speed'

I'm making a program using c.
I have many arrays and size of each array is not so small.
(more than 10,000 elements in each array).
And, there are set of arrays that are accessed and computed together frequently.
For example,
a_1[index] = constant * a_2[index];
b_1[index] = constant * b_2[index];
a_1 is compute with a_2 and b_1 is computed with b_2.
Suppose that I have a~z_1 and a~z_2 arrays, in my case,
is there significant 'execution speed' difference between the following 2 different memory allocation ways.
allocating memory in order of a~z_1 followed by a~z_2
allocating a_1,a_2 followed by b_1,b_2, c_1,c_2 and others?
1.
MALLOC(a_1);
MALLOC(b_1);
...
MALLOC(z_1);
MALLOC(a_2);
...
MALLOC(z_2);
2.
MALLOC(a_1);
MALLOC(a_2);
MALLOC(b_1);
MALLOC(b_2);
...
MALLOC(z_1);
MALLOC(z_2);
I think allocating memory in second way will be faster because of hit rate.
Because arrays allocated in similar times will be in the similar address, those arrays will be uploaded in the cash or RAM at the same time, and therefore computer does not need to upload arrays in several times to compute one line of code.
For example, to compute
a_1[index] = constant * a_2[index];
, upload a_1 and a_2 at the same time not separately.
(Is it correct?)
However, for me, in terms of maintenance, allocating in first way is much easier.
I have AA_a~AA_z_1,AA_a~AA_z_2, BB_a~BB_z_1, CC_a~CC_z~1 and other arrays.
Because I can efficiently use MACRO in the following way to allocate memory.
Like,
#define MALLOC_GROUP(GROUP1,GROUP2)
MALLOC(GROUP1##_a_##GROUP2);
MALLOC(GROUP1##_b_##GROUP2);
...
MALLOC(GROUP1##_z_##GROUP2)
void allocate(){
MALLOC_GROUP(AA,1);
MALLOC_GROUP(AA,2);
MALLOC_GROUP(BB,2);
}
To sum, is allocating sets of arrays, computed together, at the similar time affects to the execution speed of the program?
Thank you.

Out of memory only by a matrix transpose

I have a cell, Data, it contains three double arrays,
Data =
[74003x253 double] [8061x253 double] [7241x253 double]
I'm using a loop to read these arrays and perform some functions,
for ii = 1 : 3
D = Data {ii} ;
m = mean (D') ;
// rest of the code
end
Which gets a warning for mean and says:
consider using different DIMENSION input argument for MEAN
However when I change it to,
for ii = 1 : 3
D = Data {ii}' ;
m = mean (D) ;
// rest of the code
end
I get Out of memory error.
Comparing two codes, can someone explain what happens?
It seems that I get the error only with a Complex conjugate transpose (my data is real valued).
To take the mean for the n:th dimension it is possible use mean(D,n) as already stated. Regarding the memory consumption, I did some tests monitoring with the windows resource manager. The output was kind of expected.
When doing the operation D=Data{ii} only minimum memory is consumed since here matlab does no more than copying a pointer. However, when doing a transpose, matlab needs to allocate more memory to store the matrix D, which means that the memory consumption increases.
However, this solely does not cause a memory overflow, since the transpose is done in both cases.
Case 1
Separately inD = Data{ii}';
Case 2
in
D = Data {ii}; m = mean(D');
The difference is that in case 2 matlab only creates a temporary copy of Data{ii}' which is not stored in the workspace. The memory allocated is the same in both cases, but in case 1 Data{ii}' is stored in D. When the memory later increases this can cause a memory overflow.
The memory consumption of D is not that bad (< 200 Mb), but the guess is that the memory got high already and that this was enough to give memory overflow.
The warning message means that instead of,
m = mean (D') ;
you should do:
m = mean (D,2);
This will take the mean along the second dimension, leaving you with a column vector the length of size(D,1).
I don't know why you only get the out of memory error when you do D = Data {ii}'. Perhaps it's becauase when you have it in side of mean (m = mean (D') ; the JIT manages to optimize somehow and save you wasted memory.
Here are some ways of doing this:
for i = 1 : length(Data)
% as chappjc recommends this is an excellent solution
m = mean(Data{i}, 2);
end
Or if you want the transpose and you know the data is real (not complex)
for i = 1 : length(Data)
m = mean(Data{i}.');
end
Note, the dot before the transpose.
Or, skip the loop all together
m = cellfun(#(d) mean(d, 2), Data, 'uniformoutput', false);
When you do:
D = Data{i}'
Matlab will create a new copy of your data. This will allocate 74003x253 doubles, which is about 150MB. As patrick pointed out, given that you might have other data you can easily exceed the allowed memory allocation usage (especially on a 32-bit machine).
If you are running with memory problems, the computations are not sensitive, you may consider using single precision instead of double, i.e.:
data{i} = single(data{i});
Ideally, you want to do the single precision at point of allocation to avoid unnecessary new allocation and copies.
Good luck.

Increasing luminosity does not produce desired effect

cvCvtColor(img,dst,CV_RGB2YCrCb);
for (int col=0;col<dst->width;col++)
{
for (int row=0;row<dst->height;row++)
{
int idxF = row*dst->widthStep + dst->nChannels*col; // Read the image data
CvPoint pt = {row,col};
temp_ptr2[0] += temp_ptr1[0]* 0.0722 + temp_ptr1[1] * 0.7152 +temp_ptr1[2] *0.2126 ; // channel Y
}
}
But the result is this:
Please assist where am i going wrong?
There is a lot to say about this code sample:
First, you are using the old C-style API (IplImage pointers, cvBlah functions, etc), which is obsolete and more difficult to maintain (in particular, memory leaks are introduced easily), so you should consider using the C++-style structures and functions (cv::Mat structure and cv::blah functions).
Your error is probably coming from the instruction cvCopy(dst,img); at the very beginning. This fills your input image with nothing just before you start your processing, so you should remove this line.
For maximum speed, you should invert the two loops, so that you first iterate over rows then over columns. This is because images in OpenCV are stored row-by-row in memory, hence accessing the images by increasing column is more efficient with respect to the cache usage.
The temporary variable idxF is never used, so you should probably remove the following line too:
int idxF = row*dst->widthStep + dst->nChannels*col;
When you access image data to store the pixels in temp_ptr1 and temp_ptr2, you swapped the positions of the x and y coordinates. You should access the image in the following way:
temp_ptr1 = &((uchar*)(img->imageData + (img->widthStep*pt.y)))[pt.x*3];
You never release the memory allocated for dst, hence introducing a memory leak in your application. Call cvReleaseImage(&dst); at the end of your function.

why buffer memory allocation error in opencl

I execute the OpenCL program on an NDRange with a work-group size of 16*16 and work-global size of 1024*1024. The application is matrix multiplication.
When the two input matrix size are both little, it works well. But when the size of input matrix becomes large, for example, larger than 20000*20000, it reports error "CL_MEM_OBJECT_ALLOCATION_FAILURE" in the function of enqueuendrangekernrl.
I am puzzled. I am not familiar with memory allocation. What's the reason?
With clGetDeviceInfo, you can query the device global memory size with CL_DEVICE_GLOBAL_MEM_SIZE, and the maximum size you can alloc in a single memory object with CL_DEVICE_MAX_MEM_ALLOC_SIZE. Three matrices of 20000*20000*sizeof(float) = 1.6 GB probably exceed these limits.

pthread speedup for matrix add/multiply

I am writing a basic code to add two matrix and note down the time taken for single thread and 2 or more threads. In the approach first i divide the given two matrix (initialized randomly) in THREADS number of segments, and then each of these segments are sent to the addition module, which is started by the pthread_create call. The argument to the parallel addition function is the following.
struct thread_segment
{
matrix_t *matrix1, *matrix2, *matrix3;
int start_row, offset;
};
Pointers to two source and one destination matrix. (Once source and the destination may point to the same matrix). The start_row is the row from which the particular thread should start adding, and the offset tells till how much this thread should add starting from start_row.
The matrix_t is a simple structure defined as below:
typedef struct _matrix_t
{
TYPE **mat;
int r, c;
} matrix_t;
I have compiled it with 2 threads, but there is (almost) no speedup when i ran with 10000 x 10000 matrix. I am recording the running time with time -p program.
The matrix random initialization is also done in parallel like above.
I think this is because all the threads work on the same matrix address area, may be because of that a bottleneck is not making any speedup. Although all the threads will work on different segments of a matrix, they don't overlap.
Previously i implemented a parallel mergesort and a quicksort which also showed similar characteristics, i was able to get speedup when i copied the data segment on which a particular thread is to work to a newly allocated memory.
My question is that is this because of:
memory bottleneck?
Time benchmark is not done in the proper way?
Dataset too small?
Coding error?
Other
In the case, if it is a memory bottleneck, then do every parallel program use exclusive memory area, even when multiple access of the threads on the shared memory can be done without mutex?
EDIT
I can see speedup when i make the matrix segments like
curr = 0;
jump = matrix1->r / THREADS;
for (i=0; i<THREADS; i++)
{
th_seg[i].matrix1 = malloc (sizeof (matrix_t));
th_seg[i].matrix1->mat = &(matrix1->mat[curr]);
th_seg[i].matrix1->c = matrix1->c;
th_seg[i].matrix1->r = jump;
curr += jump;
}
That is before passing, assign the base address of the matrix to be processed by this thread in the structure and store the number of rows. So now the base address of each matrix is different for each thread. But only if i add some small dimention matrix 100 x 100 say, many times. Before calling the parallel add in each iteration, i am re assigning the random values. Is the speedup noticed here true? Or due to some other phenomena chaching effects?
To optimize memory usage, you may want to take a look at loop tiling. That would help cache memory to be updated. In this approach you divide your matrices into smaller chunks so the cache can hold the values for longer time and does not need to update it self frequently.
Also notice, creating to many threads just increases the overhead of switching among them.
In order to get a feeling that how much a proper implementation can affect the run time of a concurrent program, these are the results of a programs to multiply two matrices in naive, cocnurrent and tiling-concurrent :
seconds name
10.72 simpleMul
5.16 mulThread
3.19 tilingMulThread

Resources