Out of memory only by a matrix transpose - database

I have a cell, Data, it contains three double arrays,
Data =
[74003x253 double] [8061x253 double] [7241x253 double]
I'm using a loop to read these arrays and perform some functions,
for ii = 1 : 3
D = Data {ii} ;
m = mean (D') ;
// rest of the code
end
Which gets a warning for mean and says:
consider using different DIMENSION input argument for MEAN
However when I change it to,
for ii = 1 : 3
D = Data {ii}' ;
m = mean (D) ;
// rest of the code
end
I get Out of memory error.
Comparing two codes, can someone explain what happens?
It seems that I get the error only with a Complex conjugate transpose (my data is real valued).

To take the mean for the n:th dimension it is possible use mean(D,n) as already stated. Regarding the memory consumption, I did some tests monitoring with the windows resource manager. The output was kind of expected.
When doing the operation D=Data{ii} only minimum memory is consumed since here matlab does no more than copying a pointer. However, when doing a transpose, matlab needs to allocate more memory to store the matrix D, which means that the memory consumption increases.
However, this solely does not cause a memory overflow, since the transpose is done in both cases.
Case 1
Separately inD = Data{ii}';
Case 2
in
D = Data {ii}; m = mean(D');
The difference is that in case 2 matlab only creates a temporary copy of Data{ii}' which is not stored in the workspace. The memory allocated is the same in both cases, but in case 1 Data{ii}' is stored in D. When the memory later increases this can cause a memory overflow.
The memory consumption of D is not that bad (< 200 Mb), but the guess is that the memory got high already and that this was enough to give memory overflow.

The warning message means that instead of,
m = mean (D') ;
you should do:
m = mean (D,2);
This will take the mean along the second dimension, leaving you with a column vector the length of size(D,1).
I don't know why you only get the out of memory error when you do D = Data {ii}'. Perhaps it's becauase when you have it in side of mean (m = mean (D') ; the JIT manages to optimize somehow and save you wasted memory.

Here are some ways of doing this:
for i = 1 : length(Data)
% as chappjc recommends this is an excellent solution
m = mean(Data{i}, 2);
end
Or if you want the transpose and you know the data is real (not complex)
for i = 1 : length(Data)
m = mean(Data{i}.');
end
Note, the dot before the transpose.
Or, skip the loop all together
m = cellfun(#(d) mean(d, 2), Data, 'uniformoutput', false);
When you do:
D = Data{i}'
Matlab will create a new copy of your data. This will allocate 74003x253 doubles, which is about 150MB. As patrick pointed out, given that you might have other data you can easily exceed the allowed memory allocation usage (especially on a 32-bit machine).
If you are running with memory problems, the computations are not sensitive, you may consider using single precision instead of double, i.e.:
data{i} = single(data{i});
Ideally, you want to do the single precision at point of allocation to avoid unnecessary new allocation and copies.
Good luck.

Related

c language 'order of memory allocation' and 'execution speed'

I'm making a program using c.
I have many arrays and size of each array is not so small.
(more than 10,000 elements in each array).
And, there are set of arrays that are accessed and computed together frequently.
For example,
a_1[index] = constant * a_2[index];
b_1[index] = constant * b_2[index];
a_1 is compute with a_2 and b_1 is computed with b_2.
Suppose that I have a~z_1 and a~z_2 arrays, in my case,
is there significant 'execution speed' difference between the following 2 different memory allocation ways.
allocating memory in order of a~z_1 followed by a~z_2
allocating a_1,a_2 followed by b_1,b_2, c_1,c_2 and others?
1.
MALLOC(a_1);
MALLOC(b_1);
...
MALLOC(z_1);
MALLOC(a_2);
...
MALLOC(z_2);
2.
MALLOC(a_1);
MALLOC(a_2);
MALLOC(b_1);
MALLOC(b_2);
...
MALLOC(z_1);
MALLOC(z_2);
I think allocating memory in second way will be faster because of hit rate.
Because arrays allocated in similar times will be in the similar address, those arrays will be uploaded in the cash or RAM at the same time, and therefore computer does not need to upload arrays in several times to compute one line of code.
For example, to compute
a_1[index] = constant * a_2[index];
, upload a_1 and a_2 at the same time not separately.
(Is it correct?)
However, for me, in terms of maintenance, allocating in first way is much easier.
I have AA_a~AA_z_1,AA_a~AA_z_2, BB_a~BB_z_1, CC_a~CC_z~1 and other arrays.
Because I can efficiently use MACRO in the following way to allocate memory.
Like,
#define MALLOC_GROUP(GROUP1,GROUP2)
MALLOC(GROUP1##_a_##GROUP2);
MALLOC(GROUP1##_b_##GROUP2);
...
MALLOC(GROUP1##_z_##GROUP2)
void allocate(){
MALLOC_GROUP(AA,1);
MALLOC_GROUP(AA,2);
MALLOC_GROUP(BB,2);
}
To sum, is allocating sets of arrays, computed together, at the similar time affects to the execution speed of the program?
Thank you.

Matlab - Slicing an Array without using additional memory

I want to slice an 4D-array into n parts along the 5th Dimension in order to use it in parfor:
X(:,:,:,particles)-->X(:,:,:,particles/n,n)
The Problem is that X is so big that I run out of memory if I start writing it into a new variable, so i want to basically do:
X = cat(5,X(:,:,:,1:particles/n),X(:,:,:,particles/n+1:2*particles/n),...)
I am doing this with
sliced = 'cat(5'
for i=1:n
sliced = strcat(2,sliced,sprintf(',X(:,:,:,(1+(%i-1)*%i):%i*%i)',i,particles/n,i,particles/n))
end
sliced = strcat(2,sliced,')');
X = eval(sliced);
I get:
Error: The input character is not valid in MATLAB statements or expressions.
If i print out the contents of sliced and comment everything and paste the printout of sliced manually into eval('...') it works.
Anyone got a solution for my problem or another way of slicing a 4D array without using additional memory?
Thanks
You can use reshape, which must not use any additional memory -
sz_X = size(X) %// get size
X = reshape(X,sz_X(1),sz_X(2),sz_X(3),sz_X(4)/n,[]); %// reshape and save
%// into same variable and as such must be memory efficient
Ok. I just got things mixed up cat and strcat are not the same... oops :o
n = 4;
particles = 200;
X = rand(6,6,6,particles);
sliced = sprintf('X = cat(5');
for i = 1:n
sliced = cat(2,sliced,sprintf(',X(:,:,:,(1+(%i-1)*%i):%i*%i)',i,particles/n,i,particles/n));
end
sliced = cat(2,sliced,sprintf(');'));
eval(sliced);
works just fine. If somebody has got a better way to slice without memory usage - please feel free to post...

Increasing luminosity does not produce desired effect

cvCvtColor(img,dst,CV_RGB2YCrCb);
for (int col=0;col<dst->width;col++)
{
for (int row=0;row<dst->height;row++)
{
int idxF = row*dst->widthStep + dst->nChannels*col; // Read the image data
CvPoint pt = {row,col};
temp_ptr2[0] += temp_ptr1[0]* 0.0722 + temp_ptr1[1] * 0.7152 +temp_ptr1[2] *0.2126 ; // channel Y
}
}
But the result is this:
Please assist where am i going wrong?
There is a lot to say about this code sample:
First, you are using the old C-style API (IplImage pointers, cvBlah functions, etc), which is obsolete and more difficult to maintain (in particular, memory leaks are introduced easily), so you should consider using the C++-style structures and functions (cv::Mat structure and cv::blah functions).
Your error is probably coming from the instruction cvCopy(dst,img); at the very beginning. This fills your input image with nothing just before you start your processing, so you should remove this line.
For maximum speed, you should invert the two loops, so that you first iterate over rows then over columns. This is because images in OpenCV are stored row-by-row in memory, hence accessing the images by increasing column is more efficient with respect to the cache usage.
The temporary variable idxF is never used, so you should probably remove the following line too:
int idxF = row*dst->widthStep + dst->nChannels*col;
When you access image data to store the pixels in temp_ptr1 and temp_ptr2, you swapped the positions of the x and y coordinates. You should access the image in the following way:
temp_ptr1 = &((uchar*)(img->imageData + (img->widthStep*pt.y)))[pt.x*3];
You never release the memory allocated for dst, hence introducing a memory leak in your application. Call cvReleaseImage(&dst); at the end of your function.

Cuda Fortran 4D array

My code is being slowed down by a my 4D arrays access in global memory.
I am using PGI compiler 2010.
The 4D array I am accessing is read only from the device and the size is known at run time.
I wanted to allocate to the texture memory and found that my PGI version does not support texture. As the size is known only at run time, it is not possible to use constant memory too.
Only One dimension is known at compile time like this MyFourD(100, x,y,z) where x,y,z are user input.
My first idea is about pointers but not familiar with pointer fortran.
If you have experience how to deal with such a situation, I will appreciate your help. Because only this makes my codes 5times slower than expected
Following is a sample code of what I am trying to do
int i,j,k
i = (blockIdx%x-1) * blockDim%x + threadIdx%x-1
j = (blockIdx%y-1) * blockDim%y + threadIdx%y-1
do k = 0, 100
regvalue1 = somevalue1
regvalue2 = somevalue2
regvalue3 = somevalue3
d_value(i,j,k)=d_value(i,j,k)
& +myFourdArray(10,i,j,k)*regvalue1
& +myFourdArray(32,i,j,k)*regvalue2
& +myFourdArray(45,i,j,k)*regvalue3
end do
Best regards,
I believe the answer from #Alexander Vogt is on the right track - I would think about re-ordering the array storage. But I would try it like this:
int i,j,k
i = (blockIdx%x-1) * blockDim%x + threadIdx%x-1
j = (blockIdx%y-1) * blockDim%y + threadIdx%y-1
do k = 0, 100
regvalue1 = somevalue1
regvalue2 = somevalue2
regvalue3 = somevalue3
d_value(i,j,k)=d_value(i,j,k)
& +myFourdArray(i,j,k,10)*regvalue1
& +myFourdArray(i,j,k,32)*regvalue2
& +myFourdArray(i,j,k,45)*regvalue3
end do
Note that the only change is to myFourdArray, there is no need for a change in data ordering in the d_value array.
The crux of this change is that we are allowing adjacent threads to access adjacent elements in myFourdArray and so we are allowing for coalesced access. Your original formulation forced adjacent threads to access elements that were separated by the length of the first dimension, and so did not allow for useful coalescing.
Whether in CUDA C or CUDA Fortran, threads are grouped in X first, then Y and then Z dimensions. So the rapidly varying thread subscript is X first. Therefore, in matrix access, we want this rapidly varying subscript to show up in the index that is also rapidly varying.
In Fortran this index is the first of a multiple-subscripted array.
In C, this index is the last of a multiple-subscripted array.
Your original code followed this convention for d_value by placing the X thread index (i) in the first array subscript position. But it broke this convention for myFourdArray by putting a constant in the first array subscript position. Thus your access to myFourdArray are noticeably slower.
When there is a loop in the code, we also don't want to place the loop variable first (for Fortran, or last for C) (i.e. k, in this case, as Alexander Vogt did) because doing that will also break coalescing. For each iteration of the loop, we have multiple threads executing in lockstep, and those threads should all access adjacent elements. This is facilitated by having the X thread indexed subscript (e.g. i) first (for Fortran, or last for C).
You could invert the indexing, i.e. let the first dimension change the Fastest. Fortran is column major!
do k = 0, 100
regvalue1 = somevalue1
regvalue2 = somevalue2
regvalue3 = somevalue3
d_value(k,i,j)=d_value(k,i,j) + &
myFourdArray(k,i,j,10)*regvalue1 + &
myFourdArray(k,i,j,32)*regvalue2 + &
myFourdArray(k,i,j,45)*regvalue3
end do
If the last (in the original case second) dimension is always fixed (and not too large), consider individual arrays instead.
In my experience, pointers do not change much in terms of speed-up when applied to large arrays. What you could try is strip-mining to optimize your loops in terms of cache access, but I do not know the compile option to enable this with the PGI compiler.
Ah, ok it is a simple directive:
!$acc do vector
do k=...
enddo

Armadillo: efficient matrix allocation on the heap

I'm using Armadillo to manipulate large matrices in C++ read from a CSV-file.
mat X;
X.load("myfile.csv",csv_ascii);
colvec x1 = X(span::all,0);
colvec x2 = X(span::all,1);
//etc.
So x1,...,xk (for k=20 say) are the columns of X. X will typically have rows ranging from 2000 to 16000. My question is:
How can I allocate (and subsequently deallocate) X onto the heap (free store)?
This section of Armadillo docs explains auxiliary memory allocation of a mat. Is this the same as heap allocation? It requires prior knowledge of matrix dimensions, which I won't know until X is read from csv:
mat(aux_mem*, n_rows, n_cols, copy_aux_mem = true, strict = true)
Any suggestions would be greatly appreciated. (I'm using g++-4.2.1; my current program runs fine locally on my Macbook Pro, but when I run it on my university's computing cluster (Linux g++-4.1.2), I run into a segmentation fault. The program is too large to post).
Edit: I ended up doing this:
arma::u32 Z_rows = 10000;
arma::u32 Z_cols = 20;
double* aux_mem = new double[Z_rows*Z_cols];
mat Z(aux_mem,Z_rows,Z_cols,false,true);
Z = randn(Z_rows, Z_cols);
which first allocates memory on the heap and then tells the matrix Z to use it.
By looking at the source code, Armadillo already allocates large matrices on the heap.
To reduce the amount of memory required, you may want to use fmat instead of mat. This will come with the trade-off of reduced precision.
fmat uses float, while mat uses double: see http://arma.sourceforge.net/docs.html#Mat.
It's also possible that the system administrator of the linux computing cluster has enabled limits on it (eg. each user can allocate only upto a certain amount of maximum memory). For example, see http://www.linuxhowtos.org/Tips%20and%20Tricks/ulimit.htm.

Resources