c language 'order of memory allocation' and 'execution speed' - c

I'm making a program using c.
I have many arrays and size of each array is not so small.
(more than 10,000 elements in each array).
And, there are set of arrays that are accessed and computed together frequently.
For example,
a_1[index] = constant * a_2[index];
b_1[index] = constant * b_2[index];
a_1 is compute with a_2 and b_1 is computed with b_2.
Suppose that I have a~z_1 and a~z_2 arrays, in my case,
is there significant 'execution speed' difference between the following 2 different memory allocation ways.
allocating memory in order of a~z_1 followed by a~z_2
allocating a_1,a_2 followed by b_1,b_2, c_1,c_2 and others?
1.
MALLOC(a_1);
MALLOC(b_1);
...
MALLOC(z_1);
MALLOC(a_2);
...
MALLOC(z_2);
2.
MALLOC(a_1);
MALLOC(a_2);
MALLOC(b_1);
MALLOC(b_2);
...
MALLOC(z_1);
MALLOC(z_2);
I think allocating memory in second way will be faster because of hit rate.
Because arrays allocated in similar times will be in the similar address, those arrays will be uploaded in the cash or RAM at the same time, and therefore computer does not need to upload arrays in several times to compute one line of code.
For example, to compute
a_1[index] = constant * a_2[index];
, upload a_1 and a_2 at the same time not separately.
(Is it correct?)
However, for me, in terms of maintenance, allocating in first way is much easier.
I have AA_a~AA_z_1,AA_a~AA_z_2, BB_a~BB_z_1, CC_a~CC_z~1 and other arrays.
Because I can efficiently use MACRO in the following way to allocate memory.
Like,
#define MALLOC_GROUP(GROUP1,GROUP2)
MALLOC(GROUP1##_a_##GROUP2);
MALLOC(GROUP1##_b_##GROUP2);
...
MALLOC(GROUP1##_z_##GROUP2)
void allocate(){
MALLOC_GROUP(AA,1);
MALLOC_GROUP(AA,2);
MALLOC_GROUP(BB,2);
}
To sum, is allocating sets of arrays, computed together, at the similar time affects to the execution speed of the program?
Thank you.

Related

Why using repmat() for expanding array?

I want to load a csv file to Matlab using testread(), since the data in it has more than 2 million records, so I should preallocate the array for those data.
Suppose I cannot know the exact length of arrays, the docs of MATLAB v6.5 recommend me to use repmat() for my expanding array. The original words in the doc is below:
"In cases where you cannot preallocate, see if you can increase the
size of your array using the repmat function. repmat tries to get you
a contiguous block of memory for your expanding array".
I really don't know how to use the repmat for expanding?
Does it mean by estimating a rough number of the length for repmat() to preallocating, and then remove the empty elements?
If so, how is that different from preallocating using zeros() or cell()?
The documentation also says:
When you preallocate a block of memory to hold a matrix of some type
other than double, it is more memory efficient and sometimes faster to
use the repmat function for this.
The statement below uses zeros to preallocate a 100-by-100 matrix of
uint8. It does this by first creating a full matrix of doubles, and
then converting the matrix to uint8. This costs time and uses memory
unnecessarily.
A = int8(zeros(100));
Using repmat, you create only one double, thus reducing your memory
needs.
A = repmat(int8(0), 100, 100);
Therefore, the advantage is if you want a datatype other than doubles, you can use repmat to replicate a non-double datatype.
Also see: http://undocumentedmatlab.com/blog/preallocation-performance, which suggests:
data1(1000,3000) = 0
instead of:
data1 = zeros(1000,3000)
to avoid initialisation of other elements.
As for dynamic resizing, repmat can be used to concisely double the size of your array (a common method which results in amortized O(1) appends for each element):
data = [0];
i = 1;
while another element
...
if i > numel(data)
data = repmat(data,1,2); % doubles the size of data
end
data(i) = element
i = i + 1;
end
And yes, after you have gathered all your elements, you can resize the array to remove empty elements at the end.

Out of memory only by a matrix transpose

I have a cell, Data, it contains three double arrays,
Data =
[74003x253 double] [8061x253 double] [7241x253 double]
I'm using a loop to read these arrays and perform some functions,
for ii = 1 : 3
D = Data {ii} ;
m = mean (D') ;
// rest of the code
end
Which gets a warning for mean and says:
consider using different DIMENSION input argument for MEAN
However when I change it to,
for ii = 1 : 3
D = Data {ii}' ;
m = mean (D) ;
// rest of the code
end
I get Out of memory error.
Comparing two codes, can someone explain what happens?
It seems that I get the error only with a Complex conjugate transpose (my data is real valued).
To take the mean for the n:th dimension it is possible use mean(D,n) as already stated. Regarding the memory consumption, I did some tests monitoring with the windows resource manager. The output was kind of expected.
When doing the operation D=Data{ii} only minimum memory is consumed since here matlab does no more than copying a pointer. However, when doing a transpose, matlab needs to allocate more memory to store the matrix D, which means that the memory consumption increases.
However, this solely does not cause a memory overflow, since the transpose is done in both cases.
Case 1
Separately inD = Data{ii}';
Case 2
in
D = Data {ii}; m = mean(D');
The difference is that in case 2 matlab only creates a temporary copy of Data{ii}' which is not stored in the workspace. The memory allocated is the same in both cases, but in case 1 Data{ii}' is stored in D. When the memory later increases this can cause a memory overflow.
The memory consumption of D is not that bad (< 200 Mb), but the guess is that the memory got high already and that this was enough to give memory overflow.
The warning message means that instead of,
m = mean (D') ;
you should do:
m = mean (D,2);
This will take the mean along the second dimension, leaving you with a column vector the length of size(D,1).
I don't know why you only get the out of memory error when you do D = Data {ii}'. Perhaps it's becauase when you have it in side of mean (m = mean (D') ; the JIT manages to optimize somehow and save you wasted memory.
Here are some ways of doing this:
for i = 1 : length(Data)
% as chappjc recommends this is an excellent solution
m = mean(Data{i}, 2);
end
Or if you want the transpose and you know the data is real (not complex)
for i = 1 : length(Data)
m = mean(Data{i}.');
end
Note, the dot before the transpose.
Or, skip the loop all together
m = cellfun(#(d) mean(d, 2), Data, 'uniformoutput', false);
When you do:
D = Data{i}'
Matlab will create a new copy of your data. This will allocate 74003x253 doubles, which is about 150MB. As patrick pointed out, given that you might have other data you can easily exceed the allowed memory allocation usage (especially on a 32-bit machine).
If you are running with memory problems, the computations are not sensitive, you may consider using single precision instead of double, i.e.:
data{i} = single(data{i});
Ideally, you want to do the single precision at point of allocation to avoid unnecessary new allocation and copies.
Good luck.

Cuda Fortran 4D array

My code is being slowed down by a my 4D arrays access in global memory.
I am using PGI compiler 2010.
The 4D array I am accessing is read only from the device and the size is known at run time.
I wanted to allocate to the texture memory and found that my PGI version does not support texture. As the size is known only at run time, it is not possible to use constant memory too.
Only One dimension is known at compile time like this MyFourD(100, x,y,z) where x,y,z are user input.
My first idea is about pointers but not familiar with pointer fortran.
If you have experience how to deal with such a situation, I will appreciate your help. Because only this makes my codes 5times slower than expected
Following is a sample code of what I am trying to do
int i,j,k
i = (blockIdx%x-1) * blockDim%x + threadIdx%x-1
j = (blockIdx%y-1) * blockDim%y + threadIdx%y-1
do k = 0, 100
regvalue1 = somevalue1
regvalue2 = somevalue2
regvalue3 = somevalue3
d_value(i,j,k)=d_value(i,j,k)
& +myFourdArray(10,i,j,k)*regvalue1
& +myFourdArray(32,i,j,k)*regvalue2
& +myFourdArray(45,i,j,k)*regvalue3
end do
Best regards,
I believe the answer from #Alexander Vogt is on the right track - I would think about re-ordering the array storage. But I would try it like this:
int i,j,k
i = (blockIdx%x-1) * blockDim%x + threadIdx%x-1
j = (blockIdx%y-1) * blockDim%y + threadIdx%y-1
do k = 0, 100
regvalue1 = somevalue1
regvalue2 = somevalue2
regvalue3 = somevalue3
d_value(i,j,k)=d_value(i,j,k)
& +myFourdArray(i,j,k,10)*regvalue1
& +myFourdArray(i,j,k,32)*regvalue2
& +myFourdArray(i,j,k,45)*regvalue3
end do
Note that the only change is to myFourdArray, there is no need for a change in data ordering in the d_value array.
The crux of this change is that we are allowing adjacent threads to access adjacent elements in myFourdArray and so we are allowing for coalesced access. Your original formulation forced adjacent threads to access elements that were separated by the length of the first dimension, and so did not allow for useful coalescing.
Whether in CUDA C or CUDA Fortran, threads are grouped in X first, then Y and then Z dimensions. So the rapidly varying thread subscript is X first. Therefore, in matrix access, we want this rapidly varying subscript to show up in the index that is also rapidly varying.
In Fortran this index is the first of a multiple-subscripted array.
In C, this index is the last of a multiple-subscripted array.
Your original code followed this convention for d_value by placing the X thread index (i) in the first array subscript position. But it broke this convention for myFourdArray by putting a constant in the first array subscript position. Thus your access to myFourdArray are noticeably slower.
When there is a loop in the code, we also don't want to place the loop variable first (for Fortran, or last for C) (i.e. k, in this case, as Alexander Vogt did) because doing that will also break coalescing. For each iteration of the loop, we have multiple threads executing in lockstep, and those threads should all access adjacent elements. This is facilitated by having the X thread indexed subscript (e.g. i) first (for Fortran, or last for C).
You could invert the indexing, i.e. let the first dimension change the Fastest. Fortran is column major!
do k = 0, 100
regvalue1 = somevalue1
regvalue2 = somevalue2
regvalue3 = somevalue3
d_value(k,i,j)=d_value(k,i,j) + &
myFourdArray(k,i,j,10)*regvalue1 + &
myFourdArray(k,i,j,32)*regvalue2 + &
myFourdArray(k,i,j,45)*regvalue3
end do
If the last (in the original case second) dimension is always fixed (and not too large), consider individual arrays instead.
In my experience, pointers do not change much in terms of speed-up when applied to large arrays. What you could try is strip-mining to optimize your loops in terms of cache access, but I do not know the compile option to enable this with the PGI compiler.
Ah, ok it is a simple directive:
!$acc do vector
do k=...
enddo

pthread speedup for matrix add/multiply

I am writing a basic code to add two matrix and note down the time taken for single thread and 2 or more threads. In the approach first i divide the given two matrix (initialized randomly) in THREADS number of segments, and then each of these segments are sent to the addition module, which is started by the pthread_create call. The argument to the parallel addition function is the following.
struct thread_segment
{
matrix_t *matrix1, *matrix2, *matrix3;
int start_row, offset;
};
Pointers to two source and one destination matrix. (Once source and the destination may point to the same matrix). The start_row is the row from which the particular thread should start adding, and the offset tells till how much this thread should add starting from start_row.
The matrix_t is a simple structure defined as below:
typedef struct _matrix_t
{
TYPE **mat;
int r, c;
} matrix_t;
I have compiled it with 2 threads, but there is (almost) no speedup when i ran with 10000 x 10000 matrix. I am recording the running time with time -p program.
The matrix random initialization is also done in parallel like above.
I think this is because all the threads work on the same matrix address area, may be because of that a bottleneck is not making any speedup. Although all the threads will work on different segments of a matrix, they don't overlap.
Previously i implemented a parallel mergesort and a quicksort which also showed similar characteristics, i was able to get speedup when i copied the data segment on which a particular thread is to work to a newly allocated memory.
My question is that is this because of:
memory bottleneck?
Time benchmark is not done in the proper way?
Dataset too small?
Coding error?
Other
In the case, if it is a memory bottleneck, then do every parallel program use exclusive memory area, even when multiple access of the threads on the shared memory can be done without mutex?
EDIT
I can see speedup when i make the matrix segments like
curr = 0;
jump = matrix1->r / THREADS;
for (i=0; i<THREADS; i++)
{
th_seg[i].matrix1 = malloc (sizeof (matrix_t));
th_seg[i].matrix1->mat = &(matrix1->mat[curr]);
th_seg[i].matrix1->c = matrix1->c;
th_seg[i].matrix1->r = jump;
curr += jump;
}
That is before passing, assign the base address of the matrix to be processed by this thread in the structure and store the number of rows. So now the base address of each matrix is different for each thread. But only if i add some small dimention matrix 100 x 100 say, many times. Before calling the parallel add in each iteration, i am re assigning the random values. Is the speedup noticed here true? Or due to some other phenomena chaching effects?
To optimize memory usage, you may want to take a look at loop tiling. That would help cache memory to be updated. In this approach you divide your matrices into smaller chunks so the cache can hold the values for longer time and does not need to update it self frequently.
Also notice, creating to many threads just increases the overhead of switching among them.
In order to get a feeling that how much a proper implementation can affect the run time of a concurrent program, these are the results of a programs to multiply two matrices in naive, cocnurrent and tiling-concurrent :
seconds name
10.72 simpleMul
5.16 mulThread
3.19 tilingMulThread

Which kind of data organization using C arrays makes fastest code and why?

Given following data, what is the best way to organize an array of elements so that the fastest random access will be possible?
Each element has some int number, a name of 3 characters with '\0' at the end, and a floating point value.
I see two possible methods to organize and access such array:
First:
typedef struct { int num; char name[4]; float val; } t_Element;
t_Element array[900000000];
//random access:
num = array[i].num;
name = array[i].name;
val = array[i].val;
//sequential access:
some_cycle:
num = array[i].num
i++;
Second:
#define NUMS 0
#define NAMES 1
#define VALS 2
#define SIZE (VALS+1)
int array[SIZE][900000000];
//random access:
num = array[NUMS][i];
name = (char*) array[NAMES][i];
val = (float) array[VALS][i];
//sequential access:
p_array_nums = &array[NUMS][i];
some_cycle:
num = *p_array_nums;
p_array_nums++;
My question is, what method is faster and why? My first thought was the second method makes fastest code and allows fastest block copy, but I doubt whether it saves any sensitive number of CPU instructions in comparison to the first method?
It depends on the common access patterns. If you plan to iterate over the data, accessing every element as you go, the struct approach is better. If you plan to iterate independently over each component, then parallel arrays are better.
This is not a subtle distinction, either. With main memory typically being around two orders of magnitude slower than L1 cache, using the data structure that is appropriate for the usage pattern can possibly triple performance.
I must say, though, that your approach to implementing parallel arrays leaves much to be desired. You should simply declare three arrays instead of getting "clever" with two-dimensional arrays and casting:
int nums[900000000];
char names[900000000][4];
float vals[900000000];
Impossible to say. As with any performance related test, the answer my vary by any one or more of your OS, your CPU, your memory, your compiler etc.
So you need to test for yourself. Set your performance targets, measure, optimise, repeat.
The first one is probably faster, since memory access latency will be the dominant factor in performance. Ideally you should access memory sequentially and contiguously, to make best use of loaded cache lines and reduce cache misses.
Of course the access pattern is critical in any such discussion, which is why sometimes it's better to use SoA (structure of arrays) and other times AoS (array of structures), at least when performance is critical.
Most of the time of course you shouldn't worry about such things (premature optimisation, and all that).

Resources