I have saved multidimensional arrays in Matlab before (e.g. an array A that has size 100x100x100) using a .mat file and that worked out very nicely.
What is the best way to save such multidimensional arrays in C? The only way I can think of is to store it as a 2D array (e.g. convert a KxNxM array to a KNxM array) and be careful in remembering how it was saved.
What is also desired is to save it in a way that can be opened later in Matlab for post-processing/plotting.
C does 3D arrays just fine:
double data[D0][D1][D2];
...
data[i][j][k] = ...;
although for very large arrays such as in your example, you would want to allocate the arrays dynamically instead of declaring them as auto variables such as above, since the space for auto variables (usually the stack, but not always) may be very limited.
Assuming all your dimensions are known at compile time, you could do something like:
#include <stdlib.h>
...
#define DO 100
#define D1 100
#define D2 100
...
double (*data)[D1][D2] = malloc(sizeof *data * D0);
if (data)
{
...
data[i][j][k] = ...;
...
free(data);
}
This will allocate a D0xD1xD2 array from the heap, and you can access it like any regular 3D array.
If your dimensions are not known until run time, but you're working with a C99 compiler or a C2011 compiler that supports variable-length arrays, you can do something like this:
#include <stdlib.h>
...
size_t d0, d1, d2;
d0 = ...;
d1 = ...;
d2 = ...;
...
double (*data)[d1][d2] = malloc(sizeof *data * d0);
if (data)
{
// same as above
}
If your dimensions are not known until runtime and you're working with a compiler that does not support variable-length arrays (C89 or earlier, or a C2011 compiler without VLA support), you'll need to take a different approach.
If the memory needs to be allocated contiguously, then you'll need to do something like the following:
size_t d0, d1, d2;
d0 = ...;
d1 = ...;
d2 = ...;
...
double *data = malloc(sizeof *data * d0 * d1 * d2);
if (data)
{
...
data[i * d0 * d1 + j * d1 + k] = ...;
...
free(data);
}
Note that you have to map your i, j, and k indices to a single index value.
If the memory doesn't need to be contiguous, you can do a piecemeal allocation like so:
double ***data;
...
data = malloc(d0 * sizeof *data);
if (data)
{
size_t i;
for (i = 0; i < d0; i++)
{
data[i] = malloc(d1 * sizeof *data[i]);
if (data[i])
{
size_t j;
for (j = 0; j < d1; j++)
{
data[i][j] = malloc(d2 * sizeof *data[i][j]);
if (data[i][j])
{
size_t k;
for (k = 0; k < d2; k++)
{
data[i][j][k] = initial_value();
}
}
}
}
}
}
and deallocate it as
for (i = 0; i < d0; i++)
{
for (j = 0; j < d1; j++)
{
free(data[i][j]);
}
free(data[i]);
}
free(data);
This is not recommended practice, btw; even though it allows you to index data as though it were a 3D array, the tradeoff is more complicated code, especially if malloc fails midway through the allocation loop (then you have to back out all the allocations you've made so far). It may also incur a performance penalty since the memory isn't guaranteed to be well-localized.
EDIT
As for saving this data in a file, it kind of depends on what you need to do.
The most portable is to save the data as formatted text, such as:
#include <stdio.h>
FILE *dat = fopen("myfile.dat", "w"); // opens new file for writing
if (dat)
{
for (i = 0; i < D0; i++)
{
for (j = 0; j < D1; j++)
{
for (k = 0; k < D2; k++)
{
fprintf(dat, "%f ", data[i][j][k]);
}
fprintf(dat, "\n");
}
fprintf(dat, "\n");
}
}
This writes the data out as a sequence of floating-point numbers, with a newline at the end of each row, and two newlines at the end of each "page". Reading the data back in is essentially the reverse:
FILE *dat = fopen("myfile.dat", "r"); // opens file for reading
if (dat)
{
for (i = 0; i < D0; i++)
for (j = 0; j < D1; j++)
for (k = 0; k < D2; k++)
fscanf(dat, "%f", &data[i][j][k]);
}
Note that both of these snippets assume that the array has a known, fixed size that does not change from run to run. If that is not the case, you will obviously have to store additional data in the file to determine how big the array needs to be. There's also nothing resembling error handling.
I'm leaving a lot of stuff out, since I'm not sure what your goal is.
Well, you can of course store it as a 3D array in C, too. Not sure why you feel you must convert to 2D:
double data[100][100][100];
This will of course require quite a bit of memory (around 7.6 MB assuming a 64-bit double), but that should be fine on a PC, for instance.
You might want to avoid putting such a variable on the stack, though.
c can handle 3-dimensional arrays, so why don't use that?
Writing it on a .mat file is a little bit of work, but it doesn't seem too difficult.
The .mat format is described here.
C handles multidimensional arrays (double array[K][M][N];) just fine, and they are stored contiguously in memory the same as a 1-D array. In fact, it's legal to write double* onedim = &array[0][0][0]; and then use the same exact memory area as both a 3-D and 1-D array.
To get it from C into matlab, you can just use fwrite(array, sizeof array[0][0][0], K*M*N*, fptr) in C and array = fread(fileID, inf, 'real*8') in MatLab. You may find that the reshape function is helpful.
Triple pointer:
double*** X;
X= (double***)malloc(k*sizeof(double**));
for(int i=0; i<k;i++)
{
X[i]=(double**)malloc(n*sizeof(double*));
for(int j=0; j<n;j++)
{
X[i][j]=(double*)malloc(m*sizeof(double));
}
}
This way the method to access at each value if quite intuitive: X[i][j][k].
If instead you want, you can use an unique array:
double* X;
X=(double*)malloc(n*m*k*sizeof(double));
And you access to each element this way:
X[i*n*m+j*n+k]=0.0;
If you use a triple pointer, don't forget to free the memory.
Related
This is one of those questions where there are so many answers, and yet none do the specific thing.
I tried to look at all of these posts —
1 2 3 4 5 6 7 8 9 — and every time the solution would be either using VLAs, using normal arrays with fixed dimensions, or using pointer to pointer.
What I want is to allocate:
dynamically (using a variable set at runtime)
rectangular ("2d array") (I don't need a jagged one. And I guess it would be impossible to do it anyway.)
contiguous memory (in #8 and some other posts, people say that pointer to pointer is bad because of heap stuff and fragmentation)
no VLAs (I heard they are the devil and to always avoid them and not to talk to people who suggest using them in any scenario).
So please, if there is a post I skipped, or didn't read thoroughly enough, that fulfils these requirements, point me to it.
Otherwise, I would ask of you to educate me about this and tell me if this is possible, and if so, how to do it.
You can dynamically allocate a contiguous 2D array as
int (*arr)[cols] = malloc( rows * sizeof (int [cols]) );
and then access elements as arr[i][j].
If your compiler doesn’t support VLAs, then cols will have to be a constant expression.
Preliminary
no VLAs (I heard they are the devil and to always avoid them and not to talk to people who suggest using them in any scenario.
VLAs are a problem if you want your code to work with Microsoft's stunted C compiler, as MS has steadfastly refused to implement VLA support, even when C99, in which VLA support was mandatory, was the current language standard. Generally speaking, I would suggest avoiding Microsoft's C compiler altogether if you can, but I will stop well short of suggesting the avoidance of people who advise you differently.
VLAs are also a potential problem when you declare an automatic object of VLA type, without managing the maximum dimension. Especially so when the dimension comes from user input. This produces a risk of program crash that is hard to test or mitigate at development time, except by avoiding the situation in the first place.
But it is at best overly dramatic to call VLAs "the devil", and I propose that anyone who actually told you "not to talk to people who suggest using them in any scenario" must not have trusted you to understand the issues involved or to evaluate them for yourself. In particular, pointers to VLAs are a fine way to address all your points besides "no VLAs", and they have no particular technical issues other than lack of support by (mostly) Microsoft. Support for these will be mandatory again in C2X, the next C language specification, though support for some other forms of VLA use will remain optional.
Your requirements
If any of the dimensions of an array type are not given by integer constant expressions, then that type is by definition a variable-length array type. If any dimension but the first of an array type is not given by an integer constant expressions then you cannot express the corresponding pointer type without using a VLA.
Therefore, if you want a contiguously allocated multidimensional array (array of arrays) for which any dimension other than the first is chosen at runtime, then a VLA type must be involved. Allocating such an object dynamically works great and has little or no downside other than lack of support by certain compilers (which is a non-negligible consideration, to be sure). It would look something like this:
void do_something(size_t rows, size_t columns) {
int (*my_array)[columns]; // pointer to VLA
my_array = malloc(rows * sizeof(*my_array));
// ... access elements as my_array[row][col] ...
}
You should have seen similar in some of the Q&As you reference in the question.
If that's not acceptable, then you need to choose which of your other requirements to give up. I would suggest the "multi-dimensional" part. Instead, allocate (effectively) a one-dimensional array, and use it as if it had two dimensions by performing appropriate index computations upon access. This should perform almost as well, because it's pretty close to what the compiler will set up automatically for a multidimensional array. You can make it a bit easier on yourself by creating a macro to assist with the computations. For example,
#define ELEMENT_2D(a, dim2, row, col) ((a)[(row) * (dim2) + (col)])
void do_something(size_t rows, size_t columns) {
int *my_array;
my_array = malloc(rows * columns * sizeof(*my_array));
// ... access elements as ELEMENT_2D(my_array, columns, row, col) ..
}
Alternatively, you could give up the contiguous allocation and go with an array of pointers instead. This is what people who don't understand arrays, pointers, and / or dynamic allocation typically do, and although there are some applications, especially for arrays of pointers to strings, this form has mostly downside relative to contiguous allocation for the kinds of applications where one wants an object they think of as a 2D array.
Often an array of pointers is allocated and then memory is allocated to each pointer.
This could be inverted. Allocate a large contiguous block of memory. Allocate an array of pointers and assign addresses from within the contiguous block.
#include <stdio.h>
#include <stdlib.h>
int **contiguous ( int rows, int cols, int **memory, int **pointers) {
int *temp = NULL;
int **ptrtemp = NULL;
// allocate a large block of memory
if ( NULL == ( temp = realloc ( *memory, sizeof **memory * rows * cols))) {
fprintf ( stderr, "problem memory malloc\n");
return pointers;
}
*memory = temp;
// allocate pointers
if ( NULL == ( ptrtemp = realloc ( pointers, sizeof *pointers * rows))) {
fprintf ( stderr, "problem memory malloc\n");
return pointers;
}
pointers = ptrtemp;
for ( int rw = 0; rw < rows; ++rw) {
pointers[rw] = &(*memory)[rw * cols]; // assign addresses to pointers
}
// assign some values
for ( int rw = 0; rw < rows; ++rw) {
for ( int cl = 0; cl < cols; ++cl) {
pointers[rw][cl] = rw * cols + cl;
}
}
return pointers;
}
int main ( void) {
int *memory = NULL;
int **ptrs = NULL;
int rows = 20;
int cols = 17;
if ( ( ptrs = contiguous ( rows, cols, &memory, ptrs))) {
for ( int rw = 0; rw < rows; ++rw) {
for ( int cl = 0; cl < cols; ++cl) {
printf ( "%3d ", ptrs[rw][cl]);
}
printf ( "\n");
}
free ( memory);
free ( ptrs);
}
return 0;
}
Suppose you need a 2D array of size W x H containing ints (where H is the number of rows, and W the number of columns).
Then you can do the the following:
Allocation:
int * a = malloc(W * H * sizeof(int));
Access element at location (i,j):
int val = a[j * W + i];
a[j * W + i] = val;
The whole array would occupy a continuous block of memory, and can be dynamically allocated (without VLAs). Being a continuous block of memory offers an advantage over an array of pointers due to [potentially] fewer cache misses.
In such an array, the term "stride" refers to the offset between one row to another. If you need to use padding e.g. to make sure all lines start at some aligned address, you can use a stride which is bigger than W.
I did a benchmark between:
the classic pointer to array of pointers to individually malloc'd memory
one pointer to contignuous memory, accessed with a[x * COLS + y]
a mix of both - pointer to array of pointers to sliced up malloc'd contignuous memory
TL;DR:
the second one appears to be faster by 2-12% compared to the others, which are sort of similar in performance.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define ROWS 100
#define COLS 100
#define LOOPS 100
#define NORMAL 0
#define SINGLE 1
#define HYBRID 2
int **x_normal; /* global vars to make it more equal */
int *y_single;
int *z_hybrid_memory;
int **z_hybrid_pointers;
int copy_array[ROWS][COLS];
void x_normal_write(int magic) { /* magic number to prevent compiler from optimizing it */
int i, ii;
for (i = 0; i < ROWS; i++) {
for (ii = 0; ii < COLS; ii++) {
x_normal[i][ii] = (i * COLS + ii + magic);
}
}
}
void y_single_write(int magic) {
int i, ii;
for (i = 0; i < ROWS; i++) {
for (ii = 0; ii < COLS; ii++) {
y_single[i * COLS + ii] = (i * COLS + ii + magic);
}
}
}
void z_hybrid_write(int magic) {
int i, ii;
for (i = 0; i < ROWS; i++) {
for (ii = 0; ii < COLS; ii++) {
z_hybrid_pointers[i][ii] = (i * COLS + ii + magic);
}
}
}
void x_normal_copy(void) {
int i, ii;
for (i = 0; i < ROWS; i++) {
for (ii = 0; ii < COLS; ii++) {
copy_array[i][ii] = x_normal[i][ii];
}
}
}
void y_single_copy(void) {
int i, ii;
for (i = 0; i < ROWS; i++) {
for (ii = 0; ii < COLS; ii++) {
copy_array[i][ii] = y_single[i * COLS + ii];
}
}
}
void z_hybrid_copy(void) {
int i, ii;
for (i = 0; i < ROWS; i++) {
for (ii = 0; ii < COLS; ii++) {
copy_array[i][ii] = z_hybrid_pointers[i][ii];
}
}
}
int main() {
int i;
clock_t start, end;
double times_read[3][LOOPS];
double times_write[3][LOOPS];
/* MALLOC X_NORMAL 1/2 */
x_normal = malloc(ROWS * sizeof(int*)); /* rows */
for (i = 0; i < ROWS; i+=2) { /* malloc every other row to ensure memory isn't contignuous */
x_normal[i] = malloc(COLS * sizeof(int)); /* columns for each row (1/2) */
}
/* MALLOC Y_SINGLE */
y_single = malloc(ROWS * COLS * sizeof(int)); /* all in one contignuous memory */
/* MALLOC Z_HYBRID */
z_hybrid_memory = malloc(ROWS * COLS * sizeof(int)); /* memory part - with a big chunk of contignuous memory */
z_hybrid_pointers = malloc(ROWS * sizeof(int*)); /* pointer part - like in normal */
for (i = 0; i < ROWS; i++) { /* assign addresses to pointers from "memory", spaced out by COLS */
z_hybrid_pointers[i] = &z_hybrid_memory[(i * COLS)];
}
/* MALLOC X_NORMAL 2/2 */
for (i = 1; i < ROWS; i+=2) { /* malloc every other row to ensure memory isn't contignuous */
x_normal[i] = malloc(COLS * sizeof(int)); /* columns for each row (2/2) */
}
/* TEST */
for (i = 0; i < LOOPS; i++) {
/* NORMAL WRITE */
start = clock();
x_normal_write(i);
end = clock();
times_write[NORMAL][i] = (double)(end - start);
/* SINGLE WRITE */
start = clock();
y_single_write(i);
end = clock();
times_write[SINGLE][i] = (double)(end - start);
/* HYBRID WRITE */
start = clock();
z_hybrid_write(i);
end = clock();
times_write[HYBRID][i] = (double)(end - start);
/* NORMAL READ */
start = clock();
x_normal_copy();
end = clock();
times_read[NORMAL][i] = (double)(end - start);
/* SINGLE READ */
start = clock();
y_single_copy();
end = clock();
times_read[SINGLE][i] = (double)(end - start);
/* HYBRID READ */
start = clock();
z_hybrid_copy();
end = clock();
times_read[HYBRID][i] = (double)(end - start);
}
/* REPORT FINDINGS */
printf("CLOCKS NEEDED FOR:\n\nREAD\tNORMAL\tSINGLE\tHYBRID\tWRITE\tNORMAL\tSINGLE\tHYBRID\n\n");
for (i = 0; i < LOOPS; i++) {
printf(
"\t%.1f\t%.1f\t%.1f\t\t%.1f\t%.1f\t%.1f\n",
times_read[NORMAL][i], times_read[SINGLE][i], times_read[HYBRID][i],
times_write[NORMAL][i], times_write[SINGLE][i], times_write[HYBRID][i]
);
/* USE [0] to get totals */
times_read[NORMAL][0] += times_read[NORMAL][i];
times_read[SINGLE][0] += times_read[SINGLE][i];
times_read[HYBRID][0] += times_read[HYBRID][i];
times_write[NORMAL][0] += times_write[NORMAL][i];
times_write[SINGLE][0] += times_write[SINGLE][i];
times_write[HYBRID][0] += times_write[HYBRID][i];
}
printf("TOTAL:\n\t%.1f\t%.1f\t%.1f\t\t%.1f\t%.1f\t%.1f\n",
times_read[NORMAL][0], times_read[SINGLE][0], times_read[HYBRID][0],
times_write[NORMAL][0], times_write[SINGLE][0], times_write[HYBRID][0]
);
printf("AVERAGE:\n\t%.1f\t%.1f\t%.1f\t\t%.1f\t%.1f\t%.1f\n",
(times_read[NORMAL][0] / LOOPS), (times_read[SINGLE][0] / LOOPS), (times_read[HYBRID][0] / LOOPS),
(times_write[NORMAL][0] / LOOPS), (times_write[SINGLE][0] / LOOPS), (times_write[HYBRID][0] / LOOPS)
);
return 0;
}
Though maybe this is not the best approach since the result can be tainted by random stuff - such as perhaps the proximity of the source arrays to the destination array for copy functions (though the numbers are consistent for reads and writes. Perhaps someone can expand on this.)
I am searching for solution in this problem. We have to program a program which can add/multiple unknown amount of matrix. Which means multiplication don't have to be at first position, but you have to do it first due to operand precedence. I have a idea of saving all matrix to an array. But I don't know how to save a matrix(2D array) to an array. We are programming in C. Anyone know a solution or a better solution? Thanks.
I would probably create a struct representing a matrix, something like: (i use int but it will work with doubles too you just need to change every int, apart from n, to double)
typedef struct m{
//stores the size of the matrix (assuming they are square matrices)
//otherwise you can just save the same n for everyone if they have the same size
int n;
int **map;
}matrix;
and then a pointer being an array of those structs, something like the following: (note that i omitted the checks that ensures the allocations will work, you'll need to write them) i use calloc because i like it more since it initializes all the positions to 0 but malloc will work too
// if you already know how many matrices you'll have
int number = get_matrix_number();
matrix *matrices = calloc(number, sizeof(matrix));
// otherwise
int numer = 1;
matrix *matrices = calloc(number, sizeof(matrix));
matrices[number - 1].n = get_matrix_dimension();
// then every time you need to add another matrix
number++;
matrices = realloc(number * sizeof(matrix));
matrices[number - 1].n = get_matrix_dimension();
After that to create the actual matrices you could do:
for (int i = 0; i < number; i++){
matrices[i].map = calloc(matrices[i].n, sizeof(int));
for (int j = 0; j < matrices[i].n; j++){
matrices[i].map[j] = calloc(matrices[i].n, sizeof(int));
}
}
After all of that to access, let's say, the position (3,5) in the 4th matrix you'll just need to do
int value = matrices[4].map[3][5];
I didn't test it (just wrote it as i think i would've) but i think it should work.
As i said you'll need to add the checks for the mallocs and the frees but i think it's easier to understand than straight triple pointers, especially if you don't have much experience in C (but you have to write more code since you need to create the struct).
The nice part is that this will work for matrices of different size and for non square matrices (provided you don't store just "n" but "n" and "m" to remember how many columns and rows each matrix has).
You could also make the allocation faster by allocating more than what you need when calling realloc (E.G. number * 2) so that you don't need to realloc every time but i think you'll need another variable to store how many free spaces you still have (never done it so this is just what i studied as theory hence i prefeared to do it this way since i'm pretty sure it will work).
P.S. There could be some errors here and there, i wrote it pretty fast without checking too carefully.
Although your question is still confusing, for that part where you want to save a 2D array to 1D array, the following code can help you ..
int n = 10;
int array2D[n][n];
int yourArray[n*n];
int i,j,k=0;
for(i = 0; i<n; i++){
for(j = 0; j<n ;j++){
yourArray[k++] = array2D[i][j];
}
}
You need 3D array for this. Something like double *** arrayOfMatrices.
I think that the simplest thing to do, if you want to work with matrices of different shapes, is to declare a struct that holds a pointer to double, which should simulate a 1d array, and the number of rows and columns of the matrix. The main code can declare a pointer to a matrix struct, and this simulates an array of matrices.
When you add a matrix, which I take to be a VLA so that different matrix shapes can easily be handled, the VLA is stored in the 1d simulated array.
Storing the matrix information in a struct this way also makes it easy to verify that matrix operations are defined for the matrices in question (e.g., matrix addition is only defined for matrices of the same shape).
Here is a simple example of how this idea could be implemented. The program defines 3 VLAs and adds them to a list of Matrix structs called matrices. Then the matrices are displayed by reading them from matrices.
#include <stdio.h>
#include <stdlib.h>
typedef struct {
double *array;
size_t rows;
size_t cols;
} Matrix;
Matrix * add_matrix(size_t rs, size_t cs, double arr[rs][cs], Matrix *mx, size_t *n);
void show_matrix(Matrix mx);
int main(void)
{
Matrix *matrices = NULL;
size_t num_matrices = 0;
double m1[2][2] = {
{ 1, 2 },
{ 3, 4 }
};
double m2[2][3] = {
{ 1, 0, 0 },
{ 0, 1, 0 }
};
double m3[3][2] = {
{ 5, 1 },
{ 1, 0 },
{ 0, 0 }
};
matrices = add_matrix(2, 2, m1, matrices, &num_matrices);
matrices = add_matrix(2, 3, m2, matrices, &num_matrices);
matrices = add_matrix(3, 2, m3, matrices, &num_matrices);
show_matrix(matrices[0]);
show_matrix(matrices[1]);
show_matrix(matrices[2]);
/* Free allocated memory */
for (size_t i = 0; i < num_matrices; i++) {
free(matrices[i].array);
}
free(matrices);
return 0;
}
Matrix * add_matrix(size_t rs, size_t cs, double arr[rs][cs], Matrix *mx, size_t *n)
{
Matrix *temp = realloc(mx, sizeof(*temp) * (*n + 1));
if (temp == NULL) {
fprintf(stderr, "Allocation error\n");
exit(EXIT_FAILURE);
}
double *arr_1d = malloc(sizeof(*arr_1d) * rs * cs);
if (arr_1d == NULL) {
fprintf(stderr, "Allocation error\n");
exit(EXIT_FAILURE);
}
/* Store 2d VLA in arr_1d */
for (size_t i = 0; i < rs; i++)
for (size_t j = 0; j < cs; j++)
arr_1d[i * cs + j] = arr[i][j];
temp[*n].array = arr_1d;
temp[*n].rows = rs;
temp[*n].cols = cs;
++*n;
return temp;
}
void show_matrix(Matrix mx)
{
size_t rs = mx.rows;
size_t cs = mx.cols;
size_t i, j;
for (i = 0; i < rs; i++) {
for (j = 0; j < cs; j++)
printf("%5.2f", mx.array[i * cs + j]);
putchar('\n');
}
putchar('\n');
}
Output is:
1.00 2.00
3.00 4.00
1.00 0.00 0.00
0.00 1.00 0.00
5.00 1.00
1.00 0.00
0.00 0.00
!!! HOMEWORK - ASSIGNMENT !!!
Please do not post code as I would like to complete myself but rather if possible point me in the right direction with general information or by pointing out mistakes in thought or other possible useful and relevant resources.
I have a method that creates my square npages * npages matrix hat of double for use in my pagerank algorithm.
I have made it with pthreads, SIMD and with both pthreads and SIMD. I have used xcode instruments time profiler and found that the pthreads only version is the fastest, next is the SIMD only version and slowest is the version with both SIMD and pthreads.
As it is homework it can be run on multiple different machines however we were given the header #include so it is to be assumed we can use upto AVX at least. We are given how many threads the program will use as the argument to the program and store it in a global variable g_nthreads.
In my tests I have been testing it on my machine which is an IvyBridge with 4 hardware cores and 8 logical cores and I've been testing it with 4 threads as an arguments and with 8 threads as an argument.
RUNNING TIMES:
SIMD ONLY:
*331ms - for consturct_matrix_hat function *
PTHREADS ONLY (8 threads):
70ms - each thread concurrently
SIMD & PTHREADS (8 threads):
110ms - each thread concurrently
What am I doing that is slowing it down more when using both forms of optimisation?
I will post each implementation:
All versions share these macros:
#define BIG_CHUNK (g_n2/g_nthreads)
#define SMALL_CHUNK (g_npages/g_nthreads)
#define MOD BIG_CHUNK - (BIG_CHUNK % 4)
#define IDX(a, b) ((a * g_npages) + b)
Pthreads:
// struct used for passing arguments
typedef struct {
double* restrict m;
double* restrict m_hat;
int t_id;
char padding[44];
} t_arg_matrix_hat;
// Construct matrix_hat with pthreads
static void* pthread_construct_matrix_hat(void* arg) {
t_arg_matrix_hat* t_arg = (t_arg_matrix_hat*) arg;
// set coordinate limits thread is able to act upon
size_t start = t_arg->t_id * BIG_CHUNK;
size_t end = t_arg->t_id + 1 != g_nthreads ? (t_arg->t_id + 1) * BIG_CHUNK : g_n2;
// Initialise coordinates with given uniform value
for (size_t i = start; i < end; i++) {
t_arg->m_hat[i] = ((g_dampener * t_arg->m[i]) + HAT);
}
return NULL;
}
// Construct matrix_hat
double* construct_matrix_hat(double* matrix) {
double* matrix_hat = malloc(sizeof(double) * g_n2);
// create structs to send and retrieve matrix and value from threads
t_arg_matrix_hat t_args[g_nthreads];
for (size_t i = 0; i < g_nthreads; i++) {
t_args[i] = (t_arg_matrix_hat) {
.m = matrix,
.m_hat = matrix_hat,
.t_id = i
};
}
// create threads and send structs with matrix and value to divide the matrix and
// initialise the coordinates with the given value
pthread_t threads[g_nthreads];
for (size_t i = 0; i < g_nthreads; i++) {
pthread_create(threads + i, NULL, pthread_construct_matrix_hat, t_args + i);
}
// join threads after all coordinates have been intialised
for (size_t i = 0; i < g_nthreads; i++) {
pthread_join(threads[i], NULL);
}
return matrix_hat;
}
SIMD:
// Construct matrix_hat
double* construct_matrix_hat(double* matrix) {
double* matrix_hat = malloc(sizeof(double) * g_n2);
double dampeners[4] = {g_dampener, g_dampener, g_dampener, g_dampener};
__m256d b = _mm256_loadu_pd(dampeners);
// Use simd to subtract values from each other
for (size_t i = 0; i < g_mod; i += 4) {
__m256d a = _mm256_loadu_pd(matrix + i);
__m256d res = _mm256_mul_pd(a, b);
_mm256_storeu_pd(&matrix_hat[i], res);
}
// Subtract values from each other that weren't included in simd
for (size_t i = g_mod; i < g_n2; i++) {
matrix_hat[i] = g_dampener * matrix[i];
}
double hats[4] = {HAT, HAT, HAT, HAT};
b = _mm256_loadu_pd(hats);
// Use simd to raise each value to the power 2
for (size_t i = 0; i < g_mod; i += 4) {
__m256d a = _mm256_loadu_pd(matrix_hat + i);
__m256d res = _mm256_add_pd(a, b);
_mm256_storeu_pd(&matrix_hat[i], res);
}
// Raise each value to the power 2 that wasn't included in simd
for (size_t i = g_mod; i < g_n2; i++) {
matrix_hat[i] += HAT;
}
return matrix_hat;
}
Pthreads & SIMD:
// struct used for passing arguments
typedef struct {
double* restrict m;
double* restrict m_hat;
int t_id;
char padding[44];
} t_arg_matrix_hat;
// Construct matrix_hat with pthreads
static void* pthread_construct_matrix_hat(void* arg) {
t_arg_matrix_hat* t_arg = (t_arg_matrix_hat*) arg;
// set coordinate limits thread is able to act upon
size_t start = t_arg->t_id * BIG_CHUNK;
size_t end = t_arg->t_id + 1 != g_nthreads ? (t_arg->t_id + 1) * BIG_CHUNK : g_n2;
size_t leftovers = start + MOD;
__m256d b1 = _mm256_loadu_pd(dampeners);
//
for (size_t i = start; i < leftovers; i += 4) {
__m256d a1 = _mm256_loadu_pd(t_arg->m + i);
__m256d r1 = _mm256_mul_pd(a1, b1);
_mm256_storeu_pd(&t_arg->m_hat[i], r1);
}
//
for (size_t i = leftovers; i < end; i++) {
t_arg->m_hat[i] = dampeners[0] * t_arg->m[i];
}
__m256d b2 = _mm256_loadu_pd(hats);
//
for (size_t i = start; i < leftovers; i += 4) {
__m256d a2 = _mm256_loadu_pd(t_arg->m_hat + i);
__m256d r2 = _mm256_add_pd(a2, b2);
_mm256_storeu_pd(&t_arg->m_hat[i], r2);
}
//
for (size_t i = leftovers; i < end; i++) {
t_arg->m_hat[i] += hats[0];
}
return NULL;
}
// Construct matrix_hat
double* construct_matrix_hat(double* matrix) {
double* matrix_hat = malloc(sizeof(double) * g_n2);
// create structs to send and retrieve matrix and value from threads
t_arg_matrix_hat t_args[g_nthreads];
for (size_t i = 0; i < g_nthreads; i++) {
t_args[i] = (t_arg_matrix_hat) {
.m = matrix,
.m_hat = matrix_hat,
.t_id = i
};
}
// create threads and send structs with matrix and value to divide the matrix and
// initialise the coordinates with the given value
pthread_t threads[g_nthreads];
for (size_t i = 0; i < g_nthreads; i++) {
pthread_create(threads + i, NULL, pthread_construct_matrix_hat, t_args + i);
}
// join threads after all coordinates have been intialised
for (size_t i = 0; i < g_nthreads; i++) {
pthread_join(threads[i], NULL);
}
return matrix_hat;
}
I think it's because your SIMD code is horribly inefficient: It loops over the memory twice, instead of doing the add with the multiply, before storing. You didn't test SIMD vs. a scalar baseline, but if you had you'd probably find that your SIMD code wasn't a speedup with a single thread either.
STOP READING HERE if you want to solve the rest of your homework yourself.
If you used gcc -O3 -march=ivybridge, the simple scalar loop in the pthread version probably auto-vectorized into something like what you should have done with intrinsics. You even used restrict, so it might realize that the pointers can't overlap with each other, or with g_dampener.
// this probably autovectorizes well.
// Initialise coordinates with given uniform value
for (size_t i = start; i < end; i++) {
t_arg->m_hat[i] = ((g_dampener * t_arg->m[i]) + HAT);
}
// but this would be even safer to help the compiler's aliasing analysis:
double dampener = g_dampener; // in case the compiler things one of the pointers might point at the global
double *restrict hat = t_arg->hat;
const double *restrict mat = t_arg->m;
... same loop but using these locals instead of
It's probably not a problem for an FP loop, since double definitely can't alias with double *.
The coding style is also pretty nasty. You should give meaningful names to your __m256d variables whenever possible.
Also, you use malloc, which doesn't guarantee that matrix_hat will be aligned to a 32B boundary. C11's aligned_alloc is probably the nicest way, vs. posix_memalign (clunky interface), _mm_malloc (have to free with _mm_free, not free(3)), or other options.
double* construct_matrix_hat(const double* matrix) {
// double* matrix_hat = malloc(sizeof(double) * g_n2);
double* matrix_hat = aligned_alloc(64, sizeof(double) * g_n2);
// double dampeners[4] = {g_dampener, g_dampener, g_dampener, g_dampener}; // This idiom is terrible, and might actually compile to code that stores it 4 times on the stack and then loads.
__m256d vdamp = _mm256_set1_pd(g_dampener); // will compile to a broadcast-load (vbroadcastsd)
__m256d vhat = _mm256_set1_pd(HAT);
size_t last_full_vector = g_n2 & ~3ULL; // don't load this from a global.
// it's better for the compiler to see how it's calculated from g_n2
// ??? Use simd to subtract values from each other // huh? this is a multiply, not a subtract. Also, everyone can see it's using SIMD, that part adds no new information
// if you really want to manually vectorize this, instead of using an OpenMP pragma or -O3 on the scalar loop, then:
for (size_t i = 0; i < last_full_vector; i += 4) {
__m256d vmat = _mm256_loadu_pd(matrix + i);
__m256d vmul = _mm256_mul_pd(vmat, vdamp);
__m256d vres = _mm256_add_pd(vmul, vhat);
_mm256_store_pd(&matrix_hat[i], vres); // aligned store. Doesn't matter for performance.
}
#if 0
// Scalar cleanup
for (size_t i = last_vector; i < g_n2; i++) {
matrix_hat[i] = g_dampener * matrix[i] + HAT;
}
#else
// assume that g_n2 >= 4, and do a potentially-overlapping unaligned vector
if (last_full_vector != g_n2) {
// Or have this always run, and have the main loop stop one element sooner (so this overlaps by 0..3 instead of by 1..3 with a conditional)
assert(g_n2 >= 4);
__m256d vmat = _mm256_loadu_pd(matrix + g_n2 - 4);
__m256d vmul = _mm256_mul_pd(vmat, vdamp);
__m256d vres = _mm256_add_pd(vmul, vhat);
_mm256_storeu_pd(&matrix_hat[g_n2-4], vres);
}
#endif
return matrix_hat;
}
This version compiles (after defining a couple globals) to the asm we expect. BTW, normal people pass sizes around as function arguments. This is another way of avoiding optimization-failure due to C aliasing rules.
Anyway, really your best bet is to let OpenMP auto-vectorize it, because then you don't have to write a cleanup loop yourself. There's nothing tricky about the data organization, so it vectorizes trivially. (And it's not a reduction, like in your other question, so there's no loop-carried dependency or order-of-operations concern).
I am able to declare in a good way two matrices A and B.
But, when using the memcpy (to copy B from A), B gives me arrays of 0s.
How can I do? Is my code correct for using memcpy?
int r = 10, c = 10, i, j;
int (*MatrixA)[r];
MatrixA=malloc(c * sizeof(*MatrixA));
int (*MatrixB)[r];
MatrixB=malloc(c * sizeof(*MatrixB));
memcpy(MatrixB,MatrixA,c * sizeof(MatrixA));
for(i=1;i<r+1;i++)
{
for (j = 1; j < c+1; j++)
{
MatrixA[i][j]=j;
printf("A[%d][%d]= %d\t",i,j,MatrixA[i][j]);
}
printf("\n");
}
printf("\n");printf("\n");printf("\n");printf("\n");printf("\n");
for(i=1;i<r+1;i++)
{
for (j = 1; j < c+1; j++)
{
printf("B[%d][%d]= %d\t",i,j,MatrixB[i][j]);
}
printf("\n");
}
You copied contents before initializing MatrixA .And also you access index out of bound (r+1 evaluates 11 which is out of bound) causing UB. Do this instead -
for(i=0;i<r;i++) // i starts from 0
{
for (j =0; j < c; j++) // j from 0
{
MatrixA[i][j]=j;
printf("A[%d][%d]= %d\t",i,j,MatrixA[i][j]);
}
printf("\n");
}
memcpy(MatrixB,MatrixA,c * sizeof(*MatrixA)); // copy after setting MatrixA
for(i=0;i<r;i++) // similarly indexing starts with 0
{
for (j =0; j < c; j++)
{
printf("B[%d][%d]= %d\t",i,j,MatrixB[i][j]);
}
printf("\n");
}
Is my code correct for using memcpy?
No, your code is wrong, but that's less of a memcpy problem. You're simply doing C arrays wrong.
int r = 10, c = 10, i, j;
int (*MatrixA)[r];
MatrixA=malloc(c * sizeof(*MatrixA));
Ok, MatrixA is now a pointer to a 10-element array of integers right? So the compiler reserves memory for ten ints; however, in the malloc line, you overwrite that with a pointer to a memory region of ten times the size of a single integer. A code analysis tool will tell you that you've built a memory leak.
These mistakes continue throughout your code; you will have to understand the difference between statically allocated C arrays and dynamic allocation using malloc.
I want to do some calculation with some matrices whose size is 2048*2048, for example.But the simulator stop working and it does not simulate the code. I understood that the problem is about the size and type of variable. For example, I run a simple code, which is written below, to check whether I am right or not. I should print 1 after declaring variable A. But it does not work.
Please note that I use Codeblocks. WFM is a function to write a float matrix in a text file and it works properly because I check that before with other matrices.
int main()
{
float A[2048][2048];
printf("1");
float *AP = &(A[0][0]);
const char *File_Name = "example.txt";
int counter = 0;
for(int i = 0; i < 2048; i++)
for(int j = 0; j < 2048; j++)
{
A[i][j] = counter;
++counter;
}
WFM(AP, 2048, 2048, File_Name , ' ');
return 0;
}
Any help and suggestion to deal with this problem and larger matrices is appreciate it.
Thanks
float A[2048][2048];
which requires approx. 2K * 2K * 8 = 32M of stack memory. But typically the stack size of the process if far less than that. Please allocate it dynamically using alloc family.
float A[2048][2048];
This may be too large for a local array, you should allocate memory dynamically by function such as malloc. For example, you could do this:
float *A = malloc(2048*2048*sizeof(float));
if (A == 0)
{
perror("malloc");
exit(1);
}
float *AP = A;
int counter = 0;
for(int i = 0; i < 2048; i++)
for(int j = 0; j < 2048; j++)
{
*(A + 2048*i + j) = counter;
++counter;
}
And when you does not need A anymore, you can free it by free(A);.
Helpful links about efficiency pitfalls of large arrays with power-of-2 size (offered by #LưuVĩnhPhúc):
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Why is my program slow when looping over exactly 8192 elements?
Matrix multiplication: Small difference in matrix size, large difference in timings