tI have the following code:
#define FIRST_COUNT 100
#define X_COUNT 250
#define Y_COUNT 310
#define z_COUNT 40
struct s_tsp {
short abc[FIRST_COUNT][X_COUNT][Y_COUNT][Z_COUNT];
};
struct s_tsp xyz;
I need to run through the data like this:
for (int i = 0; i < FIRST_COUNT; ++i)
for (int j = 0; j < X_COUNT; ++j)
for (int k = 0; k < Y_COUNT; ++k)
for (int n = 0; n < Z_COUNT; ++n)
doSomething(xyz, i, j, k, n);
I've tried to think of a more elegant, less brain-dead approach. ( I know that this sort of multidimensional array is inefficient in terms of cpu usage, but that is irrelevant in this case.) Is there a better approach to the way I've structured things here?
If you need a 4D array, then that's what you need. It's possible to 'flatten' it into a single dimensional malloc()ed 'array', however that is not quite as clean:
abc = malloc(sizeof(short)*FIRST_COUNT*X_COUNT*Y_COUNT*Z_COUNT);
Accesses are also more difficult:
*(abc + FIRST_COUNT*X_COUNT*Y_COUNT*i + FIRST_COUNT*X_COUNT*j + FIRST_COUNT*k + n)
So that's obviously a bit of a pain.
But you do have the advantage that if you need to simply iterate over every single element, you can do:
for (int i = 0; i < FIRST_COUNT*X_COUNT*Y_COUNT*Z_COUNT; i++) {
doWhateverWith *(abc+i);
}
Clearly this method is terribly ugly for most uses, and is a bit neater for one type of access. It's also a bit more memory-conservative and only requires one pointer-dereference rather than 4.
NOTE: The intention of the examples used in this post are just to explain the concepts. So the examples may be incomplete, may lack error handling, etc.
When it comes to usage of multi-dimension array in C, the following are the two possible ways.
Flattening of Arrays
In C, arrays are implemented as a contiguous memory block. This information can be used to manipulate the values stored in the array and allows rapid access to a particular array location.
For example,
int arr[10][10];
int *ptr = (int *)arr ;
ptr[11] = 10;
// this is equivalent to arr[1][0] = 10; assign a 2D array
// and manipulate now as a single dimensional array.
The technique of exploiting the contiguous nature of arrays is known as flattening of arrays.
Ragged Arrays
Now, consider the following example.
char **list;
list[0] = "United States of America";
list[1] = "India";
list[2] = "United Kingdom";
for(int i=0; i< 3 ;i++)
printf(" %d ",strlen(list[i]));
// prints 24 5 14
This type of implementation is known as ragged array, and is useful in places where the strings of variable size are used. Popular method is to have dynamic-memory-allocation to be done on the every dimension.
NOTE: The command line argument (char *argv[]) is passed only as ragged array.
Comparing flattened and ragged arrays
Now, lets consider the following code snippet which compares the flattened and ragged arrays.
/* Note: lacks error handling */
int flattened[30][20][10];
int ***ragged;
int i,j,numElements=0,numPointers=1;
ragged = (int ***) malloc(sizeof(int **) * 30);
numPointers += 30;
for( i=0; i<30; i++) {
ragged[i] = (int **)malloc(sizeof(int*) * 20);
numPointers += 20;
for(j=0; j<20; j++) {
ragged[i][j]=(int*)malloc(sizeof(int) * 10);
numElements += 10;
}
}
printf("Number of elements = %d",numElements);
printf("Number of pointers = %d",numPointers);
// it prints
// Number of elements = 6000
// Number of pointers = 631
From the above example, the ragged arrays require 631-pointers, in other words, 631 * sizeof(int *) extra memory locations for pointing 6000 integers. Whereas, the flattened array requires only one base pointer: i.e. the name of the array enough to point to the contiguous 6000 memory locations.
But OTOH, the ragged arrays are flexible. In cases where the exact number of memory locations required is not known you cannot have the luxury of allocating the memory for worst possible case. Again, in some cases the exact number of memory space required is known only at run-time. In such situations ragged arrays become handy.
Row-major and column-major of Arrays
C follows row-major ordering for multi-dimensional arrays. Flattening of arrays can be viewed as an effect due this aspect in C. The significance of row-major order of C is it fits to the natural way in which most of the accessing is made in the programming. For example, lets look at an example for traversing a N * M 2D matrix,
for(i=0; i<N; i++) {
for(j=0; j<M; j++)
printf(“%d ”, matrix[i][j]);
printf("\n");
}
Each row in the matrix is accessed one by one, by varying the column rapidly. The C array is arranged in memory in this natural way. On contrary, consider the following example,
for(i=0; i<M; i++) {
for(j=0; j<N; j++)
printf(“%d ”, matrix[j][i]);
printf("\n");
}
This changes the column index most frequently than the row index. And because of this there is a lot of difference in efficiency between these two code snippet. Yes, the first one is more efficient than the second one!
Because the first one accesses the array in the natural order (row-major order) of C, hence it is faster, whereas the second one takes more time to jump. The difference in performance would get widen as the number of dimensions and the size of element increases.
So when working with multi-dimension arrays in C, its good to consider the above details!
Related
I'm currently trying to optimize matrix operations with intrinsics and loop unrolling. There was segmentation fault which I couldn't figure out. Here is the code I made change:
const int UNROLL = 4;
void outer_product(matrix *vec1, matrix *vec2, matrix *dst) {
assert(vec1->dim.cols == 1 && vec2->dim.cols == 1 && vec1->dim.rows == dst->dim.rows && vec2->dim.rows == dst->dim.cols);
__m256 tmp[4];
for (int x = 0; x < UNROLL; x++) {
tmp[x] = _mm256_setzero_ps();
}
for (int i = 0; i < vec1->dim.rows; i+=UNROLL*8) {
for (int j = 0; j < vec2->dim.rows; j++) {
__m256 row2 = _mm256_broadcast_ss(&vec2->data[j][0]);
for (int x = 0; x<UNROLL; x++) {
tmp[x] = _mm256_mul_ps(_mm256_load_ps(&vec1->data[i+x*8][0]), row2);
_mm256_store_ps(&dst->data[i+x*8][j], tmp[x]);
}
}
}
}
void matrix_multiply(matrix *mat1, matrix *mat2, matrix *dst) {
assert (mat1->dim.cols == mat2->dim.rows && dst->dim.rows == mat1->dim.rows && dst->dim.cols == mat2->dim.cols);
for (int i = 0; i < mat1->dim.rows; i+=UNROLL*8) {
for (int j = 0; j < mat2->dim.cols; j++) {
__m256 tmp[4];
for (int x = 0; x < UNROLL; x++) {
tmp[x] = _mm256_setzero_ps();
}
for (int k = 0; k < mat1->dim.cols; k++) {
__m256 mat2_s = _mm256_broadcast_ss(&mat2->data[k][j]);
for (int x = 0; x < UNROLL; x++) {
tmp[x] = _mm256_add_ps(tmp[x], _mm256_mul_ps(_mm256_load_ps(&mat1->data[i+x*8][k]), mat2_s));
}
}
for (int x = 0; x < UNROLL; x++) {
_mm256_store_ps(&dst->data[i+x*8][j], tmp[x]);
}
}
}
}
edited:
Here is the struct of matrix. I didn't modified it.
typedef struct shape {
int rows;
int cols;
} shape;
typedef struct matrix {
shape dim;
float** data;
} matrix;
edited:
I tried gdb to figure out which line caused segmentation fault and it looked like it was _mm256_load_ps(). Am I indexing into the matrix in a wrong way such that it cannot load from the correct address? Or is the problem of aligned memory?
In at least one place, you're doing 32-byte alignment-required loads with a stride of only 4 bytes. I think that's not what you actually meant to do, though:
for (int k = 0; k < mat1->dim.cols; k++) {
for (int x = 0; x < UNROLL; x++) {
...
_mm256_load_ps(&mat1->data[i+x*8][k])
}
}
_mm256_load_ps loads 8 contiguous floats, i.e. it loads data[i+x*8][k] to data[i+x*8][k+7]. I think you want data[i+x][k*8], and loop over k in the inner-most loop.
If you need unaligned loads / stores, use _mm256_loadu_ps / _mm256_storeu_ps. But prefer aligning your data to 32B, and pad the storage layout of your matrix so the row stride is a multiple of 32 bytes. (The actual logical dimensions of the array don't have to match the stride; it's fine to leave padding at the end of each row out to a multiple of 16 or 32 bytes. This makes loops much easier to write.)
You're not even using a 2D array (you're using an array of pointers to arrays of float), but the syntax looks the same as for float A[100][100], even though the meaning in asm is very different. Anyway, in Fortran 2D arrays the indexing goes the other way, where incrementing the left-most index takes you to the next position in memory. But in C, varying the left index by one takes you to a whole new row. (Pointed to by a different element of float **data, or in a proper 2D array, one row stride away.) Of course you're striding by 8 rows because of this mixup combined with using x*8.
Speaking of the asm, you get really bad results for this code especially with gcc, where it reloads 4 things for every vector, I think because it's not sure the vector stores don't alias the pointer data. Assign things to local variables to make sure the compiler can hoist them out of loops. (e.g. const float *mat1dat = mat1->data;.) Clang does slightly better, but the access pattern in the source is inherently bad and requires pointer-chasing for each inner-loop iteration to get to a new row, because you loop over x instead of k. I put it up on the Godbolt compiler explorer.
But really you should optimize the memory layout first, before trying to manually vectorize it. It might be worth transposing one of the arrays, so you can loop over contiguous memory for rows of one matrix and columns of the other while doing the dot product of a row and column to calculate one element of the result. Or it could be worth doing c[Arow,Bcol] += a_value_from_A * b[Arow,Bcol] inside an inner loop instead of transposing up front (but that's a lot of memory traffic). But whatever you do, make sure you're not striding through non-contiguous accesses to one of your matrices in the inner loop.
You'll also want to ditch the array-of-pointers thing and do manual 2D indexing (data[row * row_stride + col] so your data is all in one contiguous block instead of having each row allocated separately. Making this change first, before you spend any time manually-vectorizing, seems to make the most sense.
gcc or clang with -O3 should do a not-terrible job of auto-vectorizing scalar C, especially if you compile with -ffast-math. (You might remove -ffast-math after you're done manually vectorizing, but use it while tuning with auto-vectorization).
Related:
How does BLAS get such extreme performance?
Also see my comments on Poor maths performance in C vs Python/numpy for another bad-memory-layout problem.
how to optimize matrix multiplication (matmul) code to run fast on a single processor core
You might manually vectorize before or after looking at cache-blocking, but when you do, see Matrix Multiplication with blocks.
I am doing a program in C which needs to take in a set of values (integers) into a 2D array, and then performs certain mathematical operations on it. I have decided to implement a check in the program as the user is inputting the values to avoid them from entering values that are already present in the array.
I am however unsure of how to go about this check. I figured out I might need some sort of recursive function to check all the elements previous to the one that's being entered, but I don't know how to implement it.
Please find below a snippet of my code for illustrative purposes:
Row and col are values inputted by the user for the dimension of the array
for (int i=0; i<row;i++){
for (int j=0; j<col; j++){
scanf("%d", &arr[i][j]); //take in elements
}
}
for (int i = 0; i < row; i++)
{
for (int j = 0; i < col; j++)
{
if (arr[i][j] == arr[i][j-1]){
printf("Duplicate.\n");}
else {}
}
}
I know this is probably not correct but it's my attempt.
Any help would be much appreciated.
I would suggest that your store every element you read in a temporary 1D array. Everytime you scan a new element, traverse the 1D array checking if the value exists or not. Although this is not optimal, this will be at least less expensive than traversing the 2D array everytime.
Example:
int temp[SIZE];
int k,elements = 0;
for (int i = 0; i < row; i++) {
for (int j = 0; j < col; j++) {
scanf("%d", &arr[i][j]); //take in elements
temp[elements] = arr[i][j];
elements++;
for (int k = 0; k < elements; k++) {
if (temp[k] == arr[i][j])
printf("Duplicate.\n"); //or do whatever you wish
}
}
}
A balanced tree inserts and searches in O(log N) time.
Since the algorithms are quite simple & standard and were published in the seminal books by Knuth, there are plenty of implementations out there, including a clear and concise one at codereview.SE (which is thus automatically CC-BY-SA 3.0; do apply a bugfix in the answer). Using it (as well as virtually any other one) is simple: start with node* root = NULL;, then insert and search, and finally free_tree.
Asymptotically, the best method is a hash table with O(1) for both, but that is probably an overkill (the algorithms are much more complex and memory footprint is larger) unless you have a lot of numbers. For C++, there's a standard implementation, yet there are plenty 3rd-party ones for C, too.
If your number of input values is small, even the tree may be an overkill, and simply looking through previous values would be fast enough. If your 2D array is contiguous in memory, you can access it as 1D with int* arr1d = (int*)&arr2d.
sorry, I'm relatively new to c and am trying to create two 2-D arrays using malloc. I was told that this method is computationally more efficient than creating a pointer array of arrays through a for loop (for large arrays).
int i, j;
double **PNow, **PNext, *Array2D1, *Array2D2;
//Allocate memory
PNow = (double**)malloc(3 * sizeof(double*));
PNext = (double**)malloc(3 * sizeof(double*));
Array2D1 = (double*)malloc(5 * sizeof(double));
Array2D2 = (double*)malloc(5 * sizeof(double));
//Create 2-Dimensionality
for(i = 0; i < 3; i++)
{
PNow[i] = Array2D1 + i * 5;
PNext[i] = Array2D2 + i * 5;
};
//Define Element Values
for(i = 0; i < 3; i++)
{
for(j = 0; j < 5; j++)
{
PNow[i][j] = 10.*(i + j);
PNext[i][j] = 1000.*(i + j);
};
};
//Output two matrices side-by-side.
for(i = 0; i < 3; i++)
{
for(j = 0; j < 5; j++)
{
printf("%6lg", PNow[i][j]);
if(j == 4)
{
printf("|");
};
};
for(j = 0; j < 5; j++)
{
printf("%6lg", PNext[i][j]);
if(j == 4)
{
printf("\n");
};
};
};
My problem is that the first matrix (PNow) turns out as I would expect, but for some reason half of the values in PNext are those of PNow, and I can't for the life of me figure out why it is doing this? I'm obviously missing something.. Also I am not overly clear on what "Array2D1 + i*5" is doing and how this makes PNow a 2-D array?
Any help would be really appreciated.
Thank you.
P.S. This is the output that I am getting, so you can see what I mean:
0 10 20 30 40| 20 30 40 50 20
10 20 30 40 50| 30 40 50 60 5000
20 30 40 50 60| 2000 3000 4000 5000 6000
In C you don't cast the result of mallocs, so your malloc lines should read
PNow = malloc(3*sizeof(double*));
Your problem is you're not actually allocating enough memory in Array2D1 and Array2D2. When you move past the first "row" in your array you're getting beyond your allocated memory! So you're in undefined behavior territory. In your case, it looks like your two matrices step all over each other (though my test just throws an error). You can solve this in two ways:
Specify the full size of your matrix in the malloc and do as you did:
Array2D1 = malloc(15*sizeof(double));
Array2D2 = malloc(15*sizeof(double));
Or malloc each line in your for loop:
for(i=0; i<3; i++){
PNow[i] = malloc(5*sizeof(double));
PNext[i] = malloc(5*sizeof(double));
}
Edit: On the topic of freeing in each example
For the first example, the freeing is straight forward
free(PNow);
free(PNext);
free(Array2D1);
free(Array2D2);
For the second, you must iterate through each line and free individually
for (i = 0; i < 3; i++) {
free(PNow[i]);
free(PNext[i]);
}
Edit2: Realistically, if you're going to hardcode your rows and columns in with a pre-processor macro, there's no reason to malloc at all. You can simply do this:
#define ROW 3
#define COL 5
double PNow[ROW][COL], PNext[ROW][COL];
Edit3: As for what Array2D1 + i * 5 is doing, PNow is an array of pointers, and Array2D1 is a pointer. By adding i * 5 you're incrementing the pointer by i * 5 (i.e., saying "give me a pointer to the memory that is i * 5 doubles away from Array2D1). So, you're filling PNow with pointers to the starts of appropriately sized memory chunks for your rows.
You code does not have 2D arrays, aka matrices. And your pointers cannot point to such an object either.
A proper pointer which can point to a 2D array would be declared like:
#define ROWS 4
#define COLS 5
double (*arr)[COLS];
Allocation is straight-forward:
arr = malloc(sizeof(*arr) * ROWS);
And deleting similar:
free(arr);
Indexing is like:
arr[row][col]
Notice the identical syntax only. The semantics are different.
Nothing more necessary and no need for hand-crafted pointer arrays.
The code above shows another important rule: Don't use magic values. Use constant-like macros instead. These should be #defined at the beginning or in a configuration-section of the code (typically somewhere near the top of the file or a distinct header file). So if you lateron change e.g. the length of a dimension, you don't have to edit all places you explicitly wrote it, but only change the macro once.
While the code above uses constants, you can as well use variables for the dimensions. This is standard C and called variable length array (VLA). If you pass the arrays to other functions, you have to pass them as additional arguments:
void f(size_t rows, size_t cols, double a[rows][cols]);
Remember array-arguments decay to pointers to the first element, so a is actually the same as arr above. The outermost dimension can be omitted, but as you need it anyway it is good for documentation to specify it, too.
There is a pseudocode that I want to implement in C. But I am in doubt on how to implement a part of it. The psuedocode is:
for every pair of states qi, and qj, i<j, do
D[i,j] := 0
S[i,j] := notzero
end for
i and j, in qi and qj are subscripts.
how do I represent D[i,J] or S[i,j]. which data structure to use so that its simple and fast.
You can use something like
int length= 10;
int i =0, j= 0;
int res1[10][10] = {0, }; //index is based on "length" value
int res2[10][10] = {0, }; //index is based on "length" value
and then
for (i =0; i < length; i++)
{
for (j =0; j < length; j++)
{
res1[i][j] = 0;
res2[i][j] = 1;//notzero
}
}
Here D[i,j] and S[i,j] are represented by res1[10][10] and res2[10][10], respectively. These are called two-dimentional array.
I guess struct will be your friend here depending on what you actually want to work with.
Struct would be fine if, say, pair of states creates some kind of entity.
Otherwise You could use two-dimensional array.
After accept answer.
Depending on coding goals and platform, to get "simple and fast" using a pointer to pointer to a number may be faster then a 2-D array in C.
// 2-D array
double x[MAX_ROW][MAX_COL];
// Code computes the address in `x`, often involving a i*MAX_COL, if not in a loop.
// Slower when multiplication is expensive and random array access occurs.
x[i][j] = f();
// pointer to pointer of double
double **y = calloc(MAX_ROW, sizeof *y);
for (i=0; i<MAX_ROW; i++) y[i] = calloc(MAX_COL, sizeof *(y[i]));
// Code computes the address in `y` by a lookup of y[i]
y[i][j] = f();
Flexibility
The first data type is easy print(x), when the array size is fixed, but becomes challenging otherwise.
The 2nd data type is easy print(y, rows, columns), when the array size is variable and of course works well with fixed.
The 2nd data type also row swapping simply by swapping pointers.
So if code is using a fixed array size, use double x[MAX_ROW][MAX_COL], otherwise recommend double **y. YMMV
I have an array of structures that I am trying to shift left by 1 array node. The total size of the array is huge (about 3 gigabytes), so even though I know the exact size of array I need, it is too big to declare on the stack (even though I have 16 gig of ram and am writing a 64bit program), thus complicating things by forcing me to do dynamic memory alloc:
struct s_ptx
{
short streamIndex;
double raw;
char rawDx;
} *Ptx[100];
void allocateMemory(void)
{
ptxTotal = 300;
for (int i = 0; i < 100; ++i)
Ptx[i] = (struct s_ptx*) calloc( ptxTotal, sizeof(struct s_ptx));
}
void shiftDataStructures(void)
{
for (int j = 100 - 1; j > 0; --j)
Ptx[j] = Ptx[j - 1];
}
But I get wrong results, because the shiftDataStructures function is not working. Any ideas of how I need to rewrite this.
You are not shifting structs, only pointers. I wonder what you really are thinking you are achieving here?
Also, why do you need to shift array indexes at all, why not use, say, linked list or a ring buffer. As to what the error itself would be, I have no clue because you provide insufficient data; your loop is running in correct direction as not to overwrite the pointers.
Try swapping the data inside the structures of instead of shifting the pointers. The resultant will result in a circular array where Ptx[99] will be circulated to Ptx[0].
Sample code:
// Change codes in the following line
for (int j = 100 - 1; j > 0; --j)
//Ptx[j] = Ptx[j - 1];
swap(Ptx[j], Ptx[j - 1]);