I am trying to perform LoopDependenceAnalysis on arrays in LLVM. For this I have written a LLVM LoopPass. I am able to detect the arrays using GetElementPtr.I am unable to determine the stride of an array used in a loop.
For example, I have a c code
int b[10];
for(int i = 0; i < 10; i++)
{
b[i] = b[i+2];
}
Now in both the array accesses, the first array access (b[i] ) has a stride of 0 while the second one has a stride of 2. How is it possible to determine those values ?
Related
I'm trying to figure out a suitable way to apply row-wise permutation of a matrix using SIMD intrinsics (mainly AVX/AVX2 and AVX512).
The problem is basically calculating R = PX where P is a permutation matrix (sparse) with only only 1 nonzero element per column. This allows one to represent matrix P as a vector p where p[i] is the row index of nonzero value for column i. Code below shows a simple loop to achieve this:
// R and X are 2d matrices with shape = (m,n), same size
for (size_t i = 0; i < m; ++i){
for (size_t j = 0; j < n; ++j) {
R[p[i],j] += X[i,j]
}
}
I assume it all boils down to gather, but before spending long time trying implement various approaches, I would love to know what you folks think about this and what is the more/most suitable approach tackling this?
Isn't it strange that none of the compilers use avx-512 for this?
https://godbolt.org/z/ox9nfjh8d
Why is it that gcc doesn't do register blocking? I see clang does a better job, is this common?
I'm currently trying to optimize matrix operations with intrinsics and loop unrolling. There was segmentation fault which I couldn't figure out. Here is the code I made change:
const int UNROLL = 4;
void outer_product(matrix *vec1, matrix *vec2, matrix *dst) {
assert(vec1->dim.cols == 1 && vec2->dim.cols == 1 && vec1->dim.rows == dst->dim.rows && vec2->dim.rows == dst->dim.cols);
__m256 tmp[4];
for (int x = 0; x < UNROLL; x++) {
tmp[x] = _mm256_setzero_ps();
}
for (int i = 0; i < vec1->dim.rows; i+=UNROLL*8) {
for (int j = 0; j < vec2->dim.rows; j++) {
__m256 row2 = _mm256_broadcast_ss(&vec2->data[j][0]);
for (int x = 0; x<UNROLL; x++) {
tmp[x] = _mm256_mul_ps(_mm256_load_ps(&vec1->data[i+x*8][0]), row2);
_mm256_store_ps(&dst->data[i+x*8][j], tmp[x]);
}
}
}
}
void matrix_multiply(matrix *mat1, matrix *mat2, matrix *dst) {
assert (mat1->dim.cols == mat2->dim.rows && dst->dim.rows == mat1->dim.rows && dst->dim.cols == mat2->dim.cols);
for (int i = 0; i < mat1->dim.rows; i+=UNROLL*8) {
for (int j = 0; j < mat2->dim.cols; j++) {
__m256 tmp[4];
for (int x = 0; x < UNROLL; x++) {
tmp[x] = _mm256_setzero_ps();
}
for (int k = 0; k < mat1->dim.cols; k++) {
__m256 mat2_s = _mm256_broadcast_ss(&mat2->data[k][j]);
for (int x = 0; x < UNROLL; x++) {
tmp[x] = _mm256_add_ps(tmp[x], _mm256_mul_ps(_mm256_load_ps(&mat1->data[i+x*8][k]), mat2_s));
}
}
for (int x = 0; x < UNROLL; x++) {
_mm256_store_ps(&dst->data[i+x*8][j], tmp[x]);
}
}
}
}
edited:
Here is the struct of matrix. I didn't modified it.
typedef struct shape {
int rows;
int cols;
} shape;
typedef struct matrix {
shape dim;
float** data;
} matrix;
edited:
I tried gdb to figure out which line caused segmentation fault and it looked like it was _mm256_load_ps(). Am I indexing into the matrix in a wrong way such that it cannot load from the correct address? Or is the problem of aligned memory?
In at least one place, you're doing 32-byte alignment-required loads with a stride of only 4 bytes. I think that's not what you actually meant to do, though:
for (int k = 0; k < mat1->dim.cols; k++) {
for (int x = 0; x < UNROLL; x++) {
...
_mm256_load_ps(&mat1->data[i+x*8][k])
}
}
_mm256_load_ps loads 8 contiguous floats, i.e. it loads data[i+x*8][k] to data[i+x*8][k+7]. I think you want data[i+x][k*8], and loop over k in the inner-most loop.
If you need unaligned loads / stores, use _mm256_loadu_ps / _mm256_storeu_ps. But prefer aligning your data to 32B, and pad the storage layout of your matrix so the row stride is a multiple of 32 bytes. (The actual logical dimensions of the array don't have to match the stride; it's fine to leave padding at the end of each row out to a multiple of 16 or 32 bytes. This makes loops much easier to write.)
You're not even using a 2D array (you're using an array of pointers to arrays of float), but the syntax looks the same as for float A[100][100], even though the meaning in asm is very different. Anyway, in Fortran 2D arrays the indexing goes the other way, where incrementing the left-most index takes you to the next position in memory. But in C, varying the left index by one takes you to a whole new row. (Pointed to by a different element of float **data, or in a proper 2D array, one row stride away.) Of course you're striding by 8 rows because of this mixup combined with using x*8.
Speaking of the asm, you get really bad results for this code especially with gcc, where it reloads 4 things for every vector, I think because it's not sure the vector stores don't alias the pointer data. Assign things to local variables to make sure the compiler can hoist them out of loops. (e.g. const float *mat1dat = mat1->data;.) Clang does slightly better, but the access pattern in the source is inherently bad and requires pointer-chasing for each inner-loop iteration to get to a new row, because you loop over x instead of k. I put it up on the Godbolt compiler explorer.
But really you should optimize the memory layout first, before trying to manually vectorize it. It might be worth transposing one of the arrays, so you can loop over contiguous memory for rows of one matrix and columns of the other while doing the dot product of a row and column to calculate one element of the result. Or it could be worth doing c[Arow,Bcol] += a_value_from_A * b[Arow,Bcol] inside an inner loop instead of transposing up front (but that's a lot of memory traffic). But whatever you do, make sure you're not striding through non-contiguous accesses to one of your matrices in the inner loop.
You'll also want to ditch the array-of-pointers thing and do manual 2D indexing (data[row * row_stride + col] so your data is all in one contiguous block instead of having each row allocated separately. Making this change first, before you spend any time manually-vectorizing, seems to make the most sense.
gcc or clang with -O3 should do a not-terrible job of auto-vectorizing scalar C, especially if you compile with -ffast-math. (You might remove -ffast-math after you're done manually vectorizing, but use it while tuning with auto-vectorization).
Related:
How does BLAS get such extreme performance?
Also see my comments on Poor maths performance in C vs Python/numpy for another bad-memory-layout problem.
how to optimize matrix multiplication (matmul) code to run fast on a single processor core
You might manually vectorize before or after looking at cache-blocking, but when you do, see Matrix Multiplication with blocks.
I have written a function for FIR Filter which has an array as input and another array as output.This is my FIR Filter function here:
float * filter(float PATIENTSIGNAL[],float FILTERCOEF[])
I can use it without any problem, like the way hereunder:
float *FILTEROUT;
float FIROUT[8000];
FILTEROUT = filter(PATIENTSIGNAL, FILTERCOEF);
/* */
for (k = 0; k <= 1000; k++){
FIR[k] = 10 + FILTEROUT[k];
}
As you see I added number 10 to each element of my output array to evaluate that can I use this array for future computation,
But Here is my problem when I want use 2D array, This my function which return a 2D array correctly;
float(*Windowing(float SIGNAL[], int WINDOWSIZE));
I have used the Windowing function by this code in appropriate way:
patientwindow = Windowing(FILTEROUT, WINDOWSIZE);
and the all numbers in "patientwindow" array is correct but when I want to perform some simple operation like summation as here:
float evaluate[WINDOWSIZE][OVERLAP/4];
for (j = 0; j <= NUMBEROFWINDOWS; j++){
for (i = 0; i < WINDOWSIZE; i++){
evaluate[i][j] = 2+ (patientwindow[i][j]);
}
}
all elements of "evaluate" array are 0;
Would you please help me?
use
float** patientwindow;
float* is a pointer to array whereas float** is a pointer to matrix (for why is that the case see this answer https://stackoverflow.com/a/17953693/4996826 ).
If you want to use float* , then use following snippet of code:
evaluate[i][j] = 2 + (patientwindow[(i * NUMBEROFWINDOWS) + j]);
There is a pseudocode that I want to implement in C. But I am in doubt on how to implement a part of it. The psuedocode is:
for every pair of states qi, and qj, i<j, do
D[i,j] := 0
S[i,j] := notzero
end for
i and j, in qi and qj are subscripts.
how do I represent D[i,J] or S[i,j]. which data structure to use so that its simple and fast.
You can use something like
int length= 10;
int i =0, j= 0;
int res1[10][10] = {0, }; //index is based on "length" value
int res2[10][10] = {0, }; //index is based on "length" value
and then
for (i =0; i < length; i++)
{
for (j =0; j < length; j++)
{
res1[i][j] = 0;
res2[i][j] = 1;//notzero
}
}
Here D[i,j] and S[i,j] are represented by res1[10][10] and res2[10][10], respectively. These are called two-dimentional array.
I guess struct will be your friend here depending on what you actually want to work with.
Struct would be fine if, say, pair of states creates some kind of entity.
Otherwise You could use two-dimensional array.
After accept answer.
Depending on coding goals and platform, to get "simple and fast" using a pointer to pointer to a number may be faster then a 2-D array in C.
// 2-D array
double x[MAX_ROW][MAX_COL];
// Code computes the address in `x`, often involving a i*MAX_COL, if not in a loop.
// Slower when multiplication is expensive and random array access occurs.
x[i][j] = f();
// pointer to pointer of double
double **y = calloc(MAX_ROW, sizeof *y);
for (i=0; i<MAX_ROW; i++) y[i] = calloc(MAX_COL, sizeof *(y[i]));
// Code computes the address in `y` by a lookup of y[i]
y[i][j] = f();
Flexibility
The first data type is easy print(x), when the array size is fixed, but becomes challenging otherwise.
The 2nd data type is easy print(y, rows, columns), when the array size is variable and of course works well with fixed.
The 2nd data type also row swapping simply by swapping pointers.
So if code is using a fixed array size, use double x[MAX_ROW][MAX_COL], otherwise recommend double **y. YMMV
tI have the following code:
#define FIRST_COUNT 100
#define X_COUNT 250
#define Y_COUNT 310
#define z_COUNT 40
struct s_tsp {
short abc[FIRST_COUNT][X_COUNT][Y_COUNT][Z_COUNT];
};
struct s_tsp xyz;
I need to run through the data like this:
for (int i = 0; i < FIRST_COUNT; ++i)
for (int j = 0; j < X_COUNT; ++j)
for (int k = 0; k < Y_COUNT; ++k)
for (int n = 0; n < Z_COUNT; ++n)
doSomething(xyz, i, j, k, n);
I've tried to think of a more elegant, less brain-dead approach. ( I know that this sort of multidimensional array is inefficient in terms of cpu usage, but that is irrelevant in this case.) Is there a better approach to the way I've structured things here?
If you need a 4D array, then that's what you need. It's possible to 'flatten' it into a single dimensional malloc()ed 'array', however that is not quite as clean:
abc = malloc(sizeof(short)*FIRST_COUNT*X_COUNT*Y_COUNT*Z_COUNT);
Accesses are also more difficult:
*(abc + FIRST_COUNT*X_COUNT*Y_COUNT*i + FIRST_COUNT*X_COUNT*j + FIRST_COUNT*k + n)
So that's obviously a bit of a pain.
But you do have the advantage that if you need to simply iterate over every single element, you can do:
for (int i = 0; i < FIRST_COUNT*X_COUNT*Y_COUNT*Z_COUNT; i++) {
doWhateverWith *(abc+i);
}
Clearly this method is terribly ugly for most uses, and is a bit neater for one type of access. It's also a bit more memory-conservative and only requires one pointer-dereference rather than 4.
NOTE: The intention of the examples used in this post are just to explain the concepts. So the examples may be incomplete, may lack error handling, etc.
When it comes to usage of multi-dimension array in C, the following are the two possible ways.
Flattening of Arrays
In C, arrays are implemented as a contiguous memory block. This information can be used to manipulate the values stored in the array and allows rapid access to a particular array location.
For example,
int arr[10][10];
int *ptr = (int *)arr ;
ptr[11] = 10;
// this is equivalent to arr[1][0] = 10; assign a 2D array
// and manipulate now as a single dimensional array.
The technique of exploiting the contiguous nature of arrays is known as flattening of arrays.
Ragged Arrays
Now, consider the following example.
char **list;
list[0] = "United States of America";
list[1] = "India";
list[2] = "United Kingdom";
for(int i=0; i< 3 ;i++)
printf(" %d ",strlen(list[i]));
// prints 24 5 14
This type of implementation is known as ragged array, and is useful in places where the strings of variable size are used. Popular method is to have dynamic-memory-allocation to be done on the every dimension.
NOTE: The command line argument (char *argv[]) is passed only as ragged array.
Comparing flattened and ragged arrays
Now, lets consider the following code snippet which compares the flattened and ragged arrays.
/* Note: lacks error handling */
int flattened[30][20][10];
int ***ragged;
int i,j,numElements=0,numPointers=1;
ragged = (int ***) malloc(sizeof(int **) * 30);
numPointers += 30;
for( i=0; i<30; i++) {
ragged[i] = (int **)malloc(sizeof(int*) * 20);
numPointers += 20;
for(j=0; j<20; j++) {
ragged[i][j]=(int*)malloc(sizeof(int) * 10);
numElements += 10;
}
}
printf("Number of elements = %d",numElements);
printf("Number of pointers = %d",numPointers);
// it prints
// Number of elements = 6000
// Number of pointers = 631
From the above example, the ragged arrays require 631-pointers, in other words, 631 * sizeof(int *) extra memory locations for pointing 6000 integers. Whereas, the flattened array requires only one base pointer: i.e. the name of the array enough to point to the contiguous 6000 memory locations.
But OTOH, the ragged arrays are flexible. In cases where the exact number of memory locations required is not known you cannot have the luxury of allocating the memory for worst possible case. Again, in some cases the exact number of memory space required is known only at run-time. In such situations ragged arrays become handy.
Row-major and column-major of Arrays
C follows row-major ordering for multi-dimensional arrays. Flattening of arrays can be viewed as an effect due this aspect in C. The significance of row-major order of C is it fits to the natural way in which most of the accessing is made in the programming. For example, lets look at an example for traversing a N * M 2D matrix,
for(i=0; i<N; i++) {
for(j=0; j<M; j++)
printf(“%d ”, matrix[i][j]);
printf("\n");
}
Each row in the matrix is accessed one by one, by varying the column rapidly. The C array is arranged in memory in this natural way. On contrary, consider the following example,
for(i=0; i<M; i++) {
for(j=0; j<N; j++)
printf(“%d ”, matrix[j][i]);
printf("\n");
}
This changes the column index most frequently than the row index. And because of this there is a lot of difference in efficiency between these two code snippet. Yes, the first one is more efficient than the second one!
Because the first one accesses the array in the natural order (row-major order) of C, hence it is faster, whereas the second one takes more time to jump. The difference in performance would get widen as the number of dimensions and the size of element increases.
So when working with multi-dimension arrays in C, its good to consider the above details!