I have a matrix multiplication problem. We have an image matrix which can be have variable size. It is required to calculate C = A*B for every possible nxn. C will be added to output image as seen in figure. The center point of A Matrix is located in the lower triangle. Also, B is placed diagonally symmetric to A. A can be overlap, so, B can be overlap too. Figures can be seen in below for more detailed understand:
Blue X points represent all possible mid points of A. Algorithm should just do multiply A and diagonally mirrored version of A or called B. I done it with lots of for loop. I need to reduce number of for that I used. Could you help me please?
What kind of algorithm can be used for this problem? I have some confusing points.
Could you please help me with your genius algorithm talents? Or could you direct me to an expert?
Original Questions is below:
Thanks.
Update:
#define SIZE_ARRAY 20
#define SIZE_WINDOW 5
#define WINDOW_OFFSET 2
#define INDEX_OFFSET 1
#define START_OFFSET_COLUMN 2
#define START_OFFSET_ROW 3
#define END_OFFSET_COLUMN 3
#define END_OFFSET_ROW 2
#define GET_LOWER_DIAGONAL_INDEX_MIN_ROW (START_OFFSET_ROW);
#define GET_LOWER_DIAGONAL_INDEX_MAX_ROW (SIZE_ARRAY - INDEX_OFFSET - END_OFFSET_ROW)
#define GET_LOWER_DIAGONAL_INDEX_MIN_COL (START_OFFSET_COLUMN);
#define GET_LOWER_DIAGONAL_INDEX_MAX_COL (SIZE_ARRAY - INDEX_OFFSET - END_OFFSET_COLUMN)
uint32_t lowerDiagonalIndexMinRow = GET_LOWER_DIAGONAL_INDEX_MIN_ROW;
uint32_t lowerDiagonalIndexMaxRow = GET_LOWER_DIAGONAL_INDEX_MAX_ROW;
uint32_t lowerDiagonalIndexMinCol = GET_LOWER_DIAGONAL_INDEX_MIN_COL;
uint32_t lowerDiagonalIndexMaxCol = GET_LOWER_DIAGONAL_INDEX_MAX_COL;
void parallelMultiplication_Stable_Master()
{
startTimeStamp = omp_get_wtime();
#pragma omp parallel for num_threads(8) private(outerIterRow, outerIterCol,rA,cA,rB,cB) shared(inputImage, outputImage)
for(outerIterRow = lowerDiagonalIndexMinRow; outerIterRow < lowerDiagonalIndexMaxRow; outerIterRow++)
{
for(outerIterCol = lowerDiagonalIndexMinCol; outerIterCol < lowerDiagonalIndexMaxCol; outerIterCol++)
{
if(outerIterCol + 1 < outerIterRow)
{
rA = outerIterRow - WINDOW_OFFSET;
cA = outerIterCol - WINDOW_OFFSET;
rB = outerIterCol - WINDOW_OFFSET;
cB = outerIterRow - WINDOW_OFFSET;
for(i= outerIterRow - WINDOW_OFFSET; i <= outerIterRow + WINDOW_OFFSET; i++)
{
for(j= outerIterCol - WINDOW_OFFSET; j <= outerIterCol + WINDOW_OFFSET; j++)
{
for(k=0; k < SIZE_WINDOW; k++)
{
#pragma omp critical
outputImage[i][j] += inputImage[rA][cA+k] * inputImage[rB+k][cB];
}
cB++;
rA++;
}
rB++;
cA++;
printf("Thread Number - %d",omp_get_thread_num());
}
}
}
}
stopTimeStamp = omp_get_wtime();
printArray(outputImage,"Output Image");
printConsoleNotification(100, startTimeStamp, stopTimeStamp);
}
I am getting segmentation fault error if I set up thread count more than "1". What is the trick ?
I'm not providing a solution, but some thoughts that may help the OP exploring a possible approach.
You can evaluate each element of the resulting C matrix directly, from the values of the original matrix in a way similar to a convolution operation.
Consider the following image (sorry if it's confusing):
Instead of computing each matrix product for every A submatrix, you can evaluate the value of each Ci, j from the values in the shaded areas.
Note that Ci, j depends only on a small subset of row i and that the elements of the upper right triangular submatrix (where the B submatrices are picked) could be copied and maybe transposed in a more chache-friendly accomodation.
Alternatively, it may be worth exploring an approach where for every possible Bi, j, all the corresponding elements of C are evaluated.
Edit
Note that you can actually save a lot of calculations (and maybe cache misses) by grouping the terms, see e.g. the first two elements of row i in A:
More formally
Ci,j = Ai,j-4 · (Bj-4,i + Bj-4,i+1 + Bj-4,i+2 + Bj-4,i+3 + Bj-4,i+4)
Ci,j += Ai,j-3 · (Bj-3,i-1 + Bj-3,i+4 + 2·(Bj-3,i + Bj-3,i+1 + Bj-3,i+2 + Bj-3,i+3))
Ci,j += Ai,j-2 · (Bj-2,i-2 + Bj-2,i+4 + 2·(Bj-2,i-1 + Bj-2,i+3) + 3·(Bj-2,i + Bj-2,i+1 + Bj-2,i+2))
Ci,j += Ai,j-1 · (Bj-1,i-3 + Bj-1,i+4 + 2·(Bj-1,i-2 + Bj-1,i+3) + 3·(Bj-1,i-1 + Bj-1,i+2) + 4·(Bj-1,i + Bj-1,i+1))
Ci,j += Ai,j · (Bj,i-4 + Bj,i+4 + 2·(Bj,i-3 + Bj,i+3) + 3·(Bj,i-2 + Bj,i+2) + 4·(Bj,i-1 + Bj,i+1) + 5·Bj,i)
Ci,j += Ai,j+1 · (Bj+1,i-4 + Bj+1,i+3 + 2·(Bj+1,i-3 + Bj+1,i+2) + 3·(Bj+1,i-2 + Bj+1,i+1) + 4·(Bj+1,i-1 + Bj+1,i))
Ci,j += Ai,j+2 · (Bj+2,i-4 + Bj+2,i+2 + 2·(Bj+2,i-3 + Bj+2,i+1) + 3·(Bj+2,i-2 + Bj+2,i-1 + Bj+2,i))
Ci,j += Ai,j+3 · (Bj+3,i-4 + Bj+3,i+1 + 2·(Bj+3,i-3 + Bj+3,i-2 + Bj+3,i-1 + Bj+3,i))
Ci,j += Ai,j+4 · (Bj+4,i-4 + Bj+4,i-3 + Bj+4,i-2 + Bj+4,i-1 + Bj+4,i)
If I correctly estimated, this requires something like 60 additions and 25 (possibly fused) multiplications, compared to 125 operations like Ci,j += Ai,k · Bk,i spread all over the places.
I think that cache-locality may have a bigger impact on performance than the mere reduction of operations.
We could also precompute all the values
Si,j = Bj,i + Bj,i+1 + Bj,i+2 + Bj,i+3 + Bj,i+4
Then the previous formulas become
Ci,j = Ai,j-4 · Sj-4,i
Ci,j += Ai,j-3 · (Sj-3,i-1 + Sj-3,i)
Ci,j += Ai,j-2 · (Sj-2,i-2 + Sj-2,i-1 + Sj-2,i)
Ci,j += Ai,j-1 · (Sj-1,i-3 + Sj-1,i-2 + Sj-1,i-1 + Sj-1,i)
Ci,j += Ai,j · (Sj,i-4 + Sj,i-3 + Sj,i-2 + Sj,i-1 + Sj,i)
Ci,j += Ai,j+1 · (Sj+1,i-4 + Sj+1,i-3 + Sj+1,i-2 + Sj+1,i-1)
Ci,j += Ai,j+2 · (Sj+2,i-4 + Sj+2,i-3 + Sj+2,i-2)
Ci,j += Ai,j+3 · (Sj+3,i-4 + Sj+3,i-3)
Ci,j += Ai,j+4 · Sj+4,i-4
Here is my take. I wrote this before OP showed any code, so I'm not following any of their code patterns.
I start with a suitable image struct, just for my own sanity.
struct Image
{
float* values;
int rows, cols;
};
struct Image image_allocate(int rows, int cols)
{
struct Image rtrn;
rtrn.rows = rows;
rtrn.cols = cols;
rtrn.values = malloc(sizeof(float) * rows * cols);
return rtrn;
}
void image_fill(struct Image* img)
{
ptrdiff_t row, col;
for(row = 0; row < img->rows; ++row)
for(col = 0; col < img->cols; ++col)
img->values[row * img->cols + col] = rand() * (1.f / RAND_MAX);
}
void image_print(const struct Image* img)
{
ptrdiff_t row, col;
for(row = 0; row < img->rows; ++row) {
for(col = 0; col < img->cols; ++col)
printf("%.3f ", img->values[row * img->cols + col]);
putchar('\n');
}
putchar('\n');
}
A 5x5 matrix multiplication is too small to reasonably dispatch to BLAS. So I write a simple version myself that can be loop-unrolled and / or inlined. This routine could use a couple of micro-optimizations but let's keep it simple for now.
/** out += left * right for 5x5 sub-matrices */
static void mat_mul_5x5(
float* restrict out, const float* left, const float* right, int cols)
{
ptrdiff_t row, col, inner;
float sum;
for(row = 0; row < 5; ++row) {
for(col = 0; col < 5; ++col) {
sum = out[row * cols + col];
for(inner = 0; inner < 5; ++inner)
sum += left[row * cols + inner] * right[inner * cols + col];
out[row * cols + col] = sum;
}
}
}
Now for the single-threaded implementation of the main algorithm. Again, nothing fancy. We just iterate over the lower triangular matrix, excluding the diagonal. I keep track of the top-left corner instead of the center point. Makes index computation a bit simpler.
void compute_ltr(struct Image* restrict out, const struct Image* in)
{
ptrdiff_t top, left, end;
/* if image is not quadratic, find quadratic subset */
end = out->rows < out->cols ? out->rows : out->cols;
assert(in->rows == out->rows && in->cols == out->cols);
memset(out->values, 0, sizeof(float) * out->rows * out->cols);
for(top = 1; top <= end - 5; ++top)
for(left = 0; left < top; ++left)
mat_mul_5x5(out->values + top * out->cols + left,
in->values + top * in->cols + left,
in->values + left * in->cols + top,
in->cols);
}
The parallelization is a bit tricky because we have to make sure the threads don't overlap in their output matrices. A critical section, atomics or similar stuff would cost too much performance.
A simpler solution is a strided approach: If we always keep the threads 5 rows apart, they cannot interfere. So we simply compute every fifth row, synchronize all threads, then compute the next set of rows, five apart, and so on.
void compute_ltr_parallel(struct Image* restrict out, const struct Image* in)
{
/* if image is not quadratic, find quadratic subset */
const ptrdiff_t end = out->rows < out->cols ? out->rows : out->cols;
assert(in->rows == out->rows && in->cols == out->cols);
memset(out->values, 0, sizeof(float) * out->rows * out->cols);
/*
* Keep the parallel section open for multiple loops to reduce
* overhead
*/
# pragma omp parallel
{
ptrdiff_t top, left, offset;
for(offset = 0; offset < 5; ++offset) {
/* Use dynamic scheduling because the work per row varies */
# pragma omp for schedule(dynamic)
for(top = 1 + offset; top <= end - 5; top += 5)
for(left = 0; left < top; ++left)
mat_mul_5x5(out->values + top * out->cols + left,
in->values + top * in->cols + left,
in->values + left * in->cols + top,
in->cols);
}
}
}
My benchmark with 1000 iterations of a 1000x1000 image show 7 seconds for the serial version and 1.2 seconds for the parallelized version on my 8 core / 16 thread CPU.
EDIT for completeness: Here are the includes and the main for benchmarking.
#include <assert.h>
#include <stddef.h>
/* using ptrdiff_t */
#include <stdlib.h>
/* using malloc */
#include <stdio.h>
/* using printf */
#include <string.h>
/* using memset */
/* Insert code from above here */
int main()
{
int rows = 1000, cols = 1000, rep = 1000;
struct Image in, out;
in = image_allocate(rows, cols);
out = image_allocate(rows, cols);
image_fill(&in);
# if 1
do
compute_ltr_parallel(&out, &in);
while(--rep);
# else
do
compute_ltr(&out, &in);
while(--rep);
# endif
}
Compile with gcc -O3 -fopenmp.
Regarding the comment, and also your way of using OpenMP: Don't overcomplicate things with unnecessary directives. OpenMP can figure out how many threads are available itself. And private variables can easily be declared within the parallel section (usually).
If you want a specific number of threads, just call with the appropriate environment variable, e.g. on Linux call OMP_NUM_THREADS=8 ./executable
I want to transform 2d array to 1d. I put the most important part of my code.
int mask[3][3] = {{0, -1, 0}, {-1, 4, -1}, {0, -1, 0}};
for (i = 1; i < rows - 1; i++) {
for (j = 1; j < cols - 1;j++) {
int s;
s = mask[0][0] * image[i-1][j-1]
+ mask[0][1] * image[i-1][j]
+ mask[0][2] * image[i-1][j+1]
+ mask[1][0] * image[i][j-1]
+ mask[1][1] * image[i][j]
+ mask[1][2] * image[i][j+1]
+ mask[2][0] * image[i+1][j-1]
+ mask[2][1] * image[i+1][j]
+ mask[2][2] * image[i+1][j+1];
}
}
my 1d array
for (k = rows + 1; k < (cols * rows) / 2; k++) {
int s;
s = 0 * image_in[k-rows-1]
- 1 * image_in[k-rows]
+ 0 * image_in[k-rows+1]
- 1 * image_in[k-1]
+ 4 * image_in[k]
- 1 * image_in[k+1]
+ 0 * image_in[k+rows-1]
- 1 * image_in[k+rows]
+ 0 * image_in[k+rows+1];
}
That should be the same but I don't know if I correctly doing transformations. Can someone tell me if that is ok?
First of all: Why do you want to get away with the 2D array? You think that 2D array dimensions must be constant? Well, in that case I have good news for you: You are wrong. This code should work perfectly:
int width = ..., height = ...;
//Create a 2D array on the heap with dynamic sizes:
int (*image_in)[width] = malloc(height * sizeof(*image_in));
//initialize the array
for(int i = 0; i < height; i++) {
for(int j = 0; j < width; j++) {
image_in[i][j] = ...;
}
}
You see, apart from the somewhat cryptic declaration of the array pointer, the indexing remains exactly the same as with an automatic 2D array on the stack.
Within your given loop, you want to address the cells relative to the center cell. This is easiest done by actually addressing relative to that cell:
for (i = 1; i < rows - 1; i++) {
for (j = 1; j < cols - 1;j++) {
int* center = &image_in[i][j];
int s = mask[0][0] * center[-width - 1]
+ mask[0][1] * center[-width]
+ mask[0][2] * center[-width + 1]
+ mask[1][0] * center[-1]
+ mask[1][1] * center[0]
+ mask[1][2] * center[1]
+ mask[2][0] * center[width - 1]
+ mask[2][1] * center[width]
+ mask[2][2] * center[width + 1];
}
}
This works because the 2D array has the same memory layout as your 1D array (this is guaranteed by the C standard).
The edge handling in a 1D loop is always wrong: It will execute the body of the loop for the first and last cells of each line. This cannot be fixed without introducing some if() statements into the loop which will significantly slow things down.
This may be ignored if the consequences are proven to be irrelevant (you still need to exclude the first and last lines plus a cell). However, the edge handling is much easier if you stick to a 2D array.
If the first part of your code gives you expected result, then you can do the same with 1d array this way :
for (i = 1; i < rows - 1; i++) {
for (j = 1; j < cols - 1;j++) {
int s;
s = mask[0][0] * image_in[i-1+rows*(j-1)]
+ mask[0][1] * image_in[i-1+rows*j]
+ mask[0][2] * image_in[i-1+rows*(j+1)]
+ mask[1][0] * image_in[i+rows*(j-1)]
+ mask[1][1] * image_in[i+rows*j]
+ mask[1][2] * image_in[i+rows*(j+1)]
+ mask[2][0] * image_in[i+1+rows*(j-1)]
+ mask[2][1] * image_in[i+1+rows*j]
+ mask[2][2] * image_in[i+1+rows*(j+1)];
}
}
This way, if you are good with 2d arrays, you can do the same without error with 1d array as if they were 2d.
I have implemented the following quicksort algorithm to sort couples of points(3D space).
Every couple defines a line: the purpose is to place all lines that have a distance less or equal to powR nearby inside the array which contains all the couples.
The Array containing coordinates is monodimentional, every 6 elements define a couple and every 3 a point.
When i run the algorithm with an array of 3099642 elements stops after processing 2799222 trying to enter the next iteration. if i start the algorithm from element 2799228 it stops at 3066300.
I can't figure out where is the problem, and suggestion?
void QuickSort(float *array, int from, int to, float powR){
float pivot[6];
float temp[6];
float x1;
float y1;
float z1;
float x2;
float y2;
float z2;
float d12;
int i;
int j;
if(from >= to)
return;
pivot[0] = array[from+0];
pivot[1] = array[from+1];
pivot[2] = array[from+2];
pivot[3] = array[from+3];
pivot[4] = array[from+4];
pivot[5] = array[from+5];
i = from;
for(j = from+6; j <= to; j += 6){
x1 = pivot[0] - array[j+0];
y1 = pivot[1] - array[j+1];
z1 = pivot[2] - array[j+2];
x2 = pivot[3] - array[j+3];
y2 = pivot[4] - array[j+4];
z2 = pivot[5] - array[j+5];
d12 = (x1*x1 + y1*y1 + z1*z1) + (x2*x2 + y2*y2 + z2*z2);
/*the sorting condition i am using is the regular euclidean norm*/
if (d12 <= powR){
i += 6;
temp[0] = array[i+0];
temp[1] = array[i+1];
temp[2] = array[i+2];
temp[3] = array[i+3];
temp[4] = array[i+4];
temp[5] = array[i+5];
array[i+0] = array[j+0];
array[i+1] = array[j+1];
array[i+2] = array[j+2];
array[i+3] = array[j+3];
array[i+4] = array[j+4];
array[i+5] = array[j+5];
array[j+0] = temp[0];
array[j+1] = temp[1];
array[j+2] = temp[2];
array[j+3] = temp[3];
array[j+4] = temp[4];
array[j+5] = temp[5];
}
}
QuickSort(array, i+6, to, powR);
}
function is called in this way:
float LORs = (float) calloc((unsigned)tot, sizeof(float));
LORs is filled reading datas from a file, and works fine.
QuickSort(LORs, 0, 6000, powR);
free(LORs);
for(j = from+6; j <= to; j += 6) {
array[i+0] = array[j+0];
array[i+1] = array[j+1];
array[i+2] = array[j+2];
array[i+3] = array[j+3];
array[i+4] = array[j+4];
array[i+5] = array[j+5];
}
Your j + constant_number goes out of bounds when you approach the end. That's why it crashes at the end. Note that constant_number is non-negative.
When j comes close (how close you can find by the increment step, i.e. +6) to the end of your array, it will go for sure out of bounds.
Take the easy case, the max value j can get. That is the size of your array.
So, let's call it N.
Then, when j is equal to N, you are going to enter the loop.
Then, you want to access array[j + 0], which is actually array[N + 0], which is array[N].
I am pretty sure, you know that indexing in C (which you should in the future include in the tags of your questions is needed), is from 0 to N - 1. And so on..
EDIT: As the comments suggest, this is not a (quick)sort!
I had implemented quickSort here, is you want to take an idea of it. I suggest you start from the explanations and not from the code!
I have an odd problem. Following (re: copying) from here, I've been trying to implement the Cooley–Tukey FFT algorithm for arrays with a power-of-2 size, but the answers returned from this implementation are the conjugate of the true answers.
int fft_pow2(int dir,int m,float complex *a)
{
long nn,i,i1,j,k,i2,l,l1,l2;
float c1,c2,tx,ty,t1,t2,u1,u2,z;
float complex t;
/* Calculate the number of points */
nn = 1;
for (i=0;i<m;i++)
nn *= 2;
/* Do the bit reversal */
i2 = nn >> 1;
j = 0;
for (i=0;i<nn-1;i++) {
if (i < j) {
t = a[i];
a[i] = a[j];
a[j] = t;
}
k = i2;
while (k <= j) {
j -= k;
k >>= 1;
}
j += k;
}
/* Compute the FFT */
c1 = -1.0;
c2 = 0.0;
l2 = 1;
for (l=0;l<m;l++) {
l1 = l2;
l2 <<= 1;
u1 = 1.0;
u2 = 0.0;
for (j=0;j<l1;j++) {
for (i=j;i<nn;i+=l2) {
i1 = i + l1;
t = u1 * crealf(a[i1]) - u2 * cimagf(a[i1])
+ I * (u1 * cimagf(a[i1]) + u2 * crealf(a[i1]));
a[i1] = a[i] - t;
a[i] += t;
}
z = u1 * c1 - u2 * c2;
u2 = u1 * c2 + u2 * c1;
u1 = z;
}
c2 = sqrt((1.0 - c1) / 2.0);
if (dir == 1)
c2 = -c2;
c1 = sqrt((1.0 + c1) / 2.0);
}
/* Scaling for forward transform */
if (dir == 1) {
for (i=0;i<nn;i++) {
a[i] /= (float)nn;
}
}
return 1;
}
int main(int argc, char **argv) {
float complex arr[4] = { 1.0, 2.0, 3.0, 4.0 };
fft_pow2(0, log2(n), arr);
for (int i = 0; i < n; i++) {
printf("%f %f\n", crealf(arr[i]), cimagf(arr[i]));
}
}
The results:
10.000000 0.000000
-2.000000 -2.000000
-2.000000 0.000000
-2.000000 2.000000
whereas the true answer is the conjugate.
Any ideas?
The FFT is often defined with Hk = sum(e–2•π•i•j•k/N•hj, 0 < j ≤ N). Note the minus sign in the exponent. The FFT can be defined with a plus sign instead of the minus sign. In large part, the definitions are equivalent, because +i and –i are completely symmetric.
The code you show is written for the definition with the negative sign, and it is also written so that the first parameter, dir, is 1 for a forward transform and something else for a reverse transform. We can determine the intended direction because of the comment about scaling for the forward transform: It scales if dir is 1.
So, where your code in main calls fft_pow2 with 0 for dir, it is requesting a reverse transform. Your code has performed a reverse transform using the FFT definition with a negative sign. The reverse of the transform with a negative sign is a transform with a positive sign. For [1, 2, 3, 4], the result is:
10•1 + 11•2 + 12•3 + 13•4 = 1 + 2 + 3 + 4 = 10.
i0•1 + i1•2 + i2•3 + i3•4 = 1 + 2i – 3 – 4i = –2 – 2i.
(–1)0•1 + (–1)1•2 + (–1)2•3 + (–1)3•4 = 1 – 2 + 3 – 4 = –2.
(–i)0•1 + (–i)1•2 + (–i)2•3 + (–i)3•4 = 1 – 2i – 3 + 4i = –2 + 2i.
And that is the result you obtained.