Related
I am working on the CS50 filter challenge and specifically the blur challenge.
It compiles nicely, but when I create the image the image is turned 90 degrees and at the bottom of the image is an image error.
Do you know what the problem is?
This is my code:
void blur(int height, int width, RGBTRIPLE image[height][width])
{
float sum_blue;
float sum_green;
float sum_red;
float counter;
RGBTRIPLE temp[height][width];
for (int i = 0; i < height; i++)
{
for (int j = 0; j < width; j++)
{
sum_blue = 0;
sum_green = 0;
sum_red = 0;
counter = 0;
for (int h= -1; h < 2; h++)
{
if (j + h < 0 || j + h > width - 1)
{
continue;
}
for (int v = -1; v < 2; v++)
{
if (i + v < 0 || i + v > height - 1)
{
continue;
}
sum_red += image[j + h][i + v].rgbtRed;
sum_blue += image[j + h][i + v].rgbtBlue;
sum_green += image[j + h][i + v].rgbtGreen;
counter++;
}
}
//summarize all values and save the pixels in a temporary image to transfer it later
temp[i][j].rgbtRed = round(sum_red / counter);
temp[i][j].rgbtBlue = round(sum_blue / counter);
temp[i][j].rgbtGreen = round(sum_green / counter);
}
}
// transfer temporary to real image
for (int i = 0; i < height; i++)
{
for (int j = 0; j < width; j++)
{
image[i][j].rgbtRed = temp[i][j].rgbtRed;
image[i][j].rgbtBlue = temp[i][j].rgbtBlue;
image[i][j].rgbtGreen = temp[i][j].rgbtGreen;
}
}
}
You appear to have swapped width and height here:
sum_red += image[j + h][i + v].rgbtRed;
sum_blue += image[j + h][i + v].rgbtBlue;
sum_green += image[j + h][i + v].rgbtGreen;
I would expect:
sum_red += image[i + v][j + h].rgbtRed;
sum_blue += image[i + v][j + h].rgbtBlue;
sum_green += image[i + v][j + h].rgbtGreen;
since the image array is [height][width] and i is the vertical iterator and j the horizontal. Possibly only i and j are swapped, and i + h and j + v are intended.
Possibly the variables are swapped elsewhere in the algorithm. Better variable naming might help - so the names indicate what they represent. Even x and y would be clearer since that is a convention for cartesian coordinates - though a comment indicating top-left origin and y increasing down might be advisable.
By getting the indexes in the wrong order you have rotated and mirrored it and processed data from the wrong part of the image or not in the image at all.
So, I have spent about 5 hours+ trying to figure out what is wrong with my code. I have tried debug50 with a 3x3 file I have manually created in Paint and everything seemed to work as intended; each pixel makes a 3x3 sweep around itself and disregards pixels that do not exist, like the ones in the corners or around edges. The final average values for each of the colors were also correct. Somehow, though, when I checked with check50, it gave out the following message:
With countless tweaking and head-scratching, I have decided that it was probably time for me to turn to the community for help. Here's my code:
{
for (int h = 0; h < height; h++)
{
for (int w = 0; w < width; w++)
{
int avgfordiv = 0;
int neighvalgreen = 0;
int neighvalblue = 0;
int neighvalred = 0;
for (int hh = -1; hh < 2; hh++)
{
for (int ww = -1; ww < 2; ww++)
{
if ((h+hh) != height && (w+ww) != width && (h+hh) != -1 && (w+ww) != -1)
{
//sweep
avgfordiv++;//count up for division
neighvalgreen += image[h + hh][w + ww].rgbtGreen;
neighvalred += image[h + hh][w + ww].rgbtRed;
neighvalblue += image[h + hh][w + ww].rgbtBlue;
}
}
}
//add values to pixels
image[h][w].rgbtGreen = (int)(round((float)neighvalgreen / avgfordiv));
image[h][w].rgbtBlue = (int)(round((float)neighvalblue / avgfordiv));
image[h][w].rgbtRed = (int)(round((float)neighvalred / avgfordiv));
//check cap
if (image[h][w].rgbtGreen <= 255)
{}
else
image[h][w].rgbtGreen %= 255;
if (image[h][w].rgbtRed <= 255)
{}
else
image[h][w].rgbtRed %= 255;
if (image[h][w].rgbtBlue <= 255)
{}
else
image[h][w].rgbtBlue %= 255;
}
}
return;
}
Making a copy of the image and using that copy to calculate the total amount of red green and blue seems to fix it.
RGBTRIPLE copy[height][width];
for (int h = 0; h < height; i++)
{
for (int w = 0; w < width; j++)
{
copy[h][w] = image[h][w];
}
}
And change it below:
neighvalgreen += copy[h + hh][w + ww].rgbtGreen;
neighvalred += copy[h + hh][w + ww].rgbtRed;
neighvalblue += copy[h + hh][w + ww].rgbtBlue;
Also you dont need to check if the values exceeded 255 because you are calculating the average value so it will never exceeded 255.
My strategy in pseudocode is as follows:
Save entire image as a tmp
loop through every single pixel
For every pixel, image a 3x3 square with the said pixel at the center. Check each of the pixel of the 3x3 square if it exists. If it does, add up the sum. Finally calculate average
// Blur image
void blur(int height, int width, RGBTRIPLE image[height][width])
{
//store entire image in tmp to maintain original value
RGBTRIPLE tmp[height][width];
for (int i = 0; i < height; i++)
{
for (int j = 0; j < width; j++)
{
tmp[i][j] = image[i][j];
}
}
//loop through each pixel
for (int i = 0; i < height; i++)
{
for (int j = 0; j < width; j++)
{
float n = 0, r = 0, b = 0, g = 0;
// loop the 9box
for (int ii = 0; ii < 3; ii++)
{
for (int jj = 0; jj < 3; jj++)
{
//calculate average. ONLY if that pixel exists.
if ((i + ii - 1 >= 0 && i + ii - 1 < height) && (j + jj - 1 >= 0 && j + jj - 1 < width))
{
b = b + (float)tmp[i + ii - 1][j + jj - 1].rgbtBlue;
r = r + (float)tmp[i + ii - 1][j + jj - 1].rgbtRed;
g = g + (float)tmp[i + ii - 1][j + jj - 1].rgbtGreen;
n++;
}
}
}
//calculate
image[i][j].rgbtBlue = (int)round(b / n);
image[i][j].rgbtRed = (int)round(r / n);
image[i][j].rgbtGreen = (int)round(g / n);
}
}
return;
}
I have written a completely different code and got the same error and the same numbers.
But someone wrote this code which works perfectly:
void blur(int height, int width, RGBTRIPLE image[height][width])
{
RGBTRIPLE ogImage[height][width];
for (int i = 0; i < height; i++)
{
for (int j = 0; j < width; j++)
{
ogImage[i][j] = image[i][j];
}
}
for (int i = 0, red, green, blue, counter; i < height; i++)
{
for (int j = 0; j < width; j++)
{
red = green = blue = counter = 0;
if (i >= 0 && j >= 0)
{
red += ogImage[i][j].rgbtRed;
green += ogImage[i][j].rgbtGreen;
blue += ogImage[i][j].rgbtBlue;
counter++;
}
if (i >= 0 && j - 1 >= 0)
{
red += ogImage[i][j-1].rgbtRed;
green += ogImage[i][j-1].rgbtGreen;
blue += ogImage[i][j-1].rgbtBlue;
counter++;
}
if ((i >= 0 && j + 1 >= 0) && (i >= 0 && j + 1 < width))
{
red += ogImage[i][j+1].rgbtRed;
green += ogImage[i][j+1].rgbtGreen;
blue += ogImage[i][j+1].rgbtBlue;
counter++;
}
if (i - 1 >= 0 && j >= 0)
{
red += ogImage[i-1][j].rgbtRed;
green += ogImage[i-1][j].rgbtGreen;
blue += ogImage[i-1][j].rgbtBlue;
counter++;
}
if (i - 1 >= 0 && j - 1 >= 0)
{
red += ogImage[i-1][j-1].rgbtRed;
green += ogImage[i-1][j-1].rgbtGreen;
blue += ogImage[i-1][j-1].rgbtBlue;
counter++;
}
if ((i - 1 >= 0 && j + 1 >= 0) && (i - 1 >= 0 && j + 1 < width))
{
red += ogImage[i-1][j+1].rgbtRed;
green += ogImage[i-1][j+1].rgbtGreen;
blue += ogImage[i-1][j+1].rgbtBlue;
counter++;
}
if ((i + 1 >= 0 && j >= 0) && (i + 1 < height && j >= 0))
{
red += ogImage[i+1][j].rgbtRed;
green += ogImage[i+1][j].rgbtGreen;
blue += ogImage[i+1][j].rgbtBlue;
counter++;
}
if ((i + 1 >= 0 && j - 1 >= 0) && (i + 1 < height && j - 1 >= 0))
{
red += ogImage[i+1][j-1].rgbtRed;
green += ogImage[i+1][j-1].rgbtGreen;
blue += ogImage[i+1][j-1].rgbtBlue;
counter++;
}
if ((i + 1 >= 0 && j + 1 >= 0) && (i + 1 < height && j + 1 < width))
{
red += ogImage[i+1][j+1].rgbtRed;
green += ogImage[i+1][j+1].rgbtGreen;
blue += ogImage[i+1][j+1].rgbtBlue;
counter++;
}
image[i][j].rgbtRed = round(red / (counter * 1.0));
image[i][j].rgbtGreen = round(green / (counter * 1.0));
image[i][j].rgbtBlue = round(blue / (counter * 1.0));
}
}
return;
}
I know it's late but maybe it will be helpful for somebody.
For future people in doubt
You don't need that much conditions, just think that you are working with one matrix within another.
Once this is done, just place a condition, in case the location being worked does not exceed the matrix limits.
Example
current line + the submatrix line cannot be less than zero, because that would be going beyond the limits.
L: matrix line
sL: submatrix line
[L + sL] [0] ! < 0
[0 + (-1) [0] would be extrapolating.
Just think now for the other cases as well.
void blur(int height, int width, RGBTRIPLE image[height][width])
{
// average of the ORIGINAL value of the pixels around it
int avgR, avgG, avgB, counter;
// make a copy of the original image for the calculations
RGBTRIPLE copy[height][width];
for (int h = 0; h < height; h++)
{
for (int w = 0; w < width; w++)
{
copy[h][w] = image[h][w];
}
}
// go across the image
for (int linha = 0; linha < height; linha++)
{
for (int coluna = 0; coluna < width; coluna++)
{
// initialize the variables and reset them to 0
avgR = 0;
avgG = 0;
avgB = 0;
counter = 0;
// go across the pixels around
for (int row = -1; row < 2; row++)
{
for (int column = -1; column < 2; column++)
{
if (linha + row < 0 || coluna + column < 0 || linha + row >= height || coluna + column >= width)
{
}
else
{
avgR += copy[linha + row][coluna + column].rgbtRed;
avgG += copy[linha + row][coluna + column].rgbtGreen;
avgB += copy[linha + row][coluna + column].rgbtBlue;
counter ++;
}
}
}
image[linha][coluna].rgbtRed = round(avgR / (float) counter);
image[linha][coluna].rgbtGreen = round(avgG / (float) counter);
image[linha][coluna].rgbtBlue = round(avgB / (float) counter);
}
}
return;
}
Fernando's answer is great (I don't have the rep to comment directly on it), but to handle the edge cases, I turned the pixel's for loop's start and end points into variables that get adjusted if you're working with an edge ("i" and "j" are my outer loop counters):
int startRow = -1;
int endRow = 1;
int startColumn = -1;
int endColumn = 1;
// Handle edge cases
if (i + startRow < 0) { startRow = 0; }
if (j + startColumn < 0) { startColumn = 0; }
if (i + endRow >= height) { endRow = 0; }
if (j + endColumn >= width) { endColumn = 0; }
// go across the pixels around
for (int pxRow = startRow; pxRow <= endRow; pxRow++) {
for (int pxColumn = startColumn; pxColumn <= endColumn; pxColumn++) {
int row = i + pxRow;
int column = j + pxColumn;
avgR += copy[row][column].rgbtRed;
avgG += copy[row][column].rgbtGreen;
avgB += copy[row][column].rgbtBlue;
counter ++;
}
}
Same issue as post (Cuda - Multiple sums in each vector element). How do you perform 2D block striding in both x- and y-direction with varying summation limits. The 2D algorithm can be seen in the CPU and monolithic kernel. I included openmp for the CPU so as to get a more fair speedup result. If there is a way to increase the speed of the CPU function as well I would be happy to find out.
This version of the code takes a 2D array and flattens it to a 1D array. I still use the 2D thread dim3 indexing so I can index the double summations more intuitively.
(p.s. all credit to user Robert Crovella for the 1D striding code.)
The code so far is,
#include <stdio.h>
#include <iostream>
#include <cuda.h>
#include <sys/time.h>
typedef double df;
#define USECPSEC 1000000ULL
#define BSX 1<<5
#define BSY 1<<5
#define N 100
#define M 100
const bool sync = true;
const bool nosync = false;
unsigned long long dtime_usec(unsigned long long start, bool use_sync = nosync){
if (use_sync == sync) cudaDeviceSynchronize();
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
int divUp(int a, int b) {return (a + b - 1) / b;}
float cpu_sum(int n, int m, df *a, df *b, df *c) {
df q, r;
#pragma omp parallel for collapse(2)
for (int x = 0; x < n; x++) {
for (int y = 0; y < m; y++) {
q = 0.0f;
for (int i = 0; i <= x; i++) {
r = 0.0f;
for (int j = 0; j <= y; j++) {
r += a[i * n + j] * b[(x - i) * n + y - j];
}
for (int j = 1; j < m - y; j++) {
r += a[i * n + j] * b[(x - i) * n + y + j]
+ a[i * n + y + j] * b[(x - i) * n + j];
}
q += r;
}
for (int i = 1; i < n-x; i++) {
r = 0.0f;
for (int j = 0; j <= y; j++) {
r += a[i * n + j] * b[(x + i) * n + y - j]
+ a[(x + i) * n + j] * b[ i * n + y - j];
}
for (int j = 1; j < m - y; j++) {
r += a[i * n + j] * b[(x + i) * n + y + j]
+ a[(x + i) * n + y + j] * b[(x + i) * n + j]
+a[(x + i) * n + j] * b[i * n + y + j]
+ a[(x + i) * n + y + j] * b[i * n + j];
}
q += r;
}
c[x * N + y] = 0.25f*q;
}
}
return 0;
}
const int P2 = 5;
const int TPB = 1<<P2;
const unsigned row_mask = ~((0xFFFFFFFFU>>P2)<<P2);
__global__ void chebyprod_imp(int n, int m, df *a, df *b, df *c){
__shared__ df sdata[TPB*TPB];
int x = blockIdx.x;
int y = blockIdx.y;
int row_width_x = (((x)>(n-x))?(x):(n-x))+1;
int row_width_y = (((y)>(m-y))?(y):(m-y))+1;
int strides_x = (row_width_x>>P2) + ((row_width_x&row_mask)?1:0);
int strides_y = (row_width_y>>P2) + ((row_width_y&row_mask)?1:0);
int i = threadIdx.x;
df tmp_a;
df sum = 0.0f;
for (int s=0; s < strides_x; s++) { // block-stride x loop
int j = threadIdx.y;
for (int u=0; u < strides_y; u++) { // block-stride y loop
if (i < n && j < m) {tmp_a = a[i * n + j];}
if (i <= x) {
if (j <= y) {sum += tmp_a * b[(x - i) * n + y - j];}
if ((j > 0) && (j < (m-y))) {sum += tmp_a * b[(x - i) * n + y + j]
+ a[i * n + y + j] * b[(x - i) * n + j];}
}
if ((i > 0) && (i < (n-x))) {
if (j <= y) {sum += tmp_a * b[(x + i) * n + y - j]
+ a[(x + i) * n + j] * b[ i * n + y - j];}
if ((j > 0) && (j < (m-y))) {sum += tmp_a * b[(x + i) * n + y + j]
+ a[(x + i) * n + y + j] * b[(x + i) * n + j]
+ a[(x + i) * n + j] * b[i * n + y + j]
+ a[(x + i) * n + y + j] * b[i * n + j];}
}
j += TPB;
}
i += TPB;
}
sdata[threadIdx.x * TPB + threadIdx.y] = sum;
for (int s = TPB>>1; s > 0; s>>=1) { // sweep reduction in x
for (int u = TPB>>1; u > 0; u>>=1) { // sweep reduction in x
__syncthreads();
if (threadIdx.x < s && threadIdx.y < u) {
sdata[threadIdx.x * TPB + threadIdx.y] += sdata[(threadIdx.x + s) * TPB + threadIdx.y + u];
}
}
}
if (!threadIdx.x && !threadIdx.y) c[x * n + y] = 0.25f*sdata[0];
}
__global__ void chebyprod(int n, int m, df *a, df *b, df *c){
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
df q, r;
if (x < n && y < m) {
q = 0.0f;
for (int i = 0; i <= x; i++) {
r = 0.0f;
for (int j = 0; j <= y; j++) {
r += a[i * n + j] * b[(x - i) * n + y - j];
}
for (int j = 1; j < m - y; j++) {
r += a[i * n + j] * b[(x - i) * n + y + j]
+ a[i * n + y + j] * b[(x - i) * n + j];
}
q += r;
}
for (int i = 1; i < n-x; i++) {
r = 0.0f;
for (int j = 0; j <= y; j++) {
r += a[i * n + j] * b[(x + i) * n + y - j]
+ a[(x + i) * n + j] * b[ i * n + y - j];
}
for (int j = 1; j < m - y; j++) {
r += a[i * n + j] * b[(x + i) * n + y + j]
+ a[(x + i) * n + y + j] * b[(x + i) * n + j]
+a[(x + i) * n + j] * b[i * n + y + j]
+ a[(x + i) * n + y + j] * b[i * n + j];
}
q += r;
}
c[x * N + y] = 0.25f*q;
}
}
int main(void){
int size = N*M*sizeof(df);
df *a, *b, *c, *cc, *ci, *d_a, *d_b, *d_c, *d_ci;
a = (df*)malloc(size);
b = (df*)malloc(size);
c = (df*)malloc(size);
cc = (df*)malloc(size);
ci = (df*)malloc(size);
cudaMalloc(&d_a, size);
cudaMalloc(&d_b, size);
cudaMalloc(&d_c, size);
cudaMalloc(&d_ci, size);
#pragma omp parallel for collapse (2)
for (int i = 0; i < N; i++) {
for (int j = 0; j < M; j++) {
a[i * M + j] = 0.1f;
b[i * M + j] = 0.2f;
}
}
unsigned long long dt = dtime_usec(0);
// Perform chebyprod on N elements
cpu_sum(N, M, a, b, cc);
dt = dtime_usec(dt,sync);
printf("Time taken 2D CPU: %fs\n", dt/(float)USECPSEC);
df dtc = dt/(float)USECPSEC;
std::cout << "Vector cc: [ ";
for (int k = 0; k < 10; ++k)
std::cout << cc[k] << " ";
std::cout <<"]\n";
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
dim3 dimBlock(BSX, BSY);
dim3 dimGrid(divUp(N, BSX), divUp(M, BSY));
//std::cout << "dimBlock: " << dimBlock << "\n dimGrid: " << dimGrid << "\n";
dt = dtime_usec(0);
// Perform chebyprod on N elements
chebyprod<<< dimBlock, dimGrid >>>(N, M, d_a, d_b, d_c);
dt = dtime_usec(dt,sync);
printf("Time taken 2D monolithic kernel: %fs\n", dt/(float)USECPSEC);
printf("Speedup: %fs\n", dtc/(dt/(float)USECPSEC));
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
std::cout << "Vector c: [ ";
for (int k = 0; k < 10; ++k)
std::cout << c[k] << " ";
std::cout <<"]\n";
dt = dtime_usec(0);
// Perform chebyprod on N elements
chebyprod_imp<<< dimBlock, dimGrid >>>(N, M, d_a, d_b, d_ci);
dt = dtime_usec(dt,sync);
printf("Time taken 2D stride kernel: %fs\n", dt/(float)USECPSEC);
cudaMemcpy(ci, d_ci, size, cudaMemcpyDeviceToHost);
std::cout << "Vector ci: [ ";
for (int k = 0; k < 10; ++k)
std::cout << ci[k] << " ";
std::cout <<"]\n";
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
cudaFree(d_ci);
free(a);
free(b);
free(c);
free(cc);
free(ci);
}
For me, anyway, the results for the CPU code don't match between the cases where I compile with OpenMP support and without, if I omit -O3. I seem to get the correct results with OpenMP compilation if I also specify -O3. I'm not sure why that should matter for correctness, although it obviously has an impact on CPU code performance.
You seem to have gotten your grid and block sizing backwards:
chebyprod<<< dimBlock, dimGrid >>>(....
the first kernel config parameter is the grid dimension, not the block dimension. I'm not sure how this came about since you had it done correctly in your previous question.
As in the previous question, we need to pick a thread strategy and implement it correctly. You seemed to be confused about striding, so hopefully the code below will clarify things. The thread strategy I will use here is one warp per output point. A warp is a group of threads with a dimension of 32 (threads) in the x direction, and 1 in the y direction. Therefore the loop striding will be by an increment of 32 in the x direction, but only 1 in the y direction, to cover the entire space. The choice of thread strategy also affects grid sizing.
You seem to have jumbled the relationships that I think should exist for the two dimensions. The x direction, N, and n should all be connected. Likewise the y direction, M and m should all be connected (for example, M is the dimension in the y direction).
When it comes to 2D threadblocks, we want to arrange indexing for coalescing on the GPU such that the index that includes threadIdx.x is not multiplied by anything. (A simplified statement of coalescing is that we want adjacent threads in the warp to access adjacent elements in memory. Since threadIdx.x increases by 1 as we go from thread to thread in the warp, we want to use this characteristic to generate adjacent memory indexing. If we multiply threadIdx.x by anything except 1, we break the pattern.) You have this reversed - where the index including threadIdx.x is typically multiplied by the row dimension (N, or n). This really cannot be correct, and also does not make for good coalesced access. To solve this, we want to transpose our indexing and also transpose the data storage for a and b (and therefore c). In the code below, I have tranposed the indexing for the data setup for a and b, and also the relevant indexing has been transposed in the striding kernel (only). In your non-striding kernel and also your CPU version, I have not transposed the indexing, I leave that as an exercise for you, if needed. For the results, numerically, it does not matter, because your entire a matrix has the same value at every location, and a similar statement can be made about your b matrix. Numerically, then, for this example code, transposing (or not) has no bearing on the result. But it matters for performance (of the striding kernel, at least). Also note that I believe performing the indexing "transpose" on the "monolithic" kernel should also improve its performance. I don't know if it would affect the performance of the CPU version.
I've also added back in the const __restrict__ usage that I included in my previous answer. According to my testing, on "smaller" GPUs this provides noticeable performance benefit. It's not strictly necessary for correctness, however. Here's a worked example with the above changes that gives numerically matching results for all 3 test cases:
$ cat t1498.cu
#include <stdio.h>
#include <iostream>
#include <cuda.h>
#include <time.h>
#include <sys/time.h>
typedef double df;
#define USECPSEC 1000000ULL
#define BSX 1<<5
#define BSY 1<<5
#define N 100
#define M 100
const bool sync = true;
const bool nosync = false;
unsigned long long dtime_usec(unsigned long long start, bool use_sync = nosync){
if (use_sync == sync) cudaDeviceSynchronize();
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
int divUp(int a, int b) {return (a + b - 1) / b;}
void cpu_sum(int n, int m, df *a, df *b, df *c) {
df q, r;
#pragma omp parallel for collapse(2)
for (int x = 0; x < n; x++) {
for (int y = 0; y < m; y++) {
q = 0.0f;
for (int i = 0; i <= x; i++) {
r = 0.0f;
for (int j = 0; j <= y; j++) {
r += a[i * n + j] * b[(x - i) * n + y - j];
}
for (int j = 1; j < m - y; j++) {
r += a[i * n + j] * b[(x - i) * n + y + j]
+ a[i * n + y + j] * b[(x - i) * n + j];
}
q += r;
}
for (int i = 1; i < n-x; i++) {
r = 0.0f;
for (int j = 0; j <= y; j++) {
r += a[i * n + j] * b[(x + i) * n + y - j]
+ a[(x + i) * n + j] * b[ i * n + y - j];
}
for (int j = 1; j < m - y; j++) {
r += a[i * n + j] * b[(x + i) * n + y + j]
+ a[(x + i) * n + y + j] * b[(x + i) * n + j]
+a[(x + i) * n + j] * b[i * n + y + j]
+ a[(x + i) * n + y + j] * b[i * n + j];
}
q += r;
}
c[x * N + y] = 0.25f*q;
}
}
}
// choose one warp per output point
const int P2 = 5; // assumes warp size is 32
const unsigned row_mask = ~((0xFFFFFFFFU>>P2)<<P2);
__global__ void chebyprod_imp(int n, int m, const df * __restrict__ a, const df * __restrict__ b, df * __restrict__ c){
int x = blockIdx.x;
int y = threadIdx.y+blockDim.y*blockIdx.y;
int width_x = (((x)>(n-x))?(x):(n-x))+1;
int height_y = (((y)>(m-y))?(y):(m-y))+1;
int strides_x = (width_x>>P2) + ((width_x&row_mask)?1:0);
int strides_y = height_y;
int i = threadIdx.x;
df tmp_a;
df sum = 0.0f;
if ((x < n) && (y < m)){
for (int s=0; s < strides_x; s++) { // warp-stride x loop
for (int j=0; j < strides_y; j++) { // y loop
if (i < n && j < m) {tmp_a = a[j * n + i];}
if (i <= x) {
if (j <= y) {sum += tmp_a * b[(y - j) * n + x - i];}
if ((j > 0) && (j < (m-y))) {sum += tmp_a * b[(y+j) * n + x - i] + a[(y+j)* n + i] * b[j*n+(x - i)];}
}
if ((i > 0) && (i < (n-x))) {
if (j <= y) {sum += tmp_a * b[(y-j) * n + x+i] + a[j*n + (x + i)] * b[(y - j)*n + i];}
if ((j > 0) && (j < (m-y)))
{sum += tmp_a * b[(y+j) * n + x+i]
+ a[(y+j) * n + x + i] * b[j*n+(x + i)]
+ a[j*n + (x + i)] * b[(y+j)*n + i]
+ a[(y+j)*n + x + i] * b[j*n+i];}
}
}
i += 32;
}
// warp-shuffle reduction
for (int offset = warpSize>>1; offset > 0; offset >>= 1)
sum += __shfl_down_sync(0xFFFFFFFFU, sum, offset);
if (!threadIdx.x) c[y*m+x] = 0.25f*sum;}
}
__global__ void chebyprod(int n, int m, df *a, df *b, df *c){
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
df q, r;
if (x < n && y < m) {
q = 0.0f;
for (int i = 0; i <= x; i++) {
r = 0.0f;
for (int j = 0; j <= y; j++) {
r += a[i * n + j] * b[(x - i) * n + y - j];
}
for (int j = 1; j < m - y; j++) {
r += a[i * n + j] * b[(x - i) * n + y + j]
+ a[i * n + y + j] * b[(x - i) * n + j];
}
q += r;
}
for (int i = 1; i < n-x; i++) {
r = 0.0f;
for (int j = 0; j <= y; j++) {
r += a[i * n + j] * b[(x + i) * n + y - j]
+ a[(x + i) * n + j] * b[ i * n + y - j];
}
for (int j = 1; j < m - y; j++) {
r += a[i * n + j] * b[(x + i) * n + y + j]
+ a[(x + i) * n + y + j] * b[(x + i) * n + j]
+a[(x + i) * n + j] * b[i * n + y + j]
+ a[(x + i) * n + y + j] * b[i * n + j];
}
q += r;
}
c[x * N + y] = 0.25f*q;
}
}
int main(void){
int size = N*M*sizeof(df);
df *a, *b, *c, *cc, *ci, *d_a, *d_b, *d_c, *d_ci;
a = (df*)malloc(size);
b = (df*)malloc(size);
c = (df*)malloc(size);
cc = (df*)malloc(size);
ci = (df*)malloc(size);
cudaMalloc(&d_a, size);
cudaMalloc(&d_b, size);
cudaMalloc(&d_c, size);
cudaMalloc(&d_ci, size);
#pragma omp parallel for collapse (2)
for (int j = 0; j < M; j++) {
for (int i = 0; i < N; i++) {
a[j * N + i] = 0.1f;
b[j * N + i] = 0.2f;
}
}
unsigned long long dt = dtime_usec(0);
// Perform chebyprod on N elements
cpu_sum(N, M, a, b, cc);
dt = dtime_usec(dt,sync);
printf("Time taken 2D CPU: %fs\n", dt/(float)USECPSEC);
df dtc = dt/(float)USECPSEC;
std::cout << "Vector cc: [ ";
for (int k = 0; k < 10; ++k)
std::cout << cc[k] << " ";
std::cout <<"]\n";
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
dim3 dimBlock(BSX, BSY);
dim3 dimGrid(divUp(N, BSX), divUp(M, BSY));
//std::cout << "dimBlock: " << dimBlock << "\n dimGrid: " << dimGrid << "\n";
dt = dtime_usec(0);
// Perform chebyprod on N elements
chebyprod<<< dimGrid, dimBlock >>>(N, M, d_a, d_b, d_c);
dt = dtime_usec(dt,sync);
printf("Time taken 2D monolithic kernel: %fs\n", dt/(float)USECPSEC);
printf("Speedup: %fs\n", dtc/(dt/(float)USECPSEC));
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
std::cout << "Vector c: [ ";
for (int k = 0; k < 10; ++k)
std::cout << c[k] << " ";
std::cout <<"]\n";
dt = dtime_usec(0);
// Perform chebyprod on N elements
dim3 dimGrid2(N, (M+dimBlock.y-1)/dimBlock.y);
chebyprod_imp<<< dimGrid2, dimBlock >>>(N, M, d_a, d_b, d_ci);
dt = dtime_usec(dt,sync);
printf("Time taken 2D stride kernel: %fs\n", dt/(float)USECPSEC);
printf("Speedup: %fs\n", dtc/(dt/(float)USECPSEC));
cudaMemcpy(ci, d_ci, size, cudaMemcpyDeviceToHost);
std::cout << "Vector ci: [ ";
for (int k = 0; k < 10; ++k)
std::cout << ci[k] << " ";
std::cout <<"]\n";
df max_error = 0;
for (int k = 0; k < N*M; k++)
max_error = fmax(max_error, fabs(c[k] - ci[k]));
std::cout << "Max diff = " << max_error << std::endl;
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
cudaFree(d_ci);
free(a);
free(b);
free(c);
free(cc);
free(ci);
}
$ nvcc -O3 -Xcompiler -fopenmp -arch=sm_52 -o t1498 t1498.cu
$ ./t1498
Time taken 2D CPU: 0.034830s
Vector cc: [ 198.005 197.01 196.015 195.02 194.025 193.03 192.035 191.04 190.045 189.05 ]
Time taken 2D monolithic kernel: 0.033687s
Speedup: 1.033930s
Vector c: [ 198.005 197.01 196.015 195.02 194.025 193.03 192.035 191.04 190.045 189.05 ]
Time taken 2D stride kernel: 0.013526s
Speedup: 2.575041s
Vector ci: [ 198.005 197.01 196.015 195.02 194.025 193.03 192.035 191.04 190.045 189.05 ]
Max diff = 8.52651e-13
$
CUDA 10.1.105, Fedora 29, GTX 960
Note that when we run this same test on a Tesla V100, which can take the most advantage of the "extra" threads available in the striding kernel case, the benefit is more obvious:
$ OMP_NUM_THREADS=32 ./t1498
Time taken 2D CPU: 0.031610s
Vector cc: [ 198.005 197.01 196.015 195.02 194.025 193.03 192.035 191.04 190.045 189.05 ]
Time taken 2D monolithic kernel: 0.018228s
Speedup: 1.734145s
Vector c: [ 198.005 197.01 196.015 195.02 194.025 193.03 192.035 191.04 190.045 189.05 ]
Time taken 2D stride kernel: 0.000731s
Speedup: 43.242137s
Vector ci: [ 198.005 197.01 196.015 195.02 194.025 193.03 192.035 191.04 190.045 189.05 ]
Max diff = 8.52651e-13
If you perform the indexing "transpose" on your monolithic kernel similar to what I have done in the striding kernel, I think you'll end up in a performance situation that is roughly similar to where you ended up in the last question. Little or no performance benefit for the striding kernel over your monolithic kernel on a "small" GPU. ~5x improvement on a "large" GPU.
I wrote a program that has two versions of a median filter implemented using OpenCV in C, one is sequential and the other is parallelized with OpenMP. My problem lies in that the OpenMP version seems to be running slower than my sequential one, no matter the chunk size or the number of threads.
Any ideas/advice is very much welcomed!
Here is my sequential code:
void medianFilter (const IplImage* img){
IplImage* output = cvCloneImage(img);
int rows, cols, step;
uchar *data;
rows = output->height;
cols = output->width;
step = output->widthStep;
data = (uchar *)output->imageData;
if(!data)
{ return; }
//create a sliding window of size 9
int window[9];
for(int y = 1; y < rows - 1; y++){
for(int x = 1; x < cols - 1; x++){
// Pick up window element
window[0] = data[(y - 1) * step + (x - 1)];
window[1] = data[y * step + (x - 1)];
window[2] = data[(y + 1) * step + (x - 1)];
window[3] = data[(y - 1) * step + x];
window[4] = data[y * step + x];
window[5] = data[(y + 1) * step + x];
window[6] = data[(y - 1) * step + (x + 1)];
window[7] = data[y * step + (x + 1)];
window[8] = data[(y + 1) * step + (x + 1)];
// Sort the window to find median
insertionSort(window);
// Assign the median to centered element of the matrix
data[y * step + x] = window[4];
}
}
cvNamedWindow("Post-filter", CV_WINDOW_AUTOSIZE);
cvShowImage("Post-filter", output);
cvReleaseImage(&output);
}
Here is my parallelized code:
void omp_medianFilter (const IplImage* img){
IplImage* output = cvCloneImage(img);
int rows, cols, step, nthreads;
uchar *data;
rows = output->height;
cols = output->width;
step = output->widthStep;
data = (uchar *)output->imageData;
if(!data)
{ return; }
// Create a sliding window of size 9
int window[9], x, y;
// Set the number of threads to use
omp_set_num_threads(NUM_THREADS);
// Parallel code segment. Window, x and y are private variables for each thread
#pragma omp parallel private(window, x, y)
{
//if(omp_get_thread_num() == 0){
//nthreads = omp_get_num_threads();
//printf("Numer of threads running: %d \n", nthreads);
//}
// Parallel for loop with dynamic scheduling and collapsing nested loops
#pragma omp for schedule(dynamic, CHUNK) collapse(2)
for(y = 1; y < rows - 1; y++){
for(x = 1; x < cols - 1; x++){
// Pick up 3x3 window elements
window[0] = data[(y - 1) * step + (x - 1)];
window[1] = data[y * step + (x - 1)];
window[2] = data[(y + 1) * step + (x - 1)];
window[3] = data[(y - 1) * step + x];
window[4] = data[y * step + x];
window[5] = data[(y + 1) * step + x];
window[6] = data[(y - 1) * step + (x + 1)];
window[7] = data[y * step + (x + 1)];
window[8] = data[(y + 1) * step + (x + 1)];
// Sort the window to find median
insertionSort(window);
// Assign the median to centered element of the matrix
data[y * step + x] = window[4];
}
}
}
cvNamedWindow("Post-filter (OMP)", CV_WINDOW_AUTOSIZE);
cvShowImage("Post-filter (OMP)", output);
cvReleaseImage(&output);
}
Full Code:
#include <stdio.h>
#include <opencv2/imgproc/imgproc_c.h>
#include <opencv2/highgui/highgui_c.h>
#include <opencv2/core/types_c.h>
#include <sys/time.h>
#include <omp.h>
#define NUM_THREADS 8
#define CHUNK 15000
//Function to measure time
double get_walltime() {
struct timeval tp; gettimeofday(&tp, NULL);
return (double) (tp.tv_sec + tp.tv_usec*1e-6);
}
//Sort the window elements using insertion sort
void insertionSort(int window[])
{
int temp, i , j;
for(i = 0; i < 9; i++){
temp = window[i];
for(j = i-1; j >= 0 && temp < window[j]; j--){
window[j+1] = window[j];
}
window[j+1] = temp;
}
}
void medianFilter (const IplImage* img){
IplImage* output = cvCloneImage(img);
int rows, cols, step;
uchar *data;
rows = output->height;
cols = output->width;
step = output->widthStep;
data = (uchar *)output->imageData;
if(!data)
{ return; }
//create a sliding window of size 9
int window[9];
for(int y = 1; y < rows - 1; y++){
for(int x = 1; x < cols - 1; x++){
// Pick up window element
window[0] = data[(y - 1) * step + (x - 1)];
window[1] = data[y * step + (x - 1)];
window[2] = data[(y + 1) * step + (x - 1)];
window[3] = data[(y - 1) * step + x];
window[4] = data[y * step + x];
window[5] = data[(y + 1) * step + x];
window[6] = data[(y - 1) * step + (x + 1)];
window[7] = data[y * step + (x + 1)];
window[8] = data[(y + 1) * step + (x + 1)];
// Sort the window to find median
insertionSort(window);
// Assign the median to centered element of the matrix
data[y * step + x] = window[4];
}
}
cvNamedWindow("Post-filter", CV_WINDOW_AUTOSIZE);
cvShowImage("Post-filter", output);
cvReleaseImage(&output);
}
// Parallelized implementation of median filter
void omp_medianFilter (const IplImage* img){
IplImage* output = cvCloneImage(img);
int rows, cols, step, nthreads;
uchar *data;
rows = output->height;
cols = output->width;
step = output->widthStep;
data = (uchar *)output->imageData;
if(!data)
{ return; }
// Create a sliding window of size 9
int window[9], x, y, j, k, min;
// Set the number of threads to use
omp_set_num_threads(NUM_THREADS);
// Parallel code segment. Window, x and y are private variables for each thread
#pragma omp parallel private(window, x, y, j, k, min)
{
//if(omp_get_thread_num() == 0){
//nthreads = omp_get_num_threads();
//printf("Numer of threads running: %d \n", nthreads);
//}
// Parallel for loop with dynamic scheduling and collapsing nested loops
#pragma omp for schedule(dynamic, CHUNK) collapse(2)
for(y = 1; y < rows - 1; y++){
for(x = 1; x < cols - 1; x++){
// Pick up 3x3 window elements
window[0] = data[(y - 1) * step + (x - 1)];
window[1] = data[y * step + (x - 1)];
window[2] = data[(y + 1) * step + (x - 1)];
window[3] = data[(y - 1) * step + x];
window[4] = data[y * step + x];
window[5] = data[(y + 1) * step + x];
window[6] = data[(y - 1) * step + (x + 1)];
window[7] = data[y * step + (x + 1)];
window[8] = data[(y + 1) * step + (x + 1)];
// Sort the window to find median
//insertionSort(window);
for (int j = 0; j < 5; ++j)
{
// Find position of minimum element
int min = j;
for (int l = j + 1; l < 9; ++l)
if (window[l] < window[min])
min = l;
// Put found minimum element in its place
const int temp = window[j];
window[j] = window[min];
window[min] = temp;
}
// Assign the median to centered element of the matrix
data[y * step + x] = window[4];
}
}
}
cvNamedWindow("Post-filter (OMP)", CV_WINDOW_AUTOSIZE);
cvShowImage("Post-filter (OMP)", output);
cvReleaseImage(&output);
}
int main(int argc, char *argv[])
{
IplImage* src;
double time1, time2;
if(argc<2){
printf("Usage: main <image-file-name>\n\7");
exit(0);
}
// Load a source image
src = cvLoadImage(argv[1], CV_LOAD_IMAGE_GRAYSCALE);
cvNamedWindow("Original", CV_WINDOW_AUTOSIZE);
cvShowImage("Original", src);
/*time1 = get_walltime();
medianFilter(src);
time2 = get_walltime();
printf("Sequential Code Performance: %fs\n", time2 - time1);*/
time1 = get_walltime();
omp_medianFilter(src);
time2 = get_walltime();
printf("Parallel Code Performance: %fs\n", time2 - time1);
cvWaitKey(0);
cvReleaseImage(&src);
return 0;
}
Fixed:
I did apply much of the advice given, and I did see performance improvemets, but the things mentioned weren't the problem.
Turns out it was something pretty stupid. I was running this on a VM with Ubuntu 16.04, and I had accidentally forgotten to increase the number of cores, therefore it was only using 1, which might as well mean it wasn't parallelized at all.
For an assignment of a course called High Performance Computing, I required to optimize the following code fragment:
int foobar(int a, int b, int N)
{
int i, j, k, x, y;
x = 0;
y = 0;
k = 256;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
if (i > j){
y = y + 8*(i-j);
}else{
y = y + 8*(j-i);
}
}
}
return x;
}
Using some recommendations, I managed to optimize the code (or at least I think so), such as:
Constant Propagation
Algebraic Simplification
Copy Propagation
Common Subexpression Elimination
Dead Code Elimination
Loop Invariant Removal
bitwise shifts instead of multiplication as they are less expensive.
Here's my code:
int foobar(int a, int b, int N) {
int i, j, x, y, t;
x = 0;
y = 0;
for (i = 0; i <= N; i++) {
t = i + 512;
for (j = i + 1; j <= N; j++) {
x = x + ((i<<3) + (j<<2))*t;
}
}
return x;
}
According to my instructor, a well optimized code instructions should have fewer or less costly instructions in assembly language level.And therefore must be run, the instructions in less time than the original code, ie calculations are made with::
execution time = instruction count * cycles per instruction
When I generate assembly code using the command: gcc -o code_opt.s -S foobar.c,
the generated code has many more lines than the original despite having made some optimizations, and run-time is lower, but not as much as in the original code. What am I doing wrong?
Do not paste the assembly code as both are very extensive. So I'm calling the function "foobar" in the main and I am measuring the execution time using the time command in linux
int main () {
int a,b,N;
scanf ("%d %d %d",&a,&b,&N);
printf ("%d\n",foobar (a,b,N));
return 0;
}
Initially:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
if (i > j){
y = y + 8*(i-j);
}else{
y = y + 8*(j-i);
}
}
}
Removing y calculations:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
}
}
Splitting i, j, k:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 8*i*i + 16*i*k ; // multiple of 1 (no j)
x = x + (4*i + 8*k)*j ; // multiple of j
}
}
Moving them externally (and removing the loop that runs N-i times):
for (i = 0; i <= N; i++) {
x = x + (8*i*i + 16*i*k) * (N-i) ;
x = x + (4*i + 8*k) * ((N*N+N)/2 - (i*i+i)/2) ;
}
Rewritting:
for (i = 0; i <= N; i++) {
x = x + ( 8*k*(N*N+N)/2 ) ;
x = x + i * ( 16*k*N + 4*(N*N+N)/2 + 8*k*(-1/2) ) ;
x = x + i*i * ( 8*N + 16*k*(-1) + 4*(-1/2) + 8*k*(-1/2) );
x = x + i*i*i * ( 8*(-1) + 4*(-1/2) ) ;
}
Rewritting - recalculating:
for (i = 0; i <= N; i++) {
x = x + 4*k*(N*N+N) ; // multiple of 1
x = x + i * ( 16*k*N + 2*(N*N+N) - 4*k ) ; // multiple of i
x = x + i*i * ( 8*N - 20*k - 2 ) ; // multiple of i^2
x = x + i*i*i * ( -10 ) ; // multiple of i^3
}
Another move to external (and removal of the i loop):
x = x + ( 4*k*(N*N+N) ) * (N+1) ;
x = x + ( 16*k*N + 2*(N*N+N) - 4*k ) * ((N*(N+1))/2) ;
x = x + ( 8*N - 20*k - 2 ) * ((N*(N+1)*(2*N+1))/6);
x = x + (-10) * ((N*N*(N+1)*(N+1))/4) ;
Both the above loop removals use the summation formulas:
Sum(1, i = 0..n) = n+1
Sum(i1, i = 0..n) = n(n + 1)/2
Sum(i2, i = 0..n) = n(n + 1)(2n + 1)/6
Sum(i3, i = 0..n) = n2(n + 1)2/4
y does not affect the final result of the code - removed:
int foobar(int a, int b, int N)
{
int i, j, k, x, y;
x = 0;
//y = 0;
k = 256;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
//if (i > j){
// y = y + 8*(i-j);
//}else{
// y = y + 8*(j-i);
//}
}
}
return x;
}
k is simply a constant:
int foobar(int a, int b, int N)
{
int i, j, x;
x = 0;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*256);
}
}
return x;
}
The inner expression can be transformed to: x += 8*i*i + 4096*i + 4*i*j + 2048*j. Use math to push all of them to the outer loop: x += 8*i*i*(N-i) + 4096*i*(N-i) + 2*i*(N-i)*(N+i+1) + 1024*(N-i)*(N+i+1).
You can expand the above expression, and apply sum of squares and sum of cubes formula to obtain a close form expression, which should run faster than the doubly nested loop. I leave it as an exercise to you. As a result, i and j will also be removed.
a and b should also be removed if possible - since a and b are supplied as argument but never used in your code.
Sum of squares and sum of cubes formula:
Sum(x2, x = 1..n) = n(n + 1)(2n + 1)/6
Sum(x3, x = 1..n) = n2(n + 1)2/4
This function is equivalent with the following formula, which contains only 4 integer multiplications, and 1 integer division:
x = N * (N + 1) * (N * (7 * N + 8187) - 2050) / 6;
To get this, I simply typed the sum calculated by your nested loops into Wolfram Alpha:
sum (sum (8*i*i+4096*i+4*i*j+2048*j), j=i+1..N), i=0..N
Here is the direct link to the solution. Think before coding. Sometimes your brain can optimize code better than any compiler.
Briefly scanning the first routine, the first thing you notice is that expressions involving "y" are completely unused and can be eliminated (as you did). This further permits eliminating the if/else (as you did).
What remains is the two for loops and the messy expression. Factoring out the pieces of that expression that do not depend on j is the next step. You removed one such expression, but (i<<3) (ie, i * 8) remains in the inner loop, and can be removed.
Pascal's answer reminded me that you can use a loop stride optimization. First move (i<<3) * t out of the inner loop (call it i1), then calculate, when initializing the loop, a value j1 that equals (i<<2) * t. On each iteration increment j1 by 4 * t (which is a pre-calculated constant). Replace your inner expression with x = x + i1 + j1;.
One suspects that there may be some way to combine the two loops into one, with a stride, but I'm not seeing it offhand.
A few other things I can see. You don't need y, so you can remove its declaration and initialisation.
Also, the values passed in for a and b aren't actually used, so you could use these as local variables instead of x and t.
Also, rather than adding i to 512 each time through you can note that t starts at 512 and increments by 1 each iteration.
int foobar(int a, int b, int N) {
int i, j;
a = 0;
b = 512;
for (i = 0; i <= N; i++, b++) {
for (j = i + 1; j <= N; j++) {
a = a + ((i<<3) + (j<<2))*b;
}
}
return a;
}
Once you get to this point you can also observe that, aside from initialising j, i and j are only used in a single mutiple each - i<<3 and j<<2. We can code this directly in the loop logic, thus:
int foobar(int a, int b, int N) {
int i, j, iLimit, jLimit;
a = 0;
b = 512;
iLimit = N << 3;
jLimit = N << 2;
for (i = 0; i <= iLimit; i+=8) {
for (j = i >> 1 + 4; j <= jLimit; j+=4) {
a = a + (i + j)*b;
}
b++;
}
return a;
}
OK... so here is my solution, along with inline comments to explain what I did and how.
int foobar(int N)
{ // We eliminate unused arguments
int x = 0, i = 0, i2 = 0, j, k, z;
// We only iterate up to N on the outer loop, since the
// last iteration doesn't do anything useful. Also we keep
// track of '2*i' (which is used throughout the code) by a
// second variable 'i2' which we increment by two in every
// iteration, essentially converting multiplication into addition.
while(i < N)
{
// We hoist the calculation '4 * (i+2*k)' out of the loop
// since k is a literal constant and 'i' is a constant during
// the inner loop. We could convert the multiplication by 2
// into a left shift, but hey, let's not go *crazy*!
//
// (4 * (i+2*k)) <=>
// (4 * i) + (4 * 2 * k) <=>
// (2 * i2) + (8 * k) <=>
// (2 * i2) + (8 * 512) <=>
// (2 * i2) + 2048
k = (2 * i2) + 2048;
// We have now converted the expression:
// x = x + 4*(2*i+j)*(i+2*k);
//
// into the expression:
// x = x + (i2 + j) * k;
//
// Counterintuively we now *expand* the formula into:
// x = x + (i2 * k) + (j * k);
//
// Now observe that (i2 * k) is a constant inside the inner
// loop which we can calculate only once here. Also observe
// that is simply added into x a total (N - i) times, so
// we take advantange of the abelian nature of addition
// to hoist it completely out of the loop
x = x + (i2 * k) * (N - i);
// Observe that inside this loop we calculate (j * k) repeatedly,
// and that j is just an increasing counter. So now instead of
// doing numerous multiplications, let's break the operation into
// two parts: a multiplication, which we hoist out of the inner
// loop and additions which we continue performing in the inner
// loop.
z = i * k;
for (j = i + 1; j <= N; j++)
{
z = z + k;
x = x + z;
}
i++;
i2 += 2;
}
return x;
}
The code, without any of the explanations boils down to this:
int foobar(int N)
{
int x = 0, i = 0, i2 = 0, j, k, z;
while(i < N)
{
k = (2 * i2) + 2048;
x = x + (i2 * k) * (N - i);
z = i * k;
for (j = i + 1; j <= N; j++)
{
z = z + k;
x = x + z;
}
i++;
i2 += 2;
}
return x;
}
I hope this helps.
int foobar(int N) //To avoid unuse passing argument
{
int i, j, x=0; //Remove unuseful variable, operation so save stack and Machine cycle
for (i = N; i--; ) //Don't check unnecessary comparison condition
for (j = N+1; --j>i; )
x += (((i<<1)+j)*(i+512)<<2); //Save Machine cycle ,Use shift instead of Multiply
return x;
}