Cache utilization in matrix transpose in c - c

This code transposes a matrix four ways. The first does sequential writes, non sequential reads. The second is the opposite. The next two are the same, but with cache skipping writes. What seems to happen is sequential writes are faster, and skipping the cache is faster. What I don't understand is, if the cache is being skipped why are sequential writes still faster?
QueryPerformanceCounter(&before);
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
tmp[i][j] = mul2[j][i];
QueryPerformanceCounter(&after);
printf("Transpose 1:\t%ld\n", after.QuadPart - before.QuadPart);
QueryPerformanceCounter(&before);
for (j = 0; j < N; ++j)
for (i = 0; i < N; ++i)
tmp[i][j] = mul2[j][i];
QueryPerformanceCounter(&after);
printf("Transpose 2:\t%ld\n", after.QuadPart - before.QuadPart);
QueryPerformanceCounter(&before);
for (i = 0; i < N; ++i)
for (j = 0; j < N; ++j)
_mm_stream_si32(&tmp[i][j], mul2[j][i]);
QueryPerformanceCounter(&after);
printf("Transpose 3:\t%ld\n", after.QuadPart - before.QuadPart);
QueryPerformanceCounter(&before);
for (j = 0; j < N; ++j)
for (i = 0; i < N; ++i)
_mm_stream_si32(&tmp[i][j], mul2[j][i]);
QueryPerformanceCounter(&after);
printf("Transpose 4:\t%ld\n", after.QuadPart - before.QuadPart);
EDIT: The output is
Transpose 1: 47603
Transpose 2: 92449
Transpose 3: 38340
Transpose 4: 69597

CPU has a write combining buffer to combine writes on a cache line to happen in one burst. In this case (cache being skipped for sequential writes), this write combining buffer acts as a one line cache which makes the results be very similar to cache not being skipped.
To be exact, in case of cache being skipped, writes are still happening in bursts to memory.
See write-combining logic behavior here.

You could try non linear memory layout for the matrix to improve cache utilization. With 4x4 32bit float tiles one could do transpose with only single access to each cache line. Plus as a bonus tile transposes could be done easily with _MM_TRANSPOSE4_PS.
Transposing a very large matrix is still very memory intensive operation. It will still be heavily bandwidth limited but at least cache word load is near optimal. I don't know if the performance could be still optimized. My testing shows that a few years old laptop manages to transpose 16k*16k (1G memory) in about 300ms.
I tried to use also _mm_stream_sd but it actually makes performance worse for some reason. I don't understand nontemporal memory writes enough to have any practical guess why performance would drop with _mm_stream_ps. Possible reason is of course that cache line is already in L1 cache ready for the write operation.
But actually important part with non linear matrix would possibility to avoid transpose completely and simple run the multiplication in tile friendly order. But I only have transpose code that I'm using to improve my knowledge about cache management in algorithms.
I haven't yet tried to test if prefetching would improve memory bandwidth usage. Current code runs at about 0.5 instructions per cycle (good cache friendly code runs around 2 ins per cycle on this CPU) that leaves a lot of free cycles for prefetch instructions allowing even quite complex calculation to optimize prefetching timing in runtime.
example code from my transpose benchmark test follows.
#define MATSIZE 16384
#define align(val, a) (val + (a - val % a))
#define tilewidth 4
typedef int matrix[align(MATSIZE,tilewidth)*MATSIZE] __attribute__((aligned(64)));
float &index(matrix m, unsigned i, unsigned j)
{
/* tiled address calculation */
/* single cache line is used for 4x4 sub matrices (64 bytes = 4*4*sizeof(int) */
/* tiles are arranged linearly from top to bottom */
/*
* eg: 16x16 matrix tile positions:
* t1 t5 t9 t13
* t2 t6 t10 t14
* t3 t7 t11 t15
* t4 t8 t12 t16
*/
const unsigned tilestride = tilewidth * MATSIZE;
const unsigned comp0 = i % tilewidth; /* i inside tile is least significant part */
const unsigned comp1 = j * tilewidth; /* next part is j multiplied by tile width */
const unsigned comp2 = i / tilewidth * tilestride;
const unsigned add = comp0 + comp1 + comp2;
return m[add];
}
/* Get start of tile reference */
float &tile(matrix m, unsigned i, unsigned j)
{
const unsigned tilestride = tilewidth * MATSIZE;
const unsigned comp1 = j * tilewidth; /* next part is j multiplied by tile width */
const unsigned comp2 = i / tilewidth * tilestride;
return m[comp1 + comp2];
}
template<bool diagonal>
static void doswap(matrix mat, unsigned i, unsigned j)
{
/* special path to swap whole tile at once */
union {
float *fs;
__m128 *mm;
} src, dst;
src.fs = &tile(mat, i, j);
dst.fs = &tile(mat, j, i);
if (!diagonal) {
__m128 srcrow0 = src.mm[0];
__m128 srcrow1 = src.mm[1];
__m128 srcrow2 = src.mm[2];
__m128 srcrow3 = src.mm[3];
__m128 dstrow0 = dst.mm[0];
__m128 dstrow1 = dst.mm[1];
__m128 dstrow2 = dst.mm[2];
__m128 dstrow3 = dst.mm[3];
_MM_TRANSPOSE4_PS(srcrow0, srcrow1, srcrow2, srcrow3);
_MM_TRANSPOSE4_PS(dstrow0, dstrow1, dstrow2, dstrow3);
#if STREAMWRITE == 1
_mm_stream_ps(src.fs + 0, dstrow0);
_mm_stream_ps(src.fs + 4, dstrow1);
_mm_stream_ps(src.fs + 8, dstrow2);
_mm_stream_ps(src.fs + 12, dstrow3);
_mm_stream_ps(dst.fs + 0, srcrow0);
_mm_stream_ps(dst.fs + 4, srcrow1);
_mm_stream_ps(dst.fs + 8, srcrow2);
_mm_stream_ps(dst.fs + 12, srcrow3);
#else
src.mm[0] = dstrow0;
src.mm[1] = dstrow1;
src.mm[2] = dstrow2;
src.mm[3] = dstrow3;
dst.mm[0] = srcrow0;
dst.mm[1] = srcrow1;
dst.mm[2] = srcrow2;
dst.mm[3] = srcrow3;
#endif
} else {
__m128 srcrow0 = src.mm[0];
__m128 srcrow1 = src.mm[1];
__m128 srcrow2 = src.mm[2];
__m128 srcrow3 = src.mm[3];
_MM_TRANSPOSE4_PS(srcrow0, srcrow1, srcrow2, srcrow3);
#if STREAMWRITE == 1
_mm_stream_ps(src.fs + 0, srcrow0);
_mm_stream_ps(src.fs + 4, srcrow1);
_mm_stream_ps(src.fs + 8, srcrow2);
_mm_stream_ps(src.fs + 12, srcrow3);
#else
src.mm[0] = srcrow0;
src.mm[1] = srcrow1;
src.mm[2] = srcrow2;
src.mm[3] = srcrow3;
#endif
}
}
}
static void transpose(matrix mat)
{
const unsigned xstep = 256;
const unsigned ystep = 256;
const unsigned istep = 4;
const unsigned jstep = 4;
unsigned x1, y1, i, j;
/* need to increment x check for y limit to allow unrolled inner loop
* access entries close to diagonal axis
*/
for (x1 = 0; x1 < MATSIZE - xstep + 1 && MATSIZE > xstep && xstep; x1 += xstep)
for (y1 = 0; y1 < std::min(MATSIZE - ystep + 1, x1 + 1); y1 += ystep)
for ( i = x1 ; i < x1 + xstep; i += istep ) {
for ( j = y1 ; j < std::min(y1 + ystep, i); j+= jstep )
{
doswap<false>(mat, i, j);
}
if (i == j && j < (y1 + ystep))
doswap<true>(mat, i, j);
}
for ( i = 0 ; i < x1; i += istep ) {
for ( j = y1 ; j < std::min(MATSIZE - jstep + 1, i); j+= jstep )
{
doswap<false>(mat, i, j);
}
if (i == j)
doswap<true>(mat, i, j);
}
for ( i = x1 ; i < MATSIZE - istep + 1; i += istep ) {
for ( j = y1 ; j < std::min(MATSIZE - jstep + 1, i); j+= jstep )
{
doswap<false>(mat, i, j);
}
if (i == j)
doswap<true>(mat, i, j);
}
x1 = MATSIZE - MATSIZE % istep;
y1 = MATSIZE - MATSIZE % jstep;
for ( i = x1 ; i < MATSIZE; i++ )
for ( j = 0 ; j < std::min((unsigned)MATSIZE, i); j++ )
std::swap(index(mat, i, j+0), index(mat, j+0, i));
for ( i = 0; i < x1; i++ )
for ( j = y1 ; j < std::min((unsigned)MATSIZE, i) ; j++ )
std::swap(index(mat, i, j+0), index(mat, j+0, i));
}

Related

Optimizing Matrix multiplication in C with Bit Packing

I'm currently attempting to write an algorithm for optimizing matrix multiplication over GF(2) using bit-packing. Both matrices A and B are provided in column major order so I start by copying A into row-major order and then packing the values into 8-bit integers and using parity checking to speed up operations. I need to be able to test square matrices of up to 2048x2048, however, my current implementation provides the correct answer up to 24x24 and then fails to compute the correct result. Any help would be appreciated.
//Method which packs an array of integers into 8 bits
uint8_t pack(int *toPack) {
int i;
uint8_t A;
A = 0;
for (i = 0; i < 8; i++) {
A = (A << 1) | (uint8_t)toPack[i];
}
return A;
}
//Method for doing matrix multiplication over GF(2)
void matmul_optimized(int n, int *A, int *B, int *C) {
int i, j, k;
//Copying values of A into a row major order matrix.
int *A_COPY = malloc(n * n * sizeof(int));
int copy_index = 0;
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
A_COPY[copy_index] = A[i + j * n];
copy_index++;
}
}
//Size of the data data type integers will be packed into
const int portion_size = 8;
int portions = n / portion_size;
//Pointer space reserved to store packed integers in row major order
uint8_t *compressedA = malloc(n * portions * sizeof(uint8_t));
uint8_t *compressedB = malloc(n * portions * sizeof(uint8_t));
int a[portion_size];
int b[portion_size];
for (i = 0; i < n; i++) {
for (j = 0; j < portions; j++) {
for (k = 0; k < portion_size; k++) {
a[k] = A_COPY[i * n + j * portion_size + k];
b[k] = B[i * n + j * portion_size + k];
}
compressedA[i * n + j] = pack(a);
compressedB[i * n + j] = pack(b);
}
}
//Calculating final matrix using parity checking and XOR on A and B
int cij;
for (i = 0; i < n; ++i) {
for (j = 0; j < n; ++j) {
int cIndex = i + j * n;
cij = C[cIndex];
for (k = 0; k < portions; ++k) {
uint8_t temp = compressedA[k + i * n] & compressedB[k + j * n];
temp ^= temp >> 4;
temp ^= temp >> 2;
temp ^= temp >> 1;
uint8_t parity = temp & (uint8_t)1;
cij = cij ^ parity;
}
C[cIndex] = cij;
}
}
free(compressedA);
free(compressedB);
free(A_COPY);
}
I have two remarks:
you should probably initialize cij to 0 instead of cij = C[cIndex];. It seems incorrect to update the destination matrix instead of storing the result of A * B. Your code might work for small matrices by coincidence because the destination matrix C happens to be all zeroes for this size.
it is risky to compute the allocation size as malloc(n * n * sizeof(int)); because n * n might overflow with int n if int is smaller than size_t. Given the sizes you work with, it is probably not a problem here, but it is a good idea to always use the sizeof as the first operand to force conversion to size_t of the following ones:
int *A_COPY = malloc(sizeof(*A_COPY) * n * n);

Speed up matrix-matrix multiplication using SSE vector instructions

I have some trouble in vectorize some C code using SSE vector instructions. The code which I have to victorize is
#define N 1000
void matrix_mul(int mat1[N][N], int mat2[N][N], int result[N][N])
{
int i, j, k;
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; ++j)
{
for (k = 0; k < N; ++k)
{
result[i][k] += mat1[i][j] * mat2[j][k];
}
}
}
}
Here is what I got so far:
void matrix_mul_sse(int mat1[N][N], int mat2[N][N], int result[N][N])
{
int i, j, k; int* l;
__m128i v1, v2, v3;
v3 = _mm_setzero_si128();
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; j += 4)
{
for (k = 0; k < N; k += 4)
{
v1 = _mm_set1_epi32(mat1[i][j]);
v2 = _mm_loadu_si128((__m128i*)&mat2[j][k]);
v3 = _mm_add_epi32(v3, _mm_mul_epi32(v1, v2));
_mm_storeu_si128((__m128i*)&result[i][k], v3);
v3 = _mm_setzero_si128();
}
}
}
}
After execution I got wrong result. I know that the reason is the loading from memory to v2. I loop through mat1 in row major order so I need to load mat2[0][0], mat2[1][0], mat2[2][0], mat2[3][0].... but what actually loaded is mat2[0][0], mat2[0][1], mat2[0][2], mat2[0][3]... because mat2 has stored in the memory in row major order. I tried to fix this problem but without any improvement.
Can anyone help me please.
Below fixed your implementation:
void matrix_mul_sse(int mat1[N][N], int mat2[N][N], int result[N][N])
{
int i, j, k;
__m128i v1, v2, v3, v4;
for (i = 0; i < N; ++i)
{
for (j = 0; j < N; ++j) // 'j' must be incremented by 1
{
// read mat1 here because it does not use 'k' index
v1 = _mm_set1_epi32(mat1[i][j]);
for (k = 0; k < N; k += 4)
{
v2 = _mm_loadu_si128((const __m128i*)&mat2[j][k]);
// read what's in the result array first as we will need to add it later to our calculations
v3 = _mm_loadu_si128((const __m128i*)&result[i][k]);
// use _mm_mullo_epi32 here instead _mm_mul_epi32 and add it to the previous result
v4 = _mm_add_epi32(v3, _mm_mullo_epi32(v1, v2));
// store the result
_mm_storeu_si128((__m128i*)&result[i][k], v4);
}
}
}
}
In short _mm_mullo_epi32 (requires SSE4.1) produces 4 x int32 results as opposed to _mm_mul_epi32 which does 2 x int64 results. If you cannot use SSE4.1 then have a look at the answer here for an alternative SSE2 solution.
Full description by Intel Intrinsic Guide:
_mm_mullo_epi32: Multiply the packed 32-bit integers in a and b, producing intermediate 64-bit integers, and store
the low 32 bits of the intermediate integers in dst.
_mm_mul_epi32: Multiply the low 32-bit integers from each packed 64-bit element in a and b, and store the
signed 64-bit results in dst.
I kinda changed around your code to make the addressing explicit [ it helps in this case ].
#define N 100
This is a stub for the vector unit multiple & accumulate operation; you should be able to replace NV with whatever throw your vector unit has, and put the relevant opcodes in here.
#define NV 8
int Vmacc(int *A, int *B) {
int i = 0;
int x = 0;
for (i = 0; i < NV; i++) {
x += *A++ * *B++;
}
return x;
}
This multiply has two notable variations from the norm:
1. It caches the columnar vector into a contiguous one.
2. It attempts to push slices of the multiply accumulate into a vector-like func.
Even without using the vector unit, this takes half the time of naive version just because of better cache/prefetch utilization.
void mm2(int *A, int *B, int n, int *C) {
int c, r;
int stride = 0;
int cache[N];
for (c = 0; c < n; c++) {
/* cache cumn i: */
for (r = 0; r < n; r++) {
cache[r] = B[c + r*n];
}
for (r = 0; r < n; r++) {
int k = 0;
int x = 0;
int *Av = A + r*n;
for (k = 0; k+NV-1 < n; k += NV) {
x += Vmacc(Av+k, cache+k);
}
while (k < n) {
x += Av[k] * cache[k];
k++;
}
C[r*n + c] = x;
}
}
}

3-D Loop comparison in 7-pt Stencil

I carry out a 7-pt stencil update on two 3-D domains. The first one is 258x130x258and the second one is 130x258x258. Both of them have the same number of elements being updated. In C they are represented as contiguous arrays : a1[258][130][258] and x1[130][258][258]. Simply stated their x-dimension and y-dimension are exchanged but z-dimension (fastest changing index) is equal.
Loop 1:
for(i = 1; i <= 256 ; i++)
for(j = 1; j <= 128 ; j++)
for(k = 1; k <= 256; k++)
a1[i][j][k] = alpha * b1[i][j][k] + (Omega_6) *(b1[i-1][j][k] + b1[i+1][j][k] +
b1[i][j-1][k] + b1[i][j+1][k] +
b1[i][j][k-1] + b1[i][j][k+1] +
c1[i][j][k] * H);
Loop 2:
for(i = 1; i <= 128 ; i++)
for(j = 1; j <= 256 ; j++)
for(k = 1; k <= 256; k++)
x1[i][j][k] = alpha * y1[i][j][k] + (Omega_6) *(y1[i-1][j][k] + y1[i+1][j][k] +
y1[i][j-1][k] + y1[i][j+1][k] +
y1[i][j][k-1] + y1[i][j][k+1] +
z1[i][j][k] * H);
a1, b1, c1 all have same dimensions and x1, y1, z1 have the same dimensions. alpha and Omega_6 are constants. Loop 1 runs 0.5 seconds faster than Loop 2. Why does this happen ?

Improve performance of a construction of p-values matrix for a permutation test

I used an R code which implements a permutation test for the distributional comparison between two populations of functions. We have p univariate p-values.
The bottleneck is the construction of a matrix which contains all the possible CONTIGUOS p-values.
The last row of the matrix of p-values contain all the univariate p-values.
The penultimate row contains all the bivariate p-values in this order:
p_val_c(1,2), p_val_c(2,3), ..., p_val_c(p, 1)
...
The elements of the first row are coincident and the value associated is the p-value of the global test p_val_c(1,...,p)=p_val_c(2,...,p,1)=...=pval(p,1,...,p-1).
For computational reasons, I have decided to implement this component in c and use it in R with .C.
Here the code. The unique important part is the definition of the function Build_pval_asymm_matrix.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <time.h>
void Build_pval_asymm_matrix(int * p, int * B, double * pval,
double * L,
double * pval_asymm_matrix);
// Function used for the sorting of vector T_temp with qsort
int cmp(const void *x, const void *y);
int main() {
int B = 1000; // number Conditional Monte Carlo (CMC) runs
int p = 100; // number univariate tests
// Generate fictitiously data univariate p-values pval and matrix L.
// The j-th column of L is the empirical survival
// function of the statistics test associated to the j-th coefficient
// of the basis expansion. The dimension of L is B * p.
// Generate pval
double pval[p];
memset(pval, 0, sizeof(pval)); // initialize all elements to 0
for (int i = 0; i < p; i++) {
pval[i] = (double)rand() / (double)RAND_MAX;
}
// Construct L
double L[B * p];
// Inizialize to 0 the elements of L
memset(L, 0, sizeof(L));
// Array used to construct the columns of L
double temp_array[B];
memset(temp_array, 0, sizeof(temp_array));
for(int i = 0; i < B; i++) {
temp_array[i] = (double) (i + 1) / (double) B;
}
for (int iter_coeff=0; iter_coeff < p; iter_coeff++) {
// Shuffle temp_array
if (B > 1) {
for (int k = 0; k < B - 1; k++)
{
int j = rand() % B;
double t = temp_array[j];
temp_array[j] = temp_array[k];
temp_array[k] = t;
}
}
for (int i=0; i<B; i++) {
L[iter_coeff + p * i] = temp_array[i];
}
}
double pval_asymm_matrix[p * p];
memset(pval_asymm_matrix, 0, sizeof(pval_asymm_matrix));
// Construct the asymmetric matrix of p-values
clock_t start, end;
double cpu_time_used;
start = clock();
Build_pval_asymm_matrix(&p, &B, pval, L, pval_asymm_matrix);
end = clock();
cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("TOTAL CPU time used: %f\n", cpu_time_used);
return 0;
}
void Build_pval_asymm_matrix(int * p, int * B, double * pval,
double * L,
double * pval_asymm_matrix) {
int nbasis = *p, iter_CMC = *B;
// Scalar output fisher combining function applied on univariate
// p-values
double T0_temp = 0;
// Vector output fisher combining function applied on a set of
//columns of L
double T_temp[iter_CMC];
memset(T_temp, 0, sizeof(T_temp));
// Counter for elements of T_temp greater than or equal to T0_temp
int count = 0;
// Indexes for columns of L
int inf = 0, sup = 0;
// The last row of matrice_pval_asymm contains the univariate p-values
for(int i = 0; i < nbasis; i++) {
pval_asymm_matrix[i + nbasis * (nbasis - 1)] = pval[i];
}
// Construct the rows from bottom to up
for (int row = nbasis - 2; row >= 0; row--) {
for (int col = 0; col <= row; col++) {
T0_temp = 0;
memset(T_temp, 0, sizeof(T_temp));
inf = col;
sup = (nbasis - row) + col - 1;
// Combining function Fisher applied on
// p-values pval[inf:sup]
for (int k = inf; k <= sup; k++) {
T0_temp += log(pval[k]);
}
T0_temp *= -2;
// Combining function Fisher applied
// on columns inf:sup of matrix L
for (int k = 0; k < iter_CMC; k++) {
for (int l = inf; l <= sup; l++) {
T_temp[k] += log(L[l + nbasis * k]);
}
T_temp[k] *= -2;
}
// Sort the vector T_temp
qsort(T_temp, iter_CMC, sizeof(double), cmp);
// Count the number of elements of T_temp less than T0_temp
int h = 0;
while (h < iter_CMC && T_temp[h] < T0_temp) {
h++;
}
// Number of elements of T_temp greater than or equal to T0_temp
count = iter_CMC - h;
pval_asymm_matrix[col + nbasis * row] = (double) count / (double)iter_CMC;
}
// auxiliary variable for columns of L inf:nbasis-1 and 1:sup
int aux_first = 0, aux_second = 0;
int num_col_needed = 0;
for (int col = row + 1; col < nbasis; col++) {
T0_temp = 0;
memset(T_temp, 0, sizeof(T_temp));
inf = col;
sup = ((nbasis - row) + col) % nbasis - 1;
// Useful indexes
num_col_needed = nbasis - inf + sup + 1;
int index_needed[num_col_needed];
memset(index_needed, -1, num_col_needed * sizeof(int));
aux_first = inf;
for (int i = 0; i < nbasis - inf; i++) {
index_needed[i] = aux_first;
aux_first++;
}
aux_second = 0;
for (int j = 0; j < sup + 1; j++) {
index_needed[j + nbasis - inf] = aux_second;
aux_second++;
}
// Combining function Fisher applied on p-values
// pval[inf:p-1] and pval[0:sup-1]1]
for (int k = 0; k < num_col_needed; k++) {
T0_temp += log(pval[index_needed[k]]);
}
T0_temp *= -2;
// Combining function Fisher applied on columns inf:p-1 and 0:sup-1
// of matrix L
for (int k = 0; k < iter_CMC; k++) {
for (int l = 0; l < num_col_needed; l++) {
T_temp[k] += log(L[index_needed[l] + nbasis * k]);
}
T_temp[k] *= -2;
}
// Sort the vector T_temp
qsort(T_temp, iter_CMC, sizeof(double), cmp);
// Count the number of elements of T_temp less than T0_temp
int h = 0;
while (h < iter_CMC && T_temp[h] < T0_temp) {
h++;
}
// Number of elements of T_temp greater than or equal to T0_temp
count = iter_CMC - h;
pval_asymm_matrix[col + nbasis * row] = (double) count / (double)iter_CMC;
} // end for over col from row + 1 to nbasis - 1
} // end for over rows of asymm p-values matrix except the last row
}
int cmp(const void *x, const void *y)
{
double xx = *(double*)x, yy = *(double*)y;
if (xx < yy) return -1;
if (xx > yy) return 1;
return 0;
}
Here the times of execution in seconds measured in R:
time_original_function
user system elapsed
79.726 1.980 112.817
time_function_double_for
user system elapsed
79.013 1.666 89.411
time_c_function
user system elapsed
47.920 0.024 56.096
The first measure was obtained using an equivalent R function with duplication of the vector pval and matrix L.
What I wanted to ask is some suggestions in order to decrease the execution time with the C function for simulation purposes. The last time I used c was five years ago and consequently there is room for improvement. For instance I sort the vector T_temp with qsort in order to compute in linear time with a while the number of elements of T_temp greater than or equal to T0_temp. Maybe this task could be done in a more efficient way. Thanks in advance!!
I reduced the input size to p to 50 to avoid waiting on it (don't have such a fast machine) -- keeping p as is and reducing B to 100 has a similar effect, but profiling it showed that ~7.5 out of the ~8 seconds used to compute this was spent in the log function.
qsort doesn't even show up as a real hotspot. This test seems to headbutt the machine more in terms of micro-efficiency than anything else.
So unless your compiler has a vastly faster implementation of log than I do, my first suggestion is to find a fast log implementation if you can afford some accuracy loss (there are ones out there that can compute log over an order of magnitude faster with precision loss in the range of ~3% or so).
If you cannot have precision loss and accuracy is critical, then I'd suggest trying to memoize the values you use for log if you can and store them into a lookup table.
Update
I tried the latter approach.
// Create a memoized table of log values.
double log_cache[B * p];
for (int j=0, num=B*p; j < num; ++j)
log_cache[j] = log(L[j]);
Using malloc might be better here, as we're pushing rather large data to the stack and could risk overflows.
Then pass her into Build_pval_asymm_matrix.
Replace these:
T_temp[k] += log(L[l + nbasis * k]);
...
T_temp[k] += log(L[index_needed[l] + nbasis * k]);
With these:
T_temp[k] += log_cache[l + nbasis * k];
...
T_temp[k] += log_cache[index_needed[l] + nbasis * k];
This improved the times for me from ~8 seconds to ~5.3 seconds, but we've exchanged the computational overhead of log for memory overhead which isn't that much better (in fact, it rarely is but calling log for double-precision floats is apparently quite expensive, enough to make this exchange worthwhile). The next iteration, if you want more speed, and it is very possible, involves looking into cache efficiency.
For this kind of huge matrix stuff, focusing on memory layouts and access patterns can work wonders.

Optimization of C code

For an assignment of a course called High Performance Computing, I required to optimize the following code fragment:
int foobar(int a, int b, int N)
{
int i, j, k, x, y;
x = 0;
y = 0;
k = 256;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
if (i > j){
y = y + 8*(i-j);
}else{
y = y + 8*(j-i);
}
}
}
return x;
}
Using some recommendations, I managed to optimize the code (or at least I think so), such as:
Constant Propagation
Algebraic Simplification
Copy Propagation
Common Subexpression Elimination
Dead Code Elimination
Loop Invariant Removal
bitwise shifts instead of multiplication as they are less expensive.
Here's my code:
int foobar(int a, int b, int N) {
int i, j, x, y, t;
x = 0;
y = 0;
for (i = 0; i <= N; i++) {
t = i + 512;
for (j = i + 1; j <= N; j++) {
x = x + ((i<<3) + (j<<2))*t;
}
}
return x;
}
According to my instructor, a well optimized code instructions should have fewer or less costly instructions in assembly language level.And therefore must be run, the instructions in less time than the original code, ie calculations are made with::
execution time = instruction count * cycles per instruction
When I generate assembly code using the command: gcc -o code_opt.s -S foobar.c,
the generated code has many more lines than the original despite having made ​​some optimizations, and run-time is lower, but not as much as in the original code. What am I doing wrong?
Do not paste the assembly code as both are very extensive. So I'm calling the function "foobar" in the main and I am measuring the execution time using the time command in linux
int main () {
int a,b,N;
scanf ("%d %d %d",&a,&b,&N);
printf ("%d\n",foobar (a,b,N));
return 0;
}
Initially:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
if (i > j){
y = y + 8*(i-j);
}else{
y = y + 8*(j-i);
}
}
}
Removing y calculations:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
}
}
Splitting i, j, k:
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 8*i*i + 16*i*k ; // multiple of 1 (no j)
x = x + (4*i + 8*k)*j ; // multiple of j
}
}
Moving them externally (and removing the loop that runs N-i times):
for (i = 0; i <= N; i++) {
x = x + (8*i*i + 16*i*k) * (N-i) ;
x = x + (4*i + 8*k) * ((N*N+N)/2 - (i*i+i)/2) ;
}
Rewritting:
for (i = 0; i <= N; i++) {
x = x + ( 8*k*(N*N+N)/2 ) ;
x = x + i * ( 16*k*N + 4*(N*N+N)/2 + 8*k*(-1/2) ) ;
x = x + i*i * ( 8*N + 16*k*(-1) + 4*(-1/2) + 8*k*(-1/2) );
x = x + i*i*i * ( 8*(-1) + 4*(-1/2) ) ;
}
Rewritting - recalculating:
for (i = 0; i <= N; i++) {
x = x + 4*k*(N*N+N) ; // multiple of 1
x = x + i * ( 16*k*N + 2*(N*N+N) - 4*k ) ; // multiple of i
x = x + i*i * ( 8*N - 20*k - 2 ) ; // multiple of i^2
x = x + i*i*i * ( -10 ) ; // multiple of i^3
}
Another move to external (and removal of the i loop):
x = x + ( 4*k*(N*N+N) ) * (N+1) ;
x = x + ( 16*k*N + 2*(N*N+N) - 4*k ) * ((N*(N+1))/2) ;
x = x + ( 8*N - 20*k - 2 ) * ((N*(N+1)*(2*N+1))/6);
x = x + (-10) * ((N*N*(N+1)*(N+1))/4) ;
Both the above loop removals use the summation formulas:
Sum(1, i = 0..n) = n+1
Sum(i1, i = 0..n) = n(n + 1)/2
Sum(i2, i = 0..n) = n(n + 1)(2n + 1)/6
Sum(i3, i = 0..n) = n2(n + 1)2/4
y does not affect the final result of the code - removed:
int foobar(int a, int b, int N)
{
int i, j, k, x, y;
x = 0;
//y = 0;
k = 256;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*k);
//if (i > j){
// y = y + 8*(i-j);
//}else{
// y = y + 8*(j-i);
//}
}
}
return x;
}
k is simply a constant:
int foobar(int a, int b, int N)
{
int i, j, x;
x = 0;
for (i = 0; i <= N; i++) {
for (j = i + 1; j <= N; j++) {
x = x + 4*(2*i+j)*(i+2*256);
}
}
return x;
}
The inner expression can be transformed to: x += 8*i*i + 4096*i + 4*i*j + 2048*j. Use math to push all of them to the outer loop: x += 8*i*i*(N-i) + 4096*i*(N-i) + 2*i*(N-i)*(N+i+1) + 1024*(N-i)*(N+i+1).
You can expand the above expression, and apply sum of squares and sum of cubes formula to obtain a close form expression, which should run faster than the doubly nested loop. I leave it as an exercise to you. As a result, i and j will also be removed.
a and b should also be removed if possible - since a and b are supplied as argument but never used in your code.
Sum of squares and sum of cubes formula:
Sum(x2, x = 1..n) = n(n + 1)(2n + 1)/6
Sum(x3, x = 1..n) = n2(n + 1)2/4
This function is equivalent with the following formula, which contains only 4 integer multiplications, and 1 integer division:
x = N * (N + 1) * (N * (7 * N + 8187) - 2050) / 6;
To get this, I simply typed the sum calculated by your nested loops into Wolfram Alpha:
sum (sum (8*i*i+4096*i+4*i*j+2048*j), j=i+1..N), i=0..N
Here is the direct link to the solution. Think before coding. Sometimes your brain can optimize code better than any compiler.
Briefly scanning the first routine, the first thing you notice is that expressions involving "y" are completely unused and can be eliminated (as you did). This further permits eliminating the if/else (as you did).
What remains is the two for loops and the messy expression. Factoring out the pieces of that expression that do not depend on j is the next step. You removed one such expression, but (i<<3) (ie, i * 8) remains in the inner loop, and can be removed.
Pascal's answer reminded me that you can use a loop stride optimization. First move (i<<3) * t out of the inner loop (call it i1), then calculate, when initializing the loop, a value j1 that equals (i<<2) * t. On each iteration increment j1 by 4 * t (which is a pre-calculated constant). Replace your inner expression with x = x + i1 + j1;.
One suspects that there may be some way to combine the two loops into one, with a stride, but I'm not seeing it offhand.
A few other things I can see. You don't need y, so you can remove its declaration and initialisation.
Also, the values passed in for a and b aren't actually used, so you could use these as local variables instead of x and t.
Also, rather than adding i to 512 each time through you can note that t starts at 512 and increments by 1 each iteration.
int foobar(int a, int b, int N) {
int i, j;
a = 0;
b = 512;
for (i = 0; i <= N; i++, b++) {
for (j = i + 1; j <= N; j++) {
a = a + ((i<<3) + (j<<2))*b;
}
}
return a;
}
Once you get to this point you can also observe that, aside from initialising j, i and j are only used in a single mutiple each - i<<3 and j<<2. We can code this directly in the loop logic, thus:
int foobar(int a, int b, int N) {
int i, j, iLimit, jLimit;
a = 0;
b = 512;
iLimit = N << 3;
jLimit = N << 2;
for (i = 0; i <= iLimit; i+=8) {
for (j = i >> 1 + 4; j <= jLimit; j+=4) {
a = a + (i + j)*b;
}
b++;
}
return a;
}
OK... so here is my solution, along with inline comments to explain what I did and how.
int foobar(int N)
{ // We eliminate unused arguments
int x = 0, i = 0, i2 = 0, j, k, z;
// We only iterate up to N on the outer loop, since the
// last iteration doesn't do anything useful. Also we keep
// track of '2*i' (which is used throughout the code) by a
// second variable 'i2' which we increment by two in every
// iteration, essentially converting multiplication into addition.
while(i < N)
{
// We hoist the calculation '4 * (i+2*k)' out of the loop
// since k is a literal constant and 'i' is a constant during
// the inner loop. We could convert the multiplication by 2
// into a left shift, but hey, let's not go *crazy*!
//
// (4 * (i+2*k)) <=>
// (4 * i) + (4 * 2 * k) <=>
// (2 * i2) + (8 * k) <=>
// (2 * i2) + (8 * 512) <=>
// (2 * i2) + 2048
k = (2 * i2) + 2048;
// We have now converted the expression:
// x = x + 4*(2*i+j)*(i+2*k);
//
// into the expression:
// x = x + (i2 + j) * k;
//
// Counterintuively we now *expand* the formula into:
// x = x + (i2 * k) + (j * k);
//
// Now observe that (i2 * k) is a constant inside the inner
// loop which we can calculate only once here. Also observe
// that is simply added into x a total (N - i) times, so
// we take advantange of the abelian nature of addition
// to hoist it completely out of the loop
x = x + (i2 * k) * (N - i);
// Observe that inside this loop we calculate (j * k) repeatedly,
// and that j is just an increasing counter. So now instead of
// doing numerous multiplications, let's break the operation into
// two parts: a multiplication, which we hoist out of the inner
// loop and additions which we continue performing in the inner
// loop.
z = i * k;
for (j = i + 1; j <= N; j++)
{
z = z + k;
x = x + z;
}
i++;
i2 += 2;
}
return x;
}
The code, without any of the explanations boils down to this:
int foobar(int N)
{
int x = 0, i = 0, i2 = 0, j, k, z;
while(i < N)
{
k = (2 * i2) + 2048;
x = x + (i2 * k) * (N - i);
z = i * k;
for (j = i + 1; j <= N; j++)
{
z = z + k;
x = x + z;
}
i++;
i2 += 2;
}
return x;
}
I hope this helps.
int foobar(int N) //To avoid unuse passing argument
{
int i, j, x=0; //Remove unuseful variable, operation so save stack and Machine cycle
for (i = N; i--; ) //Don't check unnecessary comparison condition
for (j = N+1; --j>i; )
x += (((i<<1)+j)*(i+512)<<2); //Save Machine cycle ,Use shift instead of Multiply
return x;
}

Resources