Optimizing Matrix multiplication in C with Bit Packing

Optimizing Matrix multiplication in C with Bit Packing - c

I'm currently attempting to write an algorithm for optimizing matrix multiplication over GF(2) using bit-packing. Both matrices A and B are provided in column major order so I start by copying A into row-major order and then packing the values into 8-bit integers and using parity checking to speed up operations. I need to be able to test square matrices of up to 2048x2048, however, my current implementation provides the correct answer up to 24x24 and then fails to compute the correct result. Any help would be appreciated.
//Method which packs an array of integers into 8 bits
uint8_t pack(int *toPack) {
int i;
uint8_t A;
A = 0;
for (i = 0; i < 8; i++) {
A = (A << 1) | (uint8_t)toPack[i];
}
return A;
}
//Method for doing matrix multiplication over GF(2)
void matmul_optimized(int n, int *A, int *B, int *C) {
int i, j, k;
//Copying values of A into a row major order matrix.
int *A_COPY = malloc(n * n * sizeof(int));
int copy_index = 0;
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
A_COPY[copy_index] = A[i + j * n];
copy_index++;
}
}
//Size of the data data type integers will be packed into
const int portion_size = 8;
int portions = n / portion_size;
//Pointer space reserved to store packed integers in row major order
uint8_t *compressedA = malloc(n * portions * sizeof(uint8_t));
uint8_t *compressedB = malloc(n * portions * sizeof(uint8_t));
int a[portion_size];
int b[portion_size];
for (i = 0; i < n; i++) {
for (j = 0; j < portions; j++) {
for (k = 0; k < portion_size; k++) {
a[k] = A_COPY[i * n + j * portion_size + k];
b[k] = B[i * n + j * portion_size + k];
}
compressedA[i * n + j] = pack(a);
compressedB[i * n + j] = pack(b);
}
}
//Calculating final matrix using parity checking and XOR on A and B
int cij;
for (i = 0; i < n; ++i) {
for (j = 0; j < n; ++j) {
int cIndex = i + j * n;
cij = C[cIndex];
for (k = 0; k < portions; ++k) {
uint8_t temp = compressedA[k + i * n] & compressedB[k + j * n];
temp ^= temp >> 4;
temp ^= temp >> 2;
temp ^= temp >> 1;
uint8_t parity = temp & (uint8_t)1;
cij = cij ^ parity;
}
C[cIndex] = cij;
}
}
free(compressedA);
free(compressedB);
free(A_COPY);
}

I have two remarks:
you should probably initialize cij to 0 instead of cij = C[cIndex];. It seems incorrect to update the destination matrix instead of storing the result of A * B. Your code might work for small matrices by coincidence because the destination matrix C happens to be all zeroes for this size.
it is risky to compute the allocation size as malloc(n * n * sizeof(int)); because n * n might overflow with int n if int is smaller than size_t. Given the sizes you work with, it is probably not a problem here, but it is a good idea to always use the sizeof as the first operand to force conversion to size_t of the following ones:
int *A_COPY = malloc(sizeof(*A_COPY) * n * n);

Related

As a result of processing arrays -nan(ind)

I am writing a program that creates arrays of a given length and manipulates them. You cannot use other libraries.
First, an array M1 of length N is formed, after which an array M2 of length N is formed/2.
In the M1 array, the division by Pi operation is applied to each element, followed by elevation to the third power.
Then, in the M2 array, each element is alternately added to the previous one, and the tangent modulus operation is applied to the result of addition.
After that, exponentiation is applied to all elements of the M1 and M2 array with the same indexes and the resulting array is sorted by dwarf sorting.
And at the end, the sum of the sines of the elements of the M2 array is calculated, which, when divided by the minimum non-zero element of the M2 array, give an even number.
The problem is that the result X gives is -nan(ind). I can't figure out exactly where the error is.
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
const int A = 441;
const double PI = 3.1415926535897931159979635;
inline void dwarf_sort(double* array, int size) {
size_t i = 1;
while (i < size) {
if (i == 0) {
i = 1;
}
if (array[i - 1] <= array[i]) {
++i;
}
else
{
long tmp = array[i];
array[i] = array[i - 1];
array[i - 1] = tmp;
--i;
}
}
}
inline double reduce(double* array, int size) {
size_t i;
double min = RAND_MAX, sum = 0;
for (i = 0; i < size; ++i) {
if (array[i] < min && array[i] != 0) {
min = array[i];
}
}
for (i = 0; i < size; ++i) {
if ((int)(array[i] / min) % 2 == 0) {
sum += sin(array[i]);
}
}
return sum;
}
int main(int argc, char* argv[])
{
int i, N, j;
double* M1 = NULL, * M2 = NULL, * M2_copy = NULL;
double X;
unsigned int seed = 0;
N = atoi(argv[1]); /* N равен первому параметру командной строки */
M1 = malloc(N * sizeof(double));
M2 = malloc(N / 2 * sizeof(double));
M2_copy = malloc(N / 2 * sizeof(double));
for (i = 0; i < 100; i++)
{
seed = i;
srand(i);
/*generate*/
for (j = 0; j < N; ++j) {
M1[j] = (rand_r(&seed) % A) + 1;
}
for (j = 0; j < N / 2; ++j) {
M2[j] = (rand_r(&seed) % (10 * A)) + 1;
}
/*map*/
for (j = 0; j < N; ++j)
{
M1[j] = pow(M1[j] / PI, 3);
}
for (j = 0; j < N / 2; ++j) {
M2_copy[j] = M2[j];
}
M2[0] = fabs(tan(M2_copy[0]));
for (j = 0; j < N / 2; ++j) {
M2[j] = fabs(tan(M2[j] + M2_copy[j]));
}
/*merge*/
for (j = 0; j < N / 2; ++j) {
M2[j] = pow(M1[j], M2[j]);
}
/*sort*/
dwarf_sort(M2, N / 2);
/*sort*/
X = reduce(M2, N / 2);
}
printf("\nN=%d.\n", N);
printf("X=%f\n", X);
return 0;
}
Knowledgeable people, does anyone see where my mistake is? I think I'm putting the wrong data types to the variables, but I still can't solve the problem.

Replace the /* merge */ part with this:
/*merge*/
for (j = 0; j < N / 2; ++j) {
printf("%f %f ", M1[j], M2[j]);
M2[j] = pow(M1[j], M2[j]);
printf("%f\n", M2[j]);
}
This will print the values and the results of the pow operation. You'll see that some of these values are huge resulting in an capacity overflow of double.
Something like pow(593419.97, 31.80) will not end well.

I am trying to improve the performance speed of my cross-correlation algorithm. What things can I do to make my C code run faster?

I created a cross-correlation algorithm, and I am trying to maximize its performance by reducing the time it takes for it to run. First of all, I reduced the number of function calls within the "crossCorrelationV2" function. Second, I created several macros at the top of the program for constants. Third, I reduced the number of loops that are inside the "crossCorrelationV2" function. The code that you see is the most recent code that I have.
Are there any other methods I can use to try and reduce the processing time of my code?
Let's assume that I am only focused on the functions "crossCorrelationV2" and "createAnalyzingWave".
I would be glad for any advice, whether in general about programming or pertaining to those two specific functions; I am a beginner programmer. Thanks.
#include <stdio.h>
#include <stdlib.h>
#define ARRAYSIZE 4096
#define PULSESNUMBER 16
#define DATAFREQ 1300
// Print the contents of the array onto the console.
void printArray(double array[], int size){
int k;
for (k = 0; k < size; k++){
printf("%lf ", array[k]);
}
printf("\n");
}
// Creates analyzing square wave. This square wave has unity (1) magnitude.
// The number of high values in each period is determined by high values = (analyzingT/2) / time increment
void createAnalyzingWave(double analyzingFreq, double wave[]){
int highValues = (1 / analyzingFreq) * 0.5 / ((PULSESNUMBER * (1 / DATAFREQ) / ARRAYSIZE));
int counter = 0;
int p;
for(p = 1; p <= ARRAYSIZE; p++){
if ((counter % 2) == 0){
wave[p - 1] = 1;
} else{
wave[p - 1] = 0;
}
if (p % highValues == 0){
counter++;
}
}
}
// Creates data square wave (for testing purposes, for the real implementation actual ADC data will be used). This
// square wave has unity magnitude.
// The number of high values in each period is determined by high values = array size / (2 * number of pulses)
void createDataWave(double wave[]){
int highValues = ARRAYSIZE / (2 * PULSESNUMBER);
int counter = 0;
int p;
for(p = 0; p < ARRAYSIZE; p++){
if ((counter % 2) == 0){
wave[p] = 1;
} else{
wave[p] = 0;
}
if ((p + 1) % highValues == 0){
counter++;
}
}
}
// Finds the average of all the values inside an array
double arrayAverage(double array[], int size){
int i;
double sum = 0;
// Same thing as for(i = 0; i < arraySize; i++)
for(i = size; i--; ){
sum = array[i] + sum;
}
return sum / size;
}
// Cross-Correlation algorithm
double crossCorrelationV2(double dataWave[], double analyzingWave[]){
int bigArraySize = (2 * ARRAYSIZE) - 1;
// Expand analyzing array into array of size 2arraySize-1
int lastArrayIndex = ARRAYSIZE - 1;
int lastBigArrayIndex = 2 * ARRAYSIZE - 2; //bigArraySize - 1; //2 * arraySize - 2;
double bigAnalyzingArray[bigArraySize];
int i;
int b;
// Set first few elements of the array equal to analyzingWave
// Set remainder of big analyzing array to 0
for(i = 0; i < ARRAYSIZE; i++){
bigAnalyzingArray[i] = analyzingWave[i];
bigAnalyzingArray[i + ARRAYSIZE] = 0;
}
double maxCorrelationValue = 0;
double currentCorrelationValue;
// "Beginning" of correlation algorithm proper
for(i = 0; i < bigArraySize; i++){
currentCorrelationValue = 0;
for(b = lastBigArrayIndex; b > 0; b--){
if (b >= lastArrayIndex){
currentCorrelationValue = dataWave[b - lastBigArrayIndex / 2] * bigAnalyzingArray[b] + currentCorrelationValue;
}
bigAnalyzingArray[b] = bigAnalyzingArray[b - 1];
}
bigAnalyzingArray[0] = 0;
if (currentCorrelationValue > maxCorrelationValue){
maxCorrelationValue = currentCorrelationValue;
}
}
return maxCorrelationValue;
}
int main(){
int samplesNumber = 25;
double analyzingFreq = 1300;
double analyzingWave[ARRAYSIZE];
double dataWave[ARRAYSIZE];
createAnalyzingWave(analyzingFreq, analyzingWave);
//createDataWave(arraySize, pulsesNumber, dataWave);
double maximumCorrelationArray[samplesNumber];
int i;
for(i = 0; i < samplesNumber; i++){
createDataWave(dataWave);
maximumCorrelationArray[i] = crossCorrelationV2(dataWave, analyzingWave);
}
printf("Average of the array values: %lf\n", arrayAverage(maximumCorrelationArray, samplesNumber));
return 0;
}

The first point is that you are explicitly shifting the analizingData array, this way you are required twice as much memory and moving the items is about 50% of your time. In a test here using crossCorrelationV2 takes 4.1 seconds, with the implementation crossCorrelationV3 it runs in ~2.0 seconds.
The next thing is that you are spending time multiplying by zero on the padded array, removing that, and also removing the padding, and simplifying the indices we end with crossCorrelationV4 that makes the program to run in ~1.0 second.
// Cross-Correlation algorithm
double crossCorrelationV3(double dataWave[], double analyzingWave[]){
int bigArraySize = (2 * ARRAYSIZE) - 1;
// Expand analyzing array into array of size 2arraySize-1
int lastArrayIndex = ARRAYSIZE - 1;
int lastBigArrayIndex = 2 * ARRAYSIZE - 2; //bigArraySize - 1; //2 * arraySize - 2;
double bigAnalyzingArray[bigArraySize];
int i;
int b;
// Set first few elements of the array equal to analyzingWave
// Set remainder of big analyzing array to 0
for(i = 0; i < ARRAYSIZE; i++){
bigAnalyzingArray[i] = analyzingWave[i];
bigAnalyzingArray[i + ARRAYSIZE] = 0;
}
double maxCorrelationValue = 0;
double currentCorrelationValue;
// "Beginning" of correlation algorithm proper
for(i = 0; i < bigArraySize; i++){
currentCorrelationValue = 0;
// Instead of checking if b >= lastArrayIndex inside the loop I use it as
// a stopping condition.
for(b = lastBigArrayIndex; b >= lastArrayIndex; b--){
// instead of shifting bitAnalizing[b] = bigAnalyzingArray[b-1] every iteration
// I simply use bigAnalizingArray[b-i]
currentCorrelationValue = dataWave[b - lastBigArrayIndex / 2] * bigAnalyzingArray[b - i] + currentCorrelationValue;
}
bigAnalyzingArray[0] = 0;
if (currentCorrelationValue > maxCorrelationValue){
maxCorrelationValue = currentCorrelationValue;
}
}
return maxCorrelationValue;
}
// Cross-Correlation algorithm
double crossCorrelationV4(double dataWave[], double analyzingWave[]){
int bigArraySize = (2 * ARRAYSIZE) - 1;
// Expand analyzing array into array of size 2arraySize-1
int lastArrayIndex = ARRAYSIZE - 1;
int lastBigArrayIndex = 2 * ARRAYSIZE - 2; //bigArraySize - 1; //2 * arraySize - 2;
// I will not allocate the bigAnalizingArray here
// double bigAnalyzingArray[bigArraySize];
int i;
int b;
// I will not copy the analizingWave to bigAnalyzingArray
// for(i = 0; i < ARRAYSIZE; i++){
// bigAnalyzingArray[i] = analyzingWave[i];
// bigAnalyzingArray[i + ARRAYSIZE] = 0;
// }
double maxCorrelationValue = 0;
double currentCorrelationValue;
// Compute the correlation by symmetric paris
// the idea here is to simplify the indices of the inner loops since
// they are computed more times.
for(i = 0; i < lastArrayIndex; i++){
currentCorrelationValue = 0;
for(b = lastArrayIndex - i; b >= 0; b--){
// instead of shifting bitAnalizing[b] = bigAnalyzingArray[b-1] every iteration
// I simply use bigAnalizingArray[b-i]
currentCorrelationValue += dataWave[b] * analyzingWave[b + i];
}
if (currentCorrelationValue > maxCorrelationValue){
maxCorrelationValue = currentCorrelationValue;
}
if(i != 0){
currentCorrelationValue = 0;
// Correlate shifting to the other side
for(b = lastArrayIndex - i; b >= 0; b--){
// instead of shifting bitAnalizing[b] = bigAnalyzingArray[b-1] every iteration
// I simply use bigAnalizingArray[b-i]
currentCorrelationValue += dataWave[b + i] * analyzingWave[b];
}
if (currentCorrelationValue > maxCorrelationValue){
maxCorrelationValue = currentCorrelationValue;
}
}
}
return maxCorrelationValue;
}
If you want more optimization you can unroll some iterations of the loop and enable some compiler optimizations like vector extension.

problem calculating the inverse of a matrix

I'm trying to calculate the inverse of a square matrix of any rank N x N. I'm using a struct to store the values of the matrix which I can to effectively and I am already able to calculate the determinant. But there must be some issue with the inverse function. This is the code
struct m{
size_t row;
size_t col;
double *data;
};
void inverse(size_t n, struct m *A) /*Calculate the inverse of A */
{
size_t i,j,i_count,j_count, count=0;
double det = determinant(n, A);
size_t id = 0;
double *d;
struct m C; /*The Adjoint matrix */
C.data = malloc(sizeof(double) * n * n);
C.row = n;
C.col = n;
struct m *minor; /*matrices obtained by removing the i row and j column*/
if (!(minor = malloc(n*n*(n+1)*sizeof *minor))) {
perror ("malloc-minor");
exit(-1);
}
if (det == 0){
printf("The matrix is singular\n");
exit(1);
}
for(id=0; id < n*n; id++){
d = minor[id].data = malloc(sizeof(double) * (n-1) * (n-1));
for(count=0; count < n; count++)
{
//Creating array of Minors
i_count = 0;
for(i = 0; i < n; i++)
{
j_count=0;
for(j = 0; j < n; j++)
{
if(j == count)
continue; // don't copy the minor column element
*d = A->data[i * A->col + j];
d++;
j_count++;
}
i_count++;
}
}
}
for(id=0; id < n*n; id++){
for(i=0; i < n; i++){
for(j=0; j < n; j++)
C.data[i * C.col + j] = determinant(n-1,&minor[id]);//Recursive call
}
}
transpose(&C);
scalar_product(1/det, &C);
*A = C;
}
The determinant is calculated recursively with this algorithm:
double determinant(size_t n, struct m *A)
{
size_t i,j,i_count,j_count, count=0;
double det = 0;
if(n < 1)
{
printf("Error\n");
exit(1);
}
if(n==1) return A->data[0];
else if(n==2) return (A->data[0]* A->data[1 * A->col + 1] - A->data[0 + 1] * A->data[1*A->col + 0]);
else{
struct m C;
C.row = A->row-1;
C.col = A->col-1;
C.data = malloc(sizeof(double) * (A->row-1) * (A->col-1));
for(count=0; count < n; count++)
{
//Creating array of Minors
i_count = 0;
for(i = 1; i < n; i++)
{
j_count=0;
for(j = 0; j < n; j++)
{
if(j == count)
continue; // don't copy the minor column element
C.data[i_count * C.col + j_count] = A->data[i * A->col + j];
j_count++;
}
i_count++;
}
det += pow(-1, count) * A->data[count] * determinant(n-1,&C);//Recursive call
}
free(C.data);
return det;
}
}
You can find the complete code here: https://ideone.com/gQRwVu.

Use some other variable in the loop after :
det + =pow(-1,count) * A->data[count] *determinant (n-1,&C)

Your calculation of the inverse doesn't quite correspond to the algorithm described e. g. for Inverse of a Matrix
using Minors, Cofactors and Adjugate, even taken into account that you for now omitted the adjugate and division step. Compare your outermost for loop in inverse() to this working implementation:
double Rdata[(n-1)*(n-1)]; // remaining data values
struct m R = { n-1, n-1, Rdata }; // matrix structure for them
for (count = 0; count < n*n; count++) // Create n*n Matrix of Minors
{
int row = count/n, col = count%n;
for (i_count = i = 0; i < n; i++)
if (i != row) // don't copy the current row
{
for (j_count = j = 0; j < n; j++)
if (j != col) // don't copy the current column
Rdata[i_count*R.col+j_count++] = A->data[i*A->col+j];
i_count++;
}
// transpose by swapping row and column
C.data[col*C.col+row] = pow(-1, row&1 ^ col&1) * determinant(n-1, &R) / det;
}
It yields for the given input data the correct inverse matrix
1 2 -4.5
0 -1 1.5
0 0 0.5
(already transposed and divided by the determinant of the original matrix).
Minor notes:
The *A = C; at the end of inverse() loses the original data pointer of *A.
The formatting function f() is wrong for negative values, since the fraction is also negative in this case. You could write if (fabs(f)<.00001).

Image Dividing in C for DCT

Can anyone please tell us how to divide the image into 8X8 blocks?
I can read the image, but not divide it into an 8x8 submatrix for DCT.
int main()
{
FILE *image_raw;
unsigned char **matriz_image;
int i, j;
int rows=1080, colums=1920;
matriz_image = (unsigned char **) malloc (rows*sizeof(unsigned char *));
//i create dinamic colums
for(i=0; i<rows; i++)
{
matriz_image[i] = (unsigned char *) malloc (colums*sizeof(unsigned char ));
}
//i open image raw
image_raw = fopen("imag.dat", "r+b");
//i copy values to matriz_image
for (i = 0; i < rows; ++i)
{
fread(matriz_image[i], sizeof(unsigned char ), colums, image_raw);
}
for(i=0; i<rows; i++)
{
for(j=0; j<colums; j++)
{
// printf("%i ",*(*(matriz_image+i)+j));
printf("%i ",matriz_image[i][j]);
}
printf("\n");
}

You could do something like this:
void dct(unsigned char **m, int baserow, int basecol)
{
for (int row = baserow, endrow = baserow + 8; row < endrow; ++row)
for (int col = basecol, endcol = basecol + 8; col < endcol; ++col)
; // operate on m[row][col]
}
int do_dcts(unsigned char **m, int num_rows, int num_cols)
{
if (num_rows <= 0 || num_rows % 8 || num_cols <= 0 || num_cols % 8)
return -1;
for (int row = 0; row < num_rows; row += 8)
for (int col = 0; col < num_cols; col += 8)
dct(m, row, col);
return 0;
}
You are wasting space and worsening your memory locality by implementing your 2D array using two levels of pointers. It's better to do one allocation and then offset into the array appropriately like so:
int main()
{
FILE *image_raw;
unsigned char *matriz_image;
int i, j;
int rows=1080, colums=1920;
matriz_image = malloc(rows*colums*sizeof(unsigned char));
...
If you can make rows and colums constants or have VLAs, then you can do:
unsigned char (*m)[colums] = (unsigned char (*)[colums]) matriz_image;
m[5][2] = 2; // double indexed access without extra pointers + allocs
Similarly you can pass m's kind of pointer to your matrix to your functions to operate on it.
If you can't make rows and colums be compile-time constants and you don't have VLAs, then you can write helper fcns to do pointer arithmetic for you:
inline unsigned char *get_row(unsigned char *m, int numcols, int row)
{
return &m[row * num_cols];
}
inline unsigned char *get_elem(unsigned char *m, int numcols, int row, int col)
{
return &m[row * num_cols + col];
}
...
*get_elem(m, colums, 5, 2) = 2; // double indexing not as nice but good memory usage
If you really need to get fast for these operations, then as you read your image in, you could reorganize it to lay the 8x8 bytes blocks contiguously in memory to have the best possible cache performance:
// organize m like m[rows * colums / 64][8][8]; so first index is an 8x8 block #
for (int k = 0; k < rows / 8; ++k) // read all rows in chunks of 8
for (int i = 0; i < 8; ++i) // read 8 rows
for (int j = 0; j < colums / 8; ++j) // read 1 row in 8 byte chunks
fread(&m[k * 8 * colums + i * 8 + j * 64], 1, 8, image_raw);
...
typedef unsigned char (*block_ptr)[8];
inline block_ptr get_block(unsigned char *m, int num_cols, int block_num)
{
return (block_ptr) &m[block_num * 64];
}
inline block_ptr get_block2(unsigned char *m, int num_cols, int row, int col)
{
if (row % 8 || col % 8)
return NULL;
return (block_ptr) &m[row * num_cols + col * 8];
}
...
for (int k = 0; k < rows * colums / 64; ++k)
{
block_ptr block = get_block(m, num_colums, k);
for (int i = 0; i < 8; ++i)
for (int j = 0; j < 8; ++j)
; // operate on block[i][j];
}

Allocate 3D matrix in one big chunk

I'd like to allocate a 3D matrix in one big chunk. It should be possible to access this matrix in the [i][j][k] fashion, without having to calculate the linearized index every time.
I think it should be something like below, but I'm having trouble filling the ...
double ****matrix = (double ****) malloc(...)
for (int i = 0; i < imax; i++) {
matrix[i] = &matrix[...]
for (int j = 0; j < jmax; j++) {
matrix[i][j] = &matrix[...]
for (int k = 0; k < kmax; k++) {
matrix[i][j][k] = &matrix[...]
}
}
}

For the single allocation to be possible and work, you need to lay out the resulting memory like this:
imax units of double **
imax * jmax units of double *
imax * jmax * kmax units of double
Further, the 'imax units of double **' must be allocated first; you can reorder the other two sections, but it is most sensible to deal with them in the order listed.
You also need to be able to assume that double and double * (and double **, but that's not much of a stretch) are sufficiently well aligned that you can simply allocate the chunks contiguously. That is going to hold OK on most 64-bit systems with type double, but be aware of the possibility that it does not hold on 32-bit systems or for other types than double (basically, the assumption could be problematic when sizeof(double) != sizeof(double *)).
With those caveats made, then this code works cleanly (tested on Mac OS X 10.10.2 with GCC 4.9.1 and Valgrind version valgrind-3.11.0.SVN):
#include <stdio.h>
#include <stdlib.h>
typedef double Element;
static Element ***alloc_3d_matrix(size_t imax, size_t jmax, size_t kmax)
{
size_t i_size = imax * sizeof(Element **);
size_t j_size = imax * jmax * sizeof(Element *);
size_t k_size = imax * jmax * kmax * sizeof(Element);
Element ***matrix = malloc(i_size + j_size + k_size);
if (matrix == 0)
return 0;
printf("i = %zu, j = %zu, k = %zu; sizes: i = %zu, j = %zu, k = %zu; "
"%zu bytes total\n",
imax, jmax, kmax, i_size, j_size, k_size, i_size + j_size + k_size);
printf("matrix = %p .. %p\n", (void *)matrix,
(void *)((char *)matrix + i_size + j_size + k_size));
Element **j_base = (void *)((char *)matrix + imax * sizeof(Element **));
printf("j_base = %p\n", (void *)j_base);
for (size_t i = 0; i < imax; i++)
{
matrix[i] = &j_base[i * jmax];
printf("matrix[%zu] = %p (%p)\n",
i, (void *)matrix[i], (void *)&matrix[i]);
}
Element *k_base = (void *)((char *)j_base + imax * jmax * sizeof(Element *));
printf("k_base = %p\n", (void *)k_base);
for (size_t i = 0; i < imax; i++)
{
for (size_t j = 0; j < jmax; j++)
{
matrix[i][j] = &k_base[(i * jmax + j) * kmax];
printf("matrix[%zu][%zu] = %p (%p)\n",
i, j, (void *)matrix[i][j], (void *)&matrix[i][j]);
}
}
/* Diagnostic only */
for (size_t i = 0; i < imax; i++)
{
for (size_t j = 0; j < jmax; j++)
{
for (size_t k = 0; k < kmax; k++)
printf("matrix[%zu][%zu][%zu] = %p\n",
i, j, k, (void *)&matrix[i][j][k]);
}
}
return matrix;
}
int main(void)
{
size_t i_max = 3;
size_t j_max = 4;
size_t k_max = 5;
Element ***matrix = alloc_3d_matrix(i_max, j_max, k_max);
if (matrix == 0)
{
fprintf(stderr, "Failed to allocate matrix[%zu][%zu][%zu]\n", i_max, j_max, k_max);
return 1;
}
for (size_t i = 0; i < i_max; i++)
{
for (size_t j = 0; j < j_max; j++)
{
for (size_t k = 0; k < k_max; k++)
matrix[i][j][k] = (i + 1) * 100 + (j + 1) * 10 + k + 1;
}
}
for (size_t i = 0; i < i_max; i++)
{
for (size_t j = 0; j < j_max; j++)
{
for (size_t k = k_max; k > 0; k--)
printf("[%zu][%zu][%zu] = %6.0f\n", i, j, k-1, matrix[i][j][k-1]);
}
}
free(matrix);
return 0;
}
Example output (with some boring bits omitted):
i = 3, j = 4, k = 5; sizes: i = 24, j = 96, k = 480; 600 bytes total
matrix = 0x100821630 .. 0x100821888
j_base = 0x100821648
matrix[0] = 0x100821648 (0x100821630)
matrix[1] = 0x100821668 (0x100821638)
matrix[2] = 0x100821688 (0x100821640)
k_base = 0x1008216a8
matrix[0][0] = 0x1008216a8 (0x100821648)
matrix[0][1] = 0x1008216d0 (0x100821650)
matrix[0][2] = 0x1008216f8 (0x100821658)
matrix[0][3] = 0x100821720 (0x100821660)
matrix[1][0] = 0x100821748 (0x100821668)
matrix[1][1] = 0x100821770 (0x100821670)
matrix[1][2] = 0x100821798 (0x100821678)
matrix[1][3] = 0x1008217c0 (0x100821680)
matrix[2][0] = 0x1008217e8 (0x100821688)
matrix[2][1] = 0x100821810 (0x100821690)
matrix[2][2] = 0x100821838 (0x100821698)
matrix[2][3] = 0x100821860 (0x1008216a0)
matrix[0][0][0] = 0x1008216a8
matrix[0][0][1] = 0x1008216b0
matrix[0][0][2] = 0x1008216b8
matrix[0][0][3] = 0x1008216c0
matrix[0][0][4] = 0x1008216c8
matrix[0][1][0] = 0x1008216d0
matrix[0][1][1] = 0x1008216d8
matrix[0][1][2] = 0x1008216e0
matrix[0][1][3] = 0x1008216e8
matrix[0][1][4] = 0x1008216f0
matrix[0][2][0] = 0x1008216f8
…
matrix[2][2][4] = 0x100821858
matrix[2][3][0] = 0x100821860
matrix[2][3][1] = 0x100821868
matrix[2][3][2] = 0x100821870
matrix[2][3][3] = 0x100821878
matrix[2][3][4] = 0x100821880
[0][0][4] = 115
[0][0][3] = 114
[0][0][2] = 113
[0][0][1] = 112
[0][0][0] = 111
[0][1][4] = 125
[0][1][3] = 124
[0][1][2] = 123
[0][1][1] = 122
[0][1][0] = 121
[0][2][4] = 135
…
[2][2][0] = 331
[2][3][4] = 345
[2][3][3] = 344
[2][3][2] = 343
[2][3][1] = 342
[2][3][0] = 341
There is a lot of diagnostic output in the code shown.
This code will work with C89 (and C99 and C11), without requiring support for variable-length arrays or VLAs — though since I declare variables in for loops, the code as written requires C99 or later, but it can easily be fixed to declare the variables outside the for loops and it can then compile with C89.

This can be done with one simple malloc() call in C (not in C++, though, there are no variable length arrays in C++):
void foo(int imax, int jmax, int kmax) {
double (*matrix)[jmax][kmax] = malloc(imax*sizeof(*matrix));
//Allocation done. Now fill the matrix:
for(int i = 0; i < imax; i++) {
for(int j = 0; j < jmax; j++) {
for(int k = 0; k < kmax; k++) {
matrix[i][j][k] = ...
}
}
}
}
Note that C allows jmax and kmax to be dynamic values that are only known at runtime. That is the ability that's missing in C++, which makes C arrays much more powerful than their C++ counterpart.
The only drawback of this approach, as WhozCraig rightly notes, is that you can't return the resulting matrix as the return value of the function without resorting to a void*. However, you can return it by reference like this:
void foo(int imax, int jmax, int kmax, double (**outMatrix)[jmax][kmax]) {
*outMatrix = malloc(imax*sizeof(**outMatrix));
double (*matrix)[jmax][kmax] = *outMatrix; //avoid having to write (*outMatrix)[i][j][k] everywhere
... //as above
}
This function would need to be called like this:
int imax = ..., jmax = ..., kmax = ...;
double (*myMatrix)[jmax][kmax];
foo(imax, jmax, kmax, &myMatrix);
That way you get full type checking on the inner two dimension sizes even though they are runtime values.

Note: This was intended to be a comment but it got too long, until it turned into a proper answer.
You can't use a single chunk of memory without performing some calculations.
Note that the beginning of each row is marked by the formula
// row_begin is the memory address of the row at index row_idx
row_begin = row_idx * jmax * kmax
And then, each column depends on where the row starts:
// column_begin is the memory address of the column
// at index column_idx of the row starting at row_begin
column_begin = row_begin + column_idx * kmax
Which, using absolute addresses (relative to the matrix pointer, of course) translates to:
column_begin = (row_idx * jmax * kmax) + column_idx * kmax
Finally, getting the k-index of an element is very straightforward, following the previous rule this could turn in an infinite recursion:
// element address = row_address + column_address + element_k_index
element_k_idx = column_begin + element_k_idx
Which translates to
element_k_idx = (row_idx * jmax * kmax) + column_idx * kmax + element_k_idx

This works for me:
void foo(int imax, int jmax, int kmax)
{
// Allocate memory for all the numbers.
// Think of this as (imax*jmax) number of memory chunks,
// with each chunk containing kmax doubles.
double* data_0 = malloc(imax*jmax*kmax*sizeof(double));
// Allocate memory for the previus dimension of pointers.
// This of this as imax number of memory chunks,
// with each chunk containing jmax double*.
double** data_1 = malloc(imax*jmax*sizeof(double*));
// Allocate memory for the previus dimension of pointers.
double*** data_2 = malloc(imax*sizeof(double**));
for (int i = 0; i < imax; i++)
{
data_2[i] = &data_1[i*jmax];
for (int j = 0; j < jmax; j++)
{
data_1[i*jmax+j] = &data_0[(i*jmax+j)*kmax];
}
}
// That is the matrix.
double ***matrix = data_2;
for (int i = 0; i < imax; i++)
{
for (int j = 0; j < jmax; j++)
{
for (int k = 0; k < kmax; k++)
{
matrix[i][j][k] = i+j+k;
}
}
}
for (int i = 0; i < imax; i++)
{
for (int j = 0; j < jmax; j++)
{
for (int k = 0; k < kmax; k++)
{
printf("%lf ", matrix[i][j][k]);
}
printf("\n");
}
}
// Deallocate memory
free(data_2);
free(data_1);
free(data_0);
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Optimizing Matrix multiplication in C with Bit Packing - c

Related

As a result of processing arrays -nan(ind)

I am trying to improve the performance speed of my cross-correlation algorithm. What things can I do to make my C code run faster?

problem calculating the inverse of a matrix

Image Dividing in C for DCT

Allocate 3D matrix in one big chunk

Categories

Resources