This is my very first time working with SSE intrinsics. I am trying to convert a simple piece of code into a faster version using Intel SSE intrinsic (up to SSE4.2). I seem to encounter a number of errors.
The scalar version of the code is: (simple matrix multiplication)
void mm(int n, double *A, double *B, double *C)
{
int i,j,k;
double tmp;
for(i = 0; i < n; i++)
for(j = 0; j < n; j++) {
tmp = 0.0;
for(k = 0; k < n; k++)
tmp += A[n*i+k] *
B[n*k+j];
C[n*i+j] = tmp;
}
}
This is my version: I have included #include <ia32intrin.h>
void mm_sse(int n, double *A, double *B, double *C)
{
int i,j,k;
double tmp;
__m128d a_i, b_i, c_i;
for(i = 0; i < n; i++)
for(j = 0; j < n; j++) {
tmp = 0.0;
for(k = 0; k < n; k+=4)
a_i = __mm_load_ps(&A[n*i+k]);
b_i = __mm_load_ps(&B[n*k+j]);
c_i = __mm_load_ps(&C[n*i+j]);
__m128d tmp1 = __mm_mul_ps(a_i,b_i);
__m128d tmp2 = __mm_hadd_ps(tmp1,tmp1);
__m128d tmp3 = __mm_add_ps(tmp2,tmp3);
__mm_store_ps(&C[n*i+j], tmp3);
}
}
Where am I going wrong with this? I am getting several errors like this:
mm_vec.c(84): error: a value of type "int" cannot be assigned to an entity of type "__m128d"
a_i = __mm_load_ps(&A[n*i+k]);
This is how I am compiling: icc -O2 mm_vec.c -o vec
Can someone please assist me converting this code accurately. Thanks!
UPDATE:
According to your suggestions, I have made the following changes:
void mm_sse(int n, float *A, float *B, float *C)
{
int i,j,k;
float tmp;
__m128 a_i, b_i, c_i;
for(i = 0; i < n; i++)
for(j = 0; j < n; j++) {
tmp = 0.0;
for(k = 0; k < n; k+=4)
a_i = _mm_load_ps(&A[n*i+k]);
b_i = _mm_load_ps(&B[n*k+j]);
c_i = _mm_load_ps(&C[n*i+j]);
__m128 tmp1 = _mm_mul_ps(a_i,b_i);
__m128 tmp2 = _mm_hadd_ps(tmp1,tmp1);
__m128 tmp3 = _mm_add_ps(tmp2,tmp3);
_mm_store_ps(&C[n*i+j], tmp3);
}
}
But now I seem to be getting a Segmentation fault. I know this perhaps because I am not accessing the array subscripts properly for array A,B,C. I am very new to this and not sure how to proceed with this.
Please help me determine the correct approach towards handling this code.
The error you're seeing is because you have too many underscores in the function names, e.g.:
__mm_mul_ps
should be:
_mm_mul_ps // Just one underscore up front
so the C compiler is assuming they return int since it hasn't seen a declaration.
Beyond this though there's further problems - you seem to be mixing calls to double and single float variants of the same instruction.
For example you have:
__m128d a_i, b_i, c_i;
but you call:
__mm_load_ps(&A[n*i+k]);
which returns a __m128 not a __m128d - you wanted to call:
_mm_load_pd
instead. Likewise for the other instructions if you want them to work on pairs of doubles.
If you're seeing unexplained segmentation faults and in SSE code I'd be inclined to guess that you've got memory alignment problems - pointers passed to SSE intrinsics (mostly1) need to be 16 byte aligned. You can check this with a simple assert in your code, or check it in a debugger (you expect the last digit of the pointer to be 0 if it's aligned properly).
If it isn't aligned right you need to make sure it is. For things not allocated with new/malloc() you can do this with a compiler extension (e.g. with gcc):
float a[16] __attribute__ ((aligned (16)));
Provided your version of gcc has a max alignment large enough to support this and a few other caveats about stack alignment. For dynamically allocated storage you'll want to use a platform specific extension, e.g. posix_memalign to allocate suitable storage:
float *a=NULL;
posix_memalign(&a, __alignof__(__m128), sizeof(float)*16);
(I think there might be nicer, portable ways of doing this with C++11 but I'm not 100% sure on that yet).
1 There are some instructions which allow you do to unaligned loads and stores, but they're terribly slow compared to aligned loads and worth avoiding if at all possible.
You need to make sure that your loads and stores are always accessing 16 byte aligned addresses. Alternatively, if you can't guarantee this for some reason, then use _mm_loadu_ps/_mm_storeu_ps instead of _mm_load_ps/_mm_store_ps - this will be less efficient but will not crash on misaligned addresses.
Related
Arithmetic Exception
//m*n行列Aを用いてy = A*x +b を計算する
void fc(int m, int n, const float *x, const float *A, const float *b, float *y){
int i, j;
for (i = 0; i < m; i++){
y[i] = b[i];
for (j = 0; j < n; j++){
y[i] += A[i * n + j] * x[j];
}
}
}
This is a code that does AX+b calculation of matrixes.
But as in the photo, an arithmetic exception is occurred. Why is this happening? Even though it is multiplication and there is nothing divided by 0.
How can I solve this error?
Sorry that I cannot add the values, or else I will have to add the whole file here. These are the parameters of the Neural Network and I will have to add .dat files here then I will also need other codes that can load those files. Also, I do not know how to bring only numbers from the .dat files, they are kind of weirdly encoded, so.
I will provide all the other information otherwise, so please don't close this question and I really want to know why this happens and how to solve it.
This is also another example of the exception.
Example
What I want to know is how can this happen even where there is nothing divided by 0 in this example. How I can interpret this situation.
according to you image, your matrices size is 100x50 (m,n) that means 5000 items. but you entered A[j*m+i] where 'j' is equal with 55 and 'i' is equal with 0. that means accessing the 5500 item of array which is not allowed.
#include <stdio.h>
void fc(int m, int n, const float *x, const float *A, const float *b, float *y){
int i, j;
for (i = 0; i < m; i++){
y[i] = b[i];
for (j = 0; j < n; j++){
y[i] += A[i * n + j] * x[j];
}
}
}
int main()
{
const float x[3]={1,1,1};
const float *xp=x;
const float A[3][3]={{1,1,1},{1,1,1},{1,1,1}};
const float b[3]={1,1,1};
const float *bp=b;
float y[3]; float *yp=y;
fc (3,3,xp,*A,bp,yp);
printf("%f %f %f ",y[0],y[1],y[2]);
return 0;
}
I've tested the program with imaginary values of 1 for all variables and the matrices size of 3x3 and 3x1. the result was correct with no error. ther result was
4.0000000 4.0000000 4.0000000
So the problem does not arise from the structure of your code. it is definitely comes from a special arithmetic problem.
I have just begun playing around with my vectorising code. My matrix-vector multiplication code is not being autovectorised by gcc, I’d like to know why. This pastebin contains the output from -fopt-info-vec-missed.
I’m having trouble understanding what the output is telling me and seeing how it matches up to what I’ve written in code.
For instance, I see a number of lines saying not enough data-refs in basic block, I can’t find much detail online with a google search about this. I also see that there’s issues relating to memory alignment e.g. Unknown misalignment, naturally aligned and vector alignment may not be reachable. All of my memory allocation was for double types using malloc, which I believed was guaranteed to be aligned for that type.
Environment: compiling with gcc on WSL2
gcc -v: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define N 4000 // Matrix size will be N x N
#define T 1
//gcc -fopenmp -g vectorisation.c -o main -O3 -march=native -fopt-info-vec-missed=missed.txt
void doParallelComputation(double *A, double *V, double *results, unsigned long matrixSize, int numThreads)
{
omp_set_num_threads(numThreads);
unsigned long i, j;
#pragma omp parallel for simd private(j)
for (i = 0; i < matrixSize; i++)
{
// double *AHead = &A[i * matrixSize];
// double tmp = 0;
for (j = 0; j < matrixSize; j++)
{
results[i] += A[i * matrixSize + j] * V[j];
// also tried tmp += A[i * matrixSize + j] * V[j];
}
// results[i] = tmp;
}
}
void genRandVector(double *S, unsigned long size)
{
srand(time(0));
unsigned long i;
for (i = 0; i < size; i++)
{
double n = rand() % 5;
S[i] = n;
}
}
void genRandMatrix(double *A, unsigned long size)
{
srand(time(0));
unsigned long i, j;
for (i = 0; i < size; i++)
{
for (j = 0; j < size; j++)
{
double n = rand() % 5;
A[i*size + j] = n;
}
}
}
int main(int argc, char *argv[])
{
double *V = (double *)malloc(N * sizeof(double)); // v in our A*v = parV computation
double *parV = (double *)malloc(N * sizeof(double)); // Parallel computed vector
double *A = (double *)malloc(N * N * sizeof(double)); // NxN Matrix to multiply by V
genRandVector(V, N);
doParallelComputation(A, V, parV, N, T);
free(parV);
free(A);
free(V);
return 0;
}
Adding double *restrict results to promise non-overlapping input/output helped, without OpenMP but with -ffast-math. https://godbolt.org/z/qaPh1v
You need to tell OpenMP about reductions specifically, to let it relax FP-math associativity. (-ffast-math doesn't help the OpenMP vectorizer). With that as well, we get what you want:
#pragma omp simd reduction(+:tmp)
With just restrict and no -ffast-math or -fopenmp, you get total garbage: it does a SIMD FP multiply, but then unpacks that for 4x vaddsd into the scalar accumulator, not helping hide FP latency at all.
With restrict and -fopenmp (without fast-math), it just does scalar FMA.
With restrict and -ffast-math (without -fopenmp or #pragma commented) it auto-vectorizes nicely: vfmadd231pd ymm inside the loop, shuffle / add horizontal sum outside. (But doesn't parallelize). https://godbolt.org/z/f36oG3
With restrict and -ffast-math (with -fopenmp) it still doesn't auto-vectorize. The OpenMP vectorizer is different, and maybe doesn't take advantage of fast-math, instead needing you to tell it about reductions?
Also note that with your data layout, the loop you want to parallelize (outer) is different from the loop you want to vectorize with SIMD (inner). Both the input "vectors" for the inner dot-product loop are in contiguous memory so it makes the most sense to read those, instead of trying to SIMD shuffle data from 4 different columns into one vector to accumulate 4 result[i+0..3] results in 1 vector.
However, unrolling the outer loop by 4 to use each V[j+0..3] with data from 4 different columns would improve computational intensity (closer to 1 load per FMA, rather than 2)
(As long as V[] and a row of the matrix fits in L1d cache, this is good. If not, it's actually pretty bad and should get cache-blocked. Or actually if you unroll the outer loop, 4 rows of the matrix.)
Also note that double tmp = 0; would be a good idea: your current version adds into result[i], reading it before writing. That would require zero-init before you could use it as a pure output.
Auto-vec auto-par version:
I think this is correct; the asm looks like it auto-parallelized as well as auto-vectorizing the inner loop.
void doParallelComputation(double *restrict A, double *restrict V, double *restrict results, unsigned long matrixSize, int numThreads)
{
omp_set_num_threads(numThreads);
unsigned long i, j;
#pragma omp parallel for private(j)
for (i = 0; i < matrixSize; i++)
{
// double *AHead = &A[i * matrixSize];
double tmp = 0;
// TODO: unroll outer loop and cache-block it.
#pragma omp simd reduction(+:tmp)
for (j = 0; j < matrixSize; j++)
{
//results[i] += A[i * matrixSize + j] * V[j];
tmp += A[i * matrixSize + j] * V[j]; //
}
results[i] = tmp; // write-only to results, not adding to old value.
}
}
Compiles (Godbolt) with a vectorized inner loop inside the OpenMPified helper function doParallelComputation._omp_fn.0:
# gcc7.5 -xc -O3 -fopenmp -march=skylake
.L6:
add rdx, 1 # loop counter; newer GCC just compares the end-pointer
vmovupd ymm2, YMMWORD PTR [rcx+rax] # 32-byte load
vfmadd231pd ymm0, ymm2, YMMWORD PTR [rsi+rax] # 32-byte memory-source FMA
add rax, 32 # pointer increment
cmp rdi, rdx
ja .L6
Then a horizontal sum of mediocre efficiency after the loop; unfortunately the OpenMP vectorizer isn't as smart as the "normal" -ftree-vectorize vectorizer, but that requires -ffast-math to do anything here.
I am trying to create a simple program that uses Intel's AVX technology and perform vector multiplication and addition. Here I am using Open MP alongside this. But it is getting segmentation fault due to the function call _mm256_store_ps().
I have tried with OpenMP atomic features like atomic, critical, etc so that if this function is atomic in nature and multiple cores are attempting to execute at the same time, but it is not working.
#include<stdio.h>
#include<time.h>
#include<stdlib.h>
#include<immintrin.h>
#include<omp.h>
#define N 64
__m256 multiply_and_add_intel(__m256 a, __m256 b, __m256 c) {
return _mm256_add_ps(_mm256_mul_ps(a, b),c);
}
void multiply_and_add_intel_total_omp(const float* a, const float* b, const float* c, float* d)
{
__m256 a_intel, b_intel, c_intel, d_intel;
#pragma omp parallel for private(a_intel,b_intel,c_intel,d_intel)
for(long i=0; i<N; i=i+8) {
a_intel = _mm256_loadu_ps(&a[i]);
b_intel = _mm256_loadu_ps(&b[i]);
c_intel = _mm256_loadu_ps(&c[i]);
d_intel = multiply_and_add_intel(a_intel, b_intel, c_intel);
_mm256_store_ps(&d[i],d_intel);
}
}
int main()
{
srand(time(NULL));
float * a = (float *) malloc(sizeof(float) * N);
float * b = (float *) malloc(sizeof(float) * N);
float * c = (float *) malloc(sizeof(float) * N);
float * d_intel_avx_omp = (float *)malloc(sizeof(float) * N);
int i;
for(i=0;i<N;i++)
{
a[i] = (float)(rand()%10);
b[i] = (float)(rand()%10);
c[i] = (float)(rand()%10);
}
double time_t = omp_get_wtime();
multiply_and_add_intel_total_omp(a,b,c,d_intel_avx_omp);
time_t = omp_get_wtime() - time_t;
printf("\nTime taken to calculate with AVX2 and OMP : %0.5lf\n",time_t);
}
free(a);
free(b);
free(c);
free(d_intel_avx_omp);
return 0;
}
I expect that I will get d = a * b + c but it is showing segmentation fault. I have tried to perform the same task without OpenMP and it working errorless. Please let me know if there is any compatibility issue or I am missing any part.
gcc version 7.3.0
Intel® Core™ i3-3110M Processor
OS Ubuntu 18.04
Open MP 4.5, I have executed the command $ echo |cpp -fopenmp -dM |grep -i open and it showed #define _OPENMP 201511
Command to compile, gcc first_int.c -mavx -fopenmp
** UPDATE **
As per the discussions and suggestions, the new code is,
float * a = (float *) aligned_alloc(N, sizeof(float) * N);
float * b = (float *) aligned_alloc(N, sizeof(float) * N);
float * c = (float *) aligned_alloc(N, sizeof(float) * N);
float * d_intel_avx_omp = (float *)aligned_alloc(N, sizeof(float) * N);
This working without perfectly.
Just a note, I was trying to compare general calculations, avx calculation and avx+openmp calculation. This is the result I got,
Time taken to calculate without AVX : 0.00037
Time taken to calculate with AVX : 0.00024
Time taken to calculate with AVX and OMP : 0.00019
N = 50000
The documentation for _mm256_store_ps says:
Store 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from a into memory. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated.
You can use _mm256_storeu_si256 instead for unaligned stores.
A better option is to align all your arrays on a 32-byte boundary (for 256-bit avx registers) and use aligned load and stores for maximum performance because unaligned loads/stores crossing a cache line boundary incur performance penalty.
Use std::aligned_alloc (or C11 aligned_alloc, memalign, posix_memalign, whatever you have available) instead of malloc(size), e.g.:
float* allocate_aligned(size_t n) {
constexpr size_t alignment = alignof(__m256);
return static_cast<float*>(aligned_alloc(alignment, sizeof(float) * n));
}
// ...
float* a = allocate_aligned(N);
float* b = allocate_aligned(N);
float* c = allocate_aligned(N);
float* d_intel_avx_omp = allocate_aligned(N);
In C++-17 new can allocate with alignment:
float* allocate_aligned(size_t n) {
constexpr auto alignment = std::align_val_t{alignof(__m256)};
return new(alignment) float[n];
}
Alternatively, use Vc: portable, zero-overhead C++ types for explicitly data-parallel programming that aligns heap-allocated SIMD vectors for you:
#include <cstdio>
#include <memory>
#include <chrono>
#include <Vc/Vc>
Vc::float_v random_float_v() {
alignas(Vc::VectorAlignment) float t[Vc::float_v::Size];
for(unsigned i = 0; i < Vc::float_v::Size; ++i)
t[i] = std::rand() % 10;
return Vc::float_v(t, Vc::Aligned);
}
unsigned reverse_crc32(void const* vbegin, void const* vend) {
unsigned const* begin = reinterpret_cast<unsigned const*>(vbegin);
unsigned const* end = reinterpret_cast<unsigned const*>(vend);
unsigned r = 0;
while(begin != end)
r = __builtin_ia32_crc32si(r, *--end);
return r;
}
int main() {
constexpr size_t N = 65536;
constexpr size_t M = N / Vc::float_v::Size;
std::unique_ptr<Vc::float_v[]> a(new Vc::float_v[M]);
std::unique_ptr<Vc::float_v[]> b(new Vc::float_v[M]);
std::unique_ptr<Vc::float_v[]> c(new Vc::float_v[M]);
std::unique_ptr<Vc::float_v[]> d_intel_avx_omp(new Vc::float_v[M]);
for(unsigned i = 0; i < M; ++i) {
a[i] = random_float_v();
b[i] = random_float_v();
c[i] = random_float_v();
}
auto t0 = std::chrono::high_resolution_clock::now();
for(unsigned i = 0; i < M; ++i)
d_intel_avx_omp[i] = a[i] * b[i] + c[i];
auto t1 = std::chrono::high_resolution_clock::now();
double seconds = std::chrono::duration_cast<std::chrono::duration<double>>(t1 - t0).count();
unsigned crc = reverse_crc32(d_intel_avx_omp.get(), d_intel_avx_omp.get() + M); // Make sure d_intel_avx_omp isn't optimized out.
std::printf("crc: %u, time: %.09f seconds\n", crc, seconds);
}
Parallel version:
#include <tbb/parallel_for.h>
// ...
auto t0 = std::chrono::high_resolution_clock::now();
tbb::parallel_for(size_t{0}, M, [&](unsigned i) {
d_intel_avx_omp[i] = a[i] * b[i] + c[i];
});
auto t1 = std::chrono::high_resolution_clock::now();
You must use aligned memory for these intrinsics. Change your malloc(...) to aligned_alloc(sizeof(float) * 8, ...) (C11).
This is completely unrelated to atomics. You are working on entirely separate pieces of data (even on different cache lines), so there is no need for any protection.
I have a code from Mathlab, where all matrix operations are done by a couple of symbols. By translating it into C I faced a problem that for every size of matrix I have to create a special function. It's a big code, i will not place it all here but will try to explain how it works.
I also have a big loop where a lot of matrix operations are going on. Functions which are operating with matrices should take matrices as income and store results in temporary matrices for upcoming operations. In fact i know the size of matrices but i also want to make the functions as universal as possible. In oder to reduce code size and save my time.
For example, matrix transposition operation of 2x4 and 4x4 matrices:
void A_matrix_transposition (float transposed_matrix[4][2], float matrix[2][4], int rows_in_matrix, int columnes_in_matrix);
void B_matrix_transposition (float transposed_matrix[4][4], float matrix[4][4], int rows_in_matrix, int columnes_in_matrix);
int main() {
float transposed_matrix_A[4][2]; //temporary matrices
float transposed_matrix_B[4][4];
float input_matrix_A[2][4], input_matrix_B[4][4]; //input matrices with numbers
A_matrix_transposition (transposed_matrix_A, input_matrix_A, 2, 4);
B_matrix_transposition (transposed_matrix_B, input_matrix_B, 4, 4);
// after calling the functions i want to use temporary matrices again. How do I pass them to other functions if i dont know their size, in general?
}
void A_matrix_transposition (float transposed_matrix[4][2], float matrix[2][4], int rows_in_matrix, int columnes_in_matrix)
{ static int i,j;
for(i = 0; i < rows_in_matrix; ++i) {
for(j = 0; j < columnes_in_matrix; ++j)
{ transposed_matrix[j][i] = matrix[i][j];
}
}
}
void B_matrix_transposition (float transposed_matrix[4][4], float matrix[4][4], int rows_in_matrix, int columnes_in_matrix)
{ static int i,j;
for(i = 0; i < rows_in_matrix; ++i) {
for(j = 0; j < columnes_in_matrix; ++j)
{ transposed_matrix[j][i] = matrix[i][j];
}
}
}
The operation is simple, but the code is massive already because of 2 different functions, but it will be a slow disaster if I continue like this.
How do i create one function for transposing to operate matrices of different sizes?
I suppose it can be done with pointers, but I don't know how.
I'm looking for a realy general answer to understand how to tune up the "comunication" between functions and temporary matrices, best with an example. Thank you all in advance for the information and help.
There are different way you can achieve this in c from not so good to good solutions.
If you know what the maximum size of the matrices would be you can create a matrix big enough to accommodate that size and work on it. If it is lesser than that - no problem write custom operations only considering that small sub-matrix rather than the whole one.
Another solution is to - create a data structure to hold the matrix this may vary from jagged array creation which can be done using the attribute that is stored in the structure itself. For example: number of rows and column information will be stored in the structure itself. Jagged array gives you the benefit that now you can allocate de-allocate memory - giving you a better control over the form - order of the matrices. This is better in that - now you can pass two matrices of different sizes and the functions all see that structure which contain the actual matrix and work on it. (wrapped I would say).
By Structure I meant something like
struct matrix{
int ** mat;
int row;
int col;
}
If your C implementation supports variable length arrays, then you can accomplish this with:
void matrix_transposition(size_t M, size_t N,
float Destination[M][N], const float Source[N][M])
{
for (size_t m = 0; m < M; ++m)
for (size_t n = 0; n < N; ++n)
Destination[m][n] = Source[n][m];
}
If your C implementation does not support variable length arrays, but does allow pointers to arrays to be cast to pointers to elements and used to access a two-dimensional array as if it were one-dimensional (this is not standard C but may be supported by a compiler), you can use:
void matrix_transposition(size_t M, size_t N,
float *Destination, const float *Source)
{
for (size_t m = 0; m < M; ++m)
for (size_t n = 0; n < N; ++n)
Destination[m*N+n] = Source[n*M+m];
}
The above requires the caller to cast the arguments to float *. We can make it more convenient for the caller with:
void matrix_transposition(size_t M, size_t N,
void *DestinationPointer, const void *SourcePointer)
{
float *Destination = DestinationPointer;
const float *Source = SourcePointer;
for (size_t m = 0; m < M; ++m)
for (size_t n = 0; n < N; ++n)
Destination[m*N+n] = Source[n*M+m];
}
(Unfortunately, this prevents the compiler from checking that the argument types match the intended types, but this is a shortcoming of C.)
If you need a solution strictly in standard C without variable length arrays, then, technically, the proper way is to copy the bytes of the objects:
void matrix_transposition(size_t M, size_t N,
void *DestinationPointer, const void *SourcePointer)
{
char *Destination = DestinationPointer;
const char *Source = SourcePointer;
for (size_t m = 0; m < M; ++m)
for (size_t n = 0; n < N; ++n)
{
// Calculate locations of elements in memory.
char *D = Destination + (m*N+n) * sizeof(float);
const char *S = Source + (n*M+m) * sizeof(float);
memcpy(D, S, sizeof(float));
}
}
Notes:
Include <stdlib.h> to declare size_t and, if using the last solution, include <string.h> to declare memcpy.
Variable length arrays were required in C 1999 but made optional in C 2011. Good quality compilers for general purpose systems will support them.
If you are using C99 compiler, you can make use of Variable Length Array (VLA's) (optional in C11 compiler). You can write a function like this:
void matrix_transposition (int rows_in_matrix, int columnes_in_matrix, float transposed_matrix[columnes_in_matrix][rows_in_matrix], float matrix[rows_in_matrix][columnes_in_matrix])
{
int i,j;
for(i = 0; i < rows_in_matrix; ++i) {
for(j = 0; j < columnes_in_matrix; ++j)
{
transposed_matrix[j][i] = matrix[i][j];
}
}
}
This one function can work for the different number of rows_in_matrix and columnes_in_matrix. Call it like this:
matrix_transposition (2, 4, transposed_matrix_A, input_matrix_A);
matrix_transposition (4, 4, transposed_matrix_B, input_matrix_B);
You probably don't want to be hard-coding array sizes in your program. I suggest a structure that contains a single flat array, which you can then interpret in two dimensions:
typedef struct {
size_t width;
size_t height;
float *elements;
} Matrix;
Initialize it with
int matrix_init(Matrix *m, size_t w, size_t h)
{
m.elements = malloc((sizeof *m.elements) * w * h);
if (!m.elements) {
m.width = m.height = 0;
return 0; /* failed */
}
m.width = w;
m.height = h;
return 1; /* success */
}
Then, to find the element at position (x,y), we can use a simple function:
float *matrix_element(Matrix *m, size_t x, size_t y)
{
/* optional: range checking here */
return m.elements + x + m.width * y;
}
This has better locality than an array of pointers (and is easier and faster to allocate and deallocate correctly), and is more flexible than an array of arrays (where, as you've found, the inner arrays need a compile-time constant size).
You might be able to use an array of arrays wrapped in a Matrix struct - it's possible you'll need a stride that is not necessarily the same as width, if the array of arrays has padding on your platform.
I'm generating two matrices using the following function (note some code is omitted):
srand(2007);
randomInit(h_A_data, size_A);
void randomInit(float* data, int size)
{
int i;
for (i = 0; i < size; ++i){
data[i] = rand() / (float)RAND_MAX;
}
}
This is called for matrix A and B. This populates the matrices with 0.something values, e.g. 0.748667. I then perform a matrix multiplication using a CPU. I compare the result to a GPU implementation via OpenCL. The resulting matrix has values in the range 20.something, e.g. 23.472757. Both the CPU and the GPU give the same result. The CPU implementation is taken from the Cuda toolkit distrib by nvidia:
void computeGold(float* C, const float* A, const float* B, unsigned int hA, unsigned int wA, unsigned int wB)
{
unsigned int i;
unsigned int j;
unsigned int k;
for (i = 0; i < hA; ++i)
for (j = 0; j < wB; ++j) {
double sum = 0;
for (k = 0; k < wA; ++k) {
double a = A[i * wA + k];
double b = B[k * wB + j];
sum += a * b;
}
C[i * wB + j] = (float)sum;
}
}
The weird thing is, all three matrices in memory are of the same size, i.e. sizeof(float)*size_A, or *size_B for matrix B etc. When I dump them to the disk, the file for the result stored in matrix C (the multiplied matrix) is bigger than matrix A and B.
Even more critical, for my application I'm transferring these over a network via a socket. In terms of the raw number of bytes, all matrices are the same, and yet it takes longer to transfer matrix C over the network. The problem is extrapolated for large matrix sizes. Why is this?
UPDATE/EDIT:
fprintf(matrix_c_file,"\n\nMatrix C\n");
for(i = 0; i < size_C; i++)
{
fprintf(matrix_c_file,"%f ", h_C_data[i]);
}
fprintf(matrix_c_file,"\n");
When matrix A and B contain only zero's, all three (matrix A, B and C) are the same size on disk.
I think that lijie has the correct (albeit terse) answer in the comments. The %f format specifier can result in a string with variable width. Consider the following C code:
printf("%f\n", 0.0);
printf("%f\n", 3.1415926535897932384626433);
printf("%f\n", 20.53);
printf("%f\n", 20.5e38);
which produces:
0.000000
3.141593
20.530000
2050000000000000019963732141023730597888.000000
All of the output has the same number of digits after the decimal point (6 by default), but a variable number to the left of the decimal point. If you need the textual representation of your matrix to be a consistent size and you don't mind sacrificing some precision, you can use the %e format specifier instead to force an exponential representation like 2.345e12.