I have a function like this in C (in pseudo-ish code, dropping the unimportant parts):
int func(int s, int x, int* a, int* r) {
int i;
// do some stuff
for (i=0;i<a_really_big_int;++i) {
if (s) r[i] = x ^ i;
else r[i] = x ^ a[i];
// and maybe a couple other ways of computing r
// that are equally fast individually
}
// do some other stuff
}
This code gets called so much that this loop is actually a speed bottleneck in the code. I am wondering a couple things:
Since the switch s is a constant in the function, will good compilers optimize the loop so that the branch isn't slowing things down all the time?
If not, what is a good way to optimize this code?
====
Here is an update with a fuller example:
int func(int s,
int start,int stop,int stride,
double *x,double *b,
int *a,int *flips,int *signs,int i_max,
double *c)
{
int i,k,st;
for (k=start; k<stop; k += stride) {
b[k] = 0;
for (i=0;i<i_max;++i) {
/* this is the code in question */
if (s) st = k^flips[i];
else st = a[k]^flips[i];
/* done with code in question */
b[k] += x[st] * (__builtin_popcount(st & signs[i])%2 ? -c[i] : c[i]);
}
}
}
EDIT 2:
In case anyone is curious, I ended up refactoring the code and hoisting the whole inner for loop (with i_max) outside, making the really_big_int loop be much simpler and hopefully easy to vectorize! (and also avoiding doing a bunch of extra logic a zillion times)
One obvious way to optimize the code is pull the conditional outside the loop:
if (s)
for (i=0;i<a_really_big_int;++i) {
r[i] = x ^ i;
}
else
for (i=0;i<a_really_big_int;++i) {
r[i] = x ^ a[i];
}
A shrewd compiler might be able to change that into r[] assignments of more than one element at a time.
Micro-optimizations
Usually they are not worth the time - reviewing larger issue is more effective.
Yet to micro-optimize, trying a variety of approaches and then profiling them to find the best can make for modest improvements.
In addition to #wallyk and #kabanus fine answers, some simplistic compilers benefit with a loop that ends in 0.
// for (i=0;i<a_really_big_int;++i) {
for (i=a_really_big_int; --i; ) {
[edit 2nd optimization]
OP added a more compete example. One of the issues is that the compiler can not make assumption that that the memory pointed to by b and others do not overlap. This prevents certain optimizations.
Assuming they in fact to do not overlap, use restrict on b to allow optimizations. const helps too for weaker compilers that do no deduce that. restrict on the others may also benefit, again, if the reference data does not overlap.
// int func(int s, int start, int stop, int stride, double *x,
// double *b, int *a, int *flips,
// int *signs, int i_max, double *c) {
int func(int s, int start, int stop, int stride, const double * restrict x,
double * restrict b, const int * restrict a, const int * restrict flips,
const int * restrict signs, int i_max, double *c) {
All your commands are quick O(1) command in the loop. The if is definitely optimized, so is your for+if if all your commands are of the form r[i]=somethingquick. The question may boil down for you on how small can big int be?
A quick int main that just goes from INT_MIN to INT_MAX summing into a long variable, takes ~10 seconds for me on the Ubuntu subsystem on Windows. Your commands may multiply this by a few, which quickly gets to a minute. Bottom line, this may be not be avoidable if you really are iterating a ton.
If r[i] are calculated independently, this would be a classic usage for threading/multi-processing.
EDIT:
I think % is optimized anyway by the compiler, but if not, take care that x & 1 is much faster for an odd/even check.
Assuming x86_64, you can ensure that the pointers are aligned to 16 bytes and use intrinsics. If it is only running on systems with AVX2, you could use the __mm256 variants (similar for avx512*)
int func(int s, int x, const __m128i* restrict a, __m128i* restrict r) {
size_t i = 0, max = a_really_big_int / 4;
__m128i xv = _mm_set1_epi32(x);
// do some stuff
if (s) {
__m128i iv = _mm_set_epi32(3,2,1,0); //or is it 0,1,2,3?
__m128i four = _mm_set1_epi32(4);
for ( ;i<max; ++i, iv=_mm_add_epi32(iv,four)) {
r[i] = _mm_xor_si128(xv,iv);
}
}else{ /*not (s)*/
for (;i<max;++i){
r[i] = _mm_xor_si128(xv,a[i]);
}
}
// do some other stuff
}
Although the if statement will be optimized away on any decent compiler (unless you asked the compiler not to optimize), I would consider writing the optimization in (just in case you compile without optimizations).
In addition, although the compiler might optimize the "absolute" if statement, I would consider optimizing it manually, either using any available builtin, or using bitwise operations.
i.e.
b[k] += x[st] *
( ((__builtin_popcount(st & signs[I]) & 1) *
((int)0xFFFFFFFFFFFFFFFF)) ^c[I] );
This will take the last bit of popcount (1 == odd, 0 == even), multiply it by the const (all bits 1 if odd, all bits 0 if true) and than XOR the c[I] value (which is the same as 0-c[I] or ~(c[I]).
This will avoid instruction jumps in cases where the second absolute if statement isn't optimized.
P.S.
I used an 8 byte long value and truncated it's length by casting it to an int. This is because I have no idea how long an int might be on your system (it's 4 bytes on mine, which is 0xFFFFFFFF).
Related
For my project, I've written a naive C implementation of direct 3D convolution with periodic padding on the input. Unfortunately, since I'm new to C, the performance isn't so good... here's the code:
int mod(int a, int b)
{
// calculate mod to get the correct index with periodic padding
int r = a % b;
return r < 0 ? r + b : r;
}
void convolve3D(const double *image, const double *kernel, const int imageDimX, const int imageDimY, const int imageDimZ, const int stencilDimX, const int stencilDimY, const int stencilDimZ, double *result)
{
int imageSize = imageDimX * imageDimY * imageDimZ;
int kernelSize = kernelDimX * kernelDimY * kernelDimZ;
int i, j, k, l, m, n;
int kernelCenterX = (kernelDimX - 1) / 2;
int kernelCenterY = (kernelDimY - 1) / 2;
int kernelCenterZ = (kernelDimZ - 1) / 2;
int xShift,yShift,zShift;
int outIndex, outI, outJ, outK;
int imageIndex = 0, kernelIndex = 0;
// Loop through each voxel
for (k = 0; k < imageDimZ; k++){
for ( j = 0; j < imageDimY; j++) {
for ( i = 0; i < imageDimX; i++) {
stencilIndex = 0;
// for each voxel, loop through each kernel coefficient
for (n = 0; n < kernelDimZ; n++){
for ( m = 0; m < kernelDimY; m++) {
for ( l = 0; l < kernelDimX; l++) {
// find the index of the corresponding voxel in the output image
xShift = l - kernelCenterX;
yShift = m - kernelCenterY;
zShift = n - kernelCenterZ;
outI = mod ((i - xShift), imageDimX);
outJ = mod ((j - yShift), imageDimY);
outK = mod ((k - zShift), imageDimZ);
outIndex = outK * imageDimX * imageDimY + outJ * imageDimX + outI;
// calculate and add
result[outIndex] += stencil[stencilIndex]* image[imageIndex];
stencilIndex++;
}
}
}
imageIndex ++;
}
}
}
}
by convention, all the matrices (image, kernel, result) are stored in column-major fashion, and that's why I loop through them in such way so they are closer in memory (heard this would help).
I know the implementation is very naive, but since it's written in C, I was hoping the performance would be good, but instead it's a little disappointing. I tested it with image of size 100^3 and kernel of size 10^3 (Total ~1GFLOPS if only count the multiplication and addition), and it took ~7s, which I believe is way below the capability of a typical CPU.
If possible, could you guys help me optimize this routine?
I'm open to anything that could help, with just a few things if you could consider:
The problem I'm working with could be big (e.g. image of size 200 by 200 by 200 with kernel of size 50 by 50 by 50 or even larger). I understand that one way of optimizing this is by converting this problem into a matrix multiplication problem and use the blas GEMM routine, but I'm afraid memory could not hold such a big matrix
Due to the nature of the problem, I would prefer direct convolution instead of FFTConvolve, since my model is developed with direct convolution in mind, and my impression of FFT convolve is that it gives slightly different result than direct convolve especially for rapidly changing image, a discrepancy I'm trying to avoid.
That said, I'm in no way an expert in this. so if you have a great implementation based on FFTconvolve and/or my impression on FFT convolve is totally biased, I would really appreciate if you could help me out.
The input images are assumed to be periodic, so periodic padding is necessary
I understand that utilizing blas/SIMD or other lower level ways would definitely help a lot here. but since I'm a newbie here I dont't really know where to start... I would really appreciate if you help pointing me to the right direction if you have experience in these libraries,
Thanks a lot for your help, and please let me know if you need more info about the nature of the problem
As a first step, replace your mod ((i - xShift), imageDimX) with something like this:
inline int clamp( int x, int size )
{
if( x < 0 ) return x + size;
if( x >= size ) return x - size;
return x;
}
These branches are very predictable because they yield same results for very large count of consecutive elements. Integer modulo is relatively slow.
Now, next step (ordered by cost/profit) is going to be parallelizing. If you have any modern C++ compiler, just enable OpenMP somewhere in project settings. After that you need 2 changes.
Decorate your very outer loop with something like this: #pragma omp parallel for schedule(guided)
Move your function-level variables within that loop. This also means you’ll have to compute initial imageIndex from your k, for each iteration.
Next option, rework your code so you only write each output value once. Compute the final value in your innermost 3 loops, reading from random locations from both image and kernel, and only write the result once. When you have that result[outIndex] += in the inner loop, CPU stalls waiting for the data from memory. When you accumulate in a variable that’s a register not memory, there’s no access latency.
SIMD is the most complicated optimization for that. But in short, you’ll need maximum width of the FMA your hardware has (if you have AVX and need double precision, that width is 4), and you’ll also need multiple independent accumulators for your 3 innermost loops, to avoid hitting the latency as opposed to saturating the throughput. Here’s my answer to much easier problem as an example what I mean.
I have to answer a question about a relatively simple C-code snippet. In the function below, what would be the most costly in terms of performance or time complexity? I really have no idea how to answer that because I feel it depends on the if-statement. Also I don't know if a comparison is costly or not, same goes for a return, a struct access and multiplication.
Btw, the info_h is a struct.
RGBPixel* bm_get_pixel_at(
unsigned int x,
unsigned int y,
RGBPixel *pixel_array )
{
int index;
int width_with_padding = info_h.width;
if( x >= info_h.width || y >= info_h.height ){ return &blackPixel; }
index = (info_h.height-1 - y) * width_with_padding + x;
return pixel_array + index;
}
EDIT:
Okay so the question might have been a bit oddly phrased. I should add that this is just one function out of many in a bit more complex c program which we have now ran a oprofile script thirty times on. The script then returns the results of average times each procedure was sampled by oprofile. In that results table, this function came in at third most sampled. The follow up question is therefore, which part of this function causes the program to spend third most of its time inside it? Sorry if it was a bit unclear in the beginning
Since you've omitted the rest of your program, it all turns into guesswork. That this function shows up a lot in your profiling result is likely due to it being called inside an inner loop. What this function does (it's pretty obvious) is, that it returns the memory location of a pixel or, if the requested index lies outside of the bounds of the pixel array, returns the memory location of a dummy pixel.
If you run that function inside a loop, then the bounds check is executed for each and every iteration, which of course is quite redundant. This is a really low-hanging fruit regarding optimization: Put the bounds checks before the loop, and make sure, that the loop itself doesn't venture outside the bounds:
static inline
RGBPixel* bm_get_pixel_at_UNSAFE(
unsigned int x,
unsigned int y,
RGBPixel *pixel_array )
{
size_t const width_with_padding = info_h.width;
size_t const index = ((size_t)info_h.height-1 - y) * width_with_padding + x;
return &pixel_array[index];
}
RGBPixel* bm_get_pixel_at(
unsigned int x,
unsigned int y,
RGBPixel *pixel_array )
{
return ( x < info_h.width && y < info_h.height ) ?
bm_get_pixel_at_UNSAFE(x,y, pixel_array)
: &blackPixel;
}
void foo(RGBPixel *pixel_array)
{
/* iteration stays inside array bounds */
for( unsigned int y = 0; y < info_h.height; ++y )
for( unsigned int x = 0; x < info_h.width; ++x ){
RGBPixel *px = bm_get_pixel_at_UNSAFE(x, y, pixel_array);
/* ... */
}
}
I made matrix-vector multiplication program using AVX2, FMA in C. I compiled using GCC ver7 with -mfma, -mavx.
However, I got the error "incorrect checksum for freed object - object was probably modified after being freed."
I think the error would generate if the matrix dimension isn't multiples of 4.
I know AVX2 use ymm register that can use 4 double precision floating point number. Therefore, I can use AVX2 without error in case the matrix is multiples of 4.
But, here is my question.
How can I use AVX2 efficiently if the matrix isn't multiples of 4 ???
Here is my code.
#include "stdio.h"
#include "math.h"
#include "stdlib.h"
#include "time.h"
#include "x86intrin.h"
void mv(double *a,double *b,double *c, int m, int n, int l)
{
__m256d va,vb,vc;
int k;
int i;
for (k = 0; k < l; k++) {
vb = _mm256_broadcast_sd(&b[k]);
for (i = 0; i < m; i+=4) {
va = _mm256_loadu_pd(&a[m*k+i]);
vc = _mm256_loadu_pd(&c[i]);
vc = _mm256_fmadd_pd(vc, va, vb);
_mm256_storeu_pd( &c[i], vc );
}
}
}
int main(int argc, char* argv[]) {
// set variables
int m;
double* a;
double* b;
double* c;
int i;
int temp=0;
struct timespec startTime, endTime;
m=9;
// main program
// set vector or matrix
a=(double *)malloc(sizeof(double) * m*m);
b=(double *)malloc(sizeof(double) * m*1);
c=(double *)malloc(sizeof(double) * m*1);
for (i=0;i<m;i++) {
a[i]=1;
b[i]=1;
c[i]=0.0;
}
for (i=m;i<m*m;i++) {
a[i]=1;
}
// check start time
clock_gettime(CLOCK_REALTIME, &startTime);
mv(a, b, c, m, 1, m);
// check end time
clock_gettime(CLOCK_REALTIME, &endTime);
free(a);
free(b);
free(c);
return 0;
}
You load and store vectors of 4 double, but your loop condition only checks that the first vector element is in-bounds, so you can write outside objects by up to 3x8 = 24 bytes when m is not a multiple of 4.
You need something like i < (m-3) in main loop, and a cleanup strategy for handling the last partial vector of data. Vectorizing with SIMD is very much like unrolling: you have to check that it's ok to do multiple future elements in the loop condition.
A scalar cleanup loop works well, but we can do better. For example, do as many 128-bit vectors as possible after the last full 256-bit vector (i.e. up to 1), before going scalar.
In many cases (e.g. write-only destination) an unaligned vector load that ends at the end of your arrays is very good (when m>=4). It can overlap with your main loop if m%4 != 0, but that's fine because your output array doesn't overlap your inputs, so redoing an element as part of a single cleanup is cheaper than branching to avoid it.
But that doesn't work here, because your logic is c[i+0..3] += ..., so redoing an element would make it wrong.
// cleanup using a 128-bit FMA, then scalar if there's an odd element.
// untested
void mv(double *a,double *b,double *c, int m, int n, int l)
{
/* the loop below should actually work for m=1..3, but a separate strategy might be good.
if (m < 4) {
// maybe check m >= 2 and use __m128 vectors?
// or vectorize differently?
}
*/
for (int k = 0; k < l; k++) {
__m256 vb = _mm256_broadcast_sd(&b[k]);
int i;
for (i = 0; i < (m-3); i+=4) {
__m256d va = _mm256_loadu_pd(&a[m*k+i]);
__m256d vc = _mm256_loadu_pd(&c[i]);
vc = _mm256_fmadd_pd(vc, va, vb);
_mm256_storeu_pd( &c[i], vc );
}
if (i<(m-1)) {
__m128d lasta = _mm_loadu_pd(&a[m*k+i]);
__m128d lastc = _mm_loadu_pd(&c[i]);
lastc = _mm_fmadd_pd(lastc, va, _mm256_castpd256_pd128(vb));
_mm_storeu_pd( &c[i], lastc );
// i+=2; // last element only checks m odd/even, doesn't use i
}
// if (i<m)
if (m&1) {
// odd number of elements, do the last non-vector one
c[m-1] += a[m*k + m-1] * _mm256_cvtsd_f64(vb);
}
}
}
I haven't looked at exactly how gcc/clang -O3 compile that. Sometimes compilers try to get too smart with cleanup code (e.g. trying to auto-vectorize scalar cleanup loops).
Other strategies could include doing the last up-to-4 elements with an AVX masked store: you need the same mask for the end of every matrix row, so generating it once and then using it at the end of every row could be good. See Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all. (To simplify branching, you'd set it up so your main loop only goes to i < (m-4), then you always run the cleanup. In the m%4 == 0 case, the mask is all-ones so you do the final full vector.) If you can't safely read past the end of the matrix, you probably need a masked load as well as masked store.
You could also look at aligning your rows for efficiency, or a row stride that's separate from the logical length of rows. (i.e. pad rows out to 32-byte boundaries). Leaving padding at the end of rows simplifies the cleanup, because you can always do whole vectors that write padding.
Special case m==2: instead of broadcasting one element from b[], you'd like to broadcast 2 elements into two 128-bit lanes of a __m256d, so one 256-bit FMA could do 2 rows at once.
Issue
I am interested in parallelizing a problem using CUDA. The C code in question follows this simplified form:
int A, B, C; // 100 < A,B,C,D < 1,000
float* v1, v2, v3;
//v1,v2, v3 will have respective size A,B,C
//and will not be empty
float*** t1, t2, t3;
//t1,t2,t3 will eventually have the size (ci,cj,ck)
//and will not be empty
int i, j , k, l;
float xi, xj, xk;
for (i=0; i<A; ++i){
xi = ci - v1[i];
for (j=0; j<B; ++j){
xj = (j*cj)*cos(j*M_PI/180);
for (k=0;k<C; ++k){
xk = xj - v3[k];
if (xk < xi){
call_1(t1[i], v1, t2[i], &t3[i][j][k]);
}
else t3[i][j][k] = some_number;
}
}
}
here call_1 is
void call_1 (float **w, float *x, float **y, float *z){
int k, max = some_value;
float *v; //initialize to have size max
for (k=0; k<max; ++k)
call_2(x[k], y[k], max, &v[k]);
call_2(y, v, max, z);
}
here call_2is
void call_2 (float *w, float*x, int y, double *z)
that simply contains operations such as bit shifting, multiplication, subtraction and addition inside a single while loop.
Ideas attempted
So far, my idea is that, the function call_1may be transformed into a kernel code __global__ void call_1; and that call_2 may be transformed into device code without modifying its contents. In particular, I can probably make __global__ void call_1 to be
double* v; //initialize to have size max
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int k=index; k<max; k += stride)
call_2 (x[k], y[k], max, &v[k]);
__syncthreads();
call_2 (y, v, max, z);
free (v);
I'm partly aware that the for loops can be removed by using a combination of threadIdx, blockIdx, and gridDim, but I specifically am not sure how especially that the problem contains a function call that also uses a function call.
Well there are two possible answers to that, and, while I don't have the courage to search all of it for you, I'll still make it an answer since you seem to have been blatantly ignored. :/
First.
Recent CUDA API and nvidia architectures have support for function calls and even recursion in CUDA. I'm not exactly sure how it works as I never used it myself, but you might want to research that. (Or do some Vulkan since it looks like so much fun and also supports it.)
Might help you: https://devtalk.nvidia.com/default/topic/493567/cuda-programming-and-performance/calling-external-kernel-from-cuda/
And other stuff with related keywords. : D
On the other hand..
When resolving simple issues, especially if, like me, you would rather spend your time programming than doing research and learning some random API by heart, you can always go with more primitive solutions using only the basic of the language you are using.
In your case, I would simply inline the calls to the function to make a single CUDA kernel, since it seems pretty easy to do.
Yeah, right, it might include some copy-pasting if there are multiple calls to the function... which doesn't really matter if it can make you take it easy and efficiently solve a simple issue and go make something more productive.
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int k=index; k<max; k += stride)
call_2 (x[k], y[k], max, &v[k]); // Insert call_2 code here instead.
Another way to go around that, when you are confident your data is big enough to have a good increase in performances even with the passing of code and data from CPU and RAM to GPU, is to simply have multiple "waves" of cuda kernel call.
You let the first wave process while preparing for the second one, which is then launched on the finished first wave.
It's basically equivalent to other smarter constructs offered by recent CUDA implementations, so you would probably find smarter things to do with a bit of research, but then again... depends on your priority.
But yeah, manually inlining functions is great. : D
*mostly never, but it can be pretty handy
I'm fairly new to C, not having much need to anything faster than python for most of my research. However, it turns out that recent work I've been doing required the computation of fairly large vectors/matrices, and there therefore a C+MPI solution might be in order.
Mathematically speaking, the task is very simple. I have a lot of vectors of dimensionality ~40k and wish to compute the Kronecker Product of selected pairs of these vectors, and then sum these kronecker products.
The question is, how to do this efficiently? Is there anything wrong with the following structure of code, using for loops, or obtain the effect?
The function kron described below passes vectors A and B of lengths vector_size, and computes their kronecker product, which it stores in C, a vector_size*vector_size matrix.
void kron(int *A, int *B, int *C, int vector_size) {
int i,j;
for(i = 0; i < vector_size; i++) {
for (j = 0; j < vector_size; j++) {
C[i*vector_size+j] = A[i] * B[j];
}
}
return;
}
This seems fine to me, and certainly (if I've not made some silly syntax error) produce the right result, but I have a sneaking suspicion that embedded for loops is not optimal. If there's another way I should be going about this, please let me know. Suggestions welcome.
I thank you for you patience and any advice you may have. Once again, I'm very inexperienced with C, but Googling around has brought me little joy for this query.
Since your loop bodies are all completely independent, there is certainly a way to accelerate this. Easiest would be already to take advantage of several cores before thinking of MPI. OpenMP should do quite fine on this.
#pragma omp parallel for
for(int i = 0; i < vector_size; i++) {
for (int j = 0; j < vector_size; j++) {
C[i][j] = A[i] * B[j];
}
}
This is supported by many compilers nowadays.
You could also try to drag some common expressions out of the inner loop but decent compilers e.g gcc, icc or clang should do this quite well all by themselves:
#pragma omp parallel for
for(int i = 0; i < vector_size; ++i) {
int const x = A[i];
int * vec = &C[i][0];
for (int j = 0; j < vector_size; ++j) {
vec[j] = x * B[j];
}
}
BTW, indexing with int is usually not the right thing to do. size_t is the correct typedef for everything that has to do with indexing and sizes of objects.
For double-precision vectors (single-precision and complex are similar), you can use the BLAS routine DGER (rank-one update) or similar to do the products one-at-a-time, since they are all on vectors. How many vectors are you multiplying? Remember that adding a bunch of vector outer products (which you can treat the Kronecker products as) ends up as a matrix-matrix multiplication, which BLAS's DGEMM can handle efficiently. You might need to write your own routines if you truly need integer operations, though.
If your compiler supports C99 (and you never pass the same vector as A and B), consider compiling in a C99-supporting mode and changing your function signature to:
void kron(int * restrict A, int * restrict B, int * restrict C, int vector_size);
The restrict keyword promises the compiler that the arrays pointed to by A, B and C do not alias (overlap). With your code as written, the compiler must re-load A[i] on every execution of the inner loop, because it must be conservative and assume that your stores to C[] can modify values in A[]. Under restrict, the compiler can assume that this will not happen.
Solution found (thanks to #Jeremiah Willcock): GSL's BLAS bindings seem to do the trick beautifully. If we're progressively selecting pairs of vectors A and B and adding them to some 'running total' vector/matrix C, the following modified version of the above kron function
void kronadd(int *A, int *B, int *C, int vector_size, int alpha) {
int i,j;
for(i = 0; i < vector_size; i++) {
for (j = 0; j < vector_size; j++) {
C[i*vector_size+j] = alpha * A[i] * B[j];
}
}
return;
}
precisely corresponds to the BLAS DGER function (accessible as gsl_blas_dger), functionally speaking. The initial kron function is DGER with alpha = 0 and C being an uninitialised (zeroed) matrix/vector of the correct dimensionality.
It turns out, it might well be easier to simply use python bindings for these libraries, in the end. However, I think I've learned a lot while trying to figure this stuff out. There are some more helpful suggestions in the other responses, do check them out if you have the same sort of problem to deal with. Thanks everyone!
This is a common enough problem in numerical computational circles, that really the best thing to do would be to use a well-debugged package like Matlab (or one of its Free Software clones).
You could probably even find a python binding to it, so you can get rid of C.
All of the above is (probably) going to be faster than code written strictly in python. If you need more speed than that, I'd suggest a couple of things:
Look into using Fortran instead of C. Fortran compilers tend to be better at optimizing numerical computations (one exception would be if you are using gcc, since both its C and Fortran compilers use the same backend).
Consider parallelizing your algorithm. There are variants of Fortran I know that have parallel loop statements. I think there are some C addons around that do the same thing. If you are using a PC (and single-precision) you could also consider using your video card's GPU, which is essentially a really cheap array processor.
Another optimisation that would be easy to implement is that if you know that the inner dimension of your arrays will be divisible by n then add n assignment statements to the body of the loop, reducing the number of necessary iterations, with corresponding changes to the loop counting.
This strategy can be generalised by using a switch statement around the outer loop with cases for array sizes divisible by two, three, four and five, or whatever is most common. This can give quite a big performance win and is compatible with suggestions 1 and 3 for further optimisation/parallelisation. A good compiler may even do something like this for you (aka loop unrolling).
Another optimisation would be to make use of pointer arithmetic to avoid the array indexing. Something like this should do the trick:
int i, j;
for(i = 0; i < vector_size; i++) {
int d = *A++;
int *e = B;
for (j = 0; j < vector_size; j++) {
*C++ = *e++ * d;
}
}
This also avoids accessing the value of A[i] multiple times by caching it in a local variable, which might give you a minor speed boost. (Note that this version is not parallelisable since it alters the value of the pointers, but would still work with loop unrolling.)
To solve your problem, I think you should try to use Eigen 3, it's a C++ library which use all matrix functions!
If you have time, go to see its documentation! =)
Good luck !
uint32_t rA = 3;
uint32_t cA = 5;
uint32_t lda = cA;
uint32_t rB = 5;
uint32_t cB = 3;
uint32_t ldb = cB;
uint32_t rC = rA*rB;
uint32_t cC = cA*cB;
uint32_t ldc = cC;
double *A = (double *)malloc(rA*cA*sizeof(double));
double *B = (double *)malloc(rB*cB*sizeof(double));
double *C = (double *)malloc(rC*cC*sizeof(double));
for (uint32_t i=0, allA=rA*cA; i<allA; i++)
A[i]=i;
for (uint32_t i=0, allB=rB*cB; i<allB; i++)
B[i]=i;
for (uint32_t i=0, allC=rC*cC; i<allC; i++)
C[i]=0;
for (uint32_t i=0, allA=rA*cA; i<allA; i++)
{
for (uint32_t j=0, allB=rB*cB; j<allB; j++)
C[((i/lda)*rB+j/ldb)*ldc
+ (i%lda)*cB+j%ldb ]=A[i]*B[j];
}