I want to know how speed up matrix multiplication by SSE2
here is my code
int mat_mult_simd(double *a, double *b, double *c, int n)
{
__m128d c1,c2,a1,a2,b1;
for(int i=0; i<n/2; i++){
for(int j=0; j<n/2; j++){
c1 = _mm_load_pd(c+(2*j*n)+(i+2));
c2 = _mm_load_pd(c+n+(2*j*n)+(i+2));
for(int k=0; k<n; k++){
a1 = _mm_load1_pd(a+k+(2*j*n));
a2 = _mm load1_pd(a+n+k+(2*j*n));
b1 = _mm_load_pd(b+(k*n)+(i*2));
c1 = _mm_add_pd(c1, _mm_mul_pd(a1,b1));
c2 = _mm_add_pd(c2, _mm_mul_pd(a2,b1));
}
__mm_store_pd(c+(2*j*n)+(i+2), c1);
__mm_store_pd(c+n+(2*j*n)+(i+2), c2);
}
}
return 0;
}
each parameter means
'a' = vector a(MAT_SIZE*MAT_SIZE)
'b' = vector b(MAT_SIZE*MAT_SIZE)
'c' = vector c(MAT_SIZE*MAT_SIZE)
'n' = MAT_SIZE is constant (It always even and >=2)
this code speed up about X4. against
int mat_mult_default(double *a, double *b, double *c, int n)
{
double t;
for(int i=0; i<n; i++){
for(int j=0; j<n; j++){
t=0.0;
for(int k=0; k<n; k++)
t += a[i*n+k] * b[k*n+j];
c[i*n+j] = t;
}
}
}
but I want to more speed up. I usually experiment MAT_SIZE 1000*1000 or 2000*2000.
how can i speed up? Is there other way to indexing? I really want to know. thanks.
You can do a few things. The obvious one is splitting the work into several threads (1 per core). You can use OpenMP (easiest), Intel TBB or other multithreading libs.
This will provide a significant improvement on a multi-core machine.
Another thing is to looks at the disassembly (via your favorite debugger) - look how the compiler handles all the multiplications you use for the indexes, some of them can be eliminated.
Your code does 2 computations in one loop, try to do more 4 or 8 to have better locality. E.g. a1 and a2 couldbe calculated with their neighbors who are already in the L1 cache. You can actually loads them with a single load operation.
Make sure the various arrays are SSE aligned (16 Byte) and change your code to use aligned reads/writes.
I'd leave multithreading to the end as finding bugs is harder.
Just use the right library like the Intel Math Kernel Library or a similar highly optimized linear algebra package (OpenBLAS, AMD Core Math Library, ATLAS, ...). They are considered faster compared hand-written code. They have sometimes even processor-specific optimizations for instruction sets and cache sizes. And they are professionals in their field. Unless you plan to publish a paper on your own optimization, the go with the library.
In the latest issue of the German computer magazine c't they claim the compiler is smart enough to use SSE or AVX by itself. Just write the right loops and the auto-vectorizer will bring the best results. This is true for the latest Intel compiler. Microsoft's compiler are too dump. In some cases with the right compiler flags, Intel's compiler even detects that you program a matrix multiplication and replaces this by the right call. Or you have to check the documentation, it is not that hard to learn such a package.
Related
While teaching myself c, I thought it would be good practice to write a function which multiplies two 3x3 matrices and then make it more general. The function seems to calculate the correct result for the first and last columns but not the middle one. In addition, each value down the middle column is out by 3 more than the last.
For example:
[1 2 3] [23 4 6]
[4 5 6] * [ 2 35 0]
[7 8 9] [14 2 43]
The answer I receive is:
[ 69 80 135]
[190 273 282]
[303 326 429]
The actual answer should be:
[ 69 83 135]
[190 279 282]
[303 335 429]
Isolating the middle columns for clarity:
Received Expected
[ 80] [ 83]
[273] [279]
[326] [335]
My code is as follows:
#include <stdio.h>
typedef struct mat_3x3
{
double values [3][3];
} mat_3x3;
void SetMatrix(mat_3x3 * matrix, double vals[3][3])
{
for (int i = 0; i < 3; i++)
{
for (int j = 0; j < 3; j++)
{
(matrix->values)[i][j] = vals[i][j];
}
}
putchar('\n');
}
mat_3x3 MatrixMultiply(mat_3x3 * m1, mat_3x3 * m2)
{
mat_3x3 result;
for (int i = 0; i < 3; i++)
{
for (int j = 0; j < 3; j++)
{
double temp = 0;
for (int k = 0; k < 3; k++)
{
temp += ((m1->values)[i][k] * (m2->values)[k][j]);
}
(result.values)[i][j] = temp;
}
}
return result;
}
void PrintMatrix(mat_3x3 * matrix)
{
putchar('\n');
for (int i = 0; i < 3; i++)
{
for (int j = 0; j < 3; j++)
{
printf("%lf ", (matrix->values)[i][j]);
}
putchar('\n');
}
putchar('\n');
}
int main()
{
mat_3x3 m1;
mat_3x3 * pm1 = &m1;
mat_3x3 m2;
mat_3x3 * pm2 = &m2;
double vals[3][3] = {
{1,2,3},
{4,7,6},
{7,8,9}
};
double vals2[3][3] = {
{23,4,6},
{2,35,0},
{14,2,43}
};
SetMatrix(pm1, vals);
SetMatrix(pm2, vals2);
printf("\nm1:");
PrintMatrix(pm1);
printf("\nm2:");
PrintMatrix(pm2);
mat_3x3 m3 = MatrixMultiply(pm1, pm2);
mat_3x3 * pm3 = &m3;
printf("\nm3 = m1 * m2");
PrintMatrix(pm3);
}
Have been working on this for a while now comparing it against other simple examples and can't find the problem, so help would be appreciated!
Also if I've done anything atrocious syntax wise etc, I'm open to any criticism on how it's written as well.
While teaching myself c, I thought it would be good practice to write a function which multiplies two 3x3 matrices and then make it more general. The function seems to calculate the correct result for the first and last columns but not the middle one. In addition, each value down the middle column is out by 3 more than the last.
In practice, when coding in C, you should take care of the following issues:
refer to a good C reference website and read a good C programming book, such as Modern C
floating point numbers are not mathematical real numbers, see floating-point-gui.de for much more. For example, addition is associative in math, but not on a computer using IEEE-754.
we all make bugs (e.g. buffer overflows or undefined behavior). So you need to learn how to use a debugger. I recommend GDB. But you need to learn how to use it and spend a few hours reading documentation. Tools like valgrind are also useful (to hunt memory leaks) as soon as you use C dynamic memory allocation.
recent compilers can be helpful. I recommend GCC. You should invoke it with all warnings and debug info, e.g. gcc -Wall -Wextra -g. Be sure to spend some time in reading the documentation of your compiler. You might later consider using static program analysis tools such as Frama-C or the Clang analyzer or (for precision analysis) Fluctuat or CADNA
consider having a matrix abstract data type like here. You would then easily generalize your code to "arbitrary" N*M matrixes.
later, for benchmarking purposes, you will want to use an optimizing compiler. If you use GCC, you could compile your code using gcc -Wall -Wextra -g -O3 but then you could have surprising optimizations, see e.g. this draft report.
in some cases, you could need arbitrary-precision arithmetic. Consider then using specialized libraries such as GMPlib.
Most computers today are multi-core. You could want to use Pthreads or MPI to take advantage of that with concurrent programming.
many open source libraries exist for scientific computations. Look at least for inspiration on github and gitlab and see also this list. You could be interested by GNU GSL and study its source code since it is free software (and later improve it).
If you want to make serious scientific computations, you might consider switching (for expressiveness) to functional languages such as Ocaml. If you care about making a lot of iterative computing (like in finite element methods) you might switch to OpenCL or OpenACC.
Be aware that scientific computation is a very difficult field.
Expect to spend a decade in learning it.
I'm open to any criticism on how it's written as well.
mat_3x3 MatrixMultiply(mat_3x3 * m1, mat_3x3 * m2)
is unusual. Why don't you return a pointer (to a fresh memory zone obtained with malloc and correctly initialized) ? That is likely to be faster (a pointer is usually 8 bytes, a 3x3 matrix takes 72 bytes to be copied) and enable you to code things like MatrixMultiply(MatrixMultiply(M1, M2), MatrixAdd(M2, M3)). Of course, garbage collection (read the GC handbook, consider using Boehm GC) then becomes an issue. If you used Ocaml, the system GC would be very helpful.
I have vector of 1024*4608 elements (Original_signal) which is stored in one-dimention array.
And I enlarged the Original_signal to Expand_signal by copying every 1024 elements 32 times to 1024*32*4608.
Then I use a Com_array of 1024*32 to do the element-to-element multiplication with the Expand_signal and do the 1024FFT of the After multiplying array.
The core code is like follows:
//initialize Original_signal
MKL_Complex8 *Original_signal = new MKL_Complex8[1024*4608];
for (int i=0; i<4608; i++)
{
for (int j=0; j<1024; j++)
{
Original_signal[j+i*1024].real=rand();
Original_signal[j+i*1024].imag=rand();
}
}
//Com_array
MKL_Complex8 *Com_array= new MKL_Complex8[32*1024];
for (int i=0; i<32; i++)
{
for (int j=0; j<1024; j++)
{
Com_array[j+i*1024].real=cosf(2*pi*(i-16.0)/10.0*j^2);
Com_array[j+i*1024].imag=sinf(2*pi*(i-16.0)/10.0*j^2);
}
}
//element-to-element multiplication
MKL_Complex8 *Temp_signal= new MKL_Complex8[1024*32];
MKL_Complex8 *Expand_signal= new MKL_Complex8[1024*32*4608];
gettimeofday(&Bgn_Time, 0);
for (int i=0; i<4608; i++)
{
for (int j=0; j<32; j++)
{
memcpy(Temp_signal+j*1024, Original_signal+i*1024, 1024*sizeof(MKL_Complex8));
}
vmcMul(1024*32, Temp_signal, Com_array, Expand_signal+i*1024*32);
}
gettimeofday(&End_Time, 0);
double time_used = (double)(End_Time.tv_sec-Bgn_Time.tv_sec)*1000000+(double)(End_Time.tv_usec-Bgn_Time.tv_usec);
printf("element-to-element multiplication use time %fus\n, time_used ");
//FFT
DFTI_DESCRIPTOR_HANDLE h_FFT = 0;
DftiCreateDescriptor(&h_FFT, DFTI_SINGLE, DFTI_COMPLEX, 1, 1024);
DftiSetValue(h_FFT, DFTI_NUMBER_OF_TRANSFORMS, 32*4608);
DftiSetValue(h_FFT, DFTI_INPUT_DISTANCE, 1024);
DftiCommitDescriptor(h_FFT);
gettimeofday(&Bgn_Time, 0);
DftiComputeForward(h_FFT,Expand_signal);
gettimeofday(&End_Time, 0);
double time_used = (double)(End_Time.tv_sec-Bgn_Time.tv_sec)*1000000+(double)(End_Time.tv_usec-Bgn_Time.tv_usec);
printf("FFT use time %fus\n, time_used ");
The time of element-to-element multiplication is 700ms(After removing the memcpy cost), And the time of FFT is 500ms.
The complex multiplication computation of FFT is N/2log2N And the element-to-element multiplication is N.
In this project N=1024. FFT is 5 times slower than element-to-element multiplication in theory. Why is faster in actual.
Any way to speed up the project?
(notice that Com_array is symmetrical)
In this project N=1024. FFT is 5 times slower than element-to-element multiplication in theory. Why is faster in actual?
As was indicated in comments, the time complexity of the FFT gives you a relative measure for various FFT lengths, up to some constant factor. This factor becomes important when trying to compare with other computations. Also, your analysis assumes that the performance is limited by floating point operations, where in reality the actual performance seems limited by other factors such as special case handling (e.g. NaN, Inf), memory and cache access.
Any way to speed up the project?
Since your performance bottleneck is around the complex element-wise vector multiplication operation, the following will focus on improving the performance of that operation.
I do not have MKL to perform actual benchmarks, but it is probably fair to assume that the vmcMul implementation is both fairly robust to special cases such as NaN and Inf, and fairly optimized in the circumstances.
If you do not need the robustness against the special cases, run on a SSE3 processor, can guarantee that your array sizes are multiples of 2, and that they are 16-bytes aligned, then you may get some performance gains by using a simplified implementation such as the following (based on Sebastien's answer to another post):
#include <pmmintrin.h>
#include <xmmintrin.h>
// Computes and element-by-element multiplication of complex vectors "a" and "b" and
// stores the results in "c".
// Vectors "a", "b" and "c" must be:
// - vectors of even length N
// - 16-bytes aligned
// Special cases such as NaN and Inf are not handled.
//
// based on https://stackoverflow.com/questions/3211346/complex-mul-and-div-using-sse-instructions#4884057
void packed_vec_mult(int N, MKL_Complex8* a, MKL_Complex8* b, MKL_Complex8* c)
{
int M = N/2;
__m128* aptr = reinterpret_cast<__m128*>(a);
__m128* bptr = reinterpret_cast<__m128*>(b);
__m128* cptr = reinterpret_cast<__m128*>(c);
for (int i = 0; i < M; i++)
{
__m128 t0 = _mm_moveldup_ps(*aptr);
__m128 t1 = *bptr;
__m128 t2 = _mm_mul_ps(t0, t1);
__m128 t3 = _mm_shuffle_ps(t1, t1, 0xb1);
__m128 t4 = _mm_movehdup_ps(*aptr);
__m128 t5 = _mm_mul_ps(t4, t3);
*cptr = _mm_addsub_ps(t2, t5);
++aptr;
++bptr;
++cptr;
}
}
Once the multiplication is optimized, your implementation could still be improved by getting rid of the extra copies to Temp_signal with memcpy by directly multiplying the Orignal_signal many times with different portions of Com_array, as shown below:
MKL_Complex8* outptr = Expand_signal;
for (int i=0; i<4608; i++)
{
for (int j=0; j<32; j++)
{
packed_vec_mult(1024, Original_signal+i*1024, Com_array+j*1024, outptr);
outptr += 1024;
}
}
This last step would give you another ~20% performance improvement compared to your implementation with vmcMul replaced by packed_vec_mult.
Finally since the loop performs operation on independent blocks, you may be able to get significantly higher throughput (but similar latency) by launching parallel computations over multiple thread, so that the CPU is always kept busy instead of waiting for data in transit to/from memory. My tests suggests somewhere around a factor 2 improvement, but results may differ depending on your specific machine.
I'm trying to perform operations such as multiplying 7D arrays that have 32 million elements. I have written a MEX file as I am under the impression that these operations should be quicker in C than in Matlab. However, I'm finding that the MEX file is about twice as slow as performing the operations directly in Matlab (2017b).
An example operation I want to perform is:
T8 = rand(1,1e3,2,2,2,2,2);
wsm = rand(1e3,1e3,2,2);
CM = bsxfun(#times,T8,wsm);
On my machine this takes 0.117065 seconds (I call this, and other similar operations, ~1000 times per run of a model and the model is run thousands of times to optimize the parameters - these operations are making optimization prohibitively slow).
Here is the MEX file I wrote, it uses 7 for loops to access the elements of T8 and wsm by linear indexing (maybe I should be accessing the elements in a more efficient manner or avoiding for loops?):
#include "mex.h"
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
mwSize i, j, k, l, m, n, o, I, J, K, L, M, N, O;
mwSize *dims,*dims1;
double *T8, *wsm, *CM;
T8 = mxGetPr(prhs[0]);
wsm = mxGetPr(prhs[1]);
dims = mxGetDimensions(prhs[0]);
dims1 = mxGetDimensions(prhs[1]);
dims[0] = dims1[0];
I = dims[0];
J = dims[1];
K = dims[2];
L = dims[3];
M = dims[4];
N = dims[5];
O = dims[6];
plhs[0] = mxCreateNumericArray(7,dims,mxDOUBLE_CLASS,mxREAL);
CM = mxGetPr(plhs[0]);
for( o=0; o<O; o++ ) {
for( n=0; n<N; n++ ) {
for( m=0; m<M; m++ ) {
for( l=0; l<L; l++ ) {
for( k=0; k<K; k++ ) {
for( j=0; j<J; j++ ) {
for( i=0; i<I; i++ ) {
*CM++ = T8[j + k*J + +l*J*K + m*L*J*K + n*M*L*J*K + o*N*M*L*J*K] * wsm[i + j*I + k*I*J + l*I*J*K];
}
}
}
}
}
}
}
}
When I call the above MEX file
CM = arrayProduct(T8,wsm);
it takes 0.215211 seconds (nearly twice as long).
My code was very loosely based on the code suggested here (https://uk.mathworks.com/matlabcentral/answers/210352-optimize-speed-up-a-big-and-slow-matrix-operation-with-addition-and-bsxfun).
Any suggestions as to what I can do differently to speed up my code would be greatly appreciated!
It is a big mistake to assume you can beat Matlab at trivial matrix math like this. Matlab is optimized from the beginning to perform matrix math.
There are good reasons to write MEX functions sometimes, including for performance reasons, but that's typically in cases where a pure Matlab solution is not feasible to write in an optimal way (e.g. when you would need to write lots of explicit loops).
Two major reasons why your code might be slower than the optimized matrix math already present in Matlab are:
Matlab might use multiple threads to do calculations in parallel. Your code does not, but a truly optimal solution probably would.
You may have made a mistake in the memory access pattern, leading to inferior cache hit rates.
Another way to look at this is: if Matlab can't be trusted to implement multiplication in an optimal way, would people be using it for serious math on large data sets? There are algorithms Matlab doesn't know, and sometimes those can be sped up using MEX, but multiplication is not one of them.
I am solving the system of linear algebraic equations Ax = b by using Jacobian method but by taking manual inputs. I want to analyze the performance of the solver for large system. Is there any method to generate matrix A i.e non singular?
I am attaching my code here.`
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#define TOL = 0.0001
void main()
{
int size,i,j,k = 0;
printf("\n enter the number of equations: ");
scanf("%d",&size);
double reci = 0.0;
double *x = (double *)malloc(size*sizeof(double));
double *x_old = (double *)malloc(size*sizeof(double));
double *b = (double *)malloc(size*sizeof(double));
double *coeffMat = (double *)malloc(size*size*sizeof(double));
printf("\n Enter the coefficient matrix: \n");
for(i = 0; i < size; i++)
{
for(j = 0; j < size; j++)
{
printf(" coeffMat[%d][%d] = ",i,j);
scanf("%lf",&coeffMat[i*size+j]);
printf("\n");
//coeffMat[i*size+j] = 1.0;
}
}
printf("\n Enter the b vector: \n");
for(i = 0; i < size; i++)
{
x[i] = 0.0;
printf(" b[%d] = ",i);
scanf("%lf",&b[i]);
}
double sum = 0.0;
while(k < size)
{
for(i = 0; i < size; i++)
{
x_old[i] = x[i];
}
for(i = 0; i < size; i++)
{
sum = 0.0;
for(j = 0; j < size; j++)
{
if(i != j)
{
sum += (coeffMat[i * size + j] * x_old[j] );
}
}
x[i] = (b[i] -sum) / coeffMat[i * size + i];
}
k = k+1;
}
printf("\n Solution is: ");
for(i = 0; i < size; i++)
{
printf(" x[%d] = %lf \n ",i,x[i]);
}
}
This is all a bit Heath Robinson, but here's what I've used. I have no idea how 'random' such matrices all, in particular I don't know what distribution they follow.
The idea is to generate the SVD of the matrix. (Called A below, and assumed nxn).
Initialise A to all 0s
Then generate n positive numbers, and put them, with random signs, in the diagonal of A. I've found it useful to be able to control the ratio of the largest of these positive numbers to the smallest. This ratio will be the condition number of the matrix.
Then repeat n times: generate a random n vector f , and multiply A on the left by the Householder reflector I - 2*f*f' / (f'*f). Note that this can be done more efficiently than by forming the reflector matrix and doing a normal multiplication; indeed its easy to write a routine that given f and A will update A in place.
Repeat the above but multiplying on the right.
As for generating test data a simple way is to pick an x0 and then generate b = A * x0. Don't expect to get exactly x0 back from your solver; even if it is remarkably well behaved you'll find that the errors get bigger as the condition number gets bigger.
Talonmies' comment mentions http://www.eecs.berkeley.edu/Pubs/TechRpts/1991/CSD-91-658.pdf which is probably the right approach (at least in principle, and in full generality).
However, you are probably not handling "very large" matrixes (e.g. because your program use naive algorithms, and because you don't run it on a large supercomputer with a lot of RAM). So the naive approach of generating a matrix with random coefficients and testing afterwards that it is non-singular is probably enough.
Very large matrixes would have many billions of coefficients, and you need a powerful supercomputer with e.g. terabytes of RAM. You probably don't have that, if you did, your program probably would run too long (you don't have any parallelism), might give very wrong results (read http://floating-point-gui.de/ for more) so you don't care.
A matrix of a million coefficients (e.g. 1024*1024) is considered small by current hardware standards (and is more than enough to test your code on current laptops or desktops, and even to test some parallel implementations), and generating randomly some of them (and computing their determinant to test that they are not singular) is enough, and easily doable. You might even generate them and/or check their regularity with some external tool, e.g. scilab, R, octave, etc. Once your program computed a solution x0, you could use some tool (or write another program) to compute Ax0 - b and check that it is very close to the 0 vector (there are some cases where you would be disappointed or surprised, since round-off errors matter).
You'll need some good enough pseudo random number generator perhaps as simple as drand48(3) which is considered as nearly obsolete (you should find and use something better); you could seed it with some random source (e.g. /dev/urandom on Linux).
BTW, compile your code with all warnings & debug info (e.g. gcc -Wall -Wextra -g). Your #define TOL = 0.0001 is probably wrong (should be #define TOL 0.0001 or const double tol = 0.0001;). Use the debugger (gdb) & valgrind. Add optimizations (-O2 -mcpu=native) when benchmarking. Read the documentation of every used function, notably those from <stdio.h>. Check the result count from scanf... In C99, you should not cast the result of malloc, but you forgot to test against its failure, so code:
double *b = malloc(size*sizeof(double));
if (!b) {perror("malloc b"); exit(EXIT_FAILURE); };
You'll rather end, not start, your printf control strings with \n because stdout is often (not always!) line buffered. See also fflush.
You probably should read also some basic linear algebra textbook...
Notice that actually writing robust and efficient programs to invert matrixes or to solve linear systems is a difficult art (which I don't know at all : it has programming issues, algorithmic issues, and mathematical issues; read some numerical analysis book). You can still get a PhD and spend your whole life working on that. Please understand that you need ten years to learn programming (or many other things).
I'm fairly new to C, not having much need to anything faster than python for most of my research. However, it turns out that recent work I've been doing required the computation of fairly large vectors/matrices, and there therefore a C+MPI solution might be in order.
Mathematically speaking, the task is very simple. I have a lot of vectors of dimensionality ~40k and wish to compute the Kronecker Product of selected pairs of these vectors, and then sum these kronecker products.
The question is, how to do this efficiently? Is there anything wrong with the following structure of code, using for loops, or obtain the effect?
The function kron described below passes vectors A and B of lengths vector_size, and computes their kronecker product, which it stores in C, a vector_size*vector_size matrix.
void kron(int *A, int *B, int *C, int vector_size) {
int i,j;
for(i = 0; i < vector_size; i++) {
for (j = 0; j < vector_size; j++) {
C[i*vector_size+j] = A[i] * B[j];
}
}
return;
}
This seems fine to me, and certainly (if I've not made some silly syntax error) produce the right result, but I have a sneaking suspicion that embedded for loops is not optimal. If there's another way I should be going about this, please let me know. Suggestions welcome.
I thank you for you patience and any advice you may have. Once again, I'm very inexperienced with C, but Googling around has brought me little joy for this query.
Since your loop bodies are all completely independent, there is certainly a way to accelerate this. Easiest would be already to take advantage of several cores before thinking of MPI. OpenMP should do quite fine on this.
#pragma omp parallel for
for(int i = 0; i < vector_size; i++) {
for (int j = 0; j < vector_size; j++) {
C[i][j] = A[i] * B[j];
}
}
This is supported by many compilers nowadays.
You could also try to drag some common expressions out of the inner loop but decent compilers e.g gcc, icc or clang should do this quite well all by themselves:
#pragma omp parallel for
for(int i = 0; i < vector_size; ++i) {
int const x = A[i];
int * vec = &C[i][0];
for (int j = 0; j < vector_size; ++j) {
vec[j] = x * B[j];
}
}
BTW, indexing with int is usually not the right thing to do. size_t is the correct typedef for everything that has to do with indexing and sizes of objects.
For double-precision vectors (single-precision and complex are similar), you can use the BLAS routine DGER (rank-one update) or similar to do the products one-at-a-time, since they are all on vectors. How many vectors are you multiplying? Remember that adding a bunch of vector outer products (which you can treat the Kronecker products as) ends up as a matrix-matrix multiplication, which BLAS's DGEMM can handle efficiently. You might need to write your own routines if you truly need integer operations, though.
If your compiler supports C99 (and you never pass the same vector as A and B), consider compiling in a C99-supporting mode and changing your function signature to:
void kron(int * restrict A, int * restrict B, int * restrict C, int vector_size);
The restrict keyword promises the compiler that the arrays pointed to by A, B and C do not alias (overlap). With your code as written, the compiler must re-load A[i] on every execution of the inner loop, because it must be conservative and assume that your stores to C[] can modify values in A[]. Under restrict, the compiler can assume that this will not happen.
Solution found (thanks to #Jeremiah Willcock): GSL's BLAS bindings seem to do the trick beautifully. If we're progressively selecting pairs of vectors A and B and adding them to some 'running total' vector/matrix C, the following modified version of the above kron function
void kronadd(int *A, int *B, int *C, int vector_size, int alpha) {
int i,j;
for(i = 0; i < vector_size; i++) {
for (j = 0; j < vector_size; j++) {
C[i*vector_size+j] = alpha * A[i] * B[j];
}
}
return;
}
precisely corresponds to the BLAS DGER function (accessible as gsl_blas_dger), functionally speaking. The initial kron function is DGER with alpha = 0 and C being an uninitialised (zeroed) matrix/vector of the correct dimensionality.
It turns out, it might well be easier to simply use python bindings for these libraries, in the end. However, I think I've learned a lot while trying to figure this stuff out. There are some more helpful suggestions in the other responses, do check them out if you have the same sort of problem to deal with. Thanks everyone!
This is a common enough problem in numerical computational circles, that really the best thing to do would be to use a well-debugged package like Matlab (or one of its Free Software clones).
You could probably even find a python binding to it, so you can get rid of C.
All of the above is (probably) going to be faster than code written strictly in python. If you need more speed than that, I'd suggest a couple of things:
Look into using Fortran instead of C. Fortran compilers tend to be better at optimizing numerical computations (one exception would be if you are using gcc, since both its C and Fortran compilers use the same backend).
Consider parallelizing your algorithm. There are variants of Fortran I know that have parallel loop statements. I think there are some C addons around that do the same thing. If you are using a PC (and single-precision) you could also consider using your video card's GPU, which is essentially a really cheap array processor.
Another optimisation that would be easy to implement is that if you know that the inner dimension of your arrays will be divisible by n then add n assignment statements to the body of the loop, reducing the number of necessary iterations, with corresponding changes to the loop counting.
This strategy can be generalised by using a switch statement around the outer loop with cases for array sizes divisible by two, three, four and five, or whatever is most common. This can give quite a big performance win and is compatible with suggestions 1 and 3 for further optimisation/parallelisation. A good compiler may even do something like this for you (aka loop unrolling).
Another optimisation would be to make use of pointer arithmetic to avoid the array indexing. Something like this should do the trick:
int i, j;
for(i = 0; i < vector_size; i++) {
int d = *A++;
int *e = B;
for (j = 0; j < vector_size; j++) {
*C++ = *e++ * d;
}
}
This also avoids accessing the value of A[i] multiple times by caching it in a local variable, which might give you a minor speed boost. (Note that this version is not parallelisable since it alters the value of the pointers, but would still work with loop unrolling.)
To solve your problem, I think you should try to use Eigen 3, it's a C++ library which use all matrix functions!
If you have time, go to see its documentation! =)
Good luck !
uint32_t rA = 3;
uint32_t cA = 5;
uint32_t lda = cA;
uint32_t rB = 5;
uint32_t cB = 3;
uint32_t ldb = cB;
uint32_t rC = rA*rB;
uint32_t cC = cA*cB;
uint32_t ldc = cC;
double *A = (double *)malloc(rA*cA*sizeof(double));
double *B = (double *)malloc(rB*cB*sizeof(double));
double *C = (double *)malloc(rC*cC*sizeof(double));
for (uint32_t i=0, allA=rA*cA; i<allA; i++)
A[i]=i;
for (uint32_t i=0, allB=rB*cB; i<allB; i++)
B[i]=i;
for (uint32_t i=0, allC=rC*cC; i<allC; i++)
C[i]=0;
for (uint32_t i=0, allA=rA*cA; i<allA; i++)
{
for (uint32_t j=0, allB=rB*cB; j<allB; j++)
C[((i/lda)*rB+j/ldb)*ldc
+ (i%lda)*cB+j%ldb ]=A[i]*B[j];
}