Efficient computation of kronecker products in C - c

I'm fairly new to C, not having much need to anything faster than python for most of my research. However, it turns out that recent work I've been doing required the computation of fairly large vectors/matrices, and there therefore a C+MPI solution might be in order.
Mathematically speaking, the task is very simple. I have a lot of vectors of dimensionality ~40k and wish to compute the Kronecker Product of selected pairs of these vectors, and then sum these kronecker products.
The question is, how to do this efficiently? Is there anything wrong with the following structure of code, using for loops, or obtain the effect?
The function kron described below passes vectors A and B of lengths vector_size, and computes their kronecker product, which it stores in C, a vector_size*vector_size matrix.
void kron(int *A, int *B, int *C, int vector_size) {
int i,j;
for(i = 0; i < vector_size; i++) {
for (j = 0; j < vector_size; j++) {
C[i*vector_size+j] = A[i] * B[j];
}
}
return;
}
This seems fine to me, and certainly (if I've not made some silly syntax error) produce the right result, but I have a sneaking suspicion that embedded for loops is not optimal. If there's another way I should be going about this, please let me know. Suggestions welcome.
I thank you for you patience and any advice you may have. Once again, I'm very inexperienced with C, but Googling around has brought me little joy for this query.

Since your loop bodies are all completely independent, there is certainly a way to accelerate this. Easiest would be already to take advantage of several cores before thinking of MPI. OpenMP should do quite fine on this.
#pragma omp parallel for
for(int i = 0; i < vector_size; i++) {
for (int j = 0; j < vector_size; j++) {
C[i][j] = A[i] * B[j];
}
}
This is supported by many compilers nowadays.
You could also try to drag some common expressions out of the inner loop but decent compilers e.g gcc, icc or clang should do this quite well all by themselves:
#pragma omp parallel for
for(int i = 0; i < vector_size; ++i) {
int const x = A[i];
int * vec = &C[i][0];
for (int j = 0; j < vector_size; ++j) {
vec[j] = x * B[j];
}
}
BTW, indexing with int is usually not the right thing to do. size_t is the correct typedef for everything that has to do with indexing and sizes of objects.

For double-precision vectors (single-precision and complex are similar), you can use the BLAS routine DGER (rank-one update) or similar to do the products one-at-a-time, since they are all on vectors. How many vectors are you multiplying? Remember that adding a bunch of vector outer products (which you can treat the Kronecker products as) ends up as a matrix-matrix multiplication, which BLAS's DGEMM can handle efficiently. You might need to write your own routines if you truly need integer operations, though.

If your compiler supports C99 (and you never pass the same vector as A and B), consider compiling in a C99-supporting mode and changing your function signature to:
void kron(int * restrict A, int * restrict B, int * restrict C, int vector_size);
The restrict keyword promises the compiler that the arrays pointed to by A, B and C do not alias (overlap). With your code as written, the compiler must re-load A[i] on every execution of the inner loop, because it must be conservative and assume that your stores to C[] can modify values in A[]. Under restrict, the compiler can assume that this will not happen.

Solution found (thanks to #Jeremiah Willcock): GSL's BLAS bindings seem to do the trick beautifully. If we're progressively selecting pairs of vectors A and B and adding them to some 'running total' vector/matrix C, the following modified version of the above kron function
void kronadd(int *A, int *B, int *C, int vector_size, int alpha) {
int i,j;
for(i = 0; i < vector_size; i++) {
for (j = 0; j < vector_size; j++) {
C[i*vector_size+j] = alpha * A[i] * B[j];
}
}
return;
}
precisely corresponds to the BLAS DGER function (accessible as gsl_blas_dger), functionally speaking. The initial kron function is DGER with alpha = 0 and C being an uninitialised (zeroed) matrix/vector of the correct dimensionality.
It turns out, it might well be easier to simply use python bindings for these libraries, in the end. However, I think I've learned a lot while trying to figure this stuff out. There are some more helpful suggestions in the other responses, do check them out if you have the same sort of problem to deal with. Thanks everyone!

This is a common enough problem in numerical computational circles, that really the best thing to do would be to use a well-debugged package like Matlab (or one of its Free Software clones).
You could probably even find a python binding to it, so you can get rid of C.
All of the above is (probably) going to be faster than code written strictly in python. If you need more speed than that, I'd suggest a couple of things:
Look into using Fortran instead of C. Fortran compilers tend to be better at optimizing numerical computations (one exception would be if you are using gcc, since both its C and Fortran compilers use the same backend).
Consider parallelizing your algorithm. There are variants of Fortran I know that have parallel loop statements. I think there are some C addons around that do the same thing. If you are using a PC (and single-precision) you could also consider using your video card's GPU, which is essentially a really cheap array processor.

Another optimisation that would be easy to implement is that if you know that the inner dimension of your arrays will be divisible by n then add n assignment statements to the body of the loop, reducing the number of necessary iterations, with corresponding changes to the loop counting.
This strategy can be generalised by using a switch statement around the outer loop with cases for array sizes divisible by two, three, four and five, or whatever is most common. This can give quite a big performance win and is compatible with suggestions 1 and 3 for further optimisation/parallelisation. A good compiler may even do something like this for you (aka loop unrolling).
Another optimisation would be to make use of pointer arithmetic to avoid the array indexing. Something like this should do the trick:
int i, j;
for(i = 0; i < vector_size; i++) {
int d = *A++;
int *e = B;
for (j = 0; j < vector_size; j++) {
*C++ = *e++ * d;
}
}
This also avoids accessing the value of A[i] multiple times by caching it in a local variable, which might give you a minor speed boost. (Note that this version is not parallelisable since it alters the value of the pointers, but would still work with loop unrolling.)

To solve your problem, I think you should try to use Eigen 3, it's a C++ library which use all matrix functions!
If you have time, go to see its documentation! =)
Good luck !

uint32_t rA = 3;
uint32_t cA = 5;
uint32_t lda = cA;
uint32_t rB = 5;
uint32_t cB = 3;
uint32_t ldb = cB;
uint32_t rC = rA*rB;
uint32_t cC = cA*cB;
uint32_t ldc = cC;
double *A = (double *)malloc(rA*cA*sizeof(double));
double *B = (double *)malloc(rB*cB*sizeof(double));
double *C = (double *)malloc(rC*cC*sizeof(double));
for (uint32_t i=0, allA=rA*cA; i<allA; i++)
A[i]=i;
for (uint32_t i=0, allB=rB*cB; i<allB; i++)
B[i]=i;
for (uint32_t i=0, allC=rC*cC; i<allC; i++)
C[i]=0;
for (uint32_t i=0, allA=rA*cA; i<allA; i++)
{
for (uint32_t j=0, allB=rB*cB; j<allB; j++)
C[((i/lda)*rB+j/ldb)*ldc
+ (i%lda)*cB+j%ldb ]=A[i]*B[j];
}

Related

Function to multiply 3x3 matrices gives wrong answer for middle column only

While teaching myself c, I thought it would be good practice to write a function which multiplies two 3x3 matrices and then make it more general. The function seems to calculate the correct result for the first and last columns but not the middle one. In addition, each value down the middle column is out by 3 more than the last.
For example:
[1 2 3] [23 4 6]
[4 5 6] * [ 2 35 0]
[7 8 9] [14 2 43]
The answer I receive is:
[ 69 80 135]
[190 273 282]
[303 326 429]
The actual answer should be:
[ 69 83 135]
[190 279 282]
[303 335 429]
Isolating the middle columns for clarity:
Received Expected
[ 80] [ 83]
[273] [279]
[326] [335]
My code is as follows:
#include <stdio.h>
typedef struct mat_3x3
{
double values [3][3];
} mat_3x3;
void SetMatrix(mat_3x3 * matrix, double vals[3][3])
{
for (int i = 0; i < 3; i++)
{
for (int j = 0; j < 3; j++)
{
(matrix->values)[i][j] = vals[i][j];
}
}
putchar('\n');
}
mat_3x3 MatrixMultiply(mat_3x3 * m1, mat_3x3 * m2)
{
mat_3x3 result;
for (int i = 0; i < 3; i++)
{
for (int j = 0; j < 3; j++)
{
double temp = 0;
for (int k = 0; k < 3; k++)
{
temp += ((m1->values)[i][k] * (m2->values)[k][j]);
}
(result.values)[i][j] = temp;
}
}
return result;
}
void PrintMatrix(mat_3x3 * matrix)
{
putchar('\n');
for (int i = 0; i < 3; i++)
{
for (int j = 0; j < 3; j++)
{
printf("%lf ", (matrix->values)[i][j]);
}
putchar('\n');
}
putchar('\n');
}
int main()
{
mat_3x3 m1;
mat_3x3 * pm1 = &m1;
mat_3x3 m2;
mat_3x3 * pm2 = &m2;
double vals[3][3] = {
{1,2,3},
{4,7,6},
{7,8,9}
};
double vals2[3][3] = {
{23,4,6},
{2,35,0},
{14,2,43}
};
SetMatrix(pm1, vals);
SetMatrix(pm2, vals2);
printf("\nm1:");
PrintMatrix(pm1);
printf("\nm2:");
PrintMatrix(pm2);
mat_3x3 m3 = MatrixMultiply(pm1, pm2);
mat_3x3 * pm3 = &m3;
printf("\nm3 = m1 * m2");
PrintMatrix(pm3);
}
Have been working on this for a while now comparing it against other simple examples and can't find the problem, so help would be appreciated!
Also if I've done anything atrocious syntax wise etc, I'm open to any criticism on how it's written as well.
While teaching myself c, I thought it would be good practice to write a function which multiplies two 3x3 matrices and then make it more general. The function seems to calculate the correct result for the first and last columns but not the middle one. In addition, each value down the middle column is out by 3 more than the last.
In practice, when coding in C, you should take care of the following issues:
refer to a good C reference website and read a good C programming book, such as Modern C
floating point numbers are not mathematical real numbers, see floating-point-gui.de for much more. For example, addition is associative in math, but not on a computer using IEEE-754.
we all make bugs (e.g. buffer overflows or undefined behavior). So you need to learn how to use a debugger. I recommend GDB. But you need to learn how to use it and spend a few hours reading documentation. Tools like valgrind are also useful (to hunt memory leaks) as soon as you use C dynamic memory allocation.
recent compilers can be helpful. I recommend GCC. You should invoke it with all warnings and debug info, e.g. gcc -Wall -Wextra -g. Be sure to spend some time in reading the documentation of your compiler. You might later consider using static program analysis tools such as Frama-C or the Clang analyzer or (for precision analysis) Fluctuat or CADNA
consider having a matrix abstract data type like here. You would then easily generalize your code to "arbitrary" N*M matrixes.
later, for benchmarking purposes, you will want to use an optimizing compiler. If you use GCC, you could compile your code using gcc -Wall -Wextra -g -O3 but then you could have surprising optimizations, see e.g. this draft report.
in some cases, you could need arbitrary-precision arithmetic. Consider then using specialized libraries such as GMPlib.
Most computers today are multi-core. You could want to use Pthreads or MPI to take advantage of that with concurrent programming.
many open source libraries exist for scientific computations. Look at least for inspiration on github and gitlab and see also this list. You could be interested by GNU GSL and study its source code since it is free software (and later improve it).
If you want to make serious scientific computations, you might consider switching (for expressiveness) to functional languages such as Ocaml. If you care about making a lot of iterative computing (like in finite element methods) you might switch to OpenCL or OpenACC.
Be aware that scientific computation is a very difficult field.
Expect to spend a decade in learning it.
I'm open to any criticism on how it's written as well.
mat_3x3 MatrixMultiply(mat_3x3 * m1, mat_3x3 * m2)
is unusual. Why don't you return a pointer (to a fresh memory zone obtained with malloc and correctly initialized) ? That is likely to be faster (a pointer is usually 8 bytes, a 3x3 matrix takes 72 bytes to be copied) and enable you to code things like MatrixMultiply(MatrixMultiply(M1, M2), MatrixAdd(M2, M3)). Of course, garbage collection (read the GC handbook, consider using Boehm GC) then becomes an issue. If you used Ocaml, the system GC would be very helpful.

Optimization of 3D Direct Convolution Implementation in C

For my project, I've written a naive C implementation of direct 3D convolution with periodic padding on the input. Unfortunately, since I'm new to C, the performance isn't so good... here's the code:
int mod(int a, int b)
{
// calculate mod to get the correct index with periodic padding
int r = a % b;
return r < 0 ? r + b : r;
}
void convolve3D(const double *image, const double *kernel, const int imageDimX, const int imageDimY, const int imageDimZ, const int stencilDimX, const int stencilDimY, const int stencilDimZ, double *result)
{
int imageSize = imageDimX * imageDimY * imageDimZ;
int kernelSize = kernelDimX * kernelDimY * kernelDimZ;
int i, j, k, l, m, n;
int kernelCenterX = (kernelDimX - 1) / 2;
int kernelCenterY = (kernelDimY - 1) / 2;
int kernelCenterZ = (kernelDimZ - 1) / 2;
int xShift,yShift,zShift;
int outIndex, outI, outJ, outK;
int imageIndex = 0, kernelIndex = 0;
// Loop through each voxel
for (k = 0; k < imageDimZ; k++){
for ( j = 0; j < imageDimY; j++) {
for ( i = 0; i < imageDimX; i++) {
stencilIndex = 0;
// for each voxel, loop through each kernel coefficient
for (n = 0; n < kernelDimZ; n++){
for ( m = 0; m < kernelDimY; m++) {
for ( l = 0; l < kernelDimX; l++) {
// find the index of the corresponding voxel in the output image
xShift = l - kernelCenterX;
yShift = m - kernelCenterY;
zShift = n - kernelCenterZ;
outI = mod ((i - xShift), imageDimX);
outJ = mod ((j - yShift), imageDimY);
outK = mod ((k - zShift), imageDimZ);
outIndex = outK * imageDimX * imageDimY + outJ * imageDimX + outI;
// calculate and add
result[outIndex] += stencil[stencilIndex]* image[imageIndex];
stencilIndex++;
}
}
}
imageIndex ++;
}
}
}
}
by convention, all the matrices (image, kernel, result) are stored in column-major fashion, and that's why I loop through them in such way so they are closer in memory (heard this would help).
I know the implementation is very naive, but since it's written in C, I was hoping the performance would be good, but instead it's a little disappointing. I tested it with image of size 100^3 and kernel of size 10^3 (Total ~1GFLOPS if only count the multiplication and addition), and it took ~7s, which I believe is way below the capability of a typical CPU.
If possible, could you guys help me optimize this routine?
I'm open to anything that could help, with just a few things if you could consider:
The problem I'm working with could be big (e.g. image of size 200 by 200 by 200 with kernel of size 50 by 50 by 50 or even larger). I understand that one way of optimizing this is by converting this problem into a matrix multiplication problem and use the blas GEMM routine, but I'm afraid memory could not hold such a big matrix
Due to the nature of the problem, I would prefer direct convolution instead of FFTConvolve, since my model is developed with direct convolution in mind, and my impression of FFT convolve is that it gives slightly different result than direct convolve especially for rapidly changing image, a discrepancy I'm trying to avoid.
That said, I'm in no way an expert in this. so if you have a great implementation based on FFTconvolve and/or my impression on FFT convolve is totally biased, I would really appreciate if you could help me out.
The input images are assumed to be periodic, so periodic padding is necessary
I understand that utilizing blas/SIMD or other lower level ways would definitely help a lot here. but since I'm a newbie here I dont't really know where to start... I would really appreciate if you help pointing me to the right direction if you have experience in these libraries,
Thanks a lot for your help, and please let me know if you need more info about the nature of the problem
As a first step, replace your mod ((i - xShift), imageDimX) with something like this:
inline int clamp( int x, int size )
{
if( x < 0 ) return x + size;
if( x >= size ) return x - size;
return x;
}
These branches are very predictable because they yield same results for very large count of consecutive elements. Integer modulo is relatively slow.
Now, next step (ordered by cost/profit) is going to be parallelizing. If you have any modern C++ compiler, just enable OpenMP somewhere in project settings. After that you need 2 changes.
Decorate your very outer loop with something like this: #pragma omp parallel for schedule(guided)
Move your function-level variables within that loop. This also means you’ll have to compute initial imageIndex from your k, for each iteration.
Next option, rework your code so you only write each output value once. Compute the final value in your innermost 3 loops, reading from random locations from both image and kernel, and only write the result once. When you have that result[outIndex] += in the inner loop, CPU stalls waiting for the data from memory. When you accumulate in a variable that’s a register not memory, there’s no access latency.
SIMD is the most complicated optimization for that. But in short, you’ll need maximum width of the FMA your hardware has (if you have AVX and need double precision, that width is 4), and you’ll also need multiple independent accumulators for your 3 innermost loops, to avoid hitting the latency as opposed to saturating the throughput. Here’s my answer to much easier problem as an example what I mean.

CUDA: parallelizing a multiple nested for-loop having a function call with nested loops

Issue
I am interested in parallelizing a problem using CUDA. The C code in question follows this simplified form:
int A, B, C; // 100 < A,B,C,D < 1,000
float* v1, v2, v3;
//v1,v2, v3 will have respective size A,B,C
//and will not be empty
float*** t1, t2, t3;
//t1,t2,t3 will eventually have the size (ci,cj,ck)
//and will not be empty
int i, j , k, l;
float xi, xj, xk;
for (i=0; i<A; ++i){
xi = ci - v1[i];
for (j=0; j<B; ++j){
xj = (j*cj)*cos(j*M_PI/180);
for (k=0;k<C; ++k){
xk = xj - v3[k];
if (xk < xi){
call_1(t1[i], v1, t2[i], &t3[i][j][k]);
}
else t3[i][j][k] = some_number;
}
}
}
here call_1 is
void call_1 (float **w, float *x, float **y, float *z){
int k, max = some_value;
float *v; //initialize to have size max
for (k=0; k<max; ++k)
call_2(x[k], y[k], max, &v[k]);
call_2(y, v, max, z);
}
here call_2is
void call_2 (float *w, float*x, int y, double *z)
that simply contains operations such as bit shifting, multiplication, subtraction and addition inside a single while loop.
Ideas attempted
So far, my idea is that, the function call_1may be transformed into a kernel code __global__ void call_1; and that call_2 may be transformed into device code without modifying its contents. In particular, I can probably make __global__ void call_1 to be
double* v; //initialize to have size max
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int k=index; k<max; k += stride)
call_2 (x[k], y[k], max, &v[k]);
__syncthreads();
call_2 (y, v, max, z);
free (v);
I'm partly aware that the for loops can be removed by using a combination of threadIdx, blockIdx, and gridDim, but I specifically am not sure how especially that the problem contains a function call that also uses a function call.
Well there are two possible answers to that, and, while I don't have the courage to search all of it for you, I'll still make it an answer since you seem to have been blatantly ignored. :/
First.
Recent CUDA API and nvidia architectures have support for function calls and even recursion in CUDA. I'm not exactly sure how it works as I never used it myself, but you might want to research that. (Or do some Vulkan since it looks like so much fun and also supports it.)
Might help you: https://devtalk.nvidia.com/default/topic/493567/cuda-programming-and-performance/calling-external-kernel-from-cuda/
And other stuff with related keywords. : D
On the other hand..
When resolving simple issues, especially if, like me, you would rather spend your time programming than doing research and learning some random API by heart, you can always go with more primitive solutions using only the basic of the language you are using.
In your case, I would simply inline the calls to the function to make a single CUDA kernel, since it seems pretty easy to do.
Yeah, right, it might include some copy-pasting if there are multiple calls to the function... which doesn't really matter if it can make you take it easy and efficiently solve a simple issue and go make something more productive.
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int k=index; k<max; k += stride)
call_2 (x[k], y[k], max, &v[k]); // Insert call_2 code here instead.
Another way to go around that, when you are confident your data is big enough to have a good increase in performances even with the passing of code and data from CPU and RAM to GPU, is to simply have multiple "waves" of cuda kernel call.
You let the first wave process while preparing for the second one, which is then launched on the finished first wave.
It's basically equivalent to other smarter constructs offered by recent CUDA implementations, so you would probably find smarter things to do with a bit of research, but then again... depends on your priority.
But yeah, manually inlining functions is great. : D
*mostly never, but it can be pretty handy

Error in OpenMP, trying to vectorize a Matrix Multiplication for loop

I'm trying to vectorize an old matrix multiplication program I made, specifically this function using a parallel for call in openmp. I keep getting this error:
matrix_multiply.c(26): error: invalid entity for this variable list in omp clause
#pragma omp parallel for schedule(static) default(shared) private(i,j,k,sum)
Any help would be much appreciated as I've tried looking up the error and can't find any documentation that was helpful. I'm compiling using ICC if that makes a difference.
void matrix_mult(int * matrix_A, int * matrix_B, int n)
{
#pragma omp parallel for schedule(static) default(shared) private(i,j,k,sum)
for (int i = 0; i < n; i++)
{
for (int j = 0; j < n; j++)
{
int sum = 0;
for (int k = 0; k<n; k++)
{
int index_a = i * n +k;
int index_b = j + k * n;
sum += matrix_A[index_a] * matrix_A[index_b];
}
matrix_B[i * n + j] = sum;
}
}
}
There're two things worth mentioning here:
What you're actually doing here isn't vectorizing (although your compiler might be doing it for you), it is parallelizing. Here, you're creating threads to split the work among. Each thread may or may not use the CPU's vector units to spreed the computations up even more, but it has nothing to do with the parallelization directives you've put.
The error the compiler reports only says that it doesn't known the variables you've listed in your private directive. Indeed, if you look closer, neither of i, j, k, and sum have been declared before the directive line. So for the compiler, they don't exist (yet). As a matter of fact, since you only declare them when you need them (which is very good), which is inside the parallel region, you don't have to declare them privateanyway since they already are private to the the thread where they are created. So just removing the private clause should fix your issue.
Finally, if performance matters to you, rather than trying to parallelize or vectorize this code, just consider replacing it by an effective library call that will do it for you. Unfortunately, since you're dealing with integers, BLAS won't do. But I'm sure there are good options out there for that.

speed up Matrix Multiplication by SSE2

I want to know how speed up matrix multiplication by SSE2
here is my code
int mat_mult_simd(double *a, double *b, double *c, int n)
{
__m128d c1,c2,a1,a2,b1;
for(int i=0; i<n/2; i++){
for(int j=0; j<n/2; j++){
c1 = _mm_load_pd(c+(2*j*n)+(i+2));
c2 = _mm_load_pd(c+n+(2*j*n)+(i+2));
for(int k=0; k<n; k++){
a1 = _mm_load1_pd(a+k+(2*j*n));
a2 = _mm load1_pd(a+n+k+(2*j*n));
b1 = _mm_load_pd(b+(k*n)+(i*2));
c1 = _mm_add_pd(c1, _mm_mul_pd(a1,b1));
c2 = _mm_add_pd(c2, _mm_mul_pd(a2,b1));
}
__mm_store_pd(c+(2*j*n)+(i+2), c1);
__mm_store_pd(c+n+(2*j*n)+(i+2), c2);
}
}
return 0;
}
each parameter means
'a' = vector a(MAT_SIZE*MAT_SIZE)
'b' = vector b(MAT_SIZE*MAT_SIZE)
'c' = vector c(MAT_SIZE*MAT_SIZE)
'n' = MAT_SIZE is constant (It always even and >=2)
this code speed up about X4. against
int mat_mult_default(double *a, double *b, double *c, int n)
{
double t;
for(int i=0; i<n; i++){
for(int j=0; j<n; j++){
t=0.0;
for(int k=0; k<n; k++)
t += a[i*n+k] * b[k*n+j];
c[i*n+j] = t;
}
}
}
but I want to more speed up. I usually experiment MAT_SIZE 1000*1000 or 2000*2000.
how can i speed up? Is there other way to indexing? I really want to know. thanks.
You can do a few things. The obvious one is splitting the work into several threads (1 per core). You can use OpenMP (easiest), Intel TBB or other multithreading libs.
This will provide a significant improvement on a multi-core machine.
Another thing is to looks at the disassembly (via your favorite debugger) - look how the compiler handles all the multiplications you use for the indexes, some of them can be eliminated.
Your code does 2 computations in one loop, try to do more 4 or 8 to have better locality. E.g. a1 and a2 couldbe calculated with their neighbors who are already in the L1 cache. You can actually loads them with a single load operation.
Make sure the various arrays are SSE aligned (16 Byte) and change your code to use aligned reads/writes.
I'd leave multithreading to the end as finding bugs is harder.
Just use the right library like the Intel Math Kernel Library or a similar highly optimized linear algebra package (OpenBLAS, AMD Core Math Library, ATLAS, ...). They are considered faster compared hand-written code. They have sometimes even processor-specific optimizations for instruction sets and cache sizes. And they are professionals in their field. Unless you plan to publish a paper on your own optimization, the go with the library.
In the latest issue of the German computer magazine c't they claim the compiler is smart enough to use SSE or AVX by itself. Just write the right loops and the auto-vectorizer will bring the best results. This is true for the latest Intel compiler. Microsoft's compiler are too dump. In some cases with the right compiler flags, Intel's compiler even detects that you program a matrix multiplication and replaces this by the right call. Or you have to check the documentation, it is not that hard to learn such a package.

Resources