Segmentation fault when trying to use intrinsics specifically _mm256_storeu_pd()

Segmentation fault when trying to use intrinsics specifically _mm256_storeu_pd() - c

Seemed to have fixed it myself by type casting the cij2 pointer inside the mm256 call
so _mm256_storeu_pd((double *)cij2,vecC);
I have no idea why this changed anything...
I'm writing some code and trying to take advantage of the Intel manual vectorization. But whenever I run the code I get a segmentation fault on trying to use my double *cij2.
if( q == 0)
{
__m256d vecA;
__m256d vecB;
__m256d vecC;
for (int i = 0; i < M; ++i)
for (int j = 0; j < N; ++j)
{
double cij = C[i+j*lda];
double *cij2 = (double *)malloc(4*sizeof(double));
for (int k = 0; k < K; k+=4)
{
vecA = _mm256_load_pd(&A[i+k*lda]);
vecB = _mm256_load_pd(&B[k+j*lda]);
vecC = _mm256_mul_pd(vecA,vecB);
_mm256_storeu_pd(cij2, vecC);
for (int x = 0; x < 4; x++)
{
cij += cij2[x];
}
}
C[i+j*lda] = cij;
}
I've pinpointed the problem to the cij2 pointer. If i comment out the 2 lines that include that pointer the code runs fine, it doesn't work like it should but it'll actually run.
My question is why would i get a segmentation fault here? I know I've allocated the memory correctly and that the memory is a 256 vector of double's with size 64 bits.
After reading the comments I've come to add some clarification.
First thing I did was change the _mm_malloc to just a normal allocation using malloc. Shouldn't affect either way but will give me some more breathing room theoretically.
Second the problem isn't coming from a null return on the allocation, I added a couple loops in to increment through the array and make sure I could modify the memory without it crashing so I'm relatively sure that isn't the problem. The problem seems to stem from the loading of the data from vecC to the array.
Lastly I can not use BLAS calls. This is for a parallelisms class. I know it would be much simpler to call on something way smarter than I but unfortunately I'll get a 0 if I try that.

You dynamically allocate double *cij2 = (double *)malloc(4*sizeof(double)); but you never free it. This is just silly. Use double cij2[4], especially if you're not going to bother to align it. You never need more than one scratch buffer at once, and it's a small fixed size, so just use automatic storage.
In C++11, you'd use alignas(32) double cij2[4] so you could use _mm256_store_pd instead of storeu. (Or just to make sure storeu isn't slowed down by an unaligned address).
If you actually want to debug your original, use a debugger to catch it when it segfaults, and look at the pointer value. Make sure it's something sensible.
Your methods for testing that the memory was valid (like looping over it, or commenting stuff out) sound like they could lead to a lot of your loop being optimized away, so the problem wouldn't happen.
When your program crashes, you can also look at the asm instructions. Vector intrinsics map fairly directly to x86 asm (except when the compiler sees a more efficient way).
Your implementation would suck a lot less if you pulled the horizontal sum out of the loop over k. Instead of storing each multiply result and horizontally adding it, use a vector add into a vector accumulator. hsum it outside the loop over k.
__m256d cij_vec = _mm256_setzero_pd();
for (int k = 0; k < K; k+=4) {
vecA = _mm256_load_pd(&A[i+k*lda]);
vecB = _mm256_load_pd(&B[k+j*lda]);
vecC = _mm256_mul_pd(vecA,vecB);
cij_vec = _mm256_add_pd(cij_vec, vecC); // TODO: use multiple accumulators to keep multiple VADDPD or VFMAPD instructions in flight.
}
C[i+j*lda] = hsum256_pd(cij_vec); // put the horizontal sum in an inline function
For good hsum256_pd implementations (other than storing to memory and using a scalar loop), see Fastest way to do horizontal float vector sum on x86 (I included an AVX version there. It should be easy to adapt the pattern of shuffling to 256b double-precision.) This will help your code a lot, since you still have O(N^2) horizontal sums (but not O(N^3) with this change).
Ideally you could accumulate results for 4 i values in parallel, and not need horizontal sums.
VADDPD has a latency of 3 to 4 clocks, and a throughput of one per 1 to 0.5 clocks, so you need from 3 to 8 vector accumulators to saturate the execution units. Or with FMA, up to 10 vector accumulators (e.g. on Haswell where FMA...PD has 5c latency and one per 0.5c throughput). See Agner Fog's instruction tables and optimization guides to learn more about that. Also the x86 tag wiki.
Also, ideally nest your loops in a way that gave you contiguous access to two of your three arrays, since cache access patterns are critical for matmul (lots of data reuse). Even if you don't get fancy and transpose small blocks at a time that fit in cache. Even transposing one of your input matrices can be a win, since that costs O(N^2) and speeds up the O(N^3) process. I see your inner loop currently has a stride of lda while accessing A[].

Related

Would it be faster to use a for loop or list out the operations?

I'm working on code to do matrix operations for a satellite my school is making. Would it be faster and less resource intensive to use a for loop or to just write out the operations? All matrices are of a known size
for (i = 0; i < 3; i++)//Row
{
for (j = 0; j < 3; j++)//Column
{
result[i][j] = a[i][j] * b;
}
}
or
result[1][1] = a[1][1] * b;
result[1][2] = a[1][2] * b;
etc...

You are talking about loop unrolling. You're right, it is a common technique to decrease computing time of a program. However, as it has been said in the comments, it is not absolutely sure you will save time, because it depends on many factors (compiler, compiler optimisation level, etc...). It is also possible that the compiler does unroll loops itself if you choose a high optimisation level.
Don't forget it takes more code size, which is also a valuable resource.
Keep in mind there are other ways to optimize code. For example here, you multiply all elements in an array by the same variable. Perhaps you can do this multiplication later in the code when you access again the resultarray ? It will save travelling an array with all the memory accesses it implies.

Which sequence is more effective in Assembly language?

I have 2 C sequences which both multiply two matrices.
Sequence 1:
int A[M][N], B[N][P], C[M][P], i, j, k;
for (i = 0; i < M; i++)
for (j = 0; j < P; j++)
for (k = 0; k < N; k++)
C[i][j] += A[i][k] * B[k][j];
Sequence 2:
int A[M][N], B[N][P], C[M][P], i, j, k;
for (i = M - 1; i >= 0; i--)
for (j = P - 1; j >= 0; j--)
for (k = N - 1; k >= 0; k--)
C[i][j] += A[i][k] * B[k][j];
My question is: which of them is more efficient when translated in Assembly language?
I'm pretty sure that the second one can be written using the loop instruction, while the first one can be written using inc/jl.

First, you should understand that source code does not dictate what the assembly language is. The C standard allows a compiler to transform a program in any way as long as the resulting observable behavior (defined by the standard) remains the same. (The observable behavior is largely the output to files and devices, interactive input and output, and accesses to special volatile objects.)
Compilers take advantage of this rule to optimize your program. If the results of your loop are the same in either direction, then, in the best compilers, writing the loop in one direction or another has no consequence. The compiler analyzes the source code and sees that the effect of the loop is merely to perform a set of operations whose order does not matter. It represents the loop and the operations within it abstractly and later generates the best assembly code it can.
If the arrays in your example are large, then the time it takes the compiler to execute the loop control instructions is irrelevant. In typical systems, it takes dozens of CPU cycles or more to fetch a value from memory. With large arrays, the bottleneck in your example code will be fetching data from memory. The CPU will be forced to wait for this data, and it will easily complete any loop control or array address arithmetic instructions while it is waiting for data from memory.
Typical systems deal with the slow memory problem by including some fast memory, called cache. Often, there is very fast cache built into the core of the processor itself, plus some fast cache on the chip with the processor, and there are may other levels of cache. Memory in cache is organized into lines, which are segments of consecutive data from memory. Thus, one cache line may contain eight consecutive int objects. When the processor needs data that is not already in cache, an entire cache line is fetched from memory. Because of this, you can avoid the memory delay by using eight consecutive int objects. When you read the first one (or even before—the processor may predict your read and start fetching it ahead of time), all eight will be ready from memory. So your program will only have to wait for the first one. When it goes to use the second through the eight, they will already be in cache, where they are immediately available to the processor.
Unfortunately, array multiplication is notoriously bad for caches. Although your loop traverses the rows of array A (using A[i][k] where k is the fastest-varying index as your code is written), it traverses the columns of B (using B[k][j]). So consecutive iterations of your loop use consecutive elements of A but not consecutive elements of B. If the arrays are large, your program will end up waiting for elements from B to be fetched from memory. And, if you change the code to use consecutive elements from B, then it no longer uses consecutive elements from A.
With array multiplication, a typical way to deal with this problem is to split the array multiplication into smaller blocks, doing only a portion at a time, perhaps 8×8 blocks. This works because the cache can hold multiple lines at a time. If you arrange the work so that one 8×8 block from B (e.g., all the elements with a row number from 16 to 23 and a column number from 32 to 39) is used repeatedly for a while, then it can remain in cache, with all its data immediately available. This sort of rearrangement of work can speed up your program tremendously, making it many times faster. It is a much larger improvement than merely changing the direction of your loops can provide.
Some compilers can see that your loops on i, j, and k can be interchanged, and they may try to reorganize them if there is some benefit. Few compilers can break up the routines into blocks as I describe above. Also, the compiler can rearrange the work in your example only because you show A, B, and C declared as separate arrays. If these were not visible to the compiler but were instead passed as pointers to a function that was performing matrix multiplication, the compiler would not be able to see that A, B, and C point to separate arrays. In this case, it cannot know that the order of the loops does not matter. If the function were passed a C that points to the same array as A, the function would be overwriting some of its input while calculating outputs, and so the loop directions would matter.
There are a variety of matrix multiplication libraries that use the blocking technique and others to perform matrix multiplication efficiently.

Optimizing C loops

I'm new to C from many years of Matlab for numerical programming. I've developed a program to solve a large system of differential equations, but I'm pretty sure I've done something stupid as, after profiling the code, I was surprised to see three loops that were taking ~90% of the computation time, despite the fact they are performing the most trivial steps of the program.
My question is in three parts based on these expensive loops:
Initialization of an array to zero. When J is declared to be a double array are the values of the array initialized to zero? If not, is there a fast way to set all the elements to zero?
void spam(){
double J[151][151];
/* Other relevant variables declared */
calcJac(data,J,y);
/* Use J */
}
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
/* The first expensive loop */
int iter, jter;
for (iter=0; iter<151; iter++) {
for (jter = 0; jter<151; jter++) {
J[iter][jter] = 0;
}
}
/* More code to populate J from data and y that runs very quickly */
}
During the course of solving I need to solve matrix equations defined by P = I - gamma*J. The construction of P is taking longer than solving the system of equations it defines, so something I'm doing is likely in error. In the relatively slow loop below, is accessing a matrix that is contained in a structure 'data' the the slow component or is it something else about the loop?
for (iter = 1; iter<151; iter++) {
for(jter = 1; jter<151; jter++){
P[iter-1][jter-1] = - gamma*(data->J[iter][jter]);
}
}
Is there a best practice for matrix multiplication? In the loop below, Ith(v,iter) is a macro for getting the iter-th component of a vector held in the N_Vector structure 'v' (a data type used by the Sundials solvers). Particularly, is there a best way to get the dot product between v and the rows of J?
Jv_scratch = 0;
int iter, jter;
for (iter=1; iter<151; iter++) {
for (jter=1; jter<151; jter++) {
Jv_scratch += J[iter][jter]*Ith(v,jter);
}
Ith(Jv,iter) = Jv_scratch;
Jv_scratch = 0;
}

1) No they're not you can memset the array as follows:
memset( J, 0, sizeof( double ) * 151 * 151 );
or you can use an array initialiser:
double J[151][151] = { 0.0 };
2) Well you are using a fairly complex calculation to calculate the position of P and the position of J.
You may well get better performance. by stepping through as pointers:
for (iter = 1; iter<151; iter++)
{
double* pP = (P - 1) + (151 * iter);
double* pJ = data->J + (151 * iter);
for(jter = 1; jter<151; jter++, pP++, pJ++ )
{
*pP = - gamma * *pJ;
}
}
This way you move various of the array index calculation outside of the loop.
3) The best practice is to try and move as many calculations out of the loop as possible. Much like I did on the loop above.

First, I'd advise you to split up your question into three separate questions. It's hard to answer all three; I, for example, have not worked much with numerical analysis, so I'll only answer the first one.
First, variables on the stack are not initialized for you. But there are faster ways to initialize them. In your case I'd advise using memset:
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
memset((void*)J, 0, sizeof(double) * 151 * 151);
/* More code to populate J from data and y that runs very quickly */
}
memset is a fast library routine to fill a region of memory with a specific pattern of bytes. It just so happens that setting all bytes of a double to zero sets the double to zero, so take advantage of your library's fast routines (which will likely be written in assembler to take advantage of things like SSE).

Others have already answered some of your questions. On the subject of matrix multiplication; it is difficult to write a fast algorithm for this, unless you know a lot about cache architecture and so on (the slowness will be caused by the order that you access array elements causes thousands of cache misses).
You can try Googling for terms like "matrix-multiplication", "cache", "blocking" if you want to learn about the techniques used in fast libraries. But my advice is to just use a pre-existing maths library if performance is key.

Initialization of an array to zero.
When J is declared to be a double
array are the values of the array
initialized to zero? If not, is there
a fast way to set all the elements to
zero?
It depends on where the array is allocated. If it is declared at file scope, or as static, then the C standard guarantees that all elements are set to zero. The same is guaranteed if you set the first element to a value upon initialization, ie:
double J[151][151] = {0}; /* set first element to zero */
By setting the first element to something, the C standard guarantees that all other elements in the array are set to zero, as if the array were statically allocated.
Practically for this specific case, I very much doubt it will be wise to allocate 151*151*sizeof(double) bytes on the stack no matter which system you are using. You will likely have to allocate it dynamically, and then none of the above matters. You must then use memset() to set all bytes to zero.
In the
relatively slow loop below, is
accessing a matrix that is contained
in a structure 'data' the the slow
component or is it something else
about the loop?
You should ensure that the function called from it is inlined. Otherwise there isn't much else you can do to optimize the loop: what is optimal is highly system-dependent (ie how the physical cache memories are built). It is best to leave such optimization to the compiler.
You could of course obfuscate the code with manual optimization things such as counting down towards zero rather than up, or to use ++i rather than i++ etc etc. But the compiler really should be able to handle such things for you.
As for matrix addition, I don't know of the mathematically most efficient way, but I suspect it is of minor relevance to the efficiency of the code. The big time thief here is the double type. Unless you really have need for high accuracy, I'd consider using float or int to speed up the algorithm.

Optimizing 'for-loops' over arrays in C99 with different indexing

I want to speed up an array multiplication in C99.
This is the original for loops:
for(int i=0;i<n;i++) {
for(int j=0;j<m;j++) {
total[j]+= w[j][i] * x[i];
}
}
My boss asked my to try this, but it did not improve the speed:
for(int i=0;i<n;i++) {
float value = x[i];
for(int j=0;j<m;j++) {
total[j]+= w[j][i] * value;
}
}
Have you other ideas (except for openmp, which I already use) on how I could speed up these for-loops?
I am using:
gcc -DMNIST=1 -O3 -fno-strict-aliasing -std=c99 -lm -D_GNU_SOURCE -Wall -pedantic -fopenmp
Thanks!

One of the theories is that testing for zero is faster than testing for j<m. So by looping from j=m while j>0, in theory you could save some nanoseconds per loop. However in recent experience this has made not a single difference to me, so I think this doesn't hold for current cpu's.
Another issue is memory layout: if your inner loop accesses a chunk of memory that isn't spread out, but continuous, chances are you have more benefit of the lowest cache available in your CPU.
In your current example, switching the layout of w from w[j][i] to w[i][j] may therefore help. Aligning your values on 4 or 8 bytes boundaries will help as well (but you will find that this is already the case for your arrays)
Another one is loop-unrolling, meaning that you do your inner loop in chunks of, say, 4. So the evaluation if the loop is done, has to be done 4 times less. The optimum value must be determined emperically, and may also depend on the problem at hand (e.g. if you know you're looping a multiple of 5 times, use 5)

Right now, each two consecutive internal operations (i.e. total[j]+= w[j][i] * x[i]) write to different locations and read from distant locations. You can possibly gain some performance by localizing reads and writes (thus, hitting more the internal cache) - for example, by switching the j loop and the i loop, so that the j loop is the external and the i loop is the internal.
This way you'll be localizing both the reads and the writes:
Memory writes will be to the same place for all is.
Memory reads will be sequential for w[j][i] and x[i].
To sum up:
for(int j=0;j<m;j++) {
for(int i=0;i<n;i++) {
total[j]+= w[j][i] * x[i];
}
}

If this really matters:
Link against a tuned CBLAS library. There are lots to choose from, some free and some commercial. Some platforms already have one on the system.
Replace your code with a call to cblas_dgemv.
This is an extraordinarily well understood problem, and many smart people have written highly-tuned libraries for it. Use one of them.

If you know that x, total and w do not alias each other you can get a fairly measurable boost by rearranging the loop indices and avoiding the write to total[j] each time through the loop:
for(int j=0;j<m;j++) {
const float * const w_j = w[j];
float total_j = 0;
for(int i=0;i<n;i++)
total_j += w_j[i] * x[i];
total[j] += total_j;
}
However, BLAS is the right answer, most of the time for this sort of thing. The best solution will depend on n, m, prefetch times, pipeline depths, loop unrolling, the size of your cache lines, etc. You probably don't want to do the level of optimization that it other people have done under the covers.

Most efficient way to calculate the exponential of each element of a matrix

I'm migrating from Matlab to C + GSL and I would like to know what's the most efficient way to calculate the matrix B for which:
B[i][j] = exp(A[i][j])
where i in [0, Ny] and j in [0, Nx].
Notice that this is different from matrix exponential:
B = exp(A)
which can be accomplished with some unstable/unsupported code in GSL (linalg.h).
I've just found the brute force solution (couple of 'for' loops), but is there any smarter way to do it?
EDIT
Results from the solution post of Drew Hall
All the results are from a 1024x1024 for(for) loop in which in each iteration two double values (a complex number) are assigned. The time is the averaged time over 100 executions.
Results when taking into account the {Row,Column}-Major mode to store the matrix:
226.56 ms when looping over the row in the inner loop in Row-Major mode (case 1).
223.22 ms when looping over the column in the inner loop in Row-Major mode (case 2).
224.60 ms when using the gsl_matrix_complex_set function provided by GSL (case 3).
Source code for case 1:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
matrix[2*(i*s_tda + j)] = GSL_REAL(c_value);
matrix[2*(i*s_tda + j)+1] = GSL_IMAG(c_value);
}
}
Source code for case 2:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
matrix->data[2*(j*s_tda + i)] = GSL_REAL(c_value);
matrix->data[2*(j*s_tda + i)+1] = GSL_IMAG(c_value);
}
}
Source code for case 3:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
gsl_matrix_complex_set(matrix, i, j, c_value);
}
}

There's no way to avoid iterating over all the elements and calling exp() or equivalent on each one. But there are faster and slower ways to iterate.
In particular, your goal should be to mimimize cache misses. Find out if your data is stored in row-major or column-major order, and be sure to arrange your loops such that the inner loop iterates over elements stored contiguously in memory, and the outer loop takes the big stride to the next row (if row major) or column (if column major). Although this seems trivial, it can make a HUGE difference in performance (depending on the size of your matrix).
Once you've handled the cache, your next goal is to remove loop overhead. The first step (if your matrix API supports it) is to go from nested loops (M & N bounds) to a single loop iterating over the underlying data (MN bound). You'll need to get a raw pointer to the underlying memory block (that is, a double rather than a double**) to do this.
Finally, throw in some loop unrolling (that is, do 8 or 16 elements for each iteration of the loop) to further reduce the loop overhead, and that's probably about as quick as you can make it. You'll probably need a final switch statement with fall-through to clean up the remainder elements (for when your array size % block size != 0).

No, unless there's some strange mathematical quirk I haven't heard of, you pretty much just have to loop through the elements with two for loops.

If you just want to apply exp to an array of numbers, there's really no shortcut. You gotta call it (Nx * Ny) times. If some of the matrix elements are simple, like 0, or there are repeated elements, some memoization could help.
However, if what you really want is a matrix exponential (which is very useful), the algorithm we rely on is DGPADM. It's in Fortran, but you can use f2c to convert it to C. Here's the paper on it.

Since the contents of the loop haven't been shown, the bit that calculates the c_value we don't know if the performance of the code is limited by memory bandwidth or limited by CPU. The only way to know for sure is to use a profiler, and a sophisticated one at that. It needs to be able to measure memory latency, i.e. the amount of time the CPU has been idle waiting for data to arrive from RAM.
If you are limited by memory bandwidth, there's not a lot you can do once you're accessing memory sequentially. The CPU and memory work best when data is fetched sequentially. Random accesses hit the throughput as data is more likely to have to be fetched into cache from RAM. You could always try getting faster RAM.
If you're limited by CPU then there are a few more options available to you. Using SIMD is one option, as is hand coding the floating point code (C/C++ compiler aren't great at FPU code for many reasons). If this were me, and the code in the inner loop allows for it, I'd have two pointers into the array, one at the start and a second 4/5ths of the way through it. Each iteration, a SIMD operation would be performed using the first pointer and scalar FPU operations using the second pointer so that each iteration of the loop does five values. Then, I'd interleave the SIMD instructions with the FPU instructions to mitigate latency costs. This shouldn't affect your caches since (at least on the Pentium) the MMU can stream up to four data streams simultaneously (i.e. prefetch data for you without any prompting or special instructions).