Most efficient way to calculate the exponential of each element of a matrix - c

I'm migrating from Matlab to C + GSL and I would like to know what's the most efficient way to calculate the matrix B for which:
B[i][j] = exp(A[i][j])
where i in [0, Ny] and j in [0, Nx].
Notice that this is different from matrix exponential:
B = exp(A)
which can be accomplished with some unstable/unsupported code in GSL (linalg.h).
I've just found the brute force solution (couple of 'for' loops), but is there any smarter way to do it?
EDIT
Results from the solution post of Drew Hall
All the results are from a 1024x1024 for(for) loop in which in each iteration two double values (a complex number) are assigned. The time is the averaged time over 100 executions.
Results when taking into account the {Row,Column}-Major mode to store the matrix:
226.56 ms when looping over the row in the inner loop in Row-Major mode (case 1).
223.22 ms when looping over the column in the inner loop in Row-Major mode (case 2).
224.60 ms when using the gsl_matrix_complex_set function provided by GSL (case 3).
Source code for case 1:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
matrix[2*(i*s_tda + j)] = GSL_REAL(c_value);
matrix[2*(i*s_tda + j)+1] = GSL_IMAG(c_value);
}
}
Source code for case 2:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
matrix->data[2*(j*s_tda + i)] = GSL_REAL(c_value);
matrix->data[2*(j*s_tda + i)+1] = GSL_IMAG(c_value);
}
}
Source code for case 3:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
gsl_matrix_complex_set(matrix, i, j, c_value);
}
}

There's no way to avoid iterating over all the elements and calling exp() or equivalent on each one. But there are faster and slower ways to iterate.
In particular, your goal should be to mimimize cache misses. Find out if your data is stored in row-major or column-major order, and be sure to arrange your loops such that the inner loop iterates over elements stored contiguously in memory, and the outer loop takes the big stride to the next row (if row major) or column (if column major). Although this seems trivial, it can make a HUGE difference in performance (depending on the size of your matrix).
Once you've handled the cache, your next goal is to remove loop overhead. The first step (if your matrix API supports it) is to go from nested loops (M & N bounds) to a single loop iterating over the underlying data (MN bound). You'll need to get a raw pointer to the underlying memory block (that is, a double rather than a double**) to do this.
Finally, throw in some loop unrolling (that is, do 8 or 16 elements for each iteration of the loop) to further reduce the loop overhead, and that's probably about as quick as you can make it. You'll probably need a final switch statement with fall-through to clean up the remainder elements (for when your array size % block size != 0).

No, unless there's some strange mathematical quirk I haven't heard of, you pretty much just have to loop through the elements with two for loops.

If you just want to apply exp to an array of numbers, there's really no shortcut. You gotta call it (Nx * Ny) times. If some of the matrix elements are simple, like 0, or there are repeated elements, some memoization could help.
However, if what you really want is a matrix exponential (which is very useful), the algorithm we rely on is DGPADM. It's in Fortran, but you can use f2c to convert it to C. Here's the paper on it.

Since the contents of the loop haven't been shown, the bit that calculates the c_value we don't know if the performance of the code is limited by memory bandwidth or limited by CPU. The only way to know for sure is to use a profiler, and a sophisticated one at that. It needs to be able to measure memory latency, i.e. the amount of time the CPU has been idle waiting for data to arrive from RAM.
If you are limited by memory bandwidth, there's not a lot you can do once you're accessing memory sequentially. The CPU and memory work best when data is fetched sequentially. Random accesses hit the throughput as data is more likely to have to be fetched into cache from RAM. You could always try getting faster RAM.
If you're limited by CPU then there are a few more options available to you. Using SIMD is one option, as is hand coding the floating point code (C/C++ compiler aren't great at FPU code for many reasons). If this were me, and the code in the inner loop allows for it, I'd have two pointers into the array, one at the start and a second 4/5ths of the way through it. Each iteration, a SIMD operation would be performed using the first pointer and scalar FPU operations using the second pointer so that each iteration of the loop does five values. Then, I'd interleave the SIMD instructions with the FPU instructions to mitigate latency costs. This shouldn't affect your caches since (at least on the Pentium) the MMU can stream up to four data streams simultaneously (i.e. prefetch data for you without any prompting or special instructions).

Related

Convert sequential loop into parallel in C using pthreads

I would like to apply a pretty simple straightforward calculation on a n-by-d-dimensional array. The goal is to convert the sequential calculation to a parallel one using pthreads. My question is: what is the optimal way to split the problem? How could I significantly reduce the execution time of my script? I provide a sample sequential code in C and some thoughts on parallel implementations that I have already tried.
double * calcDistance(double * X ,int n, int d)
{
//calculate and return an array[n-1] of all the distances
//from the last point
double *distances = calloc(n,sizeof(double));
for(int i=0 ; i<n-1; i++)
{
//distances[i]=0;
for (int j=0; j< d; j++)
{
distances[i] += pow(X[(j+1)*n-1]-X[j*n+i], 2);
}
distances[i] = sqrt(distances[i]);
}
return distances;
}
I provide a main()-caller function in order for the sample to be complete and testable:
#include <stdio.h>
#include <stdlib.h>
#define N 10 //00000
#define D 2
int main()
{
srand(time(NULL));
//allocate the proper space for X
double *X = malloc(D*N*(sizeof(double)));
//fill X with numbers in space (0,1)
for(int i = 0 ; i<N ; i++)
{
for(int j=0; j<D; j++)
{
X[i+j*N] = (double) (rand() / (RAND_MAX + 2.0));
}
}
X = calcDistances(X, N, D);
return 0;
}
I have already tried utilizing pthreads asynchronously through the use of a global_index that is imposed to mutex and a local_index. Through the use of a while() loop, a local_index is assigned to each thread on each iteration. The local_index assignment depends on the global_index value at that time (both happening in a mutual exclusion block). The thread executes the computation on the distances[local_index] element.
Unfortunately this implementation has lead to a much slower program with a x10 or x20 bigger execution time compared to the sequential one that is cited above.
Another idea is to predetermine and split the array (say to four equal parts) and assign the computation of each segment to a given pthread. I don't know if that's a common-efficient procedure though.
Your inner loop jumps all over array X with a mixture of strides that varies with
the outer-loop iteration. Unless n and d are quite small,* this is likely to produce poor cache usage -- in the serial code, too, but parallelizing would amplify that effect. At least X is not written by the function, which improves the outlook. Also, there do not appear to be any data dependencies across iterations of the outer loop, which is good.
what is the optimal way to split the problem?
Probably the best available way would be to split outer-loop iterations among your threads. For T threads, have one perform iterations 0 ... (N / T) - 1, have the second do (N / T) ... (2 * N / T) - 1, etc..
How could I significantly reduce the execution time of my script?
The first thing I would do is use simple multiplication instead of pow to compute squares. It's unclear whether you stand to gain anything from parallelism.
I have already tried utilizing pthreads asynchronously through the use
of a global_index that is imposed to mutex and a local_index. [...]
If you have to involve a mutex, semaphore, or similar synchronization object then the task is probably hopeless. Happily (maybe) there does not appear to be any need for that. Assigning outer-loop iterations to threads dynamically is way over-engineered for this problem. Statically assigning iterations to threads as I already described will remove the need for such synchronization, and since the cost of the inner loop does not look like it will vary much for different outer-loop iterations, there probably will not be too much inefficiency introduced that way.
Another idea is to predetermine and split the array (say to four equal parts) and assign the computation of each segment to a given pthread. I don't know if that's a common-efficient procedure though.
This sounds like what I described. It is one of the standard scheduling models provided by OMP, and one of the most efficient available for many problems, given that it does not itself require a mutex. It is somewhat sensitive to the relationship between the number of threads and the number of available execution units, however. For example, if you parallelize across five cores in a four-core machine, then one will have to wait to run until one of the others has finished -- best theoretical speedup 60%. Parallelizing the same computation across only four cores uses the compute resources more efficiently, for a best theoretical speedup of about 75%.
* If n and d are quite small, say anything remotely close to the values in the example driver program, then the overhead arising from parallelization has a good chance of overcoming any gains from parallel execution.

Which sequence is more effective in Assembly language?

I have 2 C sequences which both multiply two matrices.
Sequence 1:
int A[M][N], B[N][P], C[M][P], i, j, k;
for (i = 0; i < M; i++)
for (j = 0; j < P; j++)
for (k = 0; k < N; k++)
C[i][j] += A[i][k] * B[k][j];
Sequence 2:
int A[M][N], B[N][P], C[M][P], i, j, k;
for (i = M - 1; i >= 0; i--)
for (j = P - 1; j >= 0; j--)
for (k = N - 1; k >= 0; k--)
C[i][j] += A[i][k] * B[k][j];
My question is: which of them is more efficient when translated in Assembly language?
I'm pretty sure that the second one can be written using the loop instruction, while the first one can be written using inc/jl.
First, you should understand that source code does not dictate what the assembly language is. The C standard allows a compiler to transform a program in any way as long as the resulting observable behavior (defined by the standard) remains the same. (The observable behavior is largely the output to files and devices, interactive input and output, and accesses to special volatile objects.)
Compilers take advantage of this rule to optimize your program. If the results of your loop are the same in either direction, then, in the best compilers, writing the loop in one direction or another has no consequence. The compiler analyzes the source code and sees that the effect of the loop is merely to perform a set of operations whose order does not matter. It represents the loop and the operations within it abstractly and later generates the best assembly code it can.
If the arrays in your example are large, then the time it takes the compiler to execute the loop control instructions is irrelevant. In typical systems, it takes dozens of CPU cycles or more to fetch a value from memory. With large arrays, the bottleneck in your example code will be fetching data from memory. The CPU will be forced to wait for this data, and it will easily complete any loop control or array address arithmetic instructions while it is waiting for data from memory.
Typical systems deal with the slow memory problem by including some fast memory, called cache. Often, there is very fast cache built into the core of the processor itself, plus some fast cache on the chip with the processor, and there are may other levels of cache. Memory in cache is organized into lines, which are segments of consecutive data from memory. Thus, one cache line may contain eight consecutive int objects. When the processor needs data that is not already in cache, an entire cache line is fetched from memory. Because of this, you can avoid the memory delay by using eight consecutive int objects. When you read the first one (or even before—the processor may predict your read and start fetching it ahead of time), all eight will be ready from memory. So your program will only have to wait for the first one. When it goes to use the second through the eight, they will already be in cache, where they are immediately available to the processor.
Unfortunately, array multiplication is notoriously bad for caches. Although your loop traverses the rows of array A (using A[i][k] where k is the fastest-varying index as your code is written), it traverses the columns of B (using B[k][j]). So consecutive iterations of your loop use consecutive elements of A but not consecutive elements of B. If the arrays are large, your program will end up waiting for elements from B to be fetched from memory. And, if you change the code to use consecutive elements from B, then it no longer uses consecutive elements from A.
With array multiplication, a typical way to deal with this problem is to split the array multiplication into smaller blocks, doing only a portion at a time, perhaps 8×8 blocks. This works because the cache can hold multiple lines at a time. If you arrange the work so that one 8×8 block from B (e.g., all the elements with a row number from 16 to 23 and a column number from 32 to 39) is used repeatedly for a while, then it can remain in cache, with all its data immediately available. This sort of rearrangement of work can speed up your program tremendously, making it many times faster. It is a much larger improvement than merely changing the direction of your loops can provide.
Some compilers can see that your loops on i, j, and k can be interchanged, and they may try to reorganize them if there is some benefit. Few compilers can break up the routines into blocks as I describe above. Also, the compiler can rearrange the work in your example only because you show A, B, and C declared as separate arrays. If these were not visible to the compiler but were instead passed as pointers to a function that was performing matrix multiplication, the compiler would not be able to see that A, B, and C point to separate arrays. In this case, it cannot know that the order of the loops does not matter. If the function were passed a C that points to the same array as A, the function would be overwriting some of its input while calculating outputs, and so the loop directions would matter.
There are a variety of matrix multiplication libraries that use the blocking technique and others to perform matrix multiplication efficiently.

Assemble eigen3 sparsematrix from smaller sparsematrices

I am assembling the jacobian of a coupled multi physics system. The jacobian consists of a blockmatrix on the diagonal for each system and off diagonal blocks for the coupling.
I find it the best to assemble to block separatly and then sum over them with projection matrices to get the complete jacobian.
pseudo-code (where J[i] are the diagonal elements and C[ij] the couplings, P are the projections to the complete matrix).
// diagonal blocks
J.setZero();
for(int i=0;i<N;++i){
J+=P[i]J[i]P[i].transpose()
}
// off diagonal elements
for(int i=0;i<N;++i){
for(int j=i+1;j<N;++j){
J+=P[i]C[ij]P[j].transpose()
J+=P[j]C[ji]P[i].transpose()
}
}
This takes a lot performance, around 20% of the whole programm, which is too much for some assembling. I have to recalculate the jacobian every time step since the system is highly nonlinear.
Valgrind indicates that the ressource consuming method is Eigen::internal::assign_sparse_to_sparse and in this method the call to Eigen::SparseMatrix<>::InsertBackByOuterInner.
Is there a more efficient way to assemble such a matrix?
(I also had to use P*(JP.transpose()) instead of PJ*J.transpose() to make the programm compile, may be there is already something wrong)
P.S: NDEBUG and optimizations are turned on
Edit: by storing P.transpose in a extra matrix ,I get a bit better performance, but the summation accounts still for 15% of the programm
Your code will be much faster by working inplace. First, estimate the number of non-zeros per column in the final matrix and reserve space (if not already done):
int nnz_per_col = ...;
J.reserve(VectorXi::Constant(n_cols, nnz_per_col));
If the number of nnz per column is highly non-uniform, then you can also compute it per column:
VectorXi nnz_per_col(n_cols);
for each j
nnz_per_col(j) = ...;
J.reserve(nnz_per_col);
Then manually insert elements:
for each block B[k]
for each elements i,j
J.coeffRef(foo(i),foo(j)) += B[k](i,j)
where foo implement the appropriate mapping of indices.
And for the next iteration, no need to reserve, but you need to set coefficient values to zero while preserving the structure:
J.coeffs().setZero();

Segmentation fault when trying to use intrinsics specifically _mm256_storeu_pd()

Seemed to have fixed it myself by type casting the cij2 pointer inside the mm256 call
so _mm256_storeu_pd((double *)cij2,vecC);
I have no idea why this changed anything...
I'm writing some code and trying to take advantage of the Intel manual vectorization. But whenever I run the code I get a segmentation fault on trying to use my double *cij2.
if( q == 0)
{
__m256d vecA;
__m256d vecB;
__m256d vecC;
for (int i = 0; i < M; ++i)
for (int j = 0; j < N; ++j)
{
double cij = C[i+j*lda];
double *cij2 = (double *)malloc(4*sizeof(double));
for (int k = 0; k < K; k+=4)
{
vecA = _mm256_load_pd(&A[i+k*lda]);
vecB = _mm256_load_pd(&B[k+j*lda]);
vecC = _mm256_mul_pd(vecA,vecB);
_mm256_storeu_pd(cij2, vecC);
for (int x = 0; x < 4; x++)
{
cij += cij2[x];
}
}
C[i+j*lda] = cij;
}
I've pinpointed the problem to the cij2 pointer. If i comment out the 2 lines that include that pointer the code runs fine, it doesn't work like it should but it'll actually run.
My question is why would i get a segmentation fault here? I know I've allocated the memory correctly and that the memory is a 256 vector of double's with size 64 bits.
After reading the comments I've come to add some clarification.
First thing I did was change the _mm_malloc to just a normal allocation using malloc. Shouldn't affect either way but will give me some more breathing room theoretically.
Second the problem isn't coming from a null return on the allocation, I added a couple loops in to increment through the array and make sure I could modify the memory without it crashing so I'm relatively sure that isn't the problem. The problem seems to stem from the loading of the data from vecC to the array.
Lastly I can not use BLAS calls. This is for a parallelisms class. I know it would be much simpler to call on something way smarter than I but unfortunately I'll get a 0 if I try that.
You dynamically allocate double *cij2 = (double *)malloc(4*sizeof(double)); but you never free it. This is just silly. Use double cij2[4], especially if you're not going to bother to align it. You never need more than one scratch buffer at once, and it's a small fixed size, so just use automatic storage.
In C++11, you'd use alignas(32) double cij2[4] so you could use _mm256_store_pd instead of storeu. (Or just to make sure storeu isn't slowed down by an unaligned address).
If you actually want to debug your original, use a debugger to catch it when it segfaults, and look at the pointer value. Make sure it's something sensible.
Your methods for testing that the memory was valid (like looping over it, or commenting stuff out) sound like they could lead to a lot of your loop being optimized away, so the problem wouldn't happen.
When your program crashes, you can also look at the asm instructions. Vector intrinsics map fairly directly to x86 asm (except when the compiler sees a more efficient way).
Your implementation would suck a lot less if you pulled the horizontal sum out of the loop over k. Instead of storing each multiply result and horizontally adding it, use a vector add into a vector accumulator. hsum it outside the loop over k.
__m256d cij_vec = _mm256_setzero_pd();
for (int k = 0; k < K; k+=4) {
vecA = _mm256_load_pd(&A[i+k*lda]);
vecB = _mm256_load_pd(&B[k+j*lda]);
vecC = _mm256_mul_pd(vecA,vecB);
cij_vec = _mm256_add_pd(cij_vec, vecC); // TODO: use multiple accumulators to keep multiple VADDPD or VFMAPD instructions in flight.
}
C[i+j*lda] = hsum256_pd(cij_vec); // put the horizontal sum in an inline function
For good hsum256_pd implementations (other than storing to memory and using a scalar loop), see Fastest way to do horizontal float vector sum on x86 (I included an AVX version there. It should be easy to adapt the pattern of shuffling to 256b double-precision.) This will help your code a lot, since you still have O(N^2) horizontal sums (but not O(N^3) with this change).
Ideally you could accumulate results for 4 i values in parallel, and not need horizontal sums.
VADDPD has a latency of 3 to 4 clocks, and a throughput of one per 1 to 0.5 clocks, so you need from 3 to 8 vector accumulators to saturate the execution units. Or with FMA, up to 10 vector accumulators (e.g. on Haswell where FMA...PD has 5c latency and one per 0.5c throughput). See Agner Fog's instruction tables and optimization guides to learn more about that. Also the x86 tag wiki.
Also, ideally nest your loops in a way that gave you contiguous access to two of your three arrays, since cache access patterns are critical for matmul (lots of data reuse). Even if you don't get fancy and transpose small blocks at a time that fit in cache. Even transposing one of your input matrices can be a win, since that costs O(N^2) and speeds up the O(N^3) process. I see your inner loop currently has a stride of lda while accessing A[].

Why does copying a 2D array column by column take longer than row by row in C? [duplicate]

This question already has answers here:
Why does the order of the loops affect performance when iterating over a 2D array?
(7 answers)
Closed 7 years ago.
#include <stdio.h>
#include <time.h>
#define N 32768
char a[N][N];
char b[N][N];
int main() {
int i, j;
printf("address of a[%d][%d] = %p\n", N, N, &a[N][N]);
printf("address of b[%5d][%5d] = %p\n", 0, 0, &b[0][0]);
clock_t start = clock();
for (j = 0; j < N; j++)
for (i = 0; i < N; i++)
a[i][j] = b[i][j];
clock_t end = clock();
float seconds = (float)(end - start) / CLOCKS_PER_SEC;
printf("time taken: %f secs\n", seconds);
start = clock();
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
a[i][j] = b[i][j];
end = clock();
seconds = (float)(end - start) / CLOCKS_PER_SEC;
printf("time taken: %f secs\n", seconds);
return 0;
}
Output:
address of a[32768][32768] = 0x80609080
address of b[ 0][ 0] = 0x601080
time taken: 18.063229 secs
time taken: 3.079248 secs
Why does column by column copying take almost 6 times as long as row by row copying? I understand that 2D array is basically an nxn size array where A[i][j] = A[i*n + j], but using simple algebra, I calculated that a Turing machine head (on main memory) would have to travel a distance of in both the cases. Here nxn is the size of the array and x is the distance between last element of first array and first element of second array.
It pretty much comes down to this image (source):
When accessing data, your CPU will not only load a single value, but will also load adjacent data into the CPU's L1 cache. When iterating through your array by row, the items that have automatically been loaded into the cache are actually the ones that are processed next. However, when you are iterating by column, each time an entire "cache line" of data (the size varies per CPU) is loaded, only a single item is used and then the next line has to be loaded, effectively making the cache pointless.
The wikipedia entry and, as a high level overview, this PDF should help you understand how CPU caches work.
Edit: chqrlie in the comments is of course correct. One of the relevant factors here is that only very few of your columns fit into the L1 cache at the same time. If your rows were much smaller (say, the total size of your two dimensional array was only some kilobytes) then you might not see a performance impact from iterating per-column.
While it's normal to draw the array as a rectangle, the addressing of array elements in memory is linear: 0 to one minus the number of bytes available (on nearly all machines).
Memory hierarchies (e.g. registers < L1 cache < L2 cache < RAM < swap space on disk) are optimized for the case where memory accesses are localized: accesses that are successive in time touch addresses that are close together. They are even more highly optimized (e.g. with pre-fetch strategies) for sequential access in linear order of addresses; e.g. 100,101,102...
In C, rectangular arrays are arranged in linear order by concatenating all the rows (other languages like FORTRAN and Common Lisp concatenate columns instead). Therefore the most efficient way to read or write the array is to do all the columns of the first row, then move on to the rest, row by row.
If you go down the columns instead, successive touches are N bytes apart, where N is the number of bytes in a row: 100, 10100, 20100, 30100... for the case N=10000 bytes.Then the second column is 101,10101, 20101, etc. This is the absolute worst case for most cache schemes.
In the very worst case, you can cause a page fault on each access. These days on even on an average machine it would take an enormous array to cause that. But if it happened, each touch could cost ~10ms for a head seek. Sequential access is a few nano-seconds per. That's over a factor of a million difference. Computation effectively stops in this case. It has a name: disk thrashing.
In a more normal case where only cache faults are involved, not page faults, you might see a factor of hundred. Still worth paying attention.
There are 3 main aspects that contribute to the timing different:
The first double loop accesses both arrays for the first time. You are actually reading uninitialized memory which is bad if you expect any meaningful results (functionally as well as timing-wise), but in terms of timing what plays part here is the fact that these addresses are cold, and reside in the main memory (if you're lucky), or aren't even paged (if you're less lucky). In the latter case, you would have a page fault on each new page, and would invoke a system call to allocate a page for the first time. Note that this doesn't have anything to do with the order of traversal, but simply because the first access is much slower. To avoid that, initialize both arrays to some value.
Cache line locality (as explained in the other answers) - if you access sequential data, you miss once per line, and then enjoy the benefit of having it fetched already. You most likely won't even hit it in the cache but rather in some buffer, since the consecutive requests will be waiting for that line to get fetched. When accessing column-wise, you would fetch the line, cache it, but if the reuse distance is large enough - you would lose it and have to fetch it again.
Prefetching - modern CPUs would have HW prefetching mechanisms that can detect sequential accesses and prefetch the data ahead of time, which will eliminate even the first miss of each line. Most CPUs also have stride based prefetches which may be able to cover the column size, but these things don't work well usually with matrix structures since you have too many columns and it would be impossible for HW to track all these stride flows simultaneously.
As a side note, I would recommend that any timing measurement would be performed multiple times and amortized - that would have eliminated problem #1.

Resources