Run-time efficient transposition of a rectangular matrix of arbitrary size

Run-time efficient transposition of a rectangular matrix of arbitrary size - c

I am pressed for time to optimize a large piece of C code for speed and I am looking for an algorithm---at the best a C "snippet"---that transposes a rectangular source matrix u[r][c] of arbitrary size (r number of rows, c number of columns) into a target matrix v[s][d] (s = c number of rows, d = r number of columns) in a "cache-friendly" i. e. data-locality respecting way. The typical size of u is around 5000 ... 15000 rows by 50 to 500 columns, and it is clear that a row-wise access of elements is very cache-inefficient.
There are many discussions on this topic in the web (nearby this thread), but as far as I see all of them discuss the spacial cases like square matrices, u[r][r], or the definition an on-dimensional array, e. g. u[r * c], not the above mentioned "array of arrays" (of equal length) used in my context of Numerical Recipes (background see here).
I would by very thankful for any hint that helps to spare me the "reinvention of the wheel".
Martin

I do not think that array of arrays is much harder to transpose than linear array in general. But if you are going to have 50 columns in each array, that sounds bad: it may be not enough to hide the overhead of pointer dereferencing.
I think that the overall strategy of cache-friendly implementation is the same: process your matrix in tiles, choose size of tiles which performs best according to experiments.
template<int BLOCK>
void TransposeBlocked(Matrix &dst, const Matrix &src) {
int r = dst.r, c = dst.c;
assert(r == src.c && c == src.r);
for (int i = 0; i < r; i += BLOCK)
for (int j = 0; j < c; j += BLOCK) {
if (i + BLOCK <= r && j + BLOCK <= c)
ProcessFullBlock<BLOCK>(dst.data, src.data, i, j);
else
ProcessPartialBlock(dst.data, src.data, r, c, i, j, BLOCK);
}
}
I have tried to optimize the best case when r = 10000, c = 500 (with float type). On my local machine 128 x 128 tiles give speedup in 2.5 times. Also, I have tried to use SSE to accelerate transposition, but it does not change timings significantly. I think that's because the problem is memory bound.
Here are full timings (for 100 launches each) of various implementations on Core2 E4700 2.6GHz:
Trivial: 6.111 sec
Blocked(4): 8.370 sec
Blocked(16): 3.934 sec
Blocked(64): 2.604 sec
Blocked(128): 2.441 sec
Blocked(256): 2.266 sec
BlockedSSE(16): 4.158 sec
BlockedSSE(64): 2.604 sec
BlockedSSE(128): 2.245 sec
BlockedSSE(256): 2.036 sec
Here is the full code used.

So, I'm guessing you have an array of array of floats/doubles. This setup is already very bad for cache performance. The reason is that with a 1-dimensional array the compiler can output code that results in a prefetch operation and ( in the case of a very new compiler) produce SIMD/vectorized code. With an array of pointers there's a deference operation on each step making a prefetch more difficult. Not to mention there aren't any guarantees on memory alignment.
If this is for an assignment and you have no choice but to write the code from scratch, I'd recommend looking at how CBLAS does it (note that you'll still need your array to be "flattened"). Otherwise, you're much better off using a highly optimized BLAS implementation like
OpenBLAS. It's been optimized for nearly a decade and will produce the fastest code for your target processor (tuning for things like cache sizes and vector instruction set).
The tl;dr is that using an array of arrays will result in terrible performance no matter what. Flatten your arrays and make your code nice to read by using a #define to access elements of the array.

Related

Compute efficiently Max and Min on data stream

I'm working in C with a stream of data. Basically I receive a column array of 6 elements every n milliseconds. I would like to compute the max value for each row of data.
To make this clear this is how my data looks like (this is a toy example, actually I'll have thousand of columns acquired):
[6] [-10] [5]
[1] [5] [3]
[5] [30] [10]
[2] [-10] [0]
[-2][5] [10]
[-5][0] [1]
So basically (as I said) I receive a column of data every n milliseconds, and I want to compute the max and min value row-wise. So in my previous example my result would be:
max_values=[6,5,30,2,10,1]
min_values=[-10,1,5,-10,-2,-5]
I want to point out that I have no access to the full matrix, I can only work over single columns of 6 elements that I receive every n milliseconds.
This is my simple code algorithm so far (I'm omitting the whole code since it's part of a bigger project):
for(int i=0;i<6;i++){
if(input[i]>temp_max[i]){
temp_max[i]=input[i];
}
if(input[i]<temp_min[i]){
temp_min[i]=input[i];
}
}
Where input, temp_max and temp_min are all float arrays of dimension 6.
Basically my code executes this piece of code everytime a new input array is available and updates the maximum and minimum accordingly.
Since I'm interested in performance (this is going to run on an embedded system), is there any way to improve this part of the code? Calling a comparison for each single element of the 2 arrays doesn't seem the most smart idea.

With random input data (i.e. unordered data), it'll be pretty hard (aka impossible) to find min/max without a comparison per element.
You may get some minor improvement from something like:
temp_max[0]=input[0];
temp_min[0]=input[0];
for(int i=1;i<6;i++){ // Only 1..6
if(input[i]>temp_max[i]){
temp_max[i]=input[i];
}
else // If current element was larger than max, you don't need to check min
{
if(input[i]<temp_min[i]){
temp_min[i]=input[i];
}
}
}
but I doubt this will be a significant improvement.

Branching is slow, especially on embedded systems. Scalar computation too.
Hopefully, your targeted processor seems to be an ARM-based processor supporting the NEON SIMD instruction set (apparently one based on a 64-bits ARM-V8 A53 architecture). NEON can compute 4 32-bits floating-point operations in a row. This should be much faster than the current code (which compilers apparently fail to vectorize).
Here is an example code (untested):
void minmax_optim(float temp_min[6], float temp_max[6], float input[6]) {
/* Compute the first 4 floats */
float32x4_t vInput = vld1q_f32(input);
float32x4_t vMin = vld1q_f32(temp_min);
float32x4_t vMax = vld1q_f32(temp_max);
vMin = vminq_f32(vInput, vMin);
vMax = vmaxq_f32(vInput, vMax);
vst1q_f32(temp_min, vMin);
vst1q_f32(temp_max, vMax);
/* Remainder 2 floats */
float32x2_t vLastInput = vld1_f32(input+4);
float32x2_t vLastMin = vld1_f32(temp_min+4);
float32x2_t vLastMax = vld1_f32(temp_max+4);
vLastMin = vmin_f32(vLastInput, vLastMin);
vLastMax = vmax_f32(vLastInput, vLastMax);
vst1_f32(temp_min+4, vLastMin);
vst1_f32(temp_max+4, vLastMax);
}
The resulting code should be much faster. One can see on goldbolt that the number of instructions of this vectorized implementation is drastically smaller than the reference implementation without any conditional jump instructions.

You nailed it -- you have to keep a temporary max and min arrays. Unfortunately, if we're talking strictly C, it seems to be the single possible and thus most performant algorithm possible.
Since you've mentioned it's going to run on embedded system (but omitted which), please make sure you have hardware floating point support. If there isn't, that's going to be high performance penality. If you have high-end hardware, you can look for availability of vector instructions, but then that's platform-specific, possibly by use of assembly.

To my impression the approach as such cannot be substantially improved, as the input is not available as a whole. That being said, the inner comparisons can be compacted. The assignemnts
if(input[i]>temp_max[i]){
temp_max[i]=input[i];
}
if(input[i]<temp_min[i]){
temp_min[i]=input[i];
}
can be improved to
if(input[i]>temp_max[i]){
temp_max[i]=input[i];
}
else if(input[i]<temp_min[i]){
temp_min[i]=input[i];
}
because if the current value replaces the temporary maximum, it cannot also replace the temporary minimum (assuming some sensible initialization).

Only for max but it is easy to expand
#define MAX(a,b,c) (a) > (b) ? ((b) > (c) ? (b) : (a) > (c) ? (a) : (c) ) : (b) > (c) ? (b) : (c)
void rowmax(int *a, int *b, int *c, int *result, size_t size)
{
for(size_t index = 0; index < size; index++)
{
result[index] = MAX(a[index], b[index], c[index]);
}
}

Which sequence is more effective in Assembly language?

I have 2 C sequences which both multiply two matrices.
Sequence 1:
int A[M][N], B[N][P], C[M][P], i, j, k;
for (i = 0; i < M; i++)
for (j = 0; j < P; j++)
for (k = 0; k < N; k++)
C[i][j] += A[i][k] * B[k][j];
Sequence 2:
int A[M][N], B[N][P], C[M][P], i, j, k;
for (i = M - 1; i >= 0; i--)
for (j = P - 1; j >= 0; j--)
for (k = N - 1; k >= 0; k--)
C[i][j] += A[i][k] * B[k][j];
My question is: which of them is more efficient when translated in Assembly language?
I'm pretty sure that the second one can be written using the loop instruction, while the first one can be written using inc/jl.

First, you should understand that source code does not dictate what the assembly language is. The C standard allows a compiler to transform a program in any way as long as the resulting observable behavior (defined by the standard) remains the same. (The observable behavior is largely the output to files and devices, interactive input and output, and accesses to special volatile objects.)
Compilers take advantage of this rule to optimize your program. If the results of your loop are the same in either direction, then, in the best compilers, writing the loop in one direction or another has no consequence. The compiler analyzes the source code and sees that the effect of the loop is merely to perform a set of operations whose order does not matter. It represents the loop and the operations within it abstractly and later generates the best assembly code it can.
If the arrays in your example are large, then the time it takes the compiler to execute the loop control instructions is irrelevant. In typical systems, it takes dozens of CPU cycles or more to fetch a value from memory. With large arrays, the bottleneck in your example code will be fetching data from memory. The CPU will be forced to wait for this data, and it will easily complete any loop control or array address arithmetic instructions while it is waiting for data from memory.
Typical systems deal with the slow memory problem by including some fast memory, called cache. Often, there is very fast cache built into the core of the processor itself, plus some fast cache on the chip with the processor, and there are may other levels of cache. Memory in cache is organized into lines, which are segments of consecutive data from memory. Thus, one cache line may contain eight consecutive int objects. When the processor needs data that is not already in cache, an entire cache line is fetched from memory. Because of this, you can avoid the memory delay by using eight consecutive int objects. When you read the first one (or even before—the processor may predict your read and start fetching it ahead of time), all eight will be ready from memory. So your program will only have to wait for the first one. When it goes to use the second through the eight, they will already be in cache, where they are immediately available to the processor.
Unfortunately, array multiplication is notoriously bad for caches. Although your loop traverses the rows of array A (using A[i][k] where k is the fastest-varying index as your code is written), it traverses the columns of B (using B[k][j]). So consecutive iterations of your loop use consecutive elements of A but not consecutive elements of B. If the arrays are large, your program will end up waiting for elements from B to be fetched from memory. And, if you change the code to use consecutive elements from B, then it no longer uses consecutive elements from A.
With array multiplication, a typical way to deal with this problem is to split the array multiplication into smaller blocks, doing only a portion at a time, perhaps 8×8 blocks. This works because the cache can hold multiple lines at a time. If you arrange the work so that one 8×8 block from B (e.g., all the elements with a row number from 16 to 23 and a column number from 32 to 39) is used repeatedly for a while, then it can remain in cache, with all its data immediately available. This sort of rearrangement of work can speed up your program tremendously, making it many times faster. It is a much larger improvement than merely changing the direction of your loops can provide.
Some compilers can see that your loops on i, j, and k can be interchanged, and they may try to reorganize them if there is some benefit. Few compilers can break up the routines into blocks as I describe above. Also, the compiler can rearrange the work in your example only because you show A, B, and C declared as separate arrays. If these were not visible to the compiler but were instead passed as pointers to a function that was performing matrix multiplication, the compiler would not be able to see that A, B, and C point to separate arrays. In this case, it cannot know that the order of the loops does not matter. If the function were passed a C that points to the same array as A, the function would be overwriting some of its input while calculating outputs, and so the loop directions would matter.
There are a variety of matrix multiplication libraries that use the blocking technique and others to perform matrix multiplication efficiently.

Optimising C for performance vs memory optimisation using multidimensional arrays

I am struggling to decide between two optimisations for building a numerical solver for the poisson equation.
Essentially, I have a two dimensional array, of which I require n doubles in the first row, n/2 in the second n/4 in the third and so on...
Now my difficulty is deciding whether or not to use a contiguous 2d array grid[m][n], which for a large n would have many unused zeroes but would probably reduce the chance of a cache miss. The other, and more memory efficient method, would be to dynamically allocate an array of pointers to arrays of decreasing size. This is considerably more efficient in terms of memory storage but would it potentially hinder performance?
I don't think I clearly understand the trade-offs in this situation. Could anybody help?
For reference, I made a nice plot of the memory requirements in each case:

There is no hard and fast answer to this one. If your algorithm needs more memory than you expect to be given then you need to find one which is possibly slower but fits within your constraints.
Beyond that, the only option is to implement both and then compare their performance. If saving memory results in a 10% slowdown is that acceptable for your use? If the version using more memory is 50% faster but only runs on the biggest computers will it be used? These are the questions that we have to grapple with in Computer Science. But you can only look at them once you have numbers. Otherwise you are just guessing and a fair amount of the time our intuition when it comes to optimizations are not correct.

Build a custom array that will follow the rules you have set.
The implementation will use a simple 1d contiguous array. You will need a function that will return the start of array given the row. Something like this:
int* Get( int* array , int n , int row ) //might contain logical errors
{
int pos = 0 ;
while( row-- )
{
pos += n ;
n /= 2 ;
}
return array + pos ;
}
Where n is the same n you described and is rounded down on every iteration.
You will have to call this function only once per entire row.
This function will never take more that O(log n) time, but if you want you can replace it with a single expression: http://en.wikipedia.org/wiki/Geometric_series#Formula

You could use a single array and just calculate your offset yourself
size_t get_offset(int n, int row, int column) {
size_t offset = column;
while (row--) {
offset += n;
n << 1;
}
return offset;
}
double * array = calloc(sizeof(double), get_offset(n, 64, 0));
access via
array[get_offset(column, row)]

Optimizing C loops

I'm new to C from many years of Matlab for numerical programming. I've developed a program to solve a large system of differential equations, but I'm pretty sure I've done something stupid as, after profiling the code, I was surprised to see three loops that were taking ~90% of the computation time, despite the fact they are performing the most trivial steps of the program.
My question is in three parts based on these expensive loops:
Initialization of an array to zero. When J is declared to be a double array are the values of the array initialized to zero? If not, is there a fast way to set all the elements to zero?
void spam(){
double J[151][151];
/* Other relevant variables declared */
calcJac(data,J,y);
/* Use J */
}
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
/* The first expensive loop */
int iter, jter;
for (iter=0; iter<151; iter++) {
for (jter = 0; jter<151; jter++) {
J[iter][jter] = 0;
}
}
/* More code to populate J from data and y that runs very quickly */
}
During the course of solving I need to solve matrix equations defined by P = I - gamma*J. The construction of P is taking longer than solving the system of equations it defines, so something I'm doing is likely in error. In the relatively slow loop below, is accessing a matrix that is contained in a structure 'data' the the slow component or is it something else about the loop?
for (iter = 1; iter<151; iter++) {
for(jter = 1; jter<151; jter++){
P[iter-1][jter-1] = - gamma*(data->J[iter][jter]);
}
}
Is there a best practice for matrix multiplication? In the loop below, Ith(v,iter) is a macro for getting the iter-th component of a vector held in the N_Vector structure 'v' (a data type used by the Sundials solvers). Particularly, is there a best way to get the dot product between v and the rows of J?
Jv_scratch = 0;
int iter, jter;
for (iter=1; iter<151; iter++) {
for (jter=1; jter<151; jter++) {
Jv_scratch += J[iter][jter]*Ith(v,jter);
}
Ith(Jv,iter) = Jv_scratch;
Jv_scratch = 0;
}

1) No they're not you can memset the array as follows:
memset( J, 0, sizeof( double ) * 151 * 151 );
or you can use an array initialiser:
double J[151][151] = { 0.0 };
2) Well you are using a fairly complex calculation to calculate the position of P and the position of J.
You may well get better performance. by stepping through as pointers:
for (iter = 1; iter<151; iter++)
{
double* pP = (P - 1) + (151 * iter);
double* pJ = data->J + (151 * iter);
for(jter = 1; jter<151; jter++, pP++, pJ++ )
{
*pP = - gamma * *pJ;
}
}
This way you move various of the array index calculation outside of the loop.
3) The best practice is to try and move as many calculations out of the loop as possible. Much like I did on the loop above.

First, I'd advise you to split up your question into three separate questions. It's hard to answer all three; I, for example, have not worked much with numerical analysis, so I'll only answer the first one.
First, variables on the stack are not initialized for you. But there are faster ways to initialize them. In your case I'd advise using memset:
static void calcJac(UserData data, double J[151][151],N_Vector y)
{
memset((void*)J, 0, sizeof(double) * 151 * 151);
/* More code to populate J from data and y that runs very quickly */
}
memset is a fast library routine to fill a region of memory with a specific pattern of bytes. It just so happens that setting all bytes of a double to zero sets the double to zero, so take advantage of your library's fast routines (which will likely be written in assembler to take advantage of things like SSE).

Others have already answered some of your questions. On the subject of matrix multiplication; it is difficult to write a fast algorithm for this, unless you know a lot about cache architecture and so on (the slowness will be caused by the order that you access array elements causes thousands of cache misses).
You can try Googling for terms like "matrix-multiplication", "cache", "blocking" if you want to learn about the techniques used in fast libraries. But my advice is to just use a pre-existing maths library if performance is key.

Initialization of an array to zero.
When J is declared to be a double
array are the values of the array
initialized to zero? If not, is there
a fast way to set all the elements to
zero?
It depends on where the array is allocated. If it is declared at file scope, or as static, then the C standard guarantees that all elements are set to zero. The same is guaranteed if you set the first element to a value upon initialization, ie:
double J[151][151] = {0}; /* set first element to zero */
By setting the first element to something, the C standard guarantees that all other elements in the array are set to zero, as if the array were statically allocated.
Practically for this specific case, I very much doubt it will be wise to allocate 151*151*sizeof(double) bytes on the stack no matter which system you are using. You will likely have to allocate it dynamically, and then none of the above matters. You must then use memset() to set all bytes to zero.
In the
relatively slow loop below, is
accessing a matrix that is contained
in a structure 'data' the the slow
component or is it something else
about the loop?
You should ensure that the function called from it is inlined. Otherwise there isn't much else you can do to optimize the loop: what is optimal is highly system-dependent (ie how the physical cache memories are built). It is best to leave such optimization to the compiler.
You could of course obfuscate the code with manual optimization things such as counting down towards zero rather than up, or to use ++i rather than i++ etc etc. But the compiler really should be able to handle such things for you.
As for matrix addition, I don't know of the mathematically most efficient way, but I suspect it is of minor relevance to the efficiency of the code. The big time thief here is the double type. Unless you really have need for high accuracy, I'd consider using float or int to speed up the algorithm.

Most efficient way to calculate the exponential of each element of a matrix

I'm migrating from Matlab to C + GSL and I would like to know what's the most efficient way to calculate the matrix B for which:
B[i][j] = exp(A[i][j])
where i in [0, Ny] and j in [0, Nx].
Notice that this is different from matrix exponential:
B = exp(A)
which can be accomplished with some unstable/unsupported code in GSL (linalg.h).
I've just found the brute force solution (couple of 'for' loops), but is there any smarter way to do it?
EDIT
Results from the solution post of Drew Hall
All the results are from a 1024x1024 for(for) loop in which in each iteration two double values (a complex number) are assigned. The time is the averaged time over 100 executions.
Results when taking into account the {Row,Column}-Major mode to store the matrix:
226.56 ms when looping over the row in the inner loop in Row-Major mode (case 1).
223.22 ms when looping over the column in the inner loop in Row-Major mode (case 2).
224.60 ms when using the gsl_matrix_complex_set function provided by GSL (case 3).
Source code for case 1:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
matrix[2*(i*s_tda + j)] = GSL_REAL(c_value);
matrix[2*(i*s_tda + j)+1] = GSL_IMAG(c_value);
}
}
Source code for case 2:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
matrix->data[2*(j*s_tda + i)] = GSL_REAL(c_value);
matrix->data[2*(j*s_tda + i)+1] = GSL_IMAG(c_value);
}
}
Source code for case 3:
for(i=0; i<Nx; i++)
{
for(j=0; j<Ny; j++)
{
/* Operations to obtain c_value (including exponentiation) */
gsl_matrix_complex_set(matrix, i, j, c_value);
}
}

There's no way to avoid iterating over all the elements and calling exp() or equivalent on each one. But there are faster and slower ways to iterate.
In particular, your goal should be to mimimize cache misses. Find out if your data is stored in row-major or column-major order, and be sure to arrange your loops such that the inner loop iterates over elements stored contiguously in memory, and the outer loop takes the big stride to the next row (if row major) or column (if column major). Although this seems trivial, it can make a HUGE difference in performance (depending on the size of your matrix).
Once you've handled the cache, your next goal is to remove loop overhead. The first step (if your matrix API supports it) is to go from nested loops (M & N bounds) to a single loop iterating over the underlying data (MN bound). You'll need to get a raw pointer to the underlying memory block (that is, a double rather than a double**) to do this.
Finally, throw in some loop unrolling (that is, do 8 or 16 elements for each iteration of the loop) to further reduce the loop overhead, and that's probably about as quick as you can make it. You'll probably need a final switch statement with fall-through to clean up the remainder elements (for when your array size % block size != 0).

No, unless there's some strange mathematical quirk I haven't heard of, you pretty much just have to loop through the elements with two for loops.

If you just want to apply exp to an array of numbers, there's really no shortcut. You gotta call it (Nx * Ny) times. If some of the matrix elements are simple, like 0, or there are repeated elements, some memoization could help.
However, if what you really want is a matrix exponential (which is very useful), the algorithm we rely on is DGPADM. It's in Fortran, but you can use f2c to convert it to C. Here's the paper on it.

Since the contents of the loop haven't been shown, the bit that calculates the c_value we don't know if the performance of the code is limited by memory bandwidth or limited by CPU. The only way to know for sure is to use a profiler, and a sophisticated one at that. It needs to be able to measure memory latency, i.e. the amount of time the CPU has been idle waiting for data to arrive from RAM.
If you are limited by memory bandwidth, there's not a lot you can do once you're accessing memory sequentially. The CPU and memory work best when data is fetched sequentially. Random accesses hit the throughput as data is more likely to have to be fetched into cache from RAM. You could always try getting faster RAM.
If you're limited by CPU then there are a few more options available to you. Using SIMD is one option, as is hand coding the floating point code (C/C++ compiler aren't great at FPU code for many reasons). If this were me, and the code in the inner loop allows for it, I'd have two pointers into the array, one at the start and a second 4/5ths of the way through it. Each iteration, a SIMD operation would be performed using the first pointer and scalar FPU operations using the second pointer so that each iteration of the loop does five values. Then, I'd interleave the SIMD instructions with the FPU instructions to mitigate latency costs. This shouldn't affect your caches since (at least on the Pentium) the MMU can stream up to four data streams simultaneously (i.e. prefetch data for you without any prompting or special instructions).