Matrix multiplication in 2 different ways (comparing time) - c

I've got an assignment - compare 2 matrix multiplications - in the default way, and multiplication after transposition of second matrix, we should point the difference which method is faster. I've written something like this below, but time and time2 are nearly equal to each other. In one case the first method is faster, I run the multiplication with the same size of matrix, and in another one the second method is faster. Is something done wrong? Should I change something in my code?
clock_t start = clock();
int sum;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
sum = 0;
for(int k=0; k<size; ++k) {
sum = sum + (m1[i][k] * m2[k][j]);
}
score[i][j] = sum;
}
}
clock_t end = clock();
double time = (end-start)/(double)CLOCKS_PER_SEC;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
int temp = m2[i][j];
m2[i][j] = m2[j][i];
m2[j][i] = temp;
}
}
clock_t start2 = clock();
int sum2;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
sum2 = 0;
for(int k=0; k<size; ++k) {
sum2 = sum2 + (m1[k][i] * m2[k][j]);
}
score[i][j] = sum2;
}
}
clock_t end2 = clock();
double time2 = (end2-start2)/(double)CLOCKS_PER_SEC;

You have multiple severe issues with your code and/or your understanding. Let me try to explain.
Matrix multiplication is bottlenecked by the rate at which the processor can load and store the values to memory. Most current architectures use cache to help with this. Data is moved from memory to cache and from cache to memory in blocks. To maximize the benefit of caching, you want to make sure you will use all the data in that block. To do that, you make sure you access the data sequentially in memory.
In C, multi-dimensional arrays are specified in row-major order. It means that the rightmost index is consecutive in memory; i.e. that a[i][k] and a[i][k+1] are consecutive in memory.
Depending on the architecture, the time it takes for the processor to wait (and do nothing) for the data to be moved from RAM to cache (and vice versa), may or may not be included in the CPU time (that e.g. clock() measures, albeit at a very poor resolution). For this kind of measurement ("microbenchmark"), it is much better to measure and report both CPU and real (or wall clock) time used; especially so if the microbenchmark is run on different machines, to get a better idea of the practical impact of the change.
There will be a lot of variation, so typically, you measure the time taken by a few hundred repeats (each repeat possibly making more than one operation; enough to be easily measured), storing the duration of each, and report their median. Why median, and not minimum, maximum, average? Because there will always be occasional glitches (unreasonable measurement due to an external event, or something), which typically yield a much higher value than normal; this makes the maximum uninteresting, and skews the average (mean) unless removed. The minimum is typically an over-optimistic case, where everything just happened to go perfectly; that rarely occurs in practice, so is only a curiosity, not of practical interest. The median time, on the other hand, gives you a practical measurement: you can expect 50% of all runs of your test case to take no more than the median time measured.
On POSIXy systems (Linux, Mac, BSDs), you should use clock_gettime() to measure the time. The struct timespec format has nanosecond precision (1 second = 1,000,000,000 nanoseconds), but resolution may be smaller (i.e., the clocks change by more than 1 nanosecond, whenever they change). I personally use
#define _POSIX_C_SOURCE 200809L
#include <time.h>
static struct timespec cpu_start, wall_start;
double cpu_seconds, wall_seconds;
void timing_start(void)
{
clock_gettime(CLOCK_REALTIME, &wall_start);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &cpu_start);
}
void timing_stop(void)
{
struct timespec cpu_end, wall_end;
clock_gettime(CLOCK_REALTIME, &wall_end);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &cpu_end);
wall_seconds = (double)(wall_end.tv_sec - wall_start.tv_sec)
+ (double)(wall_end.tv_nsec - wall_start.tv_nsec) / 1000000000.0;
cpu_seconds = (double)(cpu_end.tv_sec - cpu_start.tv_sec)
+ (double)(cpu_end.tv_nsec - cpu_start.tv_nsec) / 1000000000.0;
}
You call timing_start() before the operation, and timing_stop() after the operation; then, cpu_seconds contains the amount of CPU time taken and wall_seconds the real wall clock time taken (both in seconds, use e.g. %.9f to print all meaningful decimals).
The above won't work on Windows, because Microsoft does not want your C code to be portable to other systems. It prefers to develop their own "standard" instead. (Those C11 "safe" _s() I/O function variants are a stupid sham, compared to e.g. POSIX getline(), or the state of wide character support on all systems except Windows.)
Matrix multiplication is
c[r][c] = a[r][0] * b[0][c]
+ a[r][1] * b[1][c]
: :
+ a[r][L] * b[L][c]
where a has L+1 columns, and b has L+1 rows.
In order to make the summation loop use consecutive elements, we need to transpose b. If B[c][r] = b[r][c], then
c[r][c] = a[r][0] * B[c][0]
+ a[r][1] * B[c][1]
: :
+ a[r][L] * B[c][L]
Note that it suffices that a and B are consecutive in memory, but separate (possibly "far" away from each other), for the processor to utilize cache efficiently in such cases.
OP uses a simple loop, similar to the following pseudocode, to transpose b:
For r in rows:
For c in columns:
temporary = b[r][c]
b[r][c] = b[c][r]
b[c][r] = temporary
End For
End For
The problem above is that each element participates in a swap twice. For example, if b has 10 rows and columns, r = 3, c = 5 swaps b[3][5] and b[5][3], but then later, r = 5, c = 3 swaps b[5][3] and b[3][5] again! Essentially, the double loop ends up restoring the matrix to the original order; it does not do a transpose.
Consider the following entries and the actual transpose:
b[0][0] b[0][1] b[0][2] b[0][0] b[1][0] b[2][0]
b[1][0] b[1][1] b[1][2] ⇔ b[0][1] b[1][1] b[2][1]
b[2][0] b[2][1] b[2][2] b[0][2] b[1][2] b[2][2]
The diagonal entries are not swapped. You only need to do the swap in the upper triangular portion (where c > r) or in the lower triangular portion (where r > c), to swap all entries, because each swap swaps one entry from the upper triangular to the lower triangular, and vice versa.
So, to recap:
Is something done wrong?
Yes. Your transpose does nothing. You haven't understood the reason why one would want to transpose the second matrix. Your time measurement relies on a low-precision CPU time, which may not reflect the time taken by moving data between RAM and CPU cache. In the second test case, with m2 "transposed" (except it isn't, because you swap each element pair twice, returning them back to the way they were), your innermost loop is over the leftmost array index, which means it calculates the wrong result. (Moreover, because consecutive iterations of the innermost loop accesses items far from each other in memory, it is anti-optimized: it uses the pattern that is worst in terms of speed.)
All the above may sound harsh, but it isn't intended to be, at all. I do not know you, and I am not trying to evaluate you; I am only pointing out the errors in this particular answer, in your current understanding, and only in the hopes that it helps you, and anyone else encountering this question in similar circumstances, to learn.

Related

Convert sequential loop into parallel in C using pthreads

I would like to apply a pretty simple straightforward calculation on a n-by-d-dimensional array. The goal is to convert the sequential calculation to a parallel one using pthreads. My question is: what is the optimal way to split the problem? How could I significantly reduce the execution time of my script? I provide a sample sequential code in C and some thoughts on parallel implementations that I have already tried.
double * calcDistance(double * X ,int n, int d)
{
//calculate and return an array[n-1] of all the distances
//from the last point
double *distances = calloc(n,sizeof(double));
for(int i=0 ; i<n-1; i++)
{
//distances[i]=0;
for (int j=0; j< d; j++)
{
distances[i] += pow(X[(j+1)*n-1]-X[j*n+i], 2);
}
distances[i] = sqrt(distances[i]);
}
return distances;
}
I provide a main()-caller function in order for the sample to be complete and testable:
#include <stdio.h>
#include <stdlib.h>
#define N 10 //00000
#define D 2
int main()
{
srand(time(NULL));
//allocate the proper space for X
double *X = malloc(D*N*(sizeof(double)));
//fill X with numbers in space (0,1)
for(int i = 0 ; i<N ; i++)
{
for(int j=0; j<D; j++)
{
X[i+j*N] = (double) (rand() / (RAND_MAX + 2.0));
}
}
X = calcDistances(X, N, D);
return 0;
}
I have already tried utilizing pthreads asynchronously through the use of a global_index that is imposed to mutex and a local_index. Through the use of a while() loop, a local_index is assigned to each thread on each iteration. The local_index assignment depends on the global_index value at that time (both happening in a mutual exclusion block). The thread executes the computation on the distances[local_index] element.
Unfortunately this implementation has lead to a much slower program with a x10 or x20 bigger execution time compared to the sequential one that is cited above.
Another idea is to predetermine and split the array (say to four equal parts) and assign the computation of each segment to a given pthread. I don't know if that's a common-efficient procedure though.
Your inner loop jumps all over array X with a mixture of strides that varies with
the outer-loop iteration. Unless n and d are quite small,* this is likely to produce poor cache usage -- in the serial code, too, but parallelizing would amplify that effect. At least X is not written by the function, which improves the outlook. Also, there do not appear to be any data dependencies across iterations of the outer loop, which is good.
what is the optimal way to split the problem?
Probably the best available way would be to split outer-loop iterations among your threads. For T threads, have one perform iterations 0 ... (N / T) - 1, have the second do (N / T) ... (2 * N / T) - 1, etc..
How could I significantly reduce the execution time of my script?
The first thing I would do is use simple multiplication instead of pow to compute squares. It's unclear whether you stand to gain anything from parallelism.
I have already tried utilizing pthreads asynchronously through the use
of a global_index that is imposed to mutex and a local_index. [...]
If you have to involve a mutex, semaphore, or similar synchronization object then the task is probably hopeless. Happily (maybe) there does not appear to be any need for that. Assigning outer-loop iterations to threads dynamically is way over-engineered for this problem. Statically assigning iterations to threads as I already described will remove the need for such synchronization, and since the cost of the inner loop does not look like it will vary much for different outer-loop iterations, there probably will not be too much inefficiency introduced that way.
Another idea is to predetermine and split the array (say to four equal parts) and assign the computation of each segment to a given pthread. I don't know if that's a common-efficient procedure though.
This sounds like what I described. It is one of the standard scheduling models provided by OMP, and one of the most efficient available for many problems, given that it does not itself require a mutex. It is somewhat sensitive to the relationship between the number of threads and the number of available execution units, however. For example, if you parallelize across five cores in a four-core machine, then one will have to wait to run until one of the others has finished -- best theoretical speedup 60%. Parallelizing the same computation across only four cores uses the compute resources more efficiently, for a best theoretical speedup of about 75%.
* If n and d are quite small, say anything remotely close to the values in the example driver program, then the overhead arising from parallelization has a good chance of overcoming any gains from parallel execution.

Best approach to FIFO implementation in a kernel OpenCL

Goal: Implement the diagram shown below in OpenCL. The main thing needed from the OpenCl kernel is to multiply the coefficient array and temp array and then accumilate all those values into one at the end. (That is probably the most time intensive operation, parallelism would be really helpful here).
I am using a helper function for the kernel that does the multiplication and addition (I am hoping this function will be parallel as well).
Description of the picture:
One at a time, the values are passed into the array (temp array) which is the same size as the coefficient array. Now every time a single value is passed into this array, the temp array is multiplied with the coefficient array in parallel and the values of each index are then concatenated into one single element. This will continue until the input array reaches it's final element.
What happens with my code?
For 60 elements from the input, it takes over 8000 ms!! and I have a total of 1.2 million inputs that still have to be passed in. I know for a fact that there is a way better solution to do what I am attempting. Here is my code below.
Here are some things that I know are wrong with he code for sure. When I try to multiply the coefficient values with the temp array, it crashes. This is because of the global_id. All I want this line to do is simply multiply the two arrays in parallel.
I tried to figure out why it was taking so long to do the FIFO function, so I started commenting lines out. I first started by commenting everything except the first for loop of the FIFO function. As a result this took 50 ms. Then when I uncommented the next loop, it jumped to 8000ms. So the delay would have to do with the transfer of data.
Is there a register shift that I could use in OpenCl? Perhaps use some logical shifting method for integer arrays? (I know there is a '>>' operator).
float constant temp[58];
float constant tempArrayForShift[58];
float constant multipliedResult[58];
float fifo(float inputValue, float *coefficients, int sizeOfCoeff) {
//take array of 58 elements (or same size as number of coefficients)
//shift all elements to the right one
//bring next element into index 0 from input
//multiply the coefficient array with the array thats the same size of coefficients and accumilate
//store into one output value of the output array
//repeat till input array has reached the end
int globalId = get_global_id(0);
float output = 0.0f;
//Shift everything down from 1 to 57
//takes about 50ms here
for(int i=1; i<58; i++){
tempArrayForShift[i] = temp[i];
}
//Input the new value passed from main kernel. Rest of values were shifted over so element is written at index 0.
tempArrayForShift[0] = inputValue;
//Takes about 8000ms with this loop included
//Write values back into temp array
for(int i=0; i<58; i++){
temp[i] = tempArrayForShift[i];
}
//all 58 elements of the coefficient array and temp array are multiplied at the same time and stored in a new array
//I am 100% sure this line is crashing the program.
//multipliedResult[globalId] = coefficients[globalId] * temp[globalId];
//Sum the temp array with each other. Temp array consists of coefficients*fifo buffer
for (int i = 0; i < 58; i ++) {
// output = multipliedResult[i] + output;
}
//Returned summed value of temp array
return output;
}
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output) {
//Initialize the temporary array values to 0
for (int i = 0; i < 58; i ++) {
temp[i] = 0;
tempArrayForShift[i] = 0;
multipliedResult[i] = 0;
}
//fifo adds one element in and calls the fifo function. ALL I NEED TO DO IS SEND ONE VALUE AT A TIME HERE.
for (int i = 0; i < 60; i ++) {
Output[i] = fifo(Array[i], coefficients, 58);
}
}
I have had this problem with OpenCl for a long time. I am not sure how to implement parallel and sequential instructions together.
Another alternative I was thinking about
In the main cpp file, I was thinking of implementing the fifo buffer there and having the kernel do the multiplication and addition. But this would mean I would have to call the kernel 1000+ times in a loop. Would this be the better solution? Or would it just be completely inefficient.
To get good performance out of GPU, you need to parallelize your work to many threads. In your code you are just using a single thread and a GPU is very slow per thread but can be very fast, if many threads are running at the same time. In this case you can use a single thread for each output value. You do not actually need to shift values through a array: For every output value a window of 58 values is considered, you can just grab these values from memory, multiply them with the coefficients and write back the result.
A simple implementation would be (launch with as many threads as output values):
__kernel void lowpass(__global float *Array, __global float *coefficients, __global float *Output)
{
int globalId = get_global_id(0);
float sum=0.0f;
for (int i=0; i< 58; i++)
{
float tmp=0;
if (globalId+i > 56)
{
tmp=Array[i+globalId-57]*coefficient[57-i];
}
sum += tmp;
}
output[globalId]=sum;
}
This is not perfect, as the memory access patterns it generates are not optimal for GPUs. The Cache will likely help a bit, but there is clearly a lot of room for optimization, as the values are reused several times. The operation you are trying to perform is called convolution (1D). NVidia has an 2D example called oclConvolutionSeparable in their GPU Computing SDK, that shows an optimized version. You adapt use their convolutionRows kernel for a 1D convolution.
Here's another kernel you can try out. There are a lot of synchronization points (barriers), but this should perform fairly well. The 65-item work group is not very optimal.
the steps:
init local values to 0
copy coefficients to local variable
looping over the output elements to compute:
shift existing elements (work items > 0 only)
copy new element (work item 0 only)
compute dot product
5a. multiplication - one per work item
5b. reduction loop to compute sum
copy dot product to output (WI 0 only)
final barrier
the code:
__kernel void lowpass(__global float *Array, __constant float *coefficients, __global float *Output, __local float *localArray, __local float *localSums){
int globalId = get_global_id(0);
int localId = get_local_id(0);
int localSize = get_local_size(0);
//1 init local values to 0
localArray[localId] = 0.0f
//2 copy coefficients to local
//don't bother with this id __constant is working for you
//requires another local to be passed in: localCoeff
//localCoeff[localId] = coefficients[localId];
//barrier for both steps 1 and 2
barrier(CLK_LOCAL_MEM_FENCE);
float tmp;
for(int i = 0; i< outputSize; i++)
{
//3 shift elements (+barrier)
if(localId > 0){
tmp = localArray[localId -1]
}
barrier(CLK_LOCAL_MEM_FENCE);
localArray[localId] = tmp
//4 copy new element (work item 0 only, + barrier)
if(localId == 0){
localArray[0] = Array[i];
}
barrier(CLK_LOCAL_MEM_FENCE);
//5 compute dot product
//5a multiply + barrier
localSums[localId] = localArray[localId] * coefficients[localId];
barrier(CLK_LOCAL_MEM_FENCE);
//5b reduction loop + barrier
for(int j = 1; j < localSize; j <<= 1) {
int mask = (j << 1) - 1;
if ((localId & mask) == 0) {
localSums[local_index] += localSums[localId +j]
}
barrier(CLK_LOCAL_MEM_FENCE);
}
//6 copy dot product (WI 0 only)
if(localId == 0){
Output[i] = localSums[0];
}
//7 barrier
//only needed if there is more code after the loop.
//the barrier in #3 covers this in the case where the loop continues
//barrier(CLK_LOCAL_MEM_FENCE);
}
}
What about more work groups?
This is slightly simplified to allow a single 1x65 work group computer the entire 1.2M Output. To allow multiple work groups, you could use / get_num_groups(0) to calculate the amount of work each group should do (workAmount), and adjust the i for-loop:
for (i = workAmount * get_group_id(0); i< (workAmount * (get_group_id(0)+1) -1); i++)
Step #1 must be changed as well to initialize to the correct starting state for localArray, rather than all 0s.
//1 init local values
if(groupId == 0){
localArray[localId] = 0.0f
}else{
localArray[localSize - localId] = Array[workAmount - localId];
}
These two changes should allow you to use a more optimal number of work groups; I suggest some multiple of the number of compute units on the device. Try to keep the amount of work for each group in the thousands though. Play around with this, sometimes what seems optimal on a high-level will be detrimental to the kernel when it's running.
Advantages
At almost every point in this kernel, the work items have something to do. The only time fewer than 100% of the items are working is during the reduction loop in step 5b. Read more here about why that is a good thing.
Disadvantages
The barriers will slow down the kernel just by the nature of what barriers do: the pause a work item until the others reach that point. Maybe there is a way you could implement this with fewer barriers, but I still feel this is optimal because of the problem you are trying to solve.
There isn't room for more work items per group, and 65 is not a very optimal size. Ideally, you should try to use a power of 2, or a multiple of 64. This won't be a huge issue though, because there are a lot of barriers in the kernel which makes them all wait fairly regularly.

Optimized Selection Sort?

I have read sources that say that the time complexities for Selection sort are:
Best-case: O(n^2)
Average-case: O(n^2)
Worst-case: O(n^2)
I was wondering if it is worth it to "optimize" the algorithm by adding a certain line of code to make the algorithm "short-circuit" itself if the remaining part is already sorted.
Here's the code written in C:
I have also added a comment which indicates which lines are part of the "optimization" part.
void printList(int* num, int numElements) {
int i;
for (i = 0; i < numElements; i ++) {
printf("%d ", *(num + i));
}
printf("\n");
}
int main() {
int numElements = 0, i = 0, j = 0, min = 0, swap = 0, numSorted = 0;
printf("Enter number of elements: ");
scanf("%d", &numElements);
int* num = malloc(sizeof(int) * numElements);
for (i = 0; i < numElements; i ++) {
printf("Enter number = ");
scanf(" %d", num + i);
}
for (i = 0; i < numElements-1; i++) {
numSorted = i + 1; // "optimized"
min = i;
for (j = i + 1; j < numElements; j++) {
numSorted += *(num + j - 1) <= *(num + j); // "optimized"
if (*(num + min) > *(num + j))
min = j;
}
if (numSorted == numElements) // "optimized"
break;
if (min != i) {
swap = *(num + i);
*(num + i) = *(num + min);
*(num + min) = swap;
}
printList(num, numElements);
}
printf("Sorted list:\n");
printList(num, numElements);
free(num);
getch();
return 0;
}
Optimizing selection sort is a little silly. It has awful best-case, average, and worst-case time complexity, so if you want a remotely optimized sort you would (almost?) always pick another sort. Even insertion sort tends to be faster and it's hardly much more complicated to implement.
More to the point, checking if the list is sorted increases the time the algorithm takes in the worst case scenarios (the average case too I'm inclined to think). And even a mostly sorted list will not necessarily go any faster this way: consider 1,2,3,4,5,6,7,9,8. Even though the list only needs two elements swapped at the end, the algorithm will not short-circuit as it is not ever sorted until the end.
Just because something can be optimized, doesn't necessarily mean it should. Assuming profiling or "boss-says-so" indicates optimization is warranted there are a few things you can do.
As with any algorithm involving iteration over memory, anything that reduces the number of iterations can help.
keep track of the min AND max values - cut number of iterations in half
keep track of multiple min/max values (4 each will be 1/8th the iterations)
at some point temp values will not fit in registers
the code will get more complex
It can also help to maximize cache locality.
do a backward iteration after the forward iteration
the recently accessed memory should still be cached
going straight to another forward iteration would cause a cache miss
since you are moving backward, the cache predictor may prefetch the rest
this could actually be worse on some architectures (RISC-V)
operate on a cache line at a time where possible
this can allow the next cache line to be prefetched in the mean time
you may need to align the data or specially handle the first and last data
even with increased alignment, the last few elements may need "padding"
Use SIMD instructions and registers where useful and practical
useful for non-branching rank order sort of temps
can hold many data points simultaneously (AVX-512 can do a cache line)
avoids memory access (thus less cache misses)
If you use multiple max/min values, optimize sorting the n values of max and min
see here for techniques to sort a small fixed number of values.
save memory swaps until the end of each iteration and do them once
keep temporaries (or pointers) in registers in the mean time
There are quite a few more optimization methods available, but eventually the resemblance to selection sort starts to get foggy. Each of these is going to increase complexity and therefore maintenance cost to the point where a simpler implementation of a more appropriate algorithm may be a better choice.
The only way I see how this can be answered is if you define the purpose of why you are optimizing it.
Is it worth it in a professional setting: on the job, for code running "in production" - most likely (even almost certainly) not.
Is it worth it as a teaching/learning tool - sometimes yes.
I teach programming to individuals and sometimes I teach them algorithms and datastructures. I consider selection sort to be one of the easiest to explain and teach - it flows so naturally after explaining the algorithm for finding the minimum and swapping two values (swap()). Then, at the end I introduce the concept of optimization where we can implement this counter "already sorted" detection.
Admittedly bubble sort is even better to introduce optimization, because it has at least 3 easy to explain and substantial optimizations.
I was wondering if it is worth it to "optimize" the algorithm by adding a certain line of code to make the algorithm "short-circuit" itself if the remaining part is already sorted.
Clearly this change reduces the best-case complexity from O(n2) to O(n). This will be observed for inputs that are already sorted except for O(1) leading elements. If such inputs are a likely case, then the suggested code change might indeed yield an observable and worthwhile performance improvement.
Note, however, that your change more than doubles the work performed in the innermost loop, and consider that for uniform random inputs, the expected number of outer-loop iterations saved is 1. Consider also that any outer-loop iterations you do manage to trim off will be the ones that otherwise would do the least work. Overall, then, although you do not change the asymptotic complexity, the actual performance in the average and worst cases will be noticeably worse -- runtimes on the order of twice as long.
If you're after better speed then your best bet is to choose a different sorting algorithm. Among the comparison sorts, Insertion Sort will perform about the same as your optimized Selection Sort on the latter's best case, but it has a wider range of best-case scenarios, and will usually outperform (regular) Selection Sort in the average case. How the two compare in the worst case depends on implementation.
If you want better performance still then consider Merge Sort or Quick Sort, both of which are pretty simple to implement. Or if your data are suited to it then Counting Sort is pretty hard to beat.
we can optimize selection sort in best case which will be O(n) instead of O(n^2).
here is my optimization code.
public class SelectionSort {
static void selectionSort(int[]arr){
for(int i=0; i< arr.length; i++){
int maxValue=arr[0];
int maxIndex=0;
int cnt=1;
for (int j=1; j< arr.length-i; j++){
if(maxValue<=arr[j]){
maxValue=arr[j];
maxIndex=j;
cnt++;
}
}
if(cnt==arr.length)break;
arr[maxIndex]=arr[arr.length-1-i];
arr[arr.length-1-i]=maxValue;
}
}
public static void main(String[] args) {
int[]arr={1,-3, 0, 8, -45};
selectionSort(arr);
System.out.println(Arrays.toString(arr));
}
}

Why does copying a 2D array column by column take longer than row by row in C? [duplicate]

This question already has answers here:
Why does the order of the loops affect performance when iterating over a 2D array?
(7 answers)
Closed 7 years ago.
#include <stdio.h>
#include <time.h>
#define N 32768
char a[N][N];
char b[N][N];
int main() {
int i, j;
printf("address of a[%d][%d] = %p\n", N, N, &a[N][N]);
printf("address of b[%5d][%5d] = %p\n", 0, 0, &b[0][0]);
clock_t start = clock();
for (j = 0; j < N; j++)
for (i = 0; i < N; i++)
a[i][j] = b[i][j];
clock_t end = clock();
float seconds = (float)(end - start) / CLOCKS_PER_SEC;
printf("time taken: %f secs\n", seconds);
start = clock();
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
a[i][j] = b[i][j];
end = clock();
seconds = (float)(end - start) / CLOCKS_PER_SEC;
printf("time taken: %f secs\n", seconds);
return 0;
}
Output:
address of a[32768][32768] = 0x80609080
address of b[ 0][ 0] = 0x601080
time taken: 18.063229 secs
time taken: 3.079248 secs
Why does column by column copying take almost 6 times as long as row by row copying? I understand that 2D array is basically an nxn size array where A[i][j] = A[i*n + j], but using simple algebra, I calculated that a Turing machine head (on main memory) would have to travel a distance of in both the cases. Here nxn is the size of the array and x is the distance between last element of first array and first element of second array.
It pretty much comes down to this image (source):
When accessing data, your CPU will not only load a single value, but will also load adjacent data into the CPU's L1 cache. When iterating through your array by row, the items that have automatically been loaded into the cache are actually the ones that are processed next. However, when you are iterating by column, each time an entire "cache line" of data (the size varies per CPU) is loaded, only a single item is used and then the next line has to be loaded, effectively making the cache pointless.
The wikipedia entry and, as a high level overview, this PDF should help you understand how CPU caches work.
Edit: chqrlie in the comments is of course correct. One of the relevant factors here is that only very few of your columns fit into the L1 cache at the same time. If your rows were much smaller (say, the total size of your two dimensional array was only some kilobytes) then you might not see a performance impact from iterating per-column.
While it's normal to draw the array as a rectangle, the addressing of array elements in memory is linear: 0 to one minus the number of bytes available (on nearly all machines).
Memory hierarchies (e.g. registers < L1 cache < L2 cache < RAM < swap space on disk) are optimized for the case where memory accesses are localized: accesses that are successive in time touch addresses that are close together. They are even more highly optimized (e.g. with pre-fetch strategies) for sequential access in linear order of addresses; e.g. 100,101,102...
In C, rectangular arrays are arranged in linear order by concatenating all the rows (other languages like FORTRAN and Common Lisp concatenate columns instead). Therefore the most efficient way to read or write the array is to do all the columns of the first row, then move on to the rest, row by row.
If you go down the columns instead, successive touches are N bytes apart, where N is the number of bytes in a row: 100, 10100, 20100, 30100... for the case N=10000 bytes.Then the second column is 101,10101, 20101, etc. This is the absolute worst case for most cache schemes.
In the very worst case, you can cause a page fault on each access. These days on even on an average machine it would take an enormous array to cause that. But if it happened, each touch could cost ~10ms for a head seek. Sequential access is a few nano-seconds per. That's over a factor of a million difference. Computation effectively stops in this case. It has a name: disk thrashing.
In a more normal case where only cache faults are involved, not page faults, you might see a factor of hundred. Still worth paying attention.
There are 3 main aspects that contribute to the timing different:
The first double loop accesses both arrays for the first time. You are actually reading uninitialized memory which is bad if you expect any meaningful results (functionally as well as timing-wise), but in terms of timing what plays part here is the fact that these addresses are cold, and reside in the main memory (if you're lucky), or aren't even paged (if you're less lucky). In the latter case, you would have a page fault on each new page, and would invoke a system call to allocate a page for the first time. Note that this doesn't have anything to do with the order of traversal, but simply because the first access is much slower. To avoid that, initialize both arrays to some value.
Cache line locality (as explained in the other answers) - if you access sequential data, you miss once per line, and then enjoy the benefit of having it fetched already. You most likely won't even hit it in the cache but rather in some buffer, since the consecutive requests will be waiting for that line to get fetched. When accessing column-wise, you would fetch the line, cache it, but if the reuse distance is large enough - you would lose it and have to fetch it again.
Prefetching - modern CPUs would have HW prefetching mechanisms that can detect sequential accesses and prefetch the data ahead of time, which will eliminate even the first miss of each line. Most CPUs also have stride based prefetches which may be able to cover the column size, but these things don't work well usually with matrix structures since you have too many columns and it would be impossible for HW to track all these stride flows simultaneously.
As a side note, I would recommend that any timing measurement would be performed multiple times and amortized - that would have eliminated problem #1.

Optimized matrix multiplication in C

I'm trying to compare different methods for matrix multiplication.
The first one is normal method:
do
{
for (j = 0; j < i; j++)
{
for (k = 0; k < i; k++)
{
suma = 0;
for (l = 0; l < i; l++)
suma += MatrixA[j][l]*MatrixB[l][k];
MatrixR[j][k] = suma;
}
}
}
c++;
} while (c<iteraciones);
The second one consist of transposing the matrix B first and then do the multiplication by rows:
int f, co;
for (f = 0; f < i; f++) {
for ( co = 0; co < i; co++) {
MatrixB[f][co] = MatrixB[co][f];
}
}
c = 0;
do
{
for (j = 0; j < i; j++)
{
for (k = 0; k < i; k++)
{
suma = 0;
for (l = 0; l < i; l++)
suma += MatrixA[j][l]*MatrixB[k][l];
MatrixR[j][k] = suma;
}
}
}
c++;
} while (c<iteraciones);
The second method supposed to be much faster, because we are accessing contiguous memory slots, but I'm not getting a significant improvement in the performance. Am I doing something wrong?
I can post the complete code, but I think is not needed.
What Every Programmer Should Know About Memory (pdf link) by Ulrich Drepper has a lot of good ideas about memory efficiency, but in particular, he uses matrix multiplication as an example of how knowing about memory and using that knowledge can speed this process. Look at appendix A.1 in his paper, and read through section 6.2.1. Table 6.2 in the paper shows that he could get his running time to be 10% from a naive implementation's time for a 1000x1000 matrix.
Granted, his final code is pretty hairy and uses a lot of system-specific stuff and compile-time tuning, but still, if you really need speed, reading that paper and reading his implementation is definitely worth it.
Getting this right is non-trivial. Using an existing BLAS library is highly recommended.
Should you really be inclined to roll your own matrix multiplication, loop tiling is an optimization that is of particular importance for large matrices. The tiling should be tuned to the cache size to ensure that the cache is not being continually thrashed, which will occur with a naive implementation. I once measured a 12x performance difference tiling a matrix multiply with matrix sizes picked to consume multiples of my cache (circa '97 so the cache was probably small).
Loop tiling algorithms assume that a contiguous linear array of elements is used, as opposed to rows or columns of pointers. With such a storage choice, your indexing scheme determines which dimension changes fastest, and you are free to decide whether row or column access will have the best cache performance.
There's a lot of literature on the subject. The following references, especially the Banerjee books, may be helpful:
[Ban93] Banerjee, Utpal, Loop Transformations for Restructuring Compilers: the Foundations, Kluwer Academic Publishers, Norwell, MA, 1993.
[Ban94] Banerjee, Utpal, Loop Parallelization, Kluwer Academic Publishers, Norwell, MA, 1994.
[BGS93] Bacon, David F., Susan L. Graham, and Oliver Sharp, Compiler Transformations for High-Performance Computing, Computer Science Division, University of California, Berkeley, Calif., Technical Report No UCB/CSD-93-781.
[LRW91] Lam, Monica S., Edward E. Rothberg, and Michael E Wolf. The Cache Performance and Optimizations of Blocked Algorithms, In 4th International Conference on Architectural Support for Programming Languages, held in Santa Clara, Calif., April, 1991, 63-74.
[LW91] Lam, Monica S., and Michael E Wolf. A Loop Transformation Theory and an Algorithm to Maximize Parallelism, In IEEE Transactions on Parallel and Distributed Systems, 1991, 2(4):452-471.
[PW86] Padua, David A., and Michael J. Wolfe, Advanced Compiler Optimizations for Supercomputers, In Communications of the ACM, 29(12):1184-1201, 1986.
[Wolfe89] Wolfe, Michael J. Optimizing Supercompilers for Supercomputers, The MIT Press, Cambridge, MA, 1989.
[Wolfe96] Wolfe, Michael J., High Performance Compilers for Parallel Computing, Addison-Wesley, CA, 1996.
ATTENTION: You have a BUG in your second implementation
for (f = 0; f < i; f++) {
for (co = 0; co < i; co++) {
MatrixB[f][co] = MatrixB[co][f];
}
}
When you do f=0, c=1
MatrixB[0][1] = MatrixB[1][0];
you overwrite MatrixB[0][1] and lose that value! When the loop gets to f=1, c=0
MatrixB[1][0] = MatrixB[0][1];
the value copied is the same that was already there.
You should not write matrix multiplication. You should depend on external libraries. In particular you should use the GEMM routine from the BLAS library. GEMM often provides the following optimizations
Blocking
Efficient Matrix Multiplication relies on blocking your matrix and performing several smaller blocked multiplies. Ideally the size of each block is chosen to fit nicely into cache greatly improving performance.
Tuning
The ideal block size depends on the underlying memory hierarchy (how big is the cache?). As a result libraries should be tuned and compiled for each specific machine. This is done, among others, by the ATLAS implementation of BLAS.
Assembly Level Optimization
Matrix multiplicaiton is so common that developers will optimize it by hand. In particular this is done in GotoBLAS.
Heterogeneous(GPU) Computing
Matrix Multiply is very FLOP/compute intensive, making it an ideal candidate to be run on GPUs. cuBLAS and MAGMA are good candidates for this.
In short, dense linear algebra is a well studied topic. People devote their lives to the improvement of these algorithms. You should use their work; it will make them happy.
If the matrix is not large enough or you don't repeat the operations a high number of times you won't see appreciable differences.
If the matrix is, say, 1,000x1,000 you will begin to see improvements, but I would say that if it is below 100x100 you should not worry about it.
Also, any 'improvement' may be of the order of milliseconds, unless yoy are either working with extremely large matrices or repeating the operation thousands of times.
Finally, if you change the computer you are using for a faster one the differences will be even narrower!
How big improvements you get will depend on:
The size of the cache
The size of a cache line
The degree of associativity of the cache
For small matrix sizes and modern processors it's highly probable that the data fron both MatrixA and MatrixB will be kept nearly entirely in the cache after you touch it the first time.
Just something for you to try (but this would only make a difference for large matrices): seperate out your addition logic from the multiplication logic in the inner loop like so:
for (k = 0; k < i; k++)
{
int sums[i];//I know this size declaration is illegal in C. consider
//this pseudo-code.
for (l = 0; l < i; l++)
sums[l] = MatrixA[j][l]*MatrixB[k][l];
int suma = 0;
for(int s = 0; s < i; s++)
suma += sums[s];
}
This is because you end up stalling your pipeline when you write to suma. Granted, much of this is taken care of in register renaming and the like, but with my limited understanding of hardware, if I wanted to squeeze every ounce of performance out of the code, I would do this because now you don't have to stall the pipeline to wait for a write to suma. Since multiplication is more expensive than addition, you want to let the machine paralleliz it as much as possible, so saving your stalls for the addition means you spend less time waiting in the addition loop than you would in the multiplication loop.
This is just my logic. Others with more knowledge in the area may disagree.
Can you post some data comparing your 2 approaches for a range of matrix sizes ? It may be that your expectations are unrealistic and that your 2nd version is faster but you haven't done the measurements yet.
Don't forget, when measuring execution time, to include the time to transpose matrixB.
Something else you might want to try is comparing the performance of your code with that of the equivalent operation from your BLAS library. This may not answer your question directly, but it will give you a better idea of what you might expect from your code.
The computation complexity of multiplication of two N*N matrix is O(N^3). The performance will be dramatically improved if you use O(N^2.73) algorithm which probably has been adopted by MATLAB. If you installed a MATLAB, try to multiply two 1024*1024 matrix. On my computer, MATLAB complete it in 0.7s, but the C\C++ implementation of the naive algorithm like yours takes 20s. If you really care about the performance, refer to lower-complex algorithms. I heard there exists O(N^2.4) algorithm, however it needs a very large matrix so that other manipulations can be neglected.
not so special but better :
c = 0;
do
{
for (j = 0; j < i; j++)
{
for (k = 0; k < i; k++)
{
sum = 0; sum_ = 0;
for (l = 0; l < i; l++) {
MatrixB[j][k] = MatrixB[k][j];
sum += MatrixA[j][l]*MatrixB[k][l];
l++;
MatrixB[j][k] = MatrixB[k][j];
sum_ += MatrixA[j][l]*MatrixB[k][l];
sum += sum_;
}
MatrixR[j][k] = sum;
}
}
c++;
} while (c<iteraciones);
Generally speaking, transposing B should end up being much faster than the naive implementation, but at the expense of wasting another NxN worth of memory. I just spent a week digging around matrix multiplication optimization, and so far the absolute hands-down winner is this:
for (int i = 0; i < N; i++)
for (int k = 0; k < N; k++)
for (int j = 0; j < N; j++)
if (likely(k)) /* #define likely(x) __builtin_expect(!!(x), 1) */
C[i][j] += A[i][k] * B[k][j];
else
C[i][j] = A[i][k] * B[k][j];
This is even better than Drepper's method mentioned in an earlier comment, as it works optimally regardless of the cache properties of the underlying CPU. The trick lies in reordering the loops so that all three matrices are accessed in row-major order.
If you are working on small numbers, then the improvement you are mentioning is negligible. Also, performance will vary depend on the Hardware on which you are running. But if you are working on numbers in millions, then it will effect.
Coming to the program, can you paste the program you have written.
Very old question, but heres my current implementation for my opengl projects:
typedef float matN[N][N];
inline void matN_mul(matN dest, matN src1, matN src2)
{
unsigned int i;
for(i = 0; i < N^2; i++)
{
unsigned int row = (int) i / 4, col = i % 4;
dest[row][col] = src1[row][0] * src2[0][col] +
src1[row][1] * src2[1][col] +
....
src[row][N-1] * src3[N-1][col];
}
}
Where N is replaced with the size of the matrix. So if you are multiplying 4x4 matrices, then you use:
typedef float mat4[4][4];
inline void mat4_mul(mat4 dest, mat4 src1, mat4 src2)
{
unsigned int i;
for(i = 0; i < 16; i++)
{
unsigned int row = (int) i / 4, col = i % 4;
dest[row][col] = src1[row][0] * src2[0][col] +
src1[row][1] * src2[1][col] +
src1[row][2] * src2[2][col] +
src1[row][3] * src2[3][col];
}
}
This function mainly minimizes loops but the modulus might be taxing... On my computer this function performed roughly 50% faster than a triple for loop multiplication function.
Cons:
Lots of code needed (ex. different functions for mat3 x mat3, mat5 x mat5...)
Tweaks needed for irregular multiplication (ex. mat3 x mat4).....
This is a very old question but I recently wandered down the rabbit hole and developed 9 different matrix multiplication implementations for both contiguous memory and non-contiguous memory (about 18 different functions). The results are interesting:
https://github.com/cubiclesoft/matrix-multiply
Blocking (aka loop tiling) didn't always produce the best results. In fact, I found that blocking may produce worse results than other algorithms depending on matrix size. And blocking really only started doing marginally better than other algorithms somewhere around 1200x1200 and performed worse at around 2000x2000 but got better again past that point. This seems to be a common problem with blocking - certain matrix sizes just don't work well. Also, blocking on contiguous memory performed slightly worse than the non-contiguous version. Contrary to common thinking, non-contiguous memory storage also generally outperformed contiguous memory storage. Blocking on contiguous memory also performed worse than an optimized straight pointer math version. I'm sure someone will point out areas of optimization that I missed/overlooked but the general conclusion is that blocking/loop tiling may: Do slightly better, do slightly worse (smaller matrices), or it may do much worse. Blocking adds a lot of complexity to the code for largely inconsequential gains and a non-smooth/wacky performance curve that's all over the place.
In my opinion, while it isn't the fastest implementation of the nine options I developed and tested, Implementation 6 has the best balance between code length, code readability, and performance:
void MatrixMultiply_NonContiguous_6(double **C, double **A, double **B, size_t A_rows, size_t A_cols, size_t B_cols)
{
double tmpa;
for (size_t i = 0; i < A_rows; i++)
{
tmpa = A[i][0];
for (size_t j = 0; j < B_cols; j++)
{
C[i][j] = tmpa * B[0][j];
}
for (size_t k = 1; k < A_cols; k++)
{
tmpa = A[i][k];
for (size_t j = 0; j < B_cols; j++)
{
C[i][j] += tmpa * B[k][j];
}
}
}
}
void MatrixMultiply_Contiguous_6(double *C, double *A, double *B, size_t A_rows, size_t A_cols, size_t B_cols)
{
double tmpa;
for (size_t i = 0; i < A_rows; i++)
{
tmpa = A[i * A_cols];
for (size_t j = 0; j < B_cols; j++)
{
C[i * B_cols + j] = tmpa * B[j];
}
for (size_t k = 1; k < A_cols; k++)
{
tmpa = A[i * A_cols + k];
for (size_t j = 0; j < B_cols; j++)
{
C[i * B_cols + j] += tmpa * B[k * B_cols + j];
}
}
}
}
Simply swapping j and k (Implementation 3) does a lot all on its own but little adjustments to use a temporary var for A and removing the if conditional notably improves performance over Implementation 3.
Here are the implementations (copied verbatim from the linked repository):
Implementation 1 - The classic naive implementation. Also the slowest. Good for showing the baseline worst case and validating the other implementations. Not so great for actual, real world usage.
Implementation 2 - Uses a temporary variable for matrix C which might end up using a CPU register to do the addition.
Implementation 3 - Swaps the j and k loops from Implementation 1. The result is a bit more CPU cache friendly but adds a comparison per loop and the temporary from Implementation 2 is lost.
Implementation 4 - The temporary variable makes a comeback but this time on one of the operands (matrix A) instead of the assignment.
Implementation 5 - Move the conditional outside the innermost for loop. Now we have two inner for-loops.
Implementation 6 - Remove conditional altogether. This implementation arguably offers the best balance between code length, code readability, and performance. That is, both contiguous and non-contiguous functions are short, easy to understand, and faster than the earlier implementations. It is good enough that the next three Implementations use it as their starting point.
Implementation 7 - Precalculate base row start for the contiguous memory implementation. The non-contiguous version remains identical to Implementation 6. However, the performance gains are negligible over Implementation 6.
Implementation 8 - Sacrifice a LOT of code readability to use simple pointer math. The result completely removes all array access multiplication. Variant of Implementation 6. The contiguous version performs better than Implementation 9. The non-contiguous version performs about the same as Implementation 6.
Implementation 9 - Return to the readable code of Implementation 6 to implement a blocking algorithm. Processing small blocks at a time allows larger arrays to stay in the CPU cache during inner loop processing for a small increase in performance at around 1200x1200 array sizes but also results in a wacky performance graph that shows it can actually perform far worse than other Implementations.

Resources