Row major vs Column Major Matrix Multiplication - c

I am currently working on a C program trying to compute Matrix Multiplication.. I have approached this task by looping through each column of the second matrix as seen below.
I have set size to 1000.
for(i=0;i<size;i++)
{
for(j=0;j<size;j++)
{
for(k=0;k<size;k++)
{
matC[i][j]+=matA[i][k]*matB[k][j];
}
}
}
I wanted to know what problematic access pattern is in this implementation.. What makes row/column access more efficient than the other? I am trying to understand this in terms of logic from the use of Caches.. Please help me understand this. Your help is much appreciated :)

If you are talking about use of Caches then you might want to do something called loop tiling. You break the loop into tiles such that inner part of the loop gets stored inside cache (which is quite large these days). So your loop will turn into something like (if you are passing the matrices into a function using pointers )
for(j=0;j<size;j+=t)
for(k=0;k<size;k+=t)
for(i=0;i<size;i+=t)
for(ii=i;ii<MIN(i+t,size);ii++)
for(jj=j;jj<MIN(j+t,size);jj++)
{
var=*(c+ii * size+jj); //Var is a scalar variable
for(kk=k;kk<MIN(k+t,size);kk++)
{
var = var + *(a+ii *size +kk) * *(bt +jj * size+ kk);
}
*(c+ii *size +jj) = var;
}
The value of t varies depending on the speedup that you get. It can t = 64,128,256 and so on. There are many other techniques that you can use here. Loop tiling is just once technique to utilize the cache efficiently.Further, you can transpose the B matrix before you send to the multiplication function. That way you will get a linear access of elements of matrix B. To explain you more
Consider
A -------- and B | | | |
-------- | | | |
-------- | | | |
-------- | | | |
Here, you will always consider, to multiply the first row of A with first column of B.And since you are using C I believe, CPU requires extra efforts to read in the all the columns of matrix B one by one inside the memory. To ease up these efforts, you can transpose the matrix and get the rows of matrix B' (which are nothing but columns of B essentially) and use loop tiling to cache the maximum amount of elements for multiplication.Hope this helps.

Related

Find possible solutions for a matrix with known row/column sums and maximum cell values

I am trying to find solutions to a matrix where I know the row and column sums and the maximum value a cell can have. I want to find possible solutions that are within the constraints. I've already tried various things like constructing an array of all cell values and picking picking from each cell in sequence but whatever I try I always run into the problem where I run out of values for a cell.
I also tried a recursive algorithm but that I only managed to get the first result or it failed to get any solution. I think I have to do this with a backtracking algorithm? Not sure...
Any help or pointers would be appreciated.
Row sums A, B, C, column sums X, Y, Z as well as the maximum value for each ? are known. All values are are positive integers.
C1 | C2 | C3
-----------------
R1 | ? | ? | ? | A
-----------------
R2 | ? | ? | ? | B
-----------------
R3 | ? | ? | ? | C
-----------------
X | Y | Z
If you heard about linear programming (LP) and its 'cousins' (ILP, MILP), that could be a good approach to help you solve your problem with a great efficiency.
A linear program consists in a set of variables (your matrix unknowns), constraints (maximum values, sum of rows and columns), and an objective function (here none) to minimize or maximize.
Let's call x[i][j] the values you are looking for.
With the following data:
NxM the dimensions of your matrix
max_val[i][j] the maximum value for the variable x[i][j]
row_val[i] the sum of the values on the row i
col_val[j] the sum of the values on the column j
Then a possible linear program that could solve your problem is:
// declare variables
int x[N][M] // or eventually float x[N][M]
// declare constaints
for all i in 1 .. N, j in 1 .. M, x[i][j] <= max_val[i][j]
for all i in 1 .. N, sum[j in 1 .. M](x[i][j]) == row_val[i]
for all j in 1 .. M, sum[i in 1 .. N](x[i][j]) == col_val[j]
// here the objective function is useless, but you still will need one
// for instance, let's minimize the sum of all variables (which is constant, but as I said, the objective function does not have to be useful)
minimize sum[i in 1 .. N](sum[j in 1 .. M](x[i][j]))
// you could also be more explicit about the uselessness of the objective function
// minimize 0
Solvers such as gurobi or Cplex (but there are much more of them, see here for instance) can solve this kind of problems incredibly fast, especially if your solutions do not need to be integer, but can be float (that makes the problem much, much easier). It also have the advantage to not only be faster t execute, but faster and simpler to code. They have APIs in several common programming languages to ease their use.
For example, you can reasonably expect to solve this kind of problem in less than a minute, with hundreds of thousands of variables in the integer case, millions in the real variables case.
Edit:
In response to the comment, here is a piece of code in OPL (the language Cplex and other LP solvers use) that would solve your problem. We consider a 3x3 case.
// declare your problem input
int row_val[1..3] = [7, 11, 8];
int col_val[1..3] = [14, 6, 6];
int max_val[1..3][1..3] = [[10, 10, 10], [10, 10, 10], [10, 10, 10]];
// declare your decision variables
dvar int x[1..3][1..3];
// objective function
minimize 0;
// constraints
subject to {
forall(i in 1..3, j in 1..3) x[i][j] <= max_val[i][j];
forall(i in 1..3) sum(j in 1..3) x[i][j] == row_val[i];
forall(j in 1..3) sum(i in 1..3) x[i][j] == col_val[j];
}
The concept of a LP solver is that you only describe the problem you want to solve, then the solver solves it for you. The problem must be described according to a certain set of rules. In the current case (Integer Linear Programming, or ILP), the variables must all be integers, and the constraints and objective function must be linear equalities (or inequalities) with regards to the decision variables.
The solver will then work as a black box. It will analyse the problem, and run algorithms that can solve it, with a ton of optimizations, and output the solution.
As you wrote in a comment, that you want to come up an own solution, here's some guideline:
Use a Backtrack algorithm to find a solution. Your value-space consists of 3*3=9 independent values, each of them are between 1 and maxval[i][j]. Your constraints will be the row and column sums (all of them must match)
Intitalize your space with all 1s, then increment them, until they reach the maxval. Evaluate the conditions only after each value is covered for that condition (particularly, after 3 values you can evaluate the first row, after 6 the second row, after 7 the first col, after 8 the second col, and after 9 the third row and the third col)
If you reach the 9th, with all conditions passing, you've got a solution. Otherwise try the values from 1 till maxval, if neither matches, step back. If the first value was iterated through, then there's no solution.
That's all.
More advanced backtracking:
Your moving values are only the top-left 2*2=4 values. The third column is calculated, the condition is that it must be between 1 and the maxval for that particular element.
After defining the [1][1] element, you need to calculate the [2][2] index by using the column sum, and validate its value by the row sum (or vica versa). The same processing rules apply as above: iterate through all possible values, step back if none matches, and check rules only if they can be applied.
It is a way faster method, since you have 5 bound variables (the bottom and right rows), and only 4 unbound. These are optimizations from your particular rules. A bit more complex to implement, though.
PS: 1 is used because you have positive integers. If you have non-negative integers, you need to start with 0.

#VALUES! while using IF and OR together

I have the File as following format
Name Number Position
A 1
B 2
C 3
D 4
Now on position A3 , I applied =IF(B2=1,"Goal Keeper",OR(IF(B2=2,"Defender",OR(IF(B2=3,"MidField","Striker"))))) But it giving me an error #value!
Looked up at google, and my formula is correct.
What i basically want it
1- Goalkeeper 2-Defender 3-Midfield 4-Striker
Yes the other way is to to just filter the number and copy paste the text
But I want to do it using formula and want to know where did I go wrong.
Your immediate problem lies with the expression (for example):
OR(IF(B2=3,"MidField","Striker"))
| \__/ \________/ \_______/ |
| bool string string |
\____________________________/
string
The OR function expects a series of boolean values (true or false) and you're giving it a string value from the inner IF.
You don't actually need the or bits in this specific case, the if is a full if-else. So you can just use:
=IF(B1=1,"Goal Keeper",IF(B2=2,"Defender",IF(B2=3,"MidField","Striker")))
This means that B1=1 will result in "Goal Keeper", otherwise it will evaluate IF(B2=2,"Defender",IF(B2=3,"MidField","Striker")).
Then that means that, if B2=2, it will result in "Defender", otherwise it will evaluate IF(B2=3,"MidField","Striker").
Finally, that means the B2=3 will result in "MidField", anything else will give "Striker".
The only situation I can envisage when OR would come in handy here would be when two different numbers were to generate the same string. Let's say both 1 and 4 should give "Goalie", you could use:
=IF(OR(B1=1,B1=4),"Goalie",IF(B2=2,"Defender","MidField"))
Keep in mind that a more general solution would be better implemented with the Excel lookup functions, ones that would search a table (on the spreadsheet somewhere) which mapped the integers to strings. Then, if the mapping needed to change, you would just update the table rather than going back and changing the formula in every single row.
If you are actually tasked with solving the problem by using the IF and OR function within the same equation, this is the only way I can see how:
=IF(OR(B1=1, B1 = 2, B1 = 3, B1 = 4),IF(B1 = 1, "Goal Keeper", IF(B1 = 2,"Defender",IF(B1 = 3,"MidField","Striker")))
If B1 does not equal 1-4, the OR function will return FALSE and completely bypass all of the nested IF statements.

Convention with rows and columns index for most of languages

Following the question on this link, Is it a general rule (I mean for a majority of languages) to consider, for a 2 dimensional array "x(i,j)", the first index i for the index of rows and index j for the index of columns ?
I know that Fortran is column major, C Language is row major, and for both, it seems that classical convention is i = rows and j = columns, doesn't it ?
Moreover, could anyone tell me if Matlab is row or column major ?
This is a misunderstanding. There is no relation between how raw data is allocated in memory and the higher-level representation that the raw data is supposed to model.
C does not place any meaning to the indices in [i][j], this just specifies how the data is allocated in memory, not how it is presented to a user. i could be rows or it could be columns, this is for the programmer to specify in their application.
However, C does allocate the right-most dimension together in memory, example:
int arr[2][3] = { {1,2,3}, {1,2,3} };
+-------+-------+-------+-------+-------+-------+
| | | | | | |
| 1 | 2 | 3 | 1 | 2 | 3 |
| | | | | | |
+-------+-------+-------+-------+-------+-------+
This means that the preferred way to iterate over this matrix is:
for(size_t i=0; i<2; i++)
for(size_t j=0; j<3; j++)
arr[i][j] = x;
Because this order gives the fastest memory access, as far as cache memory is concerned. But the language does not enforce this order, we can iterate with j in the outer loop and the program will work just as fine (just slower).
Nor can we tell if this matrix is supposed to be a 2x3 or a 3x2.
For MATLAB, the first index is the row and the second is the column. But arrays are stored internally in column-major order (very early versions of MATLAB were implemented in FORTRAN; when it was originally commercialised it was mostly converted into C, but kept that convention).
Your question is answered here.
Quote:
In summary:
The "shape" of the memory is exactly the same in C and Fortran even though language differences make it look different due to reversed array indexing.
If you don't iterate through a Fortran array in k,j,i order, you'll access memory out of order and negatively impact your cache performance.

Google Sheets concatenate a 2d array into a single column and sort simultaneously

I want to know if it is possible and if so how I can achieve converting an array of 8 columns and 300+ rows into a sorted, concatenated, one-dimensional array where each row is the concatenation of the contents in the 8 columns. I would also like to achieve this using a single formula.
Example:
leg | dog | tom | jon | bar | | | |
foo | bin | git | hub | bet | far | day | bin |
...
would convert into:
bar dog jon leg tom
bet bin bin day far foo git hub
...
I can achieve this for a single row using this:
=arrayformula(CONCATENATE(transpose(sort(transpose(F2:M2),1,1))&" "))
as long as the 8 columns are from F to M
I can then copy this formula down 300+ times which is easy to do but I would like a single formula that populates n number of rows.
Can this be achieved or do I have to copy the formula down?
If I understood correctly, you should be able to do that with a formula like this
=ArrayFormula(transpose(query(transpose(A2:H8),,50000)))
Change the range to suit.
See also below picture.
EDIT: An alternative way may be to create a custom formula (sorting included). Add this to the script editor
function concatenateAndSort(range) {
return range.map(function (r) {
return [r.sort().join(" ")]
})
}
Then in the spreadsheet (where you want the output to appear) enter
=concatenateAndSort(A3:H8)
(Change range to suit).
Something like this should do that:
=transpose(split(" "&join(" ",index(sort(arrayformula({row(A1:A300),len(A1:A300)*0-9E+99
;row(A1:A300),A1:A300
;row(A1:A300),B1:B300
;row(A1:A300),C1:C300
;row(A1:A300),D1:D300
;row(A1:A300),E1:E300
;row(A1:A300),F1:F300
;row(A1:A300),G1:G300
;row(A1:A300),H1:H300
})),0,2))," "&-9E+99&" ",false))
First it creates a two-dimensional array with original row number in the first column and value in the second for each cell (adding a new value -9e99 for each row), then the array is sorted, first column is discarded, all values are joined using a space, then split (by the added value surrounded by spaces), and finally transposed.
=A2&" "&B2&" "&C2&" "&D2&" "&E2&" "&F2&" "&G2&" "&H2
=JOIN(" "; A2:H2)
SORTED ROW: =TRANSPOSE(SORT(TRANSPOSE(JOIN(" ";A2:H2));1;TRUE)) AND THEN: CTRL+SHIFT+⇩ DOWN ARROW ... CTRL+ENTER

Trying to pass MPI derived types between processors (and failing)

I am trying to parallelize a customer's Fortran code with MPI. f is an array of 4-byte reals dimensioned f(dimx,dimy,dimz,dimf). I need the various processes to work on different parts of the array's first dimension. (I would have rather started with the last, but it wasn't up to me.) So I define a derived type mpi_x_inteface like so
call mpi_type_vector(dimy*dimz*dimf, 1, dimx, MPI_REAL, &
mpi_x_interface, mpi_err)
call mpi_type_commit(mpi_x_interface, mpi_err)
My intent is that a single mpi_x_interface will contain all of the data in 'f' at some given first index "i". That is, for given i, it should contain f(i,:,:,:). (Note that at this stage of the game, all procs have a complete copy of f. I intend to eventually split f up between the procs, except I want proc 0 to have a full copy for the purpose of gathering.)
ptsinproc is an array containing the number of "i" indices handled by each proc. x_slab_displs is the displacement from the beginning of the array for each proc. For two procs, which is what I am testing on, they are ptsinproc=(/61,60/), x_slab_displs=(/0,61/). myminpt is a simple integer giving the minimum index handled in each proc.
So now I want to gather all of f into proc 0 and I run
if (myrank == 0) then
call mpi_gatherv(MPI_IN_PLACE, ptsinproc(myrank),
+ mpi_x_interface, f(1,1,1,1), ptsinproc,
+ x_slab_displs, mpi_x_interface, 0,
+ mpi_comm_world, mpi_err)
else
call mpi_gatherv(f(myminpt,1,1,1), ptsinproc(myrank),
+ mpi_x_interface, f(1,1,1,1), ptsinproc,
+ x_slab_displs, mpi_x_interface, 0,
+ mpi_comm_world, mpi_err)
endif
I can send at most one "slab" like this. If I try to send the entire 60 "slabs" from proc 1 to proc 0 I get a seg fault due to an "invalid memory reference". BTW, even when I send that single slab, the data winds up in the wrong places.
I've checked all the obvious stuff like maiking sure myrank and ptsinproc and x_slab_dislps are what they should be on all procs. I've looked into the difference between "size" and "extent" and so on, to no avail. I'm at my wit's end. I just don't see what I am doing wrong. And someone might remember that I asked a similar (but different!) question a few months back. I admit I'm just not getting it. Your patience is appreciated.
First off, I just want to say that the reason you're running into so many problems is because you are trying to split up the first (fastest) axis. This is not recommended at all because as-is packing your mpi_x_interface requires a lot of non-contiguous memory accesses. We're talking a huge loss in performance.
Splitting up the slowest axis across MPI processes is a much better strategy. I would highly recommend transposing your 4D matrix so that the x axis is last if you can.
Now to your actual problem(s)...
Derived datatypes
As you have deduced, one problem is that the size and extent of your derived datatype might be incorrect. Let's simplify your problem a bit so I can draw a picture. Say dimy*dimz*dimf=3, and dimx=4. As-is, your datatype mpi_x_interface describes the following data in memory:
| X | | | | X | | | | X | | | |
That is, every 4th MPI_REAL, and 3 of them total. Seeing as this is what you want, so far so good: the size of your variable is correct. However, if you try and send "the next" mpi_x_interface, you see that your implementation of MPI will start at the next point in memory (which in your case has not been allocated), and throw an "invalid memory access" at you:
tries to access and bombs
vvv
| X | | | | X | | | | X | | | | Y | | | | Y | ...
What you need to tell MPI as part of your datatype is that "the next" mpi_x_interface starts only 1 real into the array. This is accomplished by redefining the "extent" of your derived datatype by calling MPI_Type_create_resized(). In your case, you need to write
integer :: mpi_x_interface, mpi_x_interface_resized
integer, parameter :: SIZEOF_REAL = 4 ! or whatever f actually is
call mpi_type_vector(dimy*dimz*dimf, 1, dimx, MPI_REAL, &
mpi_x_interface, mpi_err)
call mpi_type_create_resized(mpi_x_interface, 0, 1*SIZEOF_REAL, &
mpi_x_interface_resized, mpi_err)
call mpi_type_commit(mpi_x_interface_resized, mpi_err)
Then, calling "the next" 3 mpi_x_interface_resized will result in:
| X | Y | Z | A | X | Y | Z | A | X | Y | Z | A |
as expected.
MPI_Gatherv
Note that now you have correctly defined the extent of your datatype, calling mpi_gatherv with an offset in terms of your datatype should now work as expected.
Personally, I wouldn't think there is a need to try some fancy logic with MPI_IN_PLACE for a collective operation. You can simply set myminpt=1 on myrank==0. Then you can call on every rank:
call mpi_gatherv(f(myminpt,1,1,1), ptsinproc(myrank),
+ mpi_x_interface_resized, f, ptsinproc,
+ x_slab_displs, mpi_x_interface_resized, 0,
+ mpi_comm_world, mpi_err)

Resources