Following the question on this link, Is it a general rule (I mean for a majority of languages) to consider, for a 2 dimensional array "x(i,j)", the first index i for the index of rows and index j for the index of columns ?
I know that Fortran is column major, C Language is row major, and for both, it seems that classical convention is i = rows and j = columns, doesn't it ?
Moreover, could anyone tell me if Matlab is row or column major ?
This is a misunderstanding. There is no relation between how raw data is allocated in memory and the higher-level representation that the raw data is supposed to model.
C does not place any meaning to the indices in [i][j], this just specifies how the data is allocated in memory, not how it is presented to a user. i could be rows or it could be columns, this is for the programmer to specify in their application.
However, C does allocate the right-most dimension together in memory, example:
int arr[2][3] = { {1,2,3}, {1,2,3} };
+-------+-------+-------+-------+-------+-------+
| | | | | | |
| 1 | 2 | 3 | 1 | 2 | 3 |
| | | | | | |
+-------+-------+-------+-------+-------+-------+
This means that the preferred way to iterate over this matrix is:
for(size_t i=0; i<2; i++)
for(size_t j=0; j<3; j++)
arr[i][j] = x;
Because this order gives the fastest memory access, as far as cache memory is concerned. But the language does not enforce this order, we can iterate with j in the outer loop and the program will work just as fine (just slower).
Nor can we tell if this matrix is supposed to be a 2x3 or a 3x2.
For MATLAB, the first index is the row and the second is the column. But arrays are stored internally in column-major order (very early versions of MATLAB were implemented in FORTRAN; when it was originally commercialised it was mostly converted into C, but kept that convention).
Your question is answered here.
Quote:
In summary:
The "shape" of the memory is exactly the same in C and Fortran even though language differences make it look different due to reversed array indexing.
If you don't iterate through a Fortran array in k,j,i order, you'll access memory out of order and negatively impact your cache performance.
Related
I am trying to find solutions to a matrix where I know the row and column sums and the maximum value a cell can have. I want to find possible solutions that are within the constraints. I've already tried various things like constructing an array of all cell values and picking picking from each cell in sequence but whatever I try I always run into the problem where I run out of values for a cell.
I also tried a recursive algorithm but that I only managed to get the first result or it failed to get any solution. I think I have to do this with a backtracking algorithm? Not sure...
Any help or pointers would be appreciated.
Row sums A, B, C, column sums X, Y, Z as well as the maximum value for each ? are known. All values are are positive integers.
C1 | C2 | C3
-----------------
R1 | ? | ? | ? | A
-----------------
R2 | ? | ? | ? | B
-----------------
R3 | ? | ? | ? | C
-----------------
X | Y | Z
If you heard about linear programming (LP) and its 'cousins' (ILP, MILP), that could be a good approach to help you solve your problem with a great efficiency.
A linear program consists in a set of variables (your matrix unknowns), constraints (maximum values, sum of rows and columns), and an objective function (here none) to minimize or maximize.
Let's call x[i][j] the values you are looking for.
With the following data:
NxM the dimensions of your matrix
max_val[i][j] the maximum value for the variable x[i][j]
row_val[i] the sum of the values on the row i
col_val[j] the sum of the values on the column j
Then a possible linear program that could solve your problem is:
// declare variables
int x[N][M] // or eventually float x[N][M]
// declare constaints
for all i in 1 .. N, j in 1 .. M, x[i][j] <= max_val[i][j]
for all i in 1 .. N, sum[j in 1 .. M](x[i][j]) == row_val[i]
for all j in 1 .. M, sum[i in 1 .. N](x[i][j]) == col_val[j]
// here the objective function is useless, but you still will need one
// for instance, let's minimize the sum of all variables (which is constant, but as I said, the objective function does not have to be useful)
minimize sum[i in 1 .. N](sum[j in 1 .. M](x[i][j]))
// you could also be more explicit about the uselessness of the objective function
// minimize 0
Solvers such as gurobi or Cplex (but there are much more of them, see here for instance) can solve this kind of problems incredibly fast, especially if your solutions do not need to be integer, but can be float (that makes the problem much, much easier). It also have the advantage to not only be faster t execute, but faster and simpler to code. They have APIs in several common programming languages to ease their use.
For example, you can reasonably expect to solve this kind of problem in less than a minute, with hundreds of thousands of variables in the integer case, millions in the real variables case.
Edit:
In response to the comment, here is a piece of code in OPL (the language Cplex and other LP solvers use) that would solve your problem. We consider a 3x3 case.
// declare your problem input
int row_val[1..3] = [7, 11, 8];
int col_val[1..3] = [14, 6, 6];
int max_val[1..3][1..3] = [[10, 10, 10], [10, 10, 10], [10, 10, 10]];
// declare your decision variables
dvar int x[1..3][1..3];
// objective function
minimize 0;
// constraints
subject to {
forall(i in 1..3, j in 1..3) x[i][j] <= max_val[i][j];
forall(i in 1..3) sum(j in 1..3) x[i][j] == row_val[i];
forall(j in 1..3) sum(i in 1..3) x[i][j] == col_val[j];
}
The concept of a LP solver is that you only describe the problem you want to solve, then the solver solves it for you. The problem must be described according to a certain set of rules. In the current case (Integer Linear Programming, or ILP), the variables must all be integers, and the constraints and objective function must be linear equalities (or inequalities) with regards to the decision variables.
The solver will then work as a black box. It will analyse the problem, and run algorithms that can solve it, with a ton of optimizations, and output the solution.
As you wrote in a comment, that you want to come up an own solution, here's some guideline:
Use a Backtrack algorithm to find a solution. Your value-space consists of 3*3=9 independent values, each of them are between 1 and maxval[i][j]. Your constraints will be the row and column sums (all of them must match)
Intitalize your space with all 1s, then increment them, until they reach the maxval. Evaluate the conditions only after each value is covered for that condition (particularly, after 3 values you can evaluate the first row, after 6 the second row, after 7 the first col, after 8 the second col, and after 9 the third row and the third col)
If you reach the 9th, with all conditions passing, you've got a solution. Otherwise try the values from 1 till maxval, if neither matches, step back. If the first value was iterated through, then there's no solution.
That's all.
More advanced backtracking:
Your moving values are only the top-left 2*2=4 values. The third column is calculated, the condition is that it must be between 1 and the maxval for that particular element.
After defining the [1][1] element, you need to calculate the [2][2] index by using the column sum, and validate its value by the row sum (or vica versa). The same processing rules apply as above: iterate through all possible values, step back if none matches, and check rules only if they can be applied.
It is a way faster method, since you have 5 bound variables (the bottom and right rows), and only 4 unbound. These are optimizations from your particular rules. A bit more complex to implement, though.
PS: 1 is used because you have positive integers. If you have non-negative integers, you need to start with 0.
I am trying to parallelize a customer's Fortran code with MPI. f is an array of 4-byte reals dimensioned f(dimx,dimy,dimz,dimf). I need the various processes to work on different parts of the array's first dimension. (I would have rather started with the last, but it wasn't up to me.) So I define a derived type mpi_x_inteface like so
call mpi_type_vector(dimy*dimz*dimf, 1, dimx, MPI_REAL, &
mpi_x_interface, mpi_err)
call mpi_type_commit(mpi_x_interface, mpi_err)
My intent is that a single mpi_x_interface will contain all of the data in 'f' at some given first index "i". That is, for given i, it should contain f(i,:,:,:). (Note that at this stage of the game, all procs have a complete copy of f. I intend to eventually split f up between the procs, except I want proc 0 to have a full copy for the purpose of gathering.)
ptsinproc is an array containing the number of "i" indices handled by each proc. x_slab_displs is the displacement from the beginning of the array for each proc. For two procs, which is what I am testing on, they are ptsinproc=(/61,60/), x_slab_displs=(/0,61/). myminpt is a simple integer giving the minimum index handled in each proc.
So now I want to gather all of f into proc 0 and I run
if (myrank == 0) then
call mpi_gatherv(MPI_IN_PLACE, ptsinproc(myrank),
+ mpi_x_interface, f(1,1,1,1), ptsinproc,
+ x_slab_displs, mpi_x_interface, 0,
+ mpi_comm_world, mpi_err)
else
call mpi_gatherv(f(myminpt,1,1,1), ptsinproc(myrank),
+ mpi_x_interface, f(1,1,1,1), ptsinproc,
+ x_slab_displs, mpi_x_interface, 0,
+ mpi_comm_world, mpi_err)
endif
I can send at most one "slab" like this. If I try to send the entire 60 "slabs" from proc 1 to proc 0 I get a seg fault due to an "invalid memory reference". BTW, even when I send that single slab, the data winds up in the wrong places.
I've checked all the obvious stuff like maiking sure myrank and ptsinproc and x_slab_dislps are what they should be on all procs. I've looked into the difference between "size" and "extent" and so on, to no avail. I'm at my wit's end. I just don't see what I am doing wrong. And someone might remember that I asked a similar (but different!) question a few months back. I admit I'm just not getting it. Your patience is appreciated.
First off, I just want to say that the reason you're running into so many problems is because you are trying to split up the first (fastest) axis. This is not recommended at all because as-is packing your mpi_x_interface requires a lot of non-contiguous memory accesses. We're talking a huge loss in performance.
Splitting up the slowest axis across MPI processes is a much better strategy. I would highly recommend transposing your 4D matrix so that the x axis is last if you can.
Now to your actual problem(s)...
Derived datatypes
As you have deduced, one problem is that the size and extent of your derived datatype might be incorrect. Let's simplify your problem a bit so I can draw a picture. Say dimy*dimz*dimf=3, and dimx=4. As-is, your datatype mpi_x_interface describes the following data in memory:
| X | | | | X | | | | X | | | |
That is, every 4th MPI_REAL, and 3 of them total. Seeing as this is what you want, so far so good: the size of your variable is correct. However, if you try and send "the next" mpi_x_interface, you see that your implementation of MPI will start at the next point in memory (which in your case has not been allocated), and throw an "invalid memory access" at you:
tries to access and bombs
vvv
| X | | | | X | | | | X | | | | Y | | | | Y | ...
What you need to tell MPI as part of your datatype is that "the next" mpi_x_interface starts only 1 real into the array. This is accomplished by redefining the "extent" of your derived datatype by calling MPI_Type_create_resized(). In your case, you need to write
integer :: mpi_x_interface, mpi_x_interface_resized
integer, parameter :: SIZEOF_REAL = 4 ! or whatever f actually is
call mpi_type_vector(dimy*dimz*dimf, 1, dimx, MPI_REAL, &
mpi_x_interface, mpi_err)
call mpi_type_create_resized(mpi_x_interface, 0, 1*SIZEOF_REAL, &
mpi_x_interface_resized, mpi_err)
call mpi_type_commit(mpi_x_interface_resized, mpi_err)
Then, calling "the next" 3 mpi_x_interface_resized will result in:
| X | Y | Z | A | X | Y | Z | A | X | Y | Z | A |
as expected.
MPI_Gatherv
Note that now you have correctly defined the extent of your datatype, calling mpi_gatherv with an offset in terms of your datatype should now work as expected.
Personally, I wouldn't think there is a need to try some fancy logic with MPI_IN_PLACE for a collective operation. You can simply set myminpt=1 on myrank==0. Then you can call on every rank:
call mpi_gatherv(f(myminpt,1,1,1), ptsinproc(myrank),
+ mpi_x_interface_resized, f, ptsinproc,
+ x_slab_displs, mpi_x_interface_resized, 0,
+ mpi_comm_world, mpi_err)
MATLAB is well-known for being column-major. Consequently, manipulating entries of an array that are in the same column is faster than manipulating entries that are on the same row.
In that case, why do so many built-in functions, such as linspace and logspace, output row vectors rather than column vectors? This seems to me like a de-optimization...
What, if any, is the rationale behind this design decision?
It is a good question. Here are some ideas...
My first thought was that in terms of performance and contiguous memory, it doesn't make a difference if it's a row or a column -- they are both contiguous in memory. For a multidimensional (>1D) array, it is correct that it is more efficient to index a whole column of the array (e.g. v(:,2)) rather than a row (e.g. v(2,:)) or other dimension because in the row (non-column) case it is not accessing elements that are contiguous in memory. However, for a row vector that is 1-by-N, the elements are contiguous because there is only one row, so it doesn't make a difference.
Second, it is simply easier to display row vectors in the Command Window, especially since it wraps the rows of long arrays. With a long column vector, you will be forced to scroll for much shorter arrays.
More thoughts...
Perhaps row vector output from linspace and logspace is just to be consistent with the fact that colon (essentially a tool for creating linearly spaced elements) makes a row:
>> 0:2:16
ans =
0 2 4 6 8 10 12 14 16
The choice was made at the beginning of time and that was that (maybe?).
Also, the convention for loop variables could be important. A row is necessary to define multiple iterations:
>> for k=1:5, k, end
k =
1
k =
2
k =
3
k =
4
k =
5
A column will be a single iteration with a non-scalar loop variable:
>> for k=(1:5)', k, end
k =
1
2
3
4
5
And maybe the outputs of linspace and logspace are commonly looped over. Maybe? :)
But, why loop over a row vector anyway? Well, as I say in my comments, it's not that a row vector is used for loops, it's that it loops through the columns of the loop expression. Meaning, with for v=M where M is a 2-by-3 matrix, there are 3 iterations, where v is a 2 element column vector in each iteration. This is actually a good design if you consider that this involves slicing the loop expression into columns (i.e. chunks of contiguous memory!).
I'm doing a project in C involving arrays. What I have is an array of 7 chars, I need to populate an array with 4 random elements from the 7. Then I compare an array I fill myself to it. I don't want to allow repeats. I know how to compare each individual element to another to prevent it but obviously this isn't optimal. So if I remove the elements from the array as I randomly pick them I remove any chance of them being duplicated, or so I think. My question is how would I do this?
Example:
char name[2+1] = {'a','b'};
char guess[2+1] = {};
so when it randomly picks a or b and puts it in guess[],
but the next time it runs it might pick the same. Removing it will get rid of that chance.
In bigger arrays it would make it faster then doing all the comparing.
Guys it just hit me.
Couldn't I switch the element I took with the last element in the array and shrink it by one?
Then obviously change the rand() % x modulus by 1 each time?
I can give you steps to do what you intend to do. Code it yourself. Before that let's generalize the problem.
You've an array of 'm' elements and you've to fill another 'n' length
array by choosing random elements from first array such that there are
no repetition of number. Let's assume all numbers are unique in first
array.
Steps:
Keep a pointer or count to track the current position in array.
Initialize it to zeroth index initially. Let's call it current.
Generate a valid random number within the range of current and 'm'. Let's say its i. Keep generating until you find something in range.
Fill second_array with first_array[i].
Swap first_array[i] and first_array[current] and increment current but 1.
Repeat through step 2 'n' times.
Let's say your array is 2, 3, 7, 5, 8, 12, 4. Its length is 7. You've to fill a 5 length array out of it.
Initialize current to zero.
Generate random index. Let's say 4. Check if its between current(0) and m(7). It is.
Swap first_array[4] and first_array[0]. array becomes 8, 3, 7, 5, 2, 12, 4
Increment current by 1 and repeat.
Here are two possible ways of "removing" items from an array in C (there are other possible way too):
Replace the item in the array with another items which states that this item is not valid.
For example, if you have the char array
+---+---+---+---+---+---+
| F | o | o | b | a | r |
+---+---+---+---+---+---+
and you want to "remove" the b the it could look like
+---+---+---+------+---+---+
| F | o | o | \xff | a | r |
+---+---+---+------+---+---+
Shift the remaining content of the array one step up.
To use the same example from above, the array after shifting would look like
+---+---+---+---+---+---+
| F | o | o | a | r | |
+---+---+---+---+---+---+
This can be implemented by a simple memmove call.
The important thing to remember for this is that you need to keep track of the size, and decrease it every time you remove a character.
Of course both these methods can be combined: First use number one in a loop, and once done you can permanently remove the unused entries in the array with number two.
To don't forget that an array is just a pointer on the beginning of a set of items of the same type in C. So to remove an element, you simply have to replace the element at the given index with a value that shows that it is not a valid entry i.e. null.
As the entry is a simple number, there is no memory management issue (that I know of) but if it were an object, you would have to delete it if it is the last reference you have on it.
So let's keep it simple:
array2[index2] = array1[index1];
array1[index1] = null;
The other way is to change the size of the original array so that it contains one less element, as Joachim stated in his answer.
I am currently working on a C program trying to compute Matrix Multiplication.. I have approached this task by looping through each column of the second matrix as seen below.
I have set size to 1000.
for(i=0;i<size;i++)
{
for(j=0;j<size;j++)
{
for(k=0;k<size;k++)
{
matC[i][j]+=matA[i][k]*matB[k][j];
}
}
}
I wanted to know what problematic access pattern is in this implementation.. What makes row/column access more efficient than the other? I am trying to understand this in terms of logic from the use of Caches.. Please help me understand this. Your help is much appreciated :)
If you are talking about use of Caches then you might want to do something called loop tiling. You break the loop into tiles such that inner part of the loop gets stored inside cache (which is quite large these days). So your loop will turn into something like (if you are passing the matrices into a function using pointers )
for(j=0;j<size;j+=t)
for(k=0;k<size;k+=t)
for(i=0;i<size;i+=t)
for(ii=i;ii<MIN(i+t,size);ii++)
for(jj=j;jj<MIN(j+t,size);jj++)
{
var=*(c+ii * size+jj); //Var is a scalar variable
for(kk=k;kk<MIN(k+t,size);kk++)
{
var = var + *(a+ii *size +kk) * *(bt +jj * size+ kk);
}
*(c+ii *size +jj) = var;
}
The value of t varies depending on the speedup that you get. It can t = 64,128,256 and so on. There are many other techniques that you can use here. Loop tiling is just once technique to utilize the cache efficiently.Further, you can transpose the B matrix before you send to the multiplication function. That way you will get a linear access of elements of matrix B. To explain you more
Consider
A -------- and B | | | |
-------- | | | |
-------- | | | |
-------- | | | |
Here, you will always consider, to multiply the first row of A with first column of B.And since you are using C I believe, CPU requires extra efforts to read in the all the columns of matrix B one by one inside the memory. To ease up these efforts, you can transpose the matrix and get the rows of matrix B' (which are nothing but columns of B essentially) and use loop tiling to cache the maximum amount of elements for multiplication.Hope this helps.