Purpose of LDA argument in BLAS dgemm? - c

The Fortran reference implementation documentation states:
* LDA - INTEGER.
* On entry, LDA specifies the first dimension of A as declared
* in the calling (sub) program. When TRANSA = 'N' or 'n' then
* LDA must be at least max( 1, m ), otherwise LDA must be at
* least max( 1, k ).
* Unchanged on exit.
However, given m and k shouldn't I be able to derive LDA? When is LDA permitted to be bigger than n (or k)?

The LDA parameter in BLAS is effectively the stride of the matrix as it is laid out in linear memory. It is perfectly valid to have an LDA value which is larger than the leading dimension of the matrix which is being operated on. Typical cases where it is either useful or necessary to use a larger LDA value are when you are operating on a sub matrix from a larger dense matrix, and when hardware or algorithms offer performance advantages when storage is padded to round multiples of some optimal size (cache lines or GPU memory transaction size, or load balance in multiprocessor implementations, for example).

The distinction is between the logical size of the first dimensions of the arrays A and B and the physical size. The first is the size of the array that you are using, the second is the value in the declaration, or the physical amount of memory used. Since Fortran is a column major language, the declared sizes of all indices except the last must be known in order to calculate the location of an array element. Notice the FORTRAN 77 style declarations of "A(LDA,),B(LDB,),C(LDC,*)". The declared size of the array can be larger than the portion that you are using; of course it can't be smaller.

Another way to look at it is LDA is the y-stride, meaning in a row-major layout your address for element A[y,x] is computed as x+LDA*y. For a "packed" memory layout without gaps between adjacent lines of x-data LDA=xSize.

Related

What is the first vector to be read by the compiler in a vector addition operation in C?

I am analyzing the operation of the cache of a code and this question has arisen:
C[i] = B[i] + A[i];
G[i] = x*F[i];
This is the part of the code where I have the doubt. Context: my cache memory has space up to 4 of these 5 vectors. It works with a LRU algorithm (Least Recently Used) so C,B,A and F are stored with any problem, but G has no space in the cache so the vector that has not been used for the longest time is replaced with the vector values of G. Here are the questions:
Was A the first or was it B? What principle does the C compiler follow to make the decision of what element is first read? Does it depend on which compiler is used (GCC, ICC...) or do they all generally follow the same discipline?
In C[i] = B[i] + A[i]; the compiler may load A[i] first or B[i] first. The C standard does not impose any requirement on this ordering.
With G[i] = x*F[i]; coming after that, the compiler must load F[i] after storing C[i] unless it can determine that F[i] is not the same object as C[i]. If it can determine that, then it may load A[i], B[i], and F[i] in any order.
Similarly, if it can determine that G[i] does not overlap C[i], it may store C[i] and G[i] in either order.
If this code appears in a loop, this permissive reordering extends to elements between iterations of the loop: If the compiler can determine that it will not affect the defined results, it can load B[i] from a “future” iteration before it loads A[i] for the current iteration, for example. It could load four elements from B, four elements from A, four elements from F, and do all the arithmetic before it stores any elements to C or G. (Often the compiler does not have the necessary information that the arrays overlap, so it will not do this reordering unless you give it that information in some way, as by declaring the pointers with restrict or use special pragmas or compiler built-ins to tell it these things.)
Generally, the C standard is permissive about how actual operations are ordered. It only mandates that the observable behavior it specifies be satisfied. You should not expect that any particular elements are loaded first based on standard C code alone.

How exactly array type is stored in C?

So I've been reading Brian W. Kernighan and Dennis M. Ritchie's "The C Programming Language" and everything was clear until I got to the array-to-pointer section. The first thing we can read is that by definition, a[i] is converted by C to *(a+i). Okay, this is clear and logical. The next thing is that when we pass an array as a function parameter, you actually pass the pointer to the first element in that array. Then we find out that we can add integers to such a pointer and even it is valid to have a pointer to the first element after the array. But then it's written that we can subtract pointers only in the same array.
So how does C 'know' if these two pointers point to the same array? Is there some metainformation associated with the array? Or does it just mean that this is undefined behavior and compiler won't even generate a warning? Is array stored in memory as just ordinary values of the size of an array type, one after another, or is there something else?
One reason the C standard only defines subtraction for two pointers if they are in the same array is that some (mostly old) C implementations use a form of addressing in which an address consists of a base address plus an offset, and different arrays may have different base addresses.
In some machines, a full address in memory may have a base that is a number of segments or other blocks of some sort and an offset that is a number of bytes within the page. This was done because, for example, some early hardware would work with data in 16-bit pieces and was designed to work with 16-bit addresses, but later versions of hardware extending the same architecture would have larger addresses but would still use 16-bit pieces of data in order to keep some compatibility with previous software. So the newer hardware might have a 22-bit address space. Old software using just 16-bit addresses would still behave the same, but newer software could use an additional piece of data to specify different base addresses and thereby access all memory in the 22-bit address space.
In such a system, the combination of a base b and an offset o might refer to memory address 64•b + o. This gives access to the full 22 bits of address space—with b=65535 and o=63, we have 64•b + o = 64•65535 + 63 = 4,194,303 = 222−1.
Observe that many locations in form can be accessed by multiple addresses. For example, b=17, o=40 refers to the same location as b=16, o=104 and as b=15, o=168. Although the formula for making a 22-bit address could have been designed to be 65536•b + o, and that would have given each memory location a unique address, the overlapping formula was used because it gives a programmer flexibility in choosing their base. Recall that these machines were largely designed around using 16-bit pieces of data. With the non-overlapping address scheme, you would have to calculate both the base and the offset whenever doing address arithmetic. With the overlapping address scheme, you can choose a base for an array you are working with, and then doing any address arithmetic requires calculating only with the offset part.
A C implementation for this architecture can easily support arrays up to 65536 arrays by setting one base address for the array and then doing arithmetic only with the offset part. For example, if we have an array A of 1000 int, and it is allocated starting at memory location 78,976 (equal to 1234•64), we can set b to 1234 and index the array with offsets from 0 to 1998 (999•2, since each int is two bytes in this C implementation).
Then, if we have a pointer p pointing to A[125], it is represented with (1234, 250), to point to offset 250 with base 1234. And if q points to A[55], it is represented with (1234, 110). To subtract these pointers, we ignore the base, subtract the offsets, and divide by the size of one element, so the result is (250-110)/2 = 70.
Now, if you have a pointer r pointing to element 13 in some other array B, it is going to have a different base, say 2345. So r would be represented with (2345, 26). Then, to subtract r from p, we need to subtract (2345, 26) from (1234, 250). In this case, you cannot ignore the bases; simply working with the offsets would give (250−26)/2 = 112, but these items are not 112 elements (or 224 bytes) apart.
The compiler could be altered to do the math by subtracting the bases, multiplying by 64, and add that to the difference of the offsets. But then it is doing math to subtract pointers that is completely unnecessary in the intended uses of pointer arithmetic. So the C standard committee decided a compiler should not be required to support this, and the way to specify that is to say that the behavior is not defined when you subtract pointers to elements in different arrays.
... it's written that we can subtract pointers only in the same array.
So how does C 'know' if these two pointers point to the same array?
C does not know that. It is the programmer's responsability to make sure about the limits.
int arr[100];
int *p1 = arr + 30;
int *p2 = arr + 50;
//both p1 and p2 point into arr
p2 - p1; //ok
p1 - p2; //ok
int *p3 = &((int)42); // ignore the C99 compound literal
//p3 does not point into arr
p3 - p1; //nope!

Multidimensional Array (2x2) Initialization in Assembly Language - Y86

I am new to assembly language and I'm using a simpler version called Y86, essentially the same thing. I wonder how to initialize a multidimensional array in such a format, specifically making a 2x2. Later with the 2x2 I will be adding two matrices (or arrays in this case). Thank you!
In machine code you have available (for information storage) CPU registers and memory.
Registers have fixed names and types and they are used like that, for example in x86 you can do mov eax, 0x12345678 to load 32b value into register eax.
Memory is like continuous block of byte-cells, each having it's own unique physical address (like: 0, 1, 2, ... mem_size-1). So it is like 1 dimensional byte array.
Whatever different type you want, in the end it is somehow mapped to this 1D byte array, so you have to first design how that mapping happens.
Some mappings like for example 32 bit integers have native mappings/support in the instructions, so you can for example read whole 32b int by single instruction like mov eax,[address], not having to compose it from individual bytes, but the CPU will for you read four bytes from memory at addresses: address+0, address+1, address+2 and address+3 and concatenate it into 32 bit value (on x86 CPU in little-endian order, so the byte from address+0 is in the lowest 8 bits of final value).
Other mappings like "array 2x2" don't have native support, and you have to design the memory layout and write the code accordingly to support it. For 2 dimensional arrays often mapping memory_offset = (row * columns_max + column) * single_element_byte_size is used.
Like for 16x16 matrix of 32 bit floats you can calculate memory offset (from the start of matrix data, which is at offset 0):
; eax = column 0..15 (x), ebx = row 0..15 (y), ecx = address of matrix
shl ebx, 4 ; y *= 16
add eax, ebx ; index = y * 16 + x
mov edx, [ecx + eax*4] ; read 32 bit element from matrix[y][x]
But you are of course free to devise and implement any kind of mapping you wish...
edit: as Peter Cordes notes, some mappings favour certain task, like for example continuously designed matrices like the one above, in the task of adding two matrices, can be dealt with in the implementation as one dimensional 256 (16x16) element array, because there's no significance of row/columns in the matrix addition, so you can just add corresponding elements of both. In multiplication you have to traverse the elements in more complex patterns, where row/columns are important, so there you have to write more complex code to respect the 2D mapping logic.
edit 2, to actually add answer to your question:
I wonder how to initialize a multidimensional array in such a format
Eee... this doesn't make sense from machine point of view. You simply need somewhere in memory reserved space, which represents data of the array, and you may want to set those to certain initial values, by simply writing those values into memory (by ordinary memory store instructions, like mov [ebx],eax), or for example in simple code adding two matrices of fixed values, you can define both of them directly in .data segment with some directive defining values, like for example in NASM assembler (for the simple mapping as described above):
; 2x2 32bit integer matrix:
; (14 32)
; (-3 4)
matrix1:
dd 14, 32, -3, 4
(check your assembler documentation to see which directives are available to reserve+initialize part of memory)
Which kind of memory area you want to reserve for the data (the load-time initialized .data, or stack, or dynamically allocated from OS "heap", ...), and how you load it with initial data, is up to you, but in no way related to the "two dimensional array", usually the allocation/initialization code often works with all the types as "continuous block of bytes", without caring about the inner structure of the data, that's left for the other functions, which are dealing with particular elements of the data.

Maximum number of elements in a 2D integer array

Is there any limit as to how many elements a 2D integer array can contain in C?
PS : I was expecting there would be some space limitations in declaring an array but could not find any such reference in the internet.
It depends on your RAM or the memory available for you.
i:e: My program used to crash when I declared a global array a[100000][10000], but this declaration is fine with the system now I have.
The size_t type is defined to be large enough to contain the size of any object in the program, including arrays. So the largest possible array size can be described as 2^(8*sizeof(size_t) bytes.
For convenience, this value can be obtained through the SIZE_MAX constant in limits.h. It is guaranteed to be at least 65535 but is realistically a larger value, most likely 2^32 on 32 bit systems and 2^64 on 64 bit systems.
Maximum imposed by the C/C++ standard: x * y * z <= SIZE_MAX, where SIZE_MAX is implementation defined, x is one dimension of the array, y is the other dimension, and z is the size of the element in bytes. e.g. element_t A[x][y], z = sizeof(element_t).

Floating multiplication performing slower depending of operands in C

I am performing a stencil computation on a matrix I previously read from a file. I use two different kinds of matrices (NonZero type and Zero type). Both types share the value of the boundaries (1000 usually), whilst the rest of the elements are 0 for Zero type and 1 for NonZero type.
The code stores the matrix of the file in two allocated matrices of the same size. Then it performs an operation in every element of one matrix using its own value and values of neighbours (add x 4 and mul x 1), and stores the result in the second matrix. Once the computation is finished, the pointers for matrices are swapped and the same operation is perform for a finite amount of times. Here you have the core code:
#define GET(I,J) rMat[(I)*cols + (J)]
#define PUT(I,J) wMat[(I)*cols + (J)]
for (cur_time=0; cur_time<timeSteps; cur_time++) {
for (i=1; i<rows-1; i++) {
for (j=1; j<cols-1; j++) {
PUT(i,j) = 0.2f*(GET(i-1,j) + GET(i,j-1) + GET(i,j) + GET(i,j+1) + GET(i+1,j));
}
}
// Change pointers for next iteration
auxP = wMat;
wMat = rMat;
rMat = auxP;
}
The case I am exposing uses a fixed amount of 500 timeSteps (outer iterations) and a matrix size of 8192 rows and 8192 columns, but the problem persists while changing number of timeSteps or matrix size. Note that I only measure time of this concrete part of algorithm, so reading matrix from file nor anything else affects the time measure.
What it happens, is that I get different times depending on which type of matrix I use, obtaining a much worse performance when using Zero type (every other matrix performs same as NonZero type, as I have already tried to generate a matrix full of random values).
I am certain it is the multiplication operation, as if I remove it and leave only the adds, they perform the same. Note that with Zero matrix type, most of the type the result of the sum will be 0, so the operation will be "0.2*0".
This behaviour is certainly weird for me, as I thought that floating point operations were independent of values of operands, which does not look like the case here. I have also tried to capture and show SIGFPE exceptions in case that was the problem, but I obtained no results.
In case it helps, I am using an Intel Nehalem processor and gcc 4.4.3.
The problem has already mostly been diagnosed, but I will write up exactly what happens here.
Essentially, the questioner is modeling diffusion; an initial quantity on the boundary diffuses into the entirety of a large grid. At each time step t, the value at the leading edge of the diffusion will be 0.2^t (ignoring effects at the corners).
The smallest normalized single-precision value is 2^-126; when cur_time = 55, the value at the frontier of the diffusion is 0.2^55, which is a bit smaller than 2^-127. From this time step forward, some of the cells in the grid will contain denormal values. On the questioner's Nehalem, operations on denormal data are about 100 times slower than the same operation on normalized floating point data, explaining the slowdown.
When the grid is initially filled with constant data of 1.0, the data never gets too small, and so the denormal stall is avoided.
Note that changing the data type to double would delay, but not alleviate the issue. If double precision is used for the computation, denormal values (now smaller than 2^-1022) will first arise in the 441st iteration.
At the cost of precision at the leading edge of the diffusion, you could fix the slowdown by enabling "Flush to Zero", which causes the processor to produce zero instead of denormal results in arithmetic operations. This is done by toggling a bit in the FPSCR or MXSCR, preferably via the functions defined in the <fenv.h> header in the C library.
Another (hackier, less good) "fix" would be to fill the matrix initially with very small non-zero values (0x1.0p-126f, the smallest normal number). This would also prevent denormals from arising in the computation.
Maybe your ZeroMatrix uses the typical storage scheme for Sparse Matrices: store every non-zero value in a linked list. If that is the case, it is quite understandable why it performs worse than a typical array-based storage-scheme: because it needs to run thru the linked list once for every operation you perform. In that case you can maybe speed the process up by using a matrix-multiply-algorithm that accounts for having a sparse-matrix. If this is not the case please post minimal but complete code so we can play with it.
here is one of the possibilities for multiplying sparse matrices efficiently:
http://www.cs.cmu.edu/~scandal/cacm/node9.html

Resources