I am new to assembly language and I'm using a simpler version called Y86, essentially the same thing. I wonder how to initialize a multidimensional array in such a format, specifically making a 2x2. Later with the 2x2 I will be adding two matrices (or arrays in this case). Thank you!
In machine code you have available (for information storage) CPU registers and memory.
Registers have fixed names and types and they are used like that, for example in x86 you can do mov eax, 0x12345678 to load 32b value into register eax.
Memory is like continuous block of byte-cells, each having it's own unique physical address (like: 0, 1, 2, ... mem_size-1). So it is like 1 dimensional byte array.
Whatever different type you want, in the end it is somehow mapped to this 1D byte array, so you have to first design how that mapping happens.
Some mappings like for example 32 bit integers have native mappings/support in the instructions, so you can for example read whole 32b int by single instruction like mov eax,[address], not having to compose it from individual bytes, but the CPU will for you read four bytes from memory at addresses: address+0, address+1, address+2 and address+3 and concatenate it into 32 bit value (on x86 CPU in little-endian order, so the byte from address+0 is in the lowest 8 bits of final value).
Other mappings like "array 2x2" don't have native support, and you have to design the memory layout and write the code accordingly to support it. For 2 dimensional arrays often mapping memory_offset = (row * columns_max + column) * single_element_byte_size is used.
Like for 16x16 matrix of 32 bit floats you can calculate memory offset (from the start of matrix data, which is at offset 0):
; eax = column 0..15 (x), ebx = row 0..15 (y), ecx = address of matrix
shl ebx, 4 ; y *= 16
add eax, ebx ; index = y * 16 + x
mov edx, [ecx + eax*4] ; read 32 bit element from matrix[y][x]
But you are of course free to devise and implement any kind of mapping you wish...
edit: as Peter Cordes notes, some mappings favour certain task, like for example continuously designed matrices like the one above, in the task of adding two matrices, can be dealt with in the implementation as one dimensional 256 (16x16) element array, because there's no significance of row/columns in the matrix addition, so you can just add corresponding elements of both. In multiplication you have to traverse the elements in more complex patterns, where row/columns are important, so there you have to write more complex code to respect the 2D mapping logic.
edit 2, to actually add answer to your question:
I wonder how to initialize a multidimensional array in such a format
Eee... this doesn't make sense from machine point of view. You simply need somewhere in memory reserved space, which represents data of the array, and you may want to set those to certain initial values, by simply writing those values into memory (by ordinary memory store instructions, like mov [ebx],eax), or for example in simple code adding two matrices of fixed values, you can define both of them directly in .data segment with some directive defining values, like for example in NASM assembler (for the simple mapping as described above):
; 2x2 32bit integer matrix:
; (14 32)
; (-3 4)
matrix1:
dd 14, 32, -3, 4
(check your assembler documentation to see which directives are available to reserve+initialize part of memory)
Which kind of memory area you want to reserve for the data (the load-time initialized .data, or stack, or dynamically allocated from OS "heap", ...), and how you load it with initial data, is up to you, but in no way related to the "two dimensional array", usually the allocation/initialization code often works with all the types as "continuous block of bytes", without caring about the inner structure of the data, that's left for the other functions, which are dealing with particular elements of the data.
Related
So I've been reading Brian W. Kernighan and Dennis M. Ritchie's "The C Programming Language" and everything was clear until I got to the array-to-pointer section. The first thing we can read is that by definition, a[i] is converted by C to *(a+i). Okay, this is clear and logical. The next thing is that when we pass an array as a function parameter, you actually pass the pointer to the first element in that array. Then we find out that we can add integers to such a pointer and even it is valid to have a pointer to the first element after the array. But then it's written that we can subtract pointers only in the same array.
So how does C 'know' if these two pointers point to the same array? Is there some metainformation associated with the array? Or does it just mean that this is undefined behavior and compiler won't even generate a warning? Is array stored in memory as just ordinary values of the size of an array type, one after another, or is there something else?
One reason the C standard only defines subtraction for two pointers if they are in the same array is that some (mostly old) C implementations use a form of addressing in which an address consists of a base address plus an offset, and different arrays may have different base addresses.
In some machines, a full address in memory may have a base that is a number of segments or other blocks of some sort and an offset that is a number of bytes within the page. This was done because, for example, some early hardware would work with data in 16-bit pieces and was designed to work with 16-bit addresses, but later versions of hardware extending the same architecture would have larger addresses but would still use 16-bit pieces of data in order to keep some compatibility with previous software. So the newer hardware might have a 22-bit address space. Old software using just 16-bit addresses would still behave the same, but newer software could use an additional piece of data to specify different base addresses and thereby access all memory in the 22-bit address space.
In such a system, the combination of a base b and an offset o might refer to memory address 64•b + o. This gives access to the full 22 bits of address space—with b=65535 and o=63, we have 64•b + o = 64•65535 + 63 = 4,194,303 = 222−1.
Observe that many locations in form can be accessed by multiple addresses. For example, b=17, o=40 refers to the same location as b=16, o=104 and as b=15, o=168. Although the formula for making a 22-bit address could have been designed to be 65536•b + o, and that would have given each memory location a unique address, the overlapping formula was used because it gives a programmer flexibility in choosing their base. Recall that these machines were largely designed around using 16-bit pieces of data. With the non-overlapping address scheme, you would have to calculate both the base and the offset whenever doing address arithmetic. With the overlapping address scheme, you can choose a base for an array you are working with, and then doing any address arithmetic requires calculating only with the offset part.
A C implementation for this architecture can easily support arrays up to 65536 arrays by setting one base address for the array and then doing arithmetic only with the offset part. For example, if we have an array A of 1000 int, and it is allocated starting at memory location 78,976 (equal to 1234•64), we can set b to 1234 and index the array with offsets from 0 to 1998 (999•2, since each int is two bytes in this C implementation).
Then, if we have a pointer p pointing to A[125], it is represented with (1234, 250), to point to offset 250 with base 1234. And if q points to A[55], it is represented with (1234, 110). To subtract these pointers, we ignore the base, subtract the offsets, and divide by the size of one element, so the result is (250-110)/2 = 70.
Now, if you have a pointer r pointing to element 13 in some other array B, it is going to have a different base, say 2345. So r would be represented with (2345, 26). Then, to subtract r from p, we need to subtract (2345, 26) from (1234, 250). In this case, you cannot ignore the bases; simply working with the offsets would give (250−26)/2 = 112, but these items are not 112 elements (or 224 bytes) apart.
The compiler could be altered to do the math by subtracting the bases, multiplying by 64, and add that to the difference of the offsets. But then it is doing math to subtract pointers that is completely unnecessary in the intended uses of pointer arithmetic. So the C standard committee decided a compiler should not be required to support this, and the way to specify that is to say that the behavior is not defined when you subtract pointers to elements in different arrays.
... it's written that we can subtract pointers only in the same array.
So how does C 'know' if these two pointers point to the same array?
C does not know that. It is the programmer's responsability to make sure about the limits.
int arr[100];
int *p1 = arr + 30;
int *p2 = arr + 50;
//both p1 and p2 point into arr
p2 - p1; //ok
p1 - p2; //ok
int *p3 = &((int)42); // ignore the C99 compound literal
//p3 does not point into arr
p3 - p1; //nope!
I want to what exactly is arrangement specifier in arm assembly instructions.
I have gone through ARM TRMs and i think if it is size of Neon register that will be used for computation
for e.g.
TBL Vd.Ta, {Vn.16B,Vn+1.16B }, Vm.Ta
This is taken from http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802a/TBL_advsimd_vector.html
they mentioned Ta to be arrangement specifier of value 16B or 8B
I would like to know what it means to.(Size of Neon Register .. anything..)
The arrangement specifier is the number and size of the elements in the vector, for example 8B means that you are looking at 8 elements of one byte (this will be a 64-bit vector), and 16B is 16 elements of 1 byte (128-bit vector).
This is taken from the ARM Reference Manual:
I help myself remember this by thinking:
B = Bytes (8bit)
H = Halfwords (16bit)
S = Single words (32bit)
D = Double words (64bit)
I don't know if that is official, but it's how I remember it
Is there an intrinsic that will set a single value at all the places in an input array where the corresponding position had a 1 bit in the provided BitMask?
10101010 is bitmask
value is 121
it will set positions 0,2,4,6 with value 121
With AVX512, yes. Masked stores are a first-class operation in AVX512.
Use the bitmask as an AVX512 mask for a vector store to an array, using _mm512_mask_storeu_epi8 (void* mem_addr, __mmask64 k, __m512i a) vmovdqu8. (AVX512BW. With AVX512F, you can only use 32 or 64-bit element size.)
#include <immintrin.h>
#include <stdint.h>
void set_value_in_selected_elements(char *array, uint64_t bitmask, uint8_t value) {
__m512i broadcastv = _mm512_set1_epi8(value);
// integer types are implicitly convertible to/from __mmask types
// the compiler emits the KMOV instruction for you.
_mm512_mask_storeu_epi8 (array, bitmask, broadcastv);
}
This compiles (with gcc7.3 -O3 -march=skylake-avx512) to:
vpbroadcastb zmm0, edx
kmovq k1, rsi
vmovdqu8 ZMMWORD PTR [rdi]{k1}, zmm0
vzeroupper
ret
If you want to write zeros in the elements where the bitmap was zero, either use a zero-masking move to create a constant from the mask and store that, or create a 0 / -1 vector using AVX512BW or DQ __m512i _mm512_movm_epi8(__mmask64 ). Other element sizes are available. But using a masked store makes it possible to safely use it when the array size isn't a multiple of the vector width, because the unmodified elements aren't read / rewritten or anything; they're truly untouched. (The CPU can take a slow microcode assist if any of the untouched elements would have faulted on a real store, though.)
Without AVX512, you still asked for "an intrinsic" (singular).
There's pdep, which you can use to expand a bitmap to a byte-map. See my AVX2 left-packing answer for an example of using _pdep_u64(mask, 0x0101010101010101); to unpack each bit in mask to a byte. This gives you 8 bytes in a uint64_t. In C, if you use a union between that and an array, then it gives you an array of 0 / 1 elements. (But of course indexing the array will require the compiler to emit shift instructions, if it hasn't spilled it somewhere first. You probably just want to memcpy the uint64_t into a permanent array.)
But in the more general case (larger bitmaps), or even with 8 elements when you want to blend in new values based on the bitmask, you should use multiple intrinsics to implement the inverse of pmovmskb, and use that to blend. (See the without pdep section below)
In general, if your array fits in 64 bits (e.g. an 8-element char array), you can use pdep. Or if it's an array of 4-bit nibbles, then you can do a 16-bit mask instead of 8.
Otherwise there's no single instruction, and thus no intrinsic. For larger bitmaps, you can process it in 8-bit chunks and store 8-byte chunks into the array.
If your array elements are wider than 8 bits (and you don't have AVX512), you should probably still expand bits to bytes with pdep, but then use [v]pmovzx to expand from bytes to dwords or whatever in a vector. e.g.
// only the low 8 bits of the input matter
__m256i bits_to_dwords(unsigned bitmap) {
uint64_t mask_bytes = _pdep_u64(bitmap, 0x0101010101010101); // expand bits to bytes
__m128i byte_vec = _mm_cvtsi64x_si128(mask_bytes);
return _mm256_cvtepu8_epi32(byte_vec);
}
If you want to leave elements unmodified instead of setting them to zero where the bitmask had zeros, OR with the previous contents instead of assigning / storing.
This is rather inconvenient to express in C / C++ (compared to asm). To copy 8 bytes from a uint64_t into a char array, you can (and should) just use memcpy (to avoid any undefined behaviour because of pointer aliasing or misaligned uint64_t*). This will compile to a single 8-byte store with modern compilers.
But to OR them in, you'd either have to write a loop over the bytes of the uint64_t, or cast your char array to uint64_t*. This usually works fine, because char* can alias anything so reading the char array later doesn't have any strict-aliasing UB. But a misaligned uint64_t* can cause problems even on x86, if the compiler assumes that it is aligned when auto-vectorizing. Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?
Assigning a value other than 0 / 1
Use a multiply by 0xFF to turn the mask of 0/1 bytes into a 0 / -1 mask, and then AND that with a uint64_t that has your value broadcasted to all byte positions.
If you want to leave element unmodified instead of setting them to zero or value=121, you should probably use SSE2 / SSE4 or AVX2 even if your array has byte elements. Load the old contents, vpblendvb with set1(121), using the byte-mask as a control vector.
vpblendvb only uses the high bit of each byte, so your pdep constant can be 0x8080808080808080 to scatter the input bits to the high bit of each byte, instead of the low bit. (So you don't need to multiply by 0xFF to get an AND mask).
If your elements are dword or larger, you could use _mm256_maskstore_epi32. (Use pmovsx instead of zx to copy the sign bit when expanding the mask from bytes to dwords). This can be a perf win over a variable-blend + always read / re-write. Is it possible to use SIMD instruction for replace?.
Without pdep
pdep is very slow on Ryzen, and even on Intel it's maybe not the best choice.
The alternative is to turn your bitmask into a vector mask:
is there an inverse instruction to the movemask instruction in intel avx2? and
How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?.
i.e. broadcast your bitmap to every position of a vector (or shuffle it so the right bit of the bitmap in in the corresponding byte), and use a SIMD AND to mask off the appropriate bit for that byte. Then use pcmpeqb/w/d against the AND-mask to find the elements that had their bit set.
You're probably going to want to load / blend / store if you don't want to store zeros where the bitmap was zero.
Use the compare-mask to blend on your value, e.g. with _mm_blendv_epi8 or the 256bit AVX2 version. You can handle bitmaps in 16-bit chunks, producing 16-byte vectors with just a pshufb to send bytes of it to the right elements.
It's not safe for multiple threads to do this at the same time on the same array even if their bitmaps don't intersect, unless you use masked stores, though.
X86-64, Linux, Windows.
Consider that I'd want to make some sort of "free launch for tag pointers". Basically I want to have two pointers that point to the same actual memory block but whose bits are different. (For example I want one bit to be used by GC collection or for some other reason).
intptr_t ptr = malloc()
intptr_t ptr2 = map(ptr | GC_FLAG_REACHABLE) //some magic call
int* p = int*(ptr);
int* p2 = int*(ptr2);
*p = 10;
*p2 = 20;
assert(*p == 20)
assert(p != p2)
On Linux, mmap() the same file twice. Same thing on Windows really, but it has its own set of functions for that.
Mapping the same memory (mmap on POSIX as Ignacio mentions, MapViewOfFile on Windows) to multiple virtual addresses may provide you some interesting coherency puzzles (are writes at one address visible when read at another address?). Or maybe not. I'm not sure what all the platform guarantees are.
More commonly, one simply reserves a few bits in the pointer and shifts things around as necessary.
If all your objects are aligned to 8-byte boundaries, it's common to simply store tags in the 3 least-significant bits of a pointer, and mask them off before dereferencing (as thkala mentions). If you choose a higher alignment, such as 16-bytes or 32-bytes, then there are 3 or 5 least-significant bits that can be used for tagging. Equivalently, choose a few most-significant bits for tagging, and shift them off before dereferencing. (Sometimes non-contiguous bits are used, for example when packing pointers into the signalling NaNs of IEEE-754 floats (223 values) or doubles (251 values).)
Continuing on the high end of the pointer, current implementations of x86-64 use at most 48 bits out of a 64-bit pointer (0x0000000000000000-0x00007fffffffffff + 0xffff800000000000-0xffffffffffffffff) and Linux and Windows only hand out addresses in the first range to userspace, leaving 17 most-significant bits that can be safely masked off. (This is neither portable nor guaranteed to remain true in the future, though.)
Another approach is to stop considering "pointers" and simply use indices into a larger memory array, as the JVM does with -XX:+UseCompressedOops. If you've allocated a 512MB pool and are storing 8-byte aligned objects, there are 226 possible object locations, so a 32-value has 6 bits to spare in addition to the index. A dereference will require adding the index times the alignment to the base address of the array, saved elsewhere (it's the same for every "pointer"). If you look at things carefully, this is simply a generalization of the previous technique (which always has base at 0, where things line up with real pointers).
Once upon a time I worked on a Prolog implementation that used the following technique to have spare bits in a pointer:
Allocate a memory area with a known alignment. malloc() usually allocates memory with a 4-byte or 8-byte alignment. If necessary, use posix_memalign() to get areas with a higher alignment size.
Since the resulting pointer is aligned to intervals of multiple bytes, but it represents byte-accurate addresses, you have a few spare bits that will by definition be zero in the memory area pointer. For example a 4-byte alignment gives you two spare bits on the LSB side of the pointer.
You OR (|) your flags with those bits and now have a tagged pointer.
As long as you take care to properly mask the pointer before using it for memory access, you should be perfectly fine.
The Fortran reference implementation documentation states:
* LDA - INTEGER.
* On entry, LDA specifies the first dimension of A as declared
* in the calling (sub) program. When TRANSA = 'N' or 'n' then
* LDA must be at least max( 1, m ), otherwise LDA must be at
* least max( 1, k ).
* Unchanged on exit.
However, given m and k shouldn't I be able to derive LDA? When is LDA permitted to be bigger than n (or k)?
The LDA parameter in BLAS is effectively the stride of the matrix as it is laid out in linear memory. It is perfectly valid to have an LDA value which is larger than the leading dimension of the matrix which is being operated on. Typical cases where it is either useful or necessary to use a larger LDA value are when you are operating on a sub matrix from a larger dense matrix, and when hardware or algorithms offer performance advantages when storage is padded to round multiples of some optimal size (cache lines or GPU memory transaction size, or load balance in multiprocessor implementations, for example).
The distinction is between the logical size of the first dimensions of the arrays A and B and the physical size. The first is the size of the array that you are using, the second is the value in the declaration, or the physical amount of memory used. Since Fortran is a column major language, the declared sizes of all indices except the last must be known in order to calculate the location of an array element. Notice the FORTRAN 77 style declarations of "A(LDA,),B(LDB,),C(LDC,*)". The declared size of the array can be larger than the portion that you are using; of course it can't be smaller.
Another way to look at it is LDA is the y-stride, meaning in a row-major layout your address for element A[y,x] is computed as x+LDA*y. For a "packed" memory layout without gaps between adjacent lines of x-data LDA=xSize.