What is arrangement specifier(.16b,.8b) in ARM assembly language instructions? - arm

I want to what exactly is arrangement specifier in arm assembly instructions.
I have gone through ARM TRMs and i think if it is size of Neon register that will be used for computation
for e.g.
TBL Vd.Ta, {Vn.16B,Vn+1.16B }, Vm.Ta
This is taken from http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802a/TBL_advsimd_vector.html
they mentioned Ta to be arrangement specifier of value 16B or 8B
I would like to know what it means to.(Size of Neon Register .. anything..)

The arrangement specifier is the number and size of the elements in the vector, for example 8B means that you are looking at 8 elements of one byte (this will be a 64-bit vector), and 16B is 16 elements of 1 byte (128-bit vector).
This is taken from the ARM Reference Manual:

I help myself remember this by thinking:
B = Bytes (8bit)
H = Halfwords (16bit)
S = Single words (32bit)
D = Double words (64bit)
I don't know if that is official, but it's how I remember it

Related

Assigning a 2 byte variable to a 3 byte register?

My Watch dog timer has a default value of 0x0fffff and i want to write a 2 byte variable (u2 compare) in it. What happens when i assign the value simply like this
wdt_register = compare;
What happens to most significant byte of register?
Register definition. It's 3 bytes register containing H, M, L 8bit registers. 4 most significat bits of H are not used and then it's actually a 20 bit register. Datasheet named all of them as WDTCR_20.
My question is what happens when i assign a value to the register using this line (just an example of 2 byte value written to 3 byte register) :
WDTCR_20 = 0x1234;
Your WDT is a so-called special function register. In hardware, it may end up being three bytes, or it could be four bytes, some of which are fixed/read-only/unused. Your compiler's implementation of the write is itself implementation-dependent if the SFR is declared in a particular way that makes the compiler emit SFR-specific write instructions.
This effectively makes the result of the assignment implementation-dependent; the high eight bits might end up being discarded, might set some other microarchitectural flags, or might cause a trap/crash if they aren't set to a specific (likely all-zeros value). It depends on the processor's datasheet (since you didn't mention a processor/toolchain, we don't know exactly).
For example, the AVR-based atmega328p datasheet shows an example of such a register:
In this case, the one-byte register is actually only three bits, effectively (bits 7..3 are fixed to zero on read and ignored on write, and could very well have no physical flip-flop or SRAM cell associated with them).

Multidimensional Array (2x2) Initialization in Assembly Language - Y86

I am new to assembly language and I'm using a simpler version called Y86, essentially the same thing. I wonder how to initialize a multidimensional array in such a format, specifically making a 2x2. Later with the 2x2 I will be adding two matrices (or arrays in this case). Thank you!
In machine code you have available (for information storage) CPU registers and memory.
Registers have fixed names and types and they are used like that, for example in x86 you can do mov eax, 0x12345678 to load 32b value into register eax.
Memory is like continuous block of byte-cells, each having it's own unique physical address (like: 0, 1, 2, ... mem_size-1). So it is like 1 dimensional byte array.
Whatever different type you want, in the end it is somehow mapped to this 1D byte array, so you have to first design how that mapping happens.
Some mappings like for example 32 bit integers have native mappings/support in the instructions, so you can for example read whole 32b int by single instruction like mov eax,[address], not having to compose it from individual bytes, but the CPU will for you read four bytes from memory at addresses: address+0, address+1, address+2 and address+3 and concatenate it into 32 bit value (on x86 CPU in little-endian order, so the byte from address+0 is in the lowest 8 bits of final value).
Other mappings like "array 2x2" don't have native support, and you have to design the memory layout and write the code accordingly to support it. For 2 dimensional arrays often mapping memory_offset = (row * columns_max + column) * single_element_byte_size is used.
Like for 16x16 matrix of 32 bit floats you can calculate memory offset (from the start of matrix data, which is at offset 0):
; eax = column 0..15 (x), ebx = row 0..15 (y), ecx = address of matrix
shl ebx, 4 ; y *= 16
add eax, ebx ; index = y * 16 + x
mov edx, [ecx + eax*4] ; read 32 bit element from matrix[y][x]
But you are of course free to devise and implement any kind of mapping you wish...
edit: as Peter Cordes notes, some mappings favour certain task, like for example continuously designed matrices like the one above, in the task of adding two matrices, can be dealt with in the implementation as one dimensional 256 (16x16) element array, because there's no significance of row/columns in the matrix addition, so you can just add corresponding elements of both. In multiplication you have to traverse the elements in more complex patterns, where row/columns are important, so there you have to write more complex code to respect the 2D mapping logic.
edit 2, to actually add answer to your question:
I wonder how to initialize a multidimensional array in such a format
Eee... this doesn't make sense from machine point of view. You simply need somewhere in memory reserved space, which represents data of the array, and you may want to set those to certain initial values, by simply writing those values into memory (by ordinary memory store instructions, like mov [ebx],eax), or for example in simple code adding two matrices of fixed values, you can define both of them directly in .data segment with some directive defining values, like for example in NASM assembler (for the simple mapping as described above):
; 2x2 32bit integer matrix:
; (14 32)
; (-3 4)
matrix1:
dd 14, 32, -3, 4
(check your assembler documentation to see which directives are available to reserve+initialize part of memory)
Which kind of memory area you want to reserve for the data (the load-time initialized .data, or stack, or dynamically allocated from OS "heap", ...), and how you load it with initial data, is up to you, but in no way related to the "two dimensional array", usually the allocation/initialization code often works with all the types as "continuous block of bytes", without caring about the inner structure of the data, that's left for the other functions, which are dealing with particular elements of the data.

Intrinsic to set value in array based on a BitMask

Is there an intrinsic that will set a single value at all the places in an input array where the corresponding position had a 1 bit in the provided BitMask?
10101010 is bitmask
value is 121
it will set positions 0,2,4,6 with value 121
With AVX512, yes. Masked stores are a first-class operation in AVX512.
Use the bitmask as an AVX512 mask for a vector store to an array, using _mm512_mask_storeu_epi8 (void* mem_addr, __mmask64 k, __m512i a) vmovdqu8. (AVX512BW. With AVX512F, you can only use 32 or 64-bit element size.)
#include <immintrin.h>
#include <stdint.h>
void set_value_in_selected_elements(char *array, uint64_t bitmask, uint8_t value) {
__m512i broadcastv = _mm512_set1_epi8(value);
// integer types are implicitly convertible to/from __mmask types
// the compiler emits the KMOV instruction for you.
_mm512_mask_storeu_epi8 (array, bitmask, broadcastv);
}
This compiles (with gcc7.3 -O3 -march=skylake-avx512) to:
vpbroadcastb zmm0, edx
kmovq k1, rsi
vmovdqu8 ZMMWORD PTR [rdi]{k1}, zmm0
vzeroupper
ret
If you want to write zeros in the elements where the bitmap was zero, either use a zero-masking move to create a constant from the mask and store that, or create a 0 / -1 vector using AVX512BW or DQ __m512i _mm512_movm_epi8(__mmask64 ). Other element sizes are available. But using a masked store makes it possible to safely use it when the array size isn't a multiple of the vector width, because the unmodified elements aren't read / rewritten or anything; they're truly untouched. (The CPU can take a slow microcode assist if any of the untouched elements would have faulted on a real store, though.)
Without AVX512, you still asked for "an intrinsic" (singular).
There's pdep, which you can use to expand a bitmap to a byte-map. See my AVX2 left-packing answer for an example of using _pdep_u64(mask, 0x0101010101010101); to unpack each bit in mask to a byte. This gives you 8 bytes in a uint64_t. In C, if you use a union between that and an array, then it gives you an array of 0 / 1 elements. (But of course indexing the array will require the compiler to emit shift instructions, if it hasn't spilled it somewhere first. You probably just want to memcpy the uint64_t into a permanent array.)
But in the more general case (larger bitmaps), or even with 8 elements when you want to blend in new values based on the bitmask, you should use multiple intrinsics to implement the inverse of pmovmskb, and use that to blend. (See the without pdep section below)
In general, if your array fits in 64 bits (e.g. an 8-element char array), you can use pdep. Or if it's an array of 4-bit nibbles, then you can do a 16-bit mask instead of 8.
Otherwise there's no single instruction, and thus no intrinsic. For larger bitmaps, you can process it in 8-bit chunks and store 8-byte chunks into the array.
If your array elements are wider than 8 bits (and you don't have AVX512), you should probably still expand bits to bytes with pdep, but then use [v]pmovzx to expand from bytes to dwords or whatever in a vector. e.g.
// only the low 8 bits of the input matter
__m256i bits_to_dwords(unsigned bitmap) {
uint64_t mask_bytes = _pdep_u64(bitmap, 0x0101010101010101); // expand bits to bytes
__m128i byte_vec = _mm_cvtsi64x_si128(mask_bytes);
return _mm256_cvtepu8_epi32(byte_vec);
}
If you want to leave elements unmodified instead of setting them to zero where the bitmask had zeros, OR with the previous contents instead of assigning / storing.
This is rather inconvenient to express in C / C++ (compared to asm). To copy 8 bytes from a uint64_t into a char array, you can (and should) just use memcpy (to avoid any undefined behaviour because of pointer aliasing or misaligned uint64_t*). This will compile to a single 8-byte store with modern compilers.
But to OR them in, you'd either have to write a loop over the bytes of the uint64_t, or cast your char array to uint64_t*. This usually works fine, because char* can alias anything so reading the char array later doesn't have any strict-aliasing UB. But a misaligned uint64_t* can cause problems even on x86, if the compiler assumes that it is aligned when auto-vectorizing. Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?
Assigning a value other than 0 / 1
Use a multiply by 0xFF to turn the mask of 0/1 bytes into a 0 / -1 mask, and then AND that with a uint64_t that has your value broadcasted to all byte positions.
If you want to leave element unmodified instead of setting them to zero or value=121, you should probably use SSE2 / SSE4 or AVX2 even if your array has byte elements. Load the old contents, vpblendvb with set1(121), using the byte-mask as a control vector.
vpblendvb only uses the high bit of each byte, so your pdep constant can be 0x8080808080808080 to scatter the input bits to the high bit of each byte, instead of the low bit. (So you don't need to multiply by 0xFF to get an AND mask).
If your elements are dword or larger, you could use _mm256_maskstore_epi32. (Use pmovsx instead of zx to copy the sign bit when expanding the mask from bytes to dwords). This can be a perf win over a variable-blend + always read / re-write. Is it possible to use SIMD instruction for replace?.
Without pdep
pdep is very slow on Ryzen, and even on Intel it's maybe not the best choice.
The alternative is to turn your bitmask into a vector mask:
is there an inverse instruction to the movemask instruction in intel avx2? and
How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?.
i.e. broadcast your bitmap to every position of a vector (or shuffle it so the right bit of the bitmap in in the corresponding byte), and use a SIMD AND to mask off the appropriate bit for that byte. Then use pcmpeqb/w/d against the AND-mask to find the elements that had their bit set.
You're probably going to want to load / blend / store if you don't want to store zeros where the bitmap was zero.
Use the compare-mask to blend on your value, e.g. with _mm_blendv_epi8 or the 256bit AVX2 version. You can handle bitmaps in 16-bit chunks, producing 16-byte vectors with just a pshufb to send bytes of it to the right elements.
It's not safe for multiple threads to do this at the same time on the same array even if their bitmaps don't intersect, unless you use masked stores, though.

B instruction encoding

For school I'm working on writing an ARM simulator. In the ARM ARM (http://www.eecs.umich.edu/~prabal/teaching/eecs373-f11/readings/ARMv7-M_ARM.pdf) the calculate the second and third highest bit for the branch offset as I1 = NOT(J1 EOR S); and I2 = NOT(J2 EOR S); (ARM ARM pg 239). Does anyone know why it's this way? For some reason it seems to be causing me errors.
It's a bit of a puzzler but it appears to be a technique to map the I1 and I2 bits into valid instruction encodings.
Thumb branch instructions are encoded as a pair of 16-bit sub-instructions. Each 16-bit sub-instruction needs its own distinct instruction encoding.
The 'S' bit is a sign bit so we can see there's no way to distinguish Encoding T3 and Encoding T4 from the first 16-bit sub-instruction.
In the second sub-instruction bit 12 distinguishes Encoding T3 and T4. However, using I1 and I2 directly would clash with existing instructions so they're munged into one of four encodings, each determining the range of the branch.

Alternative of FP_SEG and FP_OFF for converting pointer to linear address

On 16 bit dos machine there is options like FP_SEG and FP_OFF for converting a pointer to linear address but since these method no more exist on 32 bit compiler what are other function that can do same on 32 bit machine??
They're luckily not needed as 32-bit mode is unsegmented and hence the addresses are always linear (a simplification, but let's keep it simple).
EDIT: The first version was confusing let's try again.
In 16-bit segmented mode (I'm exclusively referring to legacy DOS programs here, it'll probably be similar for other 16-bit x86 OSes) addresses are given in a 32-bit format consisting of a16-bit segment and a 16-bit offset. These are combined to form a 20-bit linear address (This is where the infamous 640K barrier comes from, 2**20 = 1MB and 384K are reserved for the system and bios leaving ~640K for user programs) by multiply the segment by 16 = 0x10 (equivalent to shifting left by 4) and adding the offset. I.e.: linear = segment*0x10 + offset.
This means that 2**12 segment:address type pointers will refer to the same linear address, so in general there is no way to obtain the 32-bit value used to form the linear address.
In old DOS programs that used far - segmented - pointers (as opposed to near pointers, which only contained an offset and implicitly used ds segment register) they were usually treated as 32-bit unsigned integer values where the 16 most significant bits were the segment and the 16 least significant bits the offset. This gives the following macro definitions for FP_SEG and FP_OFF (using the types from stdint.h):
#define FP_SEG(x) (uint16_t)((uint32_t)(x) >> 16) /* grab 16 most significant bits */
#define FP_OFF(x) (uint16_t)((uint32_t)(x)) /* grab 16 least significant bits */
To convert a 20-bit linear address to a segmented address you have many options (2**12). One way could be:
#define LIN_SEG(x) (uint16_t)(((uint32_t)(x)&0xf0000)>>4)
#define LIN_OFF(x) (uint16_t)((uint32_t)(x))
Finally a quick example of how it all works together:
Segmented address: a = 0xA000:0x0123
As 32-bit far pointer b = 0xA0000123
20-bit linear address: c = 0xA0123
FP_SEG(b) == 0xA000
FP_OFF(b) == 0x0123
LIN_SEG(c) = 0xA000
LIN_OFF(c) = 0x0123

Resources