B instruction encoding

B instruction encoding - arm

For school I'm working on writing an ARM simulator. In the ARM ARM (http://www.eecs.umich.edu/~prabal/teaching/eecs373-f11/readings/ARMv7-M_ARM.pdf) the calculate the second and third highest bit for the branch offset as I1 = NOT(J1 EOR S); and I2 = NOT(J2 EOR S); (ARM ARM pg 239). Does anyone know why it's this way? For some reason it seems to be causing me errors.

It's a bit of a puzzler but it appears to be a technique to map the I1 and I2 bits into valid instruction encodings.
Thumb branch instructions are encoded as a pair of 16-bit sub-instructions. Each 16-bit sub-instruction needs its own distinct instruction encoding.
The 'S' bit is a sign bit so we can see there's no way to distinguish Encoding T3 and Encoding T4 from the first 16-bit sub-instruction.
In the second sub-instruction bit 12 distinguishes Encoding T3 and T4. However, using I1 and I2 directly would clash with existing instructions so they're munged into one of four encodings, each determining the range of the branch.

Related

what is the most efficient way to flip all the bits from the least significant bit up to the most significant last 1 bit value?

Say for example I have a uint8_t that can be of any value, and I only want to flip all the bits from the least significant bit up to the most significant last 1 bit value? How would I do that in the most efficient way?, Is there a solution where I can avoid using a loop?
here are some cases:
left side is the original bits - right side after the flips.
00011101 -> 00000010
00000000 -> 00000000
11111111 -> 00000000
11110111 -> 00001000
01000000 -> 00111111
[EDIT]
The type could also be larger than uint8_t, It could be uint32_t, uint64_t and __uint128_t. I just use uint8_t because it's the easiest size to show in the example cases.

In general I expect that most solutions will have roughly this form:
Compute the mask of bits that need to flipped
XOR by that mask
As mentioned in the comments, x64 is a target of interest, and on x64 you can do step 1 like this:
Find the 1-based position p of the most significant 1, by leading zeroes (_lzcnt_u64) and subtracting that from 64 (or 32 whichever is appropriate).
Create a mask with p consecutive set bits starting from the least significant bit, probably using _bzhi_u64.
There are some variations, such as using BitScanReverse to find the most significant 1 (but it has an ugly case for zero), or using a shift instead of bzhi (but it has an ugly case for 64). lzcnt and bzhi is a good combination with no ugly cases. bzhi requires BMI2 (Intel Haswell or newer, AMD Zen or newer).
Putting it together:
x ^ _bzhi_u64(~(uint64_t)0, 64 - _lzcnt_u64(x))
Which could be further simplified to
_bzhi_u64(~x, 64 - _lzcnt_u64(x))
As shown by Peter. This doesn't follow the original 2-step plan, rather all bits are flipped, and then the bits that were originally leading zeroes are reset.
Since those original leading zeroes form a contiguous sequence of leading ones in ~x, an alternative to bzhi could be to add the appropriate power of two to ~x (though sometimes zero, which might be thought of as 264, putting the set bit just beyond the top of the number). Unfortunately the power of two that we need is a bit annoying to compute, at least I could not come up with a good way to do it, it seems like a dead end to me.
Step 1 could also be implemented in a generic way (no special operations) using a few shifts and bitwise ORs, like this:
// Get all-ones below the leading 1
// On x86-64, this is probably slower than Paul R's method using BSR and shift
// even though you have to special case x==0
m = x | (x >> 1);
m |= m >> 2;
m |= m >> 4;
m |= m >> 8;
m |= m >> 16;
m |= m >> 32; // last step should be removed if x is 32-bit
AMD CPUs have slowish BSR (but fast LZCNT; https://uops.info/), so you might want this shift/or version for uint8_t or uint16_t (where it takes fewest steps), especially if you need compatibility with all CPUs and speed on AMD is more important than on Intel.
This generic version is also useful within SIMD elements, especially narrow ones, where we don't have a leading-zero-count until AVX-512.

TL:DR: use a uint64_t shift to implement efficiently with uint32_t when compiling for 64-bit machines that have lzcnt (AMD since K10, Intel since Haswell). Without lzcnt (only bsr that's baseline for x86) the n==0 case is still special.
For the uint64_t version, the hard part is that you have 65 different possible positions for the highest set bit, including non-existent (lzcnt producing 64 when all bits are zero). But a single shift with 64-bit operand-size on x86 can only produce one of 64 different values (assuming a constant input), since x86 shifts mask the count like foo >> (c&63)
Using a shift requires special-casing one leading-bit-position, typically the n==0 case. As Harold's answer shows, BMI2 bzhi avoids that, allowing bit counts from 0..64.
Same for 32-bit operand-size shifts: they mask c&31. But to generate a mask for uint32_t, we can use a 64-bit shift efficiently on x86-64. (Or 32-bit for uint16_t and uint8_t. Fun fact: x86 asm shifts with 8 or 16-bit operand-size still mask their count mod 32, so they can shift out all the bits without even using a wider operand-size. But 32-bit operand size is efficient, no need to mess with partial-register writes.)
This strategy is even more efficient than bzhi for a type narrower than register width.
// optimized for 64-bit mode, otherwise 32-bit bzhi or a cmov version of Paul R's is good
#ifdef __LZCNT__
#include <immintrin.h>
uint32_t flip_32_on_64(uint32_t n)
{
uint64_t mask32 = 0xffffffff; // (uint64_t)(uint32_t)-1u32
// this needs to be _lzcnt_u32, not __builtin_clz; we need 32 for n==0
// If lznct isn't available, we can't avoid handling n==0 zero specially
uint32_t mask = mask32 >> _lzcnt_u32(n);
return n ^ mask;
}
#endif
This works equivalently for uint8_t and uint16_t (literally the same code with same mask, using a 32-bit lzcnt on them after zero-extension). But not uint64_t (You could use a unsigned __int128 shift, but shrd masks its shift count mod 64 so compilers still need some conditional behaviour to emulate it. So you might as well do a manual cmov or something, or sbb same,same to generate a 0 or -1 in a register as the mask to be shifted.)
Godbolt with gcc and clang. Note that it's not safe to replace _lzcnt_u32 with __builtin_clz; clang11 and later assume that can't produce 32 even when they compile it to an lzcnt instruction1, and optimize the shift operand-size down to 32 which will act as mask32 >> clz(n) & 31.
# clang 14 -O3 -march=haswell (or znver1 or bdver4 or other BMI2 CPUs)
flip_32_on_64:
lzcnt eax, edi # skylake fixed the output false-dependency for lzcnt/tzcnt, but not popcnt. Clang doesn't care, it's reckless about false deps except inside a loop in a single function.
mov ecx, 4294967295
shrx rax, rcx, rax
xor eax, edi
ret
Without BMI2, e.g. with -march=bdver1 or barcelona (aka k10), we get the same code-gen except with shr rax, cl. Those CPUs do still have lzcnt, otherwise this wouldn't compile.
(I'm curious if Intel Skylake Pentium/Celeron run lzcnt as lzcnt or bsf. They lack BMI1/BMI2, but lzcnt has its own feature flag.
It seems low-power uarches as recent as Tremont are missing lzcnt, though, according to InstLatx64 for a Pentium Silver N6005 Jasper Lake-D, Tremont core. I didn't manually look for the feature bit in the raw CPUID dumps of recent Pentium/Celeron, but Instlat does have those available if someone wants to check.)
Anyway, bzhi also requires BMI2, so if you're comparing against that for any size but uint64_t, this is the comparison.
This shrx version can keep its -1 constant around in a register across loops. So the mov reg,-1 can be hoisted out of a loop after inlining, if the compiler has a spare register. The best bzhi strategy doesn't need a mask constant so it has nothing to gain. _bzhi_u64(~x, 64 - _lzcnt_u64(x)) is 5 uops, but works for 64-bit integers on 64-bit machines. Its latency critical path length is the same as this. (lzcnt / sub / bzhi).
Without LZCNT, one option might be to always flip as a way to get FLAGS set for CMOV, and use -1 << bsr(n) to XOR some of them back to the original state. This could reduce critical path latency. IDK if a C compiler could be coaxed into emitting this. Especially not if you want to take advantage of the fact that real CPUs keep the BSR destination unchanged if the source was zero, but only AMD documents this fact. (Intel says it's an "undefined" result.)
(TODO: finish this hand-written asm idea.)
Other C ideas for the uint64_t case: cmov or cmp/sbb (to generate a 0 or -1) in parallel with lzcnt to shorten the critical path latency? See the Godbolt link where I was playing with that.
ARM/AArch64 saturate their shift counts, unlike how x86 masks for scalar. If one could take advantage of that safely (without C shift-count UB) that would be neat, allowing something about as good as this.
x86 SIMD shifts also saturate their counts, which Paul R took advantage of with an AVX-512 answer using vlzcnt and variable-shift. (It's not worth copying data to an XMM reg and back for one scalar shift, though; only useful if you have multiple elements to do.)
Footnote 1: clang codegen with __builtin_clz or ...ll
Using __builtin_clzll(n) will get clang to use 64-bit operand-size for the shift, since values from 32 to 63 become possible. But you can't actually use that to compile for CPUs without lzcnt. The 63-bsr a compiler would use without lzcnt available would not produce the 64 we need for that case. Not unless you did n<<=1; / n|=1; or something before the bsr and adjusted the result, but that would be slower than cmov.
If you were using a 64-bit lzcnt, you'd want uint64_t mask = -1ULL since there will be 32 extra leading zeros after zero-extending to uint64_t. Fortunately all-ones is relatively cheap to materialize on all ISAs, so use that instead of 0xffffffff00000000ULL

Here’s a simple example for 32 bit ints that works with gcc and compatible compilers (clang et al), and is portable across most architectures.
uint32_t flip(uint32_t n)
{
if (n == 0) return 0;
uint32_t mask = ~0U >> __builtin_clz(n);
return n ^ mask;
}
DEMO
We could avoid the extra check for n==0 if we used lzcnt on x86-64 (or clz on ARM), and we were using a shift that allowed a count of 32. (In C, shifts by the type-width or larger are undefined behaviour. On x86, in practice the shift count is masked &31 for shifts other than 64-bit, so this could be usable for uint16_t or uint8_t using a uint32_t mask.)
Be careful to avoid C undefined behaviour, including any assumption about __builtin_clz with an input of 0; modern C compilers are not portable assemblers, even though we sometimes wish they were when the language doesn't portably expose the CPU features we want to take advantage of. For example, clang assumes that __builtin_clz(n) can't be 32 even when it compiles it to lzcnt.
See #PeterCordes's answer for details.

If your use case is performance-critical you might also want to consider a SIMD implementation for performing the bit flipping operation on a large number of elements. Here's an example using AVX512 for 32 bit elements:
void flip(const uint32_t in[], uint32_t out[], size_t n)
{
assert((n & 7) == 0); // for this example we only handle arrays which are vector multiples in size
for (size_t i = 0; i + 8 <= n; i += 8)
{
__m512i vin = _mm512_loadu_si512(&in[i]);
__m512i vlz = _mm512_lzcnt_epi32(vin);
__m512i vmask = _mm512_srlv_epi32(_mm512_set1_epi32(-1), vlz);
__m512i vout = _mm512_xor_si512(vin, vmask);
_mm512_storeu_si512(&out[i], vout);
}
}
This uses the same approach as other solutions, i.e. count leading zeroes, create mask, XOR, but for 32 bit elements it processes 8 elements per loop iteration. You could implement a 64 bit version of this similarly, but unfortunately there are no similar AVX512 intrinsics for element sizes < 32 bits or > 64 bits.
You can see the above 32 bit example in action on Compiler Explorer (note: you might need to hit the refresh button at the bottom of the assembly pane to get it to re-compile and run if you get "Program returned: 139" in the output pane - this seems to be due to a glitch in Compiler Explorer currently).

How memory store signed and unsigned char

Just started to learn C, and i feel little bit confused.
I have some questions:
If i have the following code:
signed char x = 56;
// ‭In the RAM, I will see 00111000‬ yes/no?
signed char z = -56;
// In the RAM, I will see 11001000 yes/no?
unsigned char y = 200;
// ‭In the RAM, I will see 11001000‬ yes/no?
I have the following code:
if (z<0){
printf("0 is bigger then z ");
}
After compiling, how the assembly instructions know if z is -56 and not 200?(there is a special ASM instructions for signed and unsigned?).
As i mentioned in question number 1, the value of z and y is 11001000, and there is not any indicate to know if its signed or unsigned.
Apologize if i didn't find the right way to ask my question, hope you understand me
Thanks

Numbers are stored in binary. Negative numbers are usually stored as two's complement form, but C language allows different representations. So this one:
signed char z = -56;
// In the RAM, I will see 11001000 yes/no?
usually yes, but may be not on some exotic platforms.
Second question is too implementation specific. For example comparison against zero on x86 may be performed as self-comparison, and flags register would be affected, for unsigned comparison sign flag (SF) is ignored.

The compiler will generate the appropriate instructions for the signed and unsigned cases. I think it might better to see an example. The following code
void foobar();
void foo(unsigned char a)
{
if (a < 10)
foobar();
}
void bar(char a)
{
if (a < 10)
foobar();
}
Will translate to this MIPS code with GCC 5.4 using -O3 flag.
foo:
andi $4,$4,0x00ff
sltu $4,$4,10
bne $4,$0,$L4
nop
j $31
nop
$L4:
j foobar
nop
bar:
sll $4,$4,24
sra $4,$4,24
slt $4,$4,10
bne $4,$0,$L7
nop
j $31
nop
$L7:
j foobar
nop
This is the interesting part of the foo function (which use unsigned char type)
foo:
andi $4,$4,0x00ff
sltu $4,$4,10
As you can see sltu command used which is the unsinged version of slt. (You don't really have to know what it does)
While if we looking at the function bar relevants part
bar:
sll $4,$4,24
sra $4,$4,24
slt $4,$4,10
You can see that slt used which will treat its register operand as signed. The sll and sra pair doing sign extension since here the operands a was signed so its needed, while in unsigned case its not.
So you could see that different instructions generated with respect to the signdess of the operands.

The compiler will generate different instructions depending on whether it is an unsigned or signed type. And that is what tells the processor which way to treat it. So yes there are seperate instructions for signed and unsigned. With Intel processors, there are also seperate instructions depending on the width (char, short, int)

there is a special ASM instructions for signed and unsigned?
Yes, hardware generally has machine code instructions (or instruction sequences) that can
sign extend a byte to word size
zero extend a byte to word size
compare signed quantities for the various relations <, <=, >, >=
compare unsigned quantities for the various relations <, <=, >, >=
how the assembly instructions know if z is -56 and not 200?
In high level languages we associate a type with a variable. From then on the compiler knows the default way to interpret code that uses the variable. (We can override or change that default interpretation using a cast at usages of the variable.)
In machine code, there are only bytes, either in memory or in CPU registers. So, it is not how it is stored that matter (for signed vs. unsigned), but what instructions are used to access the storage. The compiler will use the right set of machine code instructions every time the variable is accessed.
While we store lots of things in memory, the processor has no concept of variable declarations. The processor only sees machine code instructions, and interprets all data types through the eyes of the instruction it is being told to execute.
As an assembly programmer, it is your job to apply the proper instructions (here signed vs. unsigned) to the same variable each time it is used. Using a byte as a signed variable and later as an unsigned variable, is a logic bug that is easy to do in assembly language.
Some assemblers will help if you use the wrong size to access a variable, but none that I know help if you use the proper size but incorrect signed-ness.

Computers do not know nor care about such things. Unsigned and signed is only relevant to the programmer. The value 0xFF can at the same time be -1, 255 or an address or a portion of an address. Part of a floating point number and so on. The computer does not care. HOW the programmer conveys their interpretation of the bits is through the program. Understanding that addition and subtraction also do not care about signed vs unsigned because it is the same logic, but other instructions like multiplies where the result is larger than the inputs or divide where the result is smaller than the inputs then there are unsigned and signed versions of the instructions or your processor may only have one and you have to synthesize the other (or none and you have to synthesize both).
int fun0 ( void )
{
return(5);
}
unsigned int fun1 ( void )
{
return(5);
}
00000000 <fun0>:
0: e3a00005 mov r0, #5
4: e12fff1e bx lr
00000008 <fun1>:
8: e3a00005 mov r0, #5
c: e12fff1e bx lr
no special bits nor nomenclature...bits is bits.
The compiler driven by the users indication of signed vs unsigned in the high level language drive instructions and data values that cause the alu to output flags that indicate greater than, less than, and equal through single flags or combinations, then a conditional branch can be taken based on the flag. Often but not always the compiler will generate the opposite if z < 0 then do something the compiler will say if z >= 0 then jump over the something.

What is arrangement specifier(.16b,.8b) in ARM assembly language instructions?

I want to what exactly is arrangement specifier in arm assembly instructions.
I have gone through ARM TRMs and i think if it is size of Neon register that will be used for computation
for e.g.
TBL Vd.Ta, {Vn.16B,Vn+1.16B }, Vm.Ta
This is taken from http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802a/TBL_advsimd_vector.html
they mentioned Ta to be arrangement specifier of value 16B or 8B
I would like to know what it means to.(Size of Neon Register .. anything..)

The arrangement specifier is the number and size of the elements in the vector, for example 8B means that you are looking at 8 elements of one byte (this will be a 64-bit vector), and 16B is 16 elements of 1 byte (128-bit vector).
This is taken from the ARM Reference Manual:

I help myself remember this by thinking:
B = Bytes (8bit)
H = Halfwords (16bit)
S = Single words (32bit)
D = Double words (64bit)
I don't know if that is official, but it's how I remember it

Union in C changing machine behavior of Float Addition

New to C programming, and I've been told to avoid unions which in general makes perfect sense and I agree with. However, as part of an academic exercise I'm writing an emulator for hardware single-precision floating point addition by doing bit manipulation operations on unsigned 32-bit integers. I only mention that to explain why I want to use unions; I'm having no trouble with the emulation.
In order to test this emulator, I wrote a test program. But of course I'm trying to find the bit representation of floats on my hardware, so I thought this could be the perfect use for a union. I wrote this union:
typedef union {
float floatRep;
uint32_t unsignedIntRep;
} FloatExaminer;
This way, I can initialize a float with the floatRep member and then examine the bits with the unsignedIntRep member.
This worked most of the time, but when I got to NaN addition, I started running into trouble. The exact situation was that I wrote a function to automate these tests. The gist of it was this:
void addTest(float op1, float op2){
FloatExaminer result;
result.floatRep = op1 + op2;
printf("%f + %f = %f\n", op1, op2, result.floatRep);
//print bit pattern as well
printf("Bit pattern of result: %08x", result.unsignedIntRep);
}
OK, now for the confusing part:
I added a NAN and a NAN with different mantissa bit patterns to differentiate between the two. On my particular hardware, it's supposed to return the second NAN operand (making it quiet if it was signalling). (I'll explain how I know this below.) However, passing the bit patterns op1=0x7fc00001, op2=0x7fc00002 would return op1, 0x7fc00001, every time!
I know it's supposed to return the second operand because I tried--outside the function--initializing as an integer and casting to a float as below:
uint32_t intRep1 = 0x7fc00001;
uint32_t intRep2 = 0x7fc00002;
float *op1 = (float *) &intRep1;
float *op2 = (float *) &intRep2;
float result = *op1 + *op2;
uint32_t *intResult = (uint32_t *)&result;
printf("%08x", *intResult); //bit pattern 0x7fc00002
In the end, I've concluded that unions are evil and I should never use them. However, does anyone know why I'm getting the result I am? Did I make stupid mistake or assumption? (I understand that hardware architecture varies, but this just seems bizarre.)

I'm assuming that when you say "my particular hardware", you are referring to an Intel processor using SSE floating point. But in fact, that architecture has a different rule, according to the Intel® 64 and IA-32 Architectures
Software Developer's Manual. Here's a summary of Table 4.7 ("Rules for handling NaNs") from Volume 1 of that documentation, which describes the handling of NaNs in arithmetic operations: (QNaN is a quiet NaN; SNaN is a signalling NaN; I've only included information about two-operand instructions)
SNaN and QNaN
x87 FPU — QNaN source operand.
SSE — First source operand, converted to a QNaN.
Two SNaNs
x87 FPU — SNaN source operand with the larger significand, converted to a QNaN
SSE — First source operand, converted to a QNaN.
Two QNaNs
x87 FPU — QNaN source operand with the larger significand
SSE — First source operand
NaN and a floating-point value
x87/SSE — NaN source operand, converted to a QNaN.
SSE floating point machine instructions generally have the form op xmm1, xmm2/m32, where the first operand is the destination register and the second operand is either a register or a memory location. The instruction will then do, in effect, xmm1 <- xmm1 (op) xmm2/m32, so the first operand is both the left-hand operand of the operation and the destination. That's the meaningof "first operand" in the above chart. AVX adds three-operand instructions, where the destination might be a different register; it is then the third operand and does not figure in the above chart. The x87 FPU uses a stack-based architecture, where the top of the stack is always one of the operands and the result replaces either the top of the stack or the other operand; in the above chart, it will be noted that the rules do not attempt to decide which operand is "first", relying instead on a simple comparison.
Now, suppose we're generating code for an SSE machine, and we have to handle the C statement:
a = b + c;
where none of those variables are in a register. That means we might emit code something like this: (I'm not using real instructions here, but the principle is the same)
LOAD r1, b (r1 <- b)
ADD r1, c (r1 <- r1 + c)
STORE r1, a (a <- r1)
But we could also do this, with (almost) the same result:
LOAD r1, c (r1 <- c)
ADD r1, b (r1 <- r1 + b)
STORE r1, a (a <- r1)
That will have precisely the same effect, except for additions involving NaNs (and only when using SSE). Since arithmetic involving NaNs is unspecified by the C standard, there is no reason why the compiler should care which of these two options it chooses. In particular, if r1 happened to already have the value c in it, the compiler would probably choose the second option, since it saves a load instruction. (And who is going to complain? We all want the compiler to generate code which runs as quickly as possible, no?)
So, in short, the order of the operands of the ADD instruction will vary with the intricate details of how the compiler chooses to optimize the code, and the particular state of the registers at the moment in which the addition operator is being emitted. It is possible that this will be effected by the use of a union, but it is equally or more likely that it has to do with the fact that in your code using the union, the values being added are arguments to the function and therefore are already placed in registers.
Indeed, different versions of gcc, and different optimization settings, produce different results for your code. And forcing the compiler to emit x87 FPU instructions produces yet different results, because the hardware operates according to a different logic.
Note:
If you want some bedtime reading, you can download the entire Intel SDM (currently 4,684 pages / 23.3MB, but it keeps on getting bigger) from their site.

rlcf instruction with pic 18F4550 in C compiler

I'm new at hardware programming with c compiler for the PIC 18F4550 from Microchip.
My question is, can someone give me an example 'how to rotate bits and get the carry that is added, with this instruction 'rlcf' (c compiler)
This instruction shifts the bits to the left and places the leftmost bit in a Carry and you should read back this value from the carry.
I know how it works. But can not find any example code to run it on my way to code.
That's the data input i receive. It must be converted into binary values, and than rotate it.
unsigned int red = 1206420333240;
Thanks in advance!

You don't have access to carry bits in a C compiler, you'd have to use assembly to get to them.
Also your value is too big for an unsigned int on a PIC18, which is a 16 bit number with a maximum of 65535 decimal, 0xFFFF hex.
How you write assembly inside a C file varies depending on the compiler. In Hitech C, the following syntax is valid
asm("RLCF REG,0,0");//replace REG with your register and consider the d and a flags.
asm("BC 5"); //branch if carry
But note that is is rotating one byte, not a two byte number. You need to chain together two rotates of two registers to rotate a 16 bit number.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight