Can someone explain what this instruction does and translate it to C?
ubfx.w r3, r11, #0xE, #1
According to the ARM reference manual, it does a "signed and unsigned bit field extract" but I'm not good with all that bitwise stuff.
UBFX just extracts a bitfield from the source register and puts it in the least significant bits of the destination register.
The general form is:
UBFX dest, src, lsb, width
which in C would be:
dest = (src >> lsb) & ((1 << width) - 1);
The C equivalent of the example you give would be:
r3 = (r11 >> 14) & 1;
i.e. r3 will be 1 if bit 14 of r11 is set, otherwise it will be 0.
See: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489c/Cjahjhee.html
Related
I'm trying to write a VM (LC-3), and on this ADD instruction I encountered this statement. Basically the "register0" is the DR register, but I don't really understand what is actually shifting and why 9. Also the AND operator with the 0x7 value.
|15|14|13|12|11|10|9|8|7|6|5|4|3|2|1|0|
| 0001 | DR | SR1 |0| 00| SR2 |
Could anyone please explain it to me in detail?
ADD {
/* destination register (DR) */
uint16_t r0 = (instr >> 9) & 0x7;
/* first operand (SR1) */
uint16_t r1 = (instr >> 6) & 0x7;
/* whether we are in immediate mode */
uint16_t imm_flag = (instr >> 5) & 0x1;
if (imm_flag) {
uint16_t imm5 = sign_extend(instr & 0x1F, 5);
reg[r0] = reg[r1] + imm5;
} else {
uint16_t r2 = instr & 0x7;
reg[r0] = reg[r1] + reg[r2];
}
update_flags(r0);
}
What it's doing is isolating the 3 bits that represent the DR register so they become a standalone number.
Let's say the entire sequence looks like this:
1110101101101011
^^^
DR
Shifting 9 bits right gives this:
1110101
and & 0x7 (bitwise AND) isolates the 3 lowest bits:
101
Similar operations are performed to isolate the values of SR1 and the immediate mode flag. Depending on that flag, SR2 may also be required, but as it's already in the lowest 3 bits, no shifting is needed.
I'm trying to understand how this following code works and I am not able to understand what happens here:
uint32_t mask[5] = { 0xFFFF0000, 0xFF00, 0xF0, 0b1100, 0b10 };
uint32_t shift[5] = { 16, 8, 4, 2, 1 };
char first_bit_left_dichotomy(uint32_t M) {
char pos = 0;
char i;
for (i = 0; i <= 4; i++) {
if (M & mask[i]) {
M = M >> shift[i];
pos += shift[i];
}
}
return pos;
}
Indeed, I have two questions please, first of all: if it was a size comparison shouldn't the mask be created in this order?
uint32_t mask[5] = { 0b100, 0b1100, 0xF0, 0xFF00, 0xFFFF0000 };
Thus, what does the program in the for loop please? With my research I understand & and >> and their bitwise behaviours, but what's the trick here because I guess it only and only compares with mask[0] since it's necessary the same size.
The algorithm follows the "divide and conquer" principle by applying a binary search on or bit pattern to figure out what the most significant bit is.
It basically halves the machine word with every cycle. The beauty of this is, that you always calculate MSB within 5 steps, regardless what 32 bit pattern you put in as the binary search has O(log2(n)) characteristics.
Let's pick two extremes to illustrate the behaviour and assume the word 0x00000001 as an input to the algorithm. We would expect it to output 0. What basically happens is:
0x00000001 & 0xFFFF0000 = 0x00000000
-> We don't shift anything, M=0x00000001, pos=0
0x00000001 & 0xFF00 = 0x0000
-> We don't shift anything, M=0x00000001, pos=0
0x00000001 & 0xF0 = 0x00
-> We don't shift anything, M=0x00000001, pos=0
0x00000001 & 0b1100 = 0x00
-> We don't shift anything, M=0x00000001, pos=0
0x00000001 & 0b10 = 0b0
-> We don't shift anything, M=0x00000001, pos=0
So we got our result within 5 steps. Imagine now doing a loop from left to right attempting the same thing: It would take 31 steps to get to the result.
Also the word 0x8FFFFFFF as an input to the algorithm would need 5 steps to get the expected result 31:
0x8FFFFFFF & 0xFFFF0000 = 0x8FFF0000
-> We shift by 16 right, M=0x8FFF, pos=16
0x8FFF & 0xFF00 = 0x8F00
-> We shift by 8 right, M=0x8F, pos=24
0x8F & 0xF0 = 0x80
-> We shift by 4 right, M=0x8, pos=28
0x8 & 0b1100 = 0x8
-> We shift by 2 right, M=0x2, pos=30
0x2 & 0b10 = 0x2
-> We shift by 1 right, M=0x1, pos=31
As you can see, both extremes took us exactly those 5 steps. Thanks to loop unrolling, conditional execution of instructions, etc. this is supposed to run quite fast, at least by far faster than looking for the MSB set from left to right in a loop.
I would want to create a macro to get easy access to a single bit from a structure like the following:
typedef union
{
struct
{
uint8_t bit0 : 1;
uint8_t bit1 : 1;
uint8_t bit2 : 1;
uint8_t bit3 : 1;
uint8_t bit4 : 1;
uint8_t bit5 : 1;
uint8_t bit6 : 1;
uint8_t bit7 : 1;
};
uint8_t raw;
} Bitfield;
I have a bi-dimensional array(x) of this structure. The best that I could make was :
#define xstr(r,c,b) str(r,c,b)
#define str(r,c,b) (x[r][c].bit##b)
#define getBit(bitCollum,row)(xstr(row,(bitCollum/8),(bitCollum%8))
When I try to use the macro like uint8_t a = getBit(15,2); it will expand to
uint8_t a = ( ( img [ 2 ] [ ( 15 / 8 ) ] . bit 15 % 8 ) );
and I would want to create a structure that will expand to:
uint8_t a = ( ( img [ 2 ] [ ( 15 / 8 ) ] . bit7 ) );
Is this even possible?
bitCollum and row will always be literal integers; the expression will not be run in a loop or something like that.
EDIT:
After seeing that it wasn't possible i looked at the disassembly of a simple increment and I saw different instructions but for my surprise the masking was faster.
` x.raw = 0b10101001;
00000040 LDI R24,0xA9 Load immediate
00000041 STD Y+8,R24 Store indirect with displacement
uint8_t y = 0b10101001;
00000042 LDI R24,0xA9 Load immediate
00000043 STD Y+1,R24 Store indirect with displacement
uint16_t xSum=0;
00000044 STD Y+3,R1 Store indirect with displacement
00000045 STD Y+2,R1 Store indirect with displacement
uint16_t ySum=0;
00000046 STD Y+5,R1 Store indirect with displacement
00000047 STD Y+4,R1 Store indirect with displacement
xSum+=x.bit3;
00000048 LDD R24,Y+8 Load indirect with displacement
00000049 BST R24,3 Bit store from register to T
0000004A CLR R24 Clear Register
0000004B BLD R24,0 Bit load from T to register
0000004C MOV R24,R24 Copy register
0000004D LDI R25,0x00 Load immediate
0000004E LDD R18,Y+2 Load indirect with displacement
0000004F LDD R19,Y+3 Load indirect with displacement
00000050 ADD R24,R18 Add without carry
00000051 ADC R25,R19 Add with carry
00000052 STD Y+3,R25 Store indirect with displacement
00000053 STD Y+2,R24 Store indirect with displacement
ySum+=y&0b00010000;
00000054 LDD R24,Y+1 Load indirect with displacement
00000055 MOV R24,R24 Copy register
00000056 LDI R25,0x00 Load immediate
00000057 ANDI R24,0x10 Logical AND with immediate
00000058 CLR R25 Clear Register
00000059 LDD R18,Y+4 Load indirect with displacement
0000005A LDD R19,Y+5 Load indirect with displacement
0000005B ADD R24,R18 Add without carry
0000005C ADC R25,R19 Add with carry
0000005D STD Y+5,R25 Store indirect with displacement
0000005E STD Y+4,R24 Store indirect with displacement `
Instead of the structures, use simple bytes - uint8_t
#define GETBIT(r,c) (img[r][(c) >> 3] & (1 << ((c) & 7)))
#define SETBIT(r,c) img[r][(c) >> 3] |= (1 << ((c) & 7))
#define CLRBIT(r,c) img[r][(c) >> 3] &= ~(1 << ((c) & 7))
However, if you want it efficient, you better avoid manipulating things one bit at a time.
It could be that I'm missing some "trick", but, AFAIK, this is not possible.
Basically, you're trying to compute a value and then append it to some token. The problem here is that the preprocessor doesn't do computations (except in #if and such statements). So, for example:
#define X2(A,B) A##B
#define X(A,B) X2(A,B)
int x = X(13 + 4, 4);
this will expand to:
int x = 13 + 44;
and not to:
int x = 174;
If you try to put parenthesis, you will just get compiler errors, 'cause this is not valid:
int x = (13+4)4;
While processing macros, everything is just a "string" (token) to the preprocessor. Actually, it is the compiler that will, in the example above, see that 13 + 44 is constant and compile that as int x = 57; (well, an intelligent compiler, I've worked with some C compilers in my day that were not so smart :) ).
#define GET_BIT(VAR8,IDX) ((VAR8>>IDX) & 1)
int main(void){
unsigned char c=3;
int i;
printf("Bits of char %d: ",c);
for(i=0; i<8;i++){
printf("%d ",GET_BIT(c,i));
}
printf("\n");
return 0;
}
For x64 I can use this:
{
uint64_t hi, lo;
// hi,lo = 64bit x 64bit multiply of c[0] and b[0]
__asm__("mulq %3\n\t"
: "=d" (hi),
"=a" (lo)
: "%a" (c[0]),
"rm" (b[0])
: "cc" );
a[0] += hi;
a[1] += lo;
}
But I'd like to perform the same calculation portably. For instance to work on x86.
As I understand the question, you want a portable pure C implementation of 64 bit multiplication, with output to a 128 bit value, stored in two 64 bit values. In which case this article purports to have what you need. That code is written for C++. It doesn't take much to turn it into C code:
void mult64to128(uint64_t op1, uint64_t op2, uint64_t *hi, uint64_t *lo)
{
uint64_t u1 = (op1 & 0xffffffff);
uint64_t v1 = (op2 & 0xffffffff);
uint64_t t = (u1 * v1);
uint64_t w3 = (t & 0xffffffff);
uint64_t k = (t >> 32);
op1 >>= 32;
t = (op1 * v1) + k;
k = (t & 0xffffffff);
uint64_t w1 = (t >> 32);
op2 >>= 32;
t = (u1 * op2) + k;
k = (t >> 32);
*hi = (op1 * op2) + w1 + k;
*lo = (t << 32) + w3;
}
Since you have gcc as a tag, note that you can just use gcc's 128-bit integer type:
typedef unsigned __int128 uint128_t;
// ...
uint64_t x, y;
// ...
uint128_t result = (uint128_t)x * y;
uint64_t lo = result;
uint64_t hi = result >> 64;
The accepted solution isn't really the best solution, in my opinion.
It is confusing to read.
It has some funky carry handling.
It doesn't take advantage of the fact that 64-bit arithmetic may be available.
It displeases ARMv6, the God of Absolutely Ridiculous Multiplies. Whoever uses UMAAL shall not lag but have eternal 64-bit to 128-bit multiplies in 4 instructions.
Joking aside, it is much better to optimize for ARMv6 than any other platform because it will have the most benefit. x86 needs a complicated routine and it would be a dead end optimization.
The best way I have found (and used in xxHash3) is this, which takes advantage of multiple implementations using macros:
It is a tiny bit slower than mult64to128 on x86 (by 1-2 instructions), but a lot faster on ARMv6.
#include <stdint.h>
#ifdef _MSC_VER
# include <intrin.h>
#endif
/* Prevents a partial vectorization from GCC. */
#if defined(__GNUC__) && !defined(__clang__) && defined(__i386__)
__attribute__((__target__("no-sse")))
#endif
static uint64_t multiply64to128(uint64_t lhs, uint64_t rhs, uint64_t *high)
{
/*
* GCC and Clang usually provide __uint128_t on 64-bit targets,
* although Clang also defines it on WASM despite having to use
* builtins for most purposes - including multiplication.
*/
#if defined(__SIZEOF_INT128__) && !defined(__wasm__)
__uint128_t product = (__uint128_t)lhs * (__uint128_t)rhs;
*high = (uint64_t)(product >> 64);
return (uint64_t)(product & 0xFFFFFFFFFFFFFFFF);
/* Use the _umul128 intrinsic on MSVC x64 to hint for mulq. */
#elif defined(_MSC_VER) && defined(_M_IX64)
# pragma intrinsic(_umul128)
/* This intentionally has the same signature. */
return _umul128(lhs, rhs, high);
#else
/*
* Fast yet simple grade school multiply that avoids
* 64-bit carries with the properties of multiplying by 11
* and takes advantage of UMAAL on ARMv6 to only need 4
* calculations.
*/
/* First calculate all of the cross products. */
uint64_t lo_lo = (lhs & 0xFFFFFFFF) * (rhs & 0xFFFFFFFF);
uint64_t hi_lo = (lhs >> 32) * (rhs & 0xFFFFFFFF);
uint64_t lo_hi = (lhs & 0xFFFFFFFF) * (rhs >> 32);
uint64_t hi_hi = (lhs >> 32) * (rhs >> 32);
/* Now add the products together. These will never overflow. */
uint64_t cross = (lo_lo >> 32) + (hi_lo & 0xFFFFFFFF) + lo_hi;
uint64_t upper = (hi_lo >> 32) + (cross >> 32) + hi_hi;
*high = upper;
return (cross << 32) | (lo_lo & 0xFFFFFFFF);
#endif /* portable */
}
On ARMv6, you can't get much better than this, at least on Clang:
multiply64to128:
push {r4, r5, r11, lr}
umull r12, r5, r2, r0
umull r2, r4, r2, r1
umaal r2, r5, r3, r0
umaal r4, r5, r3, r1
ldr r0, [sp, #16]
mov r1, r2
strd r4, r5, [r0]
mov r0, r12
pop {r4, r5, r11, pc}
The accepted solution generates a bunch of adds and adc, as well as an extra umull in Clang due to an instcombine bug.
I further explain the portable method in the link I posted.
My question would be Arduino specific, although if you know how to do it in C it will be similar in the Arduino IDE too.
So I have 5 integer variables:
r1, r2, r3, r4, r5
Their value either 0 (off) or 1 (on).
I would like to store these in a byte variable let's call it relays, not by adding them but setting certain bits to 1/0 whether they are 0 or 1.
For example:
1, 1, 0, 0, 1
I would like to have the exact same value in my relay's byte variable, not
r1+r2+r3+r4+r5 which in this case would be decimal 3, binary 11.
Thanks!
I recommend using a UNION of a structure of bits. It adds a clarity and makes it readily portable. You can specify single or any size of adjacent bits. Along with quickly rearranging them.
union {
uint8_t BAR;
struct {
uint8_t r1 : 1; // bit position 0
uint8_t r2 : 2; // bit positions 1..2
uint8_t r3 : 3; // bit positions 3..5
uint8_t r4 : 2; // bit positions 6..7
// total # of bits just needs to add up to the uint8_t size
} bar;
} foo;
void setup() {
Serial.begin(9600);
foo.bar.r1 = 1;
foo.bar.r2 = 2;
foo.bar.r3 = 2;
foo.bar.r4 = 1;
Serial.print(F("foo.bar.r1 = 0x"));
Serial.println(foo.bar.r1, HEX);
Serial.print(F("foo.bar.r2 = 0x"));
Serial.println(foo.bar.r2, HEX);
Serial.print(F("foo.bar.r3 = 0x"));
Serial.println(foo.bar.r3, HEX);
Serial.print(F("foo.bar.r4 = 0x"));
Serial.println(foo.bar.r5, HEX);
Serial.print(F("foo.BAR = 0x"));
Serial.println(foo.BAR, HEX);
}
Where you can expand this UNION to be larger than bytes
Note uint8_t is the same as byte.
You can even expand the union to an array of bytes and then send the bytes over serial port or clock them out individual as one long word, etc... see a more extensive example.
How about:
char byte = (r1 << 4) | (r2 << 3) | (r3 << 2) | (r4 << 1) | r5;
Or the other way around:
char byte = r1 | (r2 << 1) | (r3 << 2) | (r4 << 3) | (r5 << 4);