I'm trying to efficiently implement SHLD and SHRD instructions of x86 without using inline assembly.
uint32_t shld_UB_on_0(uint32_t a, uint32_t b, uint32_t c) {
return a << c | b >> 32 - c;
}
seems to work, but invokes undefined behaviour when c == 0 because the second shift's operand becomes 32. The actual SHLD instruction with third operand being 0 is well defined to do nothing. (https://www.felixcloutier.com/x86/shld)
uint32_t shld_broken_on_0(uint32_t a, uint32_t b, uint32_t c) {
return a << c | b >> (-c & 31);
}
doesn't invoke undefined behaviour, but when c == 0 the result is a | b instead of a.
uint32_t shld_safe(uint32_t a, uint32_t b, uint32_t c) {
if (c == 0) return a;
return a << c | b >> 32 - c;
}
does what's intended, but gcc now puts a je. clang on the other hand is smart enough to translate it to a single shld instruction.
Is there any way to implement it correctly and efficiently without inline assembly?
And why is gcc trying so much not to put shld? The shld_safe attempt is translated by gcc 11.2 -O3 as (Godbolt):
shld_safe:
mov eax, edi
test edx, edx
je .L1
mov ecx, 32
sub ecx, edx
shr esi, cl
mov ecx, edx
sal eax, cl
or eax, esi
.L1:
ret
while clang does,
shld_safe:
mov ecx, edx
mov eax, edi
shld eax, esi, cl
ret
As far as I have tested with gcc 9.3 (x86-64), it translates the following code to shldq and shrdq.
uint64_t shldq_x64(uint64_t low, uint64_t high, uint64_t count) {
return (uint64_t)(((((unsigned __int128)high << 64) | (unsigned __int128)low) << (count & 63)) >> 64);
}
uint64_t shrdq_x64(uint64_t low, uint64_t high, uint64_t count) {
return (uint64_t)((((unsigned __int128)high << 64) | (unsigned __int128)low) >> (count & 63));
}
Also, gcc -m32 -O3 translates the following code to shld and shrd. (I have not tested with gcc (i386), though.)
uint32_t shld_x86(uint32_t low, uint32_t high, uint32_t count) {
return (uint32_t)(((((uint64_t)high << 32) | (uint64_t)low) << (count & 31)) >> 32);
}
uint32_t shrd_x86(uint32_t low, uint32_t high, uint32_t count) {
return (uint32_t)((((uint64_t)high << 32) | (uint64_t)low) >> (count & 31));
}
(I have just read the gcc code and written the above functions, i.e. I'm not sure they are your expected ones.)
Related
clang -O3 optimizes this code:
_Bool f1(char x)
{
_Bool b1 = x == 4;
_Bool b2 = x & 3;
return b1 & b2;
}
to:
f1:
xor eax, eax
ret
However, clang -O3 does not optimize this code:
_Bool f1(char x)
{
_Bool b1 = x == 2;
_Bool b2 = x & 1;
return b1 & b2;
}
f1:
cmp dil, 2
sete al
and al, dil
ret
Why?
Note: the & b1 & b2 is used intentionally. If && is used, then clang -O3 optimizes it to:
f1:
xor eax, eax
ret
How it can be explained?
Why?
Inefficient code generation (due to "missing narrowing transforms for bitwise logic").
The x86_64 mul instruction can take two 32 bit integers and put the high and low 32 bits of the 64 bit result in RDX:RAX. However, gcc & clang can't seem to output that code. Instead, for the following source:
uint32_t mulhi32(uint32_t a, uint32_t b) {
return (a * (uint64_t) b) >> 32;
}
they output:
mov esi, esi
mov edi, edi
imul rdi, rsi
shr rdi, 32
mov rax, rdi
ret
which extends the two inputs to 64 bits, then does a 64*64 bit multiply producing a 128 bit result, then does a shift right on the lower 64 bits, and finally returns that. Madness!
The equivalent for 64 bits:
uint64_t mulhi64(uint64_t a, uint64_t b) {
return (a * (unsigned __int128) b) >> 64;
}
produces:
mov rax, rdi
mul rsi
mov rax, rdx
ret
Why don't gcc & clang do the equivalent thing for 32 bits?
Currently, from research and various attempts, I'm pretty sure that the only solution to this problem is to use assembly. I'm posting this question to show an existing problem, and maybe get attention from compiler developers, or get some hits from searches about similar problems.
If anything changes in the future, I will accept it as an answer.
This is a very related question for MSVC.
In x86_64 machines, it is faster to use div/idiv with a 32-bit operand than a 64-bit operand. When the dividend is 64-bit and the divisor is 32-bit, and when you know that the quotient will fit in 32 bits, you don't have to use the 64-bit div/idiv. You can split the 64-bit dividend into two 32-bit registers, and even with this overhead, performing a 32-bit div on two 32-bit registers will be faster than doing a 64-bit div with a full 64-bit register.
The compiler will produce a 64-bit div with this function, and that is correct because for a 32-bit div, if the quotient of the division does not fit in 32 bits, an hardware exception occurs.
uint32_t div_c(uint64_t a, uint32_t b) {
return a / b;
}
However, if the quotient is known to be fit in 32 bits, doing a full 64-bit division is unnecessary. I used __builtin_unreachable to tell the compiler about this information, but it doesn't make a difference.
uint32_t div_c_ur(uint64_t a, uint32_t b) {
uint64_t q = a / b;
if (q >= 1ull << 32) __builtin_unreachable();
return q;
}
For both div_c and div_c_ur, the output from gcc is,
mov rax, rdi
mov esi, esi
xor edx, edx
div rsi
ret
clang does an interesting optimization of checking the dividend size, but it still uses a 64-bit div when the dividend is 64-bit.
mov rax, rdi
mov ecx, esi
mov rdx, rdi
shr rdx, 32
je .LBB0_1
xor edx, edx
div rcx
ret
.LBB0_1:
xor edx, edx
div ecx
ret
I had to write straight in assembly to achieve what I want. I couldn't find any other way to do this.
__attribute__((naked, sysv_abi))
uint32_t div_asm(uint64_t, uint32_t) {__asm__(
"mov eax, edi\n\t"
"mov rdx, rdi\n\t"
"shr rdx, 32\n\t"
"div esi\n\t"
"ret\n\t"
);}
Was it worth it? At least perf reports 49.47% overhead from div_c while 24.88% overhead from div_asm, so on my computer (Tiger Lake), div r32 is about 2 times faster than div r64.
This is the benchmark code.
#include <stdint.h>
#include <stdio.h>
__attribute__((noinline))
uint32_t div_c(uint64_t a, uint32_t b) {
uint64_t q = a / b;
if (q >= 1ull << 32) __builtin_unreachable();
return q;
}
__attribute__((noinline, naked, sysv_abi))
uint32_t div_asm(uint64_t, uint32_t) {__asm__(
"mov eax, edi\n\t"
"mov rdx, rdi\n\t"
"shr rdx, 32\n\t"
"div esi\n\t"
"ret\n\t"
);}
static uint64_t rdtscp() {
uint32_t _;
return __builtin_ia32_rdtscp(&_);
}
int main() {
#define n 500000000ll
uint64_t c;
c = rdtscp();
for (int i = 1; i <= n; ++i) {
volatile uint32_t _ = div_c(i + n * n, i + n);
}
printf(" c%15ul\n", rdtscp() - c);
c = rdtscp();
for (int i = 1; i <= n; ++i) {
volatile uint32_t _ = div_asm(i + n * n, i + n);
}
printf("asm%15ul\n", rdtscp() - c);
}
Every idea in this answer is based on comments by Nate Eldredge, from which I discovered some powerfulness of gcc's extended inline assembly. Even though I still have to write assembly, it is possible to create a custom as-if intrinsic function.
static inline uint32_t divqd(uint64_t a, uint32_t b) {
if (__builtin_constant_p(b)) {
return a / b;
}
uint32_t lo = a;
uint32_t hi = a >> 32;
__asm__("div %2" : "+a" (lo), "+d" (hi) : "rm" (b));
return lo;
}
__builtin_constant_p returns 1 if b can be evaluated in compile-time. +a and +d means values are read from and written to a and d registers (eax and edx). rm specifies that the input b can either be a register or memory operand.
To see if inlining and constant propagation is done smoothly,
uint32_t divqd_r(uint64_t a, uint32_t b) {
return divqd(a, b);
}
divqd_r:
mov rdx, rdi
mov rax, rdi
shr rdx, 32
div esi
ret
uint32_t divqd_m(uint64_t a) {
extern uint32_t b;
return divqd(a, b);
}
divqd_m:
mov rdx, rdi
mov rax, rdi
shr rdx, 32
div DWORD PTR b[rip]
ret
uint32_t divqd_c(uint64_t a) {
return divqd(a, 12345);
}
divqd_c:
movabs rdx, 6120523590596543007
mov rax, rdi
mul rdx
shr rdx, 12
mov eax, edx
ret
and the results are satisfying (https://godbolt.org/z/47PE4ovMM).
This question already has answers here:
How can I pass parameters in assembler x86 function call?
(3 answers)
Can't pass parameter from C to Assembly code
(3 answers)
Why does IA-32 have a non-intuitive caller and callee register saving convention?
(4 answers)
What are the calling conventions for UNIX & Linux system calls (and user-space functions) on i386 and x86-64
(4 answers)
Closed 3 years ago.
I am learning assembly NASM and trying to do a LFSR code and call it on a C program to evaluate the execution time diference, but failed to figure out the problem with my code.
My LFSR C version works just fine and is defined as follows:
int lfsr(){
int cont = 0;
uint32_t start_state = SEED;
uint32_t lfsr = start_state;
uint32_t bit;
while (cont != 16777216) {
bit = ((lfsr >> 1) ^ (lfsr >> 5) ^ (lfsr >> 7) ^ (lfsr >> 13)) & 1;
lfsr = (lfsr >> 1) | (bit << 23);
lfsr_nums[cont] = lfsr;
cont++;
}
return cont;
}
My NASM x86 was based on the C version and it generates the numbers the same way the C code does. It should take a pointer to an array as parameter, and return (as reference) the same array with the numbers created and return (as value) the amount of the numbers. The LFSR logic works just fine, I checked the numbers created, but the code still give me a SegFault Core Dump error.
With gdb the message is that the error is in the do procedure. While I tried to debug my code I found out that the error was in the mov dword [esi + 4 * eax], ebx, if I comment it out the code doesn't output a segfault.
section .text
global lfsr_nasm
lfsr_nasm:
push dword ebx;
mov esi, edi ; vec
mov eax, 0 ;Cont = 0
mov ebx, 0x1313 ; lfst = start_state = seed
do:
mov ecx, ebx ; ecx = lfst
shr ecx, 1 ; lfsr >> 1
mov edx, ebx ; edx = lfst
shr edx, 5; lfst >> 5
xor ecx, edx ; lfst >> 1 ^ lfsr >> 5
mov edx, ebx ; edx = lfsr
shr edx, 7 ; edx = lfst >> 7
xor ecx, edx; lfst >> 1 ^ lfsr >> 5 ^ lfsr >> 7
mov edx, ebx ; edx = lfsr
shr edx, 13 ; edx = lfst >> 13
xor ecx, edx; lfst >> 1 ^ lfsr >> 5 ^ lfsr >> 7 ^ lfsr >> 13
and ecx, 1 ;ecx = bit
shr ebx, 1 ;lfsr >> 1
shl ecx, 23 ; bit << 23
or ebx, ecx ; lfsr = (lfsr >> 1) | (bit << 23);
mov dword [esi + 4 * eax], ebx
inc eax ; cont++
cmp eax, 16777216; cont != 16777216
jne do ;
pop dword ebx;
ret
The way I make the call in C, and declare my vector and NASM function:
extern int lfsr_nasm (uint32_t *vec);
uint32_t lfsr_nums[16777216];
int main(int argc, char *argv[]){
int cont;
cont = lfsr_nasm(lfsr_nums);
for(int i = 0; i < 16777216; i++){
printf("%d ", lfsr_nums[i]);
}
}
I believe that the vector is too big for the NASM or C and maybe the program is trying to access bad memory, but I couldn't find anything to confirm my believes neither a fix to the problem. Already tried with malloc and calloc.
Borland C has pseudo-Registers _AX,_BX, _FLAGS etc that could be used in 'C' code to save the registers to temp variables.
Is there any MSVC equivalent? I tried #AX, #BX, etc, but the compiler (MSVC1.5) gave error ('40' unrecognized symbol).
I'm developing a 16-bit pre-boot app and can't use .
Thanks.
you don't need to have pseudo registers if you only move values between registers and variables. example:
int a = 4;
int b = 999;
__asm
{
mov eax, a; // eax equals to 4
mov b, eax; // b equals to eax
}
// b equals to 4 now
edit: to copy the flags into a variable and back to flags again, you can use LAHF and SAHF instructions. example:
int flags = 0;
__asm
{
lahf;
mov flags, eax;
}
flags |= (1 << 3);
__asm
{
mov eax, flags;
sahf;
// 4th bit of the flag is set
}