GCC hotpatching?

GCC hotpatching? - c

When I compile this piece of code
unsigned char A[] = {1, 2, 3, 4};
unsigned int
f (unsigned int x)
{
return A[x];
}
gcc outputs
mov edi, edi
movzx eax, BYTE PTR A[rdi]
ret
on a x86_64 machine.
The question is: why is a nop instruction (mov edi, edi) there for?
Im am using gcc-4.4.4.

In 64-bit mode, mov edi, edi is not a no-op. What it does is set the top 32 bits of rdi to 0.
This is a special case of the general fact that all 32-bit operations clear the top 32 bits of the destination register in 64-bit mode. (This allows a more efficient CPU than leaving them unchanged and is perhaps more useful as well.)

Related

Converting a signed int32 to an unsigned int64

int main(){
unsigned long a = 5;
int b = -6;
long c = a + b;
return 0;
}
I wanted to follow the rules explined in this link and confirm my understanding
of how the compiler emits code for a + b:
https://en.cppreference.com/w/c/language/conversion
1- b is first converted to an unsigned long:
If the unsigned type has conversion rank greater than or equal to the rank of the signed type, then the operand with the signed type is implicitly converted to the unsigned type.
So the the compiler essentially does this:
unsigned long implicit_conversion_of_b = (unsigned long) b;
2- The above implicit conversion itself is covered with this rule under Integer conversions:
if the target type is unsigned, the value 2b
, where b is the number of bits in the target type, is repeatedly subtracted or added to the source value until the result fits in the target type.
3- We finlay end up with these 64-bit values in a register before addition takes place:
a = 0x5
b = 0xfffffffffffffffa
Is the above a correct mapping to the rules?
Edit:
4- The final result is an unsigned long which needs to be converted to long needed b c using this rule:
otherwise, if the target type is signed, the behavior is implementation-defined (which may include raising a signal)

Is the above a correct mapping to the rules?
Yes.

I'm certainly no assembly expert, but it's interesting to see what's happening here. Not sure what optimizations you're talking about compiling with -O0 (seen below), obviously -O2 and -O3 look a lot different:
main:
// setup stack
push rbp
mov rbp, rsp
// move 5 into 8-byte offset from the stack frame.
// rbp is the stack frame pointer, offset of 8 shows long is 8
// bytes on this architecture
mov QWORD PTR [rbp-8], 5
// move -6 to 12 bytes offset from rbp (or 4 byte offset from
// last value. This tells you int is 4 bytes on this architecture.
mov DWORD PTR [rbp-12], -6
// move the int into eax register. This is a 32-bit general
// purpose register
mov eax, DWORD PTR [rbp-12]
// movsx is a "move with sign extension" instruction. rdx is a
// 64-bit register, so this is your conversion from 32 to 64
// bits, preserving the sign
movsx rdx, eax
// moves 5 to the 64-bit rax register
mov rax, QWORD PTR [rbp-8]
// performs the 64-bit add
add rax, rdx
// not using the result, so cleanup, prepare to return
// from function
mov QWORD PTR [rbp-24], rax
mov eax, 0
pop rbp
ret
This assembly was generated with gcc 11.2 on x64-86

When joining four 1-byte vars into one 4-byte word, which is a faster way to shift and OR ? (comparing generated assembly code)

So I'm currently studying bit-wise operators and bit-manipulation, and I have come across two different ways to combine four 1-byte words into one 4-byte wide word.
the two ways are given below
After finding out this two methods I compare the disassembly code generated by the two (compiled using gcc 11 with -O2 flag), I don't have the basic knowledge with disassembly and with the code it generates, and what I only know is the shorter the code, the faster the function is (most of the time I guess... maybe there are some exceptions), now for the two methods it seems that they have the same numbers/counts of lines in the generated disassembly code, so I guess they have the same performance?
I also got curious about the order of the instructions, the first method seems to alternate other instructions sal>or>sal>or>sal>or, while the second one is more uniform sal>sal>sal>or>or>mov>or does this have some significant effect in the performance say for example if we are dealing with a larger word?
Two methods
int method1(unsigned char byte4, unsigned char byte3, unsigned char byte2, unsigned char byte1)
{
int combine = 0;
combine = byte4;
combine <<=8;
combine |= byte3;
combine <<=8;
combine |= byte2;
combine <<=8;
combine |= byte1;
return combine;
}
int method2(unsigned char byte4, unsigned char byte3, unsigned char byte2, unsigned char byte1)
{
int combine = 0, temp;
temp = byte4;
temp <<= 24;
combine |= temp;
temp = byte3;
temp <<= 16;
combine |= temp;
temp = byte2;
temp <<= 8;
combine |= temp;
temp = byte1;
combine |= temp;
return combine;
}
Disassembly
// method1(unsigned char, unsigned char, unsigned char, unsigned char):
movzx edi, dil
movzx esi, sil
movzx edx, dl
movzx eax, cl
sal edi, 8
or esi, edi
sal esi, 8
or edx, esi
sal edx, 8
or eax, edx
ret
// method2(unsigned char, unsigned char, unsigned char, unsigned char):
movzx edx, dl
movzx ecx, cl
movzx esi, sil
sal edi, 24
sal edx, 8
sal esi, 16
or edx, ecx
or edx, esi
mov eax, edx
or eax, edi
ret
This might be "premature optimization", but I just want to know if there is a difference.

now for the two methods it seems that they have the same numbers/counts of lines in the generated disassembly code, so I guess they have the same performance?
That hasn't been true for decades. Modern CPUs can execute instructions in parallel if they're independent of each other. See
How many CPU cycles are needed for each assembly instruction?
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
In your case, method 2 is clearly better (with GCC11 -O2 specifically) because the 3 shifts can happen in parallel, leaving only a chain of or instructions. (Most modern x86 CPUs only have 2 shift units, but the 3rd shift can be happening in parallel with the first OR).
Your first version has one long dependency chain of shift/or/shift/or (after the movzx zero-extension), so it has the same throughput but worse latency. If it's not on the critical path of some larger computation, performance would be similar.
The first version also has a redundant movzx edi, dil, because GCC11 -O2 doesn't realize that the high bits will eventually get shifted out the top of the register by the 3x8 = 24 bits of shifting. Unfortunately GCC chose to movzx into the same register (RDI into RDI), not for example movzx eax, dil which would let mov-elimination work.
The second one has a wasted mov eax, edx because GCC didn't realize it should do one of the movzx operations into EAX, instead of zero-extending each reg into itself (defeating mov-elimination). It also could have used lea eax, [edx + edi] to merge into a different reg, because it could have proved that those values couldn't have any overlapping bit-positions, so | and + would be equivalent.
That wasted mov generally only happens in small functions; apparently GCC's register allocator has a hard time when values need to be in specific hard registers. If it had its choice of where to produce the value, it would just end up with it in EDX.
So on Intel CPUs, yes by coincidence of different missed optimizations, both versions have 3 non-eliminated movzx and one instruction which can benefit from mov-elimination. On AMD CPUs, movzx is never eliminated, so the movzx eax, cl doesn't benefit.
Unfortunately, your compiler chooses to do all three or operations in sequence, instead of a tree of dependencies. (a|b) | (c|d) would be lower worst-case latency than (((a|b) | c) | d), critical path length of 2 from all inputs vs. 3 from d, 2 from c, 1 from a or b. (Writing it in C those different ways doesn't actually change how compilers make asm, because they knows that OR is associative. I'm using that familiar syntax to represent to dependency pattern of the assembly.)
So if all four inputs were ready at the same time, combining pairs would be lower latency, although it's impossible for most CPUs to produce three shift results in the same cycle.
I was able to hand-hold earlier GCC (GCC5) into making this dependency pattern (Godbolt compiler explorer). I used unsigned to avoid ISO C undefined behaviour. (GNU C does define the behaviour of left shifts even when a set bit is shifted in or out of the sign bit.)
int method4(unsigned char byte4, unsigned char byte3, unsigned char byte2, unsigned char byte1)
{
unsigned combine = (unsigned)byte4 << 24;
combine |= byte3 << 16;
unsigned combine_low = (byte2 << 8) | byte1;
combine |= combine_low;
return combine;
}
# GCC5.5 -O3
method4(unsigned char, unsigned char, unsigned char, unsigned char):
movzx eax, sil
sal edi, 24
movzx ecx, cl
sal eax, 16
or edi, eax # (byte4<<24)|(byte3<<16)
movzx eax, dl
sal eax, 8
or eax, ecx # (byte2<<8) | (byte1)
or eax, edi # combine halves
ret
But GCC11.2 makes the same asm for this vs. method2. That would be good if it was optimal, but it's not.
the first method seems to alternate other instructions sal>or>sal>or>sal>or, while the second one is more uniform sal>sal>sal>or>or>mov>or
The dependency chains are the key factor. Mainstream x86 has been out-of-order exec for over 2 decades, and there haven't been any in-order exec x86 CPUs sold for years. So instruction scheduling (ordering of independent instructions) generally isn't a big deal over very small distances like a few instructions. Of course, in the alternating shl/or version, they're not independent so you couldn't reorder them without breaking it or rewriting it.
Can we do better with partial-register shenanigans?
This part is only relevant if you're a compiler / JIT developer trying to get a compiler to do a better job for source like this. I'm not sure there's a clear win here, although maybe yes if we can't inline this function so the movzx instructions are actually needed.
We can certainly save instructions, but even modern Intel CPUs still have partial-register merging penalties for high-8-bit registers like AH. And it seems the AH-merging uop can only issue in a cycle by itself, so it effectively costs at least 4 instructions of front-end bandwidth.
movzx eax, dl
mov ah, cl
shl eax, 16 # partial register merging when reading EAX
mov ah, sil # oops, not encodeable, needs a REX for SIL which means it can't use AH
mov al, dil
Or maybe this, which avoids partial-register stalls and false dependencies on Intel Haswell and later. (And also on uarches that don't rename partial regs at all, like all AMD, and Intel Silvermont-family including the E-cores on Alder Lake.)
# good on Haswell / AMD
# partial-register stalls on Intel P6 family (Nehalem and earlier)
merge4(high=EDI, mid_hi=ESI, mid_lo=EDX, low=ECX):
mov eax, edi # mov-elimination possible on IvB and later, also AMD Zen
# zero-extension not needed because more shifts are coming
shl eax, 8
shl edx, 8
mov al, sil # AX = incoming DIL:SIL
mov dl, cl # DX = incoming DL:CL
shl eax, 16
mov ax, dx # EAX = incoming DIL:SIL:DL:CL
ret
This is using 8-bit and 16-bit mov as an ALU merge operation pretty much like movzx + or, i.e. a bitfield insert into the low 8 or low 16. I avoided ever moving into AH or other high-8 registers, so there's no partial-register merging on Haswell or later.
This is only 7 total instructions (not counting ret), all of them single-uop. And one of them is just a mov which can often be optimized away when inlining, because the compiler will have its choice of which registers to have the value in. (Unless the original value of the high byte alone is still needed in a register). It will often know it already has a value zero-extended in a register after inlining, but this version doesn't depend on that.
Of course, if you were eventually storing this value to memory, doing 4 byte stores would likely be good, especially if it's not about to be reloaded. (Store-forwarding stall.)
Related:
Why doesn't GCC use partial registers? partial regs on other CPUs, and why my "good" version would be bad on crusty old Nehalem and earlier CPUs.
How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent the basis for the partial-register versions
Why does GCC chose dword movl to copy a long shift count to CL? - re: using 32-bit mov to copy an incoming uint8_t. (Although that's saying why it's fine even on CPUs that do rename DIL separate from the full RDI, because callers will have written the full register.)
Semi-related:
Doing this without shifts, just partial-register moves, as an asm exercise: How do I shift left in assembly with ADD, not using SHL? sometimes using store/reload which will cause store-forwarding stalls.

I completely agree with Émerick Poulin:
So the real answer, would be to benchmark it on your target machine to
see which one is faster.
Nevertheless, I created a "method3()", and disassembled all three with gcc version 10.3.0, with -O0 and -O2, Here's a summary of the -O2 results:
Method3:
int method3(unsigned char byte4, unsigned char byte3, unsigned char byte2, unsigned char byte1)
{
int combine = (byte4 << 24)|(byte3<<16)|(byte2<<8)|byte1;
return combine;
}
gcc -O2 -S:
;method1:
sall $8, %eax
orl %edx, %eax
sall $8, %eax
movl %eax, %edx
movzbl %r8b, %eax
orl %edx, %eax
sall $8, %eax
orl %r9d, %eax
...
;method2:
sall $8, %r8d
sall $16, %edx
orl %r9d, %r8d
sall $24, %eax
orl %edx, %r8d
orl %r8d, %eax
...
;method3:
sall $8, %r8d
sall $16, %edx
orl %r9d, %r8d
sall $24, %eax
orl %edx, %r8d
orl %r8d, %eax
method1 has fewer instructions than method2, and method2 compiles to exactly the same code has method3. method1 also has a couple of "moves", which are more "expensive" than "or" or "shift".

Without optimizations, method 2 seems to be a tiny bit faster.
This is an online benchmark of the code provided: https://quick-bench.com/q/eyiiXkxYVyoogefHZZMH_IoeJss
However, it is difficult to get accurate metrics from tiny operations like this.
It also depends on the speed of the instructions on a given CPU (different instructions can take more or less clock cycles).
Furthermore, bitwise operators tend to be pretty fast since they are "basic" instructions.
So the real answer, would be to benchmark it on your target machine to see which one is faster.

a 'c' only portable.
unsigned int portable1(unsigned char byte4, unsigned char byte3,
unsigned char byte2, unsigned char byte1)
{
union tag1
{
_uint32 result;
_uint8 a[4];
} u;
u.a[0] = byte1;
u.a[1] = byte2;
u.a[2] = byte3;
u.a[3] = byte4;
return u.result;
}
This should generate 4 MOV and a register load in most environments. If the union is defined at module scope or global, the final LOAD (or MOV) disappears from the function. We can also easily allow for a 9bit byte and Intel versus Network byte ordering...

Emit DIV instruction, instead of __udivti3

Consider the following code:
unsigned long long div(unsigned long long a, unsigned long long b, unsigned long long c) {
unsigned __int128 d = (unsigned __int128)a*(unsigned __int128)b;
return d/c;
}
When compiled with x86-64 gcc 10 or clang 10, both with -O3, it emits __udivti3, instead of DIVQ instruction:
div:
mov rax, rdi
mov r8, rdx
sub rsp, 8
xor ecx, ecx
mul rsi
mov r9, rax
mov rsi, rdx
mov rdx, r8
mov rdi, r9
call __udivti3
add rsp, 8
ret
At least in my testing, the former is much slower than the (already) slow later, hence the question: is there a way to make a modern compiler emit DIVQ for the above code?
Edit: Let's assume the quotient fits into 64-bits register.

div will fault if the quotient doesn't fit in 64 bits. Doing (a*b) / c with mul + a single div isn't safe in the general case (doesn't implement the abstract-machine semantics for every possible input), therefore a compiler can't generate asm that way for x86-64.
Even if you do give the compiler enough info to figure out that the division can't overflow (i.e. that high_half < divisor), unfortunately gcc/clang still won't ever optimize it to single a div with a non-zero high-half dividend (RDX).
You need an intrinsic or inline asm to explicitly do 128 / 64-bit => 64-bit division. e.g. Intrinsics for 128 multiplication and division has GNU C inline asm that looks right for low/high halves separately.
Unfortunately GNU C doesn't have an intrinsic for this. MSVC does, though: Unsigned 128-bit division on 64-bit machine has links.

Converting a loop from x86 assembly to C language with AND, OR, SHR and SHL instructions and an array

I don't understand what is the problem because the result is right, but there is something wrong in it and i don't get it.
1.This is the x86 code I have to convert to C:
%include "io.inc"
SECTION .data
mask DD 0xffff, 0xff00ff, 0xf0f0f0f, 0x33333333, 0x55555555
SECTION .text
GLOBAL CMAIN
CMAIN:
GET_UDEC 4, EAX
MOV EBX, mask
ADD EBX, 16
MOV ECX, 1
.L:
MOV ESI, DWORD [EBX]
MOV EDI, ESI
NOT EDI
MOV EDX, EAX
AND EAX, ESI
AND EDX, EDI
SHL EAX, CL
SHR EDX, CL
OR EAX, EDX
SHL ECX, 1
SUB EBX, 4
CMP EBX, mask - 4
JNE .L
PRINT_UDEC 4, EAX
NEWLINE
XOR EAX, EAX
RET
2.My converted C code, when I input 0 it output me the right answer but there is something false in my code I don't understand what is:
#include "stdio.h"
int main(void)
{
int mask [5] = {0xffff, 0xff00ff, 0xf0f0f0f, 0x33333333, 0x55555555};
int eax;
int esi;
int ebx;
int edi;
int edx;
char cl = 0;
scanf("%d",&eax);
ebx = mask[4];
ebx = ebx + 16;
int ecx = 1;
L:
esi = ebx;
edi = esi;
edi = !edi;
edx = eax;
eax = eax && esi;
edx = edx && edi;
eax = eax << cl;
edx = edx >> cl ;
eax = eax || edx;
ecx = ecx << 1;
ebx = ebx - 4;
if(ebx == mask[1]) //mask - 4
{
goto L;
}
printf("%d",eax);
return 0;
}

Assembly AND is C bitwise &, not logical &&. (Same for OR). So you want eax &= esi.
(Using &= "compound assignment" makes the C even look like x86-style 2-operand asm so I'd recommend that.)
NOT is also bitwise flip-all-the-bits, not booleanize to 0/1. In C that's edi = ~edi;
Read the manual for x86 instructions like https://www.felixcloutier.com/x86/not, and for C operators like ~ and ! to check that they are / aren't what you want. https://en.cppreference.com/w/c/language/expressions https://en.cppreference.com/w/c/language/operator_arithmetic
You should be single-stepping your C and your asm in a debugger so you notice the first divergence, and know which instruction / C statement to fix. Don't just run the whole thing and look at one number for the result! Debuggers are massively useful for asm; don't waste your time without one.
CL is the low byte of ECX, not a separate C variable. You could use a union between uint32_t and uint8_t in C, or just use eax <<= ecx&31; since you don't have anything that writes CL separately from ECX. (x86 shifts mask their count; that C statement could compile to shl eax, cl. https://www.felixcloutier.com/x86/sal:sar:shl:shr). The low 5 bits of ECX are also the low 5 bits of CL.
SHR is a logical right shift, not arithmetic, so you need to be using unsigned not int at least for the >>. But really just use it for everything.
You're handling EBX completely wrong; it's a pointer.
MOV EBX, mask
ADD EBX, 16
This is like unsigned int *ebx = mask+4;
The size of a dword is 4 bytes, but C pointer math scales by the type size, so +1 is a whole element, not 1 byte. So 16 bytes is 4 dwords = 4 unsigned int elements.
MOV ESI, DWORD [EBX]
That's a load using EBX as an address. This should be easy to see if you single-step the asm in a debugger: It's not just copying the value.
CMP EBX, mask - 4
JNE .L
This is NASM syntax; it's comparing against the address of the dword before the start of the array. It's effectively the bottom of a fairly normal do{}while loop. (Why are loops always compiled into "do...while" style (tail jump)?)
do { // .L
...
} while(ebx != &mask[-1]); // cmp/jne
It's looping from the end of the mask array, stopping when the pointer goes past the end.
Equivalently, the compare could be ebx !-= mask - 1. I wrote it with unary & (address-of) cancelling out the [] to make it clear that it's the address of what would be one element before the array.
Note that it's jumping on not equal; you had your if()goto backwards, jumping only on equality. This is a loop.
unsigned mask[] should be static because it's in section .data, not on the stack. And not const, because again it's in .data not .rodata (Linux) or .rdata (Windows))
This one doesn't affect the logic, only that detail of decompiling.
There may be other bugs; I didn't try to check everything.

if(ebx != mask[1]) //mask - 4
{
goto L;
}
//JNE IMPLIES a !=

Move variable to cl and perform shr using inline assembly

So I am trying to translate the following assignment from C to inline assembly
resp = (0x1F)&(letter >> (3 - numB));
Assuming that the declaration of the variables are the following
unsigned char resp;
unsigned char letter;
int numB;
So I have tried the following:
_asm {
mov ebx, 01fh
movzx edx, letter
mov cl,3
sub cl, numB // Line 5
shr edx, cl
and ebx, edx
mov resp, ebx
}
or the following
_asm {
mov ebx, 01fh
movzx edx, letter
mov ecx,3
sub ecx, numB
mov cl, ecx // Line 5
shr edx, cl
and ebx, edx
mov resp, ebx
}
In both cases I get size operand error in Line 5.
How can I achieve the right shift?

The E*X registers are 32 bits, while the *L registers are 8 bits. Similarly, on Windows, the int type is 32 bits wide, while the char type is 8 bits wide. You cannot arbitrarily mix these sizes within a single instruction.
So, in your first piece of code:
sub cl, numB // Line 5
this is wrong because the cl register stores an 8-bit value, whereas the numB variable is of type int, which stores a 32-bit value. You cannot subtract a 32-bit value from an 8-bit value; both operands to the SUB instruction must be the same size.
Similarly, in your second piece of code:
mov cl, ecx // Line 5
you are trying to move the 32-bit value in ECX into the 8-bit CL register. That can't happen without some kind of truncation, so you have to indicate it explicitly. The MOV instruction requires that both of its operands have the same size.
(MOVZX and MOVSX are obvious exceptions to this rule that the operand types must match for a single instruction. These instructions zero-extend or sign-extend, respectively, a smaller value so that it can be stored into a larger-sized register.)
However, in this case, you don't even need the MOV instruction. Remember that CL is just the lower 8 bits of the full 32-bit ECX register. Therefore, setting ECX also implicitly sets CL. If you only need the lower 8 bits, you can just use CL in a subsequent instruction. Thus, your code becomes:
mov ebx, 01fh ; move constant into 32-bit EBX
movzx edx, BYTE PTR letter ; zero-extended move of 8-bit variable into 32-bit EDX
mov ecx, 3 ; move constant into ECX
sub ecx, DWORD PTR numB ; subtract 32-bit variable from ECX
shr edx, cl ; shift EDX right by the lower 8 bits of ECX
and ebx, edx ; bitwise AND of EDX and EBX, leaving result in EBX
mov BYTE PTR resp, bl ; move lower 8 bits of EBX into 8-bit variable
For the same operand-size matching issue discussed above, I've also had to change the final MOV instruction. You cannot move the value stored in a 32-bit register directly into an 8-bit variable. You will have to move either the lower 8 bits or the upper 8 bits, allowing you to use either the BL or BH registers, which are 8 bits and therefore match the size of resp. In the above code, I assumed that you want only the lower 8 bits, so I've used BL.
Also note that I've used the BYTE PTR and DWORD PTR specifications. These are not strictly necessary in MASM (or Visual Studio's inline assembler), since it can deduce the sizes of the types from the types of the variables. However, I think it increases readability, and is generally a recommended practice. DWORD means 32 bit; it is the same size as int and a 32-bit register (E*X). WORD means 16 bit; it is the same size as short and a 16-bit register (*X). BYTE means 8 bits; it is the same size as char and an 8-bit register (*L or *H).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight