ASM to C Code emulating nearly done.. just trying to solve these second pass problems.
Lets say I got this ASM function
401040 MOV EAX,DWORD PTR [ESP+8]
401044 MOV EDX,DWORD PTR [ESP+4]
401048 PUSH ESI
401049 MOV ESI,ECX
40104B MOV ECX,EAX
40104D DEC EAX
40104E TEST ECX,ECX
401050 JE 401083
401052 PUSH EBX
401053 PUSH EDI
401054 LEA EDI,[EAX+1]
401057 MOV AX,WORD PTR [ESI]
40105A XOR EBX,EBX
40105C MOV BL,BYTE PTR [EDX]
40105E MOV ECX,EAX
401060 AND ECX,FFFF
401066 SHR ECX,8
401069 XOR ECX,EBX
40106B XOR EBX,EBX
40106D MOV BH,AL
40106F MOV AX,WORD PTR [ECX*2+45F81C]
401077 XOR AX,BX
40107A INC EDX
40107B DEC EDI
40107C MOV WORD PTR [ESI],AX
40107F JNE 401057
401081 POP EDI
401082 POP EBX
401083 POP ESI
401084 RET 8
My program would create the following for it.
int Func_401040() {
regs.d.eax = *(unsigned int *)(regs.d.esp+0x00000008);
regs.d.edx = *(unsigned int *)(regs.d.esp+0x00000004);
regs.d.esp -= 4;
*(unsigned int *)(regs.d.esp) = regs.d.esi;
regs.d.esi = regs.d.ecx;
regs.d.ecx = regs.d.eax;
regs.d.eax--;
if(regs.d.ecx == 0)
goto label_401083;
regs.d.esp -= 4;
*(unsigned int *)(regs.d.esp) = regs.d.ebx;
regs.d.esp -= 4;
*(unsigned int *)(regs.d.esp) = regs.d.edi;
regs.d.edi = (regs.d.eax+0x00000001);
regs.x.ax = *(unsigned short *)(regs.d.esi);
regs.d.ebx ^= regs.d.ebx;
regs.h.bl = *(unsigned char *)(regs.d.edx);
regs.d.ecx = regs.d.eax;
regs.d.ecx &= 0x0000FFFF;
regs.d.ecx >>= 0x00000008;
regs.d.ecx ^= regs.d.ebx;
regs.d.ebx ^= regs.d.ebx;
regs.h.bh = regs.h.al;
regs.x.ax = *(unsigned short *)(regs.d.ecx*0x00000002+0x0045F81C);
regs.x.ax ^= regs.x.bx;
regs.d.edx++;
regs.d.edi--;
*(unsigned short *)(regs.d.esi) = regs.x.ax;
JNE 401057
regs.d.edi = *(unsigned int *)(regs.d.esp);
regs.d.esp += 4;
regs.d.ebx = *(unsigned int *)(regs.d.esp);
regs.d.esp += 4;
label_401083:
regs.d.esi = *(unsigned int *)(regs.d.esp);
regs.d.esp += 4;
return 0x8;
}
Since JNE 401057 doesn't use the CMP or TEST
How do I fix that use this in C code?
The most recent instruction that modified flags is the dec, which sets ZF when its operand hits 0. So the jne is about equivalent to if (regs.d.edi != 0) goto label_401057;.
BTW: ret 8 isn't equivalent to return 8. The ret instruction's operand is the number of bytes to add to ESP when returning. (It's commonly used to clean up the stack.) It'd be kinda like
return eax;
regs.d.esp += 8;
except that semi-obviously, this won't work in C -- the return makes any code after it unreachable.
This is actually a part of the calling convention -- [ESP+4] and [ESP+8] are arguments passed to the function, and the ret is cleaning those up. This isn't the usual C calling convention; it looks more like fastcall or thiscall, considering the function expects a value in ECX.
Related
I'm trying to get the hang of Assembly for class. So for this C code:
int a = 10;
int b = 20;
int *aPtr = &a;
int *bPtr = &b;
b += a;
*aPtr = *aPtr + *bPtr; //dereference
printf(“aPtr points to value: %d\n”, *aPtr);
*** Updated
I tried this in Assembly:
.data
var1 DWORD 10
var2 DWORD 20
var3 DWORD ?
.code
main PROC
mov eax, 10
mov ebx, 20
add ebx, eax
mov var3, ebx
mov eax, offset var1
mov ebx, offset var3
mov ecx, [eax]
mov edx, [ebx]
add ecx, edx
mov var3, ecx
INVOKE ExitProcess, 0
main endp
end
But I know that the pointers can't simply be deferenced and added together like that. We also can't use lea, so I'm at a loss on how to add a dereferenced value to another dereferenced value in Assembly; I'm also not sure how I would convert the printf statement correctly. Could I get some help?
Your code is not yet updating the a and b variables with the results from the operations.
int a = 10;
int b = 20;
int *aPtr = &a;
int *bPtr = &b;
a SDWORD 10
b SDWORD 20
aPtr DWORD offset a
bPtr DWORD offset b
b += a;
mov eax, a
add b, eax ; Result in b (30)
*aPtr = *aPtr + *bPtr;
mov edi, aPtr
mov esi, bPtr
mov eax, [edi]
add eax, [esi]
mov [edi], eax ; Result in a (40)
printf(“aPtr points to value: %d\n”, *aPtr);
msg db 'aPtr points to value: %d\n', 0
...
mov edi, aPtr
mov eax, [edi]
push eax
push offset msg
call _printf
add esp, 8
I don't understand what is the problem because the result is right, but there is something wrong in it and i don't get it.
1.This is the x86 code I have to convert to C:
%include "io.inc"
SECTION .data
mask DD 0xffff, 0xff00ff, 0xf0f0f0f, 0x33333333, 0x55555555
SECTION .text
GLOBAL CMAIN
CMAIN:
GET_UDEC 4, EAX
MOV EBX, mask
ADD EBX, 16
MOV ECX, 1
.L:
MOV ESI, DWORD [EBX]
MOV EDI, ESI
NOT EDI
MOV EDX, EAX
AND EAX, ESI
AND EDX, EDI
SHL EAX, CL
SHR EDX, CL
OR EAX, EDX
SHL ECX, 1
SUB EBX, 4
CMP EBX, mask - 4
JNE .L
PRINT_UDEC 4, EAX
NEWLINE
XOR EAX, EAX
RET
2.My converted C code, when I input 0 it output me the right answer but there is something false in my code I don't understand what is:
#include "stdio.h"
int main(void)
{
int mask [5] = {0xffff, 0xff00ff, 0xf0f0f0f, 0x33333333, 0x55555555};
int eax;
int esi;
int ebx;
int edi;
int edx;
char cl = 0;
scanf("%d",&eax);
ebx = mask[4];
ebx = ebx + 16;
int ecx = 1;
L:
esi = ebx;
edi = esi;
edi = !edi;
edx = eax;
eax = eax && esi;
edx = edx && edi;
eax = eax << cl;
edx = edx >> cl ;
eax = eax || edx;
ecx = ecx << 1;
ebx = ebx - 4;
if(ebx == mask[1]) //mask - 4
{
goto L;
}
printf("%d",eax);
return 0;
}
Assembly AND is C bitwise &, not logical &&. (Same for OR). So you want eax &= esi.
(Using &= "compound assignment" makes the C even look like x86-style 2-operand asm so I'd recommend that.)
NOT is also bitwise flip-all-the-bits, not booleanize to 0/1. In C that's edi = ~edi;
Read the manual for x86 instructions like https://www.felixcloutier.com/x86/not, and for C operators like ~ and ! to check that they are / aren't what you want. https://en.cppreference.com/w/c/language/expressions https://en.cppreference.com/w/c/language/operator_arithmetic
You should be single-stepping your C and your asm in a debugger so you notice the first divergence, and know which instruction / C statement to fix. Don't just run the whole thing and look at one number for the result! Debuggers are massively useful for asm; don't waste your time without one.
CL is the low byte of ECX, not a separate C variable. You could use a union between uint32_t and uint8_t in C, or just use eax <<= ecx&31; since you don't have anything that writes CL separately from ECX. (x86 shifts mask their count; that C statement could compile to shl eax, cl. https://www.felixcloutier.com/x86/sal:sar:shl:shr). The low 5 bits of ECX are also the low 5 bits of CL.
SHR is a logical right shift, not arithmetic, so you need to be using unsigned not int at least for the >>. But really just use it for everything.
You're handling EBX completely wrong; it's a pointer.
MOV EBX, mask
ADD EBX, 16
This is like unsigned int *ebx = mask+4;
The size of a dword is 4 bytes, but C pointer math scales by the type size, so +1 is a whole element, not 1 byte. So 16 bytes is 4 dwords = 4 unsigned int elements.
MOV ESI, DWORD [EBX]
That's a load using EBX as an address. This should be easy to see if you single-step the asm in a debugger: It's not just copying the value.
CMP EBX, mask - 4
JNE .L
This is NASM syntax; it's comparing against the address of the dword before the start of the array. It's effectively the bottom of a fairly normal do{}while loop. (Why are loops always compiled into "do...while" style (tail jump)?)
do { // .L
...
} while(ebx != &mask[-1]); // cmp/jne
It's looping from the end of the mask array, stopping when the pointer goes past the end.
Equivalently, the compare could be ebx !-= mask - 1. I wrote it with unary & (address-of) cancelling out the [] to make it clear that it's the address of what would be one element before the array.
Note that it's jumping on not equal; you had your if()goto backwards, jumping only on equality. This is a loop.
unsigned mask[] should be static because it's in section .data, not on the stack. And not const, because again it's in .data not .rodata (Linux) or .rdata (Windows))
This one doesn't affect the logic, only that detail of decompiling.
There may be other bugs; I didn't try to check everything.
if(ebx != mask[1]) //mask - 4
{
goto L;
}
//JNE IMPLIES a !=
In the following example running a 32-bit ELF on a 64-bit architecture is faster and I don't understand why. I have tried with two examples one using a division the other one with a multiplication. The performance is as expected, however, the performance for the division is surprizing.
We see on the assembly that the compiler is calling _alldiv which emulates a 64-bit division on a 32-bit architecture, so it must be slower than simply using the assembly instruction idiv. So I don't understand the results I got:
My setup is: Windows 10 x64, Visual Studio 2019
To time the code I use Measure-Command { .\out.exe }:
Multiplication
32-bit ELF: 3360 ms
64-bit ELF: 1469 ms
Division
32-bit ELF: 7383 ms
64-bit ELF: 8567 ms
Code
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <limits.h>
#include <Windows.h>
volatile int64_t m = 32;
volatile int64_t n = 12;
volatile int64_t result;
int main(void)
{
for (size_t i = 0; i < (1 << 30); i++)
{
# ifdef DIVISION
result = m / n;
# else
result = m * n;
# endif
m += 1;
n += 3;
}
}
64-bit disassembly (division)
for (size_t i = 0; i < (1 << 30); i++)
00007FF60DA81000 mov r8d,40000000h
00007FF60DA81006 nop word ptr [rax+rax]
{
result = m / n;
00007FF60DA81010 mov rcx,qword ptr [n (07FF60DA83038h)]
00007FF60DA81017 mov rax,qword ptr [m (07FF60DA83040h)]
00007FF60DA8101E cqo
00007FF60DA81020 idiv rax,rcx
00007FF60DA81023 mov qword ptr [result (07FF60DA83648h)],rax
m += 1;
00007FF60DA8102A mov rax,qword ptr [m (07FF60DA83040h)]
00007FF60DA81031 inc rax
00007FF60DA81034 mov qword ptr [m (07FF60DA83040h)],rax
n += 3;
00007FF60DA8103B mov rax,qword ptr [n (07FF60DA83038h)]
00007FF60DA81042 add rax,3
00007FF60DA81046 mov qword ptr [n (07FF60DA83038h)],rax
00007FF60DA8104D sub r8,1
00007FF60DA81051 jne main+10h (07FF60DA81010h)
}
}
32-bit disassembly (division)
for (size_t i = 0; i < (1 << 30); i++)
00A41002 mov edi,40000000h
00A41007 nop word ptr [eax+eax]
{
result = m / n;
00A41010 mov edx,dword ptr [n (0A43018h)]
00A41016 mov eax,dword ptr ds:[00A4301Ch]
00A4101B mov esi,dword ptr [m (0A43020h)]
00A41021 mov ecx,dword ptr ds:[0A43024h]
00A41027 push eax
00A41028 push edx
00A41029 push ecx
00A4102A push esi
00A4102B call _alldiv (0A41CD0h)
00A41030 mov dword ptr [result (0A433A0h)],eax
00A41035 mov dword ptr ds:[0A433A4h],edx
m += 1;
00A4103B mov eax,dword ptr [m (0A43020h)]
00A41040 mov ecx,dword ptr ds:[0A43024h]
00A41046 add eax,1
00A41049 mov dword ptr [m (0A43020h)],eax
00A4104E adc ecx,0
00A41051 mov dword ptr ds:[0A43024h],ecx
n += 3;
00A41057 mov eax,dword ptr [n (0A43018h)]
00A4105C mov ecx,dword ptr ds:[0A4301Ch]
00A41062 add eax,3
00A41065 mov dword ptr [n (0A43018h)],eax
00A4106A adc ecx,0
00A4106D mov dword ptr ds:[0A4301Ch],ecx
00A41073 sub edi,1
00A41076 jne main+10h (0A41010h)
}
}
Edit
To investigate further as Chris Dodd, I have slightly modified my code as follow:
volatile int64_t m = 32000000000;
volatile int64_t n = 12000000000;
volatile int64_t result;
This time I have these results:
Division
32-bit ELF: 22407 ms
64-bit ELF: 17812 ms
If you look at instruction timings for x86 processors, it turns out that on recent Intel processors, a 64-bit divide is 3-4x as expensive as a 32-bit divide -- and if you look at the internals of alldiv (link in the comments above), for your values which will always fit in 32 bits, it will use a single 32-bit divide...
I'm using this function to convert an 8-bit binary number represented as a boolean array to an integer. Is it efficient? I'm using it in an embedded system. It performs ok but I'm interested in some opinions or suggestions for improvement (or replacement) if there are any.
uint8_t b2i( bool *bs ){
uint8_t ret = 0;
ret = bs[7] ? 1 : 0;
ret += bs[6] ? 2 : 0;
ret += bs[5] ? 4 : 0;
ret += bs[4] ? 8 : 0;
ret += bs[3] ? 16 : 0;
ret += bs[2] ? 32 : 0;
ret += bs[1] ? 64 : 0;
ret += bs[0] ? 128 : 0;
return ret;
}
It is not possible to say without a specific system in mind. Disassemble the code and see what you got. Benchmark your code on a specific system. This is the key to understanding manual optimization.
Generally, there are lots of considerations. The CPU's data word size, instruction set, compiler optimizer performance, branch prediction (if any), data cache (if any) etc etc.
To make the code perform optimally regardless of data word size, you can change uint8_t to uint_fast8_t. That is unless you need exactly 8 bits, then leave it as uint8_t.
Cache use may or may not be more efficient if given an up-counting loop. At any rate, loop unrolling is an old kind of manual optimization that we shouldn't use in modern programming - the compiler is more capable of making that call than the programmer.
The worst problem with the code is the numerous branches. These might cause a bottleneck.
Your code results in the following x86 machine code gcc -O2:
b2i:
cmp BYTE PTR [rdi+6], 0
movzx eax, BYTE PTR [rdi+7]
je .L2
add eax, 2
.L2:
cmp BYTE PTR [rdi+5], 0
je .L3
add eax, 4
.L3:
cmp BYTE PTR [rdi+4], 0
je .L4
add eax, 8
.L4:
cmp BYTE PTR [rdi+3], 0
je .L5
add eax, 16
.L5:
cmp BYTE PTR [rdi+2], 0
je .L6
add eax, 32
.L6:
cmp BYTE PTR [rdi+1], 0
je .L7
add eax, 64
.L7:
lea edx, [rax-128]
cmp BYTE PTR [rdi], 0
cmovne eax, edx
ret
Whole lot of potentially inefficient branching. We can make the code both faster and more readable by using a loop:
uint8_t b2i (const bool bs[8])
{
uint8_t result = 0;
for(size_t i=0; i<8; i++)
{
result |= bs[8-1-i] << i;
}
return result;
}
(ideally the bool array should be arranged from LSB first but that would change the meaning of the code compared to the original)
Which gives this machine code instead:
b2i:
lea rsi, [rdi-8]
mov rax, rdi
xor r8d, r8d
.L2:
movzx edx, BYTE PTR [rax+7]
mov ecx, edi
sub ecx, eax
sub rax, 1
sal edx, cl
or r8d, edx
cmp rax, rsi
jne .L2
mov eax, r8d
ret
More instructions but less branching. It will likely perform better than your code on x86 and other high end CPUs with branch prediction and instruction cache. But worse than your code on a 8 bit microcontroller where only the total number of instructions count.
You can also do this with a loop and bit shifts to reduce code repetition:
int b2i(bool *bs) {
int ret = 0;
for (int i = 0; i < 8; i++) {
ret = ret << 1;
ret += bs[i];
}
return ret;
}
Within the following block of code, num_insts is re-assigned to 0 following the first iteration of the loop.
inst_t buf[5] = {0};
num_insts = 10;
int i = 5;
for( ; i > 0; i-- )
{
buf[i] = buf[i-1];
}
buf[0] = next;
I cannot think of any possible valid reason for this behavior, but I'm also sleep deprived so a second opinion would be appreciated.
The assembly being executed for the buf shift is this:
004017ed: mov 0x90(%esp),%eax
004017f4: lea -0x1(%eax),%ecx
004017f7: mov 0x90(%esp),%edx
004017fe: mov %edx,%eax
00401800: shl $0x2,%eax
00401803: add %edx,%eax
00401805: shl $0x2,%eax
00401808: lea 0xa0(%esp),%edi
0040180f: lea (%edi,%eax,1),%eax
00401812: lea -0x7c(%eax),%edx
00401815: mov %ecx,%eax
00401817: shl $0x2,%eax
0040181a: add %ecx,%eax
0040181c: shl $0x2,%eax
0040181f: lea 0xa0(%esp),%ecx
And the register contents prior to executing the first assembly instruction above is this:
eax 0
ecx 0
edx 0
ebx 2665332
esp 0x28ab50
ebp 0x28ac08
esi 0
edi 2665432
eip 0x4017ed <main+1593>
Following those instructions, this:
eax 0
ecx 0
edx 2665432
ebx 2665332
esp 0x28ab50
ebp 0x28ac08
esi 0
edi 2665456
eip 0x401848 <main+1684>
I don't know nearly enough assembly to make sense of any of this, but maybe someone answering this will benefit from it.
For first iteration with i = 5 you code:
for( ; i > 0; i-- ) // i = 5 > 0 = true
{
buf[i] = buf[i-1]; // b[5] = b [5 - 1]
}
Is buf[5] = buf[4]; because buf is just of size 5, maximum index value can be 4 so bug in your code = array out of index problem => rhs buf[5].