This question already has answers here:
How can I pass parameters in assembler x86 function call?
(3 answers)
Can't pass parameter from C to Assembly code
(3 answers)
Why does IA-32 have a non-intuitive caller and callee register saving convention?
(4 answers)
What are the calling conventions for UNIX & Linux system calls (and user-space functions) on i386 and x86-64
(4 answers)
Closed 3 years ago.
I am learning assembly NASM and trying to do a LFSR code and call it on a C program to evaluate the execution time diference, but failed to figure out the problem with my code.
My LFSR C version works just fine and is defined as follows:
int lfsr(){
int cont = 0;
uint32_t start_state = SEED;
uint32_t lfsr = start_state;
uint32_t bit;
while (cont != 16777216) {
bit = ((lfsr >> 1) ^ (lfsr >> 5) ^ (lfsr >> 7) ^ (lfsr >> 13)) & 1;
lfsr = (lfsr >> 1) | (bit << 23);
lfsr_nums[cont] = lfsr;
cont++;
}
return cont;
}
My NASM x86 was based on the C version and it generates the numbers the same way the C code does. It should take a pointer to an array as parameter, and return (as reference) the same array with the numbers created and return (as value) the amount of the numbers. The LFSR logic works just fine, I checked the numbers created, but the code still give me a SegFault Core Dump error.
With gdb the message is that the error is in the do procedure. While I tried to debug my code I found out that the error was in the mov dword [esi + 4 * eax], ebx, if I comment it out the code doesn't output a segfault.
section .text
global lfsr_nasm
lfsr_nasm:
push dword ebx;
mov esi, edi ; vec
mov eax, 0 ;Cont = 0
mov ebx, 0x1313 ; lfst = start_state = seed
do:
mov ecx, ebx ; ecx = lfst
shr ecx, 1 ; lfsr >> 1
mov edx, ebx ; edx = lfst
shr edx, 5; lfst >> 5
xor ecx, edx ; lfst >> 1 ^ lfsr >> 5
mov edx, ebx ; edx = lfsr
shr edx, 7 ; edx = lfst >> 7
xor ecx, edx; lfst >> 1 ^ lfsr >> 5 ^ lfsr >> 7
mov edx, ebx ; edx = lfsr
shr edx, 13 ; edx = lfst >> 13
xor ecx, edx; lfst >> 1 ^ lfsr >> 5 ^ lfsr >> 7 ^ lfsr >> 13
and ecx, 1 ;ecx = bit
shr ebx, 1 ;lfsr >> 1
shl ecx, 23 ; bit << 23
or ebx, ecx ; lfsr = (lfsr >> 1) | (bit << 23);
mov dword [esi + 4 * eax], ebx
inc eax ; cont++
cmp eax, 16777216; cont != 16777216
jne do ;
pop dword ebx;
ret
The way I make the call in C, and declare my vector and NASM function:
extern int lfsr_nasm (uint32_t *vec);
uint32_t lfsr_nums[16777216];
int main(int argc, char *argv[]){
int cont;
cont = lfsr_nasm(lfsr_nums);
for(int i = 0; i < 16777216; i++){
printf("%d ", lfsr_nums[i]);
}
}
I believe that the vector is too big for the NASM or C and maybe the program is trying to access bad memory, but I couldn't find anything to confirm my believes neither a fix to the problem. Already tried with malloc and calloc.
Related
I'm trying to efficiently implement SHLD and SHRD instructions of x86 without using inline assembly.
uint32_t shld_UB_on_0(uint32_t a, uint32_t b, uint32_t c) {
return a << c | b >> 32 - c;
}
seems to work, but invokes undefined behaviour when c == 0 because the second shift's operand becomes 32. The actual SHLD instruction with third operand being 0 is well defined to do nothing. (https://www.felixcloutier.com/x86/shld)
uint32_t shld_broken_on_0(uint32_t a, uint32_t b, uint32_t c) {
return a << c | b >> (-c & 31);
}
doesn't invoke undefined behaviour, but when c == 0 the result is a | b instead of a.
uint32_t shld_safe(uint32_t a, uint32_t b, uint32_t c) {
if (c == 0) return a;
return a << c | b >> 32 - c;
}
does what's intended, but gcc now puts a je. clang on the other hand is smart enough to translate it to a single shld instruction.
Is there any way to implement it correctly and efficiently without inline assembly?
And why is gcc trying so much not to put shld? The shld_safe attempt is translated by gcc 11.2 -O3 as (Godbolt):
shld_safe:
mov eax, edi
test edx, edx
je .L1
mov ecx, 32
sub ecx, edx
shr esi, cl
mov ecx, edx
sal eax, cl
or eax, esi
.L1:
ret
while clang does,
shld_safe:
mov ecx, edx
mov eax, edi
shld eax, esi, cl
ret
As far as I have tested with gcc 9.3 (x86-64), it translates the following code to shldq and shrdq.
uint64_t shldq_x64(uint64_t low, uint64_t high, uint64_t count) {
return (uint64_t)(((((unsigned __int128)high << 64) | (unsigned __int128)low) << (count & 63)) >> 64);
}
uint64_t shrdq_x64(uint64_t low, uint64_t high, uint64_t count) {
return (uint64_t)((((unsigned __int128)high << 64) | (unsigned __int128)low) >> (count & 63));
}
Also, gcc -m32 -O3 translates the following code to shld and shrd. (I have not tested with gcc (i386), though.)
uint32_t shld_x86(uint32_t low, uint32_t high, uint32_t count) {
return (uint32_t)(((((uint64_t)high << 32) | (uint64_t)low) << (count & 31)) >> 32);
}
uint32_t shrd_x86(uint32_t low, uint32_t high, uint32_t count) {
return (uint32_t)((((uint64_t)high << 32) | (uint64_t)low) >> (count & 31));
}
(I have just read the gcc code and written the above functions, i.e. I'm not sure they are your expected ones.)
I don't understand what is the problem because the result is right, but there is something wrong in it and i don't get it.
1.This is the x86 code I have to convert to C:
%include "io.inc"
SECTION .data
mask DD 0xffff, 0xff00ff, 0xf0f0f0f, 0x33333333, 0x55555555
SECTION .text
GLOBAL CMAIN
CMAIN:
GET_UDEC 4, EAX
MOV EBX, mask
ADD EBX, 16
MOV ECX, 1
.L:
MOV ESI, DWORD [EBX]
MOV EDI, ESI
NOT EDI
MOV EDX, EAX
AND EAX, ESI
AND EDX, EDI
SHL EAX, CL
SHR EDX, CL
OR EAX, EDX
SHL ECX, 1
SUB EBX, 4
CMP EBX, mask - 4
JNE .L
PRINT_UDEC 4, EAX
NEWLINE
XOR EAX, EAX
RET
2.My converted C code, when I input 0 it output me the right answer but there is something false in my code I don't understand what is:
#include "stdio.h"
int main(void)
{
int mask [5] = {0xffff, 0xff00ff, 0xf0f0f0f, 0x33333333, 0x55555555};
int eax;
int esi;
int ebx;
int edi;
int edx;
char cl = 0;
scanf("%d",&eax);
ebx = mask[4];
ebx = ebx + 16;
int ecx = 1;
L:
esi = ebx;
edi = esi;
edi = !edi;
edx = eax;
eax = eax && esi;
edx = edx && edi;
eax = eax << cl;
edx = edx >> cl ;
eax = eax || edx;
ecx = ecx << 1;
ebx = ebx - 4;
if(ebx == mask[1]) //mask - 4
{
goto L;
}
printf("%d",eax);
return 0;
}
Assembly AND is C bitwise &, not logical &&. (Same for OR). So you want eax &= esi.
(Using &= "compound assignment" makes the C even look like x86-style 2-operand asm so I'd recommend that.)
NOT is also bitwise flip-all-the-bits, not booleanize to 0/1. In C that's edi = ~edi;
Read the manual for x86 instructions like https://www.felixcloutier.com/x86/not, and for C operators like ~ and ! to check that they are / aren't what you want. https://en.cppreference.com/w/c/language/expressions https://en.cppreference.com/w/c/language/operator_arithmetic
You should be single-stepping your C and your asm in a debugger so you notice the first divergence, and know which instruction / C statement to fix. Don't just run the whole thing and look at one number for the result! Debuggers are massively useful for asm; don't waste your time without one.
CL is the low byte of ECX, not a separate C variable. You could use a union between uint32_t and uint8_t in C, or just use eax <<= ecx&31; since you don't have anything that writes CL separately from ECX. (x86 shifts mask their count; that C statement could compile to shl eax, cl. https://www.felixcloutier.com/x86/sal:sar:shl:shr). The low 5 bits of ECX are also the low 5 bits of CL.
SHR is a logical right shift, not arithmetic, so you need to be using unsigned not int at least for the >>. But really just use it for everything.
You're handling EBX completely wrong; it's a pointer.
MOV EBX, mask
ADD EBX, 16
This is like unsigned int *ebx = mask+4;
The size of a dword is 4 bytes, but C pointer math scales by the type size, so +1 is a whole element, not 1 byte. So 16 bytes is 4 dwords = 4 unsigned int elements.
MOV ESI, DWORD [EBX]
That's a load using EBX as an address. This should be easy to see if you single-step the asm in a debugger: It's not just copying the value.
CMP EBX, mask - 4
JNE .L
This is NASM syntax; it's comparing against the address of the dword before the start of the array. It's effectively the bottom of a fairly normal do{}while loop. (Why are loops always compiled into "do...while" style (tail jump)?)
do { // .L
...
} while(ebx != &mask[-1]); // cmp/jne
It's looping from the end of the mask array, stopping when the pointer goes past the end.
Equivalently, the compare could be ebx !-= mask - 1. I wrote it with unary & (address-of) cancelling out the [] to make it clear that it's the address of what would be one element before the array.
Note that it's jumping on not equal; you had your if()goto backwards, jumping only on equality. This is a loop.
unsigned mask[] should be static because it's in section .data, not on the stack. And not const, because again it's in .data not .rodata (Linux) or .rdata (Windows))
This one doesn't affect the logic, only that detail of decompiling.
There may be other bugs; I didn't try to check everything.
if(ebx != mask[1]) //mask - 4
{
goto L;
}
//JNE IMPLIES a !=
The code is very simple.
int foo(int a, int b, int c, int d, int e, int f, int g)
{
int r = (1 << a) | (1 << b) | (1 << c) | (1 << d) | (1 << e ) | (1 << f) | (1 << g);
return r;
}
Assume all the arguments are no greater than 30.
It seems to be a very primitive function, but after compiling with the "-Ofast" flag, it still takes 28 instructions to compute r.
Is there an alternative code that can make these bitwise operations faster?
28 instructions is rather fast.
Consider what you're doing here. You have:
7 shift operations
6 OR operations
1 memory assignment operation
That already requires at least 14 instructions. Now there are additional instructions that are necessary such as storing the intermediate results and loading operands into registers.
If you want deeper analysis, post the assembly output.
Edit: Now to the possible optimization of your algorithm.
You might be able to gain a bit more speed by sacrificing some memory. Precompute the values for each possible bit being set in 32-bit value, e.g. something like that: int bit2value[32]={1,2,4,8,16,32,64,...}; In your function instead of performing the shift operations you can replace them with looking up into the precomputed map: int r = bit2value[a] | bit2value[b] | bit2value[c]...; This can theoretically save the need for some intermediate storing operations.
For each argument you need:
mov cl, argument
mov edx, 1
shl edx, cl
or eax, edx
I believe your function had 7 arguments? Up to g? 27 (4*7 - 1) is as low as it goes, since you can calculate the result of the shift directly in eax for the first argument. You can't not load each argument into a register, as the instruction doesn't operate on memory directly. You can't shift 1 without setting something to 1 every time. Obviously you can't do without the shift instructions or the or.
Using Visual Studio 2015 (32bit, optimized for space), the following code results in 21 rather than 24 instructions:
typedef struct container
{
int a, b, c, d, e, f, g;
} CONTAINER;
int foo2(CONTAINER *ct)
{
int r = (1 << ct->a) | (1 << ct->b) | (1 << ct->c) | (1 << ct->d) | (1 << ct->e) | (1 << ct->f) | (1 << ct->g);
return r;
}
The assembly code (21 rather than 20 instructions, sorry!):
1 push esi
2 mov edx, ecx
3 xor esi, esi
4 inc esi
5 mov ecx, DWORD PTR [edx+24]
6 mov eax, DWORD PTR [edx+20]
7 shl esi, cl
8 mov ecx, DWORD PTR [edx+8]
9 bts esi, eax
10 mov eax, DWORD PTR [edx+16]
11 bts esi, eax
12 mov eax, DWORD PTR [edx+12]
13 bts esi, eax
14 bts esi, ecx
15 mov ecx, DWORD PTR [edx+4]
16 bts esi, ecx
17 mov ecx, DWORD PTR [edx]
18 bts esi, ecx
19 mov eax, esi
20 pop esi
21 ret 0
Following on from my two previous questions, How to improve memory performance/data locality of 64-bit C/intel assembly program and Using C/Intel assembly, what is the fastest way to test if a 128-byte memory block contains all zeros?, I have further reduced the running time of the test program mentioned in these questions from 150 seconds down to 62 seconds, as I will describe below.
The 64-bit program has five 4 GB lookup tables (bytevecM, bytevecD, bytevecC, bytevecL, bytevecX). To reduce the (huge) number of cache misses, analysed in my last question, I added five 4 MB bitmaps, one per lookup table.
Here is the original inner loop:
psz = (size_t*)&bytevecM[(unsigned int)m7 & 0xffffff80];
if (psz[0] == 0 && psz[1] == 0
&& psz[2] == 0 && psz[3] == 0
&& psz[4] == 0 && psz[5] == 0
&& psz[6] == 0 && psz[7] == 0
&& psz[8] == 0 && psz[9] == 0
&& psz[10] == 0 && psz[11] == 0
&& psz[12] == 0 && psz[13] == 0
&& psz[14] == 0 && psz[15] == 0) continue;
// ... rinse and repeat for bytevecD, bytevecC, bytevecL, bytevecX
// expensive inner loop that scans 128 byte chunks from the 4 GB lookup tables...
The idea behind this simple "pre-check" was to avoid the expensive inner loop if all 128 bytes were zero. However, profiling showed that this pre-check was the primary bottleneck due to huge numbers of cache misses, as discussed last time. So I created a 4 MB bitmap to do the pre-check. (BTW, around 36% of 128-byte blocks are zero, not 98% as I mistakenly reported last time).
Here is the code I used to create a 4 MB bitmap from a 4 GB lookup table:
// Last chunk index (bitmap size=((LAST_CHUNK_IDX+1)>>3)=4,194,304 bytes)
#define LAST_CHUNK_IDX 33554431
void make_bitmap(
const unsigned char* bytevec, // in: byte vector
unsigned char* bitvec // out: bitmap
)
{
unsigned int uu;
unsigned int ucnt = 0;
unsigned int byte;
unsigned int bit;
const size_t* psz;
for (uu = 0; uu <= LAST_CHUNK_IDX; ++uu)
{
psz = (size_t*)&bytevec[uu << 7];
if (psz[0] == 0 && psz[1] == 0
&& psz[2] == 0 && psz[3] == 0
&& psz[4] == 0 && psz[5] == 0
&& psz[6] == 0 && psz[7] == 0
&& psz[8] == 0 && psz[9] == 0
&& psz[10] == 0 && psz[11] == 0
&& psz[12] == 0 && psz[13] == 0
&& psz[14] == 0 && psz[15] == 0) continue;
++ucnt;
byte = uu >> 3;
bit = (uu & 7);
bitvec[byte] |= (1 << bit);
}
printf("ucnt=%u hits from %u\n", ucnt, LAST_CHUNK_IDX+1);
}
Suggestions for a better way to do this are welcome.
With the bitmaps created via the function above, I then changed the "pre-check" to use the 4 MB bitmaps, instead of the 4 GB lookup tables, like so:
if ( (bitvecM[m7 >> 10] & (1 << ((m7 >> 7) & 7))) == 0 ) continue;
// ... rinse and repeat for bitvecD, bitvecC, bitvecL, bitvecX
// expensive inner loop that scans 128 byte chunks from the 4 GB lookup tables...
This was "successful" in that the running time was reduced from 150 seconds down to 62 seconds in the simple single-threaded case. However, VTune still reports some pretty big numbers, as shown below.
I profiled a more realistic test with eight simultaneous threads running across different ranges. The VTune output of the inner loop check for zero blocks is shown below:
> m7 = (unsigned int)( (m6 ^ q7) * H_PRIME );
> if ( (bitvecM[m7 >> 10] & (1 << ((m7 >> 7) & 7))) == 0 ) continue;
0x1400025c7 Block 15:
mov eax, r15d 1.058s
mov edx, ebx 0.109s
xor eax, ecx 0.777s
imul eax, eax, 0xf4243 1.088s
mov r9d, eax 3.369s
shr eax, 0x7 0.123s
and eax, 0x7 1.306s
movzx ecx, al 1.319s
mov eax, r9d 0.156s
shr rax, 0xa 0.248s
shl edx, cl 1.321s
test byte ptr [rax+r10*1], dl 1.832s
jz 0x140007670 2.037s
> d7 = (unsigned int)( (s6.m128i_i32[0] ^ q7) * H_PRIME );
> if ( (bitvecD[d7 >> 10] & (1 << ((d7 >> 7) & 7))) == 0 ) continue;
0x1400025f3 Block 16:
mov eax, dword ptr [rsp+0x30] 104.983s
mov edx, ebx 1.663s
xor eax, r15d 0.062s
imul eax, eax, 0xf4243 0.513s
mov edi, eax 1.172s
shr eax, 0x7 0.140s
and eax, 0x7 0.062s
movzx ecx, al 0.575s
mov eax, edi 0.689s
shr rax, 0xa 0.016s
shl edx, cl 0.108s
test byte ptr [rax+r11*1], dl 1.591s
jz 0x140007670 1.087s
> c7 = (unsigned int)( (s6.m128i_i32[1] ^ q7) * H_PRIME );
> if ( (bitvecC[c7 >> 10] & (1 << ((c7 >> 7) & 7))) == 0 ) continue;
0x14000261f Block 17:
mov eax, dword ptr [rsp+0x34] 75.863s
mov edx, 0x1 1.097s
xor eax, r15d 0.031s
imul eax, eax, 0xf4243 0.265s
mov ebx, eax 0.512s
shr eax, 0x7 0.016s
and eax, 0x7 0.233s
movzx ecx, al 0.233s
mov eax, ebx 0.279s
shl edx, cl 0.109s
mov rcx, qword ptr [rsp+0x58] 0.652s
shr rax, 0xa 0.171s
movzx ecx, byte ptr [rax+rcx*1] 0.126s
test cl, dl 77.918s
jz 0x140007667
> l7 = (unsigned int)( (s6.m128i_i32[2] ^ q7) * H_PRIME );
> if ( (bitvecL[l7 >> 10] & (1 << ((l7 >> 7) & 7))) == 0 ) continue;
0x140002655 Block 18:
mov eax, dword ptr [rsp+0x38] 0.980s
mov edx, 0x1 0.794s
xor eax, r15d 0.062s
imul eax, eax, 0xf4243 0.187s
mov r11d, eax 0.278s
shr eax, 0x7 0.062s
and eax, 0x7 0.218s
movzx ecx, al 0.218s
mov eax, r11d 0.186s
shl edx, cl 0.031s
mov rcx, qword ptr [rsp+0x50] 0.373s
shr rax, 0xa 0.233s
movzx ecx, byte ptr [rax+rcx*1] 0.047s
test cl, dl 55.060s
jz 0x14000765e
In addition to that, large amounts of time were (confusingly to me) attributed to this line:
> for (q6 = 1; q6 < 128; ++q6) {
0x1400075a1 Block 779:
inc edx 0.124s
mov dword ptr [rsp+0x10], edx
cmp edx, 0x80 0.031s
jl 0x140002574
mov ecx, dword ptr [rsp+0x4]
mov ebx, dword ptr [rsp+0x48]
...
0x140007575 Block 772:
mov edx, dword ptr [rsp+0x10] 0.699s
...
0x14000765e Block 789 (note: jz in l7 section above jumps here if zero):
mov edx, dword ptr [rsp+0x10] 1.169s
jmp 0x14000757e 0.791s
0x140007667 Block 790 (note: jz in c7 section above jumps here if zero):
mov edx, dword ptr [rsp+0x10] 2.261s
jmp 0x140007583 1.461s
0x140007670 Block 791 (note: jz in m7/d7 section above jumps here if zero):
mov edx, dword ptr [rsp+0x10] 108.355s
jmp 0x140007588 6.922s
I don't fully understand the big numbers in the VTune output above. If anyone can shed more light on these numbers, I'm all ears.
It seems to me that my five 4 MB bitmaps are bigger than my Core i7 3770 processor can fit into its 8 MB L3 cache, leading to many cache misses (though far fewer than before). If my CPU had a 30 MB L3 cache (as the upcoming Ivy Bridge-E has), I speculate that this program would run a lot faster because all five bitmaps would comfortably fit into the L3 cache. Is that right?
Further to that, since the code to test the bitmaps, namely:
m7 = (unsigned int)( (m6 ^ q7) * H_PRIME );
bitvecM[m7 >> 10] & (1 << ((m7 >> 7) & 7))) == 0
now appears five times in the inner loop, any suggestions for speeding up this code are very welcome.
Within the core bits of the loop, using the _bittest() MSVC intrinsic for the bitmap check combines the shl/test combo the compiler creates into a single instruction with (on SandyBridge) no latency/throughput penalty, i.e. it should shave a few cycles off.
Beyond that, can only think of calculating the bitmaps by map-reducing bit sets via recursive POR, as a variation on your zero testing that might be worth benchmarking:
for (int i = 0; i < MAX_IDX; i++) {
__m128i v[8];
__m128i* ptr = ...[i << ...];
v[0] = _mm_load_si128(ptr[0]);
v[1] = _mm_load_si128(ptr[1]);
v[2] = _mm_load_si128(ptr[2]);
v[3] = _mm_load_si128(ptr[3]);
v[4] = _mm_load_si128(ptr[4]);
v[5] = _mm_load_si128(ptr[5]);
v[6] = _mm_load_si128(ptr[6]);
v[7] = _mm_load_si128(ptr[7]);
v[0] = _mm_or_si128(v[0], v[1]);
v[2] = _mm_or_si128(v[2], v[3]);
v[4] = _mm_or_si128(v[4], v[5]);
v[6] = _mm_or_si128(v[6], v[7]);
v[0] = _mm_or_si128(v[0], v[2]);
v[2] = _mm_or_si128(v[4], v[6]);
v[0] = _mm_or_si128(v[0], v[2]);
if (_mm_movemask_epi8(_mm_cmpeq_epi8(_mm_setzero_si128(), v[0]))) {
// the contents aren't all zero
}
...
}
At this point, the pure load / accumulate-OR / extract mask might be better than a tight loop of SSE4.2 PTEST because there's no flags dependency and no branches.
For the 128-byte buffer, do the comparisons with larger integers.
unsigned char cbuf[128];
unsigned long long *lbuf = cbuf;
int i;
for (i=0; i < 128/sizeof(long long); i++) {
if (lbuf[i]) return false; // something not a zero
}
return true; // all zero
ASM to C Code emulating nearly done.. just trying to solve these second pass problems.
Lets say I got this ASM function
401040 MOV EAX,DWORD PTR [ESP+8]
401044 MOV EDX,DWORD PTR [ESP+4]
401048 PUSH ESI
401049 MOV ESI,ECX
40104B MOV ECX,EAX
40104D DEC EAX
40104E TEST ECX,ECX
401050 JE 401083
401052 PUSH EBX
401053 PUSH EDI
401054 LEA EDI,[EAX+1]
401057 MOV AX,WORD PTR [ESI]
40105A XOR EBX,EBX
40105C MOV BL,BYTE PTR [EDX]
40105E MOV ECX,EAX
401060 AND ECX,FFFF
401066 SHR ECX,8
401069 XOR ECX,EBX
40106B XOR EBX,EBX
40106D MOV BH,AL
40106F MOV AX,WORD PTR [ECX*2+45F81C]
401077 XOR AX,BX
40107A INC EDX
40107B DEC EDI
40107C MOV WORD PTR [ESI],AX
40107F JNE 401057
401081 POP EDI
401082 POP EBX
401083 POP ESI
401084 RET 8
My program would create the following for it.
int Func_401040() {
regs.d.eax = *(unsigned int *)(regs.d.esp+0x00000008);
regs.d.edx = *(unsigned int *)(regs.d.esp+0x00000004);
regs.d.esp -= 4;
*(unsigned int *)(regs.d.esp) = regs.d.esi;
regs.d.esi = regs.d.ecx;
regs.d.ecx = regs.d.eax;
regs.d.eax--;
if(regs.d.ecx == 0)
goto label_401083;
regs.d.esp -= 4;
*(unsigned int *)(regs.d.esp) = regs.d.ebx;
regs.d.esp -= 4;
*(unsigned int *)(regs.d.esp) = regs.d.edi;
regs.d.edi = (regs.d.eax+0x00000001);
regs.x.ax = *(unsigned short *)(regs.d.esi);
regs.d.ebx ^= regs.d.ebx;
regs.h.bl = *(unsigned char *)(regs.d.edx);
regs.d.ecx = regs.d.eax;
regs.d.ecx &= 0x0000FFFF;
regs.d.ecx >>= 0x00000008;
regs.d.ecx ^= regs.d.ebx;
regs.d.ebx ^= regs.d.ebx;
regs.h.bh = regs.h.al;
regs.x.ax = *(unsigned short *)(regs.d.ecx*0x00000002+0x0045F81C);
regs.x.ax ^= regs.x.bx;
regs.d.edx++;
regs.d.edi--;
*(unsigned short *)(regs.d.esi) = regs.x.ax;
JNE 401057
regs.d.edi = *(unsigned int *)(regs.d.esp);
regs.d.esp += 4;
regs.d.ebx = *(unsigned int *)(regs.d.esp);
regs.d.esp += 4;
label_401083:
regs.d.esi = *(unsigned int *)(regs.d.esp);
regs.d.esp += 4;
return 0x8;
}
Since JNE 401057 doesn't use the CMP or TEST
How do I fix that use this in C code?
The most recent instruction that modified flags is the dec, which sets ZF when its operand hits 0. So the jne is about equivalent to if (regs.d.edi != 0) goto label_401057;.
BTW: ret 8 isn't equivalent to return 8. The ret instruction's operand is the number of bytes to add to ESP when returning. (It's commonly used to clean up the stack.) It'd be kinda like
return eax;
regs.d.esp += 8;
except that semi-obviously, this won't work in C -- the return makes any code after it unreachable.
This is actually a part of the calling convention -- [ESP+4] and [ESP+8] are arguments passed to the function, and the ret is cleaning those up. This isn't the usual C calling convention; it looks more like fastcall or thiscall, considering the function expects a value in ECX.