LEA EAX, [EAX]
I encountered this instruction in a binary compiled with the Microsoft C compiler. It clearly can't change the value of EAX. Then why is it there?
It is a NOP.
The following are typcially used as NOP. They all do the same thing but they result in machine code of different length. Depending on the alignment requirement one of them is chosen:
xchg eax, eax = 90
mov eax, eax = 89 C0
lea eax, [eax + 0x00] = 8D 40 00
From this article:
This trick is used by MSVC++ compiler
to emit the NOP instructions of
different length (for padding before
jump targets). For example, MSVC++
generates the following code if it
needs 4-byte and 6-byte padding:
8d6424 00 lea [ebx+00],ebx
; 4-byte padding 8d9b 00000000
lea [esp+00000000],esp ; 6-byte
padding
The first line is marked as "npad 4"
in assembly listings generated by the
compiler, and the second is "npad 6".
The registers (ebx, esp) can be chosen
from the rarely used ones to avoid
false dependencies in the code.
So this is just a kind of NOP, appearing right before targets of jmp instructions in order to align them.
Interestingly, you can identify the compiler from the characteristic nature of such instructions.
LEA EAX, [EAX]
Indeed doesn't change the value of EAX. As far as I understand, it's identical in function to:
MOV EAX, EAX
Did you see it in optimized code, or unoptimized code?
Related
This question already has answers here:
Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?
(3 answers)
How to load a single byte from address in assembly
(1 answer)
Why doesn't GCC use partial registers?
(3 answers)
How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent
(2 answers)
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
(1 answer)
Closed 1 year ago.
So I've been reading up about the what goes on inside x86 processors for about half a year now. So I decided to try my hand at x86 assembly for fun, starting only with 80386 instructions to keep it simple. (I'm trying to learn mostly, not optimize)
I have a game I made a few months ago coded in C, so I went there and rewrote the bitmap blitting function from scratch with assembly code. What I don't get is that the main pixel plotting body of the loop is faster with the C code (which is 18 instructions) than my assembly code (which is only 7 instructions, and I'm almost 100% certain it doesn't straddle cache line boundaries).
So my main question is why do 18 instructions take less time than the 7 instructions?
At the bottom I have the 2 code snippets.
PS. Each color is 8 bit indexed.
C Code:
{
for (x = 0; x < src.w; x++)
00D35712 mov dword ptr [x],0 // Just initial loop setup
00D35719 jmp Renderer_DrawBitmap+174h (0D35724h) // Just initial loop setup
00D3571B mov eax,dword ptr [x]
00D3571E add eax,1
00D35721 mov dword ptr [x],eax
00D35724 mov eax,dword ptr [x]
00D35727 cmp eax,dword ptr [ebp-28h]
00D3572A jge Renderer_DrawBitmap+1BCh (0D3576Ch)
{
*dest_pixel = renderer_trans[renderer_light[*src_pixel][light]][*dest_pixel][trans];
// Start of what I consider the body
00D3572C mov eax,dword ptr [src_pixel]
00D3572F movzx ecx,byte ptr [eax]
00D35732 mov edx,dword ptr [light]
00D35735 movzx eax,byte ptr renderer_light (0EDA650h)[edx+ecx*8]
00D3573D shl eax,0Bh
00D35740 mov ecx,dword ptr [dest_pixel]
00D35743 movzx edx,byte ptr [ecx]
00D35746 lea eax,renderer_trans (0E5A650h)[eax+edx*8]
00D3574D mov ecx,dword ptr [dest_pixel]
00D35750 mov edx,dword ptr [trans]
00D35753 mov al,byte ptr [eax+edx]
00D35756 mov byte ptr [ecx],al
dest_pixel++;
00D35758 mov eax,dword ptr [dest_pixel]
00D3575B add eax,1
00D3575E mov dword ptr [dest_pixel],eax
src_pixel++;
00D35761 mov eax,dword ptr [src_pixel]
00D35764 add eax,1
00D35767 mov dword ptr [src_pixel],eax
// End of what I consider the body
}
00D3576A jmp Renderer_DrawBitmap+16Bh (0D3571Bh)
And the assembly code I wrote:
(esi is the source pixel, edi is the screen buffer, edx is the light level, ebx is the transparency level, and ecx is the width of this row)
drawing_loop:
00C55682 movzx ax,byte ptr [esi]
00C55686 mov ah,byte ptr renderer_light (0DFA650h)[edx+eax*8]
00C5568D mov al,byte ptr [edi]
00C5568F mov al,byte ptr renderer_trans (0D7A650h)[ebx+eax*8]
00C55696 mov byte ptr [edi],al
00C55698 inc esi
00C55699 inc edi
00C5569A loop drawing_loop (0C55682h)
// This isn't just the body this is the full row plotting loop just like the code above there
And for context, the pixel is lighted with a LUT and the transparency is done also with a LUT.
Pseudo C code:
//transparencyLUT[new][old][transparency level (0 = opaque, 7 = full transparency)]
//lightLUT[color][light level (0 = white, 3 = no change, 7 = full black)]
dest_pixel = transparencyLUT[lightLUT[source_pixel][light]]
[screen_pixel]
[transparency];
What gets me is how I use pretty much the same instructions the C code does, but just less of them?
If you need more info I'll be happy to give more, I just don't want this to be a huge question. I'm just genuinely curious because I'm sorta new to x86 assembly programming and want to learn more about how our cpus actually work.
My only guess is that the out of order execution engine doesn't like my code because its all memory accesses moving to the same register.
Not all instructions take the same time, modern implementations of the CPU can execute (parts of) some instructions in parallel (as long as one doesn't read the data written by the previous one and the required units don't collide). Latest versions do translate the "machine" instructions into lower level, very simple ones, that are scheduled on the fly to be executed on the various units in the CPU in parallel as much as possible, using a whole lot of shadow registers (i.e., one instruction can be using the value in one copy of %eax (old value) after another instruction writes a new value into another copy of %eax (new value), thus decoupling instructions even more. The hoops they jump through for performance's sake...
While debugging one of the assembly code examples, I found following piece of information:
(gdb) x /10i 0x4005c4
0x4005c4: push %rbp
0x4005c5: mov %rsp,%rbp
0x4005c8: sub $0xa0,%rsp
0x4005cf: mov %fs:0x28,%rax
0x4005d8: mov %rax,-0x8(%rbp)
0x4005dc: xor %eax,%eax
0x4005de: movabs $0x6673646c6a6b3432,%rax
0x4005e8: mov %rax,-0x40(%rbp)
0x4005ec: movl $0x323339,-0x38(%rbp)
0x4005f3: movl $0x553059,-0x90(%rbp)
As per my understanding movabs should not be used, it seems like it was introduced intentionally. Am I right in my understanding?
What should be the equivalent MOV command to replace it?
As a direct copy from this question: https://reverseengineering.stackexchange.com/questions/2627/what-is-the-meaning-of-movabs-in-gas-x86-att-syntax
[...] The movabs instruction to load arbitrary 64-bit
constant into register and to load/store integer register from/to
arbitrary constant 64-bit address is available.
http://www.ucw.cz/~hubicka/papers/amd64/node1.html
It does exactly what you'd expect from it - it puts the immediate into the register.
.data
num dw 2,4,6,8,10
.code
main proc
mov eax,0
mov si,2
mov ax,num[si*2]
I want to print the array element 6 but error is invalid use of register. how to solve it.
I'd guess this is 16-bit 80x86 assembly. I can't guess which assembler it is (it's not GAS/AT&T and probably not NASM/YASM/FASM, and possibly MASM or TASM). Without knowing which assembler I can't know the syntax it expects.
However, for 16-bit addressing on 80x86 there is no way to encode that instruction (assuming it's not a syntax error to begin with). Alternatives include:
a) do add si,si and then (depending on which assembler) either mov ax,[num+si] or mov ax,num[si].
b) do mov bx,si and then (depending on which assembler) either mov ax,[num+bx+si] or mov ax,num[bx+si].
c) do mov si,2*2 and then (depending on which assembler) either mov ax,[num+si] or mov ax,num[si].
d) if it's 16-bit code running on a 32-bit CPU; do mov esi,2 and then (depending on which assembler) either mov ax,[num+esi*2] or mov ax,num[esi*2].
e) (depending on which assembler) either mov ax,[num+2*2] or I have no idea what (maybe mov ax,(num+2*2)[]?).
Function in c:
PHPAPI char *php_pcre_replace(char *regex, int regex_len,
char *subject, int subject_len,
zval *replace_val, int is_callable_replace,
int *result_len, int limit, int *replace_count TSRMLS_DC)
{
pcre_cache_entry *pce; /* Compiled regular expression */
/* Compile regex or get it from cache. */
if ((pce = pcre_get_compiled_regex_cache(regex, regex_len TSRMLS_CC)) == NULL) {
return NULL;
}
....
}
Its assembly:
php5ts!php_pcre_replace:
1015db70 8b442408 mov eax,dword ptr [esp+8]
1015db74 8b4c2404 mov ecx,dword ptr [esp+4]
1015db78 56 push esi
1015db79 8b74242c mov esi,dword ptr [esp+2Ch]
1015db7d 56 push esi
1015db7e 50 push eax
1015db7f 51 push ecx
1015db80 e8cbeaffff call php5ts!pcre_get_compiled_regex_cache (1015c650)
1015db85 83c40c add esp,0Ch
1015db88 85c0 test eax,eax
1015db8a 7502 jne php5ts!php_pcre_replace+0x1e (1015db8e)
php5ts!php_pcre_replace+0x1c:
1015db8c 5e pop esi
1015db8d c3 ret
The c function call pcre_get_compiled_regex_cache(regex, regex_len TSRMLS_CC) corresponds to 1015db7d~1015db80 which pushes the 3 parameters to the stack and call it.
But my doubt is,among so many registers,how does the compiler decide to use eax,ecx and esi(this is special,as it's restored before using,why?) as the intermediate to carry to the stack?
There must be some hidden indication in c that tells the compiler to do it this way,right?
No, there is no hidden indication.
This is a typical strategy for generating 80x86 instructions used by many compiler implementations, C and otherwise. For example, the 1980s Intel Fortran-77 compiler, when optimization was turned on, did the same thing.
That is uses eax and ecx preferentially is probably an artifact of avoiding use of esi and edi since those registers cannot directly be used to load byte operands.
Why not ebx and edx? Well, those are preferred by many code generators for holding intermediate pointers in evaluating complex structure evaluation, which is to say, there isn't much reason at all. The compiler just looked for two available registers to use and overwrote them to buffer the values.
Why not reuse eax like this?:
push esi
mov eax,dword ptr [esp+2Ch]
push eax
mov eax,dword ptr [esp+8]
push eax
mov eax,dword ptr [esp+4]
push eax
Because that causes pipeline stalls waiting for eax to complete previous memory cycles, in 80x86s since the 80586 (maybe 80486—it's too long ago to be sure off the top of my head).
The x86 architecture is a strange beast. Each register, though promoted as being "general purpose" by Intel, has its quirks (cx/ecx is tied to the loop instruction for example, and eax:edx is tied to the multiply instruction). That combined with the peculiar ways to optimize execution to avoid cache misses and pipeline stalls often leads to inscrutable generated code by a code generator which factors all that in.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Why is such complex code emitted for dividing a signed integer by a power of two?
Background
I'm just learning x86 asm by examining the binary code generated by the compiler.
Code compiled using the C++ compiler in Visual Studio 2010 beta 2.
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.21003.01 for 80x86
C code (sandbox.c)
int mainCRTStartup()
{
int x=5;int y=1024;
while(x) { x--; y/=2; }
return x+y;
}
Compile it using the Visual Studio Command Prompt
cl /c /O2 /Oy- /MD sandbox.c
link /NODEFAULTLIB /MANIFEST:NO /SUBSYSTEM:CONSOLE sandbox.obj
Disasm sandbox.exe in OllyDgb
The following starts from the entry point.
00401000 >/$ B9 05000000 MOV ECX,5
00401005 |. B8 00040000 MOV EAX,400
0040100A |. 8D9B 00000000 LEA EBX,DWORD PTR DS:[EBX]
00401010 |> 99 /CDQ
00401011 |. 2BC2 |SUB EAX,EDX
00401013 |. D1F8 |SAR EAX,1
00401015 |. 49 |DEC ECX
00401016 |.^75 F8 \JNZ SHORT sandbox.00401010
00401018 \. C3 RETN
Examination
MOV ECX, 5 int x=5;
MOV EAX, 400 int y=1024;
LEA ... // no idea what LEA does here. seems like ebx=ebx. elaborate please.
// in fact, NOPing it does nothing to the original procedure and the values.
CQD // sign extends EAX into EDX:EAX, which here: edx = 0. no idea why.
SUB EAX, EDX // eax=eax-edx, here: eax=eax-0. no idea, pretty redundant.
SAR EAX,1 // okay, y/= 2
DEC ECX // okay, x--, sets the zero flag when reaches 0.
JNZ ... // okay, jump back to CQD if the zero flag is not set.
This part bothers me:
0040100A |. 8D9B 00000000 LEA EBX,DWORD PTR DS:[EBX]
00401010 |> 99 /CDQ
00401011 |. 2BC2 |SUB EAX,EDX
You can nop it all and the values of EAX and ECX will remain the same at the end. So, what's the point of these instructions?
The whole thing
00401010 |> 99 /CDQ
00401011 |. 2BC2 |SUB EAX,EDX
00401013 |. D1F8 |SAR EAX,1
stands for the y /= 2. You see, a standalone SAR would not perform the signed integer division the way the compiler authors intended. C++98 standard recommends that signed integer division rounds the result towards 0, while SAR alone would round towards the negative infinity. (It is permissible to round towards negative infinity, the choice is left to the implementation). In order to implement rounding to 0 for negative operands, the above trick is used. If you use an unsigned type instead of a signed one, then the compiler will generate just a single shift instruction, since the issue with negative division will not take place.
The trick is pretty simple: for negative y sign extension will place a pattern of 11111...1 in EDX, which is actually -1 in 2's complement representation. The following SUB will effectively add 1 to EAX if the original y value was negative. If the original y was positive (or 0), the EDX will hold 0 after the sign extension and EAX will remain unchanged.
In other words, when you write y /= 2 with signed y, the compiler generates the code that does something more like the following
y = (y < 0 ? y + 1 : y) >> 1;
or, better
y = (y + (y < 0)) >> 1;
Note, that C++ standard does not require the result of the division to be rounded towards zero, so the compiler has the right to do just a single shift even for signed types. However, normally compilers follow the recommendation to round towards zero (or offer an option to control the behavior).
P.S. I don't know for sure what the purpose of that LEA instruction is. It is indeed a no-op. However, I suspect that this might be just a placeholder instruction inserted into the code for further patching. If I remember correctly, MS compiler has an option that forces the insertion of placeholder instructions at the beginning and at the end of each function. In the future this instruction can be overwritten by the patcher with a CALL or JMP instruction that will execute the patch code. This specific LEA was chosen just because it produces the a no-op placeholder instruction of the correct length. Of course, it could be something completely different.
The lea ebx,[ebx] is just a NOP operation. Its purpose is to align the beginning of the loop in memory, which will make it faster. As you can see here, the beginning of the loop starts at address 0x00401010, which is divisible by 16, thanks to this instruction.
The CDQ and SUB EAX,EDX operations make sure that the division will round a negative number towards zero - otherwise SAR would round it down, giving incorrect results for negative numbers.
The reason that the compiler emits this:
LEA EBX,DWORD PTR DS:[EBX]
instead of the semantically equivalent:
NOP
NOP
NOP
NOP
NOP
NOP
..is that it's faster for the processor to execute one 6-byte instruction than six 1-byte instructions. That's all.
This doesn't really answer the question, but is a helpful hint. Instead of mucking around with the OllyDbg.exe thing, you can make Visual Studio generate the asm file for you, which has the added bonus that it can put in the original source code as comments. This isn't a big deal for your current small project, but as your project grows, you may end up spending a fair amount of time figuring out which assembly code matches which source code.
From the command line, you want the /FAs and /Fa options (MSDN).
Here's part of the output for your example code (I compiled debug code, so the .asm is longer, but you can do the same thing for your optimized code):
_wmain PROC ; COMDAT
; 8 : {
push ebp
mov ebp, esp
sub esp, 216 ; 000000d8H
push ebx
push esi
push edi
lea edi, DWORD PTR [ebp-216]
mov ecx, 54 ; 00000036H
mov eax, -858993460 ; ccccccccH
rep stosd
; 9 : int x=5; int y=1024;
mov DWORD PTR _x$[ebp], 5
mov DWORD PTR _y$[ebp], 1024 ; 00000400H
$LN2#wmain:
; 10 : while(x) { x--; y/=2; }
cmp DWORD PTR _x$[ebp], 0
je SHORT $LN1#wmain
mov eax, DWORD PTR _x$[ebp]
sub eax, 1
mov DWORD PTR _x$[ebp], eax
mov eax, DWORD PTR _y$[ebp]
cdq
sub eax, edx
sar eax, 1
mov DWORD PTR _y$[ebp], eax
jmp SHORT $LN2#wmain
$LN1#wmain:
; 11 : return x+y;
mov eax, DWORD PTR _x$[ebp]
add eax, DWORD PTR _y$[ebp]
; 12 : }
pop edi
pop esi
pop ebx
mov esp, ebp
pop ebp
ret 0
_wmain ENDP
Hope that helps!