I been debugging REP STOS DWORD PTR ES:[EDI] for a while now
From my conclusion it always uses
ECX as counter.
EAX as the value that will be copied over EDI and then appended ECX times
so after putting in the pointed dump of EDI
it seems to overwrite the pointed data at EDI with what's
it seems it always only uses ECX as a counter, while changing EDI by 4 bytes.
it stops working when counter hits 0
So I came up with this kind of code
while(regs.d.ecx != 0)
{
*(unsigned int *)(regs.d.edi) = regs.d.eax;
regs.d.edi += 4;
regs.d.ecx--;
}
Seems to work.. but i'm concerned since I just did this by luck and guess work. Is it solid? like will it always be ECX as counter, EAX as data, and it always copies 4 bytes never less?
You are almost correct. The only difference is that the direction flag (DF) controls whether 4 is added or subtracted from EDI (and it actually is offset from the ES segment base, but you probably don't care about that):
for (; regs.d.ecx != 0; regs.d.ecx--)
{
*(unsigned int *)(regs.d.edi) = regs.d.eax;
regs.d.edi += regs.eflags.df ? -4 : 4;
}
Note that the for (; regs.d.ecx != 0; regs.d.ecx--) { } is the action of the REP prefix, and the body of the loop is the action of STOS DWORD....
Since you are asking a lot of these questions, I think you will find the Intel 64 and IA-32 Architectures Software Developer’s Manual, Volumes 2A and 2B to be useful. These contain descriptions of each instruction and prefix, including pseudo-code descriptions.
Related
Hello Everyone!
I'm a newbie at NASM and I just started out recently. I currently have a program that reserves an array and is supposed to copy and display the contents of a string from the command line arguments into that array.
Now, I am not sure if I am copying the string correctly as every time I try to display this, I keep getting a segmentation error!
This is my code for copying the array:
example:
%include "asm_io.inc"
section .bss
X: resb 50 ;;This is our array
~some code~
mov eax, dword [ebp+12] ; eax holds address of 1st arg
add eax, 4 ; eax holds address of 2nd arg
mov ebx, dword [eax] ; ebx holds 2nd arg, which is pointer to string
mov ecx, dword 0
;Where our 2nd argument is a string eg "abcdefg" i.e ebx = "abcdefg"
copyarray:
mov al, [ebx] ;; get a character
inc ebx
mov [X + ecx], al
inc ecx
cmp al, 0
jz done
jmp copyarray
My question is whether this is the correct method of copying the array and how can I display the contents of the array after?
Thank you!
The loop looks ok, but clunky. If your program is crashing, use a debugger. See the x86 for links and a quick intro to gdb for asm.
I think you're getting argv[1] loaded correctly. (Note that this is the first command-line arg, though. argv[0] is the command name.) https://en.wikibooks.org/wiki/X86_Disassembly/Functions_and_Stack_Frames says ebp+12 is the usual spot for the 2nd arg to a 32bit functions that bother to set up stack frames.
Michael Petch commented on Simon's deleted answer that the asm_io library has print_int, print_string, print_char, and print_nl routines, among a few others. So presumably you a pointer to your buffer to one of those functions and call it a day. Or you could call sys_write(2) directly with an int 0x80 instruction, since you don't need to do any string formatting and you already have the length.
Instead of incrementing separately for two arrays, you could use the same index for both, with an indexed addressing mode for the load.
;; main (int argc ([esp+4]), char **argv ([esp+8]))
... some code you didn't show that I guess pushes some more stuff on the stack
mov eax, dword [ebp+12] ; load argv
;; eax + 4 is &argv[1], the address of the 1st cmdline arg (not counting the command name)
mov esi, dword [eax + 4] ; esi holds 2nd arg, which is pointer to string
xor ecx, ecx
copystring:
mov al, [esi + ecx]
mov [X + ecx], al
inc ecx
test al, al
jnz copystring
I changed the comments to say "cmdline arg", to distinguish between those and "function arguments".
When it doesn't cost any extra instructions, use esi for source pointers, edi for dest pointers, for readability.
Check the ABI for which registers you can use without saving/restoring (eax, ecx, and edx at least. That might be all for 32bit x86.). Other registers have to be saved/restored if you want to use them. At least, if you're making functions that follow the usual ABI. In asm you can do what you like, as long as you don't tell a C compiler to call non-standard functions.
Also note the improvement in the end of the loop. A single jnz to loop is more efficient than jz break / jmp.
This should run at one cycle per byte on Intel, because test/jnz macro-fuse into one uop. The load is one uop, and the store micro-fuses into one uop. inc is also one uop. Intel CPUs since Core2 are 4-wide: 4 uops issued per clock.
Your original loop runs at half that speed. Since it's 6 uops, it takes 2 clock cycles to issue an iteration.
Another hacky way to do this would be to get the offset between X and ebx into another register, so one of the effective addresses could use a one-register addressing mode, even if the dest wasn't a static array.
mov [X + ebx + ecx], al. (Where ecx = X - start_of_src_buf). But ideally you'd make the store the one that used a one-register addressing mode, unless the load was a memory operand to an ALU instruction that could micro-fuse it. Where the dest is a static buffer, this address-different hack isn't useful at all.
You can't use rep string instructions (like rep movsb) to implement strcpy for implicit-length strings (C null-terminated, rather than with a separately-stored length). Well you could, but only scanning the source twice: once for find the length, again to memcpy.
To go faster than one byte clock, you'd have to use vector instructions to test for the null byte at any of 16 positions in parallel. Google up an optimized strcpy implementation for example. Probably using pcmpeqb against a vector register of all-zeros.
I've disassembled this c code (using ida), and ran across this bit of code. I believe the second line is an array, as well as the 5th line, but I'm not sure why it uses a sign extend or a zero extend.
I need to convert the code to C, and I'm not sure why the sign/zero extend is used, or what C code would cause that.
mov ecx, [ebp+var_58]
mov dl, byte ptr [ebp+ecx*2+var_28]
mov [ebp+var_59], dl
mov eax, [ebp+var_58]
movsx ecx, [ebp+eax*2+var_20]
movzx edx, [ebp+var_59]
or edx, ecx
mov [ebp+var_59], dl
unsigned integer types will be zero-extended, while signed types will be sign-extended.
I kinda want to downvote this as too trivial. It's not like there's anything going on that the instruction reference manual doesn't cover. I guess it's different from asking for an explanation of a really simple C program because the trick here is understanding why one might string this sequence of instructions together, rather than just what each one does individually. Being familiar with the idioms used by non-optimizing compilers (store and reload from RAM after every statement) helps.
I'm guessing this is a snippet from inside a function that makes a stack frame, so positive offsets from ebp are where local variables are spilled when they're not live in registers.
mov ecx, [ebp+var_58] ; load var58 into ecx
mov dl, byte ptr [ebp+ecx*2+var_28] ; load a byte from var28[2*var58]
mov [ebp+var_59], dl ; store it to var59
mov eax, [ebp+var_58] ; load var58 again for some reason? can var59 alias var58?
; otherwise we still have the value in ecx, right?
; Or is this non-optimizing compiler output that's really annoying to read?
movsx ecx, [ebp+eax*2+var_20] ; load var20[var58*2]
movzx edx, [ebp+var_59] ; load var59 again
or edx, ecx ; edx = var59|var20[var58*2]
mov [ebp+var_59], dl ; spill var59 back to memory
I guess the default operand size for movsx/movzx is byte-to-dword. word-to-dword also exists, and I'm surprised your disassembler didn't disambiguate with a byte ptr on the memory operand. I'm inferring that it's a byte load because the preceding store to that address was byte-wide.
movsx is used when loading signed data that's smaller than 32b. C's integer-promotion rules dictate that operations on integer types smaller than int are automatically promoted to int (or unsigned int if int can't represent all values. e.g. if unsigned short and unsigned int are the same size).
8bit or 32bit operand sizes are available without operand-size prefix bytes. Some only Intel P6/SnB family CPUs track partial-register dependencies, sign-extending to a full register width on loads can make for faster code (avoiding false dependencies on the previous contents of the register on AMD and Silvermont). So sign or zero extending (as appropriate for the data type) on loads is often the best way to handle narrow memory locations.
Looking at the output of non-optimizing compilers is not usually worth bothering with.
If the code had been generated by a proper optimizing compiler, it would probably be more like
mov ecx, [ebp+var_58] ; var58 is live in ecx
mov al, byte ptr [ebp+ecx*2+var_28] ; var59 = var28[2*var58]
or al, [ebp+ecx*2+var_20] ; var59 |= var20[var58*2]
mov [ebp+var_59], al ; spill var59 to memory
Much easier to read, IMO, without the noise of constantly storing/reloading. You can see when a value is used multiple times without having to notice that a load was from an address that was just stored to.
If a false dependency on the upper 24 bits of eax was causing a problem, we could use movzx or movsx loads into two registers, and do an or r32, r32 like the original, but then still store the low 8. (Using a 32bit or with a memory operand would do a 4B load, not a 1B load, which could cross a cache line or even a page and segfault.)
I have got some problem with memory access and postix incrementation :/
I need to access to video memory at boot, thus, I create a pointer to 0xB8000 address and then, I increment the pointer to access next location.
Basically, the code would be :
volatile char *p = (volatile char *)0xB8000;
for (int i = 0; i < 5; ++i)
*(p++) = 'A';
This way, p point to the proper memory address, and after each access, it is incremented (I know, the there is 2 bytes for each character displayed, but here is not the problem).
But this doesn't work, no character displayed. It display nothing. But if I change incrementation to prefix like this, it works, i can see the characters on the screen !
volatile char *p = (volatile char *)0xB8000;
for (int i = 0; i < 5; ++i)
*(++p) = 'A'
So, I checked assembly code :
; Postfix
mov ecx, DWORD PTR _p$[ebp]
mov BYTE PTR [ecx], 65 ; 'A' character
mov edx, DWORD PTR _p$[ebp]
add edx, 1
mov DWORD PTR _p$[ebp], edx
; Prefix
mov ecx, DWORD PTR _p$[ebp]
add ecx, 1
mov DWORD PTR _p$[ebp], ecx
mov edx, DWORD PTR _p$[ebp]
mov BYTE PTR [edx], 65 ; 'A' character
I can't spot the difference. By the way, I could use the prefix incrementation but, I would like to understand with does the postfix not work :/
The assembly code is from Visual C++ compiler, I don't have any GCC at work :/
EDIT : I know the difference between prefix and postfix incrementation, and I see the difference between assembly code present here. But IMO, none of these differences leads to non printing characters on screen.
About the attribute byte : I know I should set it properly. I would keep a light example with light assembly code, but actually, with incrementation the attribute character is always set to 'A' wich lead to a blue letter on a red background.
Thank you :)
After a few more tests, I found the possible cause of this error, it was about the .rodata section not properly linked, so it's now better.
For more details, I follow some of the instructions availables on an OSDev Tutorial ;)
In C, often we use char for small number representations. However Processor always uses Int( or 32 bit) values for read from(or fetch from) registers. So every time we need to use a char or 8 bits in our program processor need to fetch 32 bits from regsiter and 'parse' 8 bits out of it.
Hence does it sense to use Int more often in place of char if memory is not the limitation?
Will it 'help' processor?
There's the compiler part and the cpu part.
If you tell the compiler you're using a char instead of an int, during static analysis it will know the bounds of the variable is between 0-255 instead of 0-(2^32-1). This will allow it to optimize your program better.
On the cpu side, your assumption isn't always correct. Take x86 as an example, it has registers eax and al for 32 bit and 8 bit register access. If you want to use chars only, using al is sufficient. There is no performance loss.
I did some simple benchmarks in response to below comments:
al:
format PE GUI 4.0
xor ecx, ecx
dec ecx
loop_start:
inc al
add al, al
dec al
dec al
loopd short loop_start
ret
eax:
format PE GUI 4.0
xor ecx, ecx
dec ecx
loop_start:
inc eax
add eax, eax
dec eax
dec eax
loopd short loop_start
ret
times:
$ time ./test_al.exe
./test_al.exe 0.01s user 0.00s system 0% cpu 7.102 total
$ time ./test_eax.exe
./test_eax.exe 0.01s user 0.01s system 0% cpu 7.120 total
So in this case, al is slightly faster, but sometimes eax came out faster. The difference is really negligible. But cpus aren't so simple, there might be code alignment issues, caches, and other things going on, so it's best to benchmark your own code to see if there's any performance improvement. But imo, if your code is not super tight, it's best to trust the compiler to optimize things.
I'd stick to int if I were you as that is probably the most native integral type for your platform. Internally you could expect shorter types to be converted to int so actually degrading performance.
You should never use char and expect it to be consistent across platforms. Although the C standard defines sizeof(char) to be 1, char itself could be signed or unsigned. The choice is down to the compiler.
If you believe that you can squeeze some performance gain in using an 8 bit type then be explicit and use signed char or unsigned char.
From ARM system developers guide
"most ARM data processing operations are 32-bit only. For this reason, you should use
a 32-bit datatype, int or long, for local variables wherever possible. Avoid using char and
short as local variable types, even if you are manipulating an 8- or 16-bit value"
an example code from the book to prove the point. note the wrap around handling for char as opposed to unsigned int.
int checksum_v1(int *data)
{
char i;
int sum = 0;
for (i = 0; i < 64; i++)
{
sum += data[i];
}
return sum;
}
ARM7 assembly when using i as a char
checksum_v1
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ; i = 0
checksum_v1_loop
LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
ADD r1,r1,#1 ; r1 = i+1
AND r1,r1,#0xff ; i = (char)r1
CMP r1,#0x40 ; compare i, 64
ADD r0,r3,r0 ; sum += r3
BCC checksum_v1_loop ; if (i<64) loop
MOV pc,r14 ; return sum
ARM7 assembly when i is an unsigned int.
checksum_v2
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ; i = 0
checksum_v2_loop
LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
ADD r1,r1,#1 ; r1++
CMP r1,#0x40 ; compare i, 64
ADD r0,r3,r0 ; sum += r3
BCC checksum_v2_loop ; if (i<64) goto loop
MOV pc,r14 ; return sum
If your program is simple enough, the optimizer can do the right thing without you having to worry about it. In this case, plain int would be the simplest (and forward-proof) solution.
However, if you want really much to combine specific bit width and speed, you can use 7.18.1.3 Fastest minimum-width integer types from the C99 standard (requires C99-compliant compiler).
For example:
int_fast8_t x;
uint_fast8_t y;
are the signed and unsigned types that are guaranteed to be able to store at least 8 bits of data and use the usually faster underlying type. Of course, it all depends on what you are doing with the data afterwards.
For example, on all systems I have tested (see: standard type sizes in C++) the fast types were 8-bit long.
Say I got
EDX = 0xA28
EAX = 0x0A280105
I run this ASM code
IMUL EDX
which to my understand only uses EAX.. if one oprand is specified
So in C code it should be like
EAX *= EDX;
correct?
After looking in debugger.. I found out EDX got altered too.
0x0A280105 * 0xA28 = 0x67264A5AC8
in debugger
EAX = 264A5AC8
EDX = 00000067
now if you take the answer 0x67264A5AC8 and split off first hex pair, 0x67 264A5AC8
you can clearly see why the EDX and EAX are the way they are.
Okay so a overflow happens.. as it cannot store such a huge number into 32 bits. so it starts using extra 8 bits in EDX
But my question is how would I do this in C code now to get same results?
I'm guessing it would be like
EAX *= EDX;
EDX = 0xFFFFFFFF - EAX; //blah not good with math manipulation like this.
The IMUL instruction actually produces a result twice the size of the operand (unless you use one of the newer versions that can specify a destination). So:
imul 8bit -> result = ax, 16bits
imul 16bit -> result = dx:ax, 32bits
imul 32bit -> result = edx:eax, 64bits
To do this in C will be dependent on the compiler, but some will work doing this:
long result = (long) eax * (long) edx;
eax = result & 0xffffffff;
edx = result >> 32;
This assumes a long is 64 bits. If the compiler has no 64 bit data type then calculating the result becomes much harder, you need to do long multiplication.
You could always inline the imul instruction.