What does 0x4 do in "movl $0x2d, 0x4(%esp)"? - c

I am looking into assembly code generated by GCC. But I don't understand:
movl $0x2d, 0x4(%esp)
In the second operand, what does 0x4 stands for? offset address? And what the use of register EAX?

movl $0x2d, 0x4(%esp) means to take the current value of the stack pointer (%esp), add 4 (0x4) then store the long (32-bit) value 0x2d into that location.
The eax register is one of the general purpose 32-bit registers. x86 architecture specifies the following 32-bit registers:
eax Accumulator Register
ebx Base Register
ecx Counter Register
edx Data Register
esi Source Index
edi Destination Index
ebp Base Pointer
esp Stack Pointer
and the names and purposes of some of then harken back to the days of the Intel 8080.
This page gives a good overview on the Intel-type registers. The first four of those in the above list can also be accessed as a 16-bit or two 8-bit values as well. For example:
3322222222221111111111
10987654321098765432109876543210
<- eax ->
<- ax ->
<- ah -><- al ->
The pointer and index registers do not allow use of 8-bit parts but you can have, for example, the 16-bit bp.

0x4(%esp) means *(%esp + 4) where * mean dereferencing.
The statement means store the immediate value 0x2d into some local variable occupying the 4th offset on the stack.
(The code you've shown is in AT&T syntax. In Intel syntax it would be mov [esp, 4], 2dh)

0x4 in the second operand is an offset from the value of the register in the parens. EAX is a general purpose register used for assembly coding (computations, storing temporary values, etc.) formally it's called "Accumulator register" but that's more historic than relevant.
You can read this page about the x86 architecture. Most relevant to your question are the sections on Addressing modes and General purpose registers

GCC assembly operands follow a byte (b), word (w), long (l) and so on such as :
movb
movw
movl
Registers are prefixed with a percentage sign (%).
Constants are prefixed with a dollar sign ($).
In the above example in your question that means the 4th offset from the stack pointer (esp).
Hope this helps,
Best regards,
Tom.

You're accessing something four bytes removed from where the stack pointer resides. In GCC this indicates a parameter (I think -- positive offset is parameters and negative is local variables if I remember correctly). You're writing, in other words, the value 0x2D into a parameter. If you gave more context I could probably tell you what was going on in the whole procedure.

Related

How to set function arguments in assembly during runtime in a 64bit application on Windows?

I am trying to set arguments using assembly code that are used in a generic function. The arguments of this generic function - that is resident in a dll - are not known during compile time. During runtime the pointer to this function is determined using the GetProcAddress function. However its arguments are not known. During runtime I can determine the arguments - both value and type - using a datafile (not a header file or anything that can be included or compiled). I have found a good example of how to solve this problem for 32 bit (C Pass arguments as void-pointer-list to imported function from LoadLibrary()), but for 64 bit this example does not work, because you cannot fill the stack but you have to fill the registers. So I tried to use assembly code to fill the registers but until now no success. I use C-code to call the assembly code. I use VS2015 and MASM (64 bit). The C-code below works fine, but the assembly code does not. So what is wrong with the assembly code? Thanks in advance.
C code:
...
void fill_register_xmm0(double); // proto of assembly function
...
// code determining the pointer to a func returned by the GetProcAddress()
...
double dVal = 12.0;
int v;
fill_register_xmm0(dVal);
v = func->func_i(); // integer function that will use the dVal
...
assembly code in different .asm file (MASM syntax):
TITLE fill_register_xmm0
.code
option prologue:none ; turn off default prologue creation
option epilogue:none ; turn off default epilogue creation
fill_register_xmm0 PROC variable: REAL8 ; REAL8=equivalent to double or float64
movsd xmm0, variable ; fill value of variable into xmm0
ret
fill_register_xmm0 ENDP
option prologue:PrologueDef ; turn on default prologue creation
option epilogue:EpilogueDef ; turn on default epilogue creation
END
The x86-64 Windows calling convention is fairly simple, and makes it possible to write a wrapper function that doesn't know the types of anything. Just load the first 32 bytes of args into registers, and copy the rest to the stack.
You definitely need to make the function call from asm; It can't possibly work reliably to make a bunch of function calls like fill_register_xmm0 and hope that the compiler doesn't clobber any of those registers. The C compiler emits instructions that use the registers, as part of its normal job, including passing args to functions like fill_register_xmm0.
The only alternative would be to write a C statement with a function call with all the args having the correct type, to get the compiler to emit code to make a function call normally. If there are only a few possible different combinations of args, putting those in if() blocks might be good.
And BTW, movsd xmm0, variable probably assembles to movsd xmm0, xmm0, because the first function arg is passed in XMM0 if it's FP.
In C, prepare a buffer with the args (like in the 32-bit case).
Each one needs to be padded to 8 bytes if it's narrower. See MS's docs for x86-64 __fastcall. (Note that x86-64 __vectorcall passes __m128 args by value in registers, but for __fastcall it's strictly true that the args form an array of 8-byte values, after the register args. And storing those into the shadow space creates a full array of all the args.)
Any argument that doesn’t fit in 8 bytes, or is not 1, 2, 4, or 8 bytes, must be passed by reference. There is no attempt to spread a single argument across multiple registers.
But the key thing that makes variadic functions easy in the Windows calling convention also works here: The register used for the 2nd arg doesn't depend on the type of the first. i.e. if an FP arg is the first arg, then that uses up an integer register arg-passing slot. So you can only have up to 4 register args, not 4 integer and 4 FP.
If the 4th arg is integer, it goes in R9, even if it's the first integer arg. Unlike in the x86-64 System V calling convention, where the first integer arg goes in rdi, regardless of how many earlier FP args are in registers and/or on the stack.
So the asm wrapper that calls the function can load the first 8 bytes into both integer and FP registers! (Variadic functions already require this, so a callee doesn't have to know whether to store the integer or FP register to form that arg array. MS optimized the calling convention for simplicity of variadic callee functions at the expense of efficiency for functions with a mix of integer and FP args.)
The C side that puts all the args into a buffer can look like this:
#include <stdalign.h>
int asmwrapper(const char *argbuf, size_t argp-argbuf, void (*funcpointer)(...));
void somefunc() {
alignas(16) uint64_t argbuf[256/8]; // or char argbuf[256]. But if you choose not to use alignas, then uint64_t will still give 8-byte alignment
char *argp = (char*)argbuf;
for( ; argp < &argbuf[256] ; argp += 8) {
if (figure_out_an_arg()) {
int foo = get_int_arg();
memcpy(argp, &foo, sizeof(foo));
} else if(bar) {
double foo = get_double_arg();
memcpy(argp, &foo, sizeof(foo));
} else
... memcpy whatever size
// or allocate space to pass by ref and memcpy a pointer
}
if (argp == &argbuf[256]) {
// error, ran out of space for args
}
asmwrapper(argbuf, argp-argbuf, funcpointer);
}
Unfortunately I don't think we can directly use argbuf on the stack as the args + shadow space for a function call. We have no way of stopping the compiler from putting something valuable below argbuf which would let us just set rsp to the bottom of it (and save the return address somewhere, maybe at the top of argbuf by reserving some space for use by the asm).
Anyway, just copying the whole buffer will work. Or actually, load the first 32 bytes into registers (both integer and FP), and only copy the rest. The shadow space doesn't need to be initialized.
argbuf could be a VLA if you knew ahead of time how big it needed to be, but 256 bytes is pretty small. It's not like reading past the end of it can be a problem, it can't be at the end of a page with unmapped memory later, because our parent function's stack frame definitely takes some space.
;; NASM syntax. For MASM just rename the local labels and add whatever PROC / ENDPROC is needed.
;; UNTESTED
;; rcx: argbuf
;; rdx: length in bytes of the args. 0..256, zero-extended to 64 bits
;; r8 : function pointer
;; reserve rdx bytes of space for arg passing
;; load first 32 bytes of argbuf into integer and FP arg-passing registers
;; copy the rest as stack-args above the shadow space
global asmwrapper
asmwrapper:
push rbp
mov rbp, rsp ; so we can efficiently restore the stack later
mov r10, r8 ; move function pointer to a volatile but non-arg-passing register
; load *both* xmm0-3 and rcx,rdx,r8,r9 from the first 32 bytes of argbuf
; regardless of types or whether there were that many arg bytes
; All bytes are loaded into registers early, some reg->reg transfers are done later
; when we're done with more registers.
; movsd xmm0, [rcx]
; movsd xmm1, [rcx+8]
movaps xmm0, [rcx] ; 16-byte alignment required for argbuf. Use movups to allow misalignment if you want
movhlps xmm1, xmm0 ; use some ALU instructions instead of just loads
; rcx,rdx can't be set yet, still in use for wrapper args
movaps xmm2, [rcx+16] ; it's ok to leave garbage in the high 64-bits of an XMM passing a float or double.
;movhlps xmm3, xmm2 ; the copyloop uses xmm3: do this later
movq r8, xmm2
mov r9, [rcx+24]
mov eax, 32
cmp edx, eax
jbe .small_args ; no copying needed, just shadow space
sub rsp, rdx
and rsp, -16 ; reserve extra space, realigning the stack by 16
; rax=32 on entry, start copying just above shadow space (which doesn't need to be copied)
.copyloop: ; do {
movaps xmm3, [rcx+rax]
movaps [rsp+rax], xmm3 ; indexed addressing modes aren't always optimal, but this loop only runs a couple times.
add eax, 16
cmp eax, edx
jb .copyloop ; } while(bytes_copied < arg_bytes);
.done_arg_copying:
; xmm0,xmm1 have the first 2 qwords of args
movq rcx, xmm0 ; RCX NO LONGER POINTS AT argbuf
movq rdx, xmm1
; xmm2 still has the 2nd 16 bytes of args
;movhlps xmm3, xmm2 ; don't use: false dependency on old value and we just used it.
pshufd xmm3, xmm2, 0xee ; xmm3 = high 64 bits of xmm2. (0xee = _MM_SHUFFLE(3,2,3,2))
; movq xmm3, r9 ; nah, can be multiple uops on AMD
; r8,r9 set earlier
call r10
leave ; restore RSP to its value on entry
ret
; could handle this branchlessly, but copy loop still needs to run zero times
; unless we bump up the min arg_bytes to 48 and sometimes copy an unnecessary 16 bytes
; As much work as possible is before the first branch, so it can happen while a mispredict recovers
.small_args:
sub rsp, rax ; reserve shadow space
;rsp still aligned by 16 after push rbp
jmp .done_arg_copying
;byte count. This wrapper is 82 bytes; would be nice to fit it in 80 so we don't waste 14 bytes before the next function.
;e.g. maybe mov rcx, [rcx] instead of movq rcx, xmm0
;mov eax, $-asmwrapper
align 16
This does assemble (on Godbolt with NASM), but I haven't tested it.
It should perform pretty well, but if you get mispredicts around the cutoff from <= 32 bytes to > 32 bytes, change the branching so it always copies an extra 16 bytes. (Uncomment the cmp/cmovb in the version on Godbolt, but the copy loop still needs to start at 32 bytes into each buffer.)
If you often pass very few args, the 16-byte loads might hit a store-forwarding stall from two narrow stores to one wide reload, causing about an extra 8 cycles of latency. This isn't normally a throughput problem, but it can increase the latency before the called function can access its args. If out-of-order execution can't hide that, then it's worth using more load uops to load each 8-byte arg separately. (Especially into integer registers, and then from there to XMM, if the args are mostly integer. That will have lower latency than mem -> xmm -> integer.)
If you have more than a couple args, though, hopefully the first few have committed to L1d and no longer need store forwarding by the time the asm wrapper runs. Or there's enough copying of later args that the first 2 args finish their load + ALU chain early enough not to delay the critical path inside the called function.
Of course, if performance was a huge issue, you'd write the code that figures out the args in asm so you didn't need this copy stuff, or use a library interface with a fixed function signature that a C compiler can call directly. I did try to make this suck as little as possible on modern Intel / AMD mainstream CPUs (http://agner.org/optimize/), but I didn't benchmark it or tune it, so probably it could be improved with some time spent profiling it, especially for some real use-case.
If you know that FP args aren't a possibility for the first 4, you can simplify by just loading integer regs.
So you need to call a function (in a DLL) but only at run-time can you figure out the number and type of parameters. Then you need to perpare the parameters, either on the stack or in registers, depending on the Application Binary Interface/calling convention.
I would use the following approach: some component of your program figures out the number and type of parameters. Let's assume it creates a list of {type, value}, {type, value}, ...
You then pass this list to a function to prepare the ABI call. This will be an assembler function. For a stack-based ABI (32 bit), it just pushes the parameters on to the stack. For a register based ABI, it can prepare the register values and save them as local variables (add sp,nnn) and once all parameters have been prepared (possibly using registers needed for the call, hence first saving them), loads the registers (a series of mov instructions) and performs the call instruction.

Storing an unsigned char in register (x86 assembly)

I am writing a function that has 3 parameters passed to it.
void function(unsigned char* one, int two, unsigned char three) {
__asm
{
mov eax, one
mov ebx, two
mov ecx, three //Having issues storing this variable in a register
But I get a compile error "operand size conflict" for the "three". The other two store just fine. I am trying to figure out why... it compiles if I use lea ecx, three. However the value stored is wrong.
Side question. Am I understanding correctly that the first parameter is passing me a memory location for that variable?
Thanks!
Most x86 instructions require all the arguments to be the same size. In this case, three is an 8-bit argument so use an mov with an 8-bit destination to load it, like mov cl, three.
There are a few mov-like instructions which allow extending from a smaller source to a larger destination. For example, you can use movzx ecx, three (move with zero-extension) to load the byte into ecx and zero the top three bytes.
I would recommend to not move parameters into registers explicitly by "mov" instructions, because if you have compiled your project to "Register" calling convention, then "one", "two" and "three" are just the aliases for the registers and you may overwrite the data that you need, you just insert unnecessary "mov" instructions into your code. Just take into consideration what parameter comes in what register and use these parameters immediately. But be careful with the parameters that have size less then the register size. Higher bits may be undefined (trashed).
It depends on the particular calling convention, but, most probable, for a 32-bit register calling convention, a call passes first parameter in EAX, second in EDX, third in ECX. The result is returned in EAX (32-bit) or EAX:EDX pair (64-bit). See http://docwiki.embarcadero.com/RADStudio/Seattle/en/Program_Control#Register_Convention for more detail.
In your case, the "unsigned char three" immediately comes in ECX, "two" in EDX and "one" in EAX, you need not to do any moves, buts since "three" is just a char (8 bits) and ECX is dword (32 bits), the bits 8-31 of ECX may contain trash. You should not rely on the assumption that these bits are zero.
For a 64-bit calling convention, first parameter is passed in RCX, second - RDX, third R8, fourth - R9, and the result is returned in RAX. See https://msdn.microsoft.com/en-us/library/ms235286.aspx for more information for 64-bit.

C Pointer to EFLAGS using NASM

For a task at my school, I need to write a C Program which does a 16 bit addition using an assembly programm. Besides the result the EFLAGS shall also be returned.
Here is my C Program:
int add(int a, int b, unsigned short* flags);//function must be used this way
int main(void){
unsigned short* flags = NULL;
printf("%d\n",add(30000, 36000, flags);// printing just the result
return 0;
}
For the moment the Program just prints out the result and not the flags because I am unable to get them.
Here is the assembly program for which I use NASM:
global add
section .text
add:
push ebp
mov ebp,esp
mov ax,[ebp+8]
mov bx,[ebp+12]
add ax,bx
mov esp,ebp
pop ebp
ret
Now this all works smoothly. But I have no idea how to get the pointer which must be at [ebp+16] pointing to the flag register. The professor said we will have to use the pushfd command.
My problem just lies in the assembly code. I will modify the C Program to give out the flags after I get the solution for the flags.
Normally you'd just use a debugger to look at flags, instead of writing all the code to get them into a C variable for a debug-print. Especially since decent debuggers decode the condition flags symbolically for you, instead of or as well as showing a hex value.
You don't have to know or care which bit in FLAGS is CF and which is ZF. (This information isn't relevant for writing real programs, either. I don't have it memorized, I just know which flags are tested by different conditions like jae or jl. Of course, it's good to understand that FLAGS are just data that you can copy around, save/restore, or even modify if you want)
Your function args and return value are int, which is 32-bit in the System V 32-bit x86 ABI you're using. (links to ABI docs in the x86 tag wiki). Writing a function that only looks at the low 16 bits of its input, and leaves high garbage in the high 16 bits of the output is a bug. The int return value in the prototype tells the compiler that all 32 bits of EAX are part of the return value.
As Michael points out, you seem to be saying that your assignment requires using a 16-bit ADD. That will produce carry, overflow, and other conditions with different inputs than if you looked at the full 32 bits. (BTW, this article explains carry vs. overflow very well.)
Here's what I'd do. Note the 32-bit operand size for the ADD.
global add
section .text
add:
push ebp
mov ebp,esp ; stack frames are optional, you can address things relative to ESP
mov eax, [ebp+8] ; first arg: No need to avoid loading the full 32 bits; the next insn doesn't care about the high garbage.
add ax, [ebp+12] ; low 16 bits of second arg. Operand-size implied by AX
cwde ; sign-extend AX into EAX
mov ecx, [ebp+16] ; the pointer arg
pushf ; the simple straightforward way
pop edx
mov [ecx], dx ; Store the low 16 of what we popped. Writing word [ecx] is optional, because dx implies 16-bit operand-size
; be careful not to do a 32-bit store here, because that would write outside the caller's object.
; mov esp,ebp ; redundant: ESP is still pointing at the place we pushed EBP, since the push is balanced by an equal-size pop
pop ebp
ret
CWDE (the 16->32 form of the 8086 CBW instruction) is not to be confused with CWD (the AX -> DX:AX 8086 instruction). If you're not using AX, then MOVSX / MOVZX are a good way to do this.
The fun way: instead of using the default operand size and doing 32-bit push and pop, we can do a 16-bit pop directly into the destination memory address. That would leave the stack unbalanced, so we could either uncomment the mov esp, ebp again, or use a 16-bit pushf (with an operand-size prefix, which according to the docs makes it only push the low 16 FLAGS, not the 32-bit EFLAGS.)
; What I'd *really* do: maximum efficiency if I had to use the 32-bit ABI with args on the stack, instead of args in registers
global add
section .text
add:
mov eax, [esp+4] ; first arg, first thing above the return address
add ax, [esp+8] ; second arg
cwde ; sign-extend AX into EAX
mov ecx, [esp+12] ; the pointer
pushfw ; push the low 16 of FLAGS
pop word [ecx] ; pop into memory pointed to by unsigned short* flags
ret
Both PUSHFW and POP WORD will assemble with an operand-size prefix. output from objdump -Mintel, which uses slightly different syntax from NASM:
4000c0: 66 9c pushfw
4000c2: 66 8f 01 pop WORD PTR [ecx]
PUSHFW is the same as o16 PUSHF. In NASM, o16 applies the operand-size prefix.
If you only needed the low 8 flags (not including OF), you could use LAHF to load FLAGS into AH and store that.
PUSHFing directly into the destination is not something I'd recommend. Temporarily pointing the stack at some random address is not safe in general. Programs with signal handlers will use the space below the stack asynchronously. This is why you have to reserve stack space before using it with sub esp, 32 or whatever, even if you're not going to make a function call that would overwrite it by pushing more stuff on the stack. The only exception is when you have a red-zone.
You C caller:
You're passing a NULL pointer, so of course your asm function segfaults. Pass the address of a local to give the function somewhere to store to.
int add(int a, int b, unsigned short* flags);
int main(void) {
unsigned short flags;
int result = add(30000, 36000, &flags);
printf("%d %#hx\n", result, flags);
return 0;
}
This is just a simple approach. I didn't test it, but you should get the idea...
Just set ESP to the pointer value, increment it by 2 (even for 32-bit arch) and PUSHF like this:
global add
section .text
add:
push ebp
mov ebp,esp
mov ax,[ebp+8]
mov bx,[ebp+12]
add ax,bx
; --- here comes the mod
mov esp, [ebp+16] ; this will set ESP to the pointers address "unsigned short* flags"
lea esp, [esp+2] ; adjust ESP to address above target
db 66h ; operand size prefix for 16-bit PUSHF (alternatively 'db 0x66', depending on your assembler
pushf ; this will save the lower 16-bits of EFLAGS to WORD PTR [EBP+16] = [ESP+2-2]
; --- here ends the mod
mov esp,ebp
pop ebp
ret
This should work, because PUSHF decrements ESP by 2 and then saves the value to WORD PTR [ESP]. Therefore it had to be increased before using the pointer address. Setting ESP to the appropriate value enables you to denominate the direct target of PUSHF.

Assembly: access quad-word, double word, and byte quantity of same register in function

I have some assembly code generated from C that I'm trying to make sense of. One part I just can't understand:
movslq %edx,%rcx
movzbl (%rdi,%rcx,1),%ecx
test %cl,%cl
What doesn't make sense is that %rcx, %ecx, and %cl are all in the same register (quad word, double word, and byte, respectively). How could a data type access all three in the same function like this?
Having a char* makes it improbably to access %ecx in this way, and similarly having an int* makes accessing %cl unlikely. I simply have no idea what data type could be stored in %rcx.
re: your comment: You can tell it's a byte array because it's scaling %rcx by 1.
Like Michael said in a comment, this is doing
int func(char *array /* rdi */, int pos /* ecx */)
{
if (array[pos]) ...;
// or maybe
int tmpi = array[pos];
char tmpc = tmpi;
if (tmpc) ...;
}
ecx has to get sign-extended to 64bit before being used as an offset in an effective address. If it was unsigned, it would still need to be zero-extended (e.g. mov %ecx, %ecx). The ABI doesn't guarantee that the upper 32 of a register are zeroed or sign extended when the parameter being passed in a register is smaller than 64bits.
In general, it's better to write at least 32b of a register, to avoid a false dependency on the previous contents on some CPUs. Only Intel P6/SnB family CPUs track the pieces of integer registers separately (and insert an extra uop to merge them with the old contents if you do something like read %ecx after writing %cl.)
So it's perfectly reasonable for a compiler to emit that code with the zero-extending movzbl load instead of just mov (%rdi,%rcx,1), %cl. It will potentially run faster on Silvermont and AMD. (And P4. Optimizations for old CPUs do hang around in compiler source code...)

Trouble understanding gcc's assembly output

While writing some C code, I decided to compile it to assembly and read it--I just sort of, do this from time to time--sort of an exercise to keep me thinking about what the machine is doing every time I write a statement in C.
Anyways, I wrote these two lines in C
asm(";move old_string[i] to new_string[x]");
new_string[x] = old_string[i];
asm(";shift old_string[i+1] into new_string[x]");
new_string[x] |= old_string[i + 1] << 8;
(old_string is an array of char, and new_string is an array of unsigned short, so given two chars, 42 and 43, this will put 4342 into new_string[x])
Which produced the following output:
#move old_string[i] to new_string[x]
movl -20(%ebp), %esi #put address of first char of old_string in esi
movsbw (%edi,%esi),%dx #put first char into dx
movw %dx, (%ecx,%ebx,2) #put first char into new_string
#shift old_string[i+1] into new_string[x]
movsbl 1(%esi,%edi),%eax #put old_string[i+1] into eax
sall $8, %eax #shift it left by 8 bits
orl %edx, %eax #or edx into it
movw %ax, (%ecx,%ebx,2) #?
(I'm commenting it myself, so I can follow what's going on).
I compiled it with -O3, so I could also sort of see how the compiler optimizes certain constructs. Anyways, I'm sure this is probably simple, but here's what I don't get:
the first section copies a char out of old_string[i], and then movw's it (from dx) to (%ecx,%ebx). Then the next section, copies old_string[i+1], shifts it, ors it, and then puts it into the same place from ax. It puts two 16 bit values into the same place? Wouldn't this not work?
Also, it shifts old_string[i+1] to the high-order dword of eax, then ors edx (new_string[x]) into it... then puts ax into the memory! Wouldn't ax just contain what was already in new_string[x]? so it saves the same thing to the same place in memory twice?
Is there something I'm missing? Also, I'm fairly certain that the rest of the compiled program isn't relevant to this snippet... I've read around before and after, to find where each array and different variables are stored, and what the registers' values would be upon reaching that code--I think that this is the only piece of the assembly that matters for these lines of C.
--
oh, turns out GNU assembly comments are started with a #.
Okay, so it was pretty simple after all.
I figured it out with a pen and paper, writing down each step, what it did to each register, and then wrote down the contents of each register given an initial starting value...
What got me was that it was using 32 bit and 16 bit registers for 16 and 8 bit data types...
This is what I thought was happening:
first value put into memory as, say, 0001 (I was thinking 01).
second value (02) loaded into 32 bit register (so it was like, 00000002, I was thinking, 0002)
second value shifted left 8 bits (00000200, I was thinking, 0200)
first value (0000001, I thought 0001) xor'd into second value (00000201, I thought 0201)
16 bit register put into memory (0201, I was thinking, just 01 again).
I didn't get why it wrote it to memory twice though, or why it was using 32 bit registers (well, actually, my guess is that a 32 bit processor is way faster at working with 32 bit values than it is with 8 and 16 bit values, but that's a totally uneducated guess), so I tried rewriting it:
movl -20(%ebp), %esi #gets pointer to old_string
movsbw (%edi,%esi),%dx #old_string[i] -> dx (0001)
movsbw 1(%edi,%esi),%ax #old_string[i + 1] -> ax (0002)
salw $8, %ax #shift ax left (0200)
orw %dx, %ax #or dx into ax (0201)
movw %ax,(%ecx,%ebx,2) #doesn't write to memory until end
This worked exactly the same.
I don't know if this is an optimization or not (aside from taking one memory write out, which obviously is), but if it is, I know it's not really worth it and didn't gain me anything. In any case, I get what this code is doing now, thanks for the help all.
I'm not sure what's not to understand, unless I'm missing something.
The first 3 instructions load a byte from old_string into dx and stores that to your new_string.
The next 3 instructions utilize what's already in dx and combines old_string[i+1] with it, and stores it as a 16-bit value (ax) to new_string.
Also, it shifts old_string[i+1] to the high-order dword of eax, then
ors edx (new_string[x]) into it... then puts ax into the memory! Wouldn't
ax just contain what was already in new_string[x]? so it saves the same
thing to the same place in memory twice?
Now you see why optimizers are a Good Thing. That kind of redundant code shows up pretty often in unoptimized, generated code, because the generated code comes more or less from templates that don't "know" what happened before or after.

Resources