Visual-C++ inline assembler difference of two offsets - c

I'm porting chunk of code from MASM to C inline assembler (x86, Windows, MS VC)
Foolowing is not a real code, just spoof to give an idea. Let's say I have some data defined as static array or even a code chunk between two labels, and I need to get size of it.
label1:
bla bla bla
label2:
....
mov eax, (offset label2 - offset label1)
Such a code works in MASM like a charm, but in C I get following error message:
"error C2425: '-' : non-constant expression in 'second operand'"
I can get compiled:
mov eax, offset label1
mov eax, offset label2
I expect compiler to evaluate (offset label1 - offset label2) at compile time, but it looks like I'm wrong. I can't add offsets as well (why? these are just two integers added during compilation...?)
Sure, I can get
mov eax, offset label2
mov edx, offset label1
sub eax, edx
compiled, but that's an extra code just for calculating a constant.
Can someone explain me please, what is wrong in my code?
Can it be something caused by relocation? How to push it through?
Looking forward to an answer,
thank you.

Yes, it can be caused by the threat of relocation but also threat of variable length instructions dealing with relative jumps. Most likely because of some minor trouble, the assembler writers took the easy way out and implemented a 1 pass or a two pass compiler that makes final decisions as soon as possible. And thus some convenient expressions are unsupported.
As already suggested in the comment, the assembler still probably supports mov + sub combination.

The real assembler is probably running over the code in several passes before it has gotten fixed addresses for all the labels. For example, some jumps have a short and a long form depending on how far you want to jump. If you have such a jump between the labels, the distance depends on where the jump is going to.
The C compiler might leave some of that to the linker/loader and not have the values fixed at compile time.
You could very well get the addres calculation code down to two instructions
mov EAX, offset Label2
sub EAX, offset Label1
I don't think this will exactly ruin the performance of the code.

Related

Multiplication of corresponding values in an array

I want to write an x86 program that multiplies corresponding elements of 2 arrays (array1[0]*array2[0] and so on till 5 elements) and stores the results in a third array. I don't even know where to start. Any help is greatly appreciated.
First thing you'll want to get is an assembler, I'm personally a big fan of NASM in my opinion it has a very clean and concise syntax, it's also what I started on so that's what I'll use for this answer.
Other than NASM you have:
GAS
This is the GNU assembler, unlike NASM there are versions for many architectures so the directives and way of working will be about the same other than the instructions if you switch architectures. GAS does however have the unfortunate downside of being somewhat unfriendly for people who want to use the Intel syntax.
FASM
This is the Flat Assembler, it is an assembler written in Assembly. Like NASM it's unfriendly to people who want to use AT&T syntax. It has a few rough edges but some people seem to prefer it for DOS applications (especially because there's a DOS port of it) and bare metal work.
Now you might be reading 'AT&T syntax' and 'Intel syntax' and wondering what's meant by that. These are dialects of x86 assembly, they both assemble to the same machine code but reflect slightly different ways of thinking about each instruction. AT&T syntax tends to be more verbose whereas Intel syntax tends to be more minimal, however certain parts of AT&T syntax have nicer operand orderings tahn Intel syntax, a good demonstration of the difference is the mov instruction:
AT&T syntax:
movl (0x10), %eax
This means get the long value (1 dword, aka 4 bytes) and put it in the register eax. Take note of the fact that:
The mov is suffixed with the operand length.
The memory address is surrounded in parenthesis (you can think of them like a pointer dereference in C)
The register is prefixed with %
The instruction moves the left operand into the right operand
Intel Syntax:
mov eax, [0x10]
Take note of the fact that:
We do not need to suffix the instruction with the operand size, the assembler infers it, there are situations where it can't, in which case we specify the size next to the address.
The register is not prefixed
Square brackets are used to address memory
The second operand is moved into the first operand
I will be using Intel syntax for this answer.
Once you've installed NASM on your machine you'll want a simple build script (when you start writing bigger programs use a Makefile or some other proper build system, but for now this will do):
nasm -f elf arrays.asm
ld -o arrays arrays.o -melf_i386
rm arrays.o
echo
echo " Done building, the file 'arrays' is your executable"
Remember to chmod +x the script or you won't be able to execute it.
Now for the code along with some comments explaining what everything means:
global _start ; The linker will be looking for this entrypoint, so we need to make it public
section .data ; We're going on to describe our data here
array_length equ 5 ; This is effectively a macro and isn't actually being stored in memory
array1 dd 1,4,1,5,9 ; dd means declare dwords
array2 dd 2,6,5,3,5
sys_exit equ 1
section .bss ; Data that isn't initialised with any particular value
array3 resd 5 ; Leave us 5 dword sized spaces
section .text
_start:
xor ecx,ecx ; index = 0 to start
; In a Linux static executable, registers are initialized to 0 so you could leave this out if you're never going to link this as a dynamic executable.
_multiply_loop:
mov eax, [array1+ecx*4] ; move the value at the given memory address into eax
; We calculate the address we need by first taking ecx (which tells us which
; item we want) multiplying it by 4 (i.e: 4 bytes/1 dword) and then adding it
; to our array's start address to determine the address of the given item
imul eax, dword [array2+ecx*4] ; This performs a 32-bit integer multiply
mov dword [array3+ecx*4], eax ; Move our result to array3
inc ecx ; Increment ecx
; While ecx is a general purpose register the convention is to use it for
; counting hence the 'c'
cmp ecx, array_length ; Compare the value in ecx with our array_length
jb _multiply_loop ; Restart the loop unless we've exceeded the array length
; If the loop has concluded the instruction pointer will continue
_exit:
mov eax, sys_exit ; The system call we want
; ebx is already equal to 0, ebx contains the exit status
mov ebp, esp ; Prepare the stack before jumping into the system
sysenter ; Call the Linux kernel and tell it that our program has concluded
If you wanted the full 64-bit result of the 32-bit multiply, use one-operand mul. But normally you only want a result that's the same width as the inputs, in which case imul is most efficient and easiest to use. See links in the x86 tag wiki for docs and tutorials.
You'll notice that this program has no output. I'm not going to cover writing the algorithm to print numbers because we'd be here all day, that's an exercise for the reader (or see this Q&A)
However in the meantime we can run our program in gdbtui and inspect the data, use your build script to build then open your program with the command gdbtui arrays. You'll want to enter these commands:
layout asm
break _exit
run
print (int[5])array3
And GDB will display the results.

Why does main initialize stack frame when there are no variables

why does this code:
#include "stdio.h"
int main(void) {
puts("Hello, World!");
}
decide to initialize a stack frame? Here is the assembly code:
.LC0:
.string "Hello, World!"
main:
push rbp
mov rbp, rsp
mov edi, OFFSET FLAT:.LC0
call puts
mov eax, 0
pop rbp
ret
Why does the compiler initialize a stack frame only for it to be destroyed later, withoput it ever being used? This surely wont cause any errors on the outside of the main function because I never use the stack, so I wont cause any errors. Why is it compiled this way?
Having these steps in every compiled function is the "baseline" for the compiler, unoptimized. It looks clean in disassembly, and makes sense. However, the compiler can optimize the output to reduce overhead from code that has no real effect. You can see this by compiling with different optimization levels.
What you got is like this:
.LC0:
.string "Hello, World!"
main:
push rbp
mov rbp, rsp
mov edi, OFFSET FLAT:.LC0
call puts
mov eax, 0
pop rbp
ret
That's compiled in GCC with no optimization.
Adding the flag -O4 gives this output:
.LC0:
.string "Hello, World!"
main:
sub rsp, 8
mov edi, OFFSET FLAT:.LC0
call puts
xor eax, eax
add rsp, 8
ret
You'll notice that this still moves the stack pointer, but it skips changing the base pointer, and avoid the time-consuming memory access associated with that.
The stack is assumed to be aligned on a 16-byte boundary. With the return address having been pushed, this leaves another 8 bytes to be subtracted to get to the boundary before the function call.
It's very common for compilers to generate unoptimized code in the least complicated way possible (or at least the least complicated way that doesn't lead to code that's so bad that the optimizer won't be able to fix it) to keep the code simple and to stick to the one-responsibility principle (in the sense that making code more efficient is the optimizer's job).
Generating code to initialize the stack for all functions is less complicated than only doing so where necessary. Since the optimizer will be able to remove the unnecessary code anyway (and it will do so in more cases than a simple "does this function have any local variables?" check would), generating the unnecessary code won't have any effect as long as optimizations are enabled (and if they're not, it's expected that the generated code will contain inefficiencies).
If we did add a "does this function have any local variables?" check to the function that generates the stack-initialization code, we'd be re-inventing a less powerful version of an optimization that the optimizer already performs anyway, so we'd be violating the one-responsibility principle and increasing the complexity of the part of the compiler that could otherwise be relatively simple (as opposed to the optimizer, which is full of complicated algorithms anyway).
The stack frame makes it possible to inspect the call stack during runtime. This is useful:
when debugging
in code that relies on __builtin_frame_address(level) with level > 0
As already pointed out by others, a compiler may omit the stackframe on higher optimization levels.
See also:
How do you get gcc's __builtin_frame_address to work with -O2?

what is this assembly code in C?

I have some assembly code below that I don't really understand. My thoughts are that it is meaningless. Unfortunately I can't provide any more instruction information. What would the output in C be?
0x1000: iretd
0x1001: cli
0x1002: in eax, dx
0x1003: inc byte ptr [rdi]
0x1005: add byte ptr [rax], al
0x1007: add dword ptr [rbx], eax
0x1009: add byte ptr [rax], al
0x100b: add byte ptr [rdx], 0
0x100e: add byte ptr [rax], al
Thanks
The first four bytes (if I did reconstruct them correctly) form 32 bit value 0xFEEDFACF.
Putting that into google led me to:
https://gist.github.com/softboysxp/1084476#file-gistfile1-asm-L15
%define MH_MAGIC_64 0xfeedfacf
Aren't you by accident disassembling Mach-o x64 executable from Mac OS X as raw machine code, instead of reading meta data of file correctly, and disassembling only code section?
P.S. in questions like this one, rather include also the source machine code data, so experienced people may check disassembly by targetting different platform, like 32b x86 or 16b real mode code, or completely different CPU, which may help in case you would mistakenly treat machine code with wrong target platform disassembly. I had to first assemble your disassembly to see the raw bytes.
iretd is "return from interrupt", cli is "clear interrupt flag" which means disable all maskable interrupts. The C language does not understand the concept of an interrupt, so it is unlikely that this was compiled from C. In fact, this isn't a single complete fragment of code.
Also add byte ptr [rdx], 0 is adding 0 to a value which doesn't make sense to me unless it is the result of an unoptimised compilation or the result of disassembling something that isn't code.

Displaying PSW content

I'm beginner with asm, so I've been researching for my question for a while but answears were unsatisfactory. I'm wondering how to display PSW content on standard output. Other thing, how to display Instruction Pointer value ? I would be very gratefull if ypu could give me a hint (or better a scratch of code). It may be masm or 8086 as well (actually I don't know wthat is the difference :) )
The instruction pointer is not directly accessible on the x86 family, however, it is quite straightforward to retrieve its value - it will never be accurate though.
Since a subroutine call places the return address on the stack, you just need to copy it from there and violá! You have the address of the opcode following the call instruction:
proc getInstructionPointer
push bp
mov bp,sp
mov ax,[word ptr ss:bp + 2]
mov sp,bp
pop bp
ret
endp getInstructionPointer
The PSW on the x86 is called the Flags register. There are two operations that explicitly reference it: pushf and popf. As you might have guessed, you can simply push the Flags onto the stack and load it to any general purpose register you like:
pushf
pop ax
Displaying these values consists of converting their values to ASCII and writing them onto the screen. There are several ways of doing this - search for "string output assembly", I bet you find the answer.
To dispel a minor confusion: 8086 is the CPU itself, whereas MASM is the assembler. The syntax is assembler-specific; MASM assembly is x86 assembly. TASM assembly is x86 assembly as well, just like NASM assembly.
When one says "x86 Assembly", he/she is referencing any of these (or others), talking about the instruction set, not the dialect.
Note that the above examples are 16bit, indtended for 8086 and won't work on 80386+ in 32bit mode

Writing a JIT compiler in assembly

I've written a virtual machine in C which has decent performance for a non-JIT VM, but I want to learn something new, and improve performance. My current implementation simply uses a switch to translate from VM bytecode to instructions, which is compiled to a jump table. Like I said, decent performance for what it is, but I've hit a barrier that can only be overcome with a JIT compiler.
I've already asked a similar question not long ago about self-modifying code, but I came to realize that I wasn't asking the right question.
So my goal is to write a JIT compiler for this C virtual machine, and I want to do it in x86 assembly. (I'm using NASM as my assembler) I'm not quite sure how to go about doing this. I'm comfortable with assembly, and I've looked over some self-modifying code examples, but I haven't come to figure out how to do code generation just yet.
My main block so far is copying instructions to an executable piece of memory, with my arguments. I'm aware that I can label a certain line in NASM, and copy the entire line from that address with the static arguments, but that's not very dynamic, and doesn't work for a JIT compiler. I need to be able to interpret the instruction from bytecode, copy it to executable memory, interpret the first argument, copy it to memory, then interpret the second argument, and copy it to memory.
I've been informed about several libraries that would make this task easier, such as GNU lightning, and even LLVM. However, I'd like to write this by hand first, to understand how it works, before using external resources.
Are there any resources or examples this community could provide to help me get started on this task? A simple example showing two or three instructions like "add" and "mov" being used to generate executable code, with arguments, dynamically, in memory, would do wonders.
I wouldn't recommend writing a JIT in assembly at all. There are good arguments for writing the most frequently executed bits of the interpreter in assembly. For an example of how this looks like see this comment from Mike Pall, the author of LuaJIT.
As for the JIT, there are many different levels with varying complexity:
Compile a basic block (a sequence of non-branching instructions) by simply copying the interpreter's code. For example, the implementations of a few (register-based) bytecode instructions might look like this:
; ebp points to virtual register 0 on the stack
instr_ADD:
<decode instruction>
mov eax, [ebp + ecx * 4] ; load first operand from stack
add eax, [ebp + edx * 4] ; add second operand from stack
mov [ebp + ebx * 4], eax ; write back result
<dispatch next instruction>
instr_SUB:
... ; similar
So, given the instruction sequence ADD R3, R1, R2, SUB R3, R3, R4 a simple JIT could copy the relevant parts of the interpreters implementation into a new machine code chunk:
mov ecx, 1
mov edx, 2
mov ebx, 3
mov eax, [ebp + ecx * 4] ; load first operand from stack
add eax, [ebp + edx * 4] ; add second operand from stack
mov [ebp + ebx * 4], eax ; write back result
mov ecx, 3
mov edx, 4
mov ebx, 3
mov eax, [ebp + ecx * 4] ; load first operand from stack
sub eax, [ebp + edx * 4] ; add second operand from stack
mov [ebp + ebx * 4], eax ; write back result
This simply copies the relevant code, so we need to initialise the registers used accordingly. A better solution would be to translate this directly into machine instructions mov eax, [ebp + 4], but now you already have to manually encode the requested instructions.
This technique removes the overheads of interpretation, but otherwise does not improve efficiency much. If the code is executed for only one or two times, then it may not worth it to first translate it to machine code (which requires flushing at least parts of the I-cache).
While some JITs use the above technique instead of an interpreter, they then employ a more complicated optimisation mechanism for frequently executed code. This involves translating the executed bytecode into an intermediate representation (IR) on which additional optimisations are performed.
Depending on the source language and the type of JIT, this can be very complex (which is why many JITs delegate this task to LLVM). A method-based JIT needs to deal with joining control-flow graphs, so they use SSA form and run various analyses on that (e.g., Hotspot).
A tracing JIT (like LuaJIT 2) only compiles straight line code which makes many things easier to implement, but you have to be very careful how you pick traces and how you link multiple traces together efficiently. Gal and Franz describe one method in this paper (PDF). For another method see the LuaJIT source code. Both JITs are written in C (or perhaps C++).
I suggest you look at the project http://code.google.com/p/asmjit/. By using the framework it provides, you can save a lot of energy. If you want write all things by hand, just read the source and rewrite it yourself, I think it's not very hard.

Resources