Multiplication of corresponding values in an array - arrays

I want to write an x86 program that multiplies corresponding elements of 2 arrays (array1[0]*array2[0] and so on till 5 elements) and stores the results in a third array. I don't even know where to start. Any help is greatly appreciated.

First thing you'll want to get is an assembler, I'm personally a big fan of NASM in my opinion it has a very clean and concise syntax, it's also what I started on so that's what I'll use for this answer.
Other than NASM you have:
GAS
This is the GNU assembler, unlike NASM there are versions for many architectures so the directives and way of working will be about the same other than the instructions if you switch architectures. GAS does however have the unfortunate downside of being somewhat unfriendly for people who want to use the Intel syntax.
FASM
This is the Flat Assembler, it is an assembler written in Assembly. Like NASM it's unfriendly to people who want to use AT&T syntax. It has a few rough edges but some people seem to prefer it for DOS applications (especially because there's a DOS port of it) and bare metal work.
Now you might be reading 'AT&T syntax' and 'Intel syntax' and wondering what's meant by that. These are dialects of x86 assembly, they both assemble to the same machine code but reflect slightly different ways of thinking about each instruction. AT&T syntax tends to be more verbose whereas Intel syntax tends to be more minimal, however certain parts of AT&T syntax have nicer operand orderings tahn Intel syntax, a good demonstration of the difference is the mov instruction:
AT&T syntax:
movl (0x10), %eax
This means get the long value (1 dword, aka 4 bytes) and put it in the register eax. Take note of the fact that:
The mov is suffixed with the operand length.
The memory address is surrounded in parenthesis (you can think of them like a pointer dereference in C)
The register is prefixed with %
The instruction moves the left operand into the right operand
Intel Syntax:
mov eax, [0x10]
Take note of the fact that:
We do not need to suffix the instruction with the operand size, the assembler infers it, there are situations where it can't, in which case we specify the size next to the address.
The register is not prefixed
Square brackets are used to address memory
The second operand is moved into the first operand
I will be using Intel syntax for this answer.
Once you've installed NASM on your machine you'll want a simple build script (when you start writing bigger programs use a Makefile or some other proper build system, but for now this will do):
nasm -f elf arrays.asm
ld -o arrays arrays.o -melf_i386
rm arrays.o
echo
echo " Done building, the file 'arrays' is your executable"
Remember to chmod +x the script or you won't be able to execute it.
Now for the code along with some comments explaining what everything means:
global _start ; The linker will be looking for this entrypoint, so we need to make it public
section .data ; We're going on to describe our data here
array_length equ 5 ; This is effectively a macro and isn't actually being stored in memory
array1 dd 1,4,1,5,9 ; dd means declare dwords
array2 dd 2,6,5,3,5
sys_exit equ 1
section .bss ; Data that isn't initialised with any particular value
array3 resd 5 ; Leave us 5 dword sized spaces
section .text
_start:
xor ecx,ecx ; index = 0 to start
; In a Linux static executable, registers are initialized to 0 so you could leave this out if you're never going to link this as a dynamic executable.
_multiply_loop:
mov eax, [array1+ecx*4] ; move the value at the given memory address into eax
; We calculate the address we need by first taking ecx (which tells us which
; item we want) multiplying it by 4 (i.e: 4 bytes/1 dword) and then adding it
; to our array's start address to determine the address of the given item
imul eax, dword [array2+ecx*4] ; This performs a 32-bit integer multiply
mov dword [array3+ecx*4], eax ; Move our result to array3
inc ecx ; Increment ecx
; While ecx is a general purpose register the convention is to use it for
; counting hence the 'c'
cmp ecx, array_length ; Compare the value in ecx with our array_length
jb _multiply_loop ; Restart the loop unless we've exceeded the array length
; If the loop has concluded the instruction pointer will continue
_exit:
mov eax, sys_exit ; The system call we want
; ebx is already equal to 0, ebx contains the exit status
mov ebp, esp ; Prepare the stack before jumping into the system
sysenter ; Call the Linux kernel and tell it that our program has concluded
If you wanted the full 64-bit result of the 32-bit multiply, use one-operand mul. But normally you only want a result that's the same width as the inputs, in which case imul is most efficient and easiest to use. See links in the x86 tag wiki for docs and tutorials.
You'll notice that this program has no output. I'm not going to cover writing the algorithm to print numbers because we'd be here all day, that's an exercise for the reader (or see this Q&A)
However in the meantime we can run our program in gdbtui and inspect the data, use your build script to build then open your program with the command gdbtui arrays. You'll want to enter these commands:
layout asm
break _exit
run
print (int[5])array3
And GDB will display the results.

Related

Could you use C inline assembly to align instructions? (without Compiler optimizations)

I have to do a university project where we have to use cache optimizations to improve the performance of a given code but we must not use compiler optimizations to achieve it.
One of the ideas I had reading the bibliography is to align the beginning of a basic block to a line cache size. But can you do something like:
asm(".align 64;")
for(int i = 0; i<N; i++)
... (whole basic block)
in order to achieve what I'm looking for? I have no idea if it's possible to do that in terms of instruction alignment. I've seen some trick like _mm_malloc to achieve data alignment but none for instructions. Could anyone please give me some light on the matter?
TL:DR: This might not be very useful (since modern x86 with a uop cache often doesn't care about code alignment1), but does "work" in front of a do{}while() loop, which can compile directly to asm with the same layout, without any loop setup (prologue) instructions before the actual top of the loop. (The target of the backwards branch).
In general, https://gcc.gnu.org/wiki/DontUseInlineAsm and especially never use GNU C Basic asm("foo"); inside a function, but in debug mode (the -O0 default, aka optimizations disabled) each statement (including asm();) compiles to a separate block of asm in source order. So you case doesn't actually need Extended asm(".p2align 4" ::: "memory") to order the asm statement wrt. memory operations. (Also in recent GCC, a memory clobber is implicit for Basic asm with a non-empty template string). At worst with optimization enabled the padding could go somewhere useless and hurt performance, but not correctness, unlike most uses of asm().
How this actually compiles
This does not exactly work; a C for loop compiles to some asm instructions before the asm loop. Especially when using a for(a;b;c) loop with some before-first-iteration initialization in statement a! You can of course pull that out in the source, but GCC's -O0 strategy for compiling while and for loops is to enter the loop with a jmp to the condition at the bottom.
But that jmp alone is only one small (2-byte) instruction, so aligning before that would put the top of the loop near the start of a possible instruction fetch block, which still gets most of the benefit if that was ever a bottleneck. (Or near the start of a new group of uop-cache lines Sandybridge-family x86 where 32-byte boundaries are relevant. Or even a 64-byte I-cache line, although that's rarely relevant and could result in a lot of NOPs executed to reach that boundary. And bloated code size.)
void foo(register int *p)
{
// always use .p2align n or .balign 1<<n so it's unambiguous across targets like MacOS vs. Linux, never .align
asm(" .p2align 5 # from inline asm");
for (register int *endp = p + 102400; p<endp ; p++) {
*p += 123;
}
}
Compiles as follows on the Godbolt compiler explorer. Note that the way I used register meant I got not-terrible asm despite the debug build, and didn't have to combine p++ into p++ <= endp or *(p++) += 123; to make store/reload overhead less bad (because there isn't any in the first place for register locals). And I used a pointer increment / compare to keep the asm simple, and harder for debug mode to deoptimize into more wasted asm instructions.
# GCC11.3 -O0 (the default with no options, except for -masm=intel added by Godbolt)
foo:
push rbp
mov rbp, rsp
push rbx # GCC stupidly picks a call-preserved reg it has to save
mov rax, rdi
.p2align 5 # from inline asm
lea rbx, [rax+409600] # endp = p+102400
jmp .L2 # jump to the p<endp condition before the first iteration
## The actual top of the loop. 9 bytes past the alignment boundary
.L3: # do{
mov edx, DWORD PTR [rax]
add edx, 123
mov DWORD PTR [rax], edx # A memory destination add dword [rax], 123 would be 2 uops for the front-end (fused-domain) on Intel, vs. 3 for 3 separate instructions.
add rax, 4 # p++
.L2:
cmp rax, rbx
jb .L3 # }while(p<endp)
nop
nop # These aren't for alignment, IDK what this is for.
mov rbx, QWORD PTR [rbp-8] # restore RBX
leave # and restore RBP / tear down stack frame
ret
This loop is 5 uops long (assuming macro-fusion of cmp/JCC), so can run at 1 cycle per iteration on Ice Lake or Zen, if all goes well. (Load / store of 1 dword per cycle is not much memory bandwidth, so that should keep up over a large array, maybe even if it doesn't fit in L3 cahce.) Or on Haswell for example, maybe 1.25 cycles per iteration, or maybe a little worse due to loop-buffer effects.
If you use "binary" output mode on Godbolt, you can see that lea rbx, [rax+409600] is a 7-byte instruction, while jmp .L2 is 2 bytes, and that the address of the top of the loop is 0x401149, i.e. 9 bytes into the 16-byte fetch-block, on CPUs that fetch in that size. I aligned by 32, so it's only wasted 2 uops out of the first uop cache line associated with this block, so we're still relatively good in term of 32-byte blocks.
(Godbolt "binary" mode compiles and links into an executable, and runs objdump -d on that. That also lets us see the .p2align directive expanded into a NOP instruction of some width, or more than one if it had to skip more than 11 bytes, the default max NOP width for GAS for x86-64. Remember these NOP instructions have to get fetched and go through the pipeline every time control passes over this asm statement, so huge alignment inside a function is a bad thing for that as well as for I-cache footprint.)
A fairly obvious transformation gets the LEA before the .p2align. (See the asm in the Godbolt link for all of these source versions if you're curious).
register int *endp = p + 102400;
asm(" .p2align 5 # from inline asm");
for ( ; p < endp ; p++) {
*p += 123;
}
Or while (p < endp){... ; p++} also does the trick. The top of the asm loop becomes the following, with only a 2-byte jmp to the loop condition. So this is pretty decent, and gets most of the benefit.
lea rbx, [rax+409600]
.p2align 5 # from inline asm
jmp .L5 # 2-byte instruction
.L6:
It might be possible to achieve the same thing with for(foo=bar, asm(".p2align 4) ; p<endp ; p++). But if you're declaring a variable in the first part of a for statement, the comma operator won't work to let you sneak in a separate statement.
To actually align the asm loop, we can write it as a do{}while.
register int *endp = p + 102400;
asm(" .p2align 5 # from inline asm");
do {
*p += 123;
p++;
}while(p < endp);
lea rbx, [rax+409600]
.p2align 5 # from inline asm
.L8: # do{
mov edx, DWORD PTR [rax]
add edx, 123
mov DWORD PTR [rax], edx
add rax, 4
cmp rax, rbx
jb .L8 # while(p<endp)
No jmp at the start, no branch-target label inside the loop. (Which is interesting if you wanted to try -falign-labels=32 to get GCC to pad for you without having it put NOPs inside the loop. See below: -falign-loops doesn't work at -O0.)
Since I'm hard-coding a non-zero size, no p == endp check runs before the first iteration. If that length was a runtime variable, e.g. a function arg, you could do if(n==0) return; before the loop. Or more generally, put the loop inside an if like GCC does when compiling a for or while loop with optimization enabled, if it can't prove that it always runs at least one iteration.
if(n!=0) {
register int *endp = p + n;
asm (".p2align 4");
do {
...
}while(p!=endp);
}
Getting GCC to do this for you: -falign-loops=16 doesn't work at -O0
GCC -O2 enables -falign-loops=16:11:8 or something like that (align by 16 if that would skip fewer than 11 bytes, otherwise align by 8). That's why GCC uses a sequence of two .p2align directives, with a padding limit on the first one (see the GAS manual).
.p2align 4,,10 # what GCC does on its own
.p2align 3
But using -falign-loops=16 does nothing at -O0. It seems GCC -O0 doesn't know what a loop is. :P
However, GCC does respect -falign-labels even at -O0. But unfortunately that applies to all labels, including the loop entry point inside the inner loop. Godbolt.
# gcc -O0 -falign-labels=16
## from compiling endp=...; asm(); while() {}
lea rbx, [rax+409600] # endp = ...
.p2align 5 # from inline asm
jmp .L5
.p2align 4 # from GCC itself, pads another 14 bytes to an odd multiple of 16 (if you didn't remove the manual .p2align 5)
.L6:
mov edx, DWORD PTR [rax]
add edx, 123
mov DWORD PTR [rax], edx
add rax, 4
.p2align 4 # from GCC itself: one 5-byte NOP in this particular case
.L5:
cmp rax, rbx
jb .L6
Putting a NOP inside the inner-most loop is worse than misaligning its start on modern x86 CPUs.
You don't have this problem with a do{}while() loop, but in that case it also seems to work to use asm() to put an alignment directive there.
(I used How to remove "noise" from GCC/clang assembly output? for the compile options to minimize clutter without filtering out directives, which would include .p2align. If I just wanted to see where the inline asm went, I could have used asm("nop #hi mom") to make it visible with directives filtered out.)
If you can use inline asm but must compile with anti-optimized debug mode, there are likely major speedups from rewriting the whole inner loop in inline asm, with input/output constraints. (But don't really do that; it's hard to get right and in real life a normal person would just enable optimizations as a first step.)
Footnote 1: code alignment doesn't help much on modern x86, may help some on others
This is unlikely to be helpful even if you do actually align the target of the backwards branch (rather than just some loop prologue); modern x86 CPUs with uop caches (Sandybridge-family and Zen-family) and loop buffers (Nehalem and later for Intel) don't care very much about loop alignment.
It could help more on an older x86 CPU, or maybe for some other ISAs; only x86 is so hard to decode that uop caches are a thing (You didn't actually specify x86, but currently most people are using x86 CPUs in their desktops/laptops so I'm assuming that.)
The main reason alignment of branch targets helps (especially tops of loops), is when the CPU fetches a 16-byte-aligned block that includes the target address, most of the machine code in that block will be after it, and thus part of loop body that's about to run another iteration. (Bytes before the branch target are wasted in that fetch cycle).
But the worst case of mis-alignment (barring other weird effects) just costs you 1 extra cycle of front-end fetch to get more instructions in the loop body. (e.g. if the top of the loop had an address ending with 0xf, so it was the last byte of a 16-byte block, the aligned 16-byte block containing that byte would only contain that one useful byte at the end.) That might be a one-byte instruction like cdq, but pipelines are often 4 instructions wide, or more.
(Or 3-wide in the early Intel P6-family days before there were buffers between fetch, pre-decode (length finding) and decode. Buffering can hide bubbles if the rest of the loop decodes efficiently and the average instruction-length is short. But decode was still a significant bottleneck until Nehalem's loop buffer could recycle the decode results (uops) for a small loop (a couple dozen uops). And Sandybridge-family added a uop cache to cache large loops that include multiple functions that get called frequently. David Kanter's deep-dive on SnB has nice block diagrams, and see also https://www.agner.org/optimize/ especially Agner's microarch pdf.
Even then, it only helps at all when front-end (instruction fetch/decode) bandwidth is a problem, not some back-end bottleneck (actually executing those instructions). Out-of-order exec usually does a pretty good job of letting the CPU run as fast as the slowest bottleneck, not waiting until after a cache-miss load to get later instructions fetched and decoded. (See this, this, and especially Modern Microprocessors A 90-Minute Guide!.)
There are cases where it could help on a Skylake CPU where a microcode update disabled the loop buffer (LSD), so a tiny loop body split across a 32-byte boundary can run at best 1 iteration per 2 cycles (fetching uops from 2 separate cache lines). Or on Skylake again, tweaking code alignment this way could help avoid the JCC erratum (that can make part of your code run from legacy decode instead of the uop cache) if you can't pass -Wa,-mbranches-within-32B-boundaries to get the assembler to work around it. (How can I mitigate the impact of the Intel jcc erratum on gcc?). These problems are specific to Skylake-derived microarchitectures, and were fixed in Ice Lake.
Of course, anti-optimized debug-mode code is so bloated that even a tight loop is unlikely to be fewer than 8 uops anyway, so the 32-byte-boundary problem probably doesn't hurt much. But if you manage to avoid store/reload latency bottlenecks by using register on local vars (yes this does something in debug builds only, otherwise it's meaningless1), the front-end bottleneck of getting all those inefficient instructions through the pipeline could well be impacted on a Skylake CPU if an inner loop ends up tripping over the JCC erratum due to where a conditional branch inside or at the bottom of the loop ends up.
Anyway, as Eric commented, your assignment is likely more about data access pattern, and possibly layout and alignment. Presumably involving a smallish loop over some large amounts of memory, since L2 or L3 cache misses are the only thing that would be slow enough to be more of a bottleneck than building with optimization disabled. Maybe L1d in some cases, if you manage to get a compiler to make non-terrible asm for debug mode, or if load-use latency (not just throughput) is part of the critical path.
Footnote 2: -O0 is dumb, but register int i can help
See
C loop optimization help for final assignment (with compiler optimization disabled) re: how silly it is to optimize source code for debug mode, or benchmark that way for normal use-cases. But also mentions some things that are faster for that case (unlike normal builds) like doing more in a single statement or expression, since the compiler doesn't keep things in registers across statements.
(See also Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for details)
Except register variables; that obsolete keyword does still does something for unoptimized builds with GCC (but not clang). It's officially deprecated or even removed in recent C++ versions, but not C as yet.
You definitely want to use register int i to let a debug build keep it in a register, and write your C like it was hand-written asm. For example, using pointer increments instead of arr[i] where appropriate, especially for ISAs that don't have an indexed addressing mode.
register variables are most important inside your inner loop, and with optimization disabled the compiler probably isn't very smart about deciding which register var actually gets a register if it runs out. (x86-64 has 15 integer regs other than the stack pointer, and a debug build will spend one of them on a frame pointer.)
Especially for variables that change inside loops, to avoid store/reload latency bottlenecks, e.g. for(register int i=1000000 ; --i ; ); probably runs 1 iteration per clock, vs. 5 or 6 without register on a modern x86-64 CPU like Skylake.
If using an integer variable as an array index, make it intptr_t or uintptr_t (#include <stdint.h>) if possible, so the compiler doesn't have to redo sign-extension from 32-bit int to 64-bit pointer width for use in addressing modes.
(Unless you're compiling for AArch64, which has addressing modes that take a 64-bit register and a 32-bit register, doing sign or zero extension and ignoring high garbage in the narrow integer reg. Exactly because this is something compilers can't always optimize away. Although often they can thanks to signed-integer overflow being Undefined Behaviour allowing the compiler to widen an integer loop variable or convert to a pointer increment.)
Also loosely related: Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs has a section on intentionally making things slow via cache effects, so do the opposite of that. Might not be very applicable, IDK what your problem is like.

How to index arrays properly in x86 assembly

I am trying to make sure that I understand the SI and DI registers. My background in assembly language is somewhat limited to 6502, so bear with me here.
I have a quick example of how I would go about using SI as a simple counter. I am a bit concerned that I might be misusing this register though.
mov si, 0 ; set si to 0
mov cx, 5 ; set cx to 5 as we will count down to 1
do:
mov ah, 02h ; setup 02h DOS character output interrupt
mov dl, [table + si] ; grab our table with the si offset
add dl, '0' ; convert to ascii integer
int 21h ; call DOS service
inc si ; increment si
loop do ; repeat unto cx = 0
ret
table: db 1,2,3,4,5
---
OUTPUT:> 12345
Is this the right way to use SI? I know in 6502 assembly, you can use the X and Y registers to offset arrays / tables. However, in my studies of x86, I am starting to realize how much more there is to work with. Such as how CX is automatically decremented in the 'loop' instruction.
I am hoping that moving forward, I will be able to save resources by writing efficient code.
Thank you in advance for your input.
This use of SI is perfectly fine. SI has the benefit of being a preserved register in most Intel calling conventions. Also, historically, SI was one of the few registers that you could use as an index in a memory load operation; in a modern Intel CPU, any register would do.
SI still gets some special treatment with the lods instruction.
Your program actually works fine. Adding org $100 at the beginning, I managed to compile it with FASM and run in DosBox:
On the 6502 you have two index registers (X and Y) that you can use in different ways (direct, indirect, indirect indexed, indexed indirect, ...).
On the x86 you have 4 registers that can be used as pointer registers: BX, BP, SI and DI (in 32-bit mode you can use nearly all registers)
BX and DI can be combined (Example: [BX+DI+10])
BP is typically used for storing the old stack pointer when entering a function (when using a C compiler). However there is no missuse of registers (unless you use the stack pointer for something different) when you program in assembler. You cannot do anything wrong!
But be careful: On the x86 (in 16-bit mode) you also have to care about the segment registers - this is what the 6502 does not have!
These registers are needed because you can only address 64 KiB using a 16-bit register but 8086 has an 1 MiB address space. To solve this an address is composed of a 16-bit segment and a 16-bit offset so an address is effectively not 16 but 32 bits long. The exact meaning of the first 16 bits depends on the operating mode of the CPU.
The following segment registers are present:
CS: CS:IP is the instruction pointer
SS: SS:SP is the stack pointer; used for SP and BP pointer operations by default
DS: Used for all other pointer operations (all but SP and BP) by default
ES: Additional register
FS, GS: Additional registers since 80386
You can overwrite the default segment register to be used:
MOV AX,ES:[SI+100] ; Load from ES:SI+100 instead of DS:SI+100
String operations (like movsb) always access DS:SI and ES:DI (you cannot change the segment register for such operations).
That's an alright use of SI. But you could use several other registers in its base (although beware that unlike 32-bit x86, 16-bit x86 code limits the set of registers on which indexing is supported. The ModRegR/M structure governs this.)
You might want to consider doing an add si, table before the loop and mov dl, [si] inside it. It makes the loop slightly easier for the human to read, because there's one less variable in play.

Displaying PSW content

I'm beginner with asm, so I've been researching for my question for a while but answears were unsatisfactory. I'm wondering how to display PSW content on standard output. Other thing, how to display Instruction Pointer value ? I would be very gratefull if ypu could give me a hint (or better a scratch of code). It may be masm or 8086 as well (actually I don't know wthat is the difference :) )
The instruction pointer is not directly accessible on the x86 family, however, it is quite straightforward to retrieve its value - it will never be accurate though.
Since a subroutine call places the return address on the stack, you just need to copy it from there and violá! You have the address of the opcode following the call instruction:
proc getInstructionPointer
push bp
mov bp,sp
mov ax,[word ptr ss:bp + 2]
mov sp,bp
pop bp
ret
endp getInstructionPointer
The PSW on the x86 is called the Flags register. There are two operations that explicitly reference it: pushf and popf. As you might have guessed, you can simply push the Flags onto the stack and load it to any general purpose register you like:
pushf
pop ax
Displaying these values consists of converting their values to ASCII and writing them onto the screen. There are several ways of doing this - search for "string output assembly", I bet you find the answer.
To dispel a minor confusion: 8086 is the CPU itself, whereas MASM is the assembler. The syntax is assembler-specific; MASM assembly is x86 assembly. TASM assembly is x86 assembly as well, just like NASM assembly.
When one says "x86 Assembly", he/she is referencing any of these (or others), talking about the instruction set, not the dialect.
Note that the above examples are 16bit, indtended for 8086 and won't work on 80386+ in 32bit mode

Visual-C++ inline assembler difference of two offsets

I'm porting chunk of code from MASM to C inline assembler (x86, Windows, MS VC)
Foolowing is not a real code, just spoof to give an idea. Let's say I have some data defined as static array or even a code chunk between two labels, and I need to get size of it.
label1:
bla bla bla
label2:
....
mov eax, (offset label2 - offset label1)
Such a code works in MASM like a charm, but in C I get following error message:
"error C2425: '-' : non-constant expression in 'second operand'"
I can get compiled:
mov eax, offset label1
mov eax, offset label2
I expect compiler to evaluate (offset label1 - offset label2) at compile time, but it looks like I'm wrong. I can't add offsets as well (why? these are just two integers added during compilation...?)
Sure, I can get
mov eax, offset label2
mov edx, offset label1
sub eax, edx
compiled, but that's an extra code just for calculating a constant.
Can someone explain me please, what is wrong in my code?
Can it be something caused by relocation? How to push it through?
Looking forward to an answer,
thank you.
Yes, it can be caused by the threat of relocation but also threat of variable length instructions dealing with relative jumps. Most likely because of some minor trouble, the assembler writers took the easy way out and implemented a 1 pass or a two pass compiler that makes final decisions as soon as possible. And thus some convenient expressions are unsupported.
As already suggested in the comment, the assembler still probably supports mov + sub combination.
The real assembler is probably running over the code in several passes before it has gotten fixed addresses for all the labels. For example, some jumps have a short and a long form depending on how far you want to jump. If you have such a jump between the labels, the distance depends on where the jump is going to.
The C compiler might leave some of that to the linker/loader and not have the values fixed at compile time.
You could very well get the addres calculation code down to two instructions
mov EAX, offset Label2
sub EAX, offset Label1
I don't think this will exactly ruin the performance of the code.

Writing a JIT compiler in assembly

I've written a virtual machine in C which has decent performance for a non-JIT VM, but I want to learn something new, and improve performance. My current implementation simply uses a switch to translate from VM bytecode to instructions, which is compiled to a jump table. Like I said, decent performance for what it is, but I've hit a barrier that can only be overcome with a JIT compiler.
I've already asked a similar question not long ago about self-modifying code, but I came to realize that I wasn't asking the right question.
So my goal is to write a JIT compiler for this C virtual machine, and I want to do it in x86 assembly. (I'm using NASM as my assembler) I'm not quite sure how to go about doing this. I'm comfortable with assembly, and I've looked over some self-modifying code examples, but I haven't come to figure out how to do code generation just yet.
My main block so far is copying instructions to an executable piece of memory, with my arguments. I'm aware that I can label a certain line in NASM, and copy the entire line from that address with the static arguments, but that's not very dynamic, and doesn't work for a JIT compiler. I need to be able to interpret the instruction from bytecode, copy it to executable memory, interpret the first argument, copy it to memory, then interpret the second argument, and copy it to memory.
I've been informed about several libraries that would make this task easier, such as GNU lightning, and even LLVM. However, I'd like to write this by hand first, to understand how it works, before using external resources.
Are there any resources or examples this community could provide to help me get started on this task? A simple example showing two or three instructions like "add" and "mov" being used to generate executable code, with arguments, dynamically, in memory, would do wonders.
I wouldn't recommend writing a JIT in assembly at all. There are good arguments for writing the most frequently executed bits of the interpreter in assembly. For an example of how this looks like see this comment from Mike Pall, the author of LuaJIT.
As for the JIT, there are many different levels with varying complexity:
Compile a basic block (a sequence of non-branching instructions) by simply copying the interpreter's code. For example, the implementations of a few (register-based) bytecode instructions might look like this:
; ebp points to virtual register 0 on the stack
instr_ADD:
<decode instruction>
mov eax, [ebp + ecx * 4] ; load first operand from stack
add eax, [ebp + edx * 4] ; add second operand from stack
mov [ebp + ebx * 4], eax ; write back result
<dispatch next instruction>
instr_SUB:
... ; similar
So, given the instruction sequence ADD R3, R1, R2, SUB R3, R3, R4 a simple JIT could copy the relevant parts of the interpreters implementation into a new machine code chunk:
mov ecx, 1
mov edx, 2
mov ebx, 3
mov eax, [ebp + ecx * 4] ; load first operand from stack
add eax, [ebp + edx * 4] ; add second operand from stack
mov [ebp + ebx * 4], eax ; write back result
mov ecx, 3
mov edx, 4
mov ebx, 3
mov eax, [ebp + ecx * 4] ; load first operand from stack
sub eax, [ebp + edx * 4] ; add second operand from stack
mov [ebp + ebx * 4], eax ; write back result
This simply copies the relevant code, so we need to initialise the registers used accordingly. A better solution would be to translate this directly into machine instructions mov eax, [ebp + 4], but now you already have to manually encode the requested instructions.
This technique removes the overheads of interpretation, but otherwise does not improve efficiency much. If the code is executed for only one or two times, then it may not worth it to first translate it to machine code (which requires flushing at least parts of the I-cache).
While some JITs use the above technique instead of an interpreter, they then employ a more complicated optimisation mechanism for frequently executed code. This involves translating the executed bytecode into an intermediate representation (IR) on which additional optimisations are performed.
Depending on the source language and the type of JIT, this can be very complex (which is why many JITs delegate this task to LLVM). A method-based JIT needs to deal with joining control-flow graphs, so they use SSA form and run various analyses on that (e.g., Hotspot).
A tracing JIT (like LuaJIT 2) only compiles straight line code which makes many things easier to implement, but you have to be very careful how you pick traces and how you link multiple traces together efficiently. Gal and Franz describe one method in this paper (PDF). For another method see the LuaJIT source code. Both JITs are written in C (or perhaps C++).
I suggest you look at the project http://code.google.com/p/asmjit/. By using the framework it provides, you can save a lot of energy. If you want write all things by hand, just read the source and rewrite it yourself, I think it's not very hard.

Resources