Don't really understand how arrays work in Assembly - arrays

I've just started learning Assembly and I got stuck now...
%include 'io.inc'
global main
section .text
main:
; read a
mov eax, str_a
call io_writestr
call io_readint
mov [nb_array], eax
call io_writeln
; read b
mov eax, str_b
call io_writestr
call io_readint
mov [nb_array + 2], eax
call io_writeln
mov eax, [nb_array]
call io_writeint
call io_writeln
mov eax, [nb_array + 2]
call io_writeint
section .data
nb_array dw 0, 0
str_a db 'a = ', 0
str_b db 'b = ', 0
So, I have a 2 elem sized array and when I try to print the first element, it doesn't print the right value. Although I try to print the second element, it prints the right value. Could someone help me understand why is this happening?

The best answer is probably "because there are no arrays in Assembly". You have computer memory available, which is addressable by bytes. And you have several instructions to manipulate those bytes, either by single byte, or by groups of them forming "word" (two bytes) or "dword" (four bytes), or even more (depends on platform and extended instructions you use).
To use the memory in any "structured" way in Assembly: it's up to you to write piece of code like that, and it takes some practice to be accurate enough and to spot all bugs in debugger (as just running the code with correct output doesn't mean much, if you would do only single value input, your program would output correct number, but the "a = " would be destroyed anyway - you should rather every new piece of code walk instruction by instruction in debugger and verify everything works as expected).
Bugs in similar code were so common, that people rather used much worse machine code produced by C compiler, as the struct and C arrays were much easier to use, not having to guard by_size multiplication of every index, and allocating correct amount of memory for every element.
What you see as result is exactly what you did with the memory and particular bytes (fix depends whether you want it to work for 16b or 32b input numbers, you either have to fix instructions storing/reading the array to work with 16b only, or fix the array allocation and offsets to accompany two 32b values).

Related

Why does the stack frame also store instructions(besides data)? What is the precise mechanism by which instructions on stack frame get executed?

Short version:
0: 48 c7 c7 ee 4f 37 45 mov $0x45374fee, %rdi
7: 68 60 18 40 00 pushq $0x401860
c: c3 retq
How can these 3 lines of instruction(0,7,c), saved in the stack frame, get executed? I thought stack frame only store data, does it also store instructions? I know data is read to registers, but how do these instructions get executed?
Long version:
I am self-studying 15-213(Computer Systems) from CMU. In the Attack lab, there is an instance (phase 2) where the stack frame gets overwritten with "attack" instructions. The attack happens by then overwriting the return address from the calling function getbuf() with the address %rsp points to, which I know is the top of the stack frame. In this case, the top of the stack frame is in turn injected with the attack code mentioned above.
Here is the question, by reading the book(CSAPP), I get the sense that the stack frame only stores data the is overflown from the registers(including return address, extra arguments, etc.). But I don't get why it can also store instructions(attack code) and be executed. How exactly did the content in the stack frame, which %rsp points to, get executed? I also know that %rsp stores the return address of the calling function, the point being it is an address, not an instruction? So exactly by which mechanism does an supposed address get executed as an instruction? I am very confused.
Edit: Here is a link to the question(4.2 level 2):
http://csapp.cs.cmu.edu/3e/attacklab.pdf
This is a post that is helpful for me in understanding: https://github.com/magna25/Attack-Lab/blob/master/Phase%202.md
Thanks for your explanation!
ret instruction gets a pointer from the current position of the stack and jumps to it. If, while in a function, you modify the stack to point to another function or piece of code that could be used maliciously, the code can return to it.
The code below doesn't necessarily compile, and it is just meant to represent the concept.
For example, we have two functions: add(), and badcode():
int add(int a, int b)
{
return a + b;
}
void badcode()
{
// Some very bad code
}
Let's also assume that we have a stack such as the below when we call add()
...
0x00....18 extra arguments
0x00....10 return address
0x00....08 saved RBP
0x00....00 local variables and etc.
...
If during the execution of add, we managed to change the return address to address of badcode(), on ret instruction we will automatically start executing badcode(). I don't know if this answer your question.
Edit:
An instruction is simply an array of numbers. Where you store them is irrelevant (mostly) to their execution. A stack is essentially an abstract data structure, it is not a special place in RAM. If your OS doesn't mark the stack as non-executable, there is nothing stopping the code on the stack from being returned to by the ret.
Edit 2:
I get the sense that the stack frame only stores data that is overflown
from the registers(including return address, extra arguments, etc.)
I do not think that you know how registers, RAM, stack, and programs are incorporated. The sense that stack frame only stores data that is overflown is incorrect.
Let's start over.
Registers are pieces of memory on your CPU. They are independent of RAM. There are mainly 8 registers on a CPU. a, c, d, b, si, di, sp, and bp. a is for accumulator and it generally used for arithmetic operations, likewise b stands for base, c stands for counter, d stands for data, si stands for source, di stands for destination, sp is the stack pointer, and bp is the base pointer.
On 16 bit computers a, b, c, d, si, di, sp, and bp are 16 bits (2 byte). The a, b, c, and d are often shown as ax, bx, cx, and dx where the x stands for extension from their original 8 bit versions. They can also be referred to as eax, ecx, edx, ebx, esi, edi, esp, ebp for 32 bit (e again stands for extended) and rax, rcx, rdx, rbx, rsi, rdi, rsp, rbp for 64 bit.
Once again these are on your CPU and are independent of RAM. CPU uses these registers to do everything that it does. You wanna add two numbers? put one of them inside ax and another one inside cx and add them.
You also have RAM. RAM (standing for Random Access Memory) is a storage device that allows you to access and modify all of its values using equal computation power or time (hence the term random access). Each value that RAM holds also has an address that determines where on the RAM this value is. CPU can use numbers and treat such numbers as addresses to access memory addresses of RAM. Numbers that are used for such purposes are called pointers.
A stack is an abstract data structure. It has a FILO (first in last out) structure which means that to access the first datum that you have stored you have to access all of the other data. To manipulate the stack CPU provides us with sp which holds the pointer to the current position of the stack, and bp which holds the top of the stack. The position that bp holds is called the top of the stack because the stack usually grows downwards meaning that if we start a stack from the memory address 0x100 and store 4 bytes in it, sp will now be at the memory address 0x100 - 4 = 0x9C. To do such operations automatically we have the push and pop instructions. In that sense a stack could be used to store any type of data regardless of the data's relation to registers are programs.
Programs are pieces of structured code that are placed on the RAM by the operating system. The operating system reads program headers and relevant information and sets up an environment for the program to run on. For each program a stack is set up, usually, some space for the heap is given, and instructions (which are the building blocks of a program) are placed in arbitrary memory locations that are either predetermined by the program itself or automatically given by the OS.
Over the years some conventions have been set to standardize CPUs. For example, on most CPU's ret instruction receives the system pointer size amount of data from the stack and jumps to it. Jumping means executing code at a particular RAM address. This is only a convention and has no relation to being overflown from registers and etc. For that reason when a function is called firstly the return address (or the current address in the program at the time of execution) is pushed onto the stack so that it could be retrieved later by ret. Local variables are also stored in the stack, along with arguments if a function has more than 6(?).
Does this help?
I know it is a long read but I couldn't be sure on what you know and what you don't know.
Yet Another Edit:
Lets also take a look at the code from the PDF:
void test()
{
int val;
val = getbuf();
printf("No exploit. Getbuf returned 0x%x\n", val);
}
Phase 2 involves injecting a small amount of code as part of your exploit string.
Within the file ctarget there is code for a function touch2 having the following C representation:
void touch2(unsigned val)
{
vlevel = 2; /* Part of validation protocol */
if (val == cookie) {
printf("Touch2!: You called touch2(0x%.8x)\n", val);
validate(2);
} else {
printf("Misfire: You called touch2(0x%.8x)\n", val);
fail(2);
}
exit(0);
}
Your task is to get CTARGET to execute the code for touch2 rather than returning to test. In this case,
however, you must make it appear to touch2 as if you have passed your cookie as its argument.
Let's think about what you need to do:
You need to modify the stack of test() so that two things happen. The first thing is that you do not return to test() but you rather return to touch2. The other thing you need to do is give touch2 an argument which is your cookie. Since you are giving only one argument you don't need to modify the stack for the argument at all. The first argument is stored on rdi as a part of x86_64 calling convention.
The final code that you write has to change the return address to touch2()'s address and also call mov rdi, cookie
Edit:
I before talked about RAM being able to store data on addresses and CPU being able to interact with them. There is a secret register on your CPU that you are not able to reach from you assembly code. This register is called ip/eip/rip. It stands for instruction pointer. This register holds a 16/32/64 bit pointer to an address on RAM. this particular address is the address that the CPU will execute in its clock cycle. With that in my we can say that what a ret instruction is doing is
pop rip
which means get the last 64 bits (8 bytes for a pointer) on the stack into this instruction pointer. Once rip is set to this value, the CPU begins executing this code. The CPU doesn't do any checks on rip whatsoever. You can technically do the following thing (excuse me, my assembly is in intel syntax):
mov rax, str ; move the RAM address of "str" into rax
push rax ; push rax into stack
ret ; return to the last pushed qword (8 bytes) on the stack
str: db "Hello, world!", 0 ; define a string
This code can call/execute a string. Your CPU will be very upset tho, that there is no valid instruction there and will probably stop working.

Why my own memcpy written on NASM can not copy more than 340000000 bytes?

I am learning nasm. I have written a simple function that copies memory from the source to the destination. I test in in C.
section .text
global _myMemcpy
_myMemcpy:
mov eax, [esp + 4]
mov ecx, [esp + 8]
add [esp + 12], eax
lp:
mov dl, [ecx]
mov [eax], dl
inc eax
inc ecx
cmp eax, [esp + 12]
jl lp
endlp:
mov eax, [esp + 4]
ret
And the C program:
#include <string.h>
#define Times 340000000
extern void* _myMemcpy(void* dest, void* src, size_t size);
char sr[Times];
char ds[Times];
int main(void)
{
memset(sr, 'a', Times);
_myMemcpy(ds, sr, Times);
return 0;
}
I am currently using Ubuntu OS. When I compile and link the two files with $ nasm -f elf m.asm && gcc -Wall -m32 m.o p.c && ./a.out it works fine when the value of Times is less than 340000000. When it is greater, _myMemcpy copies only the furst byte of the source to the destination. I can't figure out where is the problem. Every suggestion will by useful.
You're doing signed compares on pointers; don't do that. Use jne in this case since you will always reach exact equality at the exit point.
Or if you want relational compares with pointers, usually unsigned conditions like jb and jae make the most sense. (It's normal to think of virtual address space as a flat linear 4GiB with the lowest address being 0, so you need increments across the middle of that range to work).
With arrays larger than your ~300MiB size, and the default linker script for PIE executables, apparently one of them will span the 2GiB boundary between signed-positive and signed-negative1. So the end-pointer you calculate will be "negative" if you treat it as a signed integer. (Unlike on x86-64, where the non-canonical "hole" spanning the middle of virtual address-space means that an array can never span the signed-wraparound boundary: Should pointer comparisons be signed or unsigned in 64-bit x86? - sometimes it does make sense to use signed compares there.)
You should see this with a debugger if you single-step and look at the pointer values, and the memory value you create with size += dest (add [esp + 12], eax). As a signed operation, that overflows to create a negative end_pointer, while the start pointer is still positive. pos < neg is false on the first iteration, so your loop exits, you can see this when single-stepping.
Footnote 1: On my system, under GDB (which disables ASLR), after start to get the executable mapped to Linux's default base address for PIEs (2/3 of the way into the low half of the address space, i.e. 0x5555...), I checked the addresses with your test case:
sr at 0x56559040
ds at 0x6a998d40
end of ds at p /x sizeof(ds) + ds = 0x7edd8a40
So if it were much bigger, it would cross 0x80000000. That's why 340000000 avoids your bug but larger sizes reveal it.
BTW, under a 32-bit kernel, Linux defaults to a 3:1 split of address space between kernel and user-space, so even there it's possible for this to happen. But under a 64-bit kernel, 32-bit processes can have the entire 4 GiB address space to themselves. (Except for a page or two reserved by the kernel: see also Why can't I mmap(MAP_FIXED) the highest virtual page in a 32-bit Linux process on a 64-bit kernel?. That also means that forming a pointer to one-past-end of any array like you're doing (which ISO C promises is valid to do), won't wrap around and will still compare above a pointer into the object.)
This won't happen in 64-bit mode: there's enough address space to just divide it evenly between user and kernel, as well as there being a giant non-canonical hole between high and low ranges.

Does multiplying a 1-100 int by -1 or setting said int to zero take more time?

This is for C, if the language matters. If it goes down to assembly language, it sets things to negative using two's complements. And with the variable, you're storing the value "0" inside the variable int. Which I'm not entirely sure what happens.
I got: 1.90s user 0.01s system 99% cpu 1.928 total for the beneath code and I'm guessing most of the runtime was in adding up the counter variables.
int i;
int n;
i = 0;
while (i < 999999999)
{
n = 0;
i++;
n++;
}
I got: 4.56s user 0.02s system 99% cpu 4.613 total for the beneath code.
int i;
int n;
i = 0;
n = 5;
while (i < 999999999)
{
n *= -1;
i++;
n++;
}
return (0);
I don't particularly understand much about assembly, but it doesn't seem intuitive that using the two's complement operation takes more time than setting one thing to another. What's the underlying implementation that makes one faster than the other, and what's happening beneath the surface? Or is my test simply a bad one that doesn't accurately portray how quick it'll actually be in practice.
If it seems pointless, the reason for it is because I can easily implement a "checklist" by simply multiplying an integer on a map by -1, meaning it's already been checked(But I need to keep the value, so when I do the check, I can just -1 whatever I'm comparing it to). But I was wondering if that's too slow, I could make a separate boolean 2D array to check if the value was checked or not, or change my data structure into an array of structures so it could hold an int 1/0. I'm wondering what the best implementation will be-- doing the -1 operation itself a billion times will already total up to around 5 seconds not counting the rest of my program. But making a separate 1 billion square int array or creating a billion square struct doesn't seem to be the best way either.
Assigning zero is very cheap.
But your microbenchmark tells you very little about what you should do for your large array. Memory bandwidth / cache-miss / cache footprint considerations will dominate there, and your microbench doesn't test that at all.
Using one bit of your integer values to represent checked / not-checked seems reasonable compared to having a separate bitmap. (Having a separate array of 0/1 32-bit integers would be totally silly, but a bitmap is worth considering, especially if you want to search quickly for the next unchecked or the next checked entry. It's not clear what you're doing with this, so I'll mostly just stick to explaining the observed performance in your microbenchmark.)
And BTW, questions like this are a perfect example of why SO comments like "why don't you benchmark it yourself" are misguided: because you have to understand what you're testing in quite a lot of detail to write a useful microbenchmark.
You obviously compiled this in debug mode, e.g. gcc with the default -O0, which spills everything to memory after every C statement (so your program still works even if you modify variables with a debugger). Otherwise the loops would optimize away, because you didn't use volatile or an asm statement to limit optimization, and your loops are trivial to optimize.
Benchmarking with -O0 does not reflect reality (of compiling normally), and is a total waste of time (unless you're actually worried about the performance of debug builds of something like a game).
That said, your results are easy to explain: Since -O0 compiles each C statement separately and predictably.
n = 0; is write-only, and breaks the dependency on the old value.
n *= -1; compiles the same as n = -n; with gcc (even with -O0). It has to read the old value from memory before writing the new value.
The store/reload between a write and a read of a C variable across statements costs about 5 cycles of store-forwarding latency on Intel Haswell for example (see http://agner.org/optimize and other links on the x86 tag wiki). (You didn't say what CPU microarchitecture you tested on, but I'm assuming some kind of x86 because that's usually "the default"). But dependency analysis still works the same way in this case.
So the n*=-1 version has a loop-carried dependency chain involving n, with an n++ and a negate.
The n=0 version breaks that dependency every iteration by doing a store without reading the old value. The loop only bottlenecks on the 6-cycle loop-carried dependency of the i++ loop counter. The latency of the n=0; n++ chain doesn't matter, because each loop iteration starts a fresh chain, so multiple can be in flight at once. (Store forwarding provides a sort of memory renaming, like register renaming but for a memory location).
This is all unrealistic nonsense: With optimization enabled, the cost of a unary - totally depends on the surrounding code. You can't just add up the costs of separate operations to get a total, that's not how pipelined out-of-order CPUs work, and compiler optimization itself also makes that model bogus.
About the code itself
I compiled your pieces of code into x86_64 assembly outputs using GCC 7.2 without any optimization. I also shortened each piece of code without changing the assembly output. Here are the results.
Code 1:
// C
int main() {
int n;
for (int i = 0; i < 999999999; i++) {
n = 0;
n++;
}
}
// assembly
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 0
jmp .L2
.L3:
mov DWORD PTR [rbp-8], 0
add DWORD PTR [rbp-8], 1
add DWORD PTR [rbp-4], 1
.L2:
cmp DWORD PTR [rbp-4], 999999998
jle .L3
mov eax, 0
pop rbp
ret
Code 2:
// C
int main() {
int n = 5;
for (int i = 0; i < 999999999; i++) {
n *= -1;
n++;
}
}
// assembly
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 5
mov DWORD PTR [rbp-8], 0
jmp .L2
.L3:
neg DWORD PTR [rbp-4]
add DWORD PTR [rbp-4], 1
add DWORD PTR [rbp-8], 1
.L2:
cmp DWORD PTR [rbp-8], 999999998
jle .L3
mov eax, 0
pop rbp
ret
The C instructions inside the loop are, in the assembly, located between the two labels (.L3: and .L2:). In both cases, that's three instructions, among which only the first one is different. In the first code, it is a mov, corresponding to n = 0;. In the second code however, it is a neg, corresponding to n *= -1;.
According to this manual, these two instructions have different execution speed depending on the CPU. One can be faster than the other on one chip while being slower on another.
Thanks to aschepler in the comments for the input.
This means, all the other instructions being identical, that you cannot tell which code will be faster in general. Therefore, trying to compare their performance is pointless.
About your intent
Your reason for asking about the performance of these short pieces of code is faulty. What you want is to implement a checklist structure, and you have two conflicting ideas on how to build it. One uses a special value, -1, to add special meaning onto variables in a map. The other uses additional data, either an external boolean array or a boolean for each variable, to add the same meaning without changing the purpose of the existing variables.
The choice you have to make should be a design decision rather than be motivated by unclear performance issues. Personally, whenever I am facing this kind of choice between a special value or additional data with precise meaning, I tend to prefer the latter option. That's mainly because I don't like dealing with special values, but it's only my opinion.
My advice would be to go for the solution you can maintain better, namely the one you are most comfortable with and won't harm future code, and ask about performance when it matters, or rather if it even matters.

Is there any operation in C analogous to this assembly code?

Today, I played around with incrementing function pointers in assembly code to create alternate entry points to a function:
.386
.MODEL FLAT, C
.DATA
INCLUDELIB MSVCRT
EXTRN puts:PROC
HLO DB "Hello!", 0
WLD DB "World!", 0
.CODE
dentry PROC
push offset HLO
call puts
add esp, 4
push offset WLD
call puts
add esp, 4
ret
dentry ENDP
main PROC
lea edx, offset dentry
call edx
lea edx, offset dentry
add edx, 13
call edx
ret
main ENDP
END
(I know, technically this code is invalid since it calls puts without the CRT being initialized, but it works without any assembly or runtime errors, at least on MSVC 2010 SP1.)
Note that in the second call to dentry I took the address of the function in the edx register, as before, but this time I incremented it by 13 bytes before calling the function.
The output of this program is therefore:
C:\Temp>dblentry
Hello!
World!
World!
C:\Temp>
The first output of "Hello!\nWorld!" is from the call to the very beginning of the function, whereas the second output is from the call starting at the "push offset WLD" instruction.
I'm wondering if this kind of thing exists in languages that are meant to be a step up from assembler like C, Pascal or FORTRAN. I know C doesn't let you increment function pointers but is there some other way to achieve this kind of thing?
AFAIK you can only write functions with multiple entry-points in asm.
You can put labels on all the entry points, so you can use normal direct calls instead of hard-coding the offsets from the first function-name.
This makes it easy to call them normally from C or any other language.
The earlier entry points work like functions that fall-through into the body of another function, if you're worried about confusing tools (or humans) that don't allow function bodies to overlap.
You might do this if the early entry-points do a tiny bit of extra stuff, and then fall through into the main function. It's mainly going to be a code-size saving technique (which might improve I-cache / uop-cache hit rate).
Compilers tend to duplicate code between functions instead of sharing large chunks of common implementation between slightly different functions.
However, you can probably accomplish it with only one extra jmp with something like:
int foo(int a) { return bigfunc(a + 1); }
int bar(int a) { return bigfunc(a + 2); }
int bigfunc(int x) { /* a lot of code */ }
See a real example on the Godbolt compiler explorer
foo and bar tailcall bigfunc, which is slightly worse than having bar fall-through into bigfunc. (Having foo jump over bar into bigfunc is still good, esp. if bar isn't that trivial.)
Jumping into the middle of a function isn't in general safe, because non-trivial functions usually need to save/restore some regs. So the prologue pushes them, and the epilogue pops them. If you jump into the middle, then the pops in the prologue will unbalance the stack. (i.e. pop off the return address into a register, and return to a garbage address).
See also Does a function with instructions before the entry-point label cause problems for anything (linking)?
You can use the longjmp function: http://www.cplusplus.com/reference/csetjmp/longjmp/
It's a fairly horrible function, but it'll do what you seek.

Assembly: Creating an array of linked nodes

Firstly, if this question is inappropriate because I am not providing any code or not doing any thinking on my own, I apologize, and I will delete this question.
For an assignment, we are required to create an array of nodes to simulate a linked list. Each node has an integer value and a pointer to the next node in the list. Here is my .DATA section
.DATA
linked_list DWORD 5 DUP (?) ;We are allowed to assume the linked list will have 5 items
linked_node STRUCT
value BYTE ?
next BYTE ?
linked_node ENDS
I am unsure if I am defining my STRUCT correctly, as I am unsure of what the type of next should be. Also, I am confused as how to approach this problem. To insert a node into linked_list I should be able to write mov [esi+TYPE linked_list*ecx], correct? Of course, I'd need to inc ecx every time. What I'm confused about is how to do mov linked_node.next, "pointer to next node" Is there some sort of operator that would allow me to set the pointer to the next index in the array equal to a linked_node.next ? Or am I thinking about this incorrectly? Any help would be appreciated!
Think about your design in terms of a language you are familiar with. Preferably C, because pointers and values in C are concepts that map directly to asm.
Let's say you want to keep track of your linked list by storing a pointer to the head element.
#include <stdint.h> // for int8_t
struct node {
int8_t next; // array index. More commonly, you'd use struct node *next;
// negative values for .next are a sentinel, like a NULL pointer, marking the end of the list
int8_t val;
};
struct node storage[5]; // .next field indexes into this array
uint8_t free_position = 0; // when you need a new node, take index = free_position++;
int8_t head = -1; // start with an empty list
There are tricks to reduce corner cases, like having the list head be a full node, rather than just a reference (pointer or index). You can treat it as a first element, instead of having to check for the empty-list case everywhere.
Anyway, given a node reference int8_t p (where p is the standard variable name for a pointer to a list node, in linked list code), the next node is storage[p.next]. The next node's val is storage[p.next].val.
Let's see what this looks like in asm. The NASM manual talks about how it's macro system can help you make code using global structs more readable, but I haven't done any macro stuff for this. You might define macros for NEXT and VAL or something, with 0 and 1, so you can say [storage + rdx*2 + NEXT]. Or even a macro that takes an argument, so you could say [NEXT(rdx*2)]. If you're not careful, you could end up with code that's more confusing to read, though.
section .bss
storage: resw 5 ;; reserve 5 words of zero-initialized space
free_position: db 0 ;; uint8_t free_position = 0;
section .data
head: db -1 ;; int8_t head = -1;
section .text
; p is stored in rdx. It's an integer index into storage
; We'll access storage directly, without loading it into a register.
; (normally you'd have it in a reg, since it would be space you got from malloc/realloc)
; lea rsi, [rel storage] ;; If you want RIP-relative addressing.
;; There is no [RIP+offset + scale*index] addressing mode, because global arrays are for tiny / toy programs.
test edx, edx
js .err_empty_list ;; check for p=empty list (sign-bit means negative)
movsx eax, byte [storage + 2*rdx] ;; load p.next into eax, with sign-extension
test eax, eax
js .err_empty_list ;; check that there is a next element
movsx eax, byte [storage + 2*rax + 1] ;; load storage[p.next].val, sign extended into eax
;; The final +1 in the effective address is because the val byte is 2nd.
;; you could have used a 3rd register if you wanted to keep p.next around for future use
ret ;; or not, if this is just the middle of some larger function
.err_empty_list: ; .symbol is a local symbol, doesn't have to be unique for the whole file
ud2 ; TODO: report an error instead of running an invalid insns
Notice that we get away with shorter instruction encoding by sign-extending into a 32bit reg, not to the full 64bit rax. If the value is negative, we aren't going to use rax as part of an address. We're just using movsx as a way to zero-out the rest of the register, because mov al, [storage + 2*rdx] would leave the upper 56 bits of rax with the old contents.
Another way to do this would be to movzx eax, byte [...] / test al, al, because the 8-bit test is just as fast to encode and execute as a 32bit test instruction. Also, movzx as a load has one cycle lower latency than movsx, on AMD Bulldozer-family CPUs (although they both still take an integer execution unit, unlike Intel where movsx/zx is handled entirely by a load port).
Either way, movsx or movzx is a good way to load 8-bit data, because you avoid problems with reading the full reg after writing a partial reg, and/or a false-dependency (on the previous contents of the upper bits of the reg, even if you know you already zeroed it, the CPU hardware still has to track it). Except if you know you're not optimizing for Intel pre-Haswell, then you don't have to worry about partial-register writes. Haswell does dual-bookkeeping or something to avoid extra uops to merge the partial value with the old full value when reading. AMD CPUs, P4, and Silvermont don't track partial-regs separately from the full-reg, so all you have to worry about is the false dependency.
Also note that you can load the next and val packed together, like
.search_loop:
movzx eax, word [storage + rdx*2] ; next in al, val in ah
test ah, ah
jz .found_a_zero_val
movzx edx, al ; use .next for the next iteration
test al, al
jns .search_loop
;; if we get here, we didn't find a zero val
ret
.found_a_zero_val:
;; do something with the element referred to by `rdx`
Notice how we have to use movzx anyway, because all the registers in an effective address have to be the same size. (So word [storage + al*2] doesn't work.)
This is probably more useful going the other way, to store both fields of a node with a single store, like mov [storage + rdx*2], ax or something, after getting next into al, and val into ah, probably from separate sources. (This is a case where you might want to use a regular byte load, instead of a movzx, if you don't already have it in another register). This isn't a big deal: don't make your code hard to read or more complex just to avoid doing two byte-stores. At least, not until you find out that store-port uops are the bottleneck in some loop.
Using an index into an array, instead of a pointer, can save a lot of space, esp. on 64bit systems where pointers take 8 bytes. If you don't need to free individual nodes (i.e. data structure only ever grows, or is deleted all at once when it is deleted), then an allocator for new nodes is trivial: just keep sticking them at the end of the array, and realloc(3). Or use a c++ std::vector.
With those building blocks, you should be all set to implement the usual linked list algos. Just store bytes with mov [storage + rdx*2], al or whatever.
If you need ideas on how to implement linked lists with clean algos that handle all the special-cases with as few branches as possible, have a look at this Codereview question. It's for Java, but my answer is very C-style. The other answers have some nice tricks, too, some of which I borrowed for my answer. (e.g. using a dummy node avoids branching to handle the insertion-as-a-new-head special case).

Resources