Assembly: Creating an array of linked nodes

Assembly: Creating an array of linked nodes - arrays

Firstly, if this question is inappropriate because I am not providing any code or not doing any thinking on my own, I apologize, and I will delete this question.
For an assignment, we are required to create an array of nodes to simulate a linked list. Each node has an integer value and a pointer to the next node in the list. Here is my .DATA section
.DATA
linked_list DWORD 5 DUP (?) ;We are allowed to assume the linked list will have 5 items
linked_node STRUCT
value BYTE ?
next BYTE ?
linked_node ENDS
I am unsure if I am defining my STRUCT correctly, as I am unsure of what the type of next should be. Also, I am confused as how to approach this problem. To insert a node into linked_list I should be able to write mov [esi+TYPE linked_list*ecx], correct? Of course, I'd need to inc ecx every time. What I'm confused about is how to do mov linked_node.next, "pointer to next node" Is there some sort of operator that would allow me to set the pointer to the next index in the array equal to a linked_node.next ? Or am I thinking about this incorrectly? Any help would be appreciated!

Think about your design in terms of a language you are familiar with. Preferably C, because pointers and values in C are concepts that map directly to asm.
Let's say you want to keep track of your linked list by storing a pointer to the head element.
#include <stdint.h> // for int8_t
struct node {
int8_t next; // array index. More commonly, you'd use struct node *next;
// negative values for .next are a sentinel, like a NULL pointer, marking the end of the list
int8_t val;
};
struct node storage[5]; // .next field indexes into this array
uint8_t free_position = 0; // when you need a new node, take index = free_position++;
int8_t head = -1; // start with an empty list
There are tricks to reduce corner cases, like having the list head be a full node, rather than just a reference (pointer or index). You can treat it as a first element, instead of having to check for the empty-list case everywhere.
Anyway, given a node reference int8_t p (where p is the standard variable name for a pointer to a list node, in linked list code), the next node is storage[p.next]. The next node's val is storage[p.next].val.
Let's see what this looks like in asm. The NASM manual talks about how it's macro system can help you make code using global structs more readable, but I haven't done any macro stuff for this. You might define macros for NEXT and VAL or something, with 0 and 1, so you can say [storage + rdx*2 + NEXT]. Or even a macro that takes an argument, so you could say [NEXT(rdx*2)]. If you're not careful, you could end up with code that's more confusing to read, though.
section .bss
storage: resw 5 ;; reserve 5 words of zero-initialized space
free_position: db 0 ;; uint8_t free_position = 0;
section .data
head: db -1 ;; int8_t head = -1;
section .text
; p is stored in rdx. It's an integer index into storage
; We'll access storage directly, without loading it into a register.
; (normally you'd have it in a reg, since it would be space you got from malloc/realloc)
; lea rsi, [rel storage] ;; If you want RIP-relative addressing.
;; There is no [RIP+offset + scale*index] addressing mode, because global arrays are for tiny / toy programs.
test edx, edx
js .err_empty_list ;; check for p=empty list (sign-bit means negative)
movsx eax, byte [storage + 2*rdx] ;; load p.next into eax, with sign-extension
test eax, eax
js .err_empty_list ;; check that there is a next element
movsx eax, byte [storage + 2*rax + 1] ;; load storage[p.next].val, sign extended into eax
;; The final +1 in the effective address is because the val byte is 2nd.
;; you could have used a 3rd register if you wanted to keep p.next around for future use
ret ;; or not, if this is just the middle of some larger function
.err_empty_list: ; .symbol is a local symbol, doesn't have to be unique for the whole file
ud2 ; TODO: report an error instead of running an invalid insns
Notice that we get away with shorter instruction encoding by sign-extending into a 32bit reg, not to the full 64bit rax. If the value is negative, we aren't going to use rax as part of an address. We're just using movsx as a way to zero-out the rest of the register, because mov al, [storage + 2*rdx] would leave the upper 56 bits of rax with the old contents.
Another way to do this would be to movzx eax, byte [...] / test al, al, because the 8-bit test is just as fast to encode and execute as a 32bit test instruction. Also, movzx as a load has one cycle lower latency than movsx, on AMD Bulldozer-family CPUs (although they both still take an integer execution unit, unlike Intel where movsx/zx is handled entirely by a load port).
Either way, movsx or movzx is a good way to load 8-bit data, because you avoid problems with reading the full reg after writing a partial reg, and/or a false-dependency (on the previous contents of the upper bits of the reg, even if you know you already zeroed it, the CPU hardware still has to track it). Except if you know you're not optimizing for Intel pre-Haswell, then you don't have to worry about partial-register writes. Haswell does dual-bookkeeping or something to avoid extra uops to merge the partial value with the old full value when reading. AMD CPUs, P4, and Silvermont don't track partial-regs separately from the full-reg, so all you have to worry about is the false dependency.
Also note that you can load the next and val packed together, like
.search_loop:
movzx eax, word [storage + rdx*2] ; next in al, val in ah
test ah, ah
jz .found_a_zero_val
movzx edx, al ; use .next for the next iteration
test al, al
jns .search_loop
;; if we get here, we didn't find a zero val
ret
.found_a_zero_val:
;; do something with the element referred to by `rdx`
Notice how we have to use movzx anyway, because all the registers in an effective address have to be the same size. (So word [storage + al*2] doesn't work.)
This is probably more useful going the other way, to store both fields of a node with a single store, like mov [storage + rdx*2], ax or something, after getting next into al, and val into ah, probably from separate sources. (This is a case where you might want to use a regular byte load, instead of a movzx, if you don't already have it in another register). This isn't a big deal: don't make your code hard to read or more complex just to avoid doing two byte-stores. At least, not until you find out that store-port uops are the bottleneck in some loop.
Using an index into an array, instead of a pointer, can save a lot of space, esp. on 64bit systems where pointers take 8 bytes. If you don't need to free individual nodes (i.e. data structure only ever grows, or is deleted all at once when it is deleted), then an allocator for new nodes is trivial: just keep sticking them at the end of the array, and realloc(3). Or use a c++ std::vector.
With those building blocks, you should be all set to implement the usual linked list algos. Just store bytes with mov [storage + rdx*2], al or whatever.
If you need ideas on how to implement linked lists with clean algos that handle all the special-cases with as few branches as possible, have a look at this Codereview question. It's for Java, but my answer is very C-style. The other answers have some nice tricks, too, some of which I borrowed for my answer. (e.g. using a dummy node avoids branching to handle the insertion-as-a-new-head special case).

Related

Why my own memcpy written on NASM can not copy more than 340000000 bytes?

I am learning nasm. I have written a simple function that copies memory from the source to the destination. I test in in C.
section .text
global _myMemcpy
_myMemcpy:
mov eax, [esp + 4]
mov ecx, [esp + 8]
add [esp + 12], eax
lp:
mov dl, [ecx]
mov [eax], dl
inc eax
inc ecx
cmp eax, [esp + 12]
jl lp
endlp:
mov eax, [esp + 4]
ret
And the C program:
#include <string.h>
#define Times 340000000
extern void* _myMemcpy(void* dest, void* src, size_t size);
char sr[Times];
char ds[Times];
int main(void)
{
memset(sr, 'a', Times);
_myMemcpy(ds, sr, Times);
return 0;
}
I am currently using Ubuntu OS. When I compile and link the two files with $ nasm -f elf m.asm && gcc -Wall -m32 m.o p.c && ./a.out it works fine when the value of Times is less than 340000000. When it is greater, _myMemcpy copies only the furst byte of the source to the destination. I can't figure out where is the problem. Every suggestion will by useful.

You're doing signed compares on pointers; don't do that. Use jne in this case since you will always reach exact equality at the exit point.
Or if you want relational compares with pointers, usually unsigned conditions like jb and jae make the most sense. (It's normal to think of virtual address space as a flat linear 4GiB with the lowest address being 0, so you need increments across the middle of that range to work).
With arrays larger than your ~300MiB size, and the default linker script for PIE executables, apparently one of them will span the 2GiB boundary between signed-positive and signed-negative1. So the end-pointer you calculate will be "negative" if you treat it as a signed integer. (Unlike on x86-64, where the non-canonical "hole" spanning the middle of virtual address-space means that an array can never span the signed-wraparound boundary: Should pointer comparisons be signed or unsigned in 64-bit x86? - sometimes it does make sense to use signed compares there.)
You should see this with a debugger if you single-step and look at the pointer values, and the memory value you create with size += dest (add [esp + 12], eax). As a signed operation, that overflows to create a negative end_pointer, while the start pointer is still positive. pos < neg is false on the first iteration, so your loop exits, you can see this when single-stepping.
Footnote 1: On my system, under GDB (which disables ASLR), after start to get the executable mapped to Linux's default base address for PIEs (2/3 of the way into the low half of the address space, i.e. 0x5555...), I checked the addresses with your test case:
sr at 0x56559040
ds at 0x6a998d40
end of ds at p /x sizeof(ds) + ds = 0x7edd8a40
So if it were much bigger, it would cross 0x80000000. That's why 340000000 avoids your bug but larger sizes reveal it.
BTW, under a 32-bit kernel, Linux defaults to a 3:1 split of address space between kernel and user-space, so even there it's possible for this to happen. But under a 64-bit kernel, 32-bit processes can have the entire 4 GiB address space to themselves. (Except for a page or two reserved by the kernel: see also Why can't I mmap(MAP_FIXED) the highest virtual page in a 32-bit Linux process on a 64-bit kernel?. That also means that forming a pointer to one-past-end of any array like you're doing (which ISO C promises is valid to do), won't wrap around and will still compare above a pointer into the object.)
This won't happen in 64-bit mode: there's enough address space to just divide it evenly between user and kernel, as well as there being a giant non-canonical hole between high and low ranges.

Does multiplying a 1-100 int by -1 or setting said int to zero take more time?

This is for C, if the language matters. If it goes down to assembly language, it sets things to negative using two's complements. And with the variable, you're storing the value "0" inside the variable int. Which I'm not entirely sure what happens.
I got: 1.90s user 0.01s system 99% cpu 1.928 total for the beneath code and I'm guessing most of the runtime was in adding up the counter variables.
int i;
int n;
i = 0;
while (i < 999999999)
{
n = 0;
i++;
n++;
}
I got: 4.56s user 0.02s system 99% cpu 4.613 total for the beneath code.
int i;
int n;
i = 0;
n = 5;
while (i < 999999999)
{
n *= -1;
i++;
n++;
}
return (0);
I don't particularly understand much about assembly, but it doesn't seem intuitive that using the two's complement operation takes more time than setting one thing to another. What's the underlying implementation that makes one faster than the other, and what's happening beneath the surface? Or is my test simply a bad one that doesn't accurately portray how quick it'll actually be in practice.
If it seems pointless, the reason for it is because I can easily implement a "checklist" by simply multiplying an integer on a map by -1, meaning it's already been checked(But I need to keep the value, so when I do the check, I can just -1 whatever I'm comparing it to). But I was wondering if that's too slow, I could make a separate boolean 2D array to check if the value was checked or not, or change my data structure into an array of structures so it could hold an int 1/0. I'm wondering what the best implementation will be-- doing the -1 operation itself a billion times will already total up to around 5 seconds not counting the rest of my program. But making a separate 1 billion square int array or creating a billion square struct doesn't seem to be the best way either.

Assigning zero is very cheap.
But your microbenchmark tells you very little about what you should do for your large array. Memory bandwidth / cache-miss / cache footprint considerations will dominate there, and your microbench doesn't test that at all.
Using one bit of your integer values to represent checked / not-checked seems reasonable compared to having a separate bitmap. (Having a separate array of 0/1 32-bit integers would be totally silly, but a bitmap is worth considering, especially if you want to search quickly for the next unchecked or the next checked entry. It's not clear what you're doing with this, so I'll mostly just stick to explaining the observed performance in your microbenchmark.)
And BTW, questions like this are a perfect example of why SO comments like "why don't you benchmark it yourself" are misguided: because you have to understand what you're testing in quite a lot of detail to write a useful microbenchmark.
You obviously compiled this in debug mode, e.g. gcc with the default -O0, which spills everything to memory after every C statement (so your program still works even if you modify variables with a debugger). Otherwise the loops would optimize away, because you didn't use volatile or an asm statement to limit optimization, and your loops are trivial to optimize.
Benchmarking with -O0 does not reflect reality (of compiling normally), and is a total waste of time (unless you're actually worried about the performance of debug builds of something like a game).
That said, your results are easy to explain: Since -O0 compiles each C statement separately and predictably.
n = 0; is write-only, and breaks the dependency on the old value.
n *= -1; compiles the same as n = -n; with gcc (even with -O0). It has to read the old value from memory before writing the new value.
The store/reload between a write and a read of a C variable across statements costs about 5 cycles of store-forwarding latency on Intel Haswell for example (see http://agner.org/optimize and other links on the x86 tag wiki). (You didn't say what CPU microarchitecture you tested on, but I'm assuming some kind of x86 because that's usually "the default"). But dependency analysis still works the same way in this case.
So the n*=-1 version has a loop-carried dependency chain involving n, with an n++ and a negate.
The n=0 version breaks that dependency every iteration by doing a store without reading the old value. The loop only bottlenecks on the 6-cycle loop-carried dependency of the i++ loop counter. The latency of the n=0; n++ chain doesn't matter, because each loop iteration starts a fresh chain, so multiple can be in flight at once. (Store forwarding provides a sort of memory renaming, like register renaming but for a memory location).
This is all unrealistic nonsense: With optimization enabled, the cost of a unary - totally depends on the surrounding code. You can't just add up the costs of separate operations to get a total, that's not how pipelined out-of-order CPUs work, and compiler optimization itself also makes that model bogus.

About the code itself
I compiled your pieces of code into x86_64 assembly outputs using GCC 7.2 without any optimization. I also shortened each piece of code without changing the assembly output. Here are the results.
Code 1:
// C
int main() {
int n;
for (int i = 0; i < 999999999; i++) {
n = 0;
n++;
}
}
// assembly
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 0
jmp .L2
.L3:
mov DWORD PTR [rbp-8], 0
add DWORD PTR [rbp-8], 1
add DWORD PTR [rbp-4], 1
.L2:
cmp DWORD PTR [rbp-4], 999999998
jle .L3
mov eax, 0
pop rbp
ret
Code 2:
// C
int main() {
int n = 5;
for (int i = 0; i < 999999999; i++) {
n *= -1;
n++;
}
}
// assembly
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 5
mov DWORD PTR [rbp-8], 0
jmp .L2
.L3:
neg DWORD PTR [rbp-4]
add DWORD PTR [rbp-4], 1
add DWORD PTR [rbp-8], 1
.L2:
cmp DWORD PTR [rbp-8], 999999998
jle .L3
mov eax, 0
pop rbp
ret
The C instructions inside the loop are, in the assembly, located between the two labels (.L3: and .L2:). In both cases, that's three instructions, among which only the first one is different. In the first code, it is a mov, corresponding to n = 0;. In the second code however, it is a neg, corresponding to n *= -1;.
According to this manual, these two instructions have different execution speed depending on the CPU. One can be faster than the other on one chip while being slower on another.
Thanks to aschepler in the comments for the input.
This means, all the other instructions being identical, that you cannot tell which code will be faster in general. Therefore, trying to compare their performance is pointless.
About your intent
Your reason for asking about the performance of these short pieces of code is faulty. What you want is to implement a checklist structure, and you have two conflicting ideas on how to build it. One uses a special value, -1, to add special meaning onto variables in a map. The other uses additional data, either an external boolean array or a boolean for each variable, to add the same meaning without changing the purpose of the existing variables.
The choice you have to make should be a design decision rather than be motivated by unclear performance issues. Personally, whenever I am facing this kind of choice between a special value or additional data with precise meaning, I tend to prefer the latter option. That's mainly because I don't like dealing with special values, but it's only my opinion.
My advice would be to go for the solution you can maintain better, namely the one you are most comfortable with and won't harm future code, and ask about performance when it matters, or rather if it even matters.

Delphi Optimize IndexOf Function

Could someone help me speed up my Delphi function
To find a value in a byte array without using binary search.
I call this function thousands of times, is it possible to optimize it with assembly?
Thank you so much.
function IndexOf(const List: TArray< Byte >; const Value: byte): integer;
var
  I: integer;
begin
  for I := Low( List ) to High( List ) do begin
   if List[ I ] = Value then
    Exit ( I );
end;
  Result := -1;
end;
The length of the array is about 15 items.

Well, let's think. At first, please edit this line:
For I := Low( List ) to High( List ) do
(you forgot 'do' at the end). When we compile it without optimization, here is the assembly code for this loop:
Unit1.pas.29: If List [I] = Value then
005C5E7A 8B45FC mov eax,[ebp-$04]
005C5E7D 8B55F0 mov edx,[ebp-$10]
005C5E80 8A0410 mov al,[eax+edx]
005C5E83 3A45FB cmp al,[ebp-$05]
005C5E86 7508 jnz $005c5e90
Unit1.pas.30: Exit (I);
005C5E88 8B45F0 mov eax,[ebp-$10]
005C5E8B 8945F4 mov [ebp-$0c],eax
005C5E8E EB0F jmp $005c5e9f
005C5E90 FF45F0 inc dword ptr [ebp-$10]
Unit1.pas.28: For I := Low (List) to High (List) do
005C5E93 FF4DEC dec dword ptr [ebp-$14]
005C5E96 75E2 jnz $005c5e7a
This code is far from being optimal: local variable i is really local variable, that is: it is stored in RAM, in stack (you can see it by [ebp-$10] adresses, ebp is stack pointer).
So at each new iteration we see how we load address of array into eax register (mov eax, [ebp-$04]),
then we load i from stack into edx register (mov edx, [ebp-$10]),
then we at least load List[i] into al register which is lower byte of eax (mov al, [eax+edx])
after which compare it with argument 'Value' taken again from memory, not from register!
This implementation is extremely slow.
But let's turn optimization on at last! It's done in Project options -> compiling -> code generation. Let's look at new code:
Unit1.pas.29: If List [I] = Value then
005C5E5A 3A1408 cmp dl,[eax+ecx]
005C5E5D 7504 jnz $005c5e63
Unit1.pas.30: Exit (I);
005C5E5F 8BC1 mov eax,ecx
005C5E61 5E pop esi
005C5E62 C3 ret
005C5E63 41 inc ecx
Unit1.pas.28: For I := Low (List) to High (List) do
005C5E64 4E dec esi
005C5E65 75F3 jnz $005c5e5a
now there are just 4 lines of code which gets repeated over and over.
Value is stored inside dl register (lower byte of edx register),
address of 0-th element of array is stored in eax register,
i is stored in ecx register.
So the line 'if List[i] = Value' converts into just 1 assembly line:
005C5E5A 3A1408 cmp dl,[eax+ecx]
the next line is conditional jump, 3 lines after that are executed just once or never (it's if condition is true), and at last there is increment of i,
decrement of loop variable (it's easier to compare it with zero then with anything else)
So, there is little we can do which Delphi compiler with optimizer didn't!
If it's permitted by your program, you can try to reverse direction of search, from last element to first:
For I := High( List ) downto Low( List ) do
this way compiler will be happy to compare i with zero to indicate that we checked everything (this operation is free: when we decrement i and got zero, CPU zero flag turns on!)
But in such implementation behaviour may be different: if you have several entries = Value, you'll get not the first one, but the last one!
Another very easy thing is to declare this IndexOf function as inline: this way you'll probably have no function call here: this code will be inserted at each place where you call it. Function calls are rather slow things.
There are also some crazy methods described in Knuth how to search in simple array as fast as possible, he introduces 'dummy' last element of array which equals your 'Value', that way you don't have to check boundaries (it will alway find something before going out of range), so there is just 1 condition inside loop instead of 2. Another method is 'unrolling' of loop: you write down 2 or 3 or more iterations inside a loop, so there are less jumps per each check, but this has even more downsides: it will be beneficial only for rather large arrays while may make it even slower for arrays with 1 or 2 elements.
As others said: the biggest improvement would be to understand what kind of data you store: does it change frequently or stays the same for long time, do you look for random elements or there are some 'leaders' which gets the most attention. Must these elements be in the same order as you put them or it's allowed to rearrange them as you wish? Then you can choose data structure accordingly. If you look for some 1 or 2 same entries all the time and they can be rearranged, a simple 'Move-to-front' method would be great: you don't just return index but first move element to first place, so it will be found very quickly the next time.

If your arrays are long, you can use the x86 built in string scan REP SCAS.
It is coded in microcode and has a moderate start-up time, but it is
heavily optimized in the CPU and runs fast given long enough data structures (>= 100 bytes).
In fact on a modern CPU it frequently outperforms very clever RISC code.
If your arrays are short, then no amount of optimization of this routine will help, because then your problem is in code not shown in the question, so there is no answer I can give you.
See: http://docwiki.embarcadero.com/RADStudio/Tokyo/en/Internal_Data_Formats_(Delphi)
function IndexOf({$ifndef RunInSeperateThread} const {$endif} List: TArray<byte>; const Value: byte): integer;
//Lock the array if you run this in a separate thread.
{$ifdef CPUX64}
asm
//RCX = List
//DL = byte.
mov r8,[rcx-8] //3 - get the length ASAP.
push rdi //0 - hidden in mov r,m
mov eax,edx //0 - rename
mov rdi,rcx //0 - rename
mov rcx,r8 //0 - rename
mov rdx,r8 //0 - remember the length
//8 cycles setup
repne scasb //2n - repeat until byte found.
pop rdi //1
neg rcx //0
lea rax,[rdx+rcx] //1 result = length - bytes left.
end;
{$ENDIF}
{$ifdef CPUX86}
asm
//EAX = List
//DL = byte.
push edi
mov edi,eax
mov ecx,[eax-4] //get the length
mov eax,edx
mov edx,ecx //remember the length
repne scasb //repeat until byte found.
pop edi
neg ecx
lea eax,[edx+ecx] //result = length - bytes left.
end;
Timings
On my laptop using an array of 1KB with the target byte at the end this gives the following timings (lowest time using a 100.0000 runs)
Code | CPU cycles
| Len=1024 | Len=16
-------------------------------+----------+---------
Your code optimizations off | 5775 | 146
Your code optimizations on | 4540 | 93
X86 my code | 2726 | 60
X64 my code | 2733 | 69
The speed-up is OK (ish), but hardly worth the effort.
If your array's are short, then this code will not help you and you'll have to resort to better other options to optimize your code.
Speed up possible when using binary search
Binary search is a O(log n) operation, vs O(n) for naive search.
Using the same array this will find your data in log2(1024) * CPU cycles per search = 10 * 20 +/- 200 cycles. A 10+ times speed up over my optimized code.

Don't really understand how arrays work in Assembly

I've just started learning Assembly and I got stuck now...
%include 'io.inc'
global main
section .text
main:
; read a
mov eax, str_a
call io_writestr
call io_readint
mov [nb_array], eax
call io_writeln
; read b
mov eax, str_b
call io_writestr
call io_readint
mov [nb_array + 2], eax
call io_writeln
mov eax, [nb_array]
call io_writeint
call io_writeln
mov eax, [nb_array + 2]
call io_writeint
section .data
nb_array dw 0, 0
str_a db 'a = ', 0
str_b db 'b = ', 0
So, I have a 2 elem sized array and when I try to print the first element, it doesn't print the right value. Although I try to print the second element, it prints the right value. Could someone help me understand why is this happening?

The best answer is probably "because there are no arrays in Assembly". You have computer memory available, which is addressable by bytes. And you have several instructions to manipulate those bytes, either by single byte, or by groups of them forming "word" (two bytes) or "dword" (four bytes), or even more (depends on platform and extended instructions you use).
To use the memory in any "structured" way in Assembly: it's up to you to write piece of code like that, and it takes some practice to be accurate enough and to spot all bugs in debugger (as just running the code with correct output doesn't mean much, if you would do only single value input, your program would output correct number, but the "a = " would be destroyed anyway - you should rather every new piece of code walk instruction by instruction in debugger and verify everything works as expected).
Bugs in similar code were so common, that people rather used much worse machine code produced by C compiler, as the struct and C arrays were much easier to use, not having to guard by_size multiplication of every index, and allocating correct amount of memory for every element.
What you see as result is exactly what you did with the memory and particular bytes (fix depends whether you want it to work for 16b or 32b input numbers, you either have to fix instructions storing/reading the array to work with 16b only, or fix the array allocation and offsets to accompany two 32b values).

emu8086 find minimum and maximum in an array

I have to come up with an ASM code (for emu8086) that will find the minimum and maximum value in an array of any given size. In the sample code, my instructor provides (what appears to be) a data segment that contains an array named LIST. He claims that he will replace this list with other lists of different sizes, and our code must be able to handle it.
Here's the sample code below. I've highlighted the parts that I've added, just to show you that I've done my best to solve this problem:
; You may customize this and other start-up templates;
; The location of this template is c:\emu8086\inc\0_com_template.txt
org 100h
data segment
LIST DB 05H, 31H, 34H, 30H, 38H, 37H
MINIMUM DB ?
MAXIMUM DB ?
AVARAGE DB ?
**SIZE=$-OFFSET LIST**
ends
stack segment **;**
DW 128 DUP(0) **; I have NO CLUE what this is supposed to do**
ends **;**
code segment
start proc far
; set segment registers:
MOV AX,DATA **;**
MOV DS,AX **;I'm not sure what the point of this is, especially since I'm supposed to be the programmer, not my teacher.**
MOV ES,AX **;**
; add your code here
**;the number of elements in LIST is SIZE (I think)
MOV CX,SIZE ;a loop counter, I think
;find the minimum value in LIST and store it into MINIMUM
;begin loop
AGAIN1:
LEA SI,LIST
LEA DI,MINIMUM
MOV AL,[SI]
CMP AL,[SI+1]
If carry flag=1:{I got no idea}
LOOP AGAIN1
;find the maximum value in LIST and store it into MAXIMUM
;Something similar to the other loop, but this time I gotta find the max.
AGAIN2:
LEA SI,LIST
LEA DI,MINIMUM
MOV AL,[SI]
CMP AL,[SI-1] ;???
LOOP AGAIN2
**
; exit to operating system.
MOV AX,4C00H
INT 21H
start endp
ends
end start ; set entry point and stop the assembler.
ret

I'm not positive, but I think you want to move the SIZE variable immediately after the LIST variable:
data segment
LIST DB 05H, 31H, 34H, 30H, 38H, 37H
SIZE=$-OFFSET LIST
MINIMUM DB ?
MAXIMUM DB ?
AVARAGE DB ?
ends
What it does is give you the number of bytes between the current address ($) and the beginning of the LIST variable - thus giving you the size (in bytes) of the list variable itself. Because the LIST is an array of bytes, SIZE will be the actual length of the array. If LIST was an array of WORDS, you'd have to divide SIZE by two. If your teacher wrote that code then perhaps you should leave it alone.
I'm not entirely clear on why your teacher made a stack segment, I can't think of any reason to use it, but perhaps it will become clear in a future assignment. For now, you probably should know that DUP is shorthand for duplicate. This line of code:
DW 128 DUP(0)
Is allocating 128 WORDS of memory initialized to 0.
The following lines of code:
MOV AX,DATA
MOV DS,AX
MOV ES,AX
Are setting up your pointers so that you can loop through the LIST. All you need to know is that, at this point, AX points to the beginning of the data segment and therefor the beginning of your LIST.
As for the rest... it looks like you have an endless loop. What you need to do is this:
Set SI to point to the beginning of the LIST
Set CX to be the length of the LIST, you've done that
Copy the first byte from [SI] to AX
Compare AX to the memory variable MINIMUM
If AX is smaller, copy it to MINIMUM
Increment IS and decriment CX
If CX = 0 (check the ZERO flag) exit the loop, otherwise, go back to #3

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight