Delphi Optimize IndexOf Function

Delphi Optimize IndexOf Function - arrays

Could someone help me speed up my Delphi function
To find a value in a byte array without using binary search.
I call this function thousands of times, is it possible to optimize it with assembly?
Thank you so much.
function IndexOf(const List: TArray< Byte >; const Value: byte): integer;
var
  I: integer;
begin
  for I := Low( List ) to High( List ) do begin
   if List[ I ] = Value then
    Exit ( I );
end;
  Result := -1;
end;
The length of the array is about 15 items.

Well, let's think. At first, please edit this line:
For I := Low( List ) to High( List ) do
(you forgot 'do' at the end). When we compile it without optimization, here is the assembly code for this loop:
Unit1.pas.29: If List [I] = Value then
005C5E7A 8B45FC mov eax,[ebp-$04]
005C5E7D 8B55F0 mov edx,[ebp-$10]
005C5E80 8A0410 mov al,[eax+edx]
005C5E83 3A45FB cmp al,[ebp-$05]
005C5E86 7508 jnz $005c5e90
Unit1.pas.30: Exit (I);
005C5E88 8B45F0 mov eax,[ebp-$10]
005C5E8B 8945F4 mov [ebp-$0c],eax
005C5E8E EB0F jmp $005c5e9f
005C5E90 FF45F0 inc dword ptr [ebp-$10]
Unit1.pas.28: For I := Low (List) to High (List) do
005C5E93 FF4DEC dec dword ptr [ebp-$14]
005C5E96 75E2 jnz $005c5e7a
This code is far from being optimal: local variable i is really local variable, that is: it is stored in RAM, in stack (you can see it by [ebp-$10] adresses, ebp is stack pointer).
So at each new iteration we see how we load address of array into eax register (mov eax, [ebp-$04]),
then we load i from stack into edx register (mov edx, [ebp-$10]),
then we at least load List[i] into al register which is lower byte of eax (mov al, [eax+edx])
after which compare it with argument 'Value' taken again from memory, not from register!
This implementation is extremely slow.
But let's turn optimization on at last! It's done in Project options -> compiling -> code generation. Let's look at new code:
Unit1.pas.29: If List [I] = Value then
005C5E5A 3A1408 cmp dl,[eax+ecx]
005C5E5D 7504 jnz $005c5e63
Unit1.pas.30: Exit (I);
005C5E5F 8BC1 mov eax,ecx
005C5E61 5E pop esi
005C5E62 C3 ret
005C5E63 41 inc ecx
Unit1.pas.28: For I := Low (List) to High (List) do
005C5E64 4E dec esi
005C5E65 75F3 jnz $005c5e5a
now there are just 4 lines of code which gets repeated over and over.
Value is stored inside dl register (lower byte of edx register),
address of 0-th element of array is stored in eax register,
i is stored in ecx register.
So the line 'if List[i] = Value' converts into just 1 assembly line:
005C5E5A 3A1408 cmp dl,[eax+ecx]
the next line is conditional jump, 3 lines after that are executed just once or never (it's if condition is true), and at last there is increment of i,
decrement of loop variable (it's easier to compare it with zero then with anything else)
So, there is little we can do which Delphi compiler with optimizer didn't!
If it's permitted by your program, you can try to reverse direction of search, from last element to first:
For I := High( List ) downto Low( List ) do
this way compiler will be happy to compare i with zero to indicate that we checked everything (this operation is free: when we decrement i and got zero, CPU zero flag turns on!)
But in such implementation behaviour may be different: if you have several entries = Value, you'll get not the first one, but the last one!
Another very easy thing is to declare this IndexOf function as inline: this way you'll probably have no function call here: this code will be inserted at each place where you call it. Function calls are rather slow things.
There are also some crazy methods described in Knuth how to search in simple array as fast as possible, he introduces 'dummy' last element of array which equals your 'Value', that way you don't have to check boundaries (it will alway find something before going out of range), so there is just 1 condition inside loop instead of 2. Another method is 'unrolling' of loop: you write down 2 or 3 or more iterations inside a loop, so there are less jumps per each check, but this has even more downsides: it will be beneficial only for rather large arrays while may make it even slower for arrays with 1 or 2 elements.
As others said: the biggest improvement would be to understand what kind of data you store: does it change frequently or stays the same for long time, do you look for random elements or there are some 'leaders' which gets the most attention. Must these elements be in the same order as you put them or it's allowed to rearrange them as you wish? Then you can choose data structure accordingly. If you look for some 1 or 2 same entries all the time and they can be rearranged, a simple 'Move-to-front' method would be great: you don't just return index but first move element to first place, so it will be found very quickly the next time.

If your arrays are long, you can use the x86 built in string scan REP SCAS.
It is coded in microcode and has a moderate start-up time, but it is
heavily optimized in the CPU and runs fast given long enough data structures (>= 100 bytes).
In fact on a modern CPU it frequently outperforms very clever RISC code.
If your arrays are short, then no amount of optimization of this routine will help, because then your problem is in code not shown in the question, so there is no answer I can give you.
See: http://docwiki.embarcadero.com/RADStudio/Tokyo/en/Internal_Data_Formats_(Delphi)
function IndexOf({$ifndef RunInSeperateThread} const {$endif} List: TArray<byte>; const Value: byte): integer;
//Lock the array if you run this in a separate thread.
{$ifdef CPUX64}
asm
//RCX = List
//DL = byte.
mov r8,[rcx-8] //3 - get the length ASAP.
push rdi //0 - hidden in mov r,m
mov eax,edx //0 - rename
mov rdi,rcx //0 - rename
mov rcx,r8 //0 - rename
mov rdx,r8 //0 - remember the length
//8 cycles setup
repne scasb //2n - repeat until byte found.
pop rdi //1
neg rcx //0
lea rax,[rdx+rcx] //1 result = length - bytes left.
end;
{$ENDIF}
{$ifdef CPUX86}
asm
//EAX = List
//DL = byte.
push edi
mov edi,eax
mov ecx,[eax-4] //get the length
mov eax,edx
mov edx,ecx //remember the length
repne scasb //repeat until byte found.
pop edi
neg ecx
lea eax,[edx+ecx] //result = length - bytes left.
end;
Timings
On my laptop using an array of 1KB with the target byte at the end this gives the following timings (lowest time using a 100.0000 runs)
Code | CPU cycles
| Len=1024 | Len=16
-------------------------------+----------+---------
Your code optimizations off | 5775 | 146
Your code optimizations on | 4540 | 93
X86 my code | 2726 | 60
X64 my code | 2733 | 69
The speed-up is OK (ish), but hardly worth the effort.
If your array's are short, then this code will not help you and you'll have to resort to better other options to optimize your code.
Speed up possible when using binary search
Binary search is a O(log n) operation, vs O(n) for naive search.
Using the same array this will find your data in log2(1024) * CPU cycles per search = 10 * 20 +/- 200 cycles. A 10+ times speed up over my optimized code.

Related

Delphi XE byte array index

I use simple circular buffer like this
var
Values: array [byte] of single;
ptr: byte;
In this test example
for ptr:=0 to 10 do Values[Byte(ptr-5)]:=1;
I expect to have set to 1 first 5 values and last 5 values, but XE4 compiller produce incorrect code, its using 32bit pointer math to calculate array index:
for ptr:=0 to 10 do Values[Byte(ptr-5)]:=1;
005B94BB C645FB00 mov byte ptr [ebp-$05],$00
005B94BF 33C0 xor eax,eax
005B94C1 8A45FB mov al,[ebp-$05]
005B94C4 C78485E0FBFFFF0000803F mov [ebp+eax*4-$0420],$3f800000
005B94CF FE45FB inc byte ptr [ebp-$05]
005B94D2 807DFB0B cmp byte ptr [ebp-$05],$0b
005B94D6 75E7 jnz $005b94bf
Is it my wrong code and whats proper way to operate byte indexes?

The question is:
Is a wrap expected within the Byte() cast?
Lets compare the disassembly with overflow checking on/off.
{$Q+}
Project71.dpr.21: for ptr:= 0 to 10 do Values[Byte(ptr-5)]:= 1;
0041D568 33DB xor ebx,ebx
0041D56A 0FB6C3 movzx eax,bl
0041D56D 83E805 sub eax,$05
0041D570 7105 jno $0041d577
0041D572 E82D8DFEFF call #IntOver
0041D577 0FB6C0 movzx eax,al
0041D57A C704870000803F mov [edi+eax*4],$3f800000
0041D581 43 inc ebx
0041D582 80FB0B cmp bl,$0b
0041D585 75E3 jnz $0041d56a
{$Q-}
Project71.dpr.21: for ptr:= 0 to 10 do Values[Byte(ptr-5)]:= 1;
0041D566 B30B mov bl,$0b
0041D568 B808584200 mov eax,$00425808
0041D56D C7000000803F mov [eax],$3f800000
0041D573 83C004 add eax,$04
0041D576 FECB dec bl
0041D578 75F3 jnz $0041d56d
With {$Q+} the wraps works, while with {$Q-} the wrap does not work and the compiler does not generate a range error for the wrong array indexing when {$R+} is set.
So, to me the conclusion is: Since the range check on does not generate a run-time error for an array index out of bounds, a wrap is expected.
This is further proved by the fact that a wrap is done when overflow checking is on.
This should be reported as a bug in the compiler.
Done: https://quality.embarcadero.com/browse/RSP-15527 "Type cast fail within array indexing"
Note: a workaround is given by #Rudy in his answer.
Addendum:
Following code:
for ptr:= 0 to 10 do WriteLn(Byte(ptr-5));
generates:
251
252
253
254
255
0
1
2
3
4
5
for all combinations of range/overflow checking.
Likewise Values[Byte(-1)] := 1; assigns 1 to Values[255] for all compiler options.
The documentation for Value Typecasts says:
The resulting value is obtained by converting the expression in parentheses. This may involve truncation or extension if the size of the specified type differs from that of the expression. The expression's sign is always preserved.

My code is written in Delphi 10.1 Berlin, but the result seems to be the same.
Let's extend your little code piece a little:
procedure Test;
var
Values: array[Byte] of Single;
Ptr: byte;
begin
Values[0] := 1.0;
for Ptr := 0 to 10 do
Values[Byte(Ptr - 5)] := 1.0;
end;
This gives the following code in the CPU view:
Project80.dpr.15: Values[0] := 1.0;
0041A1DD C785FCFBFFFF0000803F mov [ebp-$00000404],$3f800000
Project80.dpr.16: for Ptr := 0 to 10 do
0041A1E7 C645FF00 mov byte ptr [ebp-$01],$00
Project80.dpr.17: Values[Byte(Ptr-5)] := 1.0;
0041A1EB 33C0 xor eax,eax
0041A1ED 8A45FF mov al,[ebp-$01]
0041A1F0 C78485E8FBFFFF0000803F mov [ebp+eax*4-$0418],$3f800000
0041A1FB FE45FF inc byte ptr [ebp-$01]
Project80.dpr.16: for Ptr := 0 to 10 do
0041A1FE 807DFF0B cmp byte ptr [ebp-$01],$0b
0041A202 75E7 jnz $0041a1eb
As we can see, the first element of the array is at [ebp-$00000404], so [ebp+eax*4-$0418] is indeed below the array (for values 0..4).
That looks like a bug to me, because for Ptr = 0, Byte(Ptr - 5) should wrap around to $FB. The generated code should be something like:
mov byte ptr [ebp-$01],$00
xor eax,eax
#loop:
mov al,[ebp-$01]
sub al,5 // Byte(Ptr - 5)
mov [ebp+4*eax-$0404],$3f800000 // al = $FB, $FC, $FD, $FE, $FF, 00, etc..
inc byte ptr [ebp-$01]
cmp byte ptr [ebp-$01],$0b
jnz #loop
Good find!
There is a workaround, though:
Values[Byte(Ptr - 5) + 0] := 1.0;
This produces:
Project80.dpr.19: Values[Byte(Ptr - 5) + 0] := 1.0;
0040F16B 8A45FF mov al,[ebp-$01]
0040F16E 2C05 sub al,$05
0040F170 25FF000000 and eax,$000000ff
0040F175 C78485FCFBFFFF0000803F mov [ebp+eax*4-$0404],$3f800000
And that works nicely, although the and eax,$000000ff seems unnecessary to me.
FWIW, I also looked at the code generated with optimization on. Both in XE and Berlin, the error exists as well, and the workaround works too.

Sounds like an unexpected behavior of the compiler. But I never assume that transtyping integers using byte() would always make a rounding around $ff. It does, most of the time, e.g. if you assign values between variables, but there are cases where it doesn't - as you discovered. So I would have never used this byte() expression within an array index computation.
I always observed that using byte variables is not worth it, and you should rather use plain integer (or NativeInt), so that it matches the CPU registers, and then don't assume any complex rounding.
In all cases, I would rather make the 255 rounding explicit, as such:
procedure test;
var
Values: array [byte] of single;
ptr: integer;
begin
for ptr:=0 to 10 do Values[(ptr-5) and high(Values)]:=1;
end;
As you can see, I've made some modifications:
Define the for loop index as an integer, to use a CPU register;
Use the and operation for fast binary rounding (writing (ptr-5) mod 256 would be much slower);
Use high(Values) instead of fixed $ff constant, which indicates where this rounding comes from.
Then the generated code is quick and optimized:
TestAll.dpr.114: begin
0064810C 81C400FCFFFF add esp,$fffffc00
TestAll.dpr.115: for ptr:=0 to 10 do Values[(ptr-5) and high(Values)]:=1;
00648112 33C0 xor eax,eax
00648114 8BD0 mov edx,eax
00648116 83EA05 sub edx,$05
00648119 81E2FF000000 and edx,$000000ff
0064811F C704940000803F mov [esp+edx*4],$3f800000
00648126 40 inc eax
00648127 83F80B cmp eax,$0b
0064812A 75E8 jnz -$18
TestAll.dpr.116: end;
0064812C 81C400040000 add esp,$00000400
00648132 C3 ret

Assembly: Creating an array of linked nodes

Firstly, if this question is inappropriate because I am not providing any code or not doing any thinking on my own, I apologize, and I will delete this question.
For an assignment, we are required to create an array of nodes to simulate a linked list. Each node has an integer value and a pointer to the next node in the list. Here is my .DATA section
.DATA
linked_list DWORD 5 DUP (?) ;We are allowed to assume the linked list will have 5 items
linked_node STRUCT
value BYTE ?
next BYTE ?
linked_node ENDS
I am unsure if I am defining my STRUCT correctly, as I am unsure of what the type of next should be. Also, I am confused as how to approach this problem. To insert a node into linked_list I should be able to write mov [esi+TYPE linked_list*ecx], correct? Of course, I'd need to inc ecx every time. What I'm confused about is how to do mov linked_node.next, "pointer to next node" Is there some sort of operator that would allow me to set the pointer to the next index in the array equal to a linked_node.next ? Or am I thinking about this incorrectly? Any help would be appreciated!

Think about your design in terms of a language you are familiar with. Preferably C, because pointers and values in C are concepts that map directly to asm.
Let's say you want to keep track of your linked list by storing a pointer to the head element.
#include <stdint.h> // for int8_t
struct node {
int8_t next; // array index. More commonly, you'd use struct node *next;
// negative values for .next are a sentinel, like a NULL pointer, marking the end of the list
int8_t val;
};
struct node storage[5]; // .next field indexes into this array
uint8_t free_position = 0; // when you need a new node, take index = free_position++;
int8_t head = -1; // start with an empty list
There are tricks to reduce corner cases, like having the list head be a full node, rather than just a reference (pointer or index). You can treat it as a first element, instead of having to check for the empty-list case everywhere.
Anyway, given a node reference int8_t p (where p is the standard variable name for a pointer to a list node, in linked list code), the next node is storage[p.next]. The next node's val is storage[p.next].val.
Let's see what this looks like in asm. The NASM manual talks about how it's macro system can help you make code using global structs more readable, but I haven't done any macro stuff for this. You might define macros for NEXT and VAL or something, with 0 and 1, so you can say [storage + rdx*2 + NEXT]. Or even a macro that takes an argument, so you could say [NEXT(rdx*2)]. If you're not careful, you could end up with code that's more confusing to read, though.
section .bss
storage: resw 5 ;; reserve 5 words of zero-initialized space
free_position: db 0 ;; uint8_t free_position = 0;
section .data
head: db -1 ;; int8_t head = -1;
section .text
; p is stored in rdx. It's an integer index into storage
; We'll access storage directly, without loading it into a register.
; (normally you'd have it in a reg, since it would be space you got from malloc/realloc)
; lea rsi, [rel storage] ;; If you want RIP-relative addressing.
;; There is no [RIP+offset + scale*index] addressing mode, because global arrays are for tiny / toy programs.
test edx, edx
js .err_empty_list ;; check for p=empty list (sign-bit means negative)
movsx eax, byte [storage + 2*rdx] ;; load p.next into eax, with sign-extension
test eax, eax
js .err_empty_list ;; check that there is a next element
movsx eax, byte [storage + 2*rax + 1] ;; load storage[p.next].val, sign extended into eax
;; The final +1 in the effective address is because the val byte is 2nd.
;; you could have used a 3rd register if you wanted to keep p.next around for future use
ret ;; or not, if this is just the middle of some larger function
.err_empty_list: ; .symbol is a local symbol, doesn't have to be unique for the whole file
ud2 ; TODO: report an error instead of running an invalid insns
Notice that we get away with shorter instruction encoding by sign-extending into a 32bit reg, not to the full 64bit rax. If the value is negative, we aren't going to use rax as part of an address. We're just using movsx as a way to zero-out the rest of the register, because mov al, [storage + 2*rdx] would leave the upper 56 bits of rax with the old contents.
Another way to do this would be to movzx eax, byte [...] / test al, al, because the 8-bit test is just as fast to encode and execute as a 32bit test instruction. Also, movzx as a load has one cycle lower latency than movsx, on AMD Bulldozer-family CPUs (although they both still take an integer execution unit, unlike Intel where movsx/zx is handled entirely by a load port).
Either way, movsx or movzx is a good way to load 8-bit data, because you avoid problems with reading the full reg after writing a partial reg, and/or a false-dependency (on the previous contents of the upper bits of the reg, even if you know you already zeroed it, the CPU hardware still has to track it). Except if you know you're not optimizing for Intel pre-Haswell, then you don't have to worry about partial-register writes. Haswell does dual-bookkeeping or something to avoid extra uops to merge the partial value with the old full value when reading. AMD CPUs, P4, and Silvermont don't track partial-regs separately from the full-reg, so all you have to worry about is the false dependency.
Also note that you can load the next and val packed together, like
.search_loop:
movzx eax, word [storage + rdx*2] ; next in al, val in ah
test ah, ah
jz .found_a_zero_val
movzx edx, al ; use .next for the next iteration
test al, al
jns .search_loop
;; if we get here, we didn't find a zero val
ret
.found_a_zero_val:
;; do something with the element referred to by `rdx`
Notice how we have to use movzx anyway, because all the registers in an effective address have to be the same size. (So word [storage + al*2] doesn't work.)
This is probably more useful going the other way, to store both fields of a node with a single store, like mov [storage + rdx*2], ax or something, after getting next into al, and val into ah, probably from separate sources. (This is a case where you might want to use a regular byte load, instead of a movzx, if you don't already have it in another register). This isn't a big deal: don't make your code hard to read or more complex just to avoid doing two byte-stores. At least, not until you find out that store-port uops are the bottleneck in some loop.
Using an index into an array, instead of a pointer, can save a lot of space, esp. on 64bit systems where pointers take 8 bytes. If you don't need to free individual nodes (i.e. data structure only ever grows, or is deleted all at once when it is deleted), then an allocator for new nodes is trivial: just keep sticking them at the end of the array, and realloc(3). Or use a c++ std::vector.
With those building blocks, you should be all set to implement the usual linked list algos. Just store bytes with mov [storage + rdx*2], al or whatever.
If you need ideas on how to implement linked lists with clean algos that handle all the special-cases with as few branches as possible, have a look at this Codereview question. It's for Java, but my answer is very C-style. The other answers have some nice tricks, too, some of which I borrowed for my answer. (e.g. using a dummy node avoids branching to handle the insertion-as-a-new-head special case).

Assembly: Error when attempting to increment at array index

Here's a small snippet of assembly code (TASM) where I simply try to increment the value at the current index of the array. The idea is that the "freq" array will store a number (DWord size) that represents how many times that ASCII character was seen in the file. To keep the code short, "b" stores the current byte being read.
Declared in data segment
freq DD 256 DUP (0)
b DB ?
___________
Assume b contains current byte
mov bl, b
sub bh, bh
add bx, bx
inc freq[bx]
I receive this error at compilation time at the line containing "inc freq[bx]": ERROR Argument to operation or instruction has illegal size.
Any insight is greatly appreciated.

There is no inc that can increment a dword in 16 bit mode. You will have to synthesize it from add/adc, such as:
add freq[bx], 1
adc freq[bx + 2], 0
You might need to add a size override, such as word ptr or change your array definition to freq DW 512 DUP (0).
Also note that you have to scale the index by 4, not 2.

Which loop has better performance? Increment or decrement? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Is it faster to count down than it is to count up?
which loop has better performance? I have learnt from some where that second is better. But want to know reason why.
for(int i=0;i<=10;i++)
{
/*This is better ?*/
}
for(int i=10;i>=0;i--)
{
/*This is better ?*/
}

The second "may" be better, because it's easier to compare i with 0 than to compare i with 10 but I think you can use any one of these, because compiler will optimize them.

I do not think there is much difference between the performance of both loops.
I suppose, it becomes a different situation when the loops look like this.
for(int i = 0; i < getMaximum(); i++)
{
}
for(int i = getMaximum() - 1; i >= 0; i--)
{
}
As the getMaximum() function is called once or multiple times (assuming it is not an inline function)

Decrement loops down to zero can sometimes be faster if testing against zero is optimised in hardware. But it's a micro-optimisation, and you should profile to see whether it's really worth doing. The compiler will often make the optimisation for you, and given that the decrement loop is arguably a worse expression of intent, you're often better off just sticking with the 'normal' approach.

Incrementing and decrementing (INC and DEC, when translated into assembler commands) have the same speed of 1 CPU cycle.
However, the second can be theoretically faster on some (e.g. SPARC) architectures because no 10 has to be fetched from memory (or cache): most architectures have instructions that deal in an optimized fashion when compating with the special value 0 (usually having a special hardwired 0-register to use as operand, so no register has to be "wasted" to store the 10 for each iteration's comparison).
A smart compiler (especially if target instruction set is RISC) will itself detect this and (if your counter variable is not used in the loop) apply the second "decrement downto 0" form.
Please see answers https://stackoverflow.com/a/2823164/1018783 and https://stackoverflow.com/a/2823095/1018783 for further details.

The compiler should optimize both code to the same assembly, so it doesn't make a difference. Both take the same time.
A more valid discussion would be whether
for(int i=0;i<10;++i) //preincrement
{
}
would be faster than
for(int i=0;i<10;i++) //postincrement
{
}
Because, theoretically, post-increment does an extra operation (returns a reference to the old value). However, even this should be optimized to the same assembly.
Without optimizations, the code would look like this:
for ( int i = 0; i < 10 ; i++ )
0041165E mov dword ptr [i],0
00411665 jmp wmain+30h (411670h)
00411667 mov eax,dword ptr [i]
0041166A add eax,1
0041166D mov dword ptr [i],eax
00411670 cmp dword ptr [i],0Ah
00411674 jge wmain+68h (4116A8h)
for ( int i = 0; i < 10 ; ++i )
004116A8 mov dword ptr [i],0
004116AF jmp wmain+7Ah (4116BAh)
004116B1 mov eax,dword ptr [i]
004116B4 add eax,1
004116B7 mov dword ptr [i],eax
004116BA cmp dword ptr [i],0Ah
004116BE jge wmain+0B2h (4116F2h)
for ( int i = 9; i >= 0 ; i-- )
004116F2 mov dword ptr [i],9
004116F9 jmp wmain+0C4h (411704h)
004116FB mov eax,dword ptr [i]
004116FE sub eax,1
00411701 mov dword ptr [i],eax
00411704 cmp dword ptr [i],0
00411708 jl wmain+0FCh (41173Ch)
so even in this case, the speed is the same.

Again, the answer to all micro-performance questions is measure, measure in context of use and don't extrapolate to other contexts.
Counting instruction execution time hasn't been possible without extraordinary sophistication for quite a long time.
The mismatch between processors and memory speed and the introduction of cache to hide part of the latency (but not the bandwidth) makes the execution of a group of instructions very sensitive to memory access pattern. That is something you still can optimize for with a quite high level thinking. But it also means that something apparently worse if one doesn't take the memory access pattern into account can be better once that is done.
Then superscalar (the fact that the processor can do several things at once) and out of order execution (the fact that processor can execute an instruction before a previous one in the flow) makes basic counting meaningless even if you ignore memory access. You have to know which instructions need to be executed (so ignoring part of the structure isn't wise) and how the processor can group instructions if you want to get good a priori estimate.

emu8086 find minimum and maximum in an array

I have to come up with an ASM code (for emu8086) that will find the minimum and maximum value in an array of any given size. In the sample code, my instructor provides (what appears to be) a data segment that contains an array named LIST. He claims that he will replace this list with other lists of different sizes, and our code must be able to handle it.
Here's the sample code below. I've highlighted the parts that I've added, just to show you that I've done my best to solve this problem:
; You may customize this and other start-up templates;
; The location of this template is c:\emu8086\inc\0_com_template.txt
org 100h
data segment
LIST DB 05H, 31H, 34H, 30H, 38H, 37H
MINIMUM DB ?
MAXIMUM DB ?
AVARAGE DB ?
**SIZE=$-OFFSET LIST**
ends
stack segment **;**
DW 128 DUP(0) **; I have NO CLUE what this is supposed to do**
ends **;**
code segment
start proc far
; set segment registers:
MOV AX,DATA **;**
MOV DS,AX **;I'm not sure what the point of this is, especially since I'm supposed to be the programmer, not my teacher.**
MOV ES,AX **;**
; add your code here
**;the number of elements in LIST is SIZE (I think)
MOV CX,SIZE ;a loop counter, I think
;find the minimum value in LIST and store it into MINIMUM
;begin loop
AGAIN1:
LEA SI,LIST
LEA DI,MINIMUM
MOV AL,[SI]
CMP AL,[SI+1]
If carry flag=1:{I got no idea}
LOOP AGAIN1
;find the maximum value in LIST and store it into MAXIMUM
;Something similar to the other loop, but this time I gotta find the max.
AGAIN2:
LEA SI,LIST
LEA DI,MINIMUM
MOV AL,[SI]
CMP AL,[SI-1] ;???
LOOP AGAIN2
**
; exit to operating system.
MOV AX,4C00H
INT 21H
start endp
ends
end start ; set entry point and stop the assembler.
ret

I'm not positive, but I think you want to move the SIZE variable immediately after the LIST variable:
data segment
LIST DB 05H, 31H, 34H, 30H, 38H, 37H
SIZE=$-OFFSET LIST
MINIMUM DB ?
MAXIMUM DB ?
AVARAGE DB ?
ends
What it does is give you the number of bytes between the current address ($) and the beginning of the LIST variable - thus giving you the size (in bytes) of the list variable itself. Because the LIST is an array of bytes, SIZE will be the actual length of the array. If LIST was an array of WORDS, you'd have to divide SIZE by two. If your teacher wrote that code then perhaps you should leave it alone.
I'm not entirely clear on why your teacher made a stack segment, I can't think of any reason to use it, but perhaps it will become clear in a future assignment. For now, you probably should know that DUP is shorthand for duplicate. This line of code:
DW 128 DUP(0)
Is allocating 128 WORDS of memory initialized to 0.
The following lines of code:
MOV AX,DATA
MOV DS,AX
MOV ES,AX
Are setting up your pointers so that you can loop through the LIST. All you need to know is that, at this point, AX points to the beginning of the data segment and therefor the beginning of your LIST.
As for the rest... it looks like you have an endless loop. What you need to do is this:
Set SI to point to the beginning of the LIST
Set CX to be the length of the LIST, you've done that
Copy the first byte from [SI] to AX
Compare AX to the memory variable MINIMUM
If AX is smaller, copy it to MINIMUM
Increment IS and decriment CX
If CX = 0 (check the ZERO flag) exit the loop, otherwise, go back to #3

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight