x86 using loops in order to minimize code (lining) [duplicate]

x86 using loops in order to minimize code (lining) [duplicate] - loops

I'm having trouble understanding registers in x86 Assembly, I know that EAX is the full 32 bits, AX is the lower 16 bits, and then AH and AL the higher and lower 8 bits of AX, But I'm doing a question.
If AL=10 and AH=10 what is the value in AX?
My thinking on this is to convert 10 into binary (1010) and then take that as the higher and lower bits of AX (0000 1010 0000 1010) and then converting this to decimal (2570) am I anywhere close to the right answer here, or way off?

As suggested by Peter Cordes, I would imagine the data as hexadecimal values:
RR RR RR RR EE EE HH LL
| | || ||
| | || AL
| | AH |
| | |___|
| | AX |
| |_________|
| EAX |
|_____________________|
RAX
...where RAX is the 64-bit register which exists in x86-64.
So if you had AH = 0x12 and AL = 0x34, like this:
00 00 00 00 00 00 12 34
| | || ||
| | || AL
| | AH |
| | |___|
| | AX |
| |_________|
| EAX |
|_____________________|
RAX
...then you had AX = 0x1234 and EAX = 0x00001234 etc.
Note that, as shown in this chart, AH is the only "weird" register here which is not aligned with the lower bits. The others (AL, AX, EAX, RAX for 64-bit) are just different sizes but all aligned on the right. (For example, the two bytes marked EE EE in the chart don't have a register name on their own.)
Writing AL, AH, or AX merge into the full RAX, leaving other bytes unmodified for historical reasons. (Prefer a movzx eax, byte [mem] or movzx eax, word [mem] load if you don't specifically want this merging: Why doesn't GCC use partial registers?)
Writing EAX zero-extends into RAX. (Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?)

Related

Assembly Language x86 - Registers Set and Arithmetic and Loop

I am trying to solve this problem about loops. I am using a push and pop method instead of using a separate register to store data.
.model small
.stack
.code
m proc
mov ax,0b800h
mov es,ax
mov di,7d0h
mov ah,7 ; normal attribute
mov al,'A'
mov cx,5
x: stosw
push ax ;mov dl,al ; dl='A'
push di
mov al,'1'
stosw
pop di
add di,158
pop ax ;mov al,dl
inc al
loop x
mov ah,4ch
int 21h
m endp
end m
I am unable to loop the mov al, '1'.
The output should be like this:
A1
B2
C3
D4
E5
Can anyone show the correct code? Thank you.

Consider the ASCII codes involved:
Letter Digit Difference
A1 65 49 16
B2 66 50 16
C3 67 51 16
D4 68 52 16
E5 69 53 16
See how the difference is always 16 ? That's what next solution exploits:
...
mov ax, 0700h + 'A' ; WhiteOnBlack 'A'
x: stosw ; Stores one letter from {A, B, C, D, E}
sub al, 'A' - '1' ; Convert from letter to digit
stosw ; Stores one digit from {1, 2, 3, 4, 5}
add al, 'A' - '1' + 1 ; Restore AL and at the same time increment
add di, 160 - 4 ; Move down on the screen
cmp al, 'E'
jbe x
...
You don't always need to use CX and the LOOP instruction to work with a loop. Anyway the LOOP instruction is to be avoided for speed reasons!

Manual vectorization using AVX vector intrinsics only runs about the same speed as 4 scalar FP adds on Ryzen?

So I decided to take a look at how to use SSE, AVX, ... in C via Intel® Intrinsics. Not because of any actual interest to use it for something, but out of pure curiosity. Trying to check if code using AVX is actually faster than non-AVX code, I was a bit surprised by the results. Here is my C code:
#include <stdio.h>
#include <stdlib.h>
#include <emmintrin.h>
#include <immintrin.h>
/*** Sum up two vectors using AVX ***/
#define __vec_sum_4d_d64(src_vec1, src_vec2, dst_vec) \
_mm256_store_pd(dst_vec, _mm256_add_pd(_mm256_load_pd(src_vec1), _mm256_load_pd(src_vec2)));
/*** Sum up two vectors without AVX ***/
#define __vec_sum_4d(src_vec1, src_vec2, dst_vec) \
dst_vec[0] = src_vec1[0] + src_vec2[0];\
dst_vec[1] = src_vec1[1] + src_vec2[1];\
dst_vec[2] = src_vec1[2] + src_vec2[2];\
dst_vec[3] = src_vec1[3] + src_vec2[3];
int main (int argc, char *argv[]) {
unsigned long i;
double dvec1[4] = {atof(argv[1]), atof(argv[2]), atof(argv[3]), atof(argv[4])};
double dvec2[4] = {atof(argv[5]), atof(argv[6]), atof(argv[7]), atof(argv[8])};
#if 1
for (i = 0; i < 3000000000; i++) {
__vec_sum_4d(dvec1, dvec2, dvec2);
}
#endif
#if 0
for (i = 0; i < 3000000000; i++) {
__vec_sum_4d_d64(dvec1, dvec2, dvec2);
}
#endif
printf("%10.10lf %10.10lf %10.10lf %10.10lf\n", dvec2[0], dvec2[1], dvec2[2], dvec2[3]);
}
I simply switch #if 1 to #if 0 and the other way around to switch between "modes" (AVX and non-AVX).
My expectation would be, that the loop using AVX would be at least somewhat faster than the other one, but it isn't. I compiled the code with gcc version 10.2.0 (GCC) and these: -O2 --std=gnu99 -lm -mavx2 flags.
> time ./noavx.x86_64 1 2 3 4 5 6 7 8
3000000005.0000000000 6000000006.0000000000 9000000007.0000000000 12000000008.0000000000
real 0m2.150s
user 0m2.147s
sys 0m0.000s
> time ./withavx.x86_64 1 2 3 4 5 6 7 8
3000000005.0000000000 6000000006.0000000000 9000000007.0000000000 12000000008.0000000000
real 0m2.168s
user 0m2.165s
sys 0m0.000s
As you can see, they run at practically the same speed. I also tried to increase the number of iterations by a factor of ten, but the results will simply scale up proportionally. Also note that the printed output values are the same for both executables, so I think that it is save to say that both perform the same calculations. Digging deeper i took a look at the assembly and was even more confused. Here are the important parts of both (only the loop):
; With avx
1070: c5 fd 58 c1 vaddpd %ymm1,%ymm0,%ymm0
1074: 48 83 e8 01 sub $0x1,%rax
1078: 75 f6 jne 1070
; Without avx
1080: c5 fb 58 c4 vaddsd %xmm4,%xmm0,%xmm0
1084: c5 f3 58 cd vaddsd %xmm5,%xmm1,%xmm1
1088: c5 eb 58 d7 vaddsd %xmm7,%xmm2,%xmm2
108c: c5 e3 58 de vaddsd %xmm6,%xmm3,%xmm3
1090: 48 83 e8 01 sub $0x1,%rax
1094: 75 ea jne 1080
In my understanding the second one should be way slower since besides decrementing the counter and the conditional jump there are four times as many instructions in it. Why is it not slower? Is the vaddsd instruction just four times faster than vaddpd?
If this is relevant, my system runs on a AMD Ryzen 5 2600X Six-Core Processor which supports AVX.

With AVX
; With avx
1070: c5 fd 58 c1 vaddpd %ymm1,%ymm0,%ymm0
1074: 48 83 e8 01 sub $0x1,%rax
1078: 75 f6 jne 1070
This loop is using ymm0 as accumulator. In other words it is doing ymm0 += ymm1 (this is a vector operation; adding 4 double values at once). Therefore it has loop-carried dependency on ymm0 (every new addition has to wait for the previous addition to finish and uses the result to start the next addition). vaddpd has latency=3, throughput=1 for Zen+ (according to https://www.uops.info/table.html). Loop carried dependency makes this loop bottleneck on latency of vaddpd, so your loop can get at best 3 cycles/iteration. Only one vaddpd addition is in-flight in the CPU, which is under-utilizing it's capability by a lot.
To make this faster add more accumulators (have more vectors to sum). It can (in theory) get 3 times faster due to pipelining (3 full ymm additions in-flight), as long as it does not get limited by something else.
Without AVX
; Without avx
1080: c5 fb 58 c4 vaddsd %xmm4,%xmm0,%xmm0
1084: c5 f3 58 cd vaddsd %xmm5,%xmm1,%xmm1
1088: c5 eb 58 d7 vaddsd %xmm7,%xmm2,%xmm2
108c: c5 e3 58 de vaddsd %xmm6,%xmm3,%xmm3
1090: 48 83 e8 01 sub $0x1,%rax
1094: 75 ea jne 1080
This loop accumulates results into 4 different accumulators. Basically it is doing:
xmm0 += xmm4
xmm1 += xmm5
xmm2 += xmm7
xmm3 += xmm6
All of these additions are independent from each other (and they are scalar additions, so each only operates on a single 64-bit floating point value). vaddsd has latency=3, throughput=0.5 (Cycles Per Instruction). Which means that it can start executing first 2 additions in one cycle. Then on the next cycle it will start the second pair of additions. Therefore it is possible to achieve 2 cycles/iteration for this loop based on throughput. But latency, as you recall is 3 cycles. So this loop is also bottlenecked on latency. Unroll once (with 4 additional accumulators; alternatively break loop-carried dep.chain within the loop by adding xmm4-7 between each other before adding it to the main accumulator) to get rid of that bottleneck (it may get ~50% faster).
Note that this ("without AVX") disassembly is still using VEX encoding, so technically still requires AVX-capable CPU.
On Benchmarking
Note that your disassembly does not have any loads or stores, so this may or may not be representative of performance comparison for adding 2 arrays of 4-double vectors.

You are dealing with a latency issue. Depending on the CPU you have to wait 3 or 4 cycles until you can use the result of a vaddpd or vaddsd instruction. But within 1 cycle up to 2 vaddpd or vaddsd instructions can be executed (if the CPU does not have to wait for source registers).
Since in your loop
; Without avx
1080: c5 fb 58 c4 vaddsd %xmm4,%xmm0,%xmm0
1084: c5 f3 58 cd vaddsd %xmm5,%xmm1,%xmm1
1088: c5 eb 58 d7 vaddsd %xmm7,%xmm2,%xmm2
108c: c5 e3 58 de vaddsd %xmm6,%xmm3,%xmm3
1090: 48 83 e8 01 sub $0x1,%rax
1094: 75 ea jne 1080
each vaddsd depends on the result from the previous iteration, it has to wait 3 or 4 cycles before this can be executed. But the execution of the all the vaddsd and the sub and jne can happen during that time. Therefore, for this simple loop it does not make a difference, if you execute one vaddpd or four vaddsd.
To fully exhaust the vaddpd instruction, you need to execute 6 or 8 of them which do not depend on the result of each other (or have other instructions which do some independent work).

COMISD not comparing properly [duplicate]

As part of a compiler project I have to write GNU assembler code for x86 to compare floating point values. I have tried to find resources on how to do this online and from what I understand it works like this:
Assuming the two values I want to compare are the only values on the floating point stack, then the fcomi instruction will compare the values and set the CPU-flags so that the je, jne, jl, ... instructions can be used.
I'm asking because this only works sometimes. For example:
.section .data
msg: .ascii "Hallo\n\0"
f1: .float 10.0
f2: .float 9.0
.globl main
.type main, #function
main:
flds f1
flds f2
fcomi
jg leb
pushl $msg
call printf
addl $4, %esp
leb:
pushl $0
call exit
will not print "Hallo" even though I think it should, and if you switch f1 and f2 it still won't which is a logical contradiction. je and jne however seem to work fine.
What am I doing wrong?
PS: does the fcomip pop only one value or does it pop both?

TL:DR: Use above / below conditions (like for unsigned integer) to test the result of compares.
For various historical reasons (mapping from FP status word to FLAGS via fcom / fstsw / sahf which fcomi (new in PPro) matches), FP compares set CF, not OF / SF. See also http://www.ray.masmcode.com/tutorial/fpuchap7.htm
Modern SSE/SSE2 scalar compares into FLAGS follow this as well, with [u]comiss / sd. (Unlike SIMD compares, which have a predicate as part of the instruction, as an immediate, since they only produce a single all-zeros / all-ones result for each element, not a set of FLAGS.)
This is all coming from Volume 2 of Intel 64 and IA-32 Architectures Software Developer's Manuals.
FCOMI sets only some of the flags that CMP does. Your code has %st(0) == 9 and %st(1) == 10. (Since it's a stack they're loaded onto), referring to the table on page 3-348 in Volume 2A you can see that this is the case "ST0 < ST(i)", so it will clear ZF and PF and set CF. Meanwhile on pg. 3-544 Vol. 2A you can read that JG means "Jump short if greater (ZF=0 and SF=OF)". In other words it's testing the sign, overflow and zero flags, but FCOMI doesn't set sign or overflow!
Depending on which conditions you wish to jump, you should look at the possible comparison results and decide when you want to jump.
+--------------------+---+---+---+
| Comparison results | Z | P | C |
+--------------------+---+---+---+
| ST0 > ST(i) | 0 | 0 | 0 |
| ST0 < ST(i) | 0 | 0 | 1 |
| ST0 = ST(i) | 1 | 0 | 0 |
| unordered | 1 | 1 | 1 | one or both operands were NaN.
+--------------------+---+---+---+
I've made this small table to make it easier to figure out:
+--------------+---+---+-----+------------------------------------+
| Test | Z | C | Jcc | Notes |
+--------------+---+---+-----+------------------------------------+
| ST0 < ST(i) | X | 1 | JB | ZF will never be set when CF = 1 |
| ST0 <= ST(i) | 1 | 1 | JBE | Either ZF or CF is ok |
| ST0 == ST(i) | 1 | X | JE | CF will never be set in this case |
| ST0 != ST(i) | 0 | X | JNE | |
| ST0 >= ST(i) | X | 0 | JAE | As long as CF is clear we are good |
| ST0 > ST(i) | 0 | 0 | JA | Both CF and ZF must be clear |
+--------------+---+---+-----+------------------------------------+
Legend: X: don't care, 0: clear, 1: set
In other words the condition codes match those for using unsigned comparisons. The same goes if you're using FMOVcc.
If either (or both) operand to fcomi is NaN, it sets ZF=1 PF=1 CF=1. (FP compares have 4 possible results: >, <, ==, or unordered). If you care what your code does with NaNs, you may need an extra jp or jnp. But not always: for example, ja is only true if CF=0 and ZF=0, so it will be not-taken in the unordered case. If you want the unordered case to take the same execution path as below or equal, then ja is all you need.
Here you should use JA if you want it to print (ie. if (!(f2 > f1)) { puts("hello"); }) and JBE if you don't (corresponds to if (!(f2 <= f1)) { puts("hello"); }). (Note this might be a little confusing due to the fact that we only print if we don't jump).
Regarding your second question: by default fcomi doesn't pop anything. You want its close cousin fcomip which pops %st0. You should always clear the fpu register stack after usage, so all in all your program ends up like this assuming you want the message printed:
.section .rodata
msg: .ascii "Hallo\n\0"
f1: .float 10.0
f2: .float 9.0
.globl main
.type main, #function
main:
flds f1
flds f2
fcomip
fstp %st(0) # to clear stack
ja leb # won't jump, jbe will
pushl $msg
call printf
addl $4, %esp
leb:
pushl $0
call exit

Assembly language array multiplication bug using bit shift

I have a bug in one of my loops and I can't fix it. It is part of my HW assignment for school.
I have an array, with 20 elements, and I need to multiply every element by 2, using bit shift.
It kind of works, but every time I have a carry, it is adding 2 to the previous element in the array, instead of one. I can't propagate the carry through the array properly.
This is my first semester with assembly, so I appreciate your help. Also, please keep it simple if you can. Thank you.
This is what I want:
0000000009 ==> 0000000018
0000000099 ==> 0000000198
This is what I am getting.
0000000009 ==> 0000000028
0000000099 ==> 00000002108
Here is the code.
ARR1 DB 20 DUP (0)
MULTIPLYING PROC
MOV AX, 0
MOV CX, 19
.WHILE CX != 0
MOV DI, CX
MOV AL, [DIGIT_ARR1+DI]
;MOV BL, 2
;MUL BL
SHL AX, 1
.IF AX > 9 ; IF THE NEW DIGIT IS LARGER THAN 9
SUB AX, 10
MOV AH, 0
MOV [DIGIT_ARR1+DI], AL
DEC DI
ADD [DIGIT_ARR1+DI], 1
.ELSEIF
MOV [DIGIT_ARR1+DI], AL ; IF IT IS LESS THAN 9, THEN JUST INSERT IT BACK INTO THE ARRAY
.ENDIF
DEC CX
.ENDW
RET
MULTIPLYING ENDP

So it turned that Phil was correct. I was rewriting my own data. The trick was to read the value, do the multiplication using bit shift, add the carry (if any) and only then write back to the array. This way I can multiply each element, by two and not corrupt my data.
NOTE FOR BEGINNERS LIKE ME: bit shift will only multiply by two, so if you need to multiply by something else, use mul or imul. Also, bit shift to the right will divide by two.
The loop below will multiply BCD in array by two. Division works the same way, only make sure you process the array the other way and add 10 to each next digit when you have a carry. You will also have to make sure you don't add the carry once you reach the end of the array. Assembly language doesn't check if you are out of bounds of the array.
MULTIPLYING PROC
PUSH CX
PUSH AX
MOV CARRY, 0 ; START WITH EMPTY CARRY FLAG
MOV CX, 19 ; ARRAY SIZE
.WHILE CX > 0
MOV DI, CX ; GET ELEMENT ADDRESS
MOV AL, [ARR1+DI] ; READ THE ELEMENT
SHL AL, 1 ; DOUBLING THE DIGIT
ADD AL, CARRY ; ADD THE CARRY FLAG
MOV CARRY, 0 ; CLEAR THE CARRY FLAG
.IF AL > 9 ; IF THE NEW DIGIT IS LARGER THAN 9
SUB AL, 10
MOV CARRY, 1 ; SET CARRY FLAG
MOV [ARR1+DI], AL ; INSERTING THE DOUBLED DIGIT BACK TO THE ARRAY
.ELSEIF
MOV [DIGIT_ARR1+DI], AL ; IF IT IS LESS THAN 9, INSERT IT BACK INTO THE ARRAY
MOV CARRY, 0
.ENDIF
DEC CX
.ENDW ; END OF MULTIPLICATION PROC
POP AX
POP CX
RET
MULTIPLYING ENDP

I think the problem is that you're writing back to the same array of addresses that you're reading from. When you carry, it is corrupting the calculation.
e.g.
From: 99
Positon 19 = 9
9*2 = 18 (set position 19 to 8, increment position 18 to 10)
Position 18 = 10
10*2 = 20 (set position 18 to 10, increment position 17 to 1)
Position 17 = 1
1*2 = 2 (set position 17 to 2)
Result: 2108
But you still need to do more work because even with empty destination addresses you get
Positon 19 = 9
9*2 = 18 (set position 19 to 8, increment position 18 to 1)
Position 18 = 9
9*2 = 18 (set position 18 to 8, increment position 17 to 1)
Position 17 = 1
1*2 = 2 (set position 17 to 2)
Result: 288
You need to add the 8 to the 1 at position 18 so you get 9. And you dont do a 3rd multiplication because position 17 is empty in the source address array. I hope this makes sense.
You shouldn't get any digit overflow errors when multiplying by 2 like this, but you may need to handle it when multiplying by larger numbers.

Setting text and background in Assembly intel

I have a programming assignment to run through and set the background and text of all the possible combinations. I am using a predefined function called SetTextColor which basically sets the values like this:
mov eax, white + (blue * 16)
Essentially this sets the text white and the background blue (to set the background you multiply by 16). Basically the combination is 16 X 16 = 256
TITLE BACKGROUND COLORS (main.asm)
; Description: T
; Author: Chad Peppers
; Revision date: June 21, 2012
INCLUDE Irvine32.inc
.data
COUNT = 16
COUNT2 = 16
LCOUNT DWORD ?
val1 DWORD 0
val2 DWORD 0
.code
main PROC
mov ecx, COUNT
L1:
mov LCOUNT, ecx
mov ecx, COUNT2
L2:
mov eax, val1 + (val2 * 16)
call SetTextColor
inc val2
Loop L2
mov ecx, LCOUNT
Loop L1
call DumpRegs
exit
main ENDP
END main
Basically I am doing a nested loop. My thinking is that I simply do a 1 * (1 * 16) then inc the value in a nested loop until 1 * (16 * 16). I am getting the error below
I am getting the error A2026: constant expected

I imagine the error you are getting is at this line:
mov eax, val1 + (val2 * 16)
You just can't do that. If you intend to multiply val2 by 16 and then add val1 to the result, then you need to implement it step by step (you may come across addressing in the form of a+b*c but a and c need to be registers and b can only be 2, 4 or 8, not 16). Try replacing this line with something like this:
mov eax, val2
imul eax, 16
add aex, val1

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

x86 using loops in order to minimize code (lining) [duplicate] - loops

Related

Assembly Language x86 - Registers Set and Arithmetic and Loop

Manual vectorization using AVX vector intrinsics only runs about the same speed as 4 scalar FP adds on Ryzen?

COMISD not comparing properly [duplicate]

Assembly language array multiplication bug using bit shift

Setting text and background in Assembly intel

Categories

Resources