Assembly Language: difference between ja and jg?

Assembly Language: difference between ja and jg? - loops

I am having trouble understanding the difference between ja and jg for assembly language. I have a section of code:
cmp dh, dl
j-- hit
and am asked which conditional jump to hit (that replaces j-- hit) will be taken with the hex value of DX = 0680.
This would make dl = 06 and dh = 80, so when comparing, 80 > 06. I know that jg fits this as we can directly compare results, but how should I approach solving if ja fits (or in this case, does not fit) this code?

If dx is 0x0680, then dh is 0x06 and dl is 0x80.
0x80 is interpreted as 128 in unsigned mode, and -128 in signed mode.
Thus, you have to use jg, since 6 > -128, but 6 < 128. jg does signed comparison; ja does unsigned comparison.

The difference between ja and jg is the fact that comparison is unsigned for ja and signed for jg (treating the registers as signed vs unsigned integers).
If the numbers are guaranteed to be positive (i.e. the sign bit is 0) then you should be fine. Otherwise you have to be careful.
You really can't intuit based on the comparison instruction itself if ja is applicable. You have to look at the context and decide if sign will be an issue.

Related

How to calculate the average of an array in assembly [duplicate]

I noticed when EDX contains some random default value like 00401000, and I then use a DIV instruction like this:
mov eax,10
mov ebx,5
div ebx
it causes an INTEGER OVERFLOW ERROR. However, if I set edx to 0 and do the same thing it works. I believed that using div would result in the quotient overwriting eax and the remainder overwriting edx.
Getting this INTEGER OVERFLOW ERROR really confuses me.

What to do
For 32-bit / 32-bit => 32-bit division: zero- or sign-extend the 32-bit dividend from EAX into 64-bit EDX:EAX.
For 16-bit, AX into DX:AX with cwd or xor-zeroing.
unsigned: XOR EDX,EDX then DIV divisor
signed: CDQ then IDIV divisor
See also When and why do we sign extend and use cdq with mul/div?
Why (TL;DR)
For DIV, the registers EDX and EAX form one single 64 bit value (often shown as EDX:EAX), which is then divided, in this case, by EBX.
So if EAX = 10 or hex A and EDX is, say 20 or hex 14, then together they form the 64 bit value hex 14 0000 000A or decimal 85899345930. If this is divided by 5, the result is 17179869186 or hex
4 0000 0002, which is a value that does not fit in 32 bits.
That is why you get an integer overflow.
If, however, EDX were only 1, you would divide hex 1 0000 000A by 5, which results in hex 3333 3335. That is not the value you wanted, but it does not cause an integer overflow.
To really divide 32 bit register EAX by another 32 bit register, take care that the top of the 64 bit value formed by EDX:EAX is 0.
So, before a single division, you should generally set EDX to 0.
(Or for signed division, cdq to sign extend EAX into EDX:EAX before idiv)
But EDX does not have always have to be 0. It can just not be that big that the result causes an overflow.
One example from my BigInteger code:
After a division with DIV, the quotient is in EAX and the remainder is in EDX. To divide something like a BigInteger, which consists of an array of many DWORDS, by 10 (for instance to convert the value to a decimal string), you do something like the following:
; ECX contains number of "limbs" (DWORDs) to divide by 10
XOR EDX,EDX ; before start of loop, set EDX to 0
MOV EBX,10
LEA ESI,[EDI + 4*ECX - 4] ; now points to top element of array
#DivLoop:
MOV EAX,[ESI]
DIV EBX ; divide EDX:EAX by EBX. After that,
; quotient in EAX, remainder in EDX
MOV [ESI],EAX
SUB ESI,4 ; remainder in EDX is re-used as top DWORD...
DEC ECX ; ... for the next iteration, and is NOT set to 0.
JNE #DivLoop
After that loop, the value represented by the entire array (i.e. by the BigInteger) is divided by 10, and EDX contains the remainder of that division.
FWIW, in the assembler I use (Delphi's built-in assembler), labels starting with # are local to the function, i.e. they don't interfere with equally named labels in other functions.

The DIV instruction divides EDX:EAX by the r/m32 that follows the DIV instruction. So, if you fail to set EDX to zero, the value you are using becomes extremely large.
Trust that helps

Using Int (32 bits) over char (8 bits) to 'help' processor

In C, often we use char for small number representations. However Processor always uses Int( or 32 bit) values for read from(or fetch from) registers. So every time we need to use a char or 8 bits in our program processor need to fetch 32 bits from regsiter and 'parse' 8 bits out of it.
Hence does it sense to use Int more often in place of char if memory is not the limitation?
Will it 'help' processor?

There's the compiler part and the cpu part.
If you tell the compiler you're using a char instead of an int, during static analysis it will know the bounds of the variable is between 0-255 instead of 0-(2^32-1). This will allow it to optimize your program better.
On the cpu side, your assumption isn't always correct. Take x86 as an example, it has registers eax and al for 32 bit and 8 bit register access. If you want to use chars only, using al is sufficient. There is no performance loss.
I did some simple benchmarks in response to below comments:
al:
format PE GUI 4.0
xor ecx, ecx
dec ecx
loop_start:
inc al
add al, al
dec al
dec al
loopd short loop_start
ret
eax:
format PE GUI 4.0
xor ecx, ecx
dec ecx
loop_start:
inc eax
add eax, eax
dec eax
dec eax
loopd short loop_start
ret
times:
$ time ./test_al.exe
./test_al.exe 0.01s user 0.00s system 0% cpu 7.102 total
$ time ./test_eax.exe
./test_eax.exe 0.01s user 0.01s system 0% cpu 7.120 total
So in this case, al is slightly faster, but sometimes eax came out faster. The difference is really negligible. But cpus aren't so simple, there might be code alignment issues, caches, and other things going on, so it's best to benchmark your own code to see if there's any performance improvement. But imo, if your code is not super tight, it's best to trust the compiler to optimize things.

I'd stick to int if I were you as that is probably the most native integral type for your platform. Internally you could expect shorter types to be converted to int so actually degrading performance.
You should never use char and expect it to be consistent across platforms. Although the C standard defines sizeof(char) to be 1, char itself could be signed or unsigned. The choice is down to the compiler.
If you believe that you can squeeze some performance gain in using an 8 bit type then be explicit and use signed char or unsigned char.

From ARM system developers guide
"most ARM data processing operations are 32-bit only. For this reason, you should use
a 32-bit datatype, int or long, for local variables wherever possible. Avoid using char and
short as local variable types, even if you are manipulating an 8- or 16-bit value"
an example code from the book to prove the point. note the wrap around handling for char as opposed to unsigned int.
int checksum_v1(int *data)
{
char i;
int sum = 0;
for (i = 0; i < 64; i++)
{
sum += data[i];
}
return sum;
}
ARM7 assembly when using i as a char
checksum_v1
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ; i = 0
checksum_v1_loop
LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
ADD r1,r1,#1 ; r1 = i+1
AND r1,r1,#0xff ; i = (char)r1
CMP r1,#0x40 ; compare i, 64
ADD r0,r3,r0 ; sum += r3
BCC checksum_v1_loop ; if (i<64) loop
MOV pc,r14 ; return sum
ARM7 assembly when i is an unsigned int.
checksum_v2
MOV r2,r0 ; r2 = data
MOV r0,#0 ; sum = 0
MOV r1,#0 ; i = 0
checksum_v2_loop
LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
ADD r1,r1,#1 ; r1++
CMP r1,#0x40 ; compare i, 64
ADD r0,r3,r0 ; sum += r3
BCC checksum_v2_loop ; if (i<64) goto loop
MOV pc,r14 ; return sum

If your program is simple enough, the optimizer can do the right thing without you having to worry about it. In this case, plain int would be the simplest (and forward-proof) solution.
However, if you want really much to combine specific bit width and speed, you can use 7.18.1.3 Fastest minimum-width integer types from the C99 standard (requires C99-compliant compiler).
For example:
int_fast8_t x;
uint_fast8_t y;
are the signed and unsigned types that are guaranteed to be able to store at least 8 bits of data and use the usually faster underlying type. Of course, it all depends on what you are doing with the data afterwards.
For example, on all systems I have tested (see: standard type sizes in C++) the fast types were 8-bit long.

Why is adding a superfluous mask and bitshift more optimizable?

While writing an integer to hex string function I noticed that I had an unnecessary mask and bit shift, but when I removed it, the code actually got bigger (by about 8-fold)
char *i2s(int n){
static char buf[(sizeof(int)<<1)+1]={0};
int i=0;
while(i<(sizeof(int)<<1)+1){ /* mask the ith hex, shift it to lsb */
// buf[i++]='0'+(0xf&(n>>((sizeof(int)<<3)-i<<2))); /* less optimizable ??? */
buf[i++]='0'+(0xf&((n&(0xf<<((sizeof(int)<<3)-i<<2)))>>((sizeof(int)<<3)-i<<2)));
if(buf[i-1]>'9')buf[i-1]+=('A'-'0'-10); /* handle A-F */
}
for(i=0;buf[i++]=='0';)
/*find first non-zero*/;
return (char *)buf+i;
}
With the extra bit shift and mask and compiled with gcc -S -O3, the loops unroll and it reduces to:
movb $48, buf.1247
xorl %eax, %eax
movb $48, buf.1247+1
movb $48, buf.1247+2
movb $48, buf.1247+3
movb $48, buf.1247+4
movb $48, buf.1247+5
movb $48, buf.1247+6
movb $48, buf.1247+7
movb $48, buf.1247+8
.p2align 4,,7
.p2align 3
.L26:
movzbl buf.1247(%eax), %edx
addl $1, %eax
cmpb $48, %dl
je .L26
addl $buf.1247, %eax
ret
Which is what I expected for 32 bit x86 (should be similar,but twice as many movb-like ops for 64bit); however without the seemingly redundant mask and bit shift, gcc can't seem to unroll and optimize it.
Any ideas why this would happen? I am guessing it has to do with gcc being (overly?) cautious with the sign bit. (There is no >>> operator in C, so bitshifting the MSB >> pads with 1s vs. 0s if the sign bit is set)

It seems you're using gcc4.7, since newer gcc versions generate different code than what you show.
gcc is able to see that your longer expression with the extra shifting and masking is always '0' + 0, but not for the shorter expression.
clang sees through them both, and optimizes them to a constant independent of the function arg n, so this is probably just a missed-optimization for gcc. When gcc or clang manage to optimize away the first loop to just storing a constant, the asm for the whole function never even references the function arg, n.
Obviously this means your function is buggy! And that's not the only bug.
off-by-one in the first loop, so you write 9 bytes, leaving no terminating 0. (Otherwise the search loop could optimize away to, and just return a pointer to the last byte. As written, it has to search off the end of the static array until it finds a non-'0' byte. Writing a 0 (not '0') before the search loop unfortunately doesn't help clang or gcc get rid of the search loop)
off-by-one in the search loop so you always return buf+1 or later because you used buf[i++] in the condition instead of a for() loop with the increment after the check.
undefined behaviour from using i++ and i in the same statement with no sequence point separating them.
Apparently assuming that CHAR_BIT is 8. (Something like static char buf[CHAR_BIT*sizeof(n)/4 + 1], but actually you need to round up when dividing by two).
clang and gcc both warn about - having lower precedence than <<, but I didn't try to find exactly where you went wrong. Getting the ith nibble of an integer is much simpler than you make it: buf[i]='0'+ (0x0f & (n >> (4*i)));
That compiles to pretty clunky code, though. gcc probably does better with #Fabio's suggestion to do tmp >>= 4 repeatedly. If the compiler leaves that loop rolled up, it can still use shr reg, imm8 instead of needing a variable-shift. (clang and gcc don't seem to optimize the n>>(4*i) into repeated shifts by 4.)
In both cases, gcc is fully unrolling the first loop. It's quite large when each iteration includes actual shifting, comparing, and branching or branchless handling of hex digits from A to F.
It's quite small when it can see that all it has to do is store 48 == 0x30 == '0'. (Unfortunately, it doesn't coalesce the 9 byte stores into wider stores the way clang does).
I put a bugfixed version on godbolt, along with your original.
Fabio's answer has a more optimized version. I was just trying to figure out what gcc was doing with yours, since Fabio had already provided a good version that should compile to more efficient code. (I optimized mine a bit too, but didn't replace the n>>(4*i) with n>>=4.)
gcc6.3 makes very amusing code for your bigger expression. It unrolls the search loop and optimizes away some of the compares, but keeps a lot of the conditional branches!
i2s_orig:
mov BYTE PTR buf.1406+3, 48
mov BYTE PTR buf.1406, 48
cmp BYTE PTR buf.1406+3, 48
mov BYTE PTR buf.1406+1, 48
mov BYTE PTR buf.1406+2, 48
mov BYTE PTR buf.1406+4, 48
mov BYTE PTR buf.1406+5, 48
mov BYTE PTR buf.1406+6, 48
mov BYTE PTR buf.1406+7, 48
mov BYTE PTR buf.1406+8, 48
mov BYTE PTR buf.1406+9, 0
jne .L7 # testing flags from the compare earlier
jne .L8
jne .L9
jne .L10
jne .L11
sete al
movzx eax, al
add eax, 8
.L3:
add eax, OFFSET FLAT:buf.1406
ret
.L7:
mov eax, 3
jmp .L3
... more of the same, setting eax to 4, or 5, etc.
Putting multiple jne instructions in a row is obviously useless.

I think it has to do that in the shorter version, you are left shifting by ((sizeof(int)<<3)-i<<2) and then right-shifting by that same value later in the expression, so the compiler is able to optimised based on that fact.
Regarding the right-shifting, C++ can do the equivalent of both of Java's operators '>>' and '>>>'. It's just that in [GNU] C++ the result of "x >> y" will depend on whether x is signed or unsigned. If x is signed, then shift-right-arithmetic (SRA, sign-extending) is used, and if x is unsigned, then shift-right-logical (SRL, zero-extending) is used. This way, >> can be used to divide by 2 for both negative and positive numbers.
Unrolling loops is no longer a good idea because: 1) Newer processors come with a micro-op buffer which will often speed up small loops, 2) code bloat makes instruction caching less efficient by taking up more space in L1i. Micro-benchmarks will hide that effect.
The algorithm doesn't have to be that complicated. Also, your algorithm has a problem that it returns '0' for multiples of 16, and for 0 itself it returns an empty string.
Below is a rewrite of the algo which is branch free except for the loop exit check (or completely branch free if compiler decides to unroll it). It is faster, generates shorter code and fixes the multiple-of-16 bug.
Branch-free code is desirable because there is a big penaly (15-20 clock cycles) if the CPU mispredicts a branch. Compare that to the bit operations in the algo: they only take 1 clock cycle each and the CPU is able to execute 3 or 4 of them in the same clock cycle.
const char* i2s_brcfree(int n)
{
static char buf[ sizeof(n)*2+1] = {0};
unsigned int nibble_shifter = n;
for(char* p = buf+sizeof(buf)-2; p >= buf; --p, nibble_shifter>>=4){
const char curr_nibble = nibble_shifter & 0xF; // look only at lowest 4 bits
char digit = '0' + curr_nibble;
// "promote" to hex if nibble is over 9,
// conditionally adding the difference between ('0'+nibble) and 'A'
enum{ dec2hex_offset = ('A'-'0'-0xA) }; // compile time constant
digit += dec2hex_offset & -(curr_nibble > 9); // conditional add
*p = digit;
}
return buf;
}
Edit: C++ does not define the result of right shifting negative numbers. I only know that GCC and visual studio do that on the x86 architecture.

x86 TEST instruction not working?

I've been banging my head against the wall figuring this out, and this is making no sense to me...
Why does this program enter an infinite loop?!
I thought you could use test to compare two values for equality, as shown here... why doesn't it work?
int main()
{
__asm
{
mov EAX, 1;
mov EDX, EAX;
test EAX, EDX;
L: jne L;
}
}

Your expectation of what the TEST instruction does is incorrect.
The instruction is used to perform bit tests. You would typically use it to "test" if certain bits are set given a mask. It would be used in conjunction with the JZ (jump if zero) or JNZ (jump if not zero) instructions.
The test involves performing a bitwise-AND on the two operands and sets the appropriate flags (discarding the result). If none of the corresponding bits in the mask are set, then the ZF (zero flag) will be 1 (all bits are zero). If you wanted to test if any were set, you'd use the JNZ instruction. If you wanted to test if none were set, you'd use the JZ instruction.
The JE and JNE are not appropriate for this instruction because they interpret the flags differently.
You are trying to perform an equality check on some variables. You should be using the CMP instruction. You would typically use it to compare values with each other.
The comparison effectively subtracts the operands and only sets the flags (discarding the result). When equal, the difference of the two values is 0 (ZF = 1). When not equal, the difference of the two values is non-zero (ZF = 0). If you wanted to test if they were equal, you'd use the JE (jump if equal) instruction. If you wanted to test if they were not equal, you'd use the JNE (jump if not equal) instruction.
In this case, since you used TEST, the resulting flags would yield ZF = 0 (0x1 & 0x1 = 0x1, non-zero). Since ZF = 0, the JNE instruction would take the branch as you are seeing here.
tl;dr
You need to compare the values using the CMP instruction if you are checking for equality, not TEST them.
int main()
{
__asm
{
mov EAX, 1
mov EDX, EAX
cmp EAX, EDX
L: jne L ; no more infinite loop
}
}

Just reading this (my asm is very rusty) and this
JNE jumps on ZF (Zero Flag) = 0
TEST sets ZF = 0 If bitwise EAX AND EDX results in 1 and 1 if bitwise AND results in 0
If the result of the AND is 0, the ZF is set to 1, otherwise set to
0.
Therefore it jumps as 1 AND 1 results in 0 in ZF.
Seems logical yet counter intuative.
I think #A.Webb is right - it should probably be JNZ if you're using the TEST instruction as you are relying on the behavior of a bitwuse operation to set the zero flag whereas the SUB instruction would set the Zero flag as you need.

This is pretty simple. You obviously need to know what the instructions do, what processor state they read, and write. When in doubt, get a reference manual. The Intel x86 manuals are easy to find online.
Your specific program:
mov EAX, 1;
moves the constant 1 to EAX. No other state changes occur.
mov EDX, EAX;
copies the contexts of EAX into EDX, so it too contains the value 1.
test EAX, EDX;
test computes the bitwise AND of two registers (did you check the reference manuals?), throws the answer away, and sets condition code bits based on the answer. In your case, the upper 31 bits of each register is zero, and'd produces zeros. The least significant bit is one in both registers; and'd produces a 1. The net effect is that the 32 binary value "one" is generated, and thrown away after the condition code bits are set. There is one condition code bit we care about for this program, and that's the "Z"(ero) bit, which is set if the last condition-code setting operation produced a full zero value. This test produced "one", so the Z bit is reset. I'll let you look up the other condition code bits.
L: jne L;
This is a "Jmp on Not Equal", e.g, it jmps if the Z bit is reset. For your program, Z is reset, the jmp occurs. After execution, the processor is at the same insruction, and sees another (the same jmp). The condition code bits aren't changed by a jmp instruction.
So... it goes into an infinite loop.
There are lots of synonyms for various opcodes supported by assemblers. For instance, "JZ" and "JE" are synonyms for the same instruction. Don't let the synonyms confuse.

cmp je/jg how they work in assembly

I would like to understand how cmp and je/jg work in assembly. I saw few examples on google but I am still little bit confused. Below I have shown a part of assembly code that I am trying to convert to C language and the corresponding C code. Is it implemented in the right way or do I have a wrong understanding of how cmp works?
cmp $0x3,%eax
je A
cmp $0x3,%eax
jg B
cmp $0x1,%eax
je C
int func(int x){
if(x == 3)
goto A;
if (x >3)
goto B;
if(x == 1)
goto C;
A:
......
B:
......
C:
......

You understand correctly how cmp and je/jg work, but you have an error in your C code. This line:
if (*x == 1)
should be
if (x == 1)
Here is a pretty good summary of the x86 control flow instructions.
Also, there's no reason to repeat the cmp instruction for the same values. Once you've executed it, you can test the results multiple ways without repeating the comparison. So your assembly code should look like this:
cmp $0x3,%eax
je A
jg B
cmp $0x1,%eax
je C

Yes, that's correct, except that in your C code you have *x in third example but x in others, that does not make sense. In your assembly code there is no correspoding code.
In C the variable type (signed/unsigned) is defined upon declaring the variable, eg. int x or unsigned int x, but in assembly the distinction between signed and unsigned variables (be they in memory or in registers) for comparisons is made by different conditional jumps:
For signed variables:
jg ; jump if greater
jl ; jump if less
jge ; jump if greater or equal, "jnl" is synonymous
jle ; jump if less or equal, "jng" is synonymous
For unsigned variables:
ja ; jump if above
jb ; jump if below
jae ; jump if above or equal, "jnb" is synonymous
jbe ; jump if below or equal, "jna" is synonymous
Intel x86 JUMP quick reference lists all conditional jumps available in x86 assembly, together with their conditions (flags' values) and their opcodes for short and long jumps.

As you may already know, the processor keeps track of the stuff that happened during last operations in a so-called flags-register.
For example, there is a flag if an operation made an overflow, or the result was zero etc. The cmp mnemonic tells the processor to subtract the two registers/ register and memory content and it changes the correct flags.
After that, you can jump using the jumps you have done. The processor checks the flags to see if it was equal-je, (checks the zero flag), or if it was smaller/bigger(overflow flag for unsigned and overflow and sign flag for signed numbers).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight