I've been banging my head against the wall figuring this out, and this is making no sense to me...
Why does this program enter an infinite loop?!
I thought you could use test to compare two values for equality, as shown here... why doesn't it work?
int main()
{
__asm
{
mov EAX, 1;
mov EDX, EAX;
test EAX, EDX;
L: jne L;
}
}
Your expectation of what the TEST instruction does is incorrect.
The instruction is used to perform bit tests. You would typically use it to "test" if certain bits are set given a mask. It would be used in conjunction with the JZ (jump if zero) or JNZ (jump if not zero) instructions.
The test involves performing a bitwise-AND on the two operands and sets the appropriate flags (discarding the result). If none of the corresponding bits in the mask are set, then the ZF (zero flag) will be 1 (all bits are zero). If you wanted to test if any were set, you'd use the JNZ instruction. If you wanted to test if none were set, you'd use the JZ instruction.
The JE and JNE are not appropriate for this instruction because they interpret the flags differently.
You are trying to perform an equality check on some variables. You should be using the CMP instruction. You would typically use it to compare values with each other.
The comparison effectively subtracts the operands and only sets the flags (discarding the result). When equal, the difference of the two values is 0 (ZF = 1). When not equal, the difference of the two values is non-zero (ZF = 0). If you wanted to test if they were equal, you'd use the JE (jump if equal) instruction. If you wanted to test if they were not equal, you'd use the JNE (jump if not equal) instruction.
In this case, since you used TEST, the resulting flags would yield ZF = 0 (0x1 & 0x1 = 0x1, non-zero). Since ZF = 0, the JNE instruction would take the branch as you are seeing here.
tl;dr
You need to compare the values using the CMP instruction if you are checking for equality, not TEST them.
int main()
{
__asm
{
mov EAX, 1
mov EDX, EAX
cmp EAX, EDX
L: jne L ; no more infinite loop
}
}
Just reading this (my asm is very rusty) and this
JNE jumps on ZF (Zero Flag) = 0
TEST sets ZF = 0 If bitwise EAX AND EDX results in 1 and 1 if bitwise AND results in 0
If the result of the AND is 0, the ZF is set to 1, otherwise set to
0.
Therefore it jumps as 1 AND 1 results in 0 in ZF.
Seems logical yet counter intuative.
I think #A.Webb is right - it should probably be JNZ if you're using the TEST instruction as you are relying on the behavior of a bitwuse operation to set the zero flag whereas the SUB instruction would set the Zero flag as you need.
This is pretty simple. You obviously need to know what the instructions do, what processor state they read, and write. When in doubt, get a reference manual. The Intel x86 manuals are easy to find online.
Your specific program:
mov EAX, 1;
moves the constant 1 to EAX. No other state changes occur.
mov EDX, EAX;
copies the contexts of EAX into EDX, so it too contains the value 1.
test EAX, EDX;
test computes the bitwise AND of two registers (did you check the reference manuals?), throws the answer away, and sets condition code bits based on the answer. In your case, the upper 31 bits of each register is zero, and'd produces zeros. The least significant bit is one in both registers; and'd produces a 1. The net effect is that the 32 binary value "one" is generated, and thrown away after the condition code bits are set. There is one condition code bit we care about for this program, and that's the "Z"(ero) bit, which is set if the last condition-code setting operation produced a full zero value. This test produced "one", so the Z bit is reset. I'll let you look up the other condition code bits.
L: jne L;
This is a "Jmp on Not Equal", e.g, it jmps if the Z bit is reset. For your program, Z is reset, the jmp occurs. After execution, the processor is at the same insruction, and sees another (the same jmp). The condition code bits aren't changed by a jmp instruction.
So... it goes into an infinite loop.
There are lots of synonyms for various opcodes supported by assemblers. For instance, "JZ" and "JE" are synonyms for the same instruction. Don't let the synonyms confuse.
Related
This was a question that was previously posed by a prof of mine and I'm assuming the 8-bit register is either CL or CH. I got it working by simply moving 01H to the CH register, but I was wondering if there was any other way of doing this since I am technically using the 16-bit CX register as a whole when running the code.
My code for reference :
MOV CH,01H
L1:INC AX ;to keep count
LOOP L1
You're right, your code uses 16-bit CX. Much worse, it depends on CL being zero before this snippet executes! An 8-bit loop counter that starts at zero will wrap back to zero 256 decrements (or increments) later.
mov al, 0 ; uint8_t i = 0. xor ax,ax is the same code size but zeros AH
loop_top: ; do {
dec al
jnz loop_top ; }while(--i != 0)
Nothing in the question said there needed to be any work inside the loop; this is just an empty delay loop.
Efficiency notes: dec ax is smaller than dec al, and loop rel8 is even more compact than dec/jnz. So if you were optimizing for real 8086 or 8088, you'd want to keep the loop body smaller because it runs more times than the code ahead of the loop. Of course if you actually wanted to just delay, this would delay longer since code-fetch would take more memory accesses. Overall code size is the same either way, for mov ax, 256 (3 bytes) vs. xor ax,ax (2 bytes) or mov al, 0 (2 bytes).
This works the same with any 8-bit register; AL isn't special for any of these instructions, so you'd often want to keep it free for stuff that can benefit from its special encodings for stuff like cmp al, imm8 in 2 bytes instead of the usual 3.
(mov al, 0 vs. xor al,al - false dependency either way on many modern CPUs. mov ah,0 might avoid a false dependency on Skylake; at least mov from another register does but maybe not immediate. See How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent. Anyway, xor-zeroing is generally not useful on byte registers.)
I wonder if it's faster for the processor to negate a number or to do a subtraction. For example:
Is
int a = -3;
more efficient than
int a = 0 - 3;
In other words, does a negation is equivalent to subtracting from 0? Or is there a special CPU instruction that negate faster that a subtraction?
I suppose that the compiler does not optimize anything.
From the C language point of view, 0 - 3 is an integer constant expression and those are always calculated at compile-time.
Formal definition from C11 6.6/6:
An integer constant expression shall have integer type and shall
only have operands that are integer constants, enumeration constants,
character constants, sizeof expressions whose results are integer
constants, _Alignof expressions, and floating constants that are the
immediate operands of casts.
Knowing that these are calculated at compile time is important when writing readable code. For example if you want to declare a char array to hold 5 characters and the null terminator, you can write char str[5+1]; rather than 6, to get self-documenting code telling the reader that you have considered null termination.
Similarly when writing macros, you can make use of integer constant expressions to perform parts of the calculation at compile time.
(This answer is about negating a runtime variable, like -x or 0-x where constant-propagation doesn't result in a compile-time constant value for x. A constant like 0-3 has no runtime cost.)
I suppose that the compiler does not optimize anything.
That's not a good assumption if you're writing in C. Both are equivalent for any non-terrible compiler because of how integers work, and it would be a missed-optimization bug if one compiled to more efficient code than the other.
If you actually want to ask about asm, then how to negate efficiently depends on the ISA.
But yes, most ISAs can negate with a single instruction, usually by subtracting from an immediate or implicit zero, or from an architectural zero register.
e.g. 32-bit ARM has an rsb (reverse-subtract) instruction that can take an immediate operand. rsb rdst, rsrc, #123 does dst = 123-src. With an immediate of zero, this is just negation.
x86 has a neg instruction: neg eax is exactly equivalent to eax = 0-eax, setting flags the same way.
3-operand architectures with a zero register (hard-wired to zero) can just do something like MIPS subu $t0, $zero, $t0 to do t0 = 0 - t0. It has no need for a special instruction because the $zero register always reads as zero. Similarly AArch64 removed RSB but has a xzr / wzr 64/32-bit zero register. (Although it also has a pseudo-instruction called neg which subtracts from the zero register).
You could see most of this by using a compiler. https://godbolt.org/z/B7N8SK
But you'd have to actually compile to machine code and disassemble because gcc/clang tend to use the neg pseudo-instruction on AArch64 and RISC-V. Still, you can see ARM32's rsb r0,r0,#0 for int negate(int x){return -x;}
Both are compile time constants, and will generate the same constant initialisation in any reasonable compiler regardless of optimisation.
For example at https://godbolt.org/z/JEMWvS the following code:
void test( void )
{
int a = -3;
}
void test2( void )
{
int a = 0-3;
}
Compiled with gcc 9.2 x86-64 -std=c99 -O0 generates:
test:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], -3
nop
pop rbp
ret
test2:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], -3
nop
pop rbp
ret
Using -Os, the code:
void test( void )
{
volatile int a = -3;
}
void test2( void )
{
volatile int a = 0-3;
}
generates:
test:
mov DWORD PTR [rsp-4], -3
ret
test2:
mov DWORD PTR [rsp-4], -3
ret
The volatile being necessary to prevent the compiler removing the unused variables.
As static data it is even simpler:
int a = -3;
int b = 0-3;
outside of a function generates no executable code, just initialised data objects (initialisation is different from assignment):
a:
.long -3
b:
.long -3
Assignment of the above statics:
a = -4 ;
b = 0-4 ;
is still a compiler evaluated constant:
mov DWORD PTR a[rip], -4
mov DWORD PTR b[rip], -4
The take-home here is:
If you are interested, try it and see (with your own compiler or Godbolt set for your compiler and/or architecture),
don't sweat the small stuff, let the compiler do its job,
constant expressions are evaluated at compile time and have no run-time impact,
writing weird code in the belief you can better the compiler is almost always pointless. Compilers work better with idiomatic code the optimiser can recognise.
It's hard to tell if you ask asking if subtraction is fast then negation in a general sense, or in this specific case of implementing negation via subtraction from zero. I'll try to answer both.
General Case
For the general case, on most modern CPUs these operations are both very fast: usually each only taking a single cycle to execute, and often having a throughput of more than one per cycle (because CPUs are superscalar). On all recent AMD and Intel CPUs that I checked, both sub and neg execute at the same speed for register and immediate arguments.
Implementing -x
As regards to your specific question of implementing the -x operation, it would usually be slightly faster to implement this with a dedicated neg operation than with a sub, because with neg you don't have to prepare the zero registers. For example, a negation function int neg(int x) { return -x; }; would look something like this with the neg instruction:
neg:
mov eax, edi
neg eax
... while implementing it terms of subtraction would look something like:
neg:
xor eax, eax
sub eax, edi
Well ... sub didn't come out looking at worse there, but that's mostly a quirk of the calling convention and the fact that x86 uses a 1 argument destructive neg: the result needs to be in eax, so in the neg case 1 instruction is spent just moving the result to the right register, and one doing the negation. The sub version takes two instructions to perform the negation itself: one to zero a register, and one to do the subtraction. It so happens that this lets you avoid the ABI shuffling because you get to choose the zero register as the result register.
Still, this ABI related inefficiency wouldn't persist after inlining, so we can say in some fundamental sense that neg is slightly more efficient.
Now many ISAs may not have a neg instruction at all, so the question is more or less moot. They may have a hardcoded zero register, so you'd implement negation via subtraction from this register and there is no cost to set up the zero.
While writing an integer to hex string function I noticed that I had an unnecessary mask and bit shift, but when I removed it, the code actually got bigger (by about 8-fold)
char *i2s(int n){
static char buf[(sizeof(int)<<1)+1]={0};
int i=0;
while(i<(sizeof(int)<<1)+1){ /* mask the ith hex, shift it to lsb */
// buf[i++]='0'+(0xf&(n>>((sizeof(int)<<3)-i<<2))); /* less optimizable ??? */
buf[i++]='0'+(0xf&((n&(0xf<<((sizeof(int)<<3)-i<<2)))>>((sizeof(int)<<3)-i<<2)));
if(buf[i-1]>'9')buf[i-1]+=('A'-'0'-10); /* handle A-F */
}
for(i=0;buf[i++]=='0';)
/*find first non-zero*/;
return (char *)buf+i;
}
With the extra bit shift and mask and compiled with gcc -S -O3, the loops unroll and it reduces to:
movb $48, buf.1247
xorl %eax, %eax
movb $48, buf.1247+1
movb $48, buf.1247+2
movb $48, buf.1247+3
movb $48, buf.1247+4
movb $48, buf.1247+5
movb $48, buf.1247+6
movb $48, buf.1247+7
movb $48, buf.1247+8
.p2align 4,,7
.p2align 3
.L26:
movzbl buf.1247(%eax), %edx
addl $1, %eax
cmpb $48, %dl
je .L26
addl $buf.1247, %eax
ret
Which is what I expected for 32 bit x86 (should be similar,but twice as many movb-like ops for 64bit); however without the seemingly redundant mask and bit shift, gcc can't seem to unroll and optimize it.
Any ideas why this would happen? I am guessing it has to do with gcc being (overly?) cautious with the sign bit. (There is no >>> operator in C, so bitshifting the MSB >> pads with 1s vs. 0s if the sign bit is set)
It seems you're using gcc4.7, since newer gcc versions generate different code than what you show.
gcc is able to see that your longer expression with the extra shifting and masking is always '0' + 0, but not for the shorter expression.
clang sees through them both, and optimizes them to a constant independent of the function arg n, so this is probably just a missed-optimization for gcc. When gcc or clang manage to optimize away the first loop to just storing a constant, the asm for the whole function never even references the function arg, n.
Obviously this means your function is buggy! And that's not the only bug.
off-by-one in the first loop, so you write 9 bytes, leaving no terminating 0. (Otherwise the search loop could optimize away to, and just return a pointer to the last byte. As written, it has to search off the end of the static array until it finds a non-'0' byte. Writing a 0 (not '0') before the search loop unfortunately doesn't help clang or gcc get rid of the search loop)
off-by-one in the search loop so you always return buf+1 or later because you used buf[i++] in the condition instead of a for() loop with the increment after the check.
undefined behaviour from using i++ and i in the same statement with no sequence point separating them.
Apparently assuming that CHAR_BIT is 8. (Something like static char buf[CHAR_BIT*sizeof(n)/4 + 1], but actually you need to round up when dividing by two).
clang and gcc both warn about - having lower precedence than <<, but I didn't try to find exactly where you went wrong. Getting the ith nibble of an integer is much simpler than you make it: buf[i]='0'+ (0x0f & (n >> (4*i)));
That compiles to pretty clunky code, though. gcc probably does better with #Fabio's suggestion to do tmp >>= 4 repeatedly. If the compiler leaves that loop rolled up, it can still use shr reg, imm8 instead of needing a variable-shift. (clang and gcc don't seem to optimize the n>>(4*i) into repeated shifts by 4.)
In both cases, gcc is fully unrolling the first loop. It's quite large when each iteration includes actual shifting, comparing, and branching or branchless handling of hex digits from A to F.
It's quite small when it can see that all it has to do is store 48 == 0x30 == '0'. (Unfortunately, it doesn't coalesce the 9 byte stores into wider stores the way clang does).
I put a bugfixed version on godbolt, along with your original.
Fabio's answer has a more optimized version. I was just trying to figure out what gcc was doing with yours, since Fabio had already provided a good version that should compile to more efficient code. (I optimized mine a bit too, but didn't replace the n>>(4*i) with n>>=4.)
gcc6.3 makes very amusing code for your bigger expression. It unrolls the search loop and optimizes away some of the compares, but keeps a lot of the conditional branches!
i2s_orig:
mov BYTE PTR buf.1406+3, 48
mov BYTE PTR buf.1406, 48
cmp BYTE PTR buf.1406+3, 48
mov BYTE PTR buf.1406+1, 48
mov BYTE PTR buf.1406+2, 48
mov BYTE PTR buf.1406+4, 48
mov BYTE PTR buf.1406+5, 48
mov BYTE PTR buf.1406+6, 48
mov BYTE PTR buf.1406+7, 48
mov BYTE PTR buf.1406+8, 48
mov BYTE PTR buf.1406+9, 0
jne .L7 # testing flags from the compare earlier
jne .L8
jne .L9
jne .L10
jne .L11
sete al
movzx eax, al
add eax, 8
.L3:
add eax, OFFSET FLAT:buf.1406
ret
.L7:
mov eax, 3
jmp .L3
... more of the same, setting eax to 4, or 5, etc.
Putting multiple jne instructions in a row is obviously useless.
I think it has to do that in the shorter version, you are left shifting by ((sizeof(int)<<3)-i<<2) and then right-shifting by that same value later in the expression, so the compiler is able to optimised based on that fact.
Regarding the right-shifting, C++ can do the equivalent of both of Java's operators '>>' and '>>>'. It's just that in [GNU] C++ the result of "x >> y" will depend on whether x is signed or unsigned. If x is signed, then shift-right-arithmetic (SRA, sign-extending) is used, and if x is unsigned, then shift-right-logical (SRL, zero-extending) is used. This way, >> can be used to divide by 2 for both negative and positive numbers.
Unrolling loops is no longer a good idea because: 1) Newer processors come with a micro-op buffer which will often speed up small loops, 2) code bloat makes instruction caching less efficient by taking up more space in L1i. Micro-benchmarks will hide that effect.
The algorithm doesn't have to be that complicated. Also, your algorithm has a problem that it returns '0' for multiples of 16, and for 0 itself it returns an empty string.
Below is a rewrite of the algo which is branch free except for the loop exit check (or completely branch free if compiler decides to unroll it). It is faster, generates shorter code and fixes the multiple-of-16 bug.
Branch-free code is desirable because there is a big penaly (15-20 clock cycles) if the CPU mispredicts a branch. Compare that to the bit operations in the algo: they only take 1 clock cycle each and the CPU is able to execute 3 or 4 of them in the same clock cycle.
const char* i2s_brcfree(int n)
{
static char buf[ sizeof(n)*2+1] = {0};
unsigned int nibble_shifter = n;
for(char* p = buf+sizeof(buf)-2; p >= buf; --p, nibble_shifter>>=4){
const char curr_nibble = nibble_shifter & 0xF; // look only at lowest 4 bits
char digit = '0' + curr_nibble;
// "promote" to hex if nibble is over 9,
// conditionally adding the difference between ('0'+nibble) and 'A'
enum{ dec2hex_offset = ('A'-'0'-0xA) }; // compile time constant
digit += dec2hex_offset & -(curr_nibble > 9); // conditional add
*p = digit;
}
return buf;
}
Edit: C++ does not define the result of right shifting negative numbers. I only know that GCC and visual studio do that on the x86 architecture.
I wondered when I use this code, for example:
#include <stdio.h>
int main(){
int b;
scanf("%d",&b);
if (b)
printf("right\n");
else
printf("zero entered\n");
return 0;
}
How can the compiler know that if b!= 0, it should execute printf("right\n");.....and if b == 0 it should execute printf("zero entered\n");
And if I had another variable a, and check if a > b or not, the return from the logical operations is 1 or 0; how this value is obtained? Is it a function?
In C, all non-zero values are "true", and only zero is "false".
As for the comparison a > b, depending on the types there are instructions in the CPU running your program that does comparisons, and the compiler generates those instructions when compiling your program. For types that don't have a native comparison-instruction, it's up to the implementation of the compiler how those should be handled.
The way the compiler deals with this is to translate it into the appropriate machine instructions.
In the case of this:
if (b)
then this is typically, on an x86 machine, translated to something like this:
cmp eax, eax ; compare register eax with itself
jz target ; jump to target if zero
The above code tells you that zero is a special case in the cpu, in that many, if not most, of the instructions will set some internal flags when they operate on values, so that jz and jnz (jump if not zero) can be done afterwards.
There are other flags as well, overflow, carry, sign, parity.
As for comparison, for types that can be handled natively by the cpu, there are built in instructions:
mov eax, a ; eax = a
cmp eax, b ; compare eax to b
jl target ; jump to target if less (eax < b --> a < b)
You can find more of the jump instructions here: Intel x86 JUMP quick reference.
If the types cannot be handled natively, typically it will involve a function call, which returns a 0/1 (or 0/N, note that N can be negative) value, in which case it falls back to the if (b) type of instructions to handle the results of that function.
Something like this:
mov eax, a ; eax = a
mov ebx, b ; ebx = b
call function ; call comparison function, result returned in eax
cmp eax, eax
jz notequal
If you want to check the value is true or false, you should always compare with zero because all integers are true except zero.
I would like to understand how cmp and je/jg work in assembly. I saw few examples on google but I am still little bit confused. Below I have shown a part of assembly code that I am trying to convert to C language and the corresponding C code. Is it implemented in the right way or do I have a wrong understanding of how cmp works?
cmp $0x3,%eax
je A
cmp $0x3,%eax
jg B
cmp $0x1,%eax
je C
int func(int x){
if(x == 3)
goto A;
if (x >3)
goto B;
if(x == 1)
goto C;
A:
......
B:
......
C:
......
You understand correctly how cmp and je/jg work, but you have an error in your C code. This line:
if (*x == 1)
should be
if (x == 1)
Here is a pretty good summary of the x86 control flow instructions.
Also, there's no reason to repeat the cmp instruction for the same values. Once you've executed it, you can test the results multiple ways without repeating the comparison. So your assembly code should look like this:
cmp $0x3,%eax
je A
jg B
cmp $0x1,%eax
je C
Yes, that's correct, except that in your C code you have *x in third example but x in others, that does not make sense. In your assembly code there is no correspoding code.
In C the variable type (signed/unsigned) is defined upon declaring the variable, eg. int x or unsigned int x, but in assembly the distinction between signed and unsigned variables (be they in memory or in registers) for comparisons is made by different conditional jumps:
For signed variables:
jg ; jump if greater
jl ; jump if less
jge ; jump if greater or equal, "jnl" is synonymous
jle ; jump if less or equal, "jng" is synonymous
For unsigned variables:
ja ; jump if above
jb ; jump if below
jae ; jump if above or equal, "jnb" is synonymous
jbe ; jump if below or equal, "jna" is synonymous
Intel x86 JUMP quick reference lists all conditional jumps available in x86 assembly, together with their conditions (flags' values) and their opcodes for short and long jumps.
As you may already know, the processor keeps track of the stuff that happened during last operations in a so-called flags-register.
For example, there is a flag if an operation made an overflow, or the result was zero etc. The cmp mnemonic tells the processor to subtract the two registers/ register and memory content and it changes the correct flags.
After that, you can jump using the jumps you have done. The processor checks the flags to see if it was equal-je, (checks the zero flag), or if it was smaller/bigger(overflow flag for unsigned and overflow and sign flag for signed numbers).