Help understanding DIV instruction in x86 inline assembly - c

While reading through some source code in a GNU project, I came across this bit of inline assembly:
__asm__ (
"divq %4"
: "=a" (q), "=d" (r)
: "0" (n0), "1" (n1), "rm" (d)
);
Here the variables q, r, n0, n1, and d are 64-bit integers. I know enough assembly to get the gist of what this does, but there's some details I'm not certain about.
What I understand:
We're dividing the contents of the RAX register by d, placing the quotient in q, and placing the remainder in r.
What I don't understand
Why are there three inputs
here? We only need to input a
dividend and a divisor, so what use
could there be for 3 inputs?
I can't tell which of the inputs is the dividend. More generally, I don't see anything actually
being loaded into the RAX register,
so how does it know what to divide by what?

In the input operands specification:
: "0" (n0), "1" (n1), "rm" (d)
registers "0" and "1" are forced to rax and rdx because of the output specification:
: "=a" (q), "=d" (r)
And the div instruction family wants the numerator in RDX:RAX. The divisor can be in a general purpose register (not otherwise used - ie., not RAX or RDX) or memory, which is specified by the "rm" constraint. Registers RDX, RAX, and the divisor operand make up the 3 inputs.
So this will end up performing the division: n1:n0 / d where n1:n0 is a quantity loaded into rdx:rax.

As you correctly observe the div family works on the fixed registers a and d, rax and rdx for divq. The a register gets its input from n0 which is aliased to the 0th register, namely a. n1 is a dummy input aliased to d, probably just to ensure that this register is not used for other purposes.

Related

MSVC Inline Assembly: Freeing FPU registers for performance

While playing a little with FPU using MSVC's Inline Assembly, I got a little confused about freeing FPU registers in favor of increasing performance...
For example:
#include <stdio.h>
double fpu_add(register double x, register double y) {
double res = 0.0;
__asm {
fld x
fld y
fadd
fstp res
}
return res;
}
int main(void) {
double x = fpu_add(5.0, 2.0);
(void) printf("x = %f\n", x);
return 0;
}
When do I have to ffree the FPU registers in Inline Assembly?
In that example would performance be better if I decided to ffree the st(1) register?
Also is fstp a shorthand for instructions below?
__asm {
fst res
ffree st(0)
}
NOTE: I know FPU instructions are a bit old nowdays, But dealing with them as another option along with SSE
The ffree instruction allows you to mark any slot of the x87 fo stack as free without actually changing the stack pointer. So ffree st(0) does NOT pop the stack, just marks the top value of the stack as free/invalid, so any following instruction that tries to access it will get a floating point exception.
To actually pop to the stack you need both ffree st(0) and fincstp (to increment the pointer). Or better, fstp st(0) to do both those things with a single cheap instruction. Or fstp st(1) to keep the top-of-stack value and discard the old st(1).
But it's usually even better and easier (and faster) to use the p suffixed versions of other instructions. In your case, you probably want
__asm {
fld x // push x on the stack
fld y // push y on the stack
faddp // pop a value and add it to the (now) tos
fstp res // pop and store tos
}
This ends up pushing and popping two values, leaving the fp stack in the same state as it was before. Leaving stuff on the fp stack is likely to cause problems with other fp code, if the compiler is generating x87 fp code, so should be avoided.
Or even better, use memory-source fadd to save instructions, if you're optimizing for CPUs where that's not slower. (Check Agner Fog's microarch PDF and instruction tables for P5 Pentium and newer: seems to be fine, at least break even, and saves a uop on more modern CPUs like Core2 that can do micro-fusion of memory source operands.)
__asm {
fld x // push x on the stack
fadd y // ST0 += y
fstp res // pop and store tos
}
But MSVC inline asm is inherently slow for wrapping a single instruction like fadd, forcing inputs to be in memory, even if the compiler had them available in registers before the asm statement. And forcing the result to be stored in the asm and then reloaded for the return statement, unless you use a hack like leaving a value in st(0) and falling off the end of a function without a return statement. (MSVC does actually support this even when inlining, but clang-cl / clang -fasm-blocks does not.)
GNU C inline asm could wrap a single fadd instruction with appropriate constraints to ask for inputs in x87 registers and tell the compiler where the output is (in st(0)), but you'd still have to choose between fadd and faddp, not letting the compiler pick based on whether it had values in registers or a value from memory. (https://stackoverflow.com/tags/inline-assembly/info)
Compilers aren't terrible, they will make code at least this good from plain C source. Inline asm is generally not useful for performance, unless you're writing a whole loop that's carefully tuned for a specific CPU, or for a case where the compiler does a poor job with something. (Look at the compiler's optimized asm output, e.g. on https://godbolt.org/)

Operands mismatch for mul when inserting asm into c

I'm trying to make an assembly insert into C code. However when I try to multiply two registers inside it I get an error calling for operands mismatch. I tried "mul %%bl, %%cl\n" (double %% because it's in C code). From my past experience with asm I think this should work. I also tried "mul %%cl\n" (moving bl to al first), but in this case I get tons of errors from linker
zad3:(.rodata+0x4): multiple definition of `len'
/tmp/ccJxYyIp.o:(.rodata+0x0): first defined here
zad3: In function `_fini':
(.fini+0x0): multiple definition of `_fini'
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/crti.o:(.fini+0x0): first defined here
zad3: In function `data_start':
(.data+0x0): multiple definition of `__data_start'
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/crt1.o:(.data+0x0): first defined here
zad3: In function `data_start':
(.data+0x8): multiple definition of `__dso_handle'
/usr/lib/gcc/x86_64-linux-gnu/5/crtbegin.o:(.data+0x0): first defined here
zad3:(.rodata+0x0): multiple definition of `_IO_stdin_used'
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/crt1.o:(.rodata.cst4+0x0): first defined here
zad3: In function `_start':
(.text+0x0): multiple definition of `_start'
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/crt1.o: (.text+0x0): first defined here
zad3: In function `data_start':
(.data+0x10): multiple definition of `str'
/tmp/ccJxYyIp.o:(.data+0x0): first defined here
/usr/bin/ld: Warning: size of symbol `str' changed from 4 in /tmp/ccJxYyIp.o to 9 in zad3
zad3: In function `main':
(.text+0xf6): multiple definition of `main'
/tmp/ccJxYyIp.o:zad3.c:(.text+0x0): first defined here
zad3: In function `_init':
(.init+0x0): multiple definition of `_init'
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/crti.o:(.init+0x0): first defined here
/usr/lib/gcc/x86_64-linux-gnu/5/crtend.o:(.tm_clone_table+0x0): multiple definition of `__TMC_END__'
zad3:(.data+0x20): first defined here
/usr/bin/ld: error in zad3(.eh_frame); no .eh_frame_hdr table will be created.
collect2: error: ld returned 1 exit status
From what I understand, it tells me I defined len and a few other variables a few times, but I cannot see this multiple definition.
The goal of my program is to take a string of numbers and count sum of them but using 2 as a base. So let's say string is 293, then I want to count 2*2^2+9*2^1+3*2^0
Code:
#include <stdio.h>
char str[] = "543";
const int len = 3;
int main(void)
{
asm(
"mov $0, %%rbx \n"
"mov $1, %%rcx \n"
"potega: \n"
"shl $1, %%cl \n"
"inc %%rbx \n"
"cmp len, %%ebx \n"
"jl potega \n"
"mov $0, %%rbx \n"
"petla: \n"
"mov (%0, %%rbx, 1), %%al \n"
"sub $48, %%al \n"
"mul %%al, %%cl \n"
"shr $1, %%cl \n"
"add $48, %%al \n"
"mov %%al, (%0, %%rbx, 1) \n"
"inc %%rbx \n"
"cmp len, %%ebx \n"
"jl petla \n"
:"r"(&str)
:"%rax", "%rbx", "%rcx"
);
printf("Wynik: %s\n", str);
return 0;
}
While I try to avoid "doing people's homework" for them, you have already solved this and given that it has been over a week, have probably already turned it in.
So, looking at your final solution, there are a few things you might want to consider doing differently. In no particular order:
Comments. While all code needs comments, asm REALLY needs comments. As you'll see from my solution (below), having comments alongside the code really helps clarify what the code does. It might seem like a homework project hardly needs them. But since you posted this here, 89 people have tried to read this code. Comments would have made this easier for all of us. Not to mention that it will make life easier for your 'future self,' when you come back months from now to try to maintain it. Comments. Nuff said.
Zeroing registers. While mov $0, %%rbx will indeed put zero in rbx, this is not the most efficient way to zero a register. Using xor %%rbx, %%rbx is both (microscopically) faster and produces (slightly) smaller executable code.
potega. Without comments, it took me a bit to sort out what you were doing in your first loop. You are using rbx to keep track of how many characters you have processed, and cl gets shifted one to the left for each character. A few thoughts here:
3a. First thing I'd do is look at moving the shl $1, %%cl out of the loop. Instead of doing both increment and shift, just count the characters, then do a single shift of the appropriate size. This is (slightly) complicated by the fact that if you want to shift by a variable amount, the amount must be specified in cl (ie shl %%cl, %%rbx). Why cl? Who knows? That's just how shl works. So you'd want to do the counting in cl instead of rbx.
3b. Second thing about this loop has to do with len1. Since you already know the size (it's in len1), why would you even need a loop? Perhaps a more sensible approach would be:
3c. Strings in C are terminated with a null character (aka 0). If you want to find the length of a string, normally you'd walk the string until you find it. This removes the requirement to even have len1.
3d. Your code assumes that the input string is valid. What would happen if you got passed "abc"? Or ""? Validating parameters is boring, time consuming, and makes the program bigger and run slower. On the other hand, it pays HUGE dividends when something unexpected goes wrong. At the very least you should specify your assumptions about your input.
3e. Using global variables is usually a bad idea. You run into naming collisions (2 files both using the name len1), code in several different files all changing the value (making bugs difficult to track down) and it can make your program bigger than it needs to be. There are times when globals are useful, but this does not appear to be one of them. The only purpose here seems to be to allow access to these variables from within the asm, and there are other ways to do that.
3f. You use %0 to refer to str. That works (and is better than accessing the global symbol directly), but it is harder to read than it needs to be. You can associate a name with the parameter and use that instead.
Let's take a break for a moment to see what we've got so far:
"xor %%rcx, %%rcx\n" // Zero the strlen count
// Count how many characters in string
"potega%=: \n\t"
"mov (%[pstr], %%rcx), %%bl\n\t" // Read the next char
"test %%bl, %%bl \n\t" // Check for 0 at end of string
"jz kont%= \n\t"
"cmp $'0', %%bl\n\t" // Ensure digit is 0-9
"jl gotowe%=\n\t"
"cmp $'9', %%bl\n\t"
"jg gotowe%=\n\t"
"inc %%rcx \n\t" // Increment index/len
"jmp potega%= \n"
"kont%=:\n\t"
// rcx = the number of character in the string excluding null
You'll notice that I'm using %= at the end of all the labels. You can read about what this does in the gcc docs, but mostly it just appends a number to the labels. Why do that? Well, if you wanted to try computing multiple strings in a single run (like I do below), you might call this code several times. But compilers (being the tricky devils that they are) might choose to "inline" your assembler. That would mean you'd have several chunks of code that all had the same label names in the same routine. Which would cause your compile to fail.
Note that I don't check to see if the string is "too long" or NULL. Left as an exercise for the student...
Ok, what else?
petla. Mostly my code matches yours.
4a. I did change to sub $'0', %%al instead of just using $48. It does the same thing, but subtracting '0' seems to me to be more "self-documenting."
4b. I also slightly reordered things to put the shr at the end. Why do that? You use cmp along with jz to see when it's time to exit the loop. The way cmp works is that it sets some flags in the flags register, then jz looks at those flags to figure out whether to jump or not. However shr sets those flags too. Each time you shift, you are moving that '1' further and further to the right. What happens when it's at the rightmost position and you shift it 1 more? You get zero. At which point the "jump if not zero" (aka jnz) works as expected. Since you have to do the shr anyway, why not use it to tell you when to exit the loop too?
That gives me:
"petla%=:\n\t"
"mov (%[pstr], %%rcx, 1), %%al\n\t" // read the next char
"sub $'0', %%al\n\t" // convert char to value
"mul %%bl\n\t" // mul bl * al -> ax
"add %%ax, %[res]\n\t" // Accumulate result
"inc %%rcx\n\t" // move to next char
"shr $1, %%rbx\n\t" // decrease our exponent
"jnz petla%=\n" // Has our exponent gone to 0?
"gotowe%=:"
Lastly, the parameters:
:[res] "=r"(result)
:[pstr] "r"(str), "0"(0)
:"%rax", "%rbx", "%rcx", "cc"
I'm going to store the result in the C variable named result. Since I specify =r with this constraint, I know that it is stored in a register, although I don't know which register the compiler will pick. But I don't need to. I can just refer to it using %[res] and let the compiler sort it out. Likewise I refer to the string using %[pstr]. I could use %0 like you did, except that since I've added result, pstr isn't %0 anymore, it's %1 (result is now %0). This is another reason to use names instead of numbers.
That last bit ("0"(0)) might take a bit of explaining. Using "0" for the constraint (instead of say "r") tells the compiler to put this value into the same place as parameter #0. The (0) says store a zero there before starting the asm. In other words, initialize the register that is going to hold result to 0. Yes, I could do this in the asm. But I prefer to let the compiler do this for me. While it may not matter in a tiny program like this, letting the C compiler do as much work as possible tends to produce the most efficient code.
So, when we wrap this all together, I get:
/*
my_file.c - The goal of this program is to take a string of numbers and
count sum of them but using 2 as a base.
example: "543" -> 5*(2^2)+4*(2^1)+3*(2^0)=31
*/
#include <stdio.h>
void TestOne(const char *str)
{
short result;
// Code assumes str is not NULL. Strings with non-digits and zero
// length strings return 0.
asm(
"xor %%rcx, %%rcx\n" // Zero the strlen count
// Count how many characters in string
"potega%=: \n\t"
"mov (%[pstr], %%rcx), %%bl\n\t" // Read the next char
"test %%bl, %%bl \n\t" // Check for 0 at end of string
"jz kont%= \n\t"
"cmp $'0', %%bl\n\t" // Ensure digit is 0-9
"jl gotowe%=\n\t"
"cmp $'9', %%bl\n\t"
"jg gotowe%=\n\t"
"inc %%rcx \n\t" // Increment index/len
"jmp potega%= \n"
"kont%=:\n\t"
// rcx = the number of character in the string excluding null
"dec %%rcx \n\t" // We want to shift rbx 1 less than pstr length
"jl gotowe%=\n\t" // Check for zero length string
"mov $1, %%rbx\n\t" // Set exponent for first digit
"shl %%cl, %%rbx\n\t"
"xor %%rcx, %%rcx\n" // Reset string index
"petla%=:\n\t"
"mov (%[pstr], %%rcx, 1), %%al\n\t" // read the next char
"sub $'0', %%al\n\t" // convert char to value
"mul %%bl\n\t" // mul bl * al -> ax
"add %%ax, %[res]\n\t" // Accumulate result
"inc %%rcx\n\t" // move to next char
"shr $1, %%rbx\n\t" // decrease our exponent
"jnz petla%=\n" // Has our exponent gone to 0?
"gotowe%=:"
:[res] "=r"(result)
:[pstr] "r"(str), "0"(0)
:"%rax", "%rbx", "%rcx", "cc"
);
printf("Wynik: \"%s\" = %d\n", str, result);
}
int main(){
TestOne("x");
TestOne("");
TestOne("5");
TestOne("54");
TestOne("543");
TestOne("5432");
return 0;
}
Notice: No global variables. And no len1. Just a pointer to the string.
It might be interesting to experiment and see how long a string you can support. Using mul %%bl, add %%ax and short result works for tiny strings like these, but will eventually be insufficient as the strings get longer (requiring eax or rax etc). I'll leave that for you too. Warning: There's a trick when moving 'up' from mul %%bl to mul %%bx.
One last point about letting the compiler do as much work as possible tends to produce the most efficient code: Sometimes people assume that since they are writing assembler, this will result in faster code than if they write it in C. However, these people fail to take into account the fact that the entire purpose of a C compiler is to turn your C code into assembler. When you turn on optimization (-O2), the compiler is almost certainly going to turn your (well-written) C code into better assembler code than anything you can write by hand.
There are thousands of tweaks and tricks like the ones I've mentioned here. And the people who write compilers know them all. While there are a few places where inline asm can make sense, smart programmers leave this work to the lunatics who write compilers whenever possible. See also this.
I realize this is just a school project and you are only doing what your teacher requires, but since she has elected to use the most difficult way possible to teach you asm, perhaps she failed to mention that the thing you are doing is something you should (almost) never do in real life.
This post turned out longer than I expected. Hopefully there is information here that you can use. And forgive my attempts at Polish labels. Hopefully I haven't said anything obscene...
As somebody pointed out - yes it's a student exercise.
When it comes to my original problem, when I removed line add $48,%%al \n" it worked. I also switched to mul %%cl.
When it comes to rest of problems, you pointed out, I talked with my professor and she slightly changed her mind (or I got the assgment wrong the first time - whatever you find more possible) and now she wanted me to return an argument from the inline function and said the intiger type was good. It resulted in me writing such piece of code (which actually does what I wanted)
example: "543" -> 5*(2^2)+4*(2^1)+3*(2^0)=31
#include <stdio.h>
char str[] = "543";
const int len = 3;
int len1 = 2;
int result;
int main(){
asm(
"mov $0, %%rbx\n"
"mov $1, %%rcx\n"
"mov $0, %%rdx\n"
"potega: \n"
"inc %%rbx\n"
"shl $1, %%cl\n"
"cmp len1, %%ebx \n"
"jl potega\n"
"mov $0, %%rbx\n"
"petla:\n"
"mov (%0, %%rbx, 1), %%al\n"
"sub $48, %%al\n"
"mul %%cl\n"
"shr $1, %%cl\n"
"add %%al, %%dl\n"
"inc %%rbx\n"
"cmp len, %%ebx\n"
"jl petla\n"
"movl %%edx, result\n"
://"=r"(result)
:"r"(&str), "r"(&result)
:"%rax", "%rbx", "%rcx", "%rdx"
);
printf("Wynik: %d\n", result);
return 0;
}
Also - I do realise, that normally you return variables the way it's showed in comment, but it didn't work, so by my professor's suggestion I wrote the program this way.
Thanks everybody for help!

multiplication instruction error in inline assembly

Consider following program:
#include <stdio.h>
int main(void) {
int foo = 10, bar = 15;
__asm__ __volatile__("add %%ebx,%%eax"
:"=a"(foo)
:"a"(foo), "b"(bar)
);
printf("foo+bar=%d\n", foo);
}
I know that add instruction is used for addition, sub instruction is used for subtraction & so on. But I didn't understand these lines:
__asm__ __volatile__("add %%ebx,%%eax"
:"=a"(foo)
:"a"(foo), "b"(bar)
);
What is the exact meaning of :"=a"(foo) :"a"(foo), "b"(bar) ); ? What it does ? And when I try to use mul instruction here I get following error for the following program:
#include <stdio.h>
int main(void) {
int foo = 10, bar = 15;
__asm__ __volatile__("mul %%ebx,%%eax"
:"=a"(foo)
:"a"(foo), "b"(bar)
);
printf("foo*bar=%d\n", foo);
}
Error: number of operands mismatch for `mul'
So, why I am getting this error ? How do I solve this error ? I've searched on google about these, but I couldn't find solution of my problem. I am using windows 10 os & processor is intel core i3.
What is the exact meaning of :"=a"(foo) :"a"(foo), "b"(bar) );
There is a detailed description of how parameters are passed to the asm instruction here. In short, this is saying that bar goes into the ebx register, foo goes into eax, and after the asm is executed, eax will contain an updated value for foo.
Error: number of operands mismatch for `mul'
Yeah, that's not the right syntax for mul. Perhaps you should spend some time with an x86 assembler reference manual (for example, this).
I'll also add that using inline asm is usually a bad idea.
Edit: I can't fit a response to your question into a comment.
I'm not quite sure where to start. These questions seem to indicate that you don't have a very good grasp of how assembler works at all. Trying to teach you asm programming in a SO answer is not really practical.
But I can point you in the right direction.
First of all, consider this bit of asm code:
movl $10, %eax
movl $15, %ebx
addl %ebx, %eax
Do you understand what that does? What will be in eax when this completes? What will be in ebx? Now, compare that with this:
int foo = 10, bar = 15;
__asm__ __volatile__("add %%ebx,%%eax"
:"=a"(foo)
:"a"(foo), "b"(bar)
);
By using the "a" constraint, you are asking gcc to move the value of foo into eax. By using the "b" constraint you are asking it to move bar into ebx. It does this, then executes the instructions for the asm (ie add). On exit from the asm, the new value for foo will be in eax. Get it?
Now, let's look at mul. According to the docs I linked you to, we know that the syntax is mul value. That seems weird, doesn't it? How can there only be one parameter to mul? What does it multiple the value with?
But if you keep reading, you see "Always multiplies EAX by a value." Ahh. So the "eax" register is always implied here. So if you were to write mul %ebx, that would really be mean mul ebx, eax, but since it always has to be eax, there's no real point it writing it out.
However, it's a little more complicated than that. ebx can hold a 32bit value number. Since we are using ints (instead of unsigned ints), that means that ebx could have a number as big as 2,147,483,647. But wait, what happens if you multiply 2,147,483,647 * 10? Well, since 2,147,483,647 is already as big a number as you can store in a register, the result is much too big to fit into eax. So the multiplication (always) uses 2 registers to output the result from mul. This is what that link meant when it referred "stores the result in EDX:EAX."
So, you could write your multiplication like this:
int foo = 10, bar = 15;
int upper;
__asm__ ("mul %%ebx"
:"=a"(foo), "=d"(upper)
:"a"(foo), "b"(bar)
:"cc"
);
As before, this puts bar in ebx and foo in eax, then executes the multiplication instruction.
And after the asm is done, eax will contain the lower part of the result and edx will contain the upper. If foo * bar < 2,147,483,647, then foo will contain the result you need and upper will be zero. Otherwise, things get more complicated.
But that's as far as I'm willing to go. Other than that, take an asm class. Read a book.
PS You might also look at this answer and the 3 comments that follow that show why even your "add" example is "wrong."
PPS If this answer has resolved your question, don't forget to click the check mark next to it so I get my karma points.

Using FPU with C inline assembly

I wrote a vector structure like this:
struct vector {
float x1, x2, x3, x4;
};
Then I created a function which does some operations with inline assembly using the vector:
struct vector *adding(const struct vector v1[], const struct vector v2[], int size) {
struct vector vec[size];
int i;
for(i = 0; i < size; i++) {
asm(
"FLDL %4 \n" //v1.x1
"FADDL %8 \n" //v2.x1
"FSTL %0 \n"
"FLDL %5 \n" //v1.x2
"FADDL %9 \n" //v2.x2
"FSTL %1 \n"
"FLDL %6 \n" //v1.x3
"FADDL %10 \n" //v2.x3
"FSTL %2 \n"
"FLDL %7 \n" //v1.x4
"FADDL %11 \n" //v2.x4
"FSTL %3 \n"
:"=m"(vec[i].x1), "=m"(vec[i].x2), "=m"(vec[i].x3), "=m"(vec[i].x4) //wyjscie
:"g"(&v1[i].x1), "g"(&v1[i].x2), "g"(&v1[i].x3), "g"(&v1[i].x4), "g"(&v2[i].x1), "g"(&v2[i].x2), "g"(&v2[i].x3), "g"(&v2[i].x4) //wejscie
:
);
}
return vec;
}
Everything looks OK, but when I try to compile this with GCC I get these errors:
Error: Operand type mismatch for 'fadd'
Error: Invalid instruction suffix for 'fld'
On OS/X in XCode everything working correctly. What is wrong with this code?
Coding Issues
I'm not looking at making this efficient (I'd be using SSE/SIMD if the processor supports it). Since this part of the assignment is to use the FPU stack then here are some concerns I have:
Your function declares a local stack based variable:
struct vector vec[size];
The problem is that your function returns a vector * and you do this:
return vec;
This is very bad. The stack based variable could get clobbered after the function returns and before the data gets consumed by the caller. One alternative is to allocate memory on the heap rather than the stack. You can replace struct vector vec[size]; with:
struct vector *vec = malloc(sizeof(struct vector)*size);
This would allocate enough space for an array of size number of vector. The person who calls your function would have to use free to deallocate the memory from the heap when finished.
Your vector structure uses float, not double. The instructions FLDL, FADDL, FSTL all operate on double (64-bit floats). Each of these instructions will load and store 64-bits when used with a memory operand. This would lead to the wrong values being loaded/stored to/from the FPU stack. You should be using FLDS, FADDS, FSTS to operate on 32-bit floats.
In the assembler templates you use the g constraint on the inputs. This means the compiler is free to use any general purpose registers, a memory operand, or an immediate value. FLDS, FADDS, FSTS do not take immediate values or general purpose registers (non-FPU registers) so if the compiler attempts to do so it will likely produce errors similar to Error: Operand type mismatch for xxxx.
Since these instructions understand a memory reference use m instead of g constraint. You will need to remove the & (ampersands) from the input operands since m implies that it will be dealing with the memory address of a variable/C expression.
You don't pop the values off the FPU stack when finished. FST with a single operand copies the value at the top of the stack to the destination. The value on the stack remains. You should store it and pop it off with an FSTP instruction. You want the FPU stack to be empty when your assembler template ends. The FPU stack is very limited with only 8 slots available. If the FPU stack is not clear when the template completes then you run the risk of the FPU stack overflowing on subsequent calls. Since you leave 4 values on the stack on each call, calling the function adding a third time should fail.
To simplify the code a bit I'd recommend using a typedef to define vector. Define your structure this way:
typedef struct {
float x1, x2, x3, x4;
} vector;
All references to struct vector can simply become vector.
With all of these things in mind your code could look something like this:
typedef struct {
float x1, x2, x3, x4;
} vector;
vector *adding(const vector v1[], const vector v2[], int size) {
vector *vec = malloc(sizeof(vector)*size);
int i;
for(i = 0; i < size; i++) {
__asm__(
"FLDS %4 \n" //v1.x1
"FADDS %8 \n" //v2.x1
"FSTPS %0 \n"
"FLDS %5 \n" //v1.x2
"FADDS %9 \n" //v2.x2
"FSTPS %1 \n"
"FLDS %6 \n" //v1->x3
"FADDS %10 \n" //v2->x3
"FSTPS %2 \n"
"FLDS %7 \n" //v1->x4
"FADDS %11 \n" //v2->x4
"FSTPS %3 \n"
:"=m"(vec[i].x1), "=m"(vec[i].x2), "=m"(vec[i].x3), "=m"(vec[i].x4)
:"m"(v1[i].x1), "m"(v1[i].x2), "m"(v1[i].x3), "m"(v1[i].x4),
"m"(v2[i].x1), "m"(v2[i].x2), "m"(v2[i].x3), "m"(v2[i].x4)
:
);
}
return vec;
}
Alternative Solutions
I don't know the parameters of the assignment, but if it were to make you use GCC extended assembler templates to manually do an operation on the vector with an FPU instruction then you could define the vector with an array of 4 float. Use a nested loop to process each element of the vector independently passing each through to the assembler template to be added together.
Define the vector as:
typedef struct {
float x[4];
} vector;
The function as:
vector *adding(const vector v1[], const vector v2[], int size) {
int i, e;
vector *vec = malloc(sizeof(vector)*size);
for(i = 0; i < size; i++)
for (e = 0; e < 4; e++) {
__asm__(
"FADDPS\n"
:"=t"(vec[i].x[e])
:"0"(v1[i].x[e]), "u"(v2[i].x[e])
);
}
return vec;
}
This uses the i386 machine constraints t and u on the operands. Rather than passing a memory address we allow GCC to pass them via the top two slots on the FPU stack. t and u are defined as:
t
Top of 80387 floating-point stack (%st(0)).
u
Second from top of 80387 floating-point stack (%st(1)).
The no operand form of FADDP does this:
Add ST(0) to ST(1), store result in ST(1), and pop the register stack
We pass the two values to add at the top of the stack and perform an operation leaving ONLY the result in ST(0). We can then get the assembler template to copy the value on the top of the stack and pop it off automatically for us.
We can use an output operand of =t to specify the value we want moved is from the top of the FPU stack. =t will also pop (if needed) the value off the top of FPU stack for us. We can also use the top of the stack as an input value too! If the output operand is %0 we can reference it as an input operand with the constraint 0 (which means use the same constraint as operand 0). The second vector value will use the u constraint so it is passed as the second FPU stack element (ST(1))
A slight improvement that could potentially allow GCC to optimize the code it generates would be to use the % modifier on the first input operand. The % modifier is documented as:
Declares the instruction to be commutative for this operand and the following operand. This means that the compiler may interchange the two operands if that is the cheapest way to make all operands fit the constraints. ‘%’ applies to all alternatives and must appear as the first character in the constraint. Only read-only operands can use ‘%’.
Because x+y and y+x yield the same result we can tell the compiler that it can swap the operand marked with % with the one defined immediately after in the template. "0"(v1[i].x[e]) could be changed to "%0"(v1[i].x[e])
Disadvantages: We've reduced the code in the assembler template to a single instruction, and we've used the template to do most of the work setting things up and tearing it down. The problem is that if the vectors are likely going to be memory bound then we transfer between FPU registers and memory and back more times than we may like it to. The code generated may not be very efficient as we can see in this Godbolt output.
We can force memory usage by applying the idea in your original code to the template. This code may yield more reasonable results:
vector *adding(const vector v1[], const vector v2[], int size) {
int i, e;
vector *vec = malloc(sizeof(vector)*size);
for(i = 0; i < size; i++)
for (e = 0; e < 4; e++) {
__asm__(
"FADDS %2\n"
:"=&t"(vec[i].x[e])
:"0"(v1[i].x[e]), "m"(v2[i].x[e])
);
}
return vec;
}
Note: I've removed the % modifier in this case. In theory it should work, but GCC seems to emit less efficient code (CLANG seems okay) when targeting x86-64. I'm unsure if it is a bug; whether my understanding is lacking in how this operator should work; or there is an optimization being done I don't understand. Until I look at it closer I am leaving it off to generate the code I would expect to see.
In the last example we are forcing the FADDS instruction to operate on a memory operand. The Godbolt output is considerably cleaner, with the loop itself looking like:
.L3:
flds (%rdi) # MEM[base: _51, offset: 0B]
addq $16, %rdi #, ivtmp.6
addq $16, %rcx #, ivtmp.8
FADDS (%rsi) # _31->x
fstps -16(%rcx) # _28->x
addq $16, %rsi #, ivtmp.9
flds -12(%rdi) # MEM[base: _51, offset: 4B]
FADDS -12(%rsi) # _31->x
fstps -12(%rcx) # _28->x
flds -8(%rdi) # MEM[base: _51, offset: 8B]
FADDS -8(%rsi) # _31->x
fstps -8(%rcx) # _28->x
flds -4(%rdi) # MEM[base: _51, offset: 12B]
FADDS -4(%rsi) # _31->x
fstps -4(%rcx) # _28->x
cmpq %rdi, %rdx # ivtmp.6, D.2922
jne .L3 #,
In this final example GCC unwound the inner loop and only the outer loop remains. The code generated by the compiler is similar in nature to what was produced by hand in the original question's assembler template.

x86 Assembly: INC and DEC instruction and overflow flag

In x86 assembly, the overflow flag is set when an add or sub operation on a signed integer overflows, and the carry flag is set when an operation on an unsigned integer overflows.
However, when it comes to the inc and dec instructions, the situation seems to be somewhat different. According to this website, the inc instruction does not affect the carry flag at all.
But I can't find any information about how inc and dec affect the overflow flag, if at all.
Do inc or dec set the overflow flag when an integer overflow occurs? And is this behavior the same for both signed and unsigned integers?
============================= EDIT =============================
Okay, so essentially the consensus here is that INC and DEC should behave the same as ADD and SUB, in terms of setting flags, with the exception of the carry flag. This is also what it says in the Intel manual.
The problem is I can't actually reproduce this behavior in practice, when it comes to unsigned integers.
Consider the following assembly code (using GCC inline assembly to make it easier to print out results.)
int8_t ovf = 0;
__asm__
(
"movb $-128, %%bh;"
"decb %%bh;"
"seto %b0;"
: "=g"(ovf)
:
: "%bh"
);
printf("Overflow flag: %d\n", ovf);
Here we decrement a signed 8-bit value of -128. Since -128 is the smallest possible value, an overflow is inevitable. As expected, this prints out: Overflow flag: 1
But when we do the same with an unsigned value, the behavior isn't as I expect:
int8_t ovf = 0;
__asm__
(
"movb $255, %%bh;"
"incb %%bh;"
"seto %b0;"
: "=g"(ovf)
:
: "%bh"
);
printf("Overflow flag: %d\n", ovf);
Here I increment an unsigned 8-bit value of 255. Since 255 is the largest possible value, an overflow is inevitable. However, this prints out: Overflow flag: 0.
Huh? Why didn't it set the overflow flag in this case?
The overflow flag is set when an operation would cause a sign change. Your code is very close. I was able to set the OF flag with the following (VC++) code:
char ovf = 0;
_asm {
mov bh, 127
inc bh
seto ovf
}
cout << "ovf: " << int(ovf) << endl;
When BH is incremented the MSB changes from a 0 to a 1, causing the OF to be set.
This also sets the OF:
char ovf = 0;
_asm {
mov bh, 128
dec bh
seto ovf
}
cout << "ovf: " << int(ovf) << endl;
Keep in mind that the processor does not distinguish between signed and unsigned numbers. When you use 2's complement arithmetic, you can have one set of instructions that handle both. If you want to test for unsigned overflow, you need to use the carry flag. Since INC/DEC don't affect the carry flag, you need to use ADD/SUB for that case.
Intel® 64 and IA-32 Architectures Software Developer's Manuals
Look at the appropriate manual Instruction Set Reference, A-M. Every instruction is precisely documented.
Here is the INC section on affected flags:
The CF flag is not affected. The OF, SZ, ZF, AZ, and PF flags are set according to the result.
try changing your test to pass in the number rather than hard code it, then have a loop that tries all 256 numbers to find the one if any that affects the flag. Or have the asm perform the loop and exit out when it hits the flag and or when it wraps around to the number it started with (start with something other than 0x00, 0x7f, 0x80, or 0xFF).
EDIT
.globl inc
inc:
mov $33, %eax
top:
inc %al
jo done
jmp top
done:
ret
.globl dec
dec:
mov $33, %eax
topx:
dec %al
jo donex
jmp topx
donex:
ret
Inc overflows when it goes from 0x7F to 0x80. dec overflows when it goes from 0x80 to 0x7F, I suspect the problem is in the way you are using inline assembler.
As many of the other answers have pointed out, INC and DEC do not affect the CF, whereas ADD and SUB do.
What has not been said yet, however, is that this might make a performance difference. Not that you'd usually be bothered by that unless you are trying to optimise the hell out of a routine, but essentially not setting the CF means that INC/DEC only write to part of the flags register, which can cause a partial flag register stall, see Intel 64 and IA-32 Architectures Optimization Reference Manual or Agner Fog's optimisation manuals.
Except for the carry flag inc sets the flags the same way as add operand 1 would.
The fact that inc does not affect the carry flag is very important.
http://oopweb.com/Assembly/Documents/ArtOfAssembly/Volume/Chapter_6/CH06-2.html#HEADING2-117
The CPU/ALU is only capable of handling unsigned binary numbers, and then it uses OF, CF, AF, SF, ZF, etc., to allow you to decide whether to use it as a signed number (OF), an unsigned number (CF) or a BCD number (AF).
About your problem, remember to consider the binary numbers themselves, as unsigned.
**Also, the overflow and the OF require 3 numbers: The input number, a second number to use in the arithmetic, and the result number.
Overflow is activated only if the first and second numbers have the same value for the sign bit (the most significant bit) and the result has a different sign. As in, adding 2 negative numbers resulted in a positive number, or adding 2 positive numbers resulted in a negative number:
if( (Sign_Num1==Sign_Num2) && (Sign_Result!=Sign_Num1) ) OF=1;
else OF=0;
For your first problem, you are using -128 as the first number. The second number is implicitly -1, used by the DEC instruction. So we really have the binary numbers 0x80 and 0xFF. Both them have the sign bit set to 1. The result is 0x7F, which is a number with the sign bit set to 0. We got 2 initial numbers with the same sign, and a result with a different sign, so we indicate an overflow. -128-1 resulted in 127, and thus the overflow flag is set to indicate a wrong signed result.
For your second problem, you are using 255 as the first number. The second number is implicitly 1, used by the INC instruction. So we really have the binary numbers 0xFF and 0x01. Both them have a different sign bit, so it is not possible to get an overflow (it is only possible to overflow when basically adding 2 numbers of the same sign, but it is never possible to overflow with 2 numbers of a different sign because they will never lead to go beyond the possible signed value). The result is 0x00, and it doesn't set the overflow flag because 255+1, or more exactly, -1+1 gives 0, which is obviously correct for signed arithmetic.
Remember that for the overflow flag to be set, the 2 numbers being added/subtracted need to have the sign bit with the same value, and then the result must have a sign bit with a value different from them.
What the processor does is set the appropriate flags for the results of these instructions (add, adc, dec, inc, sbb, sub) for both the signed and unsigned cases i e two different flag results for every op. The alternative would be having two sets of instructions where one sets signed-related flags and the other the unsigned-related. If the issuing compiler is using unsigned variables in the operation it will test carry and zero (jc, jnc, jb, jbe etc), if signed it tests overflow, sign and zero (jo, jno, jg, jng, jl, jle etc).

Resources