Inserting "marker" instructions into assembly without GCC reordering them

Inserting "marker" instructions into assembly without GCC reordering them - c

For purposes of doing performance analysis it is useful to be able to
tell which line of C code goes with which line of generated assembly
code. This can be very difficult once a sufficient number of
optimization passes get involved, and I devised the following scheme
to make it easier (though it has a lot of caveats). I figured I would
use in-line assembly to insert an instruction that is effectively a
nop, but that the compiler would rarely or never generate itself. Then
when I looked at the generated code I could infer that assembly code
that appears between the inserted marker instructions probably comes
from C code that lies between the in-line assembly statements.
I came up with these candidates:
// Force insertion of a instruction that will only clobber
// flags and that the compiler hardly ever uses itself. Lie and say
// that it alters memory to try to prevent the compiler from moving
// around. Mark it volatile so the compiler can't remove it entirely.
#define ASSEMBLY_MARKER_0() \
__asm__ volatile ("cld" : /* no outputs */ : /* no inputs */ : "memory", "cc")
#define ASSEMBLY_MARKER_1() \
__asm__ volatile ("xorl %%eax,0" : /* no outputs */ : /* no inputs */ : "memory", "cc")
Then I decided to test whether the compiler would move instructions
across these boundaries. clang appears to do exactly what I want, but
GCC appears to not be deterred either by the memory clobbering or the
fact that this snippet is volatile. It reorders instructions anyway!
Is there any way to prevent this?
I know there are a lot of caveats to this method even if I get it to
work -- I may heavily influence generated code around the markers. But
I maintain that it would still be useful for finding things like
accidental implicit conversions between integer widths, and other
"wait that should never be necessary..." type problems.
You can see the difference between GCC and clang here: https://godbolt.org/z/ZtUPc9
C code:
int f(int x)
{
__asm__ volatile ("xorl %%eax,0" : /* no outputs */ : /* no inputs */ : "memory", "cc");
int j = x << 3;
__asm__ volatile ("xorl %%eax,0" : /* no outputs */ : /* no inputs */ : "memory", "cc");
return j;
}
GCC:
xorl %eax,0
xorl %eax,0
lea eax, [0+rdi*8]
ret
Clang:
xor dword ptr [0], eax
lea eax, [8*rdi]
xor dword ptr [0], eax
ret
Edits to answer questions in comments:
Why not nops? Because gcc inserts those itself often. The point is to stick out.
Why not move code into its own function? If you're doing this analysis on C++ template code for example, there be many layers of inlining that occur before producing the function that actually goes in the executable, and the code may be very different if you turn off the inlining (e.g. the code may have been written with the assumption that constant folding, dead code elimitation etc would get rid of trivial things).

Then I decided to test whether the compiler would move instructions across these boundaries. clang appears to do exactly what I want, but GCC appears to not be deterred either by the memory clobbering or the fact that this snippet is volatile. It reorders instructions anyway! Is there any way to prevent this?
Not really. The point is that such memory barriers avoid reordering stuff across it that is volatile (like volatile accesses or asm volatile) and / or memory accesses. Or in the case of x86 and cc (condition code) parts of def-use chains of condition code cannot be moved across. Such barriers do not whatsoever avoid moving unrelated instructions across it.
Sometimes it can be helpful to add options -save-temps -fverbose-asm to better understand assembly code and its relation to C. New versions of GCC dump C/C++ code alongside assembly code (dumped as *.s). When you inspect assembly (as opposed to disassembly) it's sufficient to inject asm comments to show where the inline asm is injected, there is no need to add actual instructions. Legibility of assembly might be improved by disabling debug-info (-g0).
To better understand the code, you can also disable passes that usually result in great amount of instruction reordering like instruction scheduling (-fno-schedule-insns,
-fno-schedule-insns2) but that has a big performance impact, of course.

Related

Why can't local variable be used in GNU C basic inline asm statements?

Why cannot I use local variables from main to be used in basic asm inline? It is only allowed in extended asm, but why so?
(I know local variables are on the stack after return address (and therefore cannot be used once the function return), but that should not be the reason to not use them)
And example of basic asm:
int a = 10; //global a
int b = 20; //global b
int result;
int main() {
asm ( "pusha\n\t"
"movl a, %eax\n\t"
"movl b, %ebx\n\t"
"imull %ebx, %eax\n\t"
"movl %eax, result\n\t"
"popa");
printf("the answer is %d\n", result);
return 0;
}
example of extended:
int main (void) {
int data1 = 10; //local var - could be used in extended
int data2 = 20;
int result;
asm ( "imull %%edx, %%ecx\n\t"
"movl %%ecx, %%eax"
: "=a"(result)
: "d"(data1), "c"(data2));
printf("The result is %d\n",result);
return 0;
}
Compiled with:
gcc -m32 somefile.c
platform:
uname -a:
Linux 5.0.0-32-generic #34-Ubuntu SMP Wed Oct 2 02:06:48 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

You can use local variables in extended assembly, but you need to tell the extended assembly construct about them. Consider:
#include <stdio.h>
int main (void)
{
int data1 = 10;
int data2 = 20;
int result;
__asm__(
" movl %[mydata1], %[myresult]\n"
" imull %[mydata2], %[myresult]\n"
: [myresult] "=&r" (result)
: [mydata1] "r" (data1), [mydata2] "r" (data2));
printf("The result is %d\n",result);
return 0;
}
In this [myresult] "=&r" (result) says to select a register (r) that will be used as an output (=) value for the lvalue result, and that register will be referred to in the assembly as %[myresult] and must be different from the input registers (&). (You can use the same text in both places, result instead of myresult; I just made it different for illustration.)
Similarly [mydata1] "r" (data1) says to put the value of expression data1 into a register, and it will be referred to in the assembly as %[mydata1].
I modified the code in the assembly so that it only modifies the output register. Your original code modifies %ecx but does not tell the compiler it is doing that. You could have told the compiler that by putting "ecx" after a third :, which is where the list of “clobbered” registers goes. However, since my code lets the compiler assign a register, I would not have a specific register to list in the clobbered register. There may be a way to tell the compiler that one of the input registers will be modified but is not needed for output, but I do not know. (Documentation is here.) For this task, a better solution is to tell the compiler to use the same register for one of the inputs as the output:
__asm__(
" imull %[mydata1], %[myresult]\n"
: [myresult] "=r" (result)
: [mydata1] "r" (data1), [mydata2] "0" (data2));
In this, the 0 with data2 says to make it the same as operand 0. The operands are numbered in the order they appear, starting with 0 for the first output operand and continuing into the input operands. So, when the assembly code starts, %[myresult] will refer to some register that the value of data2 has been placed in, and the compiler will expect the new value of result to be in that register when the assembly is done.
When doing this, you have to match the constraint with how a thing will be used in assembly. For the r constraint, the compiler supplies some text that can be used in assembly language where a general processor register is accepted. Others include m for a memory reference, and i for an immediate operand.

There is little distinction between "Basic asm" and "Extended asm"; "basic asm" is just a special case where the __asm__ statement has no lists of outputs, inputs, or clobbers. The compiler does not do % substitution in the assembly string for Basic asm. If you want inputs or outputs you have to specify them, and then it's what people call "extended asm".
In practice, it may be possible to access external (or even file-scope static) objects from "basic asm". This is because these objects will (respectively may) have symbol names at the assembly level. However, to perform such access you need to be careful of whether it is position-independent (if your code will be linked into libraries or PIE executables) and meets other ABI constraints that might be imposed at linking time, and there are various considerations for compatibility with link-time optimization and other transformations the compiler may perform. In short, it's a bad idea because you can't tell the compiler that a basic asm statement modified memory. There's no way to make it safe.
A "memory" clobber (Extended asm) can make it safe to access static-storage variables by name from the asm template.
The use-case for basic asm is things that modify the machine state only, like asm("cli") in a kernel to disable interrupts, without reading or writing any C variables. (Even then, you'd often use a "memory" clobber to make sure the compiler had finished earlier memory operations before changing machine state.)
Local (automatic storage, not static ones) variables fundamentally never have symbol names, because they don't exist in a single instance; there's one object per live instance of the block they're declared in, at runtime. As such, the only possible way to access them is via input/output constraints.
Users coming from MSVC-land may find this surprising since MSVC's inline assembly scheme papers over the issue by transforming local variable references in their version of inline asm into stack-pointer-relative accesses, among other things. The version of inline asm it offers however is not compatible with an optimizing compiler, and little to no optimization can happen in functions using that type of inline asm. GCC and the larger compiler world that grew alongside C out of unix does not do anything similar.

You can't safely use globals in Basic Asm statements either; it happens to work with optimization disabled but it's not safe and you're abusing the syntax.
There's very little reason to ever use Basic Asm. Even for machine-state control like asm("cli") to disable interrupts, you'd often want a "memory" clobber to order it wrt. loads / stores to globals. In fact, GCC's https://gcc.gnu.org/wiki/ConvertBasicAsmToExtended page recommends never using Basic Asm because it differs between compilers, and GCC might change to treating it as clobbering everything instead of nothing (because of existing buggy code that makes wrong assumptions). This would make a Basic Asm statement that uses push/pop even more inefficient if the compiler is also generating stores and reloads around it.
Basically the only use-case for Basic Asm is writing the body of an __attribute__((naked)) function, where data inputs/outputs / interaction with other code follows the ABI's calling convention, instead of whatever custom convention the constraints / clobbers describe for a truly inline block of code.
The design of GNU C inline asm is that it's text that you inject into the compiler's normal asm output (which is then fed to the assembler, as). Extended asm makes the string a template that it can substitute operands into. And the constraints describe how the asm fits into the data-flow of the program logic, as well as registers it clobbers.
Instead of parsing the string, there is syntax that you need to use to describe exactly what it does. Parsing the template for var names would only solve part of the language-design problem that operands need to solve, and would make the compiler's code more complicated. (It would have to know more about every instruction to know whether memory, register, or immediate was allowed, and stuff like that. Normally its machine-description files only need to know how to go from logical operation to asm, not the other direction.)
Your Basic asm block is broken because you modify C variables without telling the compiler about it. This could break with optimization enabled (maybe only with more complex surrounding code, but happening to work is not the same thing as actually safe. This is why merely testing GNU C inline asm code is not even close to sufficient for it to be future proof against new compilers and changes in surrounding code). There is no implicit "memory" clobber. (Basic asm is the same as Extended asm except for not doing % substitution on the string literal. So you don't need %% to get a literal % in the asm output. It's implicitly volatile like Extended asm with no outputs.)
Also note that if you were targeting i386 MacOS, you'd need _result in your asm. result only happens to work because the asm symbol name exactly matches the C variable name. Using Extended asm constraints would make it portable between GNU/Linux (no leading underscore) vs. other platforms that do use a leading _.
Your Extended asm is broken because you modify an input ("c") (without telling the compiler that register is also an output, e.g. an output operand using the same register).
It's also inefficient: if a mov is the first or last instruction of your template, you're almost always doing it wrong and should have used better constraints.
Instead, you can do:
asm ("imull %%edx, %%ecx\n\t"
: "=c"(result)
: "d"(data1), "c"(data2));
Or better, use "+r"(data2) and "r"(data1) operands to give the compiler free choice when doing register allocation instead of potentially forcing the compiler to emit unnecessary mov instructions. (See #Eric's answer using named operands and "=r" and a matching "0" constraint; that's equivalent to "+r" but lets you use different C names for the input and output.)
Look at the asm output of the compiler to see how code-gen happened around your asm statement, if you want to make sure it was efficient.
Since local vars don't have a symbol / label in the asm text (instead they live in registers or at some offset from the stack or frame pointer, i.e. automatic storage), it can't work to use symbol names for them in asm.
Even for global vars, you want the compiler to be able to optimize around your inline asm as much as possible, so you want to give the compiler the option of using a copy of a global var that's already in a register, instead of getting the value in memory in sync with a store just so your asm can reload that.
Having the compiler try to parse your asm and figure out which C local var names are inputs and outputs would have been possible. (But would be a complication.)
But if you want it to be efficient, you need to figure out when x in the asm can be a register like EAX, instead of doing something braindead like always storing x into memory before the asm statement, and then replacing x with 8(%rsp) or whatever. If you want to give the asm statement control over where inputs can be, you need constraints in some form. Doing it on a per-operand basis makes total sense, and means the inline-asm handling doesn't have to know that bts can take an immediate or register source but not memory, for and other machine-specific details like that. (Remember; GCC is a portable compiler; baking a huge amount of per-machine info into the inline-asm parser would be bad.)
(MSVC forces all C vars in _asm{} blocks to be memory. It's impossible to use to efficiently wrap a single instruction because the input has to bounce through memory, even if you wrap it in a function so you can use the officially-supported hack of leaving a value in EAX and falling off the end of a non-void function. What is the difference between 'asm', '__asm' and '__asm__'? And in practice MSVC's implementation was apparently pretty brittle and hard to maintain, so much so that they removed it for x86-64, and it was documented as not supported in function with register args even in 32-bit mode! That's not the fault of the syntax design, though, just the actual implementation.)
Clang does support -fasm-blocks for _asm { ... } MSVC-style syntax where it parses the asm and you use C var names. It probably forces inputs and outputs into memory but I haven't checked.
Also note that GCC's inline asm syntax with constraints is designed around the same system of constraints that GCC-internals machine-description files use to describe the ISA to the compiler. (The .md files in the GCC source that tell the compiler about an instruction to add numbers that takes inputs in "r" registers, and has the text string for the mnemonic. Notice the "r" and "m" in some examples in https://gcc.gnu.org/onlinedocs/gccint/RTL-Template.html).
The design model of asm in GNU C is that it's a black-box for optimizer; you must fully describe the effects of the code (to the optimizer) using constraints. If you clobber a register, you have to tell the compiler. If you have an input operand that you want to destroy, you need to use a dummy output operand with a matching constraint, or a "+r" operand to update the corresponding C variable's value.
If you read or write memory pointed-to by a register input, you have to tell the compiler. How can I indicate that the memory *pointed* to by an inline ASM argument may be used?
If you use the stack, you have to tell the compiler (but you can't, so instead you have to avoid stepping on the red-zone :/ Using base pointer register in C++ inline asm) See also the inline-assembly tag wiki
GCC's design makes it possible for the compiler to give you an input in a register, and use the same register for a different output. (Use an early-clobber constraint if that's not ok; GCC's syntax is designed to efficiently wrap a single instruction that reads all its inputs before writing any of its outputs.)
If GCC could only infer all of these things from C var names appearing in asm source, I don't think that level of control would be possible. (At least not plausible.) And there'd probably be surprising effects all over the place, not to mention missed optimizations. You only ever use inline asm when you want maximum control over things, so the last thing you want is the compiler using a lot of complex opaque logic to figure out what to do.
(Inline asm is complex enough in its current design, and not used much compared to plain C, so a design that requires very complex compiler support would probably end up with a lot of compiler bugs.)
GNU C inline asm isn't designed for low-performance low-effort. If you want easy, just write in pure C or use intrinsics and let the compiler do its job. (And file missed-optimization bug reports if it makes sub-optimal code.)

This is because asm is a defined language which is common for all compilers on the same processor family. After using the __asm__ keyword, you can reliably use any good manual for the processor to then start writing useful code.
But it does not have a defined interface for C, and lets be honest, if you don't interface your assembler with your C code then why is it there?
Examples of useful very simple asm: generate a debug interrupt; set the floating point register mode (exceptions/accuracy);
Each compiler writer has invented their own mechanism to interface to C. For example in one old compiler you had to declare the variables you want to share as named registers in the C code. In GCC and clang they allow you to use their quite messy 2-step system to reference an input or output index, then associate that index with a local variable.
This mechanism is the "extension" to the asm standard.
Of course, the asm is not really a standard. Change processor and your asm code is trash. When we talk in general about sticking to the c/c++ standards and not using extensions, we don't talk about asm, because you are already breaking every portability rule there is.
Then, on top of that, if you are going to call C functions, or your asm declares functions that are callable by C then you will have to match to the calling conventions of your compiler. These rules are implicit. They constrain the way you write your asm, but it will still be legal asm, by some criteria.
But if you were just writing your own asm functions, and calling them from asm, you may not be constrained so much by the c/c++ conventions: make up your own register argument rules; return values in any register you want; make stack frames, or don't; preserve the stack frame through exceptions - who cares?
Note that you might still be constrained by the platform's relocatable code conventions (these are not "C" conventions, but are often described using C syntax), but this is still one way that you can write a chunk of "portable" asm functions, then call them using "extended" embedded asm.

How to specify clobbered bottom of the x87 FPU stack with extended gcc assembly?

In a codebase of ours I found this snippet for fast, towards-negative-infinity1 rounding on x87:
inline int my_int(double x)
{
int r;
#ifdef _GCC_
asm ("fldl %1\n"
"fistpl %0\n"
:"=m"(r)
:"m"(x));
#else
// ...
#endif
return r;
}
I'm not extremely familiar with GCC extended assembly syntax, but from what I gather from the documentation:
r must be a memory location, where I'm writing back stuff;
x must be a memory location too, whence the data comes from.
there's no clobber specification, so the compiler can rest assured that at the end of the snippet the registers are as he left them.
Now, to come to my question: it's true that in the end the FPU stack is balanced, but what if all the 8 locations were already in use and I'm overflowing it? How can the compiler know that it cannot trust ST(7) to be where it left it? Should some clobber be added?
Edit I tried to specify st(7) in the clobber list and it seems to affect the codegen, now I'll wait for some confirmation of this fact.
As a side note: looking at the implementation of the barebones lrint both in glibc and in MinGW I see something like
__asm__ __volatile__ ("fistpl %0"
: "=m" (retval)
: "t" (x)
: "st");
where we are asking for the input to be placed directly in ST(0) (which avoids that potentially useless fldl); what is that "st" clobber? The docs seems to mention only t (i.e. the top of the stack).
yes, it depends from the current rounding mode, which in our application should always be "towards negative infinity".

looking at the implementation of the barebones lrint both in glibc and in MinGW I see something like
__asm__ __volatile__ ("fistpl %0"
: "=m" (retval)
: "t" (x)
: "st");
where we are asking for the input to be placed directly in ST(0) (which avoids that potentially useless fldl)
This is actually the correct way to represent the code you want as inline assembly.
To get the most optimal possible code generated, you want to make use of the inputs and outputs. Rather than hard-coding the necessary load/store instructions, let the compiler generate them. Not only does this introduce the possibility of eliding potentially unnecessary instructions, it also means that the compiler can better schedule these instructions when they are required (that is, it can interleave the instruction within a prior sequence of code, often minimizing its cost).
what is that "st" clobber? The docs seems to mention only t (i.e. the top of the stack).
The "st" clobber refers to the st(0) register, i.e., the top of the x87 FPU stack. What Intel/MASM notation calls st(0), AT&T/GAS notation generally refers to as simply st. And, as per GCC's documentation for clobbers, the items in the clobber list are "either register names or the special clobbers" ("cc" (condition codes/flags) and "memory"). So this just means that the inline assembly clobbers (overwrites) the st(0) register. The reason why this clobber is necessary is that the fistpl instruction pops the top of the stack, thus clobbering the original contents of st(0).
The only thing that concerns me regarding this code is the following paragraph from the documentation:
Clobber descriptions may not in any way overlap with an input or output operand. For example, you may not have an operand describing a register class with one member when listing that register in the clobber list. Variables declared to live in specific registers (see Explicit Register Variables) and used as asm input or output operands must have no part mentioned in the clobber description. In particular, there is no way to specify that input operands get modified without also specifying them as output operands.
When the compiler selects which registers to use to represent input and output operands, it does not use any of the clobbered registers. As a result, clobbered registers are available for any use in the assembler code.
As you already know, the t constraint means the top of the x87 FPU stack. The problem is, this is the same as the st register, and the documentation very clearly said that we could not have a clobber that specifies the same register as one of the input/output operands. Furthermore, since the documentation states that the compiler is forbidden to use any of the clobbered registers to represent input/output operands, this inline assembly makes an impossible request—load this value at the top of the x87 FPU stack without putting it in st!
Now, I would assume that the authors of glibc know what they are doing and are more familiar with the compiler's implementation of inline assembly than you or I, so this code is probably legal and legitimate.
Actually, it seems that the unusual case of the x87's stack-like registers forces an exception to the normal interactions between clobbers and operands. The official documentation says:
On x86 targets, there are several rules on the usage of stack-like registers in the operands of an asm. These rules apply only to the operands that are stack-like registers:
Given a set of input registers that die in an asm, it is necessary to know which are implicitly popped by the asm, and which must be explicitly popped by GCC.
An input register that is implicitly popped by the asm must be explicitly clobbered, unless it is constrained to match an output operand.
That fits our case exactly.
Further confirmation is provided by an example appearing in the official documentation (bottom of the linked section):
This asm takes two inputs, which are popped by the fyl2xp1 opcode, and replaces them with one output. The st(1) clobber is necessary for the compiler to know that fyl2xp1 pops both inputs.
asm ("fyl2xp1" : "=t" (result) : "0" (x), "u" (y) : "st(1)");
Here, the clobber st(1) is the same as the input constraint u, which seems to violate the above-quoted documentation regarding clobbers, but is used and justified for precisely the same reason that "st" is used as the clobber in your original code, because fistpl pops the input.
All of that said, and now that you know how to correctly write the code in inline assembly, I have to echo previous commenters who suggested that the best solution would be not to use inline assembly at all. Just call lrint, which not only has the exact semantics that you want, but can also be better optimized by the compiler under certain circumstances (e.g., transforming it into a single cvtsd2si instruction when the target architecture supports SSE).

C, Inline Assembly - manual function call [duplicate]

I don't have experience in assembly, but this is what I've been working on. I would like input if I'm missing any fundamental aspects to passing parameters and calling a function via pointer in assembly.
For instance I'm wondering if I supposed to restore ecx, edx, esi, edi. I read they are general purpose registers, but I couldn't find if they need to be restored? Is there any kind of cleanup I am supposed to do after a call?
This is the code I have now, and it does work:
#include "stdio.h"
void foo(int a, int b, int c, int d)
{
printf("values = %d and %d and %d and %d\r\n", a, b, c, d);
}
int main()
{
int a=3,b=6,c=9,d=12;
__asm__(
"mov %3, %%ecx;"
"mov %2, %%edx;"
"mov %1, %%esi;"
"mov %0, %%edi;"
"call %4;"
:
: "g"(a), "g"(b), "g"(c), "g"(d), "a"(foo)
);
}

The original question was Is this assembly function call safe/complete?. The answer to that is: no. While it may appear to work in this simple example (especially if optimizations are disabled), you are violating rules that will eventually lead to failures (ones that are really hard to track down).
I'd like to address the (obvious) followup question of how to make it safe, but without feedback from the OP on the actual intent, I can't really do that.
So, I'll do the best I can with what we have and try to describe the things that make it unsafe and some of the things you can do about it.
Let's start by simplifying that asm:
__asm__(
"mov %0, %%edi;"
:
: "g"(a)
);
Even with this single statement, this code is already unsafe. Why? Because we are changing the value of a register (edi) without letting the compiler know.
How can the compiler not know you ask? After all, it's right there in the asm! The answer comes from this line in the gcc docs:
GCC does not parse the assembler instructions themselves and does not
know what they mean or even whether they are valid assembler input.
In that case, how do you let gcc know what's going on? The answer lies in using the constraints (the stuff after the colons) to describe the impact of the asm.
Perhaps the simplest way to fix this code would be like this:
__asm__(
"mov %0, %%edi;"
:
: "g"(a)
: edi
);
This adds edi to the clobber list. In brief, this tell gcc that the value of edi is going to be changed by the code, and that gcc shouldn't assume any particular value will be in it when the asm exits.
Now, while that's the easiest, it's not necessarily the best way. Consider this code:
__asm__(
""
:
: "D"(a)
);
This uses a machine constraint to tell gcc to put the value of the variable a into the edi register for you. Doing it this way, gcc will load the register for you at a 'convenient' time, perhaps by always keeping a in edi.
There is one (significant) caveat to this code: By putting the parameter after the 2nd colon, we are declaring it to be an input. Input parameters are required to be read-only (ie they must have the same value on exiting the asm).
In your case, the call statement means that we won't be able to guarantee that edi won't be changed, so this doesn't quite work. There are a few ways to deal with this. The easiest is to move the constraint up after the first colon, making it an output, and specify "+D" to indicate that the value is read+write. But then the contents of a are going to be pretty much undefined after the asm (printf could set it to anything). If destroying a is unacceptable, there's always something like this:
int junk;
__asm__ volatile (
""
: "=D" (junk)
: "0"(a)
);
This tells gcc that on starting the asm, it should put the value of the variable a into the same place as output constraint #0 (ie edi). It also says that on output, edi won't be a anymore, it will contain the variable junk.
Edit: Since the 'junk' variable isn't actually going to be used, we need to add the volatile qualifier. Volatile was implicit when there weren't any output parameters.
One other point on that line: You end it with a semi-colon. This is legal and will work as expected. However, if you ever want to use the -S command line option to see exactly what code got generated (and if you want to get good with inline asm, you will), you will find that produces difficult-to-read code. I'd recommend using \n\t instead of a semi-colon.
All that and we're still on the first line...
Obviously the same would apply to the other two mov statements.
Which brings us to the call statement.
Both Michael and I have listed a number of reasons doing call in inline asm is difficult.
Handling all the registers that may be clobbered by the function call's ABI.
Handling red-zone.
Handling alignment.
Memory clobber.
If the goal here is 'learning,' then feel free to experiment. But I don't know that I would ever feel comfortable doing this in production code. Even when it looks like it works, I'd never feel confident there wasn't some weird case I'd missed. That's aside from my normal concerns about using inline asm at all.
I know, that's a lot of information. Probably more than you were looking for as an introduction to gcc's asm command, but you've picked a challenging place to start.
If you haven't done so already, spend time looking over all the docs in gcc's Assembly Language interface. There's a lot of good information there along with examples to try to explain how it all works.

I read they are general purpose registers, but I couldn't find if they
need to be restored?
I am not the expert in the field, but from my reading of the x86-64 ABI (Figure 3.4) the following registers: %rdi, %rsi, %rdx, and %rcx are not preserved between function calls, thus apparently don't require to be restored.
As commented by David Wohlferd you should be careful, because either way, the compiler will not be aware of the "custom" function call and in consequence you may get into its way, particularly because it may be not aware of registers modification.

The difference between asm, asm volatile and clobbering memory

When implementing lock-free data structures and timing code it's often necessary to suppress the compiler's optimisations. Normally people do this using asm volatile with memory in the clobber list, but you sometimes see just asm volatile or just a plain asm clobbering memory.
What impact do these different statements have on code generation (particularly in GCC, as it's unlikely to be portable)?
Just for reference, these are the interesting variations:
asm (""); // presumably this has no effect on code generation
asm volatile ("");
asm ("" ::: "memory");
asm volatile ("" ::: "memory");

See the "Extended Asm" page in the GCC documentation.
You can prevent an asm instruction from being deleted by writing the keyword volatile after the asm. [...] The volatile keyword indicates that the instruction has important side-effects. GCC will not delete a volatile asm if it is reachable.
and
An asm instruction without any output operands will be treated identically to a volatile asm instruction.
None of your examples have output operands specified, so the asm and asm volatile forms behave identically: they create a point in the code which may not be deleted (unless it is proved to be unreachable).
This is not quite the same as doing nothing. See this question for an example of a dummy asm which changes code generation - in that example, code that goes round a loop 1000 times gets vectorised into code which calculates 16 iterations of the loop at once; but the presence of an asm inside the loop inhibits the optimisation (the asm must be reached 1000 times).
The "memory" clobber makes GCC assume that any memory may be arbitrarily read or written by the asm block, so will prevent the compiler from reordering loads or stores across it:
This will cause GCC to not keep memory values cached in registers across the assembler instruction and not optimize stores or loads to that memory.
(That does not prevent a CPU from reordering loads and stores with respect to another CPU, though; you need real memory barrier instructions for that.)

asm ("") does nothing (or at least, it's not supposed to do anything.
asm volatile ("") also does nothing.
asm ("" ::: "memory") is a simple compiler fence.
asm volatile ("" ::: "memory") AFAIK is the same as the previous. The volatile keyword tells the compiler that it's not allowed to move this assembly block. For example, it may be hoisted out of a loop if the compiler decides that the input values are the same in every invocation. I'm not really sure under what conditions the compiler will decide that it understands enough about the assembly to try to optimize its placement, but the volatile keyword suppresses that entirely. That said, I would be very surprised if the compiler attempted to move an asm statement that had no declared inputs or outputs.
Incidentally, volatile also prevents the compiler from deleting the expression if it decides that the output values are unused. This can only happen if there are output values though, so it doesn't apply to asm ("" ::: "memory").

Just for completeness on Lily Ballard's answer, Visual Studio 2010 offers _ReadBarrier(), _WriteBarrier() and _ReadWriteBarrier() to do the same (VS2010 doesn't allow inline assembly for 64-bit apps).
These don't generate any instructions but affect the behaviour of the compiler. A nice example is here.
MemoryBarrier() generates lock or DWORD PTR [rsp], 0

Embedding assembly in C with compiler finding registers for you

When embedding assembly code into a C/C++ program, you can avoid clobbering registers by saving them with a push instruction (or specify clobber list of compiler supports it).
If you are including assembly inline and want to avoid the overhead of pushing and popping clobbered registers, is there a way of letting gcc choose registers for you (e.g. ones it knows have no useful info in them).

Yes. You can specify that you want a particular variable (input or output) to be stored in a register, but you don't have to specify a register. See this document for a detailed explanation. Essentially, the inline assembly looks like this:
asm("your assembly instructions"
: output1("=a"), // I want output1 in the eax register
output2("=r"), // output2 can be in any general-purpose register
output3("=q"), // output3 can be in eax, ebx, ecx, or edx
output4("=A") // output4 can be in eax or edx
: /* inputs */
: /* clobbered registers */
);

Compiler intrinsics are a very useful way to mix assembly and C/C++ code. They're declarations that look like functions, but actually compile directly to individual native instructions (via a special case inside the compiler). This gives you much of the control of working in assembly, but leaves the register coloring and scheduling up to the compiler.
An advantage is that then you can pass an ordinary C variable into an intrinsic, and let the compiler take care of loading it onto the register and scheduling other ops around it. For example,
struct TwoVectors
{
__m128 a; __m128b;
}
// adds two vectors A += B using the native SSE opcode
inline void SimdADD( TwoVectors *v )
{
v->a = _mm_add_ps( v->a , v->b ); // compiles directly to ADDSS opcode
}

OK, so I can't leave a comment above, but I'm pretty sure that the correct syntax (different from that shown above) is:
asm ( "your assembly instructions"
: "=a"(output1),
"=r"(output2),
"=q"(output3),
"=A"(output4)
: /* inputs */
: /* clobbered registers */
);
Although you can leave the allocation of input and output registers to the compiler, there's no obvious way of leaving the allocation of scratch/temp registers (i.e. used for intermediate values but not an input or output) to the compiler. Historically, I just listed them explicitly in the clobber list (e.g. "%xmm1", "%rcx"), but I now think it might be better to list them as outputs in order to allow the compiler to choose them. I don't know of any source that addresses this issue definitively.