Intel Vs. AT&T syntax when addressing xmm and floating instruction

Intel Vs. AT&T syntax when addressing xmm and floating instruction - c

Hello everyone
I am working on writing an assembly program and I would like to acquire some knowledge before I start on the looks of AT&T and Intel syntax when addressing xmm and fp. I know that in regular instructions a push when function on a byte is "pushb" in AT&T while "push byte" in Intel. Can anyone provide a similar comparison to when using xmm or fp? In sum I want to know how xmm operands are addressed
Thanks in advance

I'm not an AT&T fan/user, but the first place to start for intel would be the intel developer manuals(volumes 2a and 2b contain the instruction references), these list the sizes they operate on, which almost all intel syntax assemblers will try deduce (push will try narrow the variable or align it, depending on settings) if not specified, else you'll generally be using qword/dword for fp (for the likes of fld) and dword/qword/dqword for mmx/sse ops.

Related

question about an assembly code correspondence to a C code practice question [duplicate]

I am reading about x86-64 (and assembly in general) through the book "computer systems a programmer's perspective"(3rd edition). The author, in compliance with other sources from the web, states that idivq takes one operand only - just as this one claims. But then, the author, some chapters later, gives an example with the instruction idivq $9, %rcx.
Two operands? I first thought this was a mistake but it happens a lot in the book from there.
Also, the dividend should be given from the quantity in registers %rdx (high-order 64 bits) and %rax (low-order 64 bits) - so if this is defined in the architecture then it does not seem possible that the second operand could be a specified dividend.
Here is an example of an exercise (too lazy to write it all down - so a picture is the way to go). It claims that GCC emits idivq $9, %rcx when compiling a short C function.

That's a mistake. Only imul has immediate and 2-register forms.
mul, div, or idiv still only exist in the one-operand form introduced with 8086, using RDX:RAX as the implicit double-width operand for output (and input for division).
Or EDX:EAX, DX:AX, or AH:AL, depending on operand-size of course. Consult an ISA reference like Intel's manual, not this book! https://www.felixcloutier.com/x86/idiv
Also see When and why do we sign extend and use cdq with mul/div? and Why should EDX be 0 before using the DIV instruction?
x86-64's only hardware division instructions are idiv and div. 64-bit mode removed aam, which does 8-bit division by an immediate. (Dividing in Assembler x86 and Displaying Time in Assembly has an example of using aam in 16-bit mode).
Of course for division by constants idiv and div (and aam) are very inefficient. Use shifts for powers of 2, or a multiplicative inverse otherwise, unless you're optimizing for code-size instead of performance.
CS:APP 3e Global Edition apparently has multiple serious x86-64 instruction-set mistakes like this in practice problems, claiming that GCC emits impossible instructions. Not just typos or subtle mistakes, but misleading nonsense that's very obviously wrong to people familiar with the x86-64 instruction set. It's not just a syntax mistake, it's trying to use instructions that aren't encodeable (no syntax can exist to express them, other than a macro that expands to multiple instructions. Defining idivq as a pseudo-instruction using a macro would be pretty weird).
e.g. I correctly guessed missing part of a function, but gcc generated assembly code doesn't match the answer is another one where it suggests that (%rbx, %rdi, %rsi) and (%rsi, %rsi, 9) are valid addressing modes! The scale factor is actually a 2-bit shift count so these are total garbage and a sign of a serious lack of knowledge by the authors about the ISA they're teaching, not a typo.
Their code won't assemble with any AT&T syntax assembler.
Also What does this x86-64 addq instruction mean, which only have one operand? (From CSAPP book 3rd Edition) is another example, where they have a nonsensical addq %eax instead of inc %rdx, and a mismatched operand-size in a mov store.
It seems that they're just making stuff up and claiming it was emitted by GCC. IDK if they start with real GCC output and edit it into what they think is a better example, or actually write it by hand from scratch without testing it.
GCC's actual output would have used multiplication by a magic constant (fixed-point multiplicative inverse) to divide by 9 (even at -O0, but this is clearly not debug-mode code. They could have used -Os).
Presumably they didn't want to talk about Why does GCC use multiplication by a strange number in implementing integer division? and replaced that block of code with their made-up instruction. From context you can probably figure out where they expect the output to go; perhaps they mean rcx /= 9.
These errors are from 3rd-party practice problems in the Global Edition
From the publisher's web site (https://csapp.cs.cmu.edu/3e/errata.html)
Note on the Global Edition: Unfortunately, the publisher arranged for the generation of a different set of practice and homework problems in the global edition. The person doing this didn't do a very good job, and so these problems and their solutions have many errors. We have not created an errata for this edition.
So CS:APP 3e is probably a good textbook, as long as you get the North American edition, or ignore the practice / homework problems. This explains the huge disconnect between the textbook's reputation and wide use vs. the serious and obvious (to people familiar with x86-64 asm) errors like this one that go beyond sloppy into don't-know-the-language territory.
How a hypothetical idiv reg, reg or idiv $imm, reg would be designed
Also, the dividend should be given from the quantity in registers %rdx (high-order 64 bits) and %rax (low-order 64 bits) - so if this is defined in the architecture then it does not seem possible that the second operand could be a specified dividend.
If Intel or AMD had introduced a new convenient forms for div or idiv, they would have designed it to use a single-width dividend because that's how compilers always use it.
Most languages are like C and implicitly promote both operands for + - * / to the same type and produce a result of that width. Of course if the inputs are known to be narrow that can be optimized away. (e.g. using one imul r32 to implement a * (int64_t)b).
But div and idiv fault if the quotient overflows so it's not safe to use a single 32-bit idiv when compiling int32_t q = (int64_t)a / (int32_t)b.
Compilers always use xor edx,edx before DIV or cdq or cqo before IDIV to actually do n / n => n-bit division.
Real full-width division using a dividend that isn't just zero- or sign-extended is only done by hand with intrinsics or asm (because gcc/clang and other compilers don't know when the optimization is safe), or in gcc helper functions that do e.g. 64-bit / 64-bit division in 32-bit code. (Or 128-bit division in 64-bit code).
So what would be most helpful is a div/idiv that avoids the extra instruction to set up RDX, too, as well as minimizing the number of implicit register operands. (Like imul r32, r/m32 and imul r32, r/m32, imm do: making the common case of non-widening multiplication more convenient with no implicit registers. That's Intel-syntax like the manuals, destination first)
The simplest way would be a 2-operand instruction that did dst /= src. Or maybe replaced both operands with quotient and remainder. Using a VEX encoding for 3 operands like BMI1 andn, you could maybe have
idivx remainder_dst, dividend, divisor. With the 2nd operand also an output for the quotient. Or you could have the remainder written to RDX with a non-destructive destination for the quotient.
Or more likely to optimize for the simple case where only the quotient is needed, idivx quot, dividend, divisor and not store the remainder anywhere. You can always use regular idiv when you want the quotient.
BMI2 mulx uses an implicit rdx input operand because its purpose is to allow multiple dep chains of add-with-carry for extended-precision multiply. So it still has to produce 2 outputs. But this hypothetical new form of idiv would exist to save code-size and uops around normal uses of idiv that aren't widening. So 386 imul reg, reg/mem is the point of comparison, not BMI2 mulx.
IDK if it would make sense to introduce an immediate form of idivx as well; you'd only use it for code-size reasons. Multiplicative inverses are more efficient division by constants so there's very little real-world use-case for such an instruction.

I think your book has made a mistake.
idivq only has one operand. If I try to assemble this snippet:
idivq $9, %rcx
I get this error:
test.s: Assembler messages:
test.s:1: Error: operand type mismatch for `idiv'
This works:
idivq %rcx
but you probably already know that.
It may also be a macro (unlikely, but possible. credit to #HansPassant for this).
Perhaps you should contact the book's author so that they can add an entry to the errata.

Interestingly, gas seems to allow the following:
mov $20, %rax
mov $0, %rdx
mov $5, %rcx
idivq %rcx, %rax
ret
This is still performing the one operand division under the hood, but it LOOKS like two-operand form. As long as the first operand is a register and the second operand is specifically %rax, this works. However, in general idivq seems to require the one operand form.

Have problem understanding this assembly code from CS:APP [duplicate]

I am reading about x86-64 (and assembly in general) through the book "computer systems a programmer's perspective"(3rd edition). The author, in compliance with other sources from the web, states that idivq takes one operand only - just as this one claims. But then, the author, some chapters later, gives an example with the instruction idivq $9, %rcx.
Two operands? I first thought this was a mistake but it happens a lot in the book from there.
Also, the dividend should be given from the quantity in registers %rdx (high-order 64 bits) and %rax (low-order 64 bits) - so if this is defined in the architecture then it does not seem possible that the second operand could be a specified dividend.
Here is an example of an exercise (too lazy to write it all down - so a picture is the way to go). It claims that GCC emits idivq $9, %rcx when compiling a short C function.

I think your book has made a mistake.
idivq only has one operand. If I try to assemble this snippet:
idivq $9, %rcx
I get this error:
test.s: Assembler messages:
test.s:1: Error: operand type mismatch for `idiv'
This works:
idivq %rcx
but you probably already know that.
It may also be a macro (unlikely, but possible. credit to #HansPassant for this).
Perhaps you should contact the book's author so that they can add an entry to the errata.

Interestingly, gas seems to allow the following:
mov $20, %rax
mov $0, %rdx
mov $5, %rcx
idivq %rcx, %rax
ret
This is still performing the one operand division under the hood, but it LOOKS like two-operand form. As long as the first operand is a register and the second operand is specifically %rax, this works. However, in general idivq seems to require the one operand form.

SSE instruction MOVSD (extended: floating point scalar & vector operations on x86, x86-64)

I am somehow confused by the MOVSD assembly instruction. I wrote some numerical code computing some matrix multiplication, simply using ordinary C code with no SSE intrinsics. I do not even include the header file for SSE2 intrinsics for compilation. But when I check the assembler output, I see that:
1) 128-bit vector registers XMM are used;
2) SSE2 instruction MOVSD is invoked.
I understand that MOVSD essentially operates on single double precision floating point. It only uses the lower 64-bit of an XMM register and set the upper 64-bit 0. But I just don't understand two things:
1) I never give the compiler any hint for using SSE2. Plus, I am using GCC not intel compiler. As far as I know, intel compiler will automatically seek opportunities for vectorization, but GCC will not. So how does GCC know to use MOVSD?? Or, has this x86 instruction been around long before SSE instruction set, and the _mm_load_sd() intrinsics in SSE2 is just to provide backward compatibility for using XMM registers for scalar computation?
2) Why does not the compiler use other floating point registers, either the 80-bit floating point stack, or 64-bit floating point registers?? Why must it take the toll using XMM register (by setting upper 64-bit 0 and essentially wasting that storage)? Does XMM do provide faster access??
By the way, I have another question regarding SSE2. I just can't see the difference between _mm_store_sd() and _mm_storel_sd(). Both store the lower 64-bit value to an address. What is the difference? Performance difference?? Alignment difference??
Thank you.
Update 1:
OKAY, obviously when I first asked this question, I lacked some basic knowledge on how a CPU manages floating point operations. So experts tend to think my question is non-sense. Since I did not include even the shortest sample C code, people might think this question vague as well. Here I would provide a review as an answer, which hopefully will be useful to any people unclear about the floating point operations on modern CPUs.

A review of floating point scalar/vector processing on modern CPUs
The idea of vector processing dates back to old time vector processors, but these processors had been superseded by modern architectures with cache systems. So we focus on modern CPUs, especially x86 and x86-64. These architectures are the main stream in high performance scientific computing.
Since i386, Intel introduced the floating point stack where floating point numbers up to 80-bit wide can be held. This stack is commonly known as x87 or 387 floating point "registers", with a set of x87 FPU instructions. x87 stack are not real, directly addressable registers like general purpose registers, as they are on a stack. Access to register st(i) is by offsetting the stack top register %st(0) or simply %st. With help of an instruction FXCH which swaps the contents between current stack top %st and some offset register %st(i), random access can be achieved. But FXCH can impose some performance penalty, though minimized. x87 stack provides high precision computation by calculating intermediate results with 80 bits of precision by default, to minimise roundoff error in numerically unstable algorithms. However, x87 instructions are completely scalar.
The first effort on vectorization is the MMX instruction set, which implemented integer vector operations. The vector registers under MMX are 64-bit wide registers MMX0, MMX1, ..., MMX7. Each can be used to hold either 64-bit integers, or multiple smaller integers in a "packed" format. A single instruction can then be applied to two 32-bit integers, four 16-bit integers, or eight 8-bit integers at once. So now there are the legacy general purpose registers for scalar integer operations, as well as new MMX for integer vector operations with no shared execution resources. But MMX shared execution resources with scalar x87 FPU operation: each MMX register corresponded to the lower 64 bits of an x87 register, and the upper 16 bits of the x87 registers is unused. These MMX registers were each directly addressable. But the aliasing made it difficult to work with floating point and integer vector operations in the same application. To maximize performance, programmers often used the processor exclusively in one mode or the other, deferring the relatively slow switch between them as long as possible.
Later, SSE created a separate set of 128-bit wide registers XMM0–XMM7 along side of x87 stack. SSE instructions focused exclusively on single-precision floating-point operations (32-bit); integer vector operations were still performed using the MMX register and MMX instruction set. But now both operations can proceed at the same time, as they share no execution resources. It is important to know that SSE not only do floating point vector operations, but also floating point scalar operations. Essentially it provides a new place where floating operations take place, and the x87 stack is no longer prior choice to carry out floating operations. Using XMM registers for scalar floating point operations is faster than using x87 stack, as all XMM registers are easier to access, while the x87 stack can't be randomly accessed without FXCH. When I posted my question, I was clearly unaware of this fact. The other concept I was not clear about is that general purpose registers are integer/address registers. Even if they are 64-bit on x86-64, they can not hold 64-bit floating point. The main reason is that the execution unit associated with general purpose registers is ALU (arithmetic & logical unit), which is not for floating point computation.
SSE2 is a major progress, as it extends vector data type, so SSE2 instructions, either scalar or vector, can work with all C standard data type. Such extension in fact makes MMX obsolete. Also, x87 stack is no long as important as it once was. Since there are two alternative places where floating point operations can take place, you can specify your option to the compiler. For example for GCC, compilation with flag
-mfpmath=387
will schedule floating point operations on the legacy x87 stack. Note that this seems to be the default for 32-bit x86, even if SSE is already available. For example, I have an Intel Core2Duo laptop made in 2007, and it was already equipped with SSE release up to version SSE4, while GCC will still by default use x87 stack, which makes scientific computations unnecessarily slower. In this case, we need compile with flag
-mfpmath=sse
and GCC will schedule floating point operations on XMM registers. 64-bit x86-64 user needs not worry about such configuration as this is default on x86-64. Such signal will only affect scalar floating point operation. If we have written code using vector instructions and compiler the code with flag
-msse2
then XMM registers will be the only place where computation can take place. In other words, this flags turns on -mfpmath=sse. For more information see GCC's configuration of x86, x86-64. For examples of writing SSE2 C code, see my other post How to ask GCC to completely unroll this loop (i.e., peel this loop)?.
SSE set of instructions, though very useful, are not the latest vector extensions. The AVX, advanced vector extensions enhances SSE by providing 3-operands and 4 operands instructions. See number of operands in instruction set if you are unclear of what this means. 3-operands instruction optimizes the commonly seen fused multiply-add (FMA) operation in scientific computing by 1) using 1 fewer register; 2) reducing the explicit amount of data movement between registers; 3) speeding up FMA computations in itself. For example of using AVX, see #Nominal Animal's answer to my post.

Drawing a character in VGA memory with GNU C inline assembly

I´m learning to do some low level VGA programming in DOS with C and inline assembly. Right now I´m trying to create a function that prints out a character on screen.
This is my code:
//This is the characters BITMAPS
uint8_t characters[464] = {
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x20,0x20,0x20,0x20,0x00,0x20,0x00,0x50,
0x50,0x00,0x00,0x00,0x00,0x00,0x50,0xf8,0x50,0x50,0xf8,0x50,0x00,0x20,0xf8,0xa0,
0xf8,0x28,0xf8,0x00,0xc8,0xd0,0x20,0x20,0x58,0x98,0x00,0x40,0xa0,0x40,0xa8,0x90,
0x68,0x00,0x20,0x40,0x00,0x00,0x00,0x00,0x00,0x20,0x40,0x40,0x40,0x40,0x20,0x00,
0x20,0x10,0x10,0x10,0x10,0x20,0x00,0x50,0x20,0xf8,0x20,0x50,0x00,0x00,0x20,0x20,
0xf8,0x20,0x20,0x00,0x00,0x00,0x00,0x00,0x60,0x20,0x40,0x00,0x00,0x00,0xf8,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x60,0x60,0x00,0x00,0x08,0x10,0x20,0x40,0x80,
0x00,0x70,0x88,0x98,0xa8,0xc8,0x70,0x00,0x20,0x60,0x20,0x20,0x20,0x70,0x00,0x70,
0x88,0x08,0x70,0x80,0xf8,0x00,0xf8,0x10,0x30,0x08,0x88,0x70,0x00,0x20,0x40,0x90,
0x90,0xf8,0x10,0x00,0xf8,0x80,0xf0,0x08,0x88,0x70,0x00,0x70,0x80,0xf0,0x88,0x88,
0x70,0x00,0xf8,0x08,0x10,0x20,0x20,0x20,0x00,0x70,0x88,0x70,0x88,0x88,0x70,0x00,
0x70,0x88,0x88,0x78,0x08,0x70,0x00,0x30,0x30,0x00,0x00,0x30,0x30,0x00,0x30,0x30,
0x00,0x30,0x10,0x20,0x00,0x00,0x10,0x20,0x40,0x20,0x10,0x00,0x00,0xf8,0x00,0xf8,
0x00,0x00,0x00,0x00,0x20,0x10,0x08,0x10,0x20,0x00,0x70,0x88,0x10,0x20,0x00,0x20,
0x00,0x70,0x90,0xa8,0xb8,0x80,0x70,0x00,0x70,0x88,0x88,0xf8,0x88,0x88,0x00,0xf0,
0x88,0xf0,0x88,0x88,0xf0,0x00,0x70,0x88,0x80,0x80,0x88,0x70,0x00,0xe0,0x90,0x88,
0x88,0x90,0xe0,0x00,0xf8,0x80,0xf0,0x80,0x80,0xf8,0x00,0xf8,0x80,0xf0,0x80,0x80,
0x80,0x00,0x70,0x88,0x80,0x98,0x88,0x70,0x00,0x88,0x88,0xf8,0x88,0x88,0x88,0x00,
0x70,0x20,0x20,0x20,0x20,0x70,0x00,0x10,0x10,0x10,0x10,0x90,0x60,0x00,0x90,0xa0,
0xc0,0xa0,0x90,0x88,0x00,0x80,0x80,0x80,0x80,0x80,0xf8,0x00,0x88,0xd8,0xa8,0x88,
0x88,0x88,0x00,0x88,0xc8,0xa8,0x98,0x88,0x88,0x00,0x70,0x88,0x88,0x88,0x88,0x70,
0x00,0xf0,0x88,0x88,0xf0,0x80,0x80,0x00,0x70,0x88,0x88,0xa8,0x98,0x70,0x00,0xf0,
0x88,0x88,0xf0,0x90,0x88,0x00,0x70,0x80,0x70,0x08,0x88,0x70,0x00,0xf8,0x20,0x20,
0x20,0x20,0x20,0x00,0x88,0x88,0x88,0x88,0x88,0x70,0x00,0x88,0x88,0x88,0x88,0x50,
0x20,0x00,0x88,0x88,0x88,0xa8,0xa8,0x50,0x00,0x88,0x50,0x20,0x20,0x50,0x88,0x00,
0x88,0x50,0x20,0x20,0x20,0x20,0x00,0xf8,0x10,0x20,0x40,0x80,0xf8,0x00,0x60,0x40,
0x40,0x40,0x40,0x60,0x00,0x00,0x80,0x40,0x20,0x10,0x08,0x00,0x30,0x10,0x10,0x10,
0x10,0x30,0x00,0x20,0x50,0x88,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xf8,
0x00,0xf8,0xf8,0xf8,0xf8,0xf8,0xf8};
/**************************************************************************
* put_char *
* Print char *
**************************************************************************/
void put_char(int x ,int y,int ascii_char ,byte color){
__asm__(
"push %si\n\t"
"push %di\n\t"
"push %cx\n\t"
"mov color,%dl\n\t" //test color
"mov ascii_char,%al\n\t" //test char
"sub $32,%al\n\t"
"mov $7,%ah\n\t"
"mul %ah\n\t"
"lea $characters,%si\n\t"
"add %ax,%si\n\t"
"mov $7,%cl\n\t"
"0:\n\t"
"segCS %lodsb\n\t"
"mov $6,%ch\n\t"
"1:\n\t"
"shl $1,%al\n\t"
"jnc 2f\n\t"
"mov %dl,%ES:(%di)\n\t"
"2:\n\t"
"inc %di\n\t"
"dec %ch\n\t"
"jnz 1b\n\t"
"add $320-6,%di\n\t"
"dec %cl\n\t"
"jnz 0b\n\t"
"pop %cx\n\t"
"pop %di\n\t"
"pop %si\n\t"
"retn"
);
}
I´m guiding myself from this series of tutorials written in PASCAL: http://www.joco.homeserver.hu/vgalessons/lesson8.html .
I changed the assembly syntax according to the gcc compiler, but I´m still getting this errors:
Operand mismatch type for 'lea'
No such instruction 'segcs lodsb'
No such instruction 'retn'
EDIT:
I have been working on improving my code and at least now I see something on the screen. Here´s my updated code:
/**************************************************************************
* put_char *
* Print char *
**************************************************************************/
void put_char(int x,int y){
int char_offset;
int l,i,j,h,offset;
j,h,l,i=0;
offset = (y<<8) + (y<<6) + x;
__asm__(
"movl _VGA, %%ebx;" // VGA memory pointer
"addl %%ebx,%%edi;" //%di points to screen
"mov _ascii_char,%%al;"
"sub $32,%%al;"
"mov $7,%%ah;"
"mul %%ah;"
"lea _characters,%%si;"
"add %%ax,%%si;" //SI point to bitmap
"mov $7,%%cl;"
"0:;"
"lodsb %%cs:(%%si);" //load next byte of bitmap
"mov $6,%%ch;"
"1:;"
"shl $1,%%al;"
"jnc 2f;"
"movb %%dl,(%%edi);" //plot the pixel
"2:\n\t"
"incl %%edi;"
"dec %%ch;"
"jnz 1b;"
"addl $320-6,%%edi;"
"dec %%cl;"
"jnz 0b;"
: "=D" (offset)
: "d" (current_color)
);
}
If you see the image above I was trying to write the letter "S". The results are the green pixels that you see on the upper left side of the screen. No matter what x and y I give the functon it always plots the pixels on that same spot.
Can anyone help me correct my code?

See below for an analysis of some things that are specifically wrong with your put_char function, and a version that might work. (I'm not sure about the %cs segment override, but other than that it should do what you intend).
Learning DOS and 16-bit asm isn't the best way to learn asm
First of all, DOS and 16-bit x86 are thoroughly obsolete, and are not easier to learn than normal 64-bit x86. Even 32-bit x86 is obsolete, but still in wide use in the Windows world.
32-bit and 64-bit code don't have to care about a lot of 16-bit limitations / complications like segments or limited register choice in addressing modes. Some modern systems do use segment overrides for thread-local storage, but learning how to use segments in 16-bit code is barely connected to that.
One of the major benefits to knowing asm is for debugging / profiling / optimizing real programs. If you want to understand how to write C or other high-level code that can (and actually does) compile to efficient asm, you'll probably be looking at compiler output. This will be 64-bit (or 32-bit). (e.g. see Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” which has an excellent intro to reading x86 asm for total beginners, and to looking at compiler output).
Asm knowledge is useful when looking at performance-counter results annotating a disassembly of your binary (perf stat ./a.out && perf report -Mintel: see Chandler Carruth's CppCon2015 talk: "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!"). Aggressive compiler optimizations mean that looking at cycle / cache-miss / stall counts per source line are much less informative than per instruction.
Also, for your program to actually do anything, it has to either talk to hardware directly, or make system calls. Learning DOS system calls for file access and user input is a complete waste of time (except for answering the steady stream of SO questions about how to read and print multi-digit numbers in 16-bit code). They're quite different from the APIs in the current major OSes. Developing new DOS applications is not useful, so you'd have to learn another API (as well as ABI) when you get to the stage of doing something with your asm knowledge.
Learning asm on an 8086 simulator is even more limiting: 186, 286, and 386 added many convenient instructions like imul ecx, 15, making ax less "special". Limiting yourself to only instructions that work on 8086 means you'll figure out "bad" ways to do things. Other big ones are movzx / movsx, shift by an immediate count (other than 1), and push immediate. Besides performance, it's also easier to write code when these are available, because you don't have to write a loop to shift by more than 1 bit.
Suggestions for better ways to teach yourself asm
I mostly learned asm from reading compiler output, then making small changes. I didn't try to write stuff in asm when I didn't really understand things, but if you're going to learn quickly (rather than just evolve an understanding while debugging / profiling C), you probably need to test your understanding by writing your own code. You do need to understand the basics, that there are 8 or 16 integer registers + the flags and instruction pointer, and that every instruction makes a well-defined modification to the current architectural state of the machine. (See the Intel insn ref manual for complete descriptions of every instruction (links in the x86 wiki, along with much more good stuff).
You might want to start with simple things like writing a single function in asm, as part of a bigger program. Understanding the kind of asm needed to make system calls is useful, but in real programs it's normally only useful to hand-write asm for inner loops that don't involve any system calls. It's time-consuming to write asm to read input and print results, so I'd suggest doing that part in C. Make sure you read the compiler output and understand what's going on, and the difference between an integer and a string, and what strtol and printf do, even if you don't write them yourself.
Once you think you understand enough of the basics, find a function in some program you're familiar with and/or interested in, and see if you can beat the compiler and save instructions (or use faster instructions). Or implement it yourself without using the compiler output as a starting point, whichever you find more interesting. This answer might be interesting, although the focus there was finding C source that got the compiler to produce the optimal ASM.
How to try to solve your own problems (before asking an SO question)
There are many SO questions from people asking "how do I do X in asm", and the answer is usually "the same as you would in C". Don't get so caught up in asm being unfamiliar that you forget how to program. Figure out what needs to happen to the data the function operates on, then figure out how to do that in asm. If you get stuck and have to ask a question, you should have most of a working implementation, with just one part that you don't know what instructions to use for one step.
You should do this with 32 or 64bit x86. I'd suggest 64bit, since the ABI is nicer, but 32bit functions will force you to make more use of the stack. So that might help you understand how a call instruction puts the return address on the stack, and where the args the caller pushed actually are after that. (This appears to be what you tried to avoid dealing with by using inline asm).
Programming hardware directly is neat, but not a generally useful skill
Learning how to do graphics by directly modifying video RAM is not useful, other than to satisfy curiosity about how computers used to work. You can't use that knowledge for anything. Modern graphics APIs exist to let multiple programs draw in their own regions of the screen, and to allow indirection (e.g. draw on a texture instead of the screen directly, so 3D window-flipping alt-tab can look fancy). There too many reasons to list here for not drawing directly on video RAM.
Drawing on a pixmap buffer and then using a graphics API to copy it to the screen is possible. Still, doing bitmap graphics at all is more or less obsolete, unless you're generating images for PNG or JPEG or something (e.g. optimize converting histogram bins to a scatter plot in the back-end code for a web service). Modern graphics APIs abstract away the resolution, so your app can draw things at a reasonable size regardless of how big each pixel is. (small but extremely high rez screen vs. big TV at low rez).
It is kind of cool to write to memory and see something change on-screen. Or even better, hook up LEDs (with small resistors) to the data bits on a parallel port, and run an outb instruction to turn them on/off. I did this on my Linux system ages ago. I made a little wrapper program that used iopl(2) and inline asm, and ran it as root. You can probably do similar on Windows. You don't need DOS or 16bit code to get your feet wet talking to the hardware.
in/out instructions, and normal loads/stores to memory-mapped IO, and DMA, are how real drivers talk to hardware, including things far more complicated than parallel ports. It's fun to know how your hardware "really" works, but only spend time on it if you're actually interested, or want to write drivers. The Linux source tree includes drivers for boatloads of hardware, and is often well commented, so if you like reading code as much as writing code, that's another way to get a feel for what read drivers do when they talk to hardware.
It's generally good to have some idea how things work under the hood. If you want to learn about how graphics used to work ages ago (with VGA text mode and color / attribute bytes), then sure, go nuts. Just be aware that modern OSes don't use VGA text mode, so you aren't even learning what happens under the hood on modern computers.
Many people enjoy https://retrocomputing.stackexchange.com/, reliving a simpler time when computers were less complex and couldn't support as many layers of abstraction. Just be aware that's what you're doing. I might be a good stepping stone to learning to write drivers for modern hardware, if you're sure that's why you want to understand asm / hardware.
Inline asm
You are taking a totally incorrect approach to using inline ASM. You seem to want to write whole functions in asm, so you should just do that. e.g. put your code in asmfuncs.S or something. Use .S if you want to keep using GNU / AT&T syntax; or use .asm if you want to use Intel / NASM / YASM syntax (which I would recommend, since the official manuals all use Intel syntax. See the x86 wiki for guides and manuals.)
GNU inline asm is the hardest way to learn ASM. You have to understand everything that your asm does, and what the compiler needs to know about it. It's really hard to get everything right. For example, in your edit, that block of inline asm modifies many registers that you don't list as clobbered, including %ebx which is a call-preserved register (so this is broken even if that function isn't inlined). At least you took out the ret, so things won't break as spectacularly when the compiler inlines this function into the loop that calls it. If that sounds really complicated, that's because it is, and part of why you shouldn't use inline asm to learn asm.
This answer to a similar question from misusing inline asm while trying to learn asm in the first place has more links about inline asm and how to use it well.
Getting this mess working, maybe
This part could be a separate answer, but I'll leave it together.
Besides your whole approach being fundamentally a bad idea, there is at least one specific problem with your put_char function: you use offset as an output-only operand. gcc quite happily compiles your whole function to a single ret instruction, because the asm statement isn't volatile, and its output isn't used. (Inline asm statements without outputs are assumed to be volatile.)
I put your function on godbolt, so I could look at what assembly the compiler generates surrounding it. That link is to the fixed maybe-working version, with correctly-declared clobbers, comments, cleanups, and optimizations. See below for the same code, if that external link ever breaks.
I used gcc 5.3 with the -m16 option, which is different from using a real 16bit compiler. It still does everything the 32bit way (using 32bit addresses, 32bit ints, and 32bit function args on the stack), but tells the assembler that the CPU will be in 16bit mode, so it will know when to emit operand-size and address-size prefixes.
Even if you compile your original version with -O0, the compiler computes offset = (y<<8) + (y<<6) + x;, but doesn't put it in %edi, because you didn't ask it to. Specifying it as another input operand would have worked. After the inline asm, it stores %edi into -12(%ebp), where offset lives.
Other stuff wrong with put_char:
You pass 2 things (ascii_char and current_color) into your function through globals, instead of function arguments. Yuck, that's disgusting. VGA and characters are constants, so loading them from globals doesn't look so bad. Writing in asm means you should ignore good coding practices only when it helps performance by a reasonable amount. Since the caller probably had to store those values into the globals, you're not saving anything compared to the caller storing them on the stack as function args. And for x86-64, you'd be losing perf because the caller could just pass them in registers.
Also:
j,h,l,i=0; // sets i=0, does nothing to j, h, or l.
// gcc warns: left-hand operand of comma expression has no effect
j;h;l;i=0; // equivalent to this
j=h=l=i=0; // This is probably what you meant
All the local variables are unused anyway, other than offset. Were you going to write it in C or something?
You use 16bit addresses for characters, but 32bit addressing modes for VGA memory. I assume this is intentional, but I have no idea if it's correct. Also, are you sure you should use a CS: override for the loads from characters? Does the .rodata section go into the code segment? Although you didn't declare uint8_t characters[464] as const, so it's probably just in the .data section anyway. I consider myself fortunate that I haven't actually written code for a segmented memory model, but that still looks suspicious.
If you're really using djgpp, then according to Michael Petch's comment, your code will run in 32bit mode. Using 16bit addresses is thus a bad idea.
Optimizations
You can avoid using %ebx entirely by doing this, instead of loading into ebx and then adding %ebx to %edi.
"add _VGA, %%edi\n\t" // load from _VGA, add to edi.
You don't need lea to get an address into a register. You can just use
"mov %%ax, %%si\n\t"
"add $_characters, %%si\n\t"
$_characters means the address as an immediate constant. We can save a lot of instructions by combining this with the previous calculation of the offset into the characters array of bitmaps. The immediate-operand form of imul lets us produce the result in %si in the first place:
"movzbw _ascii_char,%%si\n\t"
//"sub $32,%%ax\n\t" // AX = ascii_char - 32
"imul $7, %%si, %%si\n\t"
"add $(_characters - 32*7), %%si\n\t" // Do the -32 at the same time as adding the table address, after multiplying
// SI points to characters[(ascii_char-32)*7]
// i.e. the start of the bitmap for the current ascii character.
Since this form of imul only keeps the low 16b of the 16*16 -> 32b multiply, the 2 and 3 operand forms imul can be used for signed or unsigned multiplies, which is why only imul (not mul) has those extra forms. For larger operand-size multiplies, 2 and 3 operand imul is faster, because it doesn't have to store the high half in %[er]dx.
You could simplify the inner loop a bit, but it would complicate the outer loop slightly: you could branch on the zero flag, as set by shl $1, %al, instead of using a counter. That would make it also unpredictable, like the jump over store for non-foreground pixels, so the increased branch mispredictions might be worse than the extra do-nothing loops. It would also mean you'd need to recalculate %edi in the outer loop each time, because the inner loop wouldn't run a constant number of times. But it could look like:
... same first part of the loop as before
// re-initialize %edi to first_pixel-1, based on outer-loop counter
"lea -1(%%edi), %%ebx\n"
".Lbit_loop:\n\t" // map the 1bpp bitmap to 8bpp VGA memory
"incl %%ebx\n\t" // inc before shift, to preserve flags
"shl $1,%%al\n\t"
"jnc .Lskip_store\n\t" // transparency: only store on foreground pixels
"movb %%dl,(%%ebx)\n" //plot the pixel
".Lskip_store:\n\t"
"jnz .Lbit_loop\n\t" // flags still set from shl
"addl $320,%%edi\n\t" // WITHOUT the -6
"dec %%cl\n\t"
"jnz .Lbyte_loop\n\t"
Note that the bits in your character bitmaps are going to map to bytes in VGA memory like {7 6 5 4 3 2 1 0}, because you're testing the bit shifted out by a left shift. So it starts with the MSB. Bits in a register are always "big endian". A left shift multiplies by two, even on a little-endian machine like x86. Little-endian only affects ordering of bytes in memory, not bits in a byte, and not even bytes inside registers.
A version of your function that might do what you intended.
This is the same as the godbolt link.
void put_char(int x,int y){
int offset = (y<<8) + (y<<6) + x;
__asm__ volatile ( // volatile is implicit for asm statements with no outputs, but better safe than sorry.
"add _VGA, %%edi\n\t" // edi points to VGA + offset.
"movzbw _ascii_char,%%si\n\t" // Better: use an input operand
//"sub $32,%%ax\n\t" // AX = ascii_char - 32
"imul $7, %%si, %%si\n\t" // can't fold the load into this because it's not zero-padded
"add $(_characters - 32*7), %%si\n\t" // Do the -32 at the same time as adding the table address, after multiplying
// SI points to characters[(ascii_char-32)*7]
// i.e. the start of the bitmap for the current ascii character.
"mov $7,%%cl\n"
".Lbyte_loop:\n\t"
"lodsb %%cs:(%%si)\n\t" //load next byte of bitmap
"mov $6,%%ch\n"
".Lbit_loop:\n\t" // map the 1bpp bitmap to 8bpp VGA memory
"shl $1,%%al\n\t"
"jnc .Lskip_store\n\t" // transparency: only store on foreground pixels
"movb %%dl,(%%edi)\n" //plot the pixel
".Lskip_store:\n\t"
"incl %%edi\n\t"
"dec %%ch\n\t"
"jnz .Lbit_loop\n\t"
"addl $320-6,%%edi\n\t"
"dec %%cl\n\t"
"jnz .Lbyte_loop\n\t"
: "+&D" (offset) // EDI modified by the asm, compiler needs to know that, so use a read-write "+" input. Early-clobber "&" because we read the other input after modifying this.
: "d" (current_color) // used read-only
: "%eax", "%ecx", "%esi", "memory"
// omit the memory clobber if your C never touches VGA memory, and your asm never loads/stores anywhere else.
// but that's not the case here: the asm loads from memory written by C
// without listing it as a memory operand (even a pointer in a register isn't sufficient)
// so gcc might optimize away "dead" stores to it, or reorder the asm with loads/stores to it.
);
}
Re: the "memory" clobber, see How can I indicate that the memory *pointed* to by an inline ASM argument may be used?
I didn't use dummy output operands to leave register allocation up to the compiler's discretion, but that's a good idea to reduce the overhead of getting data in the right places for inline asm. (extra mov instructions). For example, here there was no need to force the compiler to put offset in %edi. It could have been any register we aren't already using.

Efficiency of C vs Assembler

How much faster is the following assembler code:
shl ax, 1
Versus the following C code:
num = num * 2;
How can I even find out?

Your assembly variant might be faster, might be slower. What made you think that it is necessarily faster?
On the x86 platform, there are quite a few ways to multiply something by 2. I would expect a compiler to do add ax, ax, which is intuitively more efficient than your shl because it doesn't involve a potentially stored constant ('1' in your case).
Also, for quite a long time, on a x86 platform the preferred way of multiplying things by constants was not a shift, but rather a lea operation (when possible). In the above example that would be lea eax, [eax*2]. (Multiplication by 3 would be done through lea eax, [eax*2+eax])
The belief in shift operations being somehow "faster" is a nice old story for newbies, which has virtually no relevance today. And, as usual, most of the time your compiler (if it is up-to-date) has much better knowledge about the underlying hardware platform than people with naive love for shift operations.

Is this, by any chance, an academic question? I assume you understand it is in the general category of "getting a haircut to lose weight".

If you are using GCC, ask to see the generated assembly with option -S. You may find it's the same as your assembler instruction.
To answer the original question, on Out-Of-Order processors instruction speed is measured by throughput and latency, and you would measure both using the rdtsc assembly instruction. But someone else did it for you for a lot of processors, so you don't need to bother. PDF

In most circumstances, it won't make a difference. Multiplication is fast on nearly all modern hardware. In particular, it is usually fast enough that unless you have meticulously hand-optimized code, the pipeline will hide the entirety of the latency and you will see no speed difference at all between the two cases.
You may be able to measure a performance difference on multiplies and shifts when you execute them in isolation, but there will typically not be any difference in the context of the rest of your compiled code. (As I noted, this may not hold true if the code is meticulously optimized).
Now, that said, shifts are still generally faster than multiplies, and almost any reasonable compiler will map a fixed power-of-two multiply into a shift, anyway (assuming that the semantics are actually equivalent on the target architecture).
Edit: one more thing you may want to try if you really care about this is x+x. I know of at least one architecture on which this can actually be faster than shifting, depending on the surrounding context.

If you have a decent compiler it will produce the same or similar code. The best way is to disassemble and checked the created code.

The answer depends, as you've been able to see here, upon many things. What the compiler will do with your C code depends on a lot of things. If we're talking x86-32 the following should be generally applicable.
At the basic level your C code indicates a memory variable which would require at least one instruction to multiply by two: "shl mem,1" and in such a simple case the C code will be slower.
If num is a local variable the compiler may decide to put it in a register (if it is used often enough and/or the function is small enough) and then you will have your "shl reg,1" instruction - maybe.
What instruction is fastest has everything to do with how they are implemented in the processor. Shl may not be the best choice since it affects the C and Z flags which slows it down. A few years ago the recommendation was "lea reg,[reg+reg]" (all reg are the same) because lea didn't affect any flags and there were variants such as (using the eax register on x86-32 platform as an example):
lea eax,[eax+eax] ; *2
lea eax,[eax+eax*2] ; *3
lea eax,[eax+eax*4] ; *5
lea eax,[eax+eax*8] ; *9
I don't know what is the norm today but your compiler probably will.
As for measuring search for information here on the rdtsc instruction which is the hands-down best alternative as it counts actual clock cycles.

Put them in a loop with a counter that goes so high that it runs for at least a second in the fastest case. Use your favorite timing mechanism to see how long each takes.
The assembly case should be done with inline assembly in the same C program as you use for the pure C test. Otherwise, you're not comparing apples to apples.
By the way, I think you should add a third test:
num <<= 1;
The question then is whether that does the same thing as the assembly version.

If, for your target platform, shifting left is the quickest way to multiply a number by two, then the chances are your compiler will do that when compiling the code. Look at the disassembly to check
So, for that one line, it's probably exactly the same speed. However, as you're unlikely to have a function containing just that one line, you might well find the compiler would defer the shift until the value is used, or otherwise mix it up with surrounding code, making it less clear cut. A good optimizing compiler will generally do a good job of beating poor to average hand written assembly.

If the compiler up to date now ( vc9 ) was really doing a good job it would outperform vc6 by a wide margin and this dont occur, this is why I even prefer to use VC6 for some code that run faster than code compiled in mingw with -O3 and VC9 with /Ox

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight