NASM Assembly Add Array elements [duplicate]

NASM Assembly Add Array elements [duplicate] - arrays

mov ecx, 16
looptop: .
.
.
loop looptop
How many times will this loop execute?
What happens if ecx = 0 to start with? Does loop jump or fall-through in that case?

loop is exactly like dec ecx / jnz, except it doesn't set flags.
It's like the bottom of a do {} while(--ecx != 0); in C. If execution enters the loop with ecx = 0, wrap-around means the loop will run 2^32 times. (Or 2^64 times in 64-bit mode, because it uses RCX.)
Unlike rep movsb/stosb/etc., it doesn't check for ECX=0 before decrementing, only after1.
The address-size determines whether it uses CX, ECX, or RCX. So in 64-bit code, addr32 loop is like dec ecx / jnz, while a regular loop is like dec rcx / jnz. Or in 16-bit code, it normally uses CX, but an address-size prefix (0x67) will make it use ecx. As Intel's manual says, it ignores REX.W, because that sets the operand-size, not the address-size.
rep string instructions use the address-size prefix the same way, overriding the address size but also RCX vs. ECX (or CX vs. ECX in modes other than 64-bit). The operand-size for string instructions is already used to determine movsw vs. movsd vs. movsq, and you want address/repeat size to be orthogonal to that. Having loop and jrcxz/jecxz follow that behaviour is just continuing the design intent from 8086 of loop being intended for use with string operations when a simple rep couldn't get the job done; see below.
Related: Why are loops always compiled into "do...while" style (tail jump)? for more about loop structure in asm, while() {} vs. do {} while() and how to lay them out.
Footnote 1: jcxz (or x86-64 jrcxz) was intended for use before the top of a do {} while style loop, to skip it if it should run 0 times. On modern CPUs test rcx, rcx / jz is more efficient.
Stephen Morse, architect of 8086, wrote about the intended uses of loop/jcxz with string instructions in that section of his book, The 8086 Primer, available for free on his web site: https://www.stevemorse.org/8086/index.html. See the "complex string instructions" subsection, starting at the bottom of page 71. (Or start reading from earlier in the chapter, the whole String Instructions section starts on page 66. But note #ecm's review of a few things the book seems to explain poorly or incorrectly.)
If you're wondering about the design intent of x86 instructions, you won't find a better source than this. That's separate from the best / most efficient way to use them, especially on modern x86, but very good intro for beginners into what you can do with asm instructions as building blocks.
Extra debugging tips
If you ever want to know the details on an instruction, check the manual: either Intel's official vol.2 PDF instruction set reference manual, or an html extract with each entry on a different page (http://felixcloutier.com/x86/). But note that the HTML leaves out the intro and appendices that have details on how to interpret stuff, like when it says "flags are set according to the result" for instructions like add.
And you can (and should) also just try stuff in a debugger: single-step and watch registers change. Use a smaller starting value for ecx so you get to the interesting ecx=1 part sooner. See also the x86 tag wiki for links to manuals, guides, and asm debugging tips at the bottom.
And BTW, if the instructions inside the loop that aren't shown modify ecx, it could loop any number of times. For the question to have a simple and unique answer, you need a guarantee that the instructions between the label and the loop instruction don't modify ecx. (They could save/restore it, but if you're going to do that it's usually better to just use a different register as the loop counter. push/pop inside a loop makes your code hard to read.)
Rant about over-use of LOOP even when you already need to increment something else in the loop. LOOP isn't the only way to loop, and usually it's the worst.
You should normally never use the loop instruction unless optimizing for code-size at the expense of speed, because it's slow. Compilers don't use it. (So CPU vendors don't bother to make it fast; catch 22.) Use dec / jnz, or an entirely different loop condition. (See also http://agner.org/optimize/ to learn more about what's efficient.)
Loops don't even have to use a counter; it's often just as good if not better to compare a pointer to an end address, or to check for some other condition. (Pointless use of loop is one of my pet peeves, especially when you already have something in another register that would work as a loop counter.) Using cx as a loop counter often just ties up one of your precious few registers when you could have used cmp/jcc on another register you were incrementing anyway.
IMO, loop should be considered one of those obscure x86 instructions that beginners shouldn't be distracted with. Like stosd (without a rep prefix), aam or xlatb. It does have real uses when optimizing for code size, though. (That's sometimes useful in real life for machine code (like for boot sectors), not just for stuff like code golf.)
IMO, just teach / learn how conditional branches work, and how to make loops out of them. Then you won't get stuck into thinking there's something special about a loop that uses loop. I've seen an SO question or comment that said something like "I thought you had to declare loops", and didn't realize that loop was just an instruction.
</rant>. Like I said, loop is one of my pet peeves. It's an obscure code-golfing instruction, unless you're optimizing for an actual 8086.

Related

Why does Knuth use this clunky decrement?

I'm looking at some of Prof. Don Knuth's code, written in CWEB that is converted to C. A specific example is dlx1.w, available from Knuth's website
At one stage, the .len value of a struct nd[cc] is decremented, and it is done in a clunky way:
o,t=nd[cc].len-1;
o,nd[cc].len=t;
(This is a Knuth-specific question, so maybe you already know that "o," is a preprocessor macro for incrementing "mems", which is a running total of effort expended, as measured by accesses to 64-bit words.) The value remaining in "t" is definitely not used for anything else. (The example here is on line 665 of dlx1.w, or line 193 of dlx1.c after ctangle.)
My question is: why does Knuth write it this way, rather than
nd[cc].len--;
which he does actually use elsewhere (line 551 of dlx1.w):
oo,nd[k].len--,nd[k].aux=i-1;
(And "oo" is a similar macro for incrementing "mems" twice -- but there is some subtlety here, because .len and .aux are stored in the same 64-bit word. To assign values to S.len and S.aux, only one increment to mems would normally be counted.)
My only theory is that a decrement consists of two memory accesses: first to look up, then to assign. (Is that correct?) And this way of writing it is a reminder of the two steps. This would be unusually verbose of Knuth, but maybe it is an instinctive aide-memoire rather than didacticism.
For what it's worth, I've searched in CWEB documentation without finding an answer. My question probably relates more to Knuth's standard practices, which I am picking up bit by bit. I'd be interested in any resources where these practices are laid out (and maybe critiqued) as a block -- but for now, let's focus on why Knuth writes it this way.

A preliminary remark: with Knuth-style literate programming (i.e. when reading WEB or CWEB programs) the “real” program, as conceived by Knuth, is neither the “source” .w file nor the generated (tangled) .c file, but the typeset (woven) output. The source .w file is best thought of as a means to produce it (and of course also the .c source that's fed to the compiler). (If you don't have cweave and TeX handy; I've typeset some of these programs here; this program DLX1 is here.)
So in this case, I'd describe the location in the code as module 25 of DLX1, or subroutine "cover":
Anyway, to return to the actual question: note that this (DLX1) is one of the programs written for The Art of Computer Programming. Because reporting the time taken by a program “seconds” or “minutes” becomes meaningless from year to year, he reports how long a program took in number of “mems” plus “oops”, that's dominated by the “mems”, i.e. the number of memory accesses to 64-bit words (usually). So the book contains statements like “this program finds the answer to this problem in 3.5 gigamems of running time”. Further, the statements are intended to be fundamentally about the program/algorithm itself, not the specific code generated by a specific version of a compiler for certain hardware. (Ideally when the details are very important he writes the program in MMIX or MMIXAL and analyses its operations on the MMIX hardware, but this is rare.) Counting the mems (to be reported as above) is the purpose of inserting o and oo instructions into the program. Note that it's more important to get this right for the “inner loop” instructions that are executed a lot of times, such as everything in the subroutine cover in this case.
This is elaborated in Section 1.3.1′ (part of Fascicle 1):
Timing. […] The running time of a program depends not only on the clock rate but also on the number of functional units that can be active simultaneously and the degree to which they are pipelined; it depends on the techniques used to prefetch instructions before they are executed; it depends on the size of the random-access memory that is used to give the illusion of 264 virtual bytes; and it depends on the sizes and allocation strategies of caches and other buffers, etc., etc.
For practical purposes, the running time of an MMIX program can often be estimated satisfactorily by assigning a fixed cost to each operation, based on the approximate running time that would be obtained on a high-performance machine with lots of main memory; so that’s what we will do. Each operation will be assumed to take an integer number of υ, where υ (pronounced “oops”) is a unit that represents the clock cycle time in a pipelined implementation. Although the value of υ decreases as technology improves, we always keep up with the latest advances because we measure time in units of υ, not in nanoseconds. The running time in our estimates will also be assumed to depend on the number of memory references or mems that a program uses; this is the number of load and store instructions. For example, we will assume that each LDO (load octa) instruction costs µ + υ, where µ is the average cost of a memory reference. The total running time of a program might be reported as, say, 35µ+ 1000υ, meaning “35 mems plus 1000 oops.” The ratio µ/υ has been increasing steadily for many years; nobody knows for sure whether this trend will continue, but experience has shown that µ and υ deserve to be considered independently.
And he does of course understand the difference from reality:
Even though we will often use the assumptions of Table 1 for seat-of-the-pants estimates of running time, we must remember that the actual running time might be quite sensitive to the ordering of instructions. For example, integer division might cost only one cycle if we can find 60 other things to do between the time we issue the command and the time we need the result. Several LDB (load byte) instructions might need to reference memory only once, if they refer to the same octabyte. Yet the result of a load command is usually not ready for use in the immediately following instruction. Experience has shown that some algorithms work well with cache memory, and others do not; therefore µ is not really constant. Even the location of instructions in memory can have a significant effect on performance, because some instructions can be fetched together with others. […] Only the meta-simulator can be trusted to give reliable information about a program’s actual behavior in practice; but such results can be difficult to interpret, because infinitely many configurations are possible. That’s why we often resort to the much simpler estimates of Table 1.
Finally, we can use Godbolt's Compiler Explorer to look at the code generated by a typical compiler for this code. (Ideally we'd look at MMIX instructions but as we can't do that, let's settle for the default there, which seems to be x68-64 gcc 8.2.) I removed all the os and oos.
For the version of the code with:
/*o*/ t = nd[cc].len - 1;
/*o*/ nd[cc].len = t;
the generated code for the first line is:
movsx rax, r13d
sal rax, 4
add rax, OFFSET FLAT:nd+8
mov eax, DWORD PTR [rax]
lea r14d, [rax-1]
and for the second line is:
movsx rax, r13d
sal rax, 4
add rax, OFFSET FLAT:nd+8
mov DWORD PTR [rax], r14d
For the version of the code with:
/*o ?*/ nd[cc].len --;
the generated code is:
movsx rax, r13d
sal rax, 4
add rax, OFFSET FLAT:nd+8
mov eax, DWORD PTR [rax]
lea edx, [rax-1]
movsx rax, r13d
sal rax, 4
add rax, OFFSET FLAT:nd+8
mov DWORD PTR [rax], edx
which as you can see (even without knowing much about x86-64 assembly) is simply the concatenation of the code generated in the former case (except for using register edx instead of r14d), so it's not as if writing the decrement in one line has saved you any mems. In particular, it would not be correct to count it as a single one, especially in something like cover that is called a huge number of times in this algorithm (dancing links for exact cover).
So the version as written by Knuth is correct, for its goal of counting the number of mems. He could also write oo,nd[cc].len--; (counting two mems) as you observed, but perhaps it might look like a bug at first glance in that case. (BTW, in your example in the question of oo,nd[k].len--,nd[k].aux=i-1; the two mems come from the load and the store in --; not two stores.)

This whole practice seems to be based on a mistaken idea/model of how C works, that there's some correspondence between the work performed by the abstract machine and by the actual program as executed (i.e. the "C is portable assembler" fallacy). I don't think we can answer a lot more about why that exact code fragment appears, except that it happens to be an unusual idiom for counting loads and stores on the abstract machine as separate.

How exactly does the x86 LOOP instruction work?

mov ecx, 16
looptop: .
.
.
loop looptop
How many times will this loop execute?
What happens if ecx = 0 to start with? Does loop jump or fall-through in that case?

Drawing a character in VGA memory with GNU C inline assembly

I´m learning to do some low level VGA programming in DOS with C and inline assembly. Right now I´m trying to create a function that prints out a character on screen.
This is my code:
//This is the characters BITMAPS
uint8_t characters[464] = {
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x20,0x20,0x20,0x20,0x00,0x20,0x00,0x50,
0x50,0x00,0x00,0x00,0x00,0x00,0x50,0xf8,0x50,0x50,0xf8,0x50,0x00,0x20,0xf8,0xa0,
0xf8,0x28,0xf8,0x00,0xc8,0xd0,0x20,0x20,0x58,0x98,0x00,0x40,0xa0,0x40,0xa8,0x90,
0x68,0x00,0x20,0x40,0x00,0x00,0x00,0x00,0x00,0x20,0x40,0x40,0x40,0x40,0x20,0x00,
0x20,0x10,0x10,0x10,0x10,0x20,0x00,0x50,0x20,0xf8,0x20,0x50,0x00,0x00,0x20,0x20,
0xf8,0x20,0x20,0x00,0x00,0x00,0x00,0x00,0x60,0x20,0x40,0x00,0x00,0x00,0xf8,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x60,0x60,0x00,0x00,0x08,0x10,0x20,0x40,0x80,
0x00,0x70,0x88,0x98,0xa8,0xc8,0x70,0x00,0x20,0x60,0x20,0x20,0x20,0x70,0x00,0x70,
0x88,0x08,0x70,0x80,0xf8,0x00,0xf8,0x10,0x30,0x08,0x88,0x70,0x00,0x20,0x40,0x90,
0x90,0xf8,0x10,0x00,0xf8,0x80,0xf0,0x08,0x88,0x70,0x00,0x70,0x80,0xf0,0x88,0x88,
0x70,0x00,0xf8,0x08,0x10,0x20,0x20,0x20,0x00,0x70,0x88,0x70,0x88,0x88,0x70,0x00,
0x70,0x88,0x88,0x78,0x08,0x70,0x00,0x30,0x30,0x00,0x00,0x30,0x30,0x00,0x30,0x30,
0x00,0x30,0x10,0x20,0x00,0x00,0x10,0x20,0x40,0x20,0x10,0x00,0x00,0xf8,0x00,0xf8,
0x00,0x00,0x00,0x00,0x20,0x10,0x08,0x10,0x20,0x00,0x70,0x88,0x10,0x20,0x00,0x20,
0x00,0x70,0x90,0xa8,0xb8,0x80,0x70,0x00,0x70,0x88,0x88,0xf8,0x88,0x88,0x00,0xf0,
0x88,0xf0,0x88,0x88,0xf0,0x00,0x70,0x88,0x80,0x80,0x88,0x70,0x00,0xe0,0x90,0x88,
0x88,0x90,0xe0,0x00,0xf8,0x80,0xf0,0x80,0x80,0xf8,0x00,0xf8,0x80,0xf0,0x80,0x80,
0x80,0x00,0x70,0x88,0x80,0x98,0x88,0x70,0x00,0x88,0x88,0xf8,0x88,0x88,0x88,0x00,
0x70,0x20,0x20,0x20,0x20,0x70,0x00,0x10,0x10,0x10,0x10,0x90,0x60,0x00,0x90,0xa0,
0xc0,0xa0,0x90,0x88,0x00,0x80,0x80,0x80,0x80,0x80,0xf8,0x00,0x88,0xd8,0xa8,0x88,
0x88,0x88,0x00,0x88,0xc8,0xa8,0x98,0x88,0x88,0x00,0x70,0x88,0x88,0x88,0x88,0x70,
0x00,0xf0,0x88,0x88,0xf0,0x80,0x80,0x00,0x70,0x88,0x88,0xa8,0x98,0x70,0x00,0xf0,
0x88,0x88,0xf0,0x90,0x88,0x00,0x70,0x80,0x70,0x08,0x88,0x70,0x00,0xf8,0x20,0x20,
0x20,0x20,0x20,0x00,0x88,0x88,0x88,0x88,0x88,0x70,0x00,0x88,0x88,0x88,0x88,0x50,
0x20,0x00,0x88,0x88,0x88,0xa8,0xa8,0x50,0x00,0x88,0x50,0x20,0x20,0x50,0x88,0x00,
0x88,0x50,0x20,0x20,0x20,0x20,0x00,0xf8,0x10,0x20,0x40,0x80,0xf8,0x00,0x60,0x40,
0x40,0x40,0x40,0x60,0x00,0x00,0x80,0x40,0x20,0x10,0x08,0x00,0x30,0x10,0x10,0x10,
0x10,0x30,0x00,0x20,0x50,0x88,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xf8,
0x00,0xf8,0xf8,0xf8,0xf8,0xf8,0xf8};
/**************************************************************************
* put_char *
* Print char *
**************************************************************************/
void put_char(int x ,int y,int ascii_char ,byte color){
__asm__(
"push %si\n\t"
"push %di\n\t"
"push %cx\n\t"
"mov color,%dl\n\t" //test color
"mov ascii_char,%al\n\t" //test char
"sub $32,%al\n\t"
"mov $7,%ah\n\t"
"mul %ah\n\t"
"lea $characters,%si\n\t"
"add %ax,%si\n\t"
"mov $7,%cl\n\t"
"0:\n\t"
"segCS %lodsb\n\t"
"mov $6,%ch\n\t"
"1:\n\t"
"shl $1,%al\n\t"
"jnc 2f\n\t"
"mov %dl,%ES:(%di)\n\t"
"2:\n\t"
"inc %di\n\t"
"dec %ch\n\t"
"jnz 1b\n\t"
"add $320-6,%di\n\t"
"dec %cl\n\t"
"jnz 0b\n\t"
"pop %cx\n\t"
"pop %di\n\t"
"pop %si\n\t"
"retn"
);
}
I´m guiding myself from this series of tutorials written in PASCAL: http://www.joco.homeserver.hu/vgalessons/lesson8.html .
I changed the assembly syntax according to the gcc compiler, but I´m still getting this errors:
Operand mismatch type for 'lea'
No such instruction 'segcs lodsb'
No such instruction 'retn'
EDIT:
I have been working on improving my code and at least now I see something on the screen. Here´s my updated code:
/**************************************************************************
* put_char *
* Print char *
**************************************************************************/
void put_char(int x,int y){
int char_offset;
int l,i,j,h,offset;
j,h,l,i=0;
offset = (y<<8) + (y<<6) + x;
__asm__(
"movl _VGA, %%ebx;" // VGA memory pointer
"addl %%ebx,%%edi;" //%di points to screen
"mov _ascii_char,%%al;"
"sub $32,%%al;"
"mov $7,%%ah;"
"mul %%ah;"
"lea _characters,%%si;"
"add %%ax,%%si;" //SI point to bitmap
"mov $7,%%cl;"
"0:;"
"lodsb %%cs:(%%si);" //load next byte of bitmap
"mov $6,%%ch;"
"1:;"
"shl $1,%%al;"
"jnc 2f;"
"movb %%dl,(%%edi);" //plot the pixel
"2:\n\t"
"incl %%edi;"
"dec %%ch;"
"jnz 1b;"
"addl $320-6,%%edi;"
"dec %%cl;"
"jnz 0b;"
: "=D" (offset)
: "d" (current_color)
);
}
If you see the image above I was trying to write the letter "S". The results are the green pixels that you see on the upper left side of the screen. No matter what x and y I give the functon it always plots the pixels on that same spot.
Can anyone help me correct my code?

See below for an analysis of some things that are specifically wrong with your put_char function, and a version that might work. (I'm not sure about the %cs segment override, but other than that it should do what you intend).
Learning DOS and 16-bit asm isn't the best way to learn asm
First of all, DOS and 16-bit x86 are thoroughly obsolete, and are not easier to learn than normal 64-bit x86. Even 32-bit x86 is obsolete, but still in wide use in the Windows world.
32-bit and 64-bit code don't have to care about a lot of 16-bit limitations / complications like segments or limited register choice in addressing modes. Some modern systems do use segment overrides for thread-local storage, but learning how to use segments in 16-bit code is barely connected to that.
One of the major benefits to knowing asm is for debugging / profiling / optimizing real programs. If you want to understand how to write C or other high-level code that can (and actually does) compile to efficient asm, you'll probably be looking at compiler output. This will be 64-bit (or 32-bit). (e.g. see Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” which has an excellent intro to reading x86 asm for total beginners, and to looking at compiler output).
Asm knowledge is useful when looking at performance-counter results annotating a disassembly of your binary (perf stat ./a.out && perf report -Mintel: see Chandler Carruth's CppCon2015 talk: "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!"). Aggressive compiler optimizations mean that looking at cycle / cache-miss / stall counts per source line are much less informative than per instruction.
Also, for your program to actually do anything, it has to either talk to hardware directly, or make system calls. Learning DOS system calls for file access and user input is a complete waste of time (except for answering the steady stream of SO questions about how to read and print multi-digit numbers in 16-bit code). They're quite different from the APIs in the current major OSes. Developing new DOS applications is not useful, so you'd have to learn another API (as well as ABI) when you get to the stage of doing something with your asm knowledge.
Learning asm on an 8086 simulator is even more limiting: 186, 286, and 386 added many convenient instructions like imul ecx, 15, making ax less "special". Limiting yourself to only instructions that work on 8086 means you'll figure out "bad" ways to do things. Other big ones are movzx / movsx, shift by an immediate count (other than 1), and push immediate. Besides performance, it's also easier to write code when these are available, because you don't have to write a loop to shift by more than 1 bit.
Suggestions for better ways to teach yourself asm
I mostly learned asm from reading compiler output, then making small changes. I didn't try to write stuff in asm when I didn't really understand things, but if you're going to learn quickly (rather than just evolve an understanding while debugging / profiling C), you probably need to test your understanding by writing your own code. You do need to understand the basics, that there are 8 or 16 integer registers + the flags and instruction pointer, and that every instruction makes a well-defined modification to the current architectural state of the machine. (See the Intel insn ref manual for complete descriptions of every instruction (links in the x86 wiki, along with much more good stuff).
You might want to start with simple things like writing a single function in asm, as part of a bigger program. Understanding the kind of asm needed to make system calls is useful, but in real programs it's normally only useful to hand-write asm for inner loops that don't involve any system calls. It's time-consuming to write asm to read input and print results, so I'd suggest doing that part in C. Make sure you read the compiler output and understand what's going on, and the difference between an integer and a string, and what strtol and printf do, even if you don't write them yourself.
Once you think you understand enough of the basics, find a function in some program you're familiar with and/or interested in, and see if you can beat the compiler and save instructions (or use faster instructions). Or implement it yourself without using the compiler output as a starting point, whichever you find more interesting. This answer might be interesting, although the focus there was finding C source that got the compiler to produce the optimal ASM.
How to try to solve your own problems (before asking an SO question)
There are many SO questions from people asking "how do I do X in asm", and the answer is usually "the same as you would in C". Don't get so caught up in asm being unfamiliar that you forget how to program. Figure out what needs to happen to the data the function operates on, then figure out how to do that in asm. If you get stuck and have to ask a question, you should have most of a working implementation, with just one part that you don't know what instructions to use for one step.
You should do this with 32 or 64bit x86. I'd suggest 64bit, since the ABI is nicer, but 32bit functions will force you to make more use of the stack. So that might help you understand how a call instruction puts the return address on the stack, and where the args the caller pushed actually are after that. (This appears to be what you tried to avoid dealing with by using inline asm).
Programming hardware directly is neat, but not a generally useful skill
Learning how to do graphics by directly modifying video RAM is not useful, other than to satisfy curiosity about how computers used to work. You can't use that knowledge for anything. Modern graphics APIs exist to let multiple programs draw in their own regions of the screen, and to allow indirection (e.g. draw on a texture instead of the screen directly, so 3D window-flipping alt-tab can look fancy). There too many reasons to list here for not drawing directly on video RAM.
Drawing on a pixmap buffer and then using a graphics API to copy it to the screen is possible. Still, doing bitmap graphics at all is more or less obsolete, unless you're generating images for PNG or JPEG or something (e.g. optimize converting histogram bins to a scatter plot in the back-end code for a web service). Modern graphics APIs abstract away the resolution, so your app can draw things at a reasonable size regardless of how big each pixel is. (small but extremely high rez screen vs. big TV at low rez).
It is kind of cool to write to memory and see something change on-screen. Or even better, hook up LEDs (with small resistors) to the data bits on a parallel port, and run an outb instruction to turn them on/off. I did this on my Linux system ages ago. I made a little wrapper program that used iopl(2) and inline asm, and ran it as root. You can probably do similar on Windows. You don't need DOS or 16bit code to get your feet wet talking to the hardware.
in/out instructions, and normal loads/stores to memory-mapped IO, and DMA, are how real drivers talk to hardware, including things far more complicated than parallel ports. It's fun to know how your hardware "really" works, but only spend time on it if you're actually interested, or want to write drivers. The Linux source tree includes drivers for boatloads of hardware, and is often well commented, so if you like reading code as much as writing code, that's another way to get a feel for what read drivers do when they talk to hardware.
It's generally good to have some idea how things work under the hood. If you want to learn about how graphics used to work ages ago (with VGA text mode and color / attribute bytes), then sure, go nuts. Just be aware that modern OSes don't use VGA text mode, so you aren't even learning what happens under the hood on modern computers.
Many people enjoy https://retrocomputing.stackexchange.com/, reliving a simpler time when computers were less complex and couldn't support as many layers of abstraction. Just be aware that's what you're doing. I might be a good stepping stone to learning to write drivers for modern hardware, if you're sure that's why you want to understand asm / hardware.
Inline asm
You are taking a totally incorrect approach to using inline ASM. You seem to want to write whole functions in asm, so you should just do that. e.g. put your code in asmfuncs.S or something. Use .S if you want to keep using GNU / AT&T syntax; or use .asm if you want to use Intel / NASM / YASM syntax (which I would recommend, since the official manuals all use Intel syntax. See the x86 wiki for guides and manuals.)
GNU inline asm is the hardest way to learn ASM. You have to understand everything that your asm does, and what the compiler needs to know about it. It's really hard to get everything right. For example, in your edit, that block of inline asm modifies many registers that you don't list as clobbered, including %ebx which is a call-preserved register (so this is broken even if that function isn't inlined). At least you took out the ret, so things won't break as spectacularly when the compiler inlines this function into the loop that calls it. If that sounds really complicated, that's because it is, and part of why you shouldn't use inline asm to learn asm.
This answer to a similar question from misusing inline asm while trying to learn asm in the first place has more links about inline asm and how to use it well.
Getting this mess working, maybe
This part could be a separate answer, but I'll leave it together.
Besides your whole approach being fundamentally a bad idea, there is at least one specific problem with your put_char function: you use offset as an output-only operand. gcc quite happily compiles your whole function to a single ret instruction, because the asm statement isn't volatile, and its output isn't used. (Inline asm statements without outputs are assumed to be volatile.)
I put your function on godbolt, so I could look at what assembly the compiler generates surrounding it. That link is to the fixed maybe-working version, with correctly-declared clobbers, comments, cleanups, and optimizations. See below for the same code, if that external link ever breaks.
I used gcc 5.3 with the -m16 option, which is different from using a real 16bit compiler. It still does everything the 32bit way (using 32bit addresses, 32bit ints, and 32bit function args on the stack), but tells the assembler that the CPU will be in 16bit mode, so it will know when to emit operand-size and address-size prefixes.
Even if you compile your original version with -O0, the compiler computes offset = (y<<8) + (y<<6) + x;, but doesn't put it in %edi, because you didn't ask it to. Specifying it as another input operand would have worked. After the inline asm, it stores %edi into -12(%ebp), where offset lives.
Other stuff wrong with put_char:
You pass 2 things (ascii_char and current_color) into your function through globals, instead of function arguments. Yuck, that's disgusting. VGA and characters are constants, so loading them from globals doesn't look so bad. Writing in asm means you should ignore good coding practices only when it helps performance by a reasonable amount. Since the caller probably had to store those values into the globals, you're not saving anything compared to the caller storing them on the stack as function args. And for x86-64, you'd be losing perf because the caller could just pass them in registers.
Also:
j,h,l,i=0; // sets i=0, does nothing to j, h, or l.
// gcc warns: left-hand operand of comma expression has no effect
j;h;l;i=0; // equivalent to this
j=h=l=i=0; // This is probably what you meant
All the local variables are unused anyway, other than offset. Were you going to write it in C or something?
You use 16bit addresses for characters, but 32bit addressing modes for VGA memory. I assume this is intentional, but I have no idea if it's correct. Also, are you sure you should use a CS: override for the loads from characters? Does the .rodata section go into the code segment? Although you didn't declare uint8_t characters[464] as const, so it's probably just in the .data section anyway. I consider myself fortunate that I haven't actually written code for a segmented memory model, but that still looks suspicious.
If you're really using djgpp, then according to Michael Petch's comment, your code will run in 32bit mode. Using 16bit addresses is thus a bad idea.
Optimizations
You can avoid using %ebx entirely by doing this, instead of loading into ebx and then adding %ebx to %edi.
"add _VGA, %%edi\n\t" // load from _VGA, add to edi.
You don't need lea to get an address into a register. You can just use
"mov %%ax, %%si\n\t"
"add $_characters, %%si\n\t"
$_characters means the address as an immediate constant. We can save a lot of instructions by combining this with the previous calculation of the offset into the characters array of bitmaps. The immediate-operand form of imul lets us produce the result in %si in the first place:
"movzbw _ascii_char,%%si\n\t"
//"sub $32,%%ax\n\t" // AX = ascii_char - 32
"imul $7, %%si, %%si\n\t"
"add $(_characters - 32*7), %%si\n\t" // Do the -32 at the same time as adding the table address, after multiplying
// SI points to characters[(ascii_char-32)*7]
// i.e. the start of the bitmap for the current ascii character.
Since this form of imul only keeps the low 16b of the 16*16 -> 32b multiply, the 2 and 3 operand forms imul can be used for signed or unsigned multiplies, which is why only imul (not mul) has those extra forms. For larger operand-size multiplies, 2 and 3 operand imul is faster, because it doesn't have to store the high half in %[er]dx.
You could simplify the inner loop a bit, but it would complicate the outer loop slightly: you could branch on the zero flag, as set by shl $1, %al, instead of using a counter. That would make it also unpredictable, like the jump over store for non-foreground pixels, so the increased branch mispredictions might be worse than the extra do-nothing loops. It would also mean you'd need to recalculate %edi in the outer loop each time, because the inner loop wouldn't run a constant number of times. But it could look like:
... same first part of the loop as before
// re-initialize %edi to first_pixel-1, based on outer-loop counter
"lea -1(%%edi), %%ebx\n"
".Lbit_loop:\n\t" // map the 1bpp bitmap to 8bpp VGA memory
"incl %%ebx\n\t" // inc before shift, to preserve flags
"shl $1,%%al\n\t"
"jnc .Lskip_store\n\t" // transparency: only store on foreground pixels
"movb %%dl,(%%ebx)\n" //plot the pixel
".Lskip_store:\n\t"
"jnz .Lbit_loop\n\t" // flags still set from shl
"addl $320,%%edi\n\t" // WITHOUT the -6
"dec %%cl\n\t"
"jnz .Lbyte_loop\n\t"
Note that the bits in your character bitmaps are going to map to bytes in VGA memory like {7 6 5 4 3 2 1 0}, because you're testing the bit shifted out by a left shift. So it starts with the MSB. Bits in a register are always "big endian". A left shift multiplies by two, even on a little-endian machine like x86. Little-endian only affects ordering of bytes in memory, not bits in a byte, and not even bytes inside registers.
A version of your function that might do what you intended.
This is the same as the godbolt link.
void put_char(int x,int y){
int offset = (y<<8) + (y<<6) + x;
__asm__ volatile ( // volatile is implicit for asm statements with no outputs, but better safe than sorry.
"add _VGA, %%edi\n\t" // edi points to VGA + offset.
"movzbw _ascii_char,%%si\n\t" // Better: use an input operand
//"sub $32,%%ax\n\t" // AX = ascii_char - 32
"imul $7, %%si, %%si\n\t" // can't fold the load into this because it's not zero-padded
"add $(_characters - 32*7), %%si\n\t" // Do the -32 at the same time as adding the table address, after multiplying
// SI points to characters[(ascii_char-32)*7]
// i.e. the start of the bitmap for the current ascii character.
"mov $7,%%cl\n"
".Lbyte_loop:\n\t"
"lodsb %%cs:(%%si)\n\t" //load next byte of bitmap
"mov $6,%%ch\n"
".Lbit_loop:\n\t" // map the 1bpp bitmap to 8bpp VGA memory
"shl $1,%%al\n\t"
"jnc .Lskip_store\n\t" // transparency: only store on foreground pixels
"movb %%dl,(%%edi)\n" //plot the pixel
".Lskip_store:\n\t"
"incl %%edi\n\t"
"dec %%ch\n\t"
"jnz .Lbit_loop\n\t"
"addl $320-6,%%edi\n\t"
"dec %%cl\n\t"
"jnz .Lbyte_loop\n\t"
: "+&D" (offset) // EDI modified by the asm, compiler needs to know that, so use a read-write "+" input. Early-clobber "&" because we read the other input after modifying this.
: "d" (current_color) // used read-only
: "%eax", "%ecx", "%esi", "memory"
// omit the memory clobber if your C never touches VGA memory, and your asm never loads/stores anywhere else.
// but that's not the case here: the asm loads from memory written by C
// without listing it as a memory operand (even a pointer in a register isn't sufficient)
// so gcc might optimize away "dead" stores to it, or reorder the asm with loads/stores to it.
);
}
Re: the "memory" clobber, see How can I indicate that the memory *pointed* to by an inline ASM argument may be used?
I didn't use dummy output operands to leave register allocation up to the compiler's discretion, but that's a good idea to reduce the overhead of getting data in the right places for inline asm. (extra mov instructions). For example, here there was no need to force the compiler to put offset in %edi. It could have been any register we aren't already using.

Compiler's alleged optimisation of while loop

In this segment of an embedded programming lecture, Dr. Samek explains how the compiler is able to realise efficiencies over the "original code" (presumably by this he means an implementation dictated by the logical order of the syntax):
I hope you have noticed
that the disassembled code implements a different flow of control from
what I've described for the while loop. The original code was supposed
to test the condition first and then jump over the loop body if the
condition wasn't true. The compiled code starts with an unconditional
branch and reverses the order of the loop body and the testing of the
condition. When you think about it, though, those two flows of control
are equivalent, except the generated one is faster, because it has only
one conditional branch at the bottom of the loop.
Throughout his explanation, he draws reference to the two flow charts shown. The chart depicting the compiled code illustrates how "it has only one conditional branch at the bottom of the loop." I can't seem to see how this works in practice:
Here is a screen-capture I took showing how the original code would be implemented. Sequence of (initial) machine instructions:
0x2815 CMP -> 0x1c40 ADDS ->repeat until condition is false
Here is another screen-capture showing the sequence used by the compiler. Sequence of (initial) machine instructions:
0x2815 CMP -> 0xdbfc BLT.N -> 0x1c40 ADDS -> repeat until condition is false
I can understand how both flows of control are equivalent (at least from the diagrams alone), but certainly not how the generated one (on the right) is faster. First, the arrows at the very beginning of both diagrams lead to the very same branches. Second, although the branch does seem to occur later on in the schematic on the right, as you've seen from the screen-captures, the simulator seems to show that both methods commence with the machine instructions pertaining to the comparison (0x2815 CMP).
How can I reconcile the flow-charts to what I'm seeing in practice?

I concur with Kerren SB's answer. I'll discuss it in more detail:
The original logic has the following jump instructions that are executed during each iteration:
begin_loop:
cmp counter, 21
jge end_loop ; jump when greater or equal
...
jmp begin_loop
end_loop:
....
The revised logic executes one instruction less during each iteration:
jmp test_loop ; executed once; not part of loop
begin_loop:
...
test_loop:
cmp counter, 21
jlt begin_loop ; jump when less than
...
So in the loop the unconditional jmp begin_loop is saved.

The only instructions that matter are the ones that are repeated in every loop round. And the point of the rearrangement of the logic is that there is only one jump per loop round. There may be several jumps before and after the entire loop, but we don't care about those. The only things that are expensive are the ones that are done over and over.

Efficiency of C vs Assembler

How much faster is the following assembler code:
shl ax, 1
Versus the following C code:
num = num * 2;
How can I even find out?

Your assembly variant might be faster, might be slower. What made you think that it is necessarily faster?
On the x86 platform, there are quite a few ways to multiply something by 2. I would expect a compiler to do add ax, ax, which is intuitively more efficient than your shl because it doesn't involve a potentially stored constant ('1' in your case).
Also, for quite a long time, on a x86 platform the preferred way of multiplying things by constants was not a shift, but rather a lea operation (when possible). In the above example that would be lea eax, [eax*2]. (Multiplication by 3 would be done through lea eax, [eax*2+eax])
The belief in shift operations being somehow "faster" is a nice old story for newbies, which has virtually no relevance today. And, as usual, most of the time your compiler (if it is up-to-date) has much better knowledge about the underlying hardware platform than people with naive love for shift operations.

Is this, by any chance, an academic question? I assume you understand it is in the general category of "getting a haircut to lose weight".

If you are using GCC, ask to see the generated assembly with option -S. You may find it's the same as your assembler instruction.
To answer the original question, on Out-Of-Order processors instruction speed is measured by throughput and latency, and you would measure both using the rdtsc assembly instruction. But someone else did it for you for a lot of processors, so you don't need to bother. PDF

In most circumstances, it won't make a difference. Multiplication is fast on nearly all modern hardware. In particular, it is usually fast enough that unless you have meticulously hand-optimized code, the pipeline will hide the entirety of the latency and you will see no speed difference at all between the two cases.
You may be able to measure a performance difference on multiplies and shifts when you execute them in isolation, but there will typically not be any difference in the context of the rest of your compiled code. (As I noted, this may not hold true if the code is meticulously optimized).
Now, that said, shifts are still generally faster than multiplies, and almost any reasonable compiler will map a fixed power-of-two multiply into a shift, anyway (assuming that the semantics are actually equivalent on the target architecture).
Edit: one more thing you may want to try if you really care about this is x+x. I know of at least one architecture on which this can actually be faster than shifting, depending on the surrounding context.

If you have a decent compiler it will produce the same or similar code. The best way is to disassemble and checked the created code.

The answer depends, as you've been able to see here, upon many things. What the compiler will do with your C code depends on a lot of things. If we're talking x86-32 the following should be generally applicable.
At the basic level your C code indicates a memory variable which would require at least one instruction to multiply by two: "shl mem,1" and in such a simple case the C code will be slower.
If num is a local variable the compiler may decide to put it in a register (if it is used often enough and/or the function is small enough) and then you will have your "shl reg,1" instruction - maybe.
What instruction is fastest has everything to do with how they are implemented in the processor. Shl may not be the best choice since it affects the C and Z flags which slows it down. A few years ago the recommendation was "lea reg,[reg+reg]" (all reg are the same) because lea didn't affect any flags and there were variants such as (using the eax register on x86-32 platform as an example):
lea eax,[eax+eax] ; *2
lea eax,[eax+eax*2] ; *3
lea eax,[eax+eax*4] ; *5
lea eax,[eax+eax*8] ; *9
I don't know what is the norm today but your compiler probably will.
As for measuring search for information here on the rdtsc instruction which is the hands-down best alternative as it counts actual clock cycles.

Put them in a loop with a counter that goes so high that it runs for at least a second in the fastest case. Use your favorite timing mechanism to see how long each takes.
The assembly case should be done with inline assembly in the same C program as you use for the pure C test. Otherwise, you're not comparing apples to apples.
By the way, I think you should add a third test:
num <<= 1;
The question then is whether that does the same thing as the assembly version.

If, for your target platform, shifting left is the quickest way to multiply a number by two, then the chances are your compiler will do that when compiling the code. Look at the disassembly to check
So, for that one line, it's probably exactly the same speed. However, as you're unlikely to have a function containing just that one line, you might well find the compiler would defer the shift until the value is used, or otherwise mix it up with surrounding code, making it less clear cut. A good optimizing compiler will generally do a good job of beating poor to average hand written assembly.

If the compiler up to date now ( vc9 ) was really doing a good job it would outperform vc6 by a wide margin and this dont occur, this is why I even prefer to use VC6 for some code that run faster than code compiled in mingw with -O3 and VC9 with /Ox

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight