What exists under Assembly? - c

Before I learn a bit of Assembly had heard that you had to "program directly in hardware", "I had to do everything from scratch." For example to write a character without an operating system I thought I would have to know how my monitor work and write pixel by pixel of the character.
So I got interested and I learn a little. And I saw it was not so "close to the metal". Then wanted someone to explain to me how this works and if it is possible to go deeper and really control all hardware.
Here is a code that prints a character:
[BITS 16]
[ORG 0x7C00]
MOV AL, 65
CALL PrintCharacter
JMP $
PrintCharacter:
MOV AH, 0x0E
MOV BH, 0x00
MOV BL, 0x07
INT 0x10
RET
TIMES 510 - ($ - $$) db 0
DW 0xAA55

Lower than assembler is machine code.
However machine code instructions have an 1:1 relation to assembly instructions so there is nothing that can be done in machine code which cannot be done in assembler.
In early times of computing there were computers where you had to enter the machine code directly. The Mits Altair 680b is one of the examples for such a computer:
It had a lot of front panel switches which allowed you to modify the content of the RAM without (!) using the CPU: The CPU was stopped when the front panel switches were in use. You had to translate assembler code into binary code and load the program into the RAM this way. Then you started the CPU.
Later the KIM-1 computer (it is said to be the first affordable hobbyist computer) was released. This computer allowed entering the machine code as hexadecimal code but in contrast to the Mits computer a program running in the background (which means: the CPU) was responsible for writing the data entered by the keyboard into the RAM.
In theory it is still possible to enter Windows programs in hexadecimal code (using a hexadecimal editor) you want to. However this will bring no benefit compared to assembler code!

Related

How to perform the most basic gate-level operations in a computer?

How can I use my hardware directly to perform an operation at the level of bits in my computer without a programming language (and if possible even without the help of the kernel)?
For example, a code in C maybe
unsigned char a = 0;
unsigned char b = 1;
unsigned char c = a|b;
I want to do it (request 3 bytes of memory and modify those bytes) directly with my computer hardware and without using any programming language (i.e. I want to write the machine code myself). If possible, even without the help of kernel. How to do it and where to learn about these?
I am currently using Ubuntu 18, kernel 5.4.0-52-generic. I have intel 8th gen core i5 laptop. Let me know if I need to be more specific about my system specs.
First you have to of course have the documentation for the processor that defines the instruction set including the machine code.
Next you would write something that is not dead code:
unsigned int fun ( unsigned int a, unsigned int b )
{
return(a|b);
}
Intentionally not using x86
Disassembly of section .text:
00000000 <fun>:
0: e1800001 orr r0, r0, r1
4: e12fff1e bx lr
Then you can examine that machine code relevant to the documentation and understand the encoding of these instructions.
That is all quite trivial.
The biggest issue is how do you plan to run this code? You say without the kernel that means no operating system that means bare-metal which means you have to boot the processor, that takes detailed processor and system (chip/motherboard/media/etc) information. It is obviously possible as your computer boots and runs or your microcontroller (significantly better choice for this kind of education, avoiding x86 as a first ISA is also a good choice). Or even better an emulator/simulator because you have better visibility and it is un-brickable, where hardware is not hard to brick and sometimes not hard to let smoke out with bad software.
Based on your question, you need to crawl before you walk, and walk before you run. Start with simple functions as shown, you can with certain tools (gnu is your friend but also enemy as it takes time to master, but it is very feature rich) write using machine code and feed it to an assembler
.globl fun
fun:
.inst 0xe1800001
.inst 0xe12fff1e
Disassembly of section .text:
00000000 <fun>:
0: e1800001 orr r0, r0, r1
4: e12fff1e bx lr
If you don't want to use an assembler and want to create the binary file yourself then you have to read up on the file format, many are published, a somewhat trivial task even/especially if you write a tool from scratch to do it and not rely on libraries, but it depends on you and the file format. Not sure how many tangents and research projects you are really interested in. They all have value but what is it you really want to learn and/or what is the priority of your desires.
You are looking at potentially months to years of work depending on what you are really after. As hard as it is to deal with the best path to this is to use the tools and sandboxes available first then replace them later as desired rather than write everything from scratch up front without any help from any existing tools.
You want to build a better hammer you start with using the hammers that exist, decide what you do and don't like then make your own. You don't just go off without ever using one and try to create one and expect any kind of success. Or success within any kind of reasonable schedule.
A big problem with doing everything from scratch is that you need to get the bytes into a flash or ram or some media so that the processor can fetch and run them. This you likely cannot do without some tools, a fully working computer with an operating system where you take your raw bytes, some hardware tools that are capable of programming a flash and using all of that to program your raw bits into that flash. Now some flashes you can probably use switches (have to solve the bouncing problem) like on the front of an old DEC or something and toggle your way through the protocol used to program the device and thus only needing pencil/pen and paper and the switches and wiring as the tools.
You are far better off with an instruction set simulator and depending on the file formats it supports, rolling your own binary creation tool or using an assembler or something similar to make the file to feed the sim. Or even better just make your own simulator you learn the instruction set better than most seasoned professionals that way...and then of course you can create your own binary creation tool to match. You will fail most likely if you have not taking an existing instruction set and set of tools and learned to program at the assembly/machine language level, see how it works, see the instructions used/generated, and with the documentation see the encoding, etc.
Most new instruction sets the software folks have direct access to the silicon folks (walk down the hall to the office of the person, or call them on the phone) and can ask questions about the encoding of an instruction. Since you cannot do that, you have to ask existing, debugged, tools instead, so as I started way above, ask the compiler, ask the assembler. Then disassemble it then assume the assembler produced the right instruction and compare that to the processor documentation (understanding that both the tools and the documentation can have bugs so you have to work through that).

Why does Knuth use this clunky decrement?

I'm looking at some of Prof. Don Knuth's code, written in CWEB that is converted to C. A specific example is dlx1.w, available from Knuth's website
At one stage, the .len value of a struct nd[cc] is decremented, and it is done in a clunky way:
o,t=nd[cc].len-1;
o,nd[cc].len=t;
(This is a Knuth-specific question, so maybe you already know that "o," is a preprocessor macro for incrementing "mems", which is a running total of effort expended, as measured by accesses to 64-bit words.) The value remaining in "t" is definitely not used for anything else. (The example here is on line 665 of dlx1.w, or line 193 of dlx1.c after ctangle.)
My question is: why does Knuth write it this way, rather than
nd[cc].len--;
which he does actually use elsewhere (line 551 of dlx1.w):
oo,nd[k].len--,nd[k].aux=i-1;
(And "oo" is a similar macro for incrementing "mems" twice -- but there is some subtlety here, because .len and .aux are stored in the same 64-bit word. To assign values to S.len and S.aux, only one increment to mems would normally be counted.)
My only theory is that a decrement consists of two memory accesses: first to look up, then to assign. (Is that correct?) And this way of writing it is a reminder of the two steps. This would be unusually verbose of Knuth, but maybe it is an instinctive aide-memoire rather than didacticism.
For what it's worth, I've searched in CWEB documentation without finding an answer. My question probably relates more to Knuth's standard practices, which I am picking up bit by bit. I'd be interested in any resources where these practices are laid out (and maybe critiqued) as a block -- but for now, let's focus on why Knuth writes it this way.
A preliminary remark: with Knuth-style literate programming (i.e. when reading WEB or CWEB programs) the “real” program, as conceived by Knuth, is neither the “source” .w file nor the generated (tangled) .c file, but the typeset (woven) output. The source .w file is best thought of as a means to produce it (and of course also the .c source that's fed to the compiler). (If you don't have cweave and TeX handy; I've typeset some of these programs here; this program DLX1 is here.)
So in this case, I'd describe the location in the code as module 25 of DLX1, or subroutine "cover":
Anyway, to return to the actual question: note that this (DLX1) is one of the programs written for The Art of Computer Programming. Because reporting the time taken by a program “seconds” or “minutes” becomes meaningless from year to year, he reports how long a program took in number of “mems” plus “oops”, that's dominated by the “mems”, i.e. the number of memory accesses to 64-bit words (usually). So the book contains statements like “this program finds the answer to this problem in 3.5 gigamems of running time”. Further, the statements are intended to be fundamentally about the program/algorithm itself, not the specific code generated by a specific version of a compiler for certain hardware. (Ideally when the details are very important he writes the program in MMIX or MMIXAL and analyses its operations on the MMIX hardware, but this is rare.) Counting the mems (to be reported as above) is the purpose of inserting o and oo instructions into the program. Note that it's more important to get this right for the “inner loop” instructions that are executed a lot of times, such as everything in the subroutine cover in this case.
This is elaborated in Section 1.3.1′ (part of Fascicle 1):
Timing. […] The running time of a program depends not only on the clock rate but also on the number of functional units that can be active simultaneously and the degree to which they are pipelined; it depends on the techniques used to prefetch instructions before they are executed; it depends on the size of the random-access memory that is used to give the illusion of 264 virtual bytes; and it depends on the sizes and allocation strategies of caches and other buffers, etc., etc.
For practical purposes, the running time of an MMIX program can often be estimated satisfactorily by assigning a fixed cost to each operation, based on the approximate running time that would be obtained on a high-performance machine with lots of main memory; so that’s what we will do. Each operation will be assumed to take an integer number of υ, where υ (pronounced “oops”) is a unit that represents the clock cycle time in a pipelined implementation. Although the value of υ decreases as technology improves, we always keep up with the latest advances because we measure time in units of υ, not in nanoseconds. The running time in our estimates will also be assumed to depend on the number of memory references or mems that a program uses; this is the number of load and store instructions. For example, we will assume that each LDO (load octa) instruction costs µ + υ, where µ is the average cost of a memory reference. The total running time of a program might be reported as, say, 35µ+ 1000υ, meaning “35 mems plus 1000 oops.” The ratio µ/υ has been increasing steadily for many years; nobody knows for sure whether this trend will continue, but experience has shown that µ and υ deserve to be considered independently.
And he does of course understand the difference from reality:
Even though we will often use the assumptions of Table 1 for seat-of-the-pants estimates of running time, we must remember that the actual running time might be quite sensitive to the ordering of instructions. For example, integer division might cost only one cycle if we can find 60 other things to do between the time we issue the command and the time we need the result. Several LDB (load byte) instructions might need to reference memory only once, if they refer to the same octabyte. Yet the result of a load command is usually not ready for use in the immediately following instruction. Experience has shown that some algorithms work well with cache memory, and others do not; therefore µ is not really constant. Even the location of instructions in memory can have a significant effect on performance, because some instructions can be fetched together with others. […] Only the meta-simulator can be trusted to give reliable information about a program’s actual behavior in practice; but such results can be difficult to interpret, because infinitely many configurations are possible. That’s why we often resort to the much simpler estimates of Table 1.
Finally, we can use Godbolt's Compiler Explorer to look at the code generated by a typical compiler for this code. (Ideally we'd look at MMIX instructions but as we can't do that, let's settle for the default there, which seems to be x68-64 gcc 8.2.) I removed all the os and oos.
For the version of the code with:
/*o*/ t = nd[cc].len - 1;
/*o*/ nd[cc].len = t;
the generated code for the first line is:
movsx rax, r13d
sal rax, 4
add rax, OFFSET FLAT:nd+8
mov eax, DWORD PTR [rax]
lea r14d, [rax-1]
and for the second line is:
movsx rax, r13d
sal rax, 4
add rax, OFFSET FLAT:nd+8
mov DWORD PTR [rax], r14d
For the version of the code with:
/*o ?*/ nd[cc].len --;
the generated code is:
movsx rax, r13d
sal rax, 4
add rax, OFFSET FLAT:nd+8
mov eax, DWORD PTR [rax]
lea edx, [rax-1]
movsx rax, r13d
sal rax, 4
add rax, OFFSET FLAT:nd+8
mov DWORD PTR [rax], edx
which as you can see (even without knowing much about x86-64 assembly) is simply the concatenation of the code generated in the former case (except for using register edx instead of r14d), so it's not as if writing the decrement in one line has saved you any mems. In particular, it would not be correct to count it as a single one, especially in something like cover that is called a huge number of times in this algorithm (dancing links for exact cover).
So the version as written by Knuth is correct, for its goal of counting the number of mems. He could also write oo,nd[cc].len--; (counting two mems) as you observed, but perhaps it might look like a bug at first glance in that case. (BTW, in your example in the question of oo,nd[k].len--,nd[k].aux=i-1; the two mems come from the load and the store in --; not two stores.)
This whole practice seems to be based on a mistaken idea/model of how C works, that there's some correspondence between the work performed by the abstract machine and by the actual program as executed (i.e. the "C is portable assembler" fallacy). I don't think we can answer a lot more about why that exact code fragment appears, except that it happens to be an unusual idiom for counting loads and stores on the abstract machine as separate.

How exactly does the x86 LOOP instruction work?

mov ecx, 16
looptop: .
.
.
loop looptop
How many times will this loop execute?
What happens if ecx = 0 to start with? Does loop jump or fall-through in that case?
loop is exactly like dec ecx / jnz, except it doesn't set flags.
It's like the bottom of a do {} while(--ecx != 0); in C. If execution enters the loop with ecx = 0, wrap-around means the loop will run 2^32 times. (Or 2^64 times in 64-bit mode, because it uses RCX.)
Unlike rep movsb/stosb/etc., it doesn't check for ECX=0 before decrementing, only after1.
The address-size determines whether it uses CX, ECX, or RCX. So in 64-bit code, addr32 loop is like dec ecx / jnz, while a regular loop is like dec rcx / jnz. Or in 16-bit code, it normally uses CX, but an address-size prefix (0x67) will make it use ecx. As Intel's manual says, it ignores REX.W, because that sets the operand-size, not the address-size.
rep string instructions use the address-size prefix the same way, overriding the address size but also RCX vs. ECX (or CX vs. ECX in modes other than 64-bit). The operand-size for string instructions is already used to determine movsw vs. movsd vs. movsq, and you want address/repeat size to be orthogonal to that. Having loop and jrcxz/jecxz follow that behaviour is just continuing the design intent from 8086 of loop being intended for use with string operations when a simple rep couldn't get the job done; see below.
Related: Why are loops always compiled into "do...while" style (tail jump)? for more about loop structure in asm, while() {} vs. do {} while() and how to lay them out.
Footnote 1: jcxz (or x86-64 jrcxz) was intended for use before the top of a do {} while style loop, to skip it if it should run 0 times. On modern CPUs test rcx, rcx / jz is more efficient.
Stephen Morse, architect of 8086, wrote about the intended uses of loop/jcxz with string instructions in that section of his book, The 8086 Primer, available for free on his web site: https://www.stevemorse.org/8086/index.html. See the "complex string instructions" subsection, starting at the bottom of page 71. (Or start reading from earlier in the chapter, the whole String Instructions section starts on page 66. But note #ecm's review of a few things the book seems to explain poorly or incorrectly.)
If you're wondering about the design intent of x86 instructions, you won't find a better source than this. That's separate from the best / most efficient way to use them, especially on modern x86, but very good intro for beginners into what you can do with asm instructions as building blocks.
Extra debugging tips
If you ever want to know the details on an instruction, check the manual: either Intel's official vol.2 PDF instruction set reference manual, or an html extract with each entry on a different page (http://felixcloutier.com/x86/). But note that the HTML leaves out the intro and appendices that have details on how to interpret stuff, like when it says "flags are set according to the result" for instructions like add.
And you can (and should) also just try stuff in a debugger: single-step and watch registers change. Use a smaller starting value for ecx so you get to the interesting ecx=1 part sooner. See also the x86 tag wiki for links to manuals, guides, and asm debugging tips at the bottom.
And BTW, if the instructions inside the loop that aren't shown modify ecx, it could loop any number of times. For the question to have a simple and unique answer, you need a guarantee that the instructions between the label and the loop instruction don't modify ecx. (They could save/restore it, but if you're going to do that it's usually better to just use a different register as the loop counter. push/pop inside a loop makes your code hard to read.)
Rant about over-use of LOOP even when you already need to increment something else in the loop. LOOP isn't the only way to loop, and usually it's the worst.
You should normally never use the loop instruction unless optimizing for code-size at the expense of speed, because it's slow. Compilers don't use it. (So CPU vendors don't bother to make it fast; catch 22.) Use dec / jnz, or an entirely different loop condition. (See also http://agner.org/optimize/ to learn more about what's efficient.)
Loops don't even have to use a counter; it's often just as good if not better to compare a pointer to an end address, or to check for some other condition. (Pointless use of loop is one of my pet peeves, especially when you already have something in another register that would work as a loop counter.) Using cx as a loop counter often just ties up one of your precious few registers when you could have used cmp/jcc on another register you were incrementing anyway.
IMO, loop should be considered one of those obscure x86 instructions that beginners shouldn't be distracted with. Like stosd (without a rep prefix), aam or xlatb. It does have real uses when optimizing for code size, though. (That's sometimes useful in real life for machine code (like for boot sectors), not just for stuff like code golf.)
IMO, just teach / learn how conditional branches work, and how to make loops out of them. Then you won't get stuck into thinking there's something special about a loop that uses loop. I've seen an SO question or comment that said something like "I thought you had to declare loops", and didn't realize that loop was just an instruction.
</rant>. Like I said, loop is one of my pet peeves. It's an obscure code-golfing instruction, unless you're optimizing for an actual 8086.

decode ARM BL instruction

I'm just getting started with the ARM architecture on my Nucleo STM32F303RE, and I'm trying to understand how the instructions are encoded.
I have running a simple LED-blinking program, and the first few disassembled application instructions are:
08000188: push {lr}
0800018a: sub sp, #12
235 __initialize_hardware_early ();
0800018c: bl 0x80005b8 <__initialize_hardware_early>
These instructions resolve to the following in the hex file (displayed weird in Eclipse -- each 32-bit word is in MSB order, but Eclipse doesn't seem to know it... but that's for another topic):
address 0x08000188: B083B500 FA14F000
Using the ARM Architecture Ref Manual, I've confirmed the first 2 instructions, push (0xB500) and sub (0xB083). But I can't make any sense out of the "bl" instruction.
The hex instruction is 0xFA14F000. The Ref Manual says it breaks down like this:
31.28 27 26 25 24 23............0
cond 1 0 1 L signed_immed_24
The first "F" (0xF......) makes sense: all conditions are set (ALways).
The "A" doesn't make sense though, since the L bit should be set (1011). Shouldn't it be 0xFB......?
And the signed_immed_24 doesn't make sense, either. The ref manual says:
- start with 0x14F000
- sign extend to 30 bits (signed 2's-complement), giving 0x0014F000
- shift left to form 32-bit value, giving 0x0053C000
- add to the PC, which is the current instruction + 8, giving 0x0800018c + 8 + 0x0053C000, or 0x0853C194.
So I get a branch address of 0x0853C194, but the disassembly shows 0x080005B8.
What am I missing?
Thanks!
-Eric
bl is two, separate, 16 bit instructions. The armv5 (and older) ARM ARM does a better job of documenting them.
111HHoffset11
From the ARM ARM
The first Thumb instruction has H == 10 and supplies the high part of
the branch offset. This instruction sets up for the subroutine call
and is shared between the BL and BLX forms.
The second Thumb instruction has H == 11 (for BL) or H == 01 (for
BLX). It supplies the low part of the branch offset and causes the
subroutine call to take place.
0xFA14 0xF000
0xF000 is the first instruction upper offset is zeros
0xFA14 is the second instruction offset is 0x214
If starting at 0x0800018c then it is 0x0800018C + 4 + (0x0000214<<1) = 0x080005B8. The 4 is the two instructions head for the current PC. And the offset is units of (16 bit) instructions.
I guess the armv7-m ARM ARM covers it as well, but is harder to read, and apparently features were added. But they do not affect you with this branch link.
The ARMv5 ARM ARM does a better job of describing what happens as well. you can certaily take these two separate instructions and move them apart
.byte 0x00,0xF0
nop
nop
nop
nop
nop
.byte 0x14,0xFA
and it will branch to the same offset (relative to the second instruction). Maybe the broke that in some cores, but I know in some (after armv5) it works.

Drawing a character in VGA memory with GNU C inline assembly

I´m learning to do some low level VGA programming in DOS with C and inline assembly. Right now I´m trying to create a function that prints out a character on screen.
This is my code:
//This is the characters BITMAPS
uint8_t characters[464] = {
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x20,0x20,0x20,0x20,0x00,0x20,0x00,0x50,
0x50,0x00,0x00,0x00,0x00,0x00,0x50,0xf8,0x50,0x50,0xf8,0x50,0x00,0x20,0xf8,0xa0,
0xf8,0x28,0xf8,0x00,0xc8,0xd0,0x20,0x20,0x58,0x98,0x00,0x40,0xa0,0x40,0xa8,0x90,
0x68,0x00,0x20,0x40,0x00,0x00,0x00,0x00,0x00,0x20,0x40,0x40,0x40,0x40,0x20,0x00,
0x20,0x10,0x10,0x10,0x10,0x20,0x00,0x50,0x20,0xf8,0x20,0x50,0x00,0x00,0x20,0x20,
0xf8,0x20,0x20,0x00,0x00,0x00,0x00,0x00,0x60,0x20,0x40,0x00,0x00,0x00,0xf8,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x60,0x60,0x00,0x00,0x08,0x10,0x20,0x40,0x80,
0x00,0x70,0x88,0x98,0xa8,0xc8,0x70,0x00,0x20,0x60,0x20,0x20,0x20,0x70,0x00,0x70,
0x88,0x08,0x70,0x80,0xf8,0x00,0xf8,0x10,0x30,0x08,0x88,0x70,0x00,0x20,0x40,0x90,
0x90,0xf8,0x10,0x00,0xf8,0x80,0xf0,0x08,0x88,0x70,0x00,0x70,0x80,0xf0,0x88,0x88,
0x70,0x00,0xf8,0x08,0x10,0x20,0x20,0x20,0x00,0x70,0x88,0x70,0x88,0x88,0x70,0x00,
0x70,0x88,0x88,0x78,0x08,0x70,0x00,0x30,0x30,0x00,0x00,0x30,0x30,0x00,0x30,0x30,
0x00,0x30,0x10,0x20,0x00,0x00,0x10,0x20,0x40,0x20,0x10,0x00,0x00,0xf8,0x00,0xf8,
0x00,0x00,0x00,0x00,0x20,0x10,0x08,0x10,0x20,0x00,0x70,0x88,0x10,0x20,0x00,0x20,
0x00,0x70,0x90,0xa8,0xb8,0x80,0x70,0x00,0x70,0x88,0x88,0xf8,0x88,0x88,0x00,0xf0,
0x88,0xf0,0x88,0x88,0xf0,0x00,0x70,0x88,0x80,0x80,0x88,0x70,0x00,0xe0,0x90,0x88,
0x88,0x90,0xe0,0x00,0xf8,0x80,0xf0,0x80,0x80,0xf8,0x00,0xf8,0x80,0xf0,0x80,0x80,
0x80,0x00,0x70,0x88,0x80,0x98,0x88,0x70,0x00,0x88,0x88,0xf8,0x88,0x88,0x88,0x00,
0x70,0x20,0x20,0x20,0x20,0x70,0x00,0x10,0x10,0x10,0x10,0x90,0x60,0x00,0x90,0xa0,
0xc0,0xa0,0x90,0x88,0x00,0x80,0x80,0x80,0x80,0x80,0xf8,0x00,0x88,0xd8,0xa8,0x88,
0x88,0x88,0x00,0x88,0xc8,0xa8,0x98,0x88,0x88,0x00,0x70,0x88,0x88,0x88,0x88,0x70,
0x00,0xf0,0x88,0x88,0xf0,0x80,0x80,0x00,0x70,0x88,0x88,0xa8,0x98,0x70,0x00,0xf0,
0x88,0x88,0xf0,0x90,0x88,0x00,0x70,0x80,0x70,0x08,0x88,0x70,0x00,0xf8,0x20,0x20,
0x20,0x20,0x20,0x00,0x88,0x88,0x88,0x88,0x88,0x70,0x00,0x88,0x88,0x88,0x88,0x50,
0x20,0x00,0x88,0x88,0x88,0xa8,0xa8,0x50,0x00,0x88,0x50,0x20,0x20,0x50,0x88,0x00,
0x88,0x50,0x20,0x20,0x20,0x20,0x00,0xf8,0x10,0x20,0x40,0x80,0xf8,0x00,0x60,0x40,
0x40,0x40,0x40,0x60,0x00,0x00,0x80,0x40,0x20,0x10,0x08,0x00,0x30,0x10,0x10,0x10,
0x10,0x30,0x00,0x20,0x50,0x88,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0xf8,
0x00,0xf8,0xf8,0xf8,0xf8,0xf8,0xf8};
/**************************************************************************
* put_char *
* Print char *
**************************************************************************/
void put_char(int x ,int y,int ascii_char ,byte color){
__asm__(
"push %si\n\t"
"push %di\n\t"
"push %cx\n\t"
"mov color,%dl\n\t" //test color
"mov ascii_char,%al\n\t" //test char
"sub $32,%al\n\t"
"mov $7,%ah\n\t"
"mul %ah\n\t"
"lea $characters,%si\n\t"
"add %ax,%si\n\t"
"mov $7,%cl\n\t"
"0:\n\t"
"segCS %lodsb\n\t"
"mov $6,%ch\n\t"
"1:\n\t"
"shl $1,%al\n\t"
"jnc 2f\n\t"
"mov %dl,%ES:(%di)\n\t"
"2:\n\t"
"inc %di\n\t"
"dec %ch\n\t"
"jnz 1b\n\t"
"add $320-6,%di\n\t"
"dec %cl\n\t"
"jnz 0b\n\t"
"pop %cx\n\t"
"pop %di\n\t"
"pop %si\n\t"
"retn"
);
}
I´m guiding myself from this series of tutorials written in PASCAL: http://www.joco.homeserver.hu/vgalessons/lesson8.html .
I changed the assembly syntax according to the gcc compiler, but I´m still getting this errors:
Operand mismatch type for 'lea'
No such instruction 'segcs lodsb'
No such instruction 'retn'
EDIT:
I have been working on improving my code and at least now I see something on the screen. Here´s my updated code:
/**************************************************************************
* put_char *
* Print char *
**************************************************************************/
void put_char(int x,int y){
int char_offset;
int l,i,j,h,offset;
j,h,l,i=0;
offset = (y<<8) + (y<<6) + x;
__asm__(
"movl _VGA, %%ebx;" // VGA memory pointer
"addl %%ebx,%%edi;" //%di points to screen
"mov _ascii_char,%%al;"
"sub $32,%%al;"
"mov $7,%%ah;"
"mul %%ah;"
"lea _characters,%%si;"
"add %%ax,%%si;" //SI point to bitmap
"mov $7,%%cl;"
"0:;"
"lodsb %%cs:(%%si);" //load next byte of bitmap
"mov $6,%%ch;"
"1:;"
"shl $1,%%al;"
"jnc 2f;"
"movb %%dl,(%%edi);" //plot the pixel
"2:\n\t"
"incl %%edi;"
"dec %%ch;"
"jnz 1b;"
"addl $320-6,%%edi;"
"dec %%cl;"
"jnz 0b;"
: "=D" (offset)
: "d" (current_color)
);
}
If you see the image above I was trying to write the letter "S". The results are the green pixels that you see on the upper left side of the screen. No matter what x and y I give the functon it always plots the pixels on that same spot.
Can anyone help me correct my code?
See below for an analysis of some things that are specifically wrong with your put_char function, and a version that might work. (I'm not sure about the %cs segment override, but other than that it should do what you intend).
Learning DOS and 16-bit asm isn't the best way to learn asm
First of all, DOS and 16-bit x86 are thoroughly obsolete, and are not easier to learn than normal 64-bit x86. Even 32-bit x86 is obsolete, but still in wide use in the Windows world.
32-bit and 64-bit code don't have to care about a lot of 16-bit limitations / complications like segments or limited register choice in addressing modes. Some modern systems do use segment overrides for thread-local storage, but learning how to use segments in 16-bit code is barely connected to that.
One of the major benefits to knowing asm is for debugging / profiling / optimizing real programs. If you want to understand how to write C or other high-level code that can (and actually does) compile to efficient asm, you'll probably be looking at compiler output. This will be 64-bit (or 32-bit). (e.g. see Matt Godbolt's CppCon2017 talk: “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid” which has an excellent intro to reading x86 asm for total beginners, and to looking at compiler output).
Asm knowledge is useful when looking at performance-counter results annotating a disassembly of your binary (perf stat ./a.out && perf report -Mintel: see Chandler Carruth's CppCon2015 talk: "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!"). Aggressive compiler optimizations mean that looking at cycle / cache-miss / stall counts per source line are much less informative than per instruction.
Also, for your program to actually do anything, it has to either talk to hardware directly, or make system calls. Learning DOS system calls for file access and user input is a complete waste of time (except for answering the steady stream of SO questions about how to read and print multi-digit numbers in 16-bit code). They're quite different from the APIs in the current major OSes. Developing new DOS applications is not useful, so you'd have to learn another API (as well as ABI) when you get to the stage of doing something with your asm knowledge.
Learning asm on an 8086 simulator is even more limiting: 186, 286, and 386 added many convenient instructions like imul ecx, 15, making ax less "special". Limiting yourself to only instructions that work on 8086 means you'll figure out "bad" ways to do things. Other big ones are movzx / movsx, shift by an immediate count (other than 1), and push immediate. Besides performance, it's also easier to write code when these are available, because you don't have to write a loop to shift by more than 1 bit.
Suggestions for better ways to teach yourself asm
I mostly learned asm from reading compiler output, then making small changes. I didn't try to write stuff in asm when I didn't really understand things, but if you're going to learn quickly (rather than just evolve an understanding while debugging / profiling C), you probably need to test your understanding by writing your own code. You do need to understand the basics, that there are 8 or 16 integer registers + the flags and instruction pointer, and that every instruction makes a well-defined modification to the current architectural state of the machine. (See the Intel insn ref manual for complete descriptions of every instruction (links in the x86 wiki, along with much more good stuff).
You might want to start with simple things like writing a single function in asm, as part of a bigger program. Understanding the kind of asm needed to make system calls is useful, but in real programs it's normally only useful to hand-write asm for inner loops that don't involve any system calls. It's time-consuming to write asm to read input and print results, so I'd suggest doing that part in C. Make sure you read the compiler output and understand what's going on, and the difference between an integer and a string, and what strtol and printf do, even if you don't write them yourself.
Once you think you understand enough of the basics, find a function in some program you're familiar with and/or interested in, and see if you can beat the compiler and save instructions (or use faster instructions). Or implement it yourself without using the compiler output as a starting point, whichever you find more interesting. This answer might be interesting, although the focus there was finding C source that got the compiler to produce the optimal ASM.
How to try to solve your own problems (before asking an SO question)
There are many SO questions from people asking "how do I do X in asm", and the answer is usually "the same as you would in C". Don't get so caught up in asm being unfamiliar that you forget how to program. Figure out what needs to happen to the data the function operates on, then figure out how to do that in asm. If you get stuck and have to ask a question, you should have most of a working implementation, with just one part that you don't know what instructions to use for one step.
You should do this with 32 or 64bit x86. I'd suggest 64bit, since the ABI is nicer, but 32bit functions will force you to make more use of the stack. So that might help you understand how a call instruction puts the return address on the stack, and where the args the caller pushed actually are after that. (This appears to be what you tried to avoid dealing with by using inline asm).
Programming hardware directly is neat, but not a generally useful skill
Learning how to do graphics by directly modifying video RAM is not useful, other than to satisfy curiosity about how computers used to work. You can't use that knowledge for anything. Modern graphics APIs exist to let multiple programs draw in their own regions of the screen, and to allow indirection (e.g. draw on a texture instead of the screen directly, so 3D window-flipping alt-tab can look fancy). There too many reasons to list here for not drawing directly on video RAM.
Drawing on a pixmap buffer and then using a graphics API to copy it to the screen is possible. Still, doing bitmap graphics at all is more or less obsolete, unless you're generating images for PNG or JPEG or something (e.g. optimize converting histogram bins to a scatter plot in the back-end code for a web service). Modern graphics APIs abstract away the resolution, so your app can draw things at a reasonable size regardless of how big each pixel is. (small but extremely high rez screen vs. big TV at low rez).
It is kind of cool to write to memory and see something change on-screen. Or even better, hook up LEDs (with small resistors) to the data bits on a parallel port, and run an outb instruction to turn them on/off. I did this on my Linux system ages ago. I made a little wrapper program that used iopl(2) and inline asm, and ran it as root. You can probably do similar on Windows. You don't need DOS or 16bit code to get your feet wet talking to the hardware.
in/out instructions, and normal loads/stores to memory-mapped IO, and DMA, are how real drivers talk to hardware, including things far more complicated than parallel ports. It's fun to know how your hardware "really" works, but only spend time on it if you're actually interested, or want to write drivers. The Linux source tree includes drivers for boatloads of hardware, and is often well commented, so if you like reading code as much as writing code, that's another way to get a feel for what read drivers do when they talk to hardware.
It's generally good to have some idea how things work under the hood. If you want to learn about how graphics used to work ages ago (with VGA text mode and color / attribute bytes), then sure, go nuts. Just be aware that modern OSes don't use VGA text mode, so you aren't even learning what happens under the hood on modern computers.
Many people enjoy https://retrocomputing.stackexchange.com/, reliving a simpler time when computers were less complex and couldn't support as many layers of abstraction. Just be aware that's what you're doing. I might be a good stepping stone to learning to write drivers for modern hardware, if you're sure that's why you want to understand asm / hardware.
Inline asm
You are taking a totally incorrect approach to using inline ASM. You seem to want to write whole functions in asm, so you should just do that. e.g. put your code in asmfuncs.S or something. Use .S if you want to keep using GNU / AT&T syntax; or use .asm if you want to use Intel / NASM / YASM syntax (which I would recommend, since the official manuals all use Intel syntax. See the x86 wiki for guides and manuals.)
GNU inline asm is the hardest way to learn ASM. You have to understand everything that your asm does, and what the compiler needs to know about it. It's really hard to get everything right. For example, in your edit, that block of inline asm modifies many registers that you don't list as clobbered, including %ebx which is a call-preserved register (so this is broken even if that function isn't inlined). At least you took out the ret, so things won't break as spectacularly when the compiler inlines this function into the loop that calls it. If that sounds really complicated, that's because it is, and part of why you shouldn't use inline asm to learn asm.
This answer to a similar question from misusing inline asm while trying to learn asm in the first place has more links about inline asm and how to use it well.
Getting this mess working, maybe
This part could be a separate answer, but I'll leave it together.
Besides your whole approach being fundamentally a bad idea, there is at least one specific problem with your put_char function: you use offset as an output-only operand. gcc quite happily compiles your whole function to a single ret instruction, because the asm statement isn't volatile, and its output isn't used. (Inline asm statements without outputs are assumed to be volatile.)
I put your function on godbolt, so I could look at what assembly the compiler generates surrounding it. That link is to the fixed maybe-working version, with correctly-declared clobbers, comments, cleanups, and optimizations. See below for the same code, if that external link ever breaks.
I used gcc 5.3 with the -m16 option, which is different from using a real 16bit compiler. It still does everything the 32bit way (using 32bit addresses, 32bit ints, and 32bit function args on the stack), but tells the assembler that the CPU will be in 16bit mode, so it will know when to emit operand-size and address-size prefixes.
Even if you compile your original version with -O0, the compiler computes offset = (y<<8) + (y<<6) + x;, but doesn't put it in %edi, because you didn't ask it to. Specifying it as another input operand would have worked. After the inline asm, it stores %edi into -12(%ebp), where offset lives.
Other stuff wrong with put_char:
You pass 2 things (ascii_char and current_color) into your function through globals, instead of function arguments. Yuck, that's disgusting. VGA and characters are constants, so loading them from globals doesn't look so bad. Writing in asm means you should ignore good coding practices only when it helps performance by a reasonable amount. Since the caller probably had to store those values into the globals, you're not saving anything compared to the caller storing them on the stack as function args. And for x86-64, you'd be losing perf because the caller could just pass them in registers.
Also:
j,h,l,i=0; // sets i=0, does nothing to j, h, or l.
// gcc warns: left-hand operand of comma expression has no effect
j;h;l;i=0; // equivalent to this
j=h=l=i=0; // This is probably what you meant
All the local variables are unused anyway, other than offset. Were you going to write it in C or something?
You use 16bit addresses for characters, but 32bit addressing modes for VGA memory. I assume this is intentional, but I have no idea if it's correct. Also, are you sure you should use a CS: override for the loads from characters? Does the .rodata section go into the code segment? Although you didn't declare uint8_t characters[464] as const, so it's probably just in the .data section anyway. I consider myself fortunate that I haven't actually written code for a segmented memory model, but that still looks suspicious.
If you're really using djgpp, then according to Michael Petch's comment, your code will run in 32bit mode. Using 16bit addresses is thus a bad idea.
Optimizations
You can avoid using %ebx entirely by doing this, instead of loading into ebx and then adding %ebx to %edi.
"add _VGA, %%edi\n\t" // load from _VGA, add to edi.
You don't need lea to get an address into a register. You can just use
"mov %%ax, %%si\n\t"
"add $_characters, %%si\n\t"
$_characters means the address as an immediate constant. We can save a lot of instructions by combining this with the previous calculation of the offset into the characters array of bitmaps. The immediate-operand form of imul lets us produce the result in %si in the first place:
"movzbw _ascii_char,%%si\n\t"
//"sub $32,%%ax\n\t" // AX = ascii_char - 32
"imul $7, %%si, %%si\n\t"
"add $(_characters - 32*7), %%si\n\t" // Do the -32 at the same time as adding the table address, after multiplying
// SI points to characters[(ascii_char-32)*7]
// i.e. the start of the bitmap for the current ascii character.
Since this form of imul only keeps the low 16b of the 16*16 -> 32b multiply, the 2 and 3 operand forms imul can be used for signed or unsigned multiplies, which is why only imul (not mul) has those extra forms. For larger operand-size multiplies, 2 and 3 operand imul is faster, because it doesn't have to store the high half in %[er]dx.
You could simplify the inner loop a bit, but it would complicate the outer loop slightly: you could branch on the zero flag, as set by shl $1, %al, instead of using a counter. That would make it also unpredictable, like the jump over store for non-foreground pixels, so the increased branch mispredictions might be worse than the extra do-nothing loops. It would also mean you'd need to recalculate %edi in the outer loop each time, because the inner loop wouldn't run a constant number of times. But it could look like:
... same first part of the loop as before
// re-initialize %edi to first_pixel-1, based on outer-loop counter
"lea -1(%%edi), %%ebx\n"
".Lbit_loop:\n\t" // map the 1bpp bitmap to 8bpp VGA memory
"incl %%ebx\n\t" // inc before shift, to preserve flags
"shl $1,%%al\n\t"
"jnc .Lskip_store\n\t" // transparency: only store on foreground pixels
"movb %%dl,(%%ebx)\n" //plot the pixel
".Lskip_store:\n\t"
"jnz .Lbit_loop\n\t" // flags still set from shl
"addl $320,%%edi\n\t" // WITHOUT the -6
"dec %%cl\n\t"
"jnz .Lbyte_loop\n\t"
Note that the bits in your character bitmaps are going to map to bytes in VGA memory like {7 6 5 4 3 2 1 0}, because you're testing the bit shifted out by a left shift. So it starts with the MSB. Bits in a register are always "big endian". A left shift multiplies by two, even on a little-endian machine like x86. Little-endian only affects ordering of bytes in memory, not bits in a byte, and not even bytes inside registers.
A version of your function that might do what you intended.
This is the same as the godbolt link.
void put_char(int x,int y){
int offset = (y<<8) + (y<<6) + x;
__asm__ volatile ( // volatile is implicit for asm statements with no outputs, but better safe than sorry.
"add _VGA, %%edi\n\t" // edi points to VGA + offset.
"movzbw _ascii_char,%%si\n\t" // Better: use an input operand
//"sub $32,%%ax\n\t" // AX = ascii_char - 32
"imul $7, %%si, %%si\n\t" // can't fold the load into this because it's not zero-padded
"add $(_characters - 32*7), %%si\n\t" // Do the -32 at the same time as adding the table address, after multiplying
// SI points to characters[(ascii_char-32)*7]
// i.e. the start of the bitmap for the current ascii character.
"mov $7,%%cl\n"
".Lbyte_loop:\n\t"
"lodsb %%cs:(%%si)\n\t" //load next byte of bitmap
"mov $6,%%ch\n"
".Lbit_loop:\n\t" // map the 1bpp bitmap to 8bpp VGA memory
"shl $1,%%al\n\t"
"jnc .Lskip_store\n\t" // transparency: only store on foreground pixels
"movb %%dl,(%%edi)\n" //plot the pixel
".Lskip_store:\n\t"
"incl %%edi\n\t"
"dec %%ch\n\t"
"jnz .Lbit_loop\n\t"
"addl $320-6,%%edi\n\t"
"dec %%cl\n\t"
"jnz .Lbyte_loop\n\t"
: "+&D" (offset) // EDI modified by the asm, compiler needs to know that, so use a read-write "+" input. Early-clobber "&" because we read the other input after modifying this.
: "d" (current_color) // used read-only
: "%eax", "%ecx", "%esi", "memory"
// omit the memory clobber if your C never touches VGA memory, and your asm never loads/stores anywhere else.
// but that's not the case here: the asm loads from memory written by C
// without listing it as a memory operand (even a pointer in a register isn't sufficient)
// so gcc might optimize away "dead" stores to it, or reorder the asm with loads/stores to it.
);
}
Re: the "memory" clobber, see How can I indicate that the memory *pointed* to by an inline ASM argument may be used?
I didn't use dummy output operands to leave register allocation up to the compiler's discretion, but that's a good idea to reduce the overhead of getting data in the right places for inline asm. (extra mov instructions). For example, here there was no need to force the compiler to put offset in %edi. It could have been any register we aren't already using.

Resources