How to interpret assembly 68K instruction:
MOVE.W #5100!$13ec,-(A7)
What is a meaning of symbol '!' between decimal 5100 and hexadecimal 13ec.
I have noticed that 5100 is equal to $13ec.
This is your disassembler being "helpful", and showing you two possible interpretations of a value. Sometimes the decimal view is what you want (e.g. it's a loop counter, or a fixed size, or a decimal constant), and sometimes the hex view is what you want (e.g. it's an address, a block size, flags, or a hex constant). By providing both, the disassembler is just trying to be helpful.
If you were going to assemble this instruction, you'd only use one interpretation, e.g.
MOVE.W #5100,-(A7)
or
MOVE.W $13ec,-(A7)
Related
I want to make a disassembly of the m68k compiled binary myself and make an emulator.
I've disassembled the binary file into a text file with a thirdparty tool m68k-linux-gnu-objdump -b binary -m m68k:68000 ... to have a better vision of what is going on in the binary
Here I have an instruction:
0604 0320 addib #32,%d4
From this table I see that addi function has the next binary scheme:
0000 011 0 <S> <M> <Xn>
and my instruction has representation of:
0000 011 0 00 000 100
Which means I have addi operation with size (8 bits), addressing mode "data register" and the register is encoded to be D4.
Ok, addib with destination %d4 but what does this data column on the right side mean?
I see that the second word (2 bytes of data) in the disassembly is 0x0320 where the last 4 bits 0x20 actually my #32 literal in decimal. But what is this 0x03 in the middle? I've seen some other addi instructions in the disassembly and everywhere there was a 4 bits of something in the middle and the last 4 bits were my number in hex.
I'm probably not taking the last column of the table into account "data" but I failed to understand how to interpret it.
For the example above the table says, data type - "any type" + immediate mode but what is this "any type".
The size of addi instruction said to be any b/w/l in the second (green) column of the table. Are these three things like blue data's first sub-column(B,W,/), green size column (B/W/L), and pink sector of the scheme (00 - B, 01 - W, 10 - L) related? I'm completely confused
And the problem I don't understand the boundaries of the instructions. I've seen some instructions that were maximum 16 bits long (as shown in general schema for each operation) but there are "brief extension words" and "full extension words", what the book says about them I can't get completely. The only thing I probably understood is that the first 16 bits of the opcode is "Single Effective Address Operation Word" and that is.
This is my first approach in trying to understand such a low level of programming
Do what the CPU does with the first byte of the immediate data word of a byte size instruction: Ignore it.
By encoding the two size bits as "00", you told the CPU that you want to add an 8-bit immediate value to the byte-size part of d4 - That means, the upper byte of the immediate data word is not used, but still the 68000 will only read instructions word-wise. Thus, the upper part of this data word is simply "don't care" - You can put anything in there without changing the effect of the instruction, because the CPU won't use it. (Thus, the actual value "3" you see there in your case is irrelevant and probably just some random value left over from the assembler)
If you encode the instruction as ".w" (that is, you want to do a 16-bit add), the upper byte of the data word becomes relevant. If you encode the very same instruction as .l (32-bit add), the assembler will add yet another word to the instruction and put the 32-bit immediate in those 2 words.
For the sake of specifics, let's consider GCC compiler, the latest version.
Consider the instruction int i = 7;.
In assembly it will be something like
MOV 7, R1
This will insert the value seven to register R1. The exact instruction may not be important here.
In my understanding, now the compiler will convert the MOV instruction to processor specific OPCODE. Then it will allocate a (possibly virtual) register. Then the constant value 7 needs to go in the register.
My question:
How does the 7 is actually converted to binary?
Does the compiler actually repeatedly divide by 2 to get the binary representation? (May be afterwards it will convert to HEX, but let's remain on the binary step).
Or, considering that the 7 is written as a character in a text file, is there a clever look up table based technique to convert any string (representing a number) to a binary value?
If the current GCC compiler uses built in function to convert a string 7 to a binary 0111, then how did the first compiler convert a text based string to a binary value?
Thank you.
How does the 7 is actually converted to binary?
First of all, there's a distinction between the binary base 2 number format and what professional programmers call "a binary executable", meaning generated machine code and most often expressed in hex for convenience. Addressing the latter meaning:
Disassemble with binaries (for example at https://godbolt.org/) and see for yourself
int main (void)
{
int i = 7;
return i;
}
Does indeed get translated to something like
mov eax,0x7
ret
Translated to binary op codes:
B8 07 00 00 00
C3
Where B8 = mov eax, B9 = mov ecx and so on. The 7 is translated into 07 00 00 00 since mov expects 4 bytes and this is a little endian CPU.
And this is the point where the compiler/linker stops caring. The code was generated according to the CPU's ABI (Application Binary Interface) and how to deal with this machine code from here on is up to the CPU.
As for how this makes it into the hardware in the actual form of base 2 binary... it's already in that form. Everything we see in a PC is a translated convenience for the human users, who have an easier time reading decimal or hex than raw binary.
If the current GCC compiler uses built in function to convert a string 7 to a binary 0111, then how did the first compiler convert a text based string to a binary value? This is egg chicken problem but to simply put these compilers are created step by step and at some point the compiler is written in its language such that c compiler is written by c etc.
Before to answer your question we should define what we mean by "compilation" or what compiler does. to simply put this compilation is a pipeline. Takes your high level code does some operations and generates an assembly code(specific to machine) and machine defined assembler takes your assembly code and converts it into a binary object file.
At the compiler level all they do is to create corresponding assembly format in a text file.
and assembler is another program that takes this text file and converts it into "binary" format. Assembler can be also written by c language here we also need a mapping i.e movl->(0000110101110...) but this one is binary not ascii. and we need to write this binary into a file as-is.
Converting numbers into binary format is also redundant because numbers are already in binary form when they are loaded into memory.
the question is how they are converted/placed in to memory is a problem of the loader program of the operating system which exceeds my knowledge.
representations of values on a computer can vary “culturally” from architecture to architecture or are determined by the type the programmer gave to the value. Therefore, we should try to reason primarily about values and not about representations if we want to write portable code.
Specifying values. We have already seen several ways in which numerical constants (literals) can be specified:
123 Decimal integer constant.
077 Octal integer constant.
0xFFFF Hexadecimal integer constant.
et cetera
Question: Are decimal integer constants and hexadecimal integer constants, different ways to 'represent' values or are they values themselves? If the latter what are different ways to represent them on different architectures?
The source of the aforementioned is the book "Modern C" by Jens Gustedt which is freely available online, specifically from page no. 38 to page no. 46.
The words "representation" can be used here in two different contexts.
One is when we (the programmers) specify e.g. integer constants. For example, the value 37 may be represented in the C code as 37 or 0x25 or 045. Regardless of which representation we have chosen, the C compiler will interpret this into the same value when generating the binary code. Hence, these statements all generate the same code:
int a = 37;
int a = 0x25;
int a = 045;
Another context is how the compiler chooses to store the value 37 internally. The C standard states a few requirements (e.g. that the representation of int must at least be able to represent values in the range -32767 to +32767). Within the rules of the C standard the compiler will use a bit representation which can be operated on efficiently by the native language of the target system's CPU. The most common representation for signed integers is Two's complement and usually a signed integer with type int will occupy 2 or 4 bytes of 8 bits each.
However, the C standard is sufficiently flexible to allow for other internal representations (e.g. bytes with more than 8 bits or Ones' complement representation of signed integers). A common difference between representations of multibyte integers on different systems is the use of different byte order.
The C standard is primarily concerned with the result of standard operations. E.g. 5+6 must give the same result no matter on which platform the expression is executed, but how 5, 6 and 11 are represented on the given platform is largely up to the compiler to decide.
It is of utmost importance to every C programmer to understand that C is an abstraction layer that shields you from the underlying hardware. This service is the raison d'être for the language, the reason it was developed. Among other things, the language shields you from the different internal byte patterns used to hold the same values on different platforms: You write a value and operations on it, and the compiler will see to producing the proper code. This would be different in assembler where you are intimately concerned with memory layout, register sizes etc.
In case it wasn't obvious: I'm emphasizing this because I struggled with these concepts myself when I learned C.
The first thing to hammer down is that C program code is text. What we deal with here are text representations of values, a succession of (most likely) ASCII codes much as if you wrote a letter to your grandma.
Integer literals like 0443 (the less usual octal format), 0x0123 or 291 are simply different string representations for the same value. Here and in the standard, "value" is a value in the mathematical sense. As much as we think "oh, C!" when we see "0x0123", it is nothing else than a way to write down the mathematical value of 291. That's meant with "value", for example when the standard specifies that "the type of an integer constant is the first of the corresponding list in which its value can be represented." The compiler has to create a binary representation of that value in the program's memory. This means it has to find out what value it is (291 in all cases) and then produce the proper byte pattern for it. The integer literal in the C code is not a binary form of anything, no matter whether you choose to write its string representation down base 10, base 16 or base 8. In particular does 0x0123 not mean that the two bytes 01 and 23 will be anywhere in the compiled program, or in which order.1
To demonstrate the abstraction consider the expression (0x0123 << 4) == 0x1230, which should be true on all machines. Both hex literals are of type int here. The beauty of hex code is that it makes bit manipulations in multiples of 4 really easy to compute.
On a typical contemporary Intel architecture an int has 4 bytes and is organized "little endian first", or "little endian" for short: The lowest-value byte comes first if we inspect the memory in ascending order. 0x123 is represented as 00100011-00000001-00000000-00000000 (because the two highest-value bytes are zero for such a small number). 0x1230 is, consequently, 00110000-00010010-00000000-00000000. No left-shift whatsoever took place on the hardware (but also no right-shift!). The bit-shift operators' semantics are an abstraction: "Imagine a regular binary number, following the old Arab fashion of starting with the highest-value digit, and shift that imagined binary number." It is an abstraction that bears zero resemblance to anything happening on the hardware, and the compiler simply translates this abstract operation into the right thing for that particular hardware.
1Now admittedly, they probably are there, but on your prevalent x86 platform their order will be reversed, as assumed below.
Are decimal integer constants and hexadecimal integer constants, different ways to 'represent' values or are they values themselves?
This is philosophy! They are different ways to represent values, like:
0x2 means 2 (for a C compiler)
two means 2 (english language)
a couple means 2 (for an english speaker)
zwei means 2 (...)
A C compiler translates from "some form of human understandable language" to "a very precise form understandable by the machine": the only thing which is retained from the various forms, is the intimate meaning (the value!).
It happens that C, in order to be more friendly, lets you specify integers in two different ways, decimal and hexadecimal (ok, even octal and recently also binary notation). What the C compiler is interested in, is the value and, as already noted in a comment, after the C has "understand" the value, there is no more difference between a "0xC" or a "12". From that point, the compiler must make the machine understand the value 12, using the representation the target machine uses and, again, what is important is the value.
Most probably, the phrase
we should try to reason primarily about values and not about representations
is an invite to the programmers to choose correct data types and values, but not only: also to give useful names for types and variables and so on. A not very good example is: even if we know that a line feed is represented (often) by a 10 decimal, we should use LF or "\n" or similar, which is the value we want, not its representation.
About data types, especially integers, C is not particularly brilliant, compared to other languages which let you define types based on their possible values (for example with the "-3 .. 5" notation, which states that the possible values go from -3 to 5, and lets the compiler choose the number of bits needed for the representation of the range -3 to 5).
Let's say I did an instruction such as MOV R1, #9. Would this store the binary number 9 in binary in memory in ARM Assembly? Would the ARM Assembly code see the number 00001001 stored in the location of R1 in memory? If not what how would it store the decimal number 9 in binary?
The processor's storage and logic are composed of gates, which manipulate values as voltages. For simplicity, only two voltages are recognized: high and low, making a binary system.
The processor can manipulate (compute with) these values using simple boolean logic (AND, OR, etc..) — this is called combinational logic.
The processor can also store binary values for a time using cycles in logic (here think circular, as in cycles in a graph) called sequential logic. A cycle creates a feedback loop that will preserve values until told to preserve a new one (or power is lost).
Most anything meaningful requires multiple binary values, which we can call bit strings. We can think of these bit strings as numbers, characters, etc.., usually depending on context.
The processor's registers are an example of sequential logic used to store bit strings, here strings of 32 bits per register.
Machine code is the language of the processor. The processor interprets bit strings as instruction sequences to carry out. Some bit strings command the processor to store values in registers, which are composed of gates using strings of sequential logic, here 32 bits per register: each bit stores a value as a voltage in a binary system.
The registers are storage, but the only way we can actually visualize them is to retrieve their values using machine code instructions to send these values to an output device for viewing as numbers, characters, colors, etc..
In short, if you send a number #9 to a register, it will store a certain bit string, and some subsequent machine code instruction will be able to retrieve that same bit pattern. Whether that 9 represents a numeric 9 or tab character or something else is a matter of interpretation, which is done by programming, which is source code ultimately turned into sequences of machine code instructions.
Everything is always binary in a computer; decimal and hex are just human-friendly notations for writing the values that a binary bit-pattern represents.
Even after you convert a binary integer to a string of decimal digits (e.g. by repeated division by 10 to get digit=remainder), those digits are stored one per byte in ASCII.
For example the digit '0' is represented by the ASCII / UTF-8 bit-pattern 00110000 aka hex 0x30 aka decimal 48.
We call this "converting to decimal" because the value is now represented in a way that's ready to copy into a text file or into video memory where it will be shown to humans as a series of decimal digits.
But really all the information is still represented in terms of binary bits which are either on or off. There are no single wires or transistors inside a computer that can represent a single 10-state value using e.g. one of ten different voltage levels. It's always either all the way on or all the way off.
https://en.wikipedia.org/wiki/Three-valued_logic aka Ternary logic is the usual example of non-binary logic. True decimal logic would be one instance of n-valued logic, where a "bit" can have one of 10 values instead of one of 3.
In practice CPUs are based on strictly binary (Boolean) logic. I mention ternary logic only to illustrate what it would really mean for computers to truly use decimal logic/numbers instead of just using binary to represent decimal characters.
A processor has a number of places to store content; you can call it a memory hierarchy: „Memories“ that are very close to the processor are very fast and small, e.g. registers, and „memories“ that are far from the processor are slow and large, like a disk. In between are cache memories at various levels and the main memory.
Since these memories are technologically very different, the are accessed in different ways: Registers are accessed by specifying them in assembly code directly (as R1 in MOV R1, #9), whereas the main memory storage locations are consecutively numbered with „addresses“.
So if you execute MOV R1, #9, the binary number 9 is stored in register 1 of the CPU, not in main memory, and an assembly code reading out R1 would read back number 9.
All numbers are stored in a binary format. The most well known are integers (as your example binary 00001001 for decimal 9), and floating point.
I'm a little confused over the concept of machine code...
Is machine code synonymous to assembly language?
What would be an example of the machine code in LC-3?
Assembly instructions (LD, ST, ADD, etc. in the case of the LC-3 simulator) correspond to binary instructions that are loaded and executed as a program. In the case of the LC-3, these "opcodes" are assembled into 16-bit strings of 1s and 0s that the LC-3 architecture is designed to execute accordingly.
For example, the assembly "ADD R4 R4 #10" corresponds to the LC-3 "machine code":
0001100100101010
Which can be broken down as:
0001 - ADD.
100 - 4 in binary
100 - 4 in binary
10 - indicates that we are adding a value to a register
1010 - 10 in binary
Note that each opcode has a distinct binary equivalent, so there are 2^4=16 possible opcodes.
The LC-3 sneakily deals with this problem by introducing those flag bits in certain instructions. For ADD, those two bits change depending on what we're adding. For example, if we are adding two registers (ie. "ADD R4 R4 R7" as opposed to a register and a value) the bits would be set to 01 instead of 10.
This machine code instructs the LC-3 to add decimal 10 to the value in register 4, and store the result in register 4.
An "Assembly language" is a symbolic (human-readable) language, which a program known as the "assembler" translates into the binary form, "machine code", which the CPU can load and execute. In LC-3, each instruction in machine code is a 16-bit word, while the corresponding instruction in the assembly language is a line of human-readable text (other architectures may have longer or shorter machine-code instructions, but the general concept it the same).
The above stands true for any high level language (such as C, Pascal, or Basic). The differece between HLL and assembly is that each assembly language statement corresponds to one machine operation (macros excepted). Meanwhile, in a HLL, a single statement can compile into a whole lot of machine code.
You can say that assembly is a thin layer of mnemonics on top of machine code.