The machine code in LC-3 are also known as assembly?

The machine code in LC-3 are also known as assembly? - c

I'm a little confused over the concept of machine code...
Is machine code synonymous to assembly language?
What would be an example of the machine code in LC-3?

Assembly instructions (LD, ST, ADD, etc. in the case of the LC-3 simulator) correspond to binary instructions that are loaded and executed as a program. In the case of the LC-3, these "opcodes" are assembled into 16-bit strings of 1s and 0s that the LC-3 architecture is designed to execute accordingly.
For example, the assembly "ADD R4 R4 #10" corresponds to the LC-3 "machine code":
0001100100101010
Which can be broken down as:
0001 - ADD.
100 - 4 in binary
100 - 4 in binary
10 - indicates that we are adding a value to a register
1010 - 10 in binary
Note that each opcode has a distinct binary equivalent, so there are 2^4=16 possible opcodes.
The LC-3 sneakily deals with this problem by introducing those flag bits in certain instructions. For ADD, those two bits change depending on what we're adding. For example, if we are adding two registers (ie. "ADD R4 R4 R7" as opposed to a register and a value) the bits would be set to 01 instead of 10.
This machine code instructs the LC-3 to add decimal 10 to the value in register 4, and store the result in register 4.

An "Assembly language" is a symbolic (human-readable) language, which a program known as the "assembler" translates into the binary form, "machine code", which the CPU can load and execute. In LC-3, each instruction in machine code is a 16-bit word, while the corresponding instruction in the assembly language is a line of human-readable text (other architectures may have longer or shorter machine-code instructions, but the general concept it the same).

The above stands true for any high level language (such as C, Pascal, or Basic). The differece between HLL and assembly is that each assembly language statement corresponds to one machine operation (macros excepted). Meanwhile, in a HLL, a single statement can compile into a whole lot of machine code.
You can say that assembly is a thin layer of mnemonics on top of machine code.

Related

How does a C compiler convert a constant to binary

For the sake of specifics, let's consider GCC compiler, the latest version.
Consider the instruction int i = 7;.
In assembly it will be something like
MOV 7, R1
This will insert the value seven to register R1. The exact instruction may not be important here.
In my understanding, now the compiler will convert the MOV instruction to processor specific OPCODE. Then it will allocate a (possibly virtual) register. Then the constant value 7 needs to go in the register.
My question:
How does the 7 is actually converted to binary?
Does the compiler actually repeatedly divide by 2 to get the binary representation? (May be afterwards it will convert to HEX, but let's remain on the binary step).
Or, considering that the 7 is written as a character in a text file, is there a clever look up table based technique to convert any string (representing a number) to a binary value?
If the current GCC compiler uses built in function to convert a string 7 to a binary 0111, then how did the first compiler convert a text based string to a binary value?
Thank you.

How does the 7 is actually converted to binary?
First of all, there's a distinction between the binary base 2 number format and what professional programmers call "a binary executable", meaning generated machine code and most often expressed in hex for convenience. Addressing the latter meaning:
Disassemble with binaries (for example at https://godbolt.org/) and see for yourself
int main (void)
{
int i = 7;
return i;
}
Does indeed get translated to something like
mov eax,0x7
ret
Translated to binary op codes:
B8 07 00 00 00
C3
Where B8 = mov eax, B9 = mov ecx and so on. The 7 is translated into 07 00 00 00 since mov expects 4 bytes and this is a little endian CPU.
And this is the point where the compiler/linker stops caring. The code was generated according to the CPU's ABI (Application Binary Interface) and how to deal with this machine code from here on is up to the CPU.
As for how this makes it into the hardware in the actual form of base 2 binary... it's already in that form. Everything we see in a PC is a translated convenience for the human users, who have an easier time reading decimal or hex than raw binary.

If the current GCC compiler uses built in function to convert a string 7 to a binary 0111, then how did the first compiler convert a text based string to a binary value? This is egg chicken problem but to simply put these compilers are created step by step and at some point the compiler is written in its language such that c compiler is written by c etc.
Before to answer your question we should define what we mean by "compilation" or what compiler does. to simply put this compilation is a pipeline. Takes your high level code does some operations and generates an assembly code(specific to machine) and machine defined assembler takes your assembly code and converts it into a binary object file.
At the compiler level all they do is to create corresponding assembly format in a text file.
and assembler is another program that takes this text file and converts it into "binary" format. Assembler can be also written by c language here we also need a mapping i.e movl->(0000110101110...) but this one is binary not ascii. and we need to write this binary into a file as-is.
Converting numbers into binary format is also redundant because numbers are already in binary form when they are loaded into memory.
the question is how they are converted/placed in to memory is a problem of the loader program of the operating system which exceeds my knowledge.

Detecting and extracting opcode sequences

May I get an explanation about what opcode sequences are and how to find them in PE32 files?
I am trying to extract them from PE32 files.

what opcode sequences are
A CPU instruction is composed from 1 to multiple bytes, each of of those bytes have different meanings.
An opcode (operation code) is the part of an instruction that defines the behavior of the instruction itself (as in, this instruction is an 'ADD', or an 'XOR', a NOP, etc.).
For x86 / x64 CPUs (IA-32; IA-32e in Intel linguo) an instruction is composed of at least an opcode (1 to 3 bytes), but can comes with multiple other bytes (various prefixes, ModR/M, SIB, Disp. and Imm.) depending on its encoding:
Opcode is often synonym with "instruction" (since the opcode defines the behavior of the instruction); therefore when you have multiple instructions you then have an opcode sequence (which is a bit of a misnomer since it's really a instruction sequence, unless all instructions in the sequence are only composed of opcodes).
how to find them in PE32 files?
As instructions can be multiple bytes long, you can't just start at a random location in the .text section (which, for a PE file, contains the executable code of the program). There's a specific location in the PE file - called the "entry point" - which defines the start of the program.
The entry point for a PE File is given by the AddressOfEntryPoint member of the IMAGE_OPTIONAL_HEADER structure (parts of the PE header structures). Note that this member is an RVA, not a "full" VA.
From there you know you are at the start of an instruction. You can start disassembling / counting instructions from this point, following the encoding rules for instructions (these rules are explained to great length in the Intel and AMD manuals).
Most instruction are "fall-through", which means that once an instruction has executed, the next to execute is the following one (this seems obvious, but!). The trick is when there's a non-fall-through instruction, you must know what this instruction does to continue your disassembling (e.g. it might jump somewhere, go to a specific handler, etc.)

Use the radare 2 library, it can extract opcode sequences very quickly.

Is memory stored in ARM Assembly in binary?

Let's say I did an instruction such as MOV R1, #9. Would this store the binary number 9 in binary in memory in ARM Assembly? Would the ARM Assembly code see the number 00001001 stored in the location of R1 in memory? If not what how would it store the decimal number 9 in binary?

The processor's storage and logic are composed of gates, which manipulate values as voltages. For simplicity, only two voltages are recognized: high and low, making a binary system.
The processor can manipulate (compute with) these values using simple boolean logic (AND, OR, etc..) — this is called combinational logic.
The processor can also store binary values for a time using cycles in logic (here think circular, as in cycles in a graph) called sequential logic. A cycle creates a feedback loop that will preserve values until told to preserve a new one (or power is lost).
Most anything meaningful requires multiple binary values, which we can call bit strings. We can think of these bit strings as numbers, characters, etc.., usually depending on context.
The processor's registers are an example of sequential logic used to store bit strings, here strings of 32 bits per register.
Machine code is the language of the processor. The processor interprets bit strings as instruction sequences to carry out. Some bit strings command the processor to store values in registers, which are composed of gates using strings of sequential logic, here 32 bits per register: each bit stores a value as a voltage in a binary system.
The registers are storage, but the only way we can actually visualize them is to retrieve their values using machine code instructions to send these values to an output device for viewing as numbers, characters, colors, etc..
In short, if you send a number #9 to a register, it will store a certain bit string, and some subsequent machine code instruction will be able to retrieve that same bit pattern. Whether that 9 represents a numeric 9 or tab character or something else is a matter of interpretation, which is done by programming, which is source code ultimately turned into sequences of machine code instructions.

Everything is always binary in a computer; decimal and hex are just human-friendly notations for writing the values that a binary bit-pattern represents.
Even after you convert a binary integer to a string of decimal digits (e.g. by repeated division by 10 to get digit=remainder), those digits are stored one per byte in ASCII.
For example the digit '0' is represented by the ASCII / UTF-8 bit-pattern 00110000 aka hex 0x30 aka decimal 48.
We call this "converting to decimal" because the value is now represented in a way that's ready to copy into a text file or into video memory where it will be shown to humans as a series of decimal digits.
But really all the information is still represented in terms of binary bits which are either on or off. There are no single wires or transistors inside a computer that can represent a single 10-state value using e.g. one of ten different voltage levels. It's always either all the way on or all the way off.
https://en.wikipedia.org/wiki/Three-valued_logic aka Ternary logic is the usual example of non-binary logic. True decimal logic would be one instance of n-valued logic, where a "bit" can have one of 10 values instead of one of 3.
In practice CPUs are based on strictly binary (Boolean) logic. I mention ternary logic only to illustrate what it would really mean for computers to truly use decimal logic/numbers instead of just using binary to represent decimal characters.

A processor has a number of places to store content; you can call it a memory hierarchy: „Memories“ that are very close to the processor are very fast and small, e.g. registers, and „memories“ that are far from the processor are slow and large, like a disk. In between are cache memories at various levels and the main memory.
Since these memories are technologically very different, the are accessed in different ways: Registers are accessed by specifying them in assembly code directly (as R1 in MOV R1, #9), whereas the main memory storage locations are consecutively numbered with „addresses“.
So if you execute MOV R1, #9, the binary number 9 is stored in register 1 of the CPU, not in main memory, and an assembly code reading out R1 would read back number 9.
All numbers are stored in a binary format. The most well known are integers (as your example binary 00001001 for decimal 9), and floating point.

Disassembly of a mixed ARM/Thumb2 ELF file

I'm trying to disassemble an ELF executable which I compiled using arm-linux-gnueabihf to target thumb-2. However, ARM instruction encoding is making me confused while debugging my disassembler. Let's consider the following instruction:
mov.w fp, #0
Which I disassembled using objdump and hopper as a thumb-2 instruction. The instruction appears in memory as 4ff0000b which means that it's actually0b00f04f (little endian). Therefore, the binary encoding of the instruction is:
0000 1011 0000 0000 1111 0000 0100 1111
According to ARM architecture manual, it seems like ALL thumb-2 instructions should start with 111[10|01|11]. Therefore, the above encoding doesn't correspond to any thumb-2 instruction. Further, it doesn't match any of the encodings found on section A8.8.102 (page 484).
Am I missing something?

I think you're missing the subtle distinction that wide Thumb-2 encodings are not 32-bit words like ARM encodings, they are a pair of 16-bit halfwords (note the bit numbering above the ARM ARM encoding diagram). Thus whilst the halfwords themselves are little-endian, they are still stored in 'normal' order relative to each other. If the bytes in memory are 4ff0000b, then the actual instruction encoded is f04f 0b00.

thumb2 are extensions to the thumb instruction set, formerly undefined instructions, now some of them defined. arm is a completely different instruction set. if the toolchain has not left you clues as to what code is thumb vs arm then the only way to figure it out is start with an assumption at an entry point and disassemble in execution order from there, and even there you might not figure out some of the code.
you cannot distinguish arm instructions from thumb or thumb+thumb2 extension simply by bit pattern. also remember arm instructions are aligned on 4 byte boundaries where thumb are 2 byte and a thumb 2 extension doesnt have to be in the same 4 byte boundary as its parent thumb, making this all that much more fun. (thumb+thumb2 is a variable length instruction set made from multiples of 16 bit values)
if all of your code is thumb and there are no arm instructions in there then you still have the problem you would have with a variable length instruction set and to do it right you have to walk the code in execution order. For example it would not be hard to embed a data value in .text that looks like the first half of a thumb2 extension, and follow that by a real thumb 2 extension causing your disassembler to go off the rails. elementary variable word length disassembly problem (and elementary way to defeat simple disassemblers).
16 bit words A,B,C,D
if C + D are a thumb 2 instruction which is known by decoding C, A is say a thumb instruction and B is a data value which resembles the first half of a thumb2 extension then linearly decoding through ram A is the thumb instruction B and C are decoded as a thumb2 extension and D which is actually the second half of a thumb2 extension is now decoded as the first 16 bits of an instruction and all bets are off as to how that decodes or if it causes all or many of the following instructions to be decoded wrong.
So start off looking to see if the elf tells you something, if not then you have to make passes through the code in execution order (you have to make an assumption as to an entry point) following all the possible branches and linear execution to mark 16 bit sections as first or additional blocks for instructions, the unmarked blocks cannot be determined necessarily as instruction vs data, and care must be taken.
And yes it is possible to play other games to defeat disassemblers, intentionally branching into the second half of a thumb2 instruction which is hand crafted to be a valid thumb instruction or the begnning of a thumb2.
fixed length instruction sets like arm and mips, you can linearly decode, some data decodes as strange or undefined instructions but your disassembler doesnt go off the rails and fail to do its job. variable length instruction sets, disassembly at best is just a guess...the only way to truly decode is to execute the instructions the same way the processor would.

What is the ARM Thumb Instruction set?

under "The Thumb instruction set" in section 1-34 of "ARM11TechnicalRefManual" it said that:
"The Thumb instruction set is a subset of the most commonly used 32-bit ARM instructions.Thumb instructions are 16 bits long,and have a corresponding 32-bit ARM instruction that has the same effect on processor model."
can any one explain more about this especially second sentence and say how does processor perform it?

The ARM processor has 2 instruction sets, the traditional ARM set, where the instructions are all 32-bit long, and the more condensed Thumb set, where most common instructions are 16-bit long (and some are 32-bit long). Which instruction set to run can be chosen by the developer, and only one set can be active (i.e. once the processor is switched to Thumb mode, all instructions will be decoded as using the Thumb instead of ARM).
Although they are different instruction sets, they share similar functionality, and can be represented using the same assembly language. For example, the instruction
ADDS R0, R1, R2
can be compiled to ARM (E0910002 / 11100000 10010001 00000000 00000010) or Thumb (1888 / 00011000 10001000). Of course, they perform the same function (add r1 and r2 and store the result to r0), even if they have different encodings. This is the meaning of Thumb instructions are 16 bits long,and have a corresponding 32-bit ARM instruction that has the same effect on processor model.
Every* instruction in Thumb encoding also has a corresponding encoding in ARM, which is meant by the "subset" sentence.
*: Not strictly true, there is not "IT" instruction in ARM, although ARM doesn't need "IT" anyway (it will be ignored by the assembler).