Can't understand how to decode M68K binary code - disassembly

I want to make a disassembly of the m68k compiled binary myself and make an emulator.
I've disassembled the binary file into a text file with a thirdparty tool m68k-linux-gnu-objdump -b binary -m m68k:68000 ... to have a better vision of what is going on in the binary
Here I have an instruction:
0604 0320 addib #32,%d4
From this table I see that addi function has the next binary scheme:
0000 011 0 <S> <M> <Xn>
and my instruction has representation of:
0000 011 0 00 000 100
Which means I have addi operation with size (8 bits), addressing mode "data register" and the register is encoded to be D4.
Ok, addib with destination %d4 but what does this data column on the right side mean?
I see that the second word (2 bytes of data) in the disassembly is 0x0320 where the last 4 bits 0x20 actually my #32 literal in decimal. But what is this 0x03 in the middle? I've seen some other addi instructions in the disassembly and everywhere there was a 4 bits of something in the middle and the last 4 bits were my number in hex.
I'm probably not taking the last column of the table into account "data" but I failed to understand how to interpret it.
For the example above the table says, data type - "any type" + immediate mode but what is this "any type".
The size of addi instruction said to be any b/w/l in the second (green) column of the table. Are these three things like blue data's first sub-column(B,W,/), green size column (B/W/L), and pink sector of the scheme (00 - B, 01 - W, 10 - L) related? I'm completely confused
And the problem I don't understand the boundaries of the instructions. I've seen some instructions that were maximum 16 bits long (as shown in general schema for each operation) but there are "brief extension words" and "full extension words", what the book says about them I can't get completely. The only thing I probably understood is that the first 16 bits of the opcode is "Single Effective Address Operation Word" and that is.
This is my first approach in trying to understand such a low level of programming

Do what the CPU does with the first byte of the immediate data word of a byte size instruction: Ignore it.
By encoding the two size bits as "00", you told the CPU that you want to add an 8-bit immediate value to the byte-size part of d4 - That means, the upper byte of the immediate data word is not used, but still the 68000 will only read instructions word-wise. Thus, the upper part of this data word is simply "don't care" - You can put anything in there without changing the effect of the instruction, because the CPU won't use it. (Thus, the actual value "3" you see there in your case is irrelevant and probably just some random value left over from the assembler)
If you encode the instruction as ".w" (that is, you want to do a 16-bit add), the upper byte of the data word becomes relevant. If you encode the very same instruction as .l (32-bit add), the assembler will add yet another word to the instruction and put the 32-bit immediate in those 2 words.

Related

Is memory stored in ARM Assembly in binary?

Let's say I did an instruction such as MOV R1, #9. Would this store the binary number 9 in binary in memory in ARM Assembly? Would the ARM Assembly code see the number 00001001 stored in the location of R1 in memory? If not what how would it store the decimal number 9 in binary?
The processor's storage and logic are composed of gates, which manipulate values as voltages. For simplicity, only two voltages are recognized: high and low, making a binary system.
The processor can manipulate (compute with) these values using simple boolean logic (AND, OR, etc..) — this is called combinational logic.
The processor can also store binary values for a time using cycles in logic (here think circular, as in cycles in a graph) called sequential logic. A cycle creates a feedback loop that will preserve values until told to preserve a new one (or power is lost).
Most anything meaningful requires multiple binary values, which we can call bit strings. We can think of these bit strings as numbers, characters, etc.., usually depending on context.
The processor's registers are an example of sequential logic used to store bit strings, here strings of 32 bits per register.
Machine code is the language of the processor. The processor interprets bit strings as instruction sequences to carry out. Some bit strings command the processor to store values in registers, which are composed of gates using strings of sequential logic, here 32 bits per register: each bit stores a value as a voltage in a binary system.
The registers are storage, but the only way we can actually visualize them is to retrieve their values using machine code instructions to send these values to an output device for viewing as numbers, characters, colors, etc..
In short, if you send a number #9 to a register, it will store a certain bit string, and some subsequent machine code instruction will be able to retrieve that same bit pattern. Whether that 9 represents a numeric 9 or tab character or something else is a matter of interpretation, which is done by programming, which is source code ultimately turned into sequences of machine code instructions.
Everything is always binary in a computer; decimal and hex are just human-friendly notations for writing the values that a binary bit-pattern represents.
Even after you convert a binary integer to a string of decimal digits (e.g. by repeated division by 10 to get digit=remainder), those digits are stored one per byte in ASCII.
For example the digit '0' is represented by the ASCII / UTF-8 bit-pattern 00110000 aka hex 0x30 aka decimal 48.
We call this "converting to decimal" because the value is now represented in a way that's ready to copy into a text file or into video memory where it will be shown to humans as a series of decimal digits.
But really all the information is still represented in terms of binary bits which are either on or off. There are no single wires or transistors inside a computer that can represent a single 10-state value using e.g. one of ten different voltage levels. It's always either all the way on or all the way off.
https://en.wikipedia.org/wiki/Three-valued_logic aka Ternary logic is the usual example of non-binary logic. True decimal logic would be one instance of n-valued logic, where a "bit" can have one of 10 values instead of one of 3.
In practice CPUs are based on strictly binary (Boolean) logic. I mention ternary logic only to illustrate what it would really mean for computers to truly use decimal logic/numbers instead of just using binary to represent decimal characters.
A processor has a number of places to store content; you can call it a memory hierarchy: „Memories“ that are very close to the processor are very fast and small, e.g. registers, and „memories“ that are far from the processor are slow and large, like a disk. In between are cache memories at various levels and the main memory.
Since these memories are technologically very different, the are accessed in different ways: Registers are accessed by specifying them in assembly code directly (as R1 in MOV R1, #9), whereas the main memory storage locations are consecutively numbered with „addresses“.
So if you execute MOV R1, #9, the binary number 9 is stored in register 1 of the CPU, not in main memory, and an assembly code reading out R1 would read back number 9.
All numbers are stored in a binary format. The most well known are integers (as your example binary 00001001 for decimal 9), and floating point.

LTO CM CRC Function

I'm looking for some advice on writing a C function to calculate a 16 bit CRC for an LTO RFID chip.
The spec says:
For commands and data that are protected by the 16-bit CRC, the
generator polynomial shall be G(x) = x16 + x12 + x5 + 1 The CRC bytes
shall be generated by processing all bytes through a generator
circuit. See figure F.11. Registers R0 to R15 shall be 1 bit wide
where R0 shall be the least significant bit and R15 the most
significant bit. These registers shall be set to (6363) prior to the
beginning of processing. The bytes shall be fed sequentially into the
encoder, least significant bit first. After the bytes have been
processed, the content of R0 is CRC0 and shall be the least
significant bit. The content of R15 is CRC15 and shall be the most
significant bit.
But I've just a humble self taught C programmer and that means nothing to me.
Can anybody help me with some code, or an explanation of the formula?
The diagram in the ECMA 319 Standard shows what to do:
though it contains an error. The exclusive-or between R11 and R10 should have another input tapped off the wire going to R15.
The bits from the input come in the wire at the top, starting with the least significant bit from the first input byte. At each clock, each register is set to its input. The plus signs in circles are exclusive-or gates.
You can implement this in C with the bit-wise operations ^, &, and >>. Enjoy!

Disassembly of a mixed ARM/Thumb2 ELF file

I'm trying to disassemble an ELF executable which I compiled using arm-linux-gnueabihf to target thumb-2. However, ARM instruction encoding is making me confused while debugging my disassembler. Let's consider the following instruction:
mov.w fp, #0
Which I disassembled using objdump and hopper as a thumb-2 instruction. The instruction appears in memory as 4ff0000b which means that it's actually0b00f04f (little endian). Therefore, the binary encoding of the instruction is:
0000 1011 0000 0000 1111 0000 0100 1111
According to ARM architecture manual, it seems like ALL thumb-2 instructions should start with 111[10|01|11]. Therefore, the above encoding doesn't correspond to any thumb-2 instruction. Further, it doesn't match any of the encodings found on section A8.8.102 (page 484).
Am I missing something?
I think you're missing the subtle distinction that wide Thumb-2 encodings are not 32-bit words like ARM encodings, they are a pair of 16-bit halfwords (note the bit numbering above the ARM ARM encoding diagram). Thus whilst the halfwords themselves are little-endian, they are still stored in 'normal' order relative to each other. If the bytes in memory are 4ff0000b, then the actual instruction encoded is f04f 0b00.
thumb2 are extensions to the thumb instruction set, formerly undefined instructions, now some of them defined. arm is a completely different instruction set. if the toolchain has not left you clues as to what code is thumb vs arm then the only way to figure it out is start with an assumption at an entry point and disassemble in execution order from there, and even there you might not figure out some of the code.
you cannot distinguish arm instructions from thumb or thumb+thumb2 extension simply by bit pattern. also remember arm instructions are aligned on 4 byte boundaries where thumb are 2 byte and a thumb 2 extension doesnt have to be in the same 4 byte boundary as its parent thumb, making this all that much more fun. (thumb+thumb2 is a variable length instruction set made from multiples of 16 bit values)
if all of your code is thumb and there are no arm instructions in there then you still have the problem you would have with a variable length instruction set and to do it right you have to walk the code in execution order. For example it would not be hard to embed a data value in .text that looks like the first half of a thumb2 extension, and follow that by a real thumb 2 extension causing your disassembler to go off the rails. elementary variable word length disassembly problem (and elementary way to defeat simple disassemblers).
16 bit words A,B,C,D
if C + D are a thumb 2 instruction which is known by decoding C, A is say a thumb instruction and B is a data value which resembles the first half of a thumb2 extension then linearly decoding through ram A is the thumb instruction B and C are decoded as a thumb2 extension and D which is actually the second half of a thumb2 extension is now decoded as the first 16 bits of an instruction and all bets are off as to how that decodes or if it causes all or many of the following instructions to be decoded wrong.
So start off looking to see if the elf tells you something, if not then you have to make passes through the code in execution order (you have to make an assumption as to an entry point) following all the possible branches and linear execution to mark 16 bit sections as first or additional blocks for instructions, the unmarked blocks cannot be determined necessarily as instruction vs data, and care must be taken.
And yes it is possible to play other games to defeat disassemblers, intentionally branching into the second half of a thumb2 instruction which is hand crafted to be a valid thumb instruction or the begnning of a thumb2.
fixed length instruction sets like arm and mips, you can linearly decode, some data decodes as strange or undefined instructions but your disassembler doesnt go off the rails and fail to do its job. variable length instruction sets, disassembly at best is just a guess...the only way to truly decode is to execute the instructions the same way the processor would.

Writing a C function that uses pointers and bit operators to change one bit in memory?

First off, here is the exact wording of the problem I need to solve:
The BIOS (Basic Input Output Services) controls low level I/O on a computer. When a computer
first starts up the system BIOS creates a data area starting at memory address 0x400 for its own use.
Address 0x0417 is the Keyboard shift flags register, the bits of this byte have the following
meanings:
Bit Value Meaning
7 0/1 Insert off/on
6 0/1 CapsLock off/on
5 0/1 NumLock off/on
4 0/1 ScrollLock off/on
3 0/1 Alt key up/down
2 0/1 Control key up/down
1 0/1 Left shift key up/down
0 0/1 Right shift key up/down
This byte can be written as well as read. Thus we may change the status of the CapsLock, NumLock and ScrollLock LEDs on the keyboard by setting or clearing the relevant bit.
Write a C function using pointers and bit operators to turn Caps lock on without changing the other bits.
Our teacher didn't go over this at all, and I've referenced the textbook and conducted many Google searches looking for some help.
I understand how bitwise operators work, and understand that the solution is to OR this byte with the binary value '00000010'. However, I'm stuck when it comes to implementing this. How do I write this in C code? I don't know how to declare a pointer to exactly 1 byte of memory. Besides that, I'm assuming the answer looks like the following (with byte replaced with something proper):
byte* b_ptr = 0x417;
(*b_ptr) |= 00000010;
Is the above solution correct?
unsigned char is the typical synonym for byte. You can typedef byte to mean that if it's not available.
That notation you're using is decimal. The notation for a binary number is easiest in hex, so I'd just use 0x02 instead of the 00000010, which is actually octal notation for a literal number.
I think that you've reversed the order of the bits. My guess is that the bits are numbered most - least significant, so in the solution you propose, the bitmask 00000010 is left shift.
Otherwise, assuming byte is typedef'd to unsigned char, the syntax of the byte * b_ptr = 0x417; will point you at memory address 0x417, which is what you want.

The machine code in LC-3 are also known as assembly?

I'm a little confused over the concept of machine code...
Is machine code synonymous to assembly language?
What would be an example of the machine code in LC-3?
Assembly instructions (LD, ST, ADD, etc. in the case of the LC-3 simulator) correspond to binary instructions that are loaded and executed as a program. In the case of the LC-3, these "opcodes" are assembled into 16-bit strings of 1s and 0s that the LC-3 architecture is designed to execute accordingly.
For example, the assembly "ADD R4 R4 #10" corresponds to the LC-3 "machine code":
0001100100101010
Which can be broken down as:
0001 - ADD.
100 - 4 in binary
100 - 4 in binary
10 - indicates that we are adding a value to a register
1010 - 10 in binary
Note that each opcode has a distinct binary equivalent, so there are 2^4=16 possible opcodes.
The LC-3 sneakily deals with this problem by introducing those flag bits in certain instructions. For ADD, those two bits change depending on what we're adding. For example, if we are adding two registers (ie. "ADD R4 R4 R7" as opposed to a register and a value) the bits would be set to 01 instead of 10.
This machine code instructs the LC-3 to add decimal 10 to the value in register 4, and store the result in register 4.
An "Assembly language" is a symbolic (human-readable) language, which a program known as the "assembler" translates into the binary form, "machine code", which the CPU can load and execute. In LC-3, each instruction in machine code is a 16-bit word, while the corresponding instruction in the assembly language is a line of human-readable text (other architectures may have longer or shorter machine-code instructions, but the general concept it the same).
The above stands true for any high level language (such as C, Pascal, or Basic). The differece between HLL and assembly is that each assembly language statement corresponds to one machine operation (macros excepted). Meanwhile, in a HLL, a single statement can compile into a whole lot of machine code.
You can say that assembly is a thin layer of mnemonics on top of machine code.

Resources