Detecting and extracting opcode sequences - database

May I get an explanation about what opcode sequences are and how to find them in PE32 files?
I am trying to extract them from PE32 files.

what opcode sequences are
A CPU instruction is composed from 1 to multiple bytes, each of of those bytes have different meanings.
An opcode (operation code) is the part of an instruction that defines the behavior of the instruction itself (as in, this instruction is an 'ADD', or an 'XOR', a NOP, etc.).
For x86 / x64 CPUs (IA-32; IA-32e in Intel linguo) an instruction is composed of at least an opcode (1 to 3 bytes), but can comes with multiple other bytes (various prefixes, ModR/M, SIB, Disp. and Imm.) depending on its encoding:
Opcode is often synonym with "instruction" (since the opcode defines the behavior of the instruction); therefore when you have multiple instructions you then have an opcode sequence (which is a bit of a misnomer since it's really a instruction sequence, unless all instructions in the sequence are only composed of opcodes).
how to find them in PE32 files?
As instructions can be multiple bytes long, you can't just start at a random location in the .text section (which, for a PE file, contains the executable code of the program). There's a specific location in the PE file - called the "entry point" - which defines the start of the program.
The entry point for a PE File is given by the AddressOfEntryPoint member of the IMAGE_OPTIONAL_HEADER structure (parts of the PE header structures). Note that this member is an RVA, not a "full" VA.
From there you know you are at the start of an instruction. You can start disassembling / counting instructions from this point, following the encoding rules for instructions (these rules are explained to great length in the Intel and AMD manuals).
Most instruction are "fall-through", which means that once an instruction has executed, the next to execute is the following one (this seems obvious, but!). The trick is when there's a non-fall-through instruction, you must know what this instruction does to continue your disassembling (e.g. it might jump somewhere, go to a specific handler, etc.)

Use the radare 2 library, it can extract opcode sequences very quickly.

Related

Detecting Thumb-2 instruction and location of PC offset

i'm kinda new to ARM and i am trying to understand how instructions are interpreted/executed:
From what i know, on ARM is quite simple since every instruction takes up 4 bytes and it's all aligned by 4 bytes also.
The problem comes with Thumb-2 where their instructions can be both 16/32bit long. I've read that to determine if the current instruction is 16/32 bits long the processor reads a word (32bit) and evaluates the first half-word on certain bits [15:11]. If those bits are 0b11101/0b11110/0b11111 then that halfword is the first halfword of a 32 bit instruction else it's a 16bit instruction (I don't quite get why those specific bytes determine that). So an example should be:
0x4000 16-bit
0x4002 32-bit
0x4006 16-bit
0x4008 16-bit
0x400a 32-bit
Then the processor should grab from 0x4000 to 0x4004, evaluate the first half-word (0x4000 to 0x4002) and if the instruction is 16 bit then it just jumps to the next half-word and repeats the process but if the half-word indicates a 32bit address then it skips the next half-word and executes that 32bit instruction?
Also, i'm confused on where does PC point in thumb-2, is it still two instructions further?
Most of us don't/won't know exactly how it is implemented in the logic (and there are various cores so each could be different). But what used to be undefined instructions became thumb-2 extensions a couple dozen in armv6-m then like 150 new ones in armv7-m.
Think of the processor fetching 16 bit instructions, and sometimes it runs across a variable length one. Just like other variable length processors, the x86 will look at the one byte instruction then based on that it may or may not need to look at the next byte and so on until it has resolved the whole instruction. Same here, it looks at a halfword determines if it has everything it needs, if not it grabs the next halfword for the rest of the information.
0x4000 16-bit
0x4002 32-bit
0x4006 16-bit
0x4008 16-bit
0x400a 32-bit
the processor grabs 0x4000 sees it has what it needs, executes. The processor grabs 0x4002, sees it needs another halfword, grabs 0x4004, executes. processor grabs 0x4006 has what it needs executes. grabs 0x4008 has what it needs executes. grabs 0x400A sees it needs another halfword, grabs 0x400C, executes.
Those bit patterns were formerly undefined instructions, now they are part of the definition of a variable length instruction. Just like instructions that start with 0b010000 are data processing instructions and to determine is it an add or an xor, you have to look at other bits. These bit patterns define thumb-2 extensions then other bits in those two half words define what the full instruction is.
Why these bit patterns? You can think of it is arbitrary if you want, all instruction sets someone(/group) sat down and decided what bit patterns where going to mean what, no different here. There was room in the instruction set space with certain patterns so those were used. Not uncommon to add instructions later in the life of a processor family, take x86 for example. Plus many others, for an 8 bitter like x86 or 6502 or whatever you can either consume an 8 bit instruction/opcode as your next new instruction or you take that formerly unused byte/opcode and expand it into many more for example you take a byte/opcode that was unused and that byte now means look at the next byte, that next byte could be up to 256 new instructions or it could simply supplement the first byte specifying registers or operations, etc. No different here, down the road arm extended the thumb instruction set, some percentage of the instruction is consumed indicating this is a variable length instruction, but of those 32 bits there still remains quite a few bits to allow for a larger instruction with more options. (but losing the one to one relationship between thumb and arm instructions, all thumb instructions (not thumb-2 extensions) map directly into a full sized arm instruction).
Each core is different they don't all fetch a word at a time, thumb-2 extensions don't have to be aligned so a whole thumb-2 instruction won't necessarily fit in an aligned word fetch for the processors that do word fetches. Think of the (pre)fetcher and decoder as two separate things, since they are, functionally the decoder takes 16 bits at a time in thumb mode, how is it specifically implemented? I don't know. Do they wait for two half words to be ready before decoding the first? I don't know. Is every implementation the same? I don't know, would expect not. As far as fetching goes they are not the same as you can see in the ARM documentation and I think at least one if not more the chip vendor can choose at compile time.
If you are coming from for example a MIPS based textbook and trying to understand other processors, this can be confusing, understand that those text books and terms are for understanding and vocabulary, pipelines are not that depth in general and you don't fetch whole instructions at a time in general (the x86 does not fetch one byte at a time, it fetches MANY instructions at a time). Risc-v has even worse of a problem than arm and mips as you can have 16 bit compressed instructions, 32 bit instructions, and 64 bit instructions, the 32 bit instructions do not have to be aligned on a risc-v (nor the 64 bit) so fetching 32 at a time doesn't get you a whole instruction, the fetcher is separate from the decoder once enough is there then the decoder can complete.
I want to say that thumb is two ahead (independent of a thumb2 extension or not) so pc+4, should be easy to figure out though.
Disassembly of section .text:
00000000 <hello-0xe>:
0: e005 b.n e <hello>
2: bf00 nop
4: bf00 nop
6: f000 b802 b.w e <hello>
a: bf00 nop
c: bf00 nop
0000000e <hello>:
e: bf00 nop
Yes, so two thumb sized halfwords ahead (pc+4) in both cases. It would be significantly more complicated if it were two instructions ahead which is how it used to be to make it easy to remember. If it were two instructions ahead then sometimes pc+4, sometimes pc+6, and sometimes pc+8 the logic would have to decode two instructions in order to know how the pc was offset for the first of the two, so sticking with pc+4 as it has always been for thumb mode is the sane way to do it.

Disassembly of a mixed ARM/Thumb2 ELF file

I'm trying to disassemble an ELF executable which I compiled using arm-linux-gnueabihf to target thumb-2. However, ARM instruction encoding is making me confused while debugging my disassembler. Let's consider the following instruction:
mov.w fp, #0
Which I disassembled using objdump and hopper as a thumb-2 instruction. The instruction appears in memory as 4ff0000b which means that it's actually0b00f04f (little endian). Therefore, the binary encoding of the instruction is:
0000 1011 0000 0000 1111 0000 0100 1111
According to ARM architecture manual, it seems like ALL thumb-2 instructions should start with 111[10|01|11]. Therefore, the above encoding doesn't correspond to any thumb-2 instruction. Further, it doesn't match any of the encodings found on section A8.8.102 (page 484).
Am I missing something?
I think you're missing the subtle distinction that wide Thumb-2 encodings are not 32-bit words like ARM encodings, they are a pair of 16-bit halfwords (note the bit numbering above the ARM ARM encoding diagram). Thus whilst the halfwords themselves are little-endian, they are still stored in 'normal' order relative to each other. If the bytes in memory are 4ff0000b, then the actual instruction encoded is f04f 0b00.
thumb2 are extensions to the thumb instruction set, formerly undefined instructions, now some of them defined. arm is a completely different instruction set. if the toolchain has not left you clues as to what code is thumb vs arm then the only way to figure it out is start with an assumption at an entry point and disassemble in execution order from there, and even there you might not figure out some of the code.
you cannot distinguish arm instructions from thumb or thumb+thumb2 extension simply by bit pattern. also remember arm instructions are aligned on 4 byte boundaries where thumb are 2 byte and a thumb 2 extension doesnt have to be in the same 4 byte boundary as its parent thumb, making this all that much more fun. (thumb+thumb2 is a variable length instruction set made from multiples of 16 bit values)
if all of your code is thumb and there are no arm instructions in there then you still have the problem you would have with a variable length instruction set and to do it right you have to walk the code in execution order. For example it would not be hard to embed a data value in .text that looks like the first half of a thumb2 extension, and follow that by a real thumb 2 extension causing your disassembler to go off the rails. elementary variable word length disassembly problem (and elementary way to defeat simple disassemblers).
16 bit words A,B,C,D
if C + D are a thumb 2 instruction which is known by decoding C, A is say a thumb instruction and B is a data value which resembles the first half of a thumb2 extension then linearly decoding through ram A is the thumb instruction B and C are decoded as a thumb2 extension and D which is actually the second half of a thumb2 extension is now decoded as the first 16 bits of an instruction and all bets are off as to how that decodes or if it causes all or many of the following instructions to be decoded wrong.
So start off looking to see if the elf tells you something, if not then you have to make passes through the code in execution order (you have to make an assumption as to an entry point) following all the possible branches and linear execution to mark 16 bit sections as first or additional blocks for instructions, the unmarked blocks cannot be determined necessarily as instruction vs data, and care must be taken.
And yes it is possible to play other games to defeat disassemblers, intentionally branching into the second half of a thumb2 instruction which is hand crafted to be a valid thumb instruction or the begnning of a thumb2.
fixed length instruction sets like arm and mips, you can linearly decode, some data decodes as strange or undefined instructions but your disassembler doesnt go off the rails and fail to do its job. variable length instruction sets, disassembly at best is just a guess...the only way to truly decode is to execute the instructions the same way the processor would.

MASM Assembly Listing File - interpretation

I've created a listing file of my asm code using commands
cd c:\masm32\bin\
ml.exe /c /Fl"c:\path\file.lst" /Sc "c:\path\file.asm"
The lst file contains three columns: the first one is hex address of specific line, the third one is opcode, but I don't understand the meaning of values in the second column. I think it's called "timing" and the values are someting like: 2 or 10m or even 7m,3. What is the meaning of this numbers, what do they represent?
With the /Sc command-line switch, which generates instruction timings, each line has this syntax:
offset [[timing]] [[code]]
The offset is the offset from the beginning of the current code segment. The timing shows the number of cycles the processor needs to execute the instruction. The value of timing reflects the CPU type; for example, specifying the .386 directive produces instruction timings for the 80386 processor. If the statement generates code or data, code shows the numeric value in hexadecimal notation if the value is known at assembly time. If the value is calculated at run time, the assembler indicates what action is necessary to compute the value.
When assembling under the default .8086 directive, timing includes an effective address value if the instruction accesses memory. The 80186/486 processors do not use effective address values. For more information on effective address timing, see the "Processor" section in the Reference book.
(source)
I'm not sure how much I'd trust those timing values unless you're actually going to execute the code on an 80486 or earlier processor.

Are all of the ARM opcodes 1 byte?

In x86 architecture there are one, two and three byte opcodes. What about ARM? For example, when I disassemble a binary, can I always take first byte as opcode?
No. There are now a few arm instruction sets. The primary arm instruction set, all instructions are 32 bits and the decoding of the instruction is not isolated to one field within the instruction. The thumb instruction set is based on 16 bit instructions, same answer though, depending on the instruction the decoding is in different places. Then there are thumb2 extensions to the thumb instruction set, same answer. The thumb2 instructions are one of the undefined instructions from thumb then add some more bits to distinguish which instruction.
Get one of the ARM Architecture Reference Manuals from infocenter.arm.com to see all of this. There are a couple of nice diagrams which show how this all works, the msbits are used to divide the instructions into different types then depending on the type the rest of the bits are decoded.
x86 comes from the 8 bit world where the designs tended to use one memory location(/access) (at the time that was a byte) for the more commonly used instructions, then have some opcodes multiplex into a second byte. Also immediates were added in their entirety. It was okay then, but is inefficient today. x86 should be the last instruction set you learn if ever, not representative of a good instruction set, you will need to unlearn a number of things to move forward.
wikipedia has a number of instruction sets on the wiki page itself, others it often has a direct link. IP vendors like arm and mips, as well as many chip vendors you can go right to their site and get the documentation.
ARM instructions don't really have a concept of 'opcodes'. I don't know the x86 instruction set very well, but what you describe sounds like 6502 machine code where some bytes identify which instruction to execute and others specify data that the instruction uses.
Ignoring Thumb, all ARM
instructions are four bytes long. The operands used by the instruction are contained within those four bytes.

The machine code in LC-3 are also known as assembly?

I'm a little confused over the concept of machine code...
Is machine code synonymous to assembly language?
What would be an example of the machine code in LC-3?
Assembly instructions (LD, ST, ADD, etc. in the case of the LC-3 simulator) correspond to binary instructions that are loaded and executed as a program. In the case of the LC-3, these "opcodes" are assembled into 16-bit strings of 1s and 0s that the LC-3 architecture is designed to execute accordingly.
For example, the assembly "ADD R4 R4 #10" corresponds to the LC-3 "machine code":
0001100100101010
Which can be broken down as:
0001 - ADD.
100 - 4 in binary
100 - 4 in binary
10 - indicates that we are adding a value to a register
1010 - 10 in binary
Note that each opcode has a distinct binary equivalent, so there are 2^4=16 possible opcodes.
The LC-3 sneakily deals with this problem by introducing those flag bits in certain instructions. For ADD, those two bits change depending on what we're adding. For example, if we are adding two registers (ie. "ADD R4 R4 R7" as opposed to a register and a value) the bits would be set to 01 instead of 10.
This machine code instructs the LC-3 to add decimal 10 to the value in register 4, and store the result in register 4.
An "Assembly language" is a symbolic (human-readable) language, which a program known as the "assembler" translates into the binary form, "machine code", which the CPU can load and execute. In LC-3, each instruction in machine code is a 16-bit word, while the corresponding instruction in the assembly language is a line of human-readable text (other architectures may have longer or shorter machine-code instructions, but the general concept it the same).
The above stands true for any high level language (such as C, Pascal, or Basic). The differece between HLL and assembly is that each assembly language statement corresponds to one machine operation (macros excepted). Meanwhile, in a HLL, a single statement can compile into a whole lot of machine code.
You can say that assembly is a thin layer of mnemonics on top of machine code.

Resources