Interpreting ARM/MachO with LLVM for analysis and optimization? - arm

I've spent a great deal of time reading the LLVM source tree. It is quite an impressive piece of engineering!
Anyhow, I have been trying to convert some MachO Arm Binaries that I have into the LLVM bitcode for basic static analysis. Mainly, I'd like to create backwards static slices on certain calls depending on which registers are used. Additionally, I am trying to do forward propagation of obvious constants (for instance, loading a function name from the symbol table and passing to a register).
At this point, I have been able dump a file and parse it in native ARM assembly using this command line:
bash-3.2$ llvm-objdump -d ~/code/osx/HelloWorldThin -triple=thumb
-mattr=+thumb2,+32bit,+v7,+v6t2,+thumb-mode,+neon
/Users/steve/code/osx/HelloWorldThin: file format Mach-O arm
Disassembly of section __TEXT,__text:
_main:
2fd4: f0 b5 push {r4, r5, r6, r7, lr}
2fd6: 03 af add r7, sp, #12
2fd8: 4d f8 04 8d str r8, [sp, #-4]!
2fdc: 0d 46 mov r5, r1
2fde: 06 46 mov r6, r0
2fe0: 00 f0 fe ef blx #4092
...snipped...
This is great, as it saves me a bunch of time writing a parser!
After looking through MachODump.cpp, I see that these are lowered to MCInst, which from the way I understand it, is just a parsed opcode with parameters.
So my questions are:
1) Is there a way to convert from ARM to LLVM (for optimization passes, etc)? There is no need to emit back to ARM, only a need to have an analysis result.
1.5) I notice all the analysis operations operate on Instruction instead of MCInst, is there a way to type promote and provide the required information?
2) Is there a way to emulate/simulate ARM or LLVM instructions? I ask because things like slicing and constant propagation need dataflow analysis in order to determine what contents are in memory and registers.
Operations like this, require tracking the way data is loaded and stored from memory, along with registers. Can LLVM understand the side effects of these instructions for analysis?
__text:000032DE LDR R1, [R0] ; "viewDidLoad"
__text:000032E0 MOV R0, SP
__text:000032E2 BLX _objc_msgSendSuper2
3) If it seems like I have a fundamental misunderstanding of something going on in LLVM, I'd love any feedback.
Thanks and let me know if I can provide any more information about my problem.

For the purpose of static analysis of ARM binaries. It's is better to translate the semantics of each ARM instruction directly to LLVM IR and apply data-flow analysis on the later. For example, an ADD rd, rd, rm in ARM can be translated to LLVM IR %rd2 = add i32 %rd1, %rm1.
Decompilation of ARM machine code to C (for the purpose of recompiling it back to LLVM IR) is both cumbersome and unnecessary. Note that the focus of decompilers like IDA Pro is on binary understanding and not on recompilation per se. Therefore, you would have a hard time recompiling the software back, and even harder time linking your analysis results to the original binary.
The following links might be useful:
Fracture is an open source project attempting to directly translate ARM binaries to LLVM IR.
LLBT: is a research project that implemented ARM translation to LLVM IR. Their goal, however, is on static binary rewriting rather than binary analysis.
Note that you need a robust disassembler if you are considering analyzing stripped binaries. objdump can emit too much disassembly errors on binaries without symbols.
I'm in the early phases of a research project where we develop a processor description language that can make describing instruction semantics in LLVM IR easier. I'll update this answer when we have more results.

For (1) - not within the framework of LLVM. There's no "decompiler" in there. You're free to use an external decompiler that translates machine code into C, and then compile that into LLVM IR with clang. YMMV with regards to the quality of such a translation, of course.
(1.5) If I understand what you're asking, then no. Instruction and MCInst are quite different animals, very far apart in their abstraction levels. Read this: http://eli.thegreenplace.net/2012/11/24/life-of-an-instruction-in-llvm/
(2) Yes, LLVM has an interpreter you can use from the lli tool. It directly "emulates" LLVM IR without lowering it.

Related

Meaning of # zero_extendqisi2

I was wondering what the actual meaning of # zero_extendqisi2 in gcc assembly output was and also the usage. I couldn't find what qisi stands for or anything along those lines.
For context, the line is ldrb r3, [fp, #-9] # zero_extendqisi2 and this is ARM on a Raspberry Pi Zero W, compiled with GCC. For example, when reloading an unsigned char with conversion to int, with optimization disabled, with GCC9.2 with no options. https://godbolt.org/z/7xnfqh. Older GCC all the way to the earliest on Godbolt (4.5) and presumably earlier print the same comment.
This is an RTL instruction name, included in the Standard Names list of the GCC internals manual under zero_extendmn2. Here m,n are the machine modes qi and si, which are respectively a byte and a 32-bit integer. So this is GCC's indication that it is generating an instruction which takes a byte (here loaded from memory) and zero-extends it into a 32-bit integer (here in the register r3). Which is exactly what the ARM ldrb instruction does.
I don't know what the 2 stands for, but it's apparently part of GCC's naming convention.
As Peter points out, it's a little odd that GCC would include such a comment in the assembly without -fverbose-asm. Indeed the comment is coded in as part of the template string in the machine description file, arm.md. It could have been a debugging aid that some GCC developer added and then forgot to take out.
(If you submit this for your assignment, please cite this post properly.)

Debugging a compiled C program with GDB to learn Assembly programming

I'm very new to gdb. I wrote a very simple hello world program
#include <stdio.h>
int main() {
printf("Hello world\n");
return 0;
}
I compiled it with -g to add debugging symbols
gcc -g -o hello hello.c
I'm not sure what to do next since I'm not familiar with gdb. I'd like to be able to use gdb to inspect assembly code. That's what I was told on IRC.
First, start the program to stop exactly at the beginning of main function.
(gdb) start
Switch to assembly layout to see assembly instructions interactively in a separate window.
(gdb) layout asm
Use stepi or nexti commands to step through the program. You will see current instruction pointer in assembly window moving when you walk over the assembly instructions in your program.
printf is pretty much the last function you would want to use to learn assembly, library calls would come later, but you wouldnt need to use library/system calls. Using a debugger is going to lead you into a rats nest using system calls as well. Try something like this, particularly if you want to learn assembly language from this exercise.
unsigned int fun ( unsigned int a, unsigned int b )
{
return(a^b^3);
}
gcc -O2 -c so.c -o so.o
objdump -D so.o
Disassembly of section .text:
0000000000000000 <fun>:
0: 89 f0 mov %esi,%eax
2: 83 f0 03 xor $0x3,%eax
5: 31 f8 xor %edi,%eax
7: c3 retq
I highly recommend you avoid x86 as your first instruction set. Try something cleaner...
arm-none-eabi-gcc -O2 -c so.c -o so.o
arm-none-eabi-gcc -O2 -c -mthumb so.c -o so.o
arm-none-eabi-objdump -D so.o
00000000 <fun>:
0: 2303 movs r3, #3
2: 4059 eors r1, r3
4: 4048 eors r0, r1
6: 4770 bx lr
msp430-gcc -O2 -c so.c -o so.o
msp430-objdump -D so.o
00000000 <fun>:
0: 3f e0 03 00 xor #3, r15 ;#0x0003
4: 0f ee xor r14, r15
6: 30 41 ret
dead serious about this one being the first instruction set, msp430 is close to it but this one makes the most sense, unfortunately the gnu assembler syntax doesnt match the books, and also unfortunate the world thought in octal then and we think hex now...
pdp11-aout-gcc -O2 -c so.c -o so.o
pdp11-aout-objdump -D so.o
00000000 <_fun>:
0: 1166 mov r5, -(sp)
2: 1185 mov sp, r5
4: 15c0 0003 mov $3, r0
8: 1d41 0006 mov 6(r5), r1
c: 7840 xor r1, r0
e: 1d41 0004 mov 4(r5), r1
12: 7840 xor r1, r0
14: 1585 mov (sp)+, r5
16: 0087 rts pc
Nice simulators or hardware for all, best to learn in a simulator than on real hardware...
Most of the instruction sets I learned I learned by writing a disassembler, arm and thumb would fall into this category as they are fixed instruction length (if you avoid thumb2 extensions). Or just write a simulator, msp430 and pdp11 fall into this category. Either of the latter is an afternoon project, either of the former is a long weekend project. You will know each instruction set better than the average person, even some who have been programming in it for a while.
If you insist on x86 (I strongly urge you away from this) use an 8086/8088 simulator like pcemu and stick to the original instruction set, use nasm or a86 or whatever as needed to do this. It is not as nice of an instruction set even back then but back then makes more sense than now. bitsavers has nice scanned with search capability versions of the original intel documents, best place to start.
arm docs are at arm (looking for the architectural reference manual for armv5 I think they call it now). msp430 just look at wikipedia instruction set is there pdp11 google it and using C to machine code to disassembly figure out the syntax.
If you really really want to have fun get the amber core from opencores it is an arm2/3, almost all the instructions are the same as in armv4 and later, can use the gnu tools. Use verilator to build and simulate and see a working processor from the inside. Understand that just like taking 100 programmers and giving them a programming task and getting anywhere from 1 to 100 different solutions, take an instruction set and give 100 engineers the task of implementing it you get anywhere from 1 to 100 different solutions. Arm itself has re-designed their cores for the same instruction sets several times over, much less the few legal clones.
recommended order pdp11, msp430, thumb, arm, then mips and if you still feel you need to disassemble some x86. PIC12/14 is simple and educational (should take you like a half hour to an hour to make a simulator for that), 6502, z80, 8051, 6800 and a number of others are also historically educational like x86 to look at the documentation but not necessary to write programs. if you start with a good one, then each Nth instruction set is that much easier from the second one on. They are more alike than different but you do get to see different things like how to do things without flags in mips, etc...I have left out several other instruction sets that are either still available in silicon or are interesting for various reasons.
Another approach is install clang/llvm and take a quick or longer look at every instruction set that llc can produce (compile to bitcode/bytecode then use llc to do the backend to whatever instruction set). Like above taking the same code and seeing what different instruction sets look like at least with that compiler and its settings is very educational and helps mentally get a feel for how to break programming tasks down into these atomic steps.

Illegal instruction when running simple ELLCC-generated ELF binary on a Raspberry Pi

I have an empty program in LLVM IR:
define i32 #main(i32 %argc, i8** %argv) nounwind {
entry:
ret i32 0
}
I'm cross-compiling it on Intel x86-64 Windows for ARM Linux using ELLCC, with the following command:
ecc++ hw.ll -o hw.o -target arm-linux-engeabihf
It completes without errors and generates an ELF binary.
When I take the binary to a Raspberry Pi Model B+ (running Raspbian), I get only the following error:
Illegal instruction
I don't know how to tell what's wrong from the disassembled code. I tried other ARM Linux targets but the behavior was the same. What's wrong?
The exact same file builds, links and runs fine for other targets like i386-linux-eng, x86_64-w64-mingw32, etc (that I could test on), again using the ELLCC toolchain.
Assuming the library and startup code isn't at fault, this is what the disassembly of main itself looks like:
.text:00010188 e24dd008 sub sp, sp, #8
.text:0001018c e3002000 movw r2, #0
.text:00010190 e58d0004 str r0, [sp, #4]
.text:00010194 e1a00002 mov r0, r2
.text:00010198 e58d1000 str r1, [sp]
.text:0001019c e28dd008 add sp, sp, #8
.text:000101a0 e12fff1e bx lr
I'd guess it's choking on the movw at 0x0001018c. The movw/movt encodings which can handle full 16-bit immediate values first appeared in the ARMv6T2 version of the architecture - the ARM1176 in the original Pi models predates that, only supporting original ARMv6*.
You need to tell the compiler to generate code appropriate to the thing you're running on - I don't know ELLCC, but I'd guess from this it's fairly modern and up-to-date and thus defaulting to something newer like ARMv6T2 or ARMv7. Otherwise, it's akin to generating code for a Pentium and hoping it works on an 80486 - you might be lucky, you might not. That said, there's no good reason it should have chosen that encoding in the first place - it's not as if 0 can't be encoded in a 'classic' mov instruction...
The decadent option, however, would be to consider this a perfect excuse to replace the Pi with a Pi 2 - the Cortex-A7s in that are nice capable ARMv7 cores ;)
* Lies for clarity. I think 1176 might actually be v6K, but that's irrelevant here. I'm not sure if anything actually exists as plain ARMv6, and all the various architecture extensions are frankly a hideous mess

Where is declaration for get_pc() in GNU ARM?

I'm building legacy code using the GNUARM C compiler and trying to resolve all the implicit declarations of functions.
I've come across some ARM specific functions and can't find the header file containing the declarations for these functions:
get_pc
get_cpsr
get_sp
I have searched the web and only came up with source code containing these functions without any non-standard include files.
I'll also settle for the function declarations.
Since I will also be porting the code to the Cygwin / Windows platform, what are the equivalent declarations for Cygwin GNU GCC?
Thanks.
Just write your own if you really need those functions, asm is easier than inline asm:
.globl get_pc
get_pc:
mov r0,pc
bx lr
.globl get_sp
get_sp:
mov r0,sp
bx lr
.globl get_cpsr
get_cpsr:
mrs r0,cpsr
bx lr
At least for arm. if you are porting to x86 and need the equivalents, I have to wonder what the code needs with those things anyway. the cpsr in particular you would likely have to change any code that uses the result as the status registers across processor vendors/families pretty much never match. The x86 equivalents should still be about the same level of effort, takes longer to do a google search and read the results than it is to just write the code (if you know the processor).
Depending on what your application is doing it is probably better to just comment out any code that calls those functions and/or uses the return value. I can imagine a few reasons why those items would be used, but it could get into architecture specific stuff and that is more involved than just porting a few register read functions. So what user786653 asked is the key question. How are these functions used? Not where can I find them but how are they used and why do you think you need them.
Are you sure those are functions? I'm not very familiar with ARM, but those sound like compiler intrinsics to me. If you're moving to GCC, you might be better off replacing those with inline assembly.

68000, portable JIT library

There are several JIT libraries, but is there any which emits Motorola 68000 style instructions, such as for instance 68000, 68040, 68060 or any of the Coldfire CPUs?
Bonus points if it could emit for other platforms too, but 68k is most important.
Something easily integrated with C is preferred, but other languages are interesting too.
Ideally something like libjit, but with a 68k backend.
Although this doesn't really answer your question, you could consider generating the 68k machine code yourself. It shouldn't be too terribly difficult if you are already familiar with 68k assembly.
The Motorola M68000 Family Programmer's Reference Manual documents the syntax, availability, and bit configuration of every 680x0 instruction. However, a less tedious way to figure out the machine code for instructions is to use a 68k assembler that can generate a listing of the hex codes for each instruction produced. If you're on Windows, Easy68K should be able to generate such a listing, but I haven't tried it myself.
If you're not on Windows, you could try this assembler (only supports 68000, I think). You'll have to blow the dust off of it, but it works (at least in Linux). The command-line assembler (assembler/asm) has a -l flag that tells the assembler to generate a listing. Example:
$ asmlab/assembler/asm -ln test.asm
68000 Assembler by PGM
No errors detected
No warnings generated
test.asm
Leading space is required before each instruction, and the assembler doesn't handle whitespace between tokens well.
move.l #$12345678,-(a6)
jmp ($12345678)
rts
test.LIS
00000000 2D3C 12345678 1 move.l #$12345678,-(a6)
00000006 4EF9 12345678 2 jmp ($12345678)
0000000C 4E75 3 rts
No errors detected
No warnings generated

Resources