arm (bare metal): call binary file as function - c

I have AT91Bootloader for AT91sam9 ARM controller. I need add some extra hardware initialization, but I have only compiled .bin file.
I loaded bin file to memory and tried to call it:
((void (*)())0x00005000)();
But, haven't any results. Please use assembler as less as possible. I was introduced to assembler before, but cannot understand ARM assembler due to it's complicity. How can I make call from middle of bootloader, execute bin file (it will be in some memory sector, 0x00005000 for example) and then return to bootloader and continue executing it's own code?

If ARM asm is "too complex", you will find it very difficult to debug any problems you're having. Basic* ARM assembly is one of the least complex assembly languages I've come across.
Your code ought to work (though I would not use a hard-coded address there) provided the ".bin" is of the correct format. Common issues:
The entry point should be ARM code; some compilers default to Thumb. It's possible (if a little tricky) to make Thumb code work.
The entry point needs to be at the start of the file. Without disassembling, it's hard to tell if you've done this correctly.
The linker will insert "thunks" (a.k.a. "stubs") where necessary. A quirk in some linkers means that the thunk can be placed before the entry point. You can work around this by using --stub-group-size=-1 (docs here).
* Ignoring things like Thumb/VFP/NEON which you don't need to get started.

ARM assembly is one of the simpler ones, very straight forward. If you want to continue to do bare metal you are going to need to learn at least some assembly. For example understanding Alexey's comment.
The instruction you are looking for is BX, it branches to an address, the assembly you need to branch to the code your bootloader downloaded is:
.globl tramp
tramp:
bx r0
The C prototype is
void tramp ( unsigned int address );
As mentioned in the comments the program needs to be compiled for the address you are running it from and/or it needs to be position independent, otherwise it wont work. Also you need to build the application with the proper entry point, if it is raw binary and you branch to the address where the binary was loaded the binary needs to be able to be started that way by having the first word in the binary be the entry point for execution.
Also understand that an elf format file contains the data you want to load, but as a whole is not the data you want to load. It is a "binary file" yes but to run the program contained in it you need to parse and extract the loadable portions and load them in the right places.
If you dont know what those terms mean, use google, and/or search SO, the answers are there.

Related

Instruction disassembler ARM. [ARM/Thumb mode]

I would like to ask you how to determine in which ISA (ARM/Thumb/Thumb-2) an instruction is encoded?
First of all, I tried to do it following the instructions here (section 4.5.5).
However, when I use readelf -s ./arm_binary, and arm_binary was built in release mode, it appears that there is no .symtab in the binary. And anyway, I don't understand how to use this command to find the type for the instructions.
Secondly, I know the other way to differentiate is to look at the PC address for the ARM/Thumb instruction. If it is even then it is a Thumb instruction, if not - then ARM. But how can I do this without loading the file to memory? When I parse the sections of the file and find the execute section, all that I have is the start (offset) location in the file and the file-offset is always even, and it will be always even because we have instruction of size equal to 2 or 4...
Finally, the last way to check is to detect BX Rm, extract the value from Rm, and then check if that address in Rm is it even or not. But, this may be difficult because for this I would need to emulate the whole program.
So what is the correct way to identify the ISA for disassembly?
Thank you for your attention and I hope you will help me.
I don't believe it's possible to tell, in a mixed mode binary, without inspecting the instructions as you describe.
If the whole file will be one ISA or the other, then you can determine the ISA of the entry point by running this:
readelf -h ./arm_binary
And checking whether the entry point is even or odd.
However, what I would do is simply disassemble it both ways, and see what looks right. As long as you start the disassembly at the start of a function (or any 4-byte boundary), then this will work fine. Most code will produce nonsense when disassembled in the wrong ISA.

How do I determine the start and end of instructions in an object file?

So, I've been trying to write an emulator, or at least understand how stuff works. I have a decent grasp of assembly, particularly z80 and x86, but I've never really understood how an object file (or in my case, a .gb ROM file) indicates the start and end of an instruction.
I'm trying to parse out the opcode for each instruction, but it occurred to me that it's not like there's a line break after every instruction. So how does this happen? To me, it just looks like a bunch of bytes, with no way to tell the difference between an opcode and its operands.
For most CPUs - and I believe Z80 falls in this category - the length of an instruction is implicit.
That is, you must decode the instruction in order to figure out how long it is.
If you're writing an emulator you don't really ever need to be able to obtain a full disassembly. You know what the program counter is now, you know whether you're expecting a fresh opcode, an address, a CB page opcode or whatever and you just deal with it. What people end up writing, in effect, is usually a per-opcode recursive descent parser.
To get to a full disassembler, most people impute some mild simulation, recursively tracking flow. Instructions are found, data is then left by deduction.
Not so much on the GB where storage was plentiful (by comparison) and piracy had a physical barrier, but on other platforms it was reasonably common to save space or to effect disassembly-proof code by writing code where a branch into the middle of an opcode would create a multiplexed second stream of operations, or where the same thing might be achieved by suddenly reusing valid data as valid code. One of Orlando's 6502 efforts even re-used some of the loader text — regular ASCII — as decrypting code. That sort of stuff is very hard to crack because there's no simple assembly for it and a disassembler therefore usually won't be able to figure out what to do heuristically. Conversely, on a suitably accurate emulator such code should just work exactly as it did originally.

I want to create a simple assembler in C. Where should I begin? [duplicate]

This question already has answers here:
Building an assembler
(4 answers)
How Do You Make An Assembler? [closed]
(4 answers)
Closed 9 years ago.
I've recently been trying to immerse myself in the world of assembly programming with the eventual goal of creating my own programming language. I want my first real project to be a simple assembler written in C that will be able to assemble a very small portion of the x86 machine language and create a Windows executable. No macros, no linkers. Just assembly.
On paper, it seems simple enough. Assembly code comes in, machine code comes out.
But as soon as I thinking about all the details, it suddenly becomes very daunting. What conventions does the operating system demand? How do I align data and calculate jumps? What does the inside of an executable even look like?
I'm feeling lost. There aren't any tutorials on this that I could find and looking at the source code of popular assemblers was not inspiring (I'm willing to try again, though).
Where do I go from here? How would you have done it? Are there any good tutorials or literature on this topic?
I have written a few myself (assemblers and disassemblers) and I would not start with x86. If you know x86 or any other instruction set you can pick up and learn the syntax for another instruction set in short order (an evening/afternoon), at least the lions share of it. The act of writing an assembler (or disassembler) will definitely teach you an instruction set, fast, and you will know that instruction set better than many seasoned assembly programmers for that instruction set who have not examined the microcode at that level. msp430, pdp11, and thumb (not thumb2 extensions) (or mips or openrisc) are all good places to start, not a lot of instructions, not overly complicated, etc.
I recommend a disassembler first, and with that a fixed length instruction set like arm or thumb or mips or openrisc, etc. If not then at least use a disassembler (definitely choose an instruction set for which you already have an assembler, linker, and disassembler) and with pencil and paper understand the relationship between the machine code and the assembly, in particular the branches, they usually have one or more quirks like the program counter is an instruction or two ahead when the offset is added, to gain another bit they sometimes measure in whole instructions not bytes.
It is pretty easy to brute force parse the text with a C program to read the instructions. A harder task but perhaps as educational, would be to use bison/flex and learn that programming language to allow those tools to create (an even more extreme brute force) parser which then interfaces to your code to tell you what was found where.
The assembler itself is pretty straight forward, just read the ascii and set the bits in the machine code. Branches and other pc relative instructions are a little more painful as they can take multiple passes through the source/tables to completely resolve.
mov r0,r1
mov r2 ,#1
the assembler begins parsing the text for a line (being defined as the bytes that follow a carriage return 0xD or line feed 0xA), discard the white space (spaces and tabs) until you get to something non white space, then strncmp that with the known mnemonics. if you hit one then parse the possible combinations of that instruction, in the simple case above after the mov skip over the white space to non-white space, perhaps the first thing you find must be a register, then optional white space, then a comma. remove the whitespace and comma and compare that against a table of strings or just parse through it. Once that register is done then go past where the comma is found and lets say it is either another register or an immediate. If immediate lets say it has to have a # sign, if register lets say it has to start with a lower or upper case 'r'. after parsing that register or immediate, then make sure there is nothing else on the line that shouldnt be on the line. build the machine code for this instruciton or at least as much as you can, and move on to the next line. It may be tedious but it is not difficult to parse ascii...
at a minimum you will want a table/array that accumulates the machine code/data as it is created, plus some method for marking instructions as being incomplete, the pc-relative instructions to be completed on a future pass. you will also want a table/array that collects the labels you find and the address/offset in the machine code table where found. As well as the labels used in the instruction as a destination/source and the offset in the table/array holding the partially complete instruction they go with. after the first pass, then go back through these tables until you have matched up all the label definitions with the labels used as a source or destination, using the label definition address/offset to compute the distance to the instruction in question and then finish creating the machine code for that instruction. (some disassembly may be required and/or use some other method for remembering what kind of encoding it was when you come back to it later to finish building the machine code).
The next step is allowing for multiple source files, if that is something you want to allow. Now you have to have labels that dont get resolved by the assembler so you have to leave placeholders in the output and make some flavor of the longest jump/branch instruction because you dont know how far away the destination will be, expect the worse. Then there is the output file format you choose to create/use, then there is the linker which is mostly simple, but you have to remember to fill in the machine code for the final pc relative instructions, no harder than it was in the assembler itself.
Note, writing an assembler is not necessarily related to creating a programming language and then writing a compiler for it, separate thing, different problems. Actually if you want to make a new programming language just use an existing assembler for an existing instruction set. Not required of course, but most teachings and tutorials are going to use the bison/flex approach for programming languages, and there are many college course lecture notes/resources out there for beginning compiler classes that you can just use to get you started then modify the script to add the features of your language. The middle and back ends are the bigger challenge than the front end. there are many books on this topic and many online resources as well. As mentioned in another answer llvm is not a bad place to create a new programming language the middle and backends are done for you, you only need to focus on the programming language itself, the front end.
You should look at LLVM, llvm is a modular compiler back end, the most popular front end is Clang for compiling C/C++/Objective-C. The good thing about LLVM is that you can pick the part of the compiler chain that you are interested in and just focus on that, ignoring all of the others. You want to create your own language, write a parser that generates the LLVM internal representation code, and for free you get all of the middle layer target independent optimisations and compiling to many different targets. Interesting in a compiler for some exotic CPU, write a compiler backend that takes the LLVM intermediated code and generates your assemble. Have some ideas about optimisation technics, automatic threading perhaps, write a middle layer which processes LLVM intermediate code. LLVM is a collection of libraries not a standalone binary like GCC, and so it is very easy to use in you own projects.
What you're looking for is not a tutorial or source code, it's a specification. See http://msdn.microsoft.com/en-us/library/windows/hardware/gg463119.aspx
Once you understand the specification of an executable, write a program to generate one. The executable you build should be as simple as possible. Once you have mastered that, then you can write a simple line-oriented parser that reads instruction names and numeric arguments to generate a block of code to plug into the exe. Later you can add symbols, branches, sections, whatever you want, and that's where something like http://www.davidsalomon.name/assem.advertis/asl.pdf will come in.
P.S. Carl Norum has a good point in the comment above. If your goal is create your own programming language, learning to write an assembler is irrelevant and is very much not the right way to start (unless the language you want to create is an assembly language). There are already assemblers that produce executables from assembler source, so your compiler could produce assembler source and you could avoid the work of recreating the assembler ... and you should. Or you could use something like LLVM, which will solve many other daunting problems of compiler construction. The odds are very small that you will ever actually produce your own programming language, but they're much smaller if you start from scratch and there's no need to. Decide what your goal is and use the best tools available to achieve it.

C program for display assembly in binary files

i'm trying to display the assembly instructions in a binary files but how can i do?
how can i know if an argument of MOV ( for example ) is a pointer or a number ?
this is for educational purposes, i known that there is GDB and othrer tools.
thanks in advance!
You mean a disassembler? then you have many tools to pick from, such as:
OllyDbg
IDA
objdump
If you want to integrate this into an existing program, then you need a disassembly engine, such as BeaEngine or diStorm.
You can utilize many of the libraries inside binutils like BFD and opcodes.
BFD Binary File Descriptor library, to do low-level manipulation.
opcodes library is used to assemble and disassemble machine instructions.
You might find useful information from the source to an emulator, which has to perform the same decoding task before performing the simulated instruction.
I highly recommend targeting a small subset first, ideally the bare 8086, and then add extensions in the same sequence they historically happened. This will help you decide what to ignore when looking for more information. So as not get overwhelmed.
For the MOV operation, the operands are specified (in the most general form) by the second byte, the MOD-REG-REG/MEM byte. Operands are almost always registers or memory locations (pointers, possibly constructed on-the-fly using "indexing registers"). Only a few instructions accept a literal operand(a number) and only as the source, and they are clearly marked in the table, 1979 8086 Manual, on page 180.

x86 way to tell instruction from data

Is there a more or less reliable way to tell whether data at some location in memory is a beginning of a processor instruction or some other data?
For example, E8 3F BD 6A 00 may be call instruction (E8) with relative offset of 0x6ABD3F, or it might be three bytes of data belonging to some other instruction, followed by push 0 (6A 00).
I know the question sounds silly and there is probably no simple way, but maybe instruction set was designed with this problem in mind and maybe some simple code examining +-100 bytes around the location can give an answer that is very likely correct.
I want to know this because I scan program's code and replace all calls to some function with calls to my replacement. It's working this far but it's not impossible that at some point, as I increase number of functions I'm replacing, some data will look exactly like a function call to that exact address, and will be replaced, and this will cause a program to break in a most unexpected fashion. I want to reduce the probability of that.
If it is your code (or another one which retaining linking and debug info), the best way is to scan symbol/relocation tables in object file. Otherwise there's no reliable way to determine if some byte is inctruction or data.
Possibly the most efficient method to qualify data is recursive disassembling. I. e. disassembling code from enty point and from all jump destinations found. But this is not completely reliable, because it does not traverse jump tables (you can try to use some heuristics for this, but this is not completely reliable too).
Solution for your problem would be patch function being replaced itself: overwrite its beginning with jump inctruction to your function.
Unfortunately, there is no 100% reliable way to distinguish code from data. From the CPU point of view, code is code only when some jump opcode induces the processor into trying to execute the bytes as if they were code. You could try to make a control flow analysis by beginning with the program entry point, and following all possible execution paths, but this may fail in the presence of pointers to function.
For your specific problem: I gather that you want to replace an existing function with a replacement of your own. I suggest that you patch the replaced function itself. I.e., instead of locating all calls to the foo() function and replacing them with a call to bar(), just replace the first bytes of foo() with a jump to bar() (a jmp, not a call: you do not want to mess with the stack). This is less satisfactory because of the double jump, but it is reliable.
It is impossible to distinguish data from instruction in general and this is because of von Neumann architecture . Analyzing the code around is helpful and disassembly tools do this. (This may be helpful. If you can't use IDA Pro /it is commercial/, use another disassembly tool.)
Plain code have a very specific entropy, so it's quite easy to distinglish it from most data. However, it's a probabilistic approach, but a large enough buffer of plain code can be recognized (especially compiler output, when you can also recognize patterns, like beginning of a function).
Also, some opcodes are reserved for future, others are available only from kernel mode. In this case by knowing them and knowing how to compute the instruction lengths (you could try a routine written by Z0mbie for that), you can do it.
Thomas suggests the right idea. To implement it properly, you need to disassemble the first few instructions (the part you would overwrite with the JMP) and generate a simple trampoline function that executes them then jumps to the rest of the original function.
There's libraries that do this for you. A well-known one is Detours but it has somewhat awkward licensing conditions. A nice implementation of the same idea with a more permissive license is Mhook.

Resources