C program for display assembly in binary files - c

i'm trying to display the assembly instructions in a binary files but how can i do?
how can i know if an argument of MOV ( for example ) is a pointer or a number ?
this is for educational purposes, i known that there is GDB and othrer tools.
thanks in advance!

You mean a disassembler? then you have many tools to pick from, such as:
OllyDbg
IDA
objdump
If you want to integrate this into an existing program, then you need a disassembly engine, such as BeaEngine or diStorm.

You can utilize many of the libraries inside binutils like BFD and opcodes.
BFD Binary File Descriptor library, to do low-level manipulation.
opcodes library is used to assemble and disassemble machine instructions.

You might find useful information from the source to an emulator, which has to perform the same decoding task before performing the simulated instruction.
I highly recommend targeting a small subset first, ideally the bare 8086, and then add extensions in the same sequence they historically happened. This will help you decide what to ignore when looking for more information. So as not get overwhelmed.
For the MOV operation, the operands are specified (in the most general form) by the second byte, the MOD-REG-REG/MEM byte. Operands are almost always registers or memory locations (pointers, possibly constructed on-the-fly using "indexing registers"). Only a few instructions accept a literal operand(a number) and only as the source, and they are clearly marked in the table, 1979 8086 Manual, on page 180.

Related

How to perform the most basic gate-level operations in a computer?

How can I use my hardware directly to perform an operation at the level of bits in my computer without a programming language (and if possible even without the help of the kernel)?
For example, a code in C maybe
unsigned char a = 0;
unsigned char b = 1;
unsigned char c = a|b;
I want to do it (request 3 bytes of memory and modify those bytes) directly with my computer hardware and without using any programming language (i.e. I want to write the machine code myself). If possible, even without the help of kernel. How to do it and where to learn about these?
I am currently using Ubuntu 18, kernel 5.4.0-52-generic. I have intel 8th gen core i5 laptop. Let me know if I need to be more specific about my system specs.
First you have to of course have the documentation for the processor that defines the instruction set including the machine code.
Next you would write something that is not dead code:
unsigned int fun ( unsigned int a, unsigned int b )
{
return(a|b);
}
Intentionally not using x86
Disassembly of section .text:
00000000 <fun>:
0: e1800001 orr r0, r0, r1
4: e12fff1e bx lr
Then you can examine that machine code relevant to the documentation and understand the encoding of these instructions.
That is all quite trivial.
The biggest issue is how do you plan to run this code? You say without the kernel that means no operating system that means bare-metal which means you have to boot the processor, that takes detailed processor and system (chip/motherboard/media/etc) information. It is obviously possible as your computer boots and runs or your microcontroller (significantly better choice for this kind of education, avoiding x86 as a first ISA is also a good choice). Or even better an emulator/simulator because you have better visibility and it is un-brickable, where hardware is not hard to brick and sometimes not hard to let smoke out with bad software.
Based on your question, you need to crawl before you walk, and walk before you run. Start with simple functions as shown, you can with certain tools (gnu is your friend but also enemy as it takes time to master, but it is very feature rich) write using machine code and feed it to an assembler
.globl fun
fun:
.inst 0xe1800001
.inst 0xe12fff1e
Disassembly of section .text:
00000000 <fun>:
0: e1800001 orr r0, r0, r1
4: e12fff1e bx lr
If you don't want to use an assembler and want to create the binary file yourself then you have to read up on the file format, many are published, a somewhat trivial task even/especially if you write a tool from scratch to do it and not rely on libraries, but it depends on you and the file format. Not sure how many tangents and research projects you are really interested in. They all have value but what is it you really want to learn and/or what is the priority of your desires.
You are looking at potentially months to years of work depending on what you are really after. As hard as it is to deal with the best path to this is to use the tools and sandboxes available first then replace them later as desired rather than write everything from scratch up front without any help from any existing tools.
You want to build a better hammer you start with using the hammers that exist, decide what you do and don't like then make your own. You don't just go off without ever using one and try to create one and expect any kind of success. Or success within any kind of reasonable schedule.
A big problem with doing everything from scratch is that you need to get the bytes into a flash or ram or some media so that the processor can fetch and run them. This you likely cannot do without some tools, a fully working computer with an operating system where you take your raw bytes, some hardware tools that are capable of programming a flash and using all of that to program your raw bits into that flash. Now some flashes you can probably use switches (have to solve the bouncing problem) like on the front of an old DEC or something and toggle your way through the protocol used to program the device and thus only needing pencil/pen and paper and the switches and wiring as the tools.
You are far better off with an instruction set simulator and depending on the file formats it supports, rolling your own binary creation tool or using an assembler or something similar to make the file to feed the sim. Or even better just make your own simulator you learn the instruction set better than most seasoned professionals that way...and then of course you can create your own binary creation tool to match. You will fail most likely if you have not taking an existing instruction set and set of tools and learned to program at the assembly/machine language level, see how it works, see the instructions used/generated, and with the documentation see the encoding, etc.
Most new instruction sets the software folks have direct access to the silicon folks (walk down the hall to the office of the person, or call them on the phone) and can ask questions about the encoding of an instruction. Since you cannot do that, you have to ask existing, debugged, tools instead, so as I started way above, ask the compiler, ask the assembler. Then disassemble it then assume the assembler produced the right instruction and compare that to the processor documentation (understanding that both the tools and the documentation can have bugs so you have to work through that).

Does libjit dynamically translate a piece of code to something executable?

Is GNU libjit meant to translate a piece of code into something executable (say, machine code for x86) at run time? I don't see how the examples from the libjit tutorial actually shows this. Any ideas? Thanks.
Not as a single step. It's probably more appropriate to say that libjit can generate executable code in memory, if given a low-level description of what that code should do. That's what the calls to functions like jit_insn_add() in the tutorial are all about; libjit converts a sequence of notional instructions (e.g, "add register 1 to register 2", "perform the next block of instructions if register 3 is zero") fed to it into a sequence of bytes in memory which can be run by your CPU to perform those operations.
If you want to convert a textual representation of some code (e.g, the string a = b + c;) to executable code of some variety, that's an entirely different and unrelated task. A full explanation of everything involved is beyond the scope of an answer on this site, but a general study of formal compiler implementation would be my recommended starting point. (Ignore for a moment that you intend to execute this code at runtime, rather than compiling it to an executable; this has surprisingly little bearing on the techniques used.) An excellent textbook on the subject is "Compilers: Principles, Techniques, and Tools", aka. the "Dragon Book".

Instruction disassembler ARM. [ARM/Thumb mode]

I would like to ask you how to determine in which ISA (ARM/Thumb/Thumb-2) an instruction is encoded?
First of all, I tried to do it following the instructions here (section 4.5.5).
However, when I use readelf -s ./arm_binary, and arm_binary was built in release mode, it appears that there is no .symtab in the binary. And anyway, I don't understand how to use this command to find the type for the instructions.
Secondly, I know the other way to differentiate is to look at the PC address for the ARM/Thumb instruction. If it is even then it is a Thumb instruction, if not - then ARM. But how can I do this without loading the file to memory? When I parse the sections of the file and find the execute section, all that I have is the start (offset) location in the file and the file-offset is always even, and it will be always even because we have instruction of size equal to 2 or 4...
Finally, the last way to check is to detect BX Rm, extract the value from Rm, and then check if that address in Rm is it even or not. But, this may be difficult because for this I would need to emulate the whole program.
So what is the correct way to identify the ISA for disassembly?
Thank you for your attention and I hope you will help me.
I don't believe it's possible to tell, in a mixed mode binary, without inspecting the instructions as you describe.
If the whole file will be one ISA or the other, then you can determine the ISA of the entry point by running this:
readelf -h ./arm_binary
And checking whether the entry point is even or odd.
However, what I would do is simply disassemble it both ways, and see what looks right. As long as you start the disassembly at the start of a function (or any 4-byte boundary), then this will work fine. Most code will produce nonsense when disassembled in the wrong ISA.

I want to create a simple assembler in C. Where should I begin? [duplicate]

This question already has answers here:
Building an assembler
(4 answers)
How Do You Make An Assembler? [closed]
(4 answers)
Closed 9 years ago.
I've recently been trying to immerse myself in the world of assembly programming with the eventual goal of creating my own programming language. I want my first real project to be a simple assembler written in C that will be able to assemble a very small portion of the x86 machine language and create a Windows executable. No macros, no linkers. Just assembly.
On paper, it seems simple enough. Assembly code comes in, machine code comes out.
But as soon as I thinking about all the details, it suddenly becomes very daunting. What conventions does the operating system demand? How do I align data and calculate jumps? What does the inside of an executable even look like?
I'm feeling lost. There aren't any tutorials on this that I could find and looking at the source code of popular assemblers was not inspiring (I'm willing to try again, though).
Where do I go from here? How would you have done it? Are there any good tutorials or literature on this topic?
I have written a few myself (assemblers and disassemblers) and I would not start with x86. If you know x86 or any other instruction set you can pick up and learn the syntax for another instruction set in short order (an evening/afternoon), at least the lions share of it. The act of writing an assembler (or disassembler) will definitely teach you an instruction set, fast, and you will know that instruction set better than many seasoned assembly programmers for that instruction set who have not examined the microcode at that level. msp430, pdp11, and thumb (not thumb2 extensions) (or mips or openrisc) are all good places to start, not a lot of instructions, not overly complicated, etc.
I recommend a disassembler first, and with that a fixed length instruction set like arm or thumb or mips or openrisc, etc. If not then at least use a disassembler (definitely choose an instruction set for which you already have an assembler, linker, and disassembler) and with pencil and paper understand the relationship between the machine code and the assembly, in particular the branches, they usually have one or more quirks like the program counter is an instruction or two ahead when the offset is added, to gain another bit they sometimes measure in whole instructions not bytes.
It is pretty easy to brute force parse the text with a C program to read the instructions. A harder task but perhaps as educational, would be to use bison/flex and learn that programming language to allow those tools to create (an even more extreme brute force) parser which then interfaces to your code to tell you what was found where.
The assembler itself is pretty straight forward, just read the ascii and set the bits in the machine code. Branches and other pc relative instructions are a little more painful as they can take multiple passes through the source/tables to completely resolve.
mov r0,r1
mov r2 ,#1
the assembler begins parsing the text for a line (being defined as the bytes that follow a carriage return 0xD or line feed 0xA), discard the white space (spaces and tabs) until you get to something non white space, then strncmp that with the known mnemonics. if you hit one then parse the possible combinations of that instruction, in the simple case above after the mov skip over the white space to non-white space, perhaps the first thing you find must be a register, then optional white space, then a comma. remove the whitespace and comma and compare that against a table of strings or just parse through it. Once that register is done then go past where the comma is found and lets say it is either another register or an immediate. If immediate lets say it has to have a # sign, if register lets say it has to start with a lower or upper case 'r'. after parsing that register or immediate, then make sure there is nothing else on the line that shouldnt be on the line. build the machine code for this instruciton or at least as much as you can, and move on to the next line. It may be tedious but it is not difficult to parse ascii...
at a minimum you will want a table/array that accumulates the machine code/data as it is created, plus some method for marking instructions as being incomplete, the pc-relative instructions to be completed on a future pass. you will also want a table/array that collects the labels you find and the address/offset in the machine code table where found. As well as the labels used in the instruction as a destination/source and the offset in the table/array holding the partially complete instruction they go with. after the first pass, then go back through these tables until you have matched up all the label definitions with the labels used as a source or destination, using the label definition address/offset to compute the distance to the instruction in question and then finish creating the machine code for that instruction. (some disassembly may be required and/or use some other method for remembering what kind of encoding it was when you come back to it later to finish building the machine code).
The next step is allowing for multiple source files, if that is something you want to allow. Now you have to have labels that dont get resolved by the assembler so you have to leave placeholders in the output and make some flavor of the longest jump/branch instruction because you dont know how far away the destination will be, expect the worse. Then there is the output file format you choose to create/use, then there is the linker which is mostly simple, but you have to remember to fill in the machine code for the final pc relative instructions, no harder than it was in the assembler itself.
Note, writing an assembler is not necessarily related to creating a programming language and then writing a compiler for it, separate thing, different problems. Actually if you want to make a new programming language just use an existing assembler for an existing instruction set. Not required of course, but most teachings and tutorials are going to use the bison/flex approach for programming languages, and there are many college course lecture notes/resources out there for beginning compiler classes that you can just use to get you started then modify the script to add the features of your language. The middle and back ends are the bigger challenge than the front end. there are many books on this topic and many online resources as well. As mentioned in another answer llvm is not a bad place to create a new programming language the middle and backends are done for you, you only need to focus on the programming language itself, the front end.
You should look at LLVM, llvm is a modular compiler back end, the most popular front end is Clang for compiling C/C++/Objective-C. The good thing about LLVM is that you can pick the part of the compiler chain that you are interested in and just focus on that, ignoring all of the others. You want to create your own language, write a parser that generates the LLVM internal representation code, and for free you get all of the middle layer target independent optimisations and compiling to many different targets. Interesting in a compiler for some exotic CPU, write a compiler backend that takes the LLVM intermediated code and generates your assemble. Have some ideas about optimisation technics, automatic threading perhaps, write a middle layer which processes LLVM intermediate code. LLVM is a collection of libraries not a standalone binary like GCC, and so it is very easy to use in you own projects.
What you're looking for is not a tutorial or source code, it's a specification. See http://msdn.microsoft.com/en-us/library/windows/hardware/gg463119.aspx
Once you understand the specification of an executable, write a program to generate one. The executable you build should be as simple as possible. Once you have mastered that, then you can write a simple line-oriented parser that reads instruction names and numeric arguments to generate a block of code to plug into the exe. Later you can add symbols, branches, sections, whatever you want, and that's where something like http://www.davidsalomon.name/assem.advertis/asl.pdf will come in.
P.S. Carl Norum has a good point in the comment above. If your goal is create your own programming language, learning to write an assembler is irrelevant and is very much not the right way to start (unless the language you want to create is an assembly language). There are already assemblers that produce executables from assembler source, so your compiler could produce assembler source and you could avoid the work of recreating the assembler ... and you should. Or you could use something like LLVM, which will solve many other daunting problems of compiler construction. The odds are very small that you will ever actually produce your own programming language, but they're much smaller if you start from scratch and there's no need to. Decide what your goal is and use the best tools available to achieve it.

arm (bare metal): call binary file as function

I have AT91Bootloader for AT91sam9 ARM controller. I need add some extra hardware initialization, but I have only compiled .bin file.
I loaded bin file to memory and tried to call it:
((void (*)())0x00005000)();
But, haven't any results. Please use assembler as less as possible. I was introduced to assembler before, but cannot understand ARM assembler due to it's complicity. How can I make call from middle of bootloader, execute bin file (it will be in some memory sector, 0x00005000 for example) and then return to bootloader and continue executing it's own code?
If ARM asm is "too complex", you will find it very difficult to debug any problems you're having. Basic* ARM assembly is one of the least complex assembly languages I've come across.
Your code ought to work (though I would not use a hard-coded address there) provided the ".bin" is of the correct format. Common issues:
The entry point should be ARM code; some compilers default to Thumb. It's possible (if a little tricky) to make Thumb code work.
The entry point needs to be at the start of the file. Without disassembling, it's hard to tell if you've done this correctly.
The linker will insert "thunks" (a.k.a. "stubs") where necessary. A quirk in some linkers means that the thunk can be placed before the entry point. You can work around this by using --stub-group-size=-1 (docs here).
* Ignoring things like Thumb/VFP/NEON which you don't need to get started.
ARM assembly is one of the simpler ones, very straight forward. If you want to continue to do bare metal you are going to need to learn at least some assembly. For example understanding Alexey's comment.
The instruction you are looking for is BX, it branches to an address, the assembly you need to branch to the code your bootloader downloaded is:
.globl tramp
tramp:
bx r0
The C prototype is
void tramp ( unsigned int address );
As mentioned in the comments the program needs to be compiled for the address you are running it from and/or it needs to be position independent, otherwise it wont work. Also you need to build the application with the proper entry point, if it is raw binary and you branch to the address where the binary was loaded the binary needs to be able to be started that way by having the first word in the binary be the entry point for execution.
Also understand that an elf format file contains the data you want to load, but as a whole is not the data you want to load. It is a "binary file" yes but to run the program contained in it you need to parse and extract the loadable portions and load them in the right places.
If you dont know what those terms mean, use google, and/or search SO, the answers are there.

Resources