Instruction disassembler ARM. [ARM/Thumb mode] - arm

I would like to ask you how to determine in which ISA (ARM/Thumb/Thumb-2) an instruction is encoded?
First of all, I tried to do it following the instructions here (section 4.5.5).
However, when I use readelf -s ./arm_binary, and arm_binary was built in release mode, it appears that there is no .symtab in the binary. And anyway, I don't understand how to use this command to find the type for the instructions.
Secondly, I know the other way to differentiate is to look at the PC address for the ARM/Thumb instruction. If it is even then it is a Thumb instruction, if not - then ARM. But how can I do this without loading the file to memory? When I parse the sections of the file and find the execute section, all that I have is the start (offset) location in the file and the file-offset is always even, and it will be always even because we have instruction of size equal to 2 or 4...
Finally, the last way to check is to detect BX Rm, extract the value from Rm, and then check if that address in Rm is it even or not. But, this may be difficult because for this I would need to emulate the whole program.
So what is the correct way to identify the ISA for disassembly?
Thank you for your attention and I hope you will help me.

I don't believe it's possible to tell, in a mixed mode binary, without inspecting the instructions as you describe.
If the whole file will be one ISA or the other, then you can determine the ISA of the entry point by running this:
readelf -h ./arm_binary
And checking whether the entry point is even or odd.
However, what I would do is simply disassemble it both ways, and see what looks right. As long as you start the disassembly at the start of a function (or any 4-byte boundary), then this will work fine. Most code will produce nonsense when disassembled in the wrong ISA.

Related

How to perform the most basic gate-level operations in a computer?

How can I use my hardware directly to perform an operation at the level of bits in my computer without a programming language (and if possible even without the help of the kernel)?
For example, a code in C maybe
unsigned char a = 0;
unsigned char b = 1;
unsigned char c = a|b;
I want to do it (request 3 bytes of memory and modify those bytes) directly with my computer hardware and without using any programming language (i.e. I want to write the machine code myself). If possible, even without the help of kernel. How to do it and where to learn about these?
I am currently using Ubuntu 18, kernel 5.4.0-52-generic. I have intel 8th gen core i5 laptop. Let me know if I need to be more specific about my system specs.
First you have to of course have the documentation for the processor that defines the instruction set including the machine code.
Next you would write something that is not dead code:
unsigned int fun ( unsigned int a, unsigned int b )
{
return(a|b);
}
Intentionally not using x86
Disassembly of section .text:
00000000 <fun>:
0: e1800001 orr r0, r0, r1
4: e12fff1e bx lr
Then you can examine that machine code relevant to the documentation and understand the encoding of these instructions.
That is all quite trivial.
The biggest issue is how do you plan to run this code? You say without the kernel that means no operating system that means bare-metal which means you have to boot the processor, that takes detailed processor and system (chip/motherboard/media/etc) information. It is obviously possible as your computer boots and runs or your microcontroller (significantly better choice for this kind of education, avoiding x86 as a first ISA is also a good choice). Or even better an emulator/simulator because you have better visibility and it is un-brickable, where hardware is not hard to brick and sometimes not hard to let smoke out with bad software.
Based on your question, you need to crawl before you walk, and walk before you run. Start with simple functions as shown, you can with certain tools (gnu is your friend but also enemy as it takes time to master, but it is very feature rich) write using machine code and feed it to an assembler
.globl fun
fun:
.inst 0xe1800001
.inst 0xe12fff1e
Disassembly of section .text:
00000000 <fun>:
0: e1800001 orr r0, r0, r1
4: e12fff1e bx lr
If you don't want to use an assembler and want to create the binary file yourself then you have to read up on the file format, many are published, a somewhat trivial task even/especially if you write a tool from scratch to do it and not rely on libraries, but it depends on you and the file format. Not sure how many tangents and research projects you are really interested in. They all have value but what is it you really want to learn and/or what is the priority of your desires.
You are looking at potentially months to years of work depending on what you are really after. As hard as it is to deal with the best path to this is to use the tools and sandboxes available first then replace them later as desired rather than write everything from scratch up front without any help from any existing tools.
You want to build a better hammer you start with using the hammers that exist, decide what you do and don't like then make your own. You don't just go off without ever using one and try to create one and expect any kind of success. Or success within any kind of reasonable schedule.
A big problem with doing everything from scratch is that you need to get the bytes into a flash or ram or some media so that the processor can fetch and run them. This you likely cannot do without some tools, a fully working computer with an operating system where you take your raw bytes, some hardware tools that are capable of programming a flash and using all of that to program your raw bits into that flash. Now some flashes you can probably use switches (have to solve the bouncing problem) like on the front of an old DEC or something and toggle your way through the protocol used to program the device and thus only needing pencil/pen and paper and the switches and wiring as the tools.
You are far better off with an instruction set simulator and depending on the file formats it supports, rolling your own binary creation tool or using an assembler or something similar to make the file to feed the sim. Or even better just make your own simulator you learn the instruction set better than most seasoned professionals that way...and then of course you can create your own binary creation tool to match. You will fail most likely if you have not taking an existing instruction set and set of tools and learned to program at the assembly/machine language level, see how it works, see the instructions used/generated, and with the documentation see the encoding, etc.
Most new instruction sets the software folks have direct access to the silicon folks (walk down the hall to the office of the person, or call them on the phone) and can ask questions about the encoding of an instruction. Since you cannot do that, you have to ask existing, debugged, tools instead, so as I started way above, ask the compiler, ask the assembler. Then disassemble it then assume the assembler produced the right instruction and compare that to the processor documentation (understanding that both the tools and the documentation can have bugs so you have to work through that).

Mapping inputs to an array (A better way?)

I'm working in an embedded system and have "mapped" some defines to an array for inputs.
volatile int INPUT_ARRAY[40];
#define INPUT01 INPUT_ARRAY[0]
#define INPUT02 INPUT_ARRAY[1]
// section 2
if ( INPUT01 && INPUT02 ) {
writepin(outputpin, value);
}
If I want to read from Input 1, I can simply say newvariable = INPUT01 or I can compare data with Input 1, like in section 2 of my code. I'm not sure if this is a normal way of mapping the name INPUT01 to where the array position is. Or for an Input pin in the first place. Each array value represents a binary pin, and are read into the array by decoding a port value (16 bit). Question: Is using the defines and array like this reasonably efficient?
Yes, your solution is efficient.
Before the C compiler even sees your code, the C preprocessor substitutes INPUT_ARRAY[0] for INPUT01 and, similarly, INPUT_ARRAY[1] for INPUT02; so this substitution uses zero time and zero power at run time.
Moreover, when the C compiler sees INPUT_ARRAY[1] in the preprocessed code, it adds 1 at compile time to the base address of INPUT_ARRAY. Therefore, you get maximal efficiency at run time.
Admittedly, were you manually to turn your C compiler's optimizer off, as with the -O0 option of GCC, then it is conceivable that the compiler would emit assembly code to add the 1 at run time. So don't do that.
The only likely exception to the foregoing would be the case that the base address of INPUT_ARRAY were unknown to the compiler at run time, not likely because INPUT_ARRAY were dynamically allocated on the heap (which would make little sense for hardware device addressing), but likely because the base address of INPUT_ARRAY were configurable during boot via device configuration registers. Some hardware does this, but if yours does, why, that is exactly the reason your MCU (or MPU) possesses an index-offset indirect addressing mode in the first place. Though this mode engages the MCU's integer arithmetic unit, [a] the mode does not multiply (multiplication being a power-hungry operation); and, [b] anyway, the mode is such a normal, often-used mode that MCUs are invariably designed to support it efficiently—not perhaps as efficiently as precomputed direct addressing, but as efficiently as one can reasonably expect for such a use. The MCU's manufacturer knows that device pins are things you need to address. The engineer who designed your MCU will have given priority to making the index-offset indirect mode as efficient as possible for this and other reasons. (You could maybe still cheat the matter to save a few millijoules via self-modifying code, if your MCU even allowed that; but, as an engineer, you'd regret the cheat, I suspect, unless security and maintainability were non-issues to you. The problem probably is not much of a real problem. Index-offset indirect addressing is the normal technique when the base address remains unknown until run time. If you really need to save that last millijoule, then you might not be using a C compiler for your code's inner loop, anyway, but might be handcrafting assembly code.)
I suspect that you would find it instructive to tell your compiler to emit assembly code for your inspection. I do not know which compiler you are using but, if you were using GCC, then gcc -S myfile.c.

How do I determine the start and end of instructions in an object file?

So, I've been trying to write an emulator, or at least understand how stuff works. I have a decent grasp of assembly, particularly z80 and x86, but I've never really understood how an object file (or in my case, a .gb ROM file) indicates the start and end of an instruction.
I'm trying to parse out the opcode for each instruction, but it occurred to me that it's not like there's a line break after every instruction. So how does this happen? To me, it just looks like a bunch of bytes, with no way to tell the difference between an opcode and its operands.
For most CPUs - and I believe Z80 falls in this category - the length of an instruction is implicit.
That is, you must decode the instruction in order to figure out how long it is.
If you're writing an emulator you don't really ever need to be able to obtain a full disassembly. You know what the program counter is now, you know whether you're expecting a fresh opcode, an address, a CB page opcode or whatever and you just deal with it. What people end up writing, in effect, is usually a per-opcode recursive descent parser.
To get to a full disassembler, most people impute some mild simulation, recursively tracking flow. Instructions are found, data is then left by deduction.
Not so much on the GB where storage was plentiful (by comparison) and piracy had a physical barrier, but on other platforms it was reasonably common to save space or to effect disassembly-proof code by writing code where a branch into the middle of an opcode would create a multiplexed second stream of operations, or where the same thing might be achieved by suddenly reusing valid data as valid code. One of Orlando's 6502 efforts even re-used some of the loader text — regular ASCII — as decrypting code. That sort of stuff is very hard to crack because there's no simple assembly for it and a disassembler therefore usually won't be able to figure out what to do heuristically. Conversely, on a suitably accurate emulator such code should just work exactly as it did originally.

arm (bare metal): call binary file as function

I have AT91Bootloader for AT91sam9 ARM controller. I need add some extra hardware initialization, but I have only compiled .bin file.
I loaded bin file to memory and tried to call it:
((void (*)())0x00005000)();
But, haven't any results. Please use assembler as less as possible. I was introduced to assembler before, but cannot understand ARM assembler due to it's complicity. How can I make call from middle of bootloader, execute bin file (it will be in some memory sector, 0x00005000 for example) and then return to bootloader and continue executing it's own code?
If ARM asm is "too complex", you will find it very difficult to debug any problems you're having. Basic* ARM assembly is one of the least complex assembly languages I've come across.
Your code ought to work (though I would not use a hard-coded address there) provided the ".bin" is of the correct format. Common issues:
The entry point should be ARM code; some compilers default to Thumb. It's possible (if a little tricky) to make Thumb code work.
The entry point needs to be at the start of the file. Without disassembling, it's hard to tell if you've done this correctly.
The linker will insert "thunks" (a.k.a. "stubs") where necessary. A quirk in some linkers means that the thunk can be placed before the entry point. You can work around this by using --stub-group-size=-1 (docs here).
* Ignoring things like Thumb/VFP/NEON which you don't need to get started.
ARM assembly is one of the simpler ones, very straight forward. If you want to continue to do bare metal you are going to need to learn at least some assembly. For example understanding Alexey's comment.
The instruction you are looking for is BX, it branches to an address, the assembly you need to branch to the code your bootloader downloaded is:
.globl tramp
tramp:
bx r0
The C prototype is
void tramp ( unsigned int address );
As mentioned in the comments the program needs to be compiled for the address you are running it from and/or it needs to be position independent, otherwise it wont work. Also you need to build the application with the proper entry point, if it is raw binary and you branch to the address where the binary was loaded the binary needs to be able to be started that way by having the first word in the binary be the entry point for execution.
Also understand that an elf format file contains the data you want to load, but as a whole is not the data you want to load. It is a "binary file" yes but to run the program contained in it you need to parse and extract the loadable portions and load them in the right places.
If you dont know what those terms mean, use google, and/or search SO, the answers are there.

C program for display assembly in binary files

i'm trying to display the assembly instructions in a binary files but how can i do?
how can i know if an argument of MOV ( for example ) is a pointer or a number ?
this is for educational purposes, i known that there is GDB and othrer tools.
thanks in advance!
You mean a disassembler? then you have many tools to pick from, such as:
OllyDbg
IDA
objdump
If you want to integrate this into an existing program, then you need a disassembly engine, such as BeaEngine or diStorm.
You can utilize many of the libraries inside binutils like BFD and opcodes.
BFD Binary File Descriptor library, to do low-level manipulation.
opcodes library is used to assemble and disassemble machine instructions.
You might find useful information from the source to an emulator, which has to perform the same decoding task before performing the simulated instruction.
I highly recommend targeting a small subset first, ideally the bare 8086, and then add extensions in the same sequence they historically happened. This will help you decide what to ignore when looking for more information. So as not get overwhelmed.
For the MOV operation, the operands are specified (in the most general form) by the second byte, the MOD-REG-REG/MEM byte. Operands are almost always registers or memory locations (pointers, possibly constructed on-the-fly using "indexing registers"). Only a few instructions accept a literal operand(a number) and only as the source, and they are clearly marked in the table, 1979 8086 Manual, on page 180.

Resources