Indexed addressing in x86 assembly - is something like mov array[ebx],eax a valid instruction? - arrays

My understanding is that indexed addressing essentially produces an address whose offset is the number in the brackets. Is this understanding correct? But I also understand that this address is dereferenced somehow. I don't understand exactly how this works. The book Assembly Language for Intel-Based Computers shows a lot of instructions of the form mov reg,array[reg] that move data from an indexed array element into a register. But I need a way to move data back from the register into the array. How can I do this? Do I use the opposite of that, which would be mov array[reg],reg? Or would this dereference array[reg] and move the data into the address given by the value stored in that array element?
For example, suppose the array index is 3 and it's stored in register EBX, and I want to move the value stored in EAX into that array element. The value currently stored in that array element is 500 hex. If I use the instruction mov array[ebx],eax, will this instruction move the value in EAX into array[3], or will it move it into memory location 500 hex? And if it's the latter case, what instruction can I use to avoid this effect and do what I actually want to do, which is move the data into array[3]?
Note: The syntax I am using is for MASM. I do not have MASM installed on my machine, and it's not really an option since I'm using Ubuntu. But the book I'm reading is written for MASM, so I'm learning MASM first, just to get a feel for how x86 assembly language works. I'm not assembling any programs, but I'd like to understand them.

Related

Is there any way to analyze the "type" of register in x86 assembly source code?

So basically what I am trying to do is distinguish data from memory address during my analysis task towards assembly code.
Here is an example I can hardly deal with.
Suppose we have a variable val declared in .data section.
0x08048054 01 00 00 00
and here is one line of assembly code by disassembly the ELF file.
mov $0x08048054, %eax
So probably this is an indirect reference of variable val, like this :
mov $0x8048054,%eax
mov %edx,0x4(%esp)
mov %eax,(%esp)
call printf
then I will transform $0x8048054 into variable name val like this:
mov val,%eax
mov %edx,0x4(%esp)
mov %eax,(%esp)
call printf
But there is another situation, 0x8048054 is just used as a number in one calculation:
mov $0x8048054,%eax
add 0x8(%ebp), %eax
which is probably equal to (I know we can hardly see this in real code, but this is possibility)
b = 0x8048054 + argc;
and in this situation, I should not re-write $0x8048054 into val
so what I am thinking is that if I can figure out the type of %eax register, I can probably distinguish these two situations.
as for the first situation, %eax's type is pointer
the second one, its type is integer
Am I in the right way?
Could anyone give me some help?
Thank you!
One view of "type" is the set of operations which apply to a value.
So, the way to understand the "type" of a value in a register (or a memory location(s)), is to determine what operations the program applies to it. Each operation applied to the register suggests a set of possible types the value may be, e.g., "type constraints".
If a register is used in an operation to determine an address, which in turns causes a memory fetch (the x86 LEA instruction "forms an address", but doesn't cause a memory fetch!), then it is some kind of pointer. What kind of memory fetch hints as to the type of the pointer; if it is a byte fetch, it might be a "pointer to char", if it is a fetch of a value to a floating point unit, it may be a "pointer to a double". So, the way in which the register is used establishes some type constraints (e.g, "may be type T").
If the register is added to another, or added-to, it may be a pointer (e.g., pointer arithmetic) or a number (integer or natural). If the register is mutiplied or divided, it probably isn't a pointer.
But these analyses are limited to what you can determine by direct inspection of the few instructions which use the value of the register (e.g., those instructions that can be "reached" by the specific register value).
However, many machine operations are only copying values, often through registers. What you really want to do is a data flow analysis of where the register value came from, and where it goes to. All operators on the value which flows into, is in, or flows out of, the register should be used to establish type constraints. A better characterization of the type is the intersection of the type constraints of the value that (data)flowed through the register. (You have to worry about whether an invisible coercion has occurred: a pointer to a string, can be "invisibly converted" into a pointer to its first character on many architectures, without any specific machine instructions).
So your type inference process needs to do dataflow analysis on the whole program (and since some of the data flow depends on the type of values, this may be iterative), estimate the intersection of the types of each value, and then consider whether implied conversions may take place. (you may do this inference process in your head, but if you have to do it on a big program you will really need tools to manage the sheer volume of data).
In general, you can't do this perfectly; one can easily turn type inference into a Turing-halting problem:
if Turing(x) then op1(register1) else op2(register1) endif
[so, is register always used only in op1 or only in op2?] So you have take your estimates of the type with a grain of salt.
Looks like you're on the right track - in general, the difference between a pointer and a number that just happens to look like a memory address is that a pointer will be dereferenced somewhere. Obviously you can only observe this when it happens, so you're going to have to analyse the code for the lifetime of that value to see how it's used.
If a value ends up in a register that is then used as the base register for a memory operation, it was a pointer. Anything else is a number-that-looks-like-a-pointer until proven otherwise. There might be short-cuts like seeing it passed as an argument to a function that you know takes a pointer (if you can assume the code is correct in the first place).
The complication comes in the fact that that value may be loaded, added to another value, shoved on the stack, passed around, stashed in another variable, etc., and eventually reloaded and dereferenced by a completely different part of the program.
For more ideas, I'd suggest looking at what the OS program loader does, since that typically needs to detect and fix up pointers, particularly for relocatable code.

Creating and addressing array NASM

I just made a snake game in assembly 8086 and tried to compile it with NASM. I discovered that I must "fit" my program. First, I'll be glad if someone can extract all the NASM's adaptations. Second, the terminal gives me the next message:"comma, colon or end of line expected".
The data segment
BOARDARR: TIMES 1896 DB 0
The code segment
mov bx, 3d7h
mov BOARDARR[BX], 1
Can someone please help me? thanks.
comma, colon or end of line expected in this case is caused by an improper syntax of the code itself, namely mov BOARDARR[BX], 1. In NASM, all memory references need to be made in brackets, in which the effective address of the operand is calculated. Therefore, what you want is (I assume) mov [BOARDARR+BX], 1, which will cause a 1 to be written to the address BOARDARR + 3d7h.
However, doing only that correction will cause another error related to the operand size not being specified. Since NASM doesn't care about variable types, it doesn't care that you BOARDARR was declared with a db and treats it as an ordinary, un-typed chunk of memory, not an array of byte-sized elements.
In order to remedy this, you need to explicitly state the size of the operand that you want to write to the specified address, since - even in real mode, which I assume that you're using - MOV with a memory operand has two flavors : byte-sized and word-sized. In this case, you have two options to write that instruction :
mov [BOARDARR+BX], byte 1, which will cause 01 to be written to BOARDARR+BX, or
mov [BOARDARR+BX], word 1, which will cause 01 00 (in that particular order, since x86 is Little Endian) to be written to BOARDARR+BX.
Hope this clears things up.

Displaying PSW content

I'm beginner with asm, so I've been researching for my question for a while but answears were unsatisfactory. I'm wondering how to display PSW content on standard output. Other thing, how to display Instruction Pointer value ? I would be very gratefull if ypu could give me a hint (or better a scratch of code). It may be masm or 8086 as well (actually I don't know wthat is the difference :) )
The instruction pointer is not directly accessible on the x86 family, however, it is quite straightforward to retrieve its value - it will never be accurate though.
Since a subroutine call places the return address on the stack, you just need to copy it from there and violá! You have the address of the opcode following the call instruction:
proc getInstructionPointer
push bp
mov bp,sp
mov ax,[word ptr ss:bp + 2]
mov sp,bp
pop bp
ret
endp getInstructionPointer
The PSW on the x86 is called the Flags register. There are two operations that explicitly reference it: pushf and popf. As you might have guessed, you can simply push the Flags onto the stack and load it to any general purpose register you like:
pushf
pop ax
Displaying these values consists of converting their values to ASCII and writing them onto the screen. There are several ways of doing this - search for "string output assembly", I bet you find the answer.
To dispel a minor confusion: 8086 is the CPU itself, whereas MASM is the assembler. The syntax is assembler-specific; MASM assembly is x86 assembly. TASM assembly is x86 assembly as well, just like NASM assembly.
When one says "x86 Assembly", he/she is referencing any of these (or others), talking about the instruction set, not the dialect.
Note that the above examples are 16bit, indtended for 8086 and won't work on 80386+ in 32bit mode

Understanding disassembled c code (particularly things like var_28 = dword ptr -28h) (binary bomb lab)

So I'm disassembling some code (binary bomb lab) and need some help figuring out what's going on.
Here's an IDA screen shot:
(there's some jump table stuff and another comparison below, but I feel a bit more comfortable about that stuff (I think))
Now, I think I know what's going on in this phase, as I've read:
http://teamterradactyl.blogspot.com/2007/10/binary-bomb.html (scroll down to phase 3)
However, I'm used to a different form of the assembly.
The biggest thing I don't understand is all this var_28 = dword ptr -28h stuff at the top.
When sscanf gets called, how does it know where to put each token? And there are only going to be three tokens (which is what the link above says, although I see a %d, %d... so maybe two, I think three though). Basically, can anyone tell me what each of these var_x (and arg_0) will point to after sscanf is called?
They are just relative addressing to the stack pointer right...? But how are these addresses getting filled with the tokens from sscanf?
NOTE: This is homework, but it says not to add the homework tag, because it's obsolete or something. The homework is to figure out the secret phrase to enter via the command line to get past each phase.
NOTE2: I don't really know how to use IDA, my friend just told me to open the bomb file in IDA. Perhaps there's an easy way for me to experiment and figure it out in IDA, but I don't know how.
Local variables are stored just below the frame pointer. Arguments are above the frame pointer. x86 uses BP/EBP/RBP as a frame pointer.
A naïve disassembly would just disassemble lea eax, [ebp+var_10] as lea eax, [ebp-10h]. This instruction is referencing a local variable whose address is 10h (16 bytes) below where the frame pointer points. LEA means Load Effective Address: it's loading the address of the variable at [ebp - 10h] in eax, so eax now contains a pointer to that variable.
IDA apparently is trying to give meaningful names to local variables, but since apparently there is no debug info available it ends up using dummy names. Anyway:
var_10= dword ptr -10h
is just IDA's way of telling that it has created an alias var_10 for -10.

How to use relative position in c/assembly?

It's said Position Independent Code only uses relative position instead of absolute positions, how's this implemented in c and assembly respectively?
Let's take char test[] = "string"; as an example, how to reference it by relative address?
In C, position-independent code is a detail of the compiler's implementation. See your compiler manual to determine whether it is supported and how.
In assembly, position-independent code is a detail of the instruction set architecture. See your CPU manual to find out how to read the PC (program counter) register, how efficient that is, and what the recommended best practices are in translating a code address to a data address.
Position-relative data is less popular now that code and data are separated into different pages on most modern operating systems. It is a good way to implement self-contained executable modules, but the most common such things nowadays are viruses.
On x86, position-independent code in principle looks like this:
call 1f
1: popl %ebx
followed by use of ebx as a base pointer with a displacement equal to the distance between the data to be accessed and the address of the popl instruction.
In reality it's often more complicated, and typically a tiny thunk function might be used to load the PIC register like this:
load_ebx:
movl 4(%esp),%ebx
addl $some_offset,%ebx
ret
where the offset is chosen such that, when the thunk returns, ebx contains a pointer to a designated special point in the program/library (usually the start of the global offset table), and then all subsequent ebx-relative accesses can simply use the distance between the desired data and the designated special point as the offset.
On other archs everything is similar in principle, but there may be easier ways to load the program counter. Many simply let you use the pc or ip register as an ordinary register in relative addressing modes.
In pseudo code it could look like:
lea str1(pc), r0 ; load address of string relative to the pc (assuming constant strings, maybe)
st r0, test ; save the address in test (test could also be PIC, in which case it could be relative
; to some register)
A lot depends on your compiler and CPU architecture, as the previous answer stated. One way to find out would be to compile with the appropriate flags (-PIC -S for gcc) and look at the assembly language you get.

Resources