MIPS Assembly - String (ASCII) Instructions - c

I am writing an assembler in C for MIPS assembly (so it converts MIPS assembly to machine code).
Now MIPS has three different instructions: R-Type, I-Type and J-Type. However, in the .data. section, we might have something like message: .asciiz "hello world". In this case, how would we convert an ASCII string into machine code for MIPS?
Thanks

ASCII text is not converted to machine code. It is stored via the format found on Wikipedia.
MIPS uses this format to store ASCII strings. As for .asciiz in particular, it is the string plus the NUL character. So, according to the sheet, A is 41 in hexadecimal, which is just 0100 0001 in binary. But don't forget the NUL character, so: 0100 0001 0000.
When storing the string, I'd take Mars MIPS simulator's idea and just start the memory section at a known address in memory and make any references to the label message set to that location in memory.
Please note that everything in the data section is neither R-type, I-type, nor J-type. It is just raw data.

Data is not executable and should not be converted to machine code. It should be encoded in the proper binary representation of the data type for your target.

As other answers have noted, the ascii contained in a .ascii "string" directive is encoded in it's raw binary format in the data segment of the object file. As to what happens from there, that depends on the binary format the assembler is encoding into. Ordinarily data is not encoded into machine code, however GNU as will happily assemble this:
.text
start:
.ascii "Hello, world"
addi $t1, $zero, 0x1
end:
If you disassemble the output in objdump ( I'm using the mips-img-elf toolchain here ) you'll see this:
Disassembly of section .text:
00000000 <message>:
0: 48656c6c 0x48656c6c
4: 6f2c2077 0x6f2c2077
8: 6f726c64 0x6f726c64
c: 20090001 addi t1,zero,1
The hexadecimal sequence 48 65 6c 6c 6f 2c 20 77 6f 72 6c 64 spells out "Hello, world".
I came here while looking for an answer as to why GAS behaves like this. Mars won't assemble the above program, giving an error that data directives can't be used in the text segment
Does anyone have any insight here?

Related

What is in the address of main?

A simple piece of code like this
#include<stdio.h>
int main()
{
return 0;
}
check the value in "&main" with gdb,I got 0xe5894855, I wonder what's this?
(gdb) x/x &main
0x401550 <main>: 0xe5894855
(gdb)
(gdb) x/x &main
0x401550 <main>: 0xe5894855
(gdb)
0xe5894855 is hex opcodes of the first instructions in main, but since you used x/x now gdb is displaying it as just a hex number and is backwards due to x86-64 being little-endian. 55 is the opcode for push rbp and the first instruction of main. Use x/i &main to view the instructions.
check the value in "&main" with gdb,I got 0xe5894855, I wonder what's this?
The C expression &main evaluates to a pointer to (function) main.
The gdb command
x/x &main
prints (eXamines) the value stored at the address expressed by &main, in hexadecimal format (/x). The result in your case is 0xe5894855, but the C language does not specify the significance of that value. In fact, C does not define any strictly conforming way even to read it from inside the program.
In practice, that value probably represents the first four bytes of the function's machine code, interpreted as a four-byte unsigned integer in native byte order. But that depends on implementation details both of GDB of the C implementation involved.
Ok so the 0x401550 is the address of main() and the hex goo to the right is the "contents" of that address, which doesn't make much sense since it's code stored there, not data.
To explain what that hex goo is coming from, we can toy around with some artificial examples:
#include <stdio.h>
int main (void)
{
printf("%llx\n", (unsigned long long)&main);
}
Running this code on gcc x86_64, I get 401040 which is the address of main() on my particular system (this time). Then upon modifying the example into some ugly hard coding:
#include <stdio.h>
int main (void)
{
printf("%llx\n", (unsigned long long)&main);
printf("%.8x\n", *(unsigned int*)0x401040);
}
(Please note that accessing absolute addresses of program code memory like this is dirty hacking. It is very questionable practice and some systems might toss out an hardware exception if you attempt it.)
I get
401040
08ec8348
The gibberish second line is something similar to what gdb would give: the raw op codes for the instructions stored there.
(That is, it's actually a program that prints out the machine code used for printing out the machine code... and now my head hurts...)
Upon disassembly and generating a binary of the executable, then viewing numerical op codes with annotated assembly, I get:
main:
48 83 ec 08
401040 sub rsp,0x8
Where the 48 83 ec 08 is the raw machine code, including the instruction sub with its parameters (x86 assembler isn't exactly my forte, but I believe 48 is "REX prefix" and 83 is the op code for sub). Upon attempting to print this as if it was integer data rather than machine code, it got tossed around according to x86 little endian ordering from 48 83 ec 08 to 08 ec 83 48. And that's the hex gibberish 08ec8348 from before.

[]A\A]A^A_ and ;*3$" in compiled C binary

I'm on an Ubuntu 18.04 laptop coding C with VSCode and compiling it with GNU's gcc.
I'm doing some basic engineering on my own C code and I noticed a few interesting details, on of which is the pair []A\A]A^A_ and ;*3$" that seems to appear in every one of my compiled C binaries. Between them is usually (or always) strings that I hard code in for printf() functions.
An example is this short piece of code here:
#include <stdio.h>
#include <stdbool.h>
int f(int i);
int main()
{
int x = 5;
int o = f(x);
printf("The factorial of %d is: %d\n", x, o);
return 0;
}
int f(int i)
{
if(i == 0)
{
return i;
}
else
{
return i*f(i-1);
}
}
... is then compiled using gcc test.c -o test.
When I run strings test, the following is outputted:
/lib64/ld-linux-x86-64.so.2
0HSn(
libc.so.6
printf
__cxa_finalize
__libc_start_main
GLIBC_2.2.5
_ITM_deregisterTMCloneTable
__gmon_start__
_ITM_registerTMCloneTable
AWAVI
AUATL
[]A\A]A^A_
The factorial of %d is: %d
;*3$"
GCC: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
crtstuff.c
deregister_tm_clones
__do_global_dtors_aux
completed.7697
__do_global_dtors_aux_fini_array_entry
frame_dummy
__frame_dummy_init_array_entry
test.c
__FRAME_END__
__init_array_end
_DYNAMIC
__init_array_start
__GNU_EH_FRAME_HDR
_GLOBAL_OFFSET_TABLE_
__libc_csu_fini
_ITM_deregisterTMCloneTable
_edata
printf##GLIBC_2.2.5
__libc_start_main##GLIBC_2.2.5
__data_start
__gmon_start__
__dso_handle
_IO_stdin_used
__libc_csu_init
__bss_start
main
__TMC_END__
_ITM_registerTMCloneTable
__cxa_finalize##GLIBC_2.2.5
.symtab
.strtab
.shstrtab
.interp
.note.ABI-tag
.note.gnu.build-id
.gnu.hash
.dynsym
.dynstr
.gnu.version
.gnu.version_r
.rela.dyn
.rela.plt
.init
.plt.got
.text
.fini
.rodata
.eh_frame_hdr
.eh_frame
.init_array
.fini_array
.dynamic
.data
.bss
.comment
Same as other scripts I've written, the 2 pieces []A\A]A^A_ and ;*3$" always pop up, 1 before the strings used with printf and one right after.
I'm curious: What exactly do those strings mean? I'm guessing they mainly mark the begining and endding of the use of hard-coded output strings.
Our digital computers work on bits, most commonly clustered in bytes containing 8 bits each. The meaning of such a combination depends on the context and the interpretation.
A non-exhausting list of possible interpretation is:
ASCII characters with the eighth bit ignored or accepted only if 0;
signed or unsigned 8-bit integer;
operation code (or part of it) of one specific machine language, each processor (family) has its own different set.
For example, the hex value 0x43 can be seen as:
ASCII character 'C';
Unsigned 8-bit integer 67 (signed is the same if 2's complement is used);
Operation code "LD B,E" for a Z80 CPU (see, I'm really old and learned that processor in depth);
Operation code "EORS ari" for an ARM CPU.
Now strings simply (not to say "primitively") scans through the given file and tries so interpret the bytes as sequences of printable ASCII characters. By default a sequence has to have at least 4 characters and the bytes are interpreted as 7-bit ASCII. BTW, the file does not have to be an executable. You can scan any file but if you give it an object file by default it scans only sections that are loaded in memory.
So what you see are sequences of bytes which by chance are at least 4 printable characters in a row. And because some patterns are always in an executable it just looks as if they have a special meaning. Actually they have but they don't have to relate to your program's strings.
You can use strings to quickly peek into a file to find, well, strings which might help you with whatever you're trying to accomplish.
What you're seeing is an ASCII representation of a particular bit pattern that happens to be common in executable programs generated by that particular compiler. The pattern might correspond to a particular sequence of machine language instructions which the compiler is fond of emitting. Or it might correspond to a particular data structure which the compiler or linker uses to mark the various other pieces of data stored in the executable.
Given enough work, it would probably be possible to work out the actual details, for your C code and your particular version of your particular compiler, precisely what the bit patterns behind []A\A]A^A_ and ;*3$" correspond to. But I don't do much machine-language programming any more, so I'm not going to try, and the answers probably wouldn't be too interesting in the end, anyway.
But it reminds me of little quirk which I have noticed and can explain. Suppose you wrote the very simple program
int i = 12345;
If you compiled that program and ran strings on it, and if you told it to look for strings as short as two characters, you'd probably see (among lots of other short, meaningless strings), the string
90
and that bit pattern would, in fact, correspond to your variable! What's up with that?
Well, 12345 in hexadecimal is 0x3039, and most machines these days are little-endian, so those two bytes in memory are stored in the other order as
39 30
and in ASCII, 0x39 is '9', while 0x30 is '0'.
And if this is interesting to you, you can try compiling the program fragment
int i = 12345;
long int a = 1936287860;
long int b = 1629516649;
long int c = 1953719668;
long long int x = 48857072035144;
long long int y = 36715199885175;
and running strings -2 on it, and see what else you get.

How can I access the size of a symbol as set by .size directive in C

I have set the size of a symbol in assembly using the .size directive of GNU assembler, how do I access this size in C?
void my_func(void){}
asm(
"_my_stub:\n\t"
"call my_func\n\t"
".size _my_stub, .-_my_stub"
);
extern char* _my_stub;
int main(){
/* print size of _my_stub here */
}
Here is relevant objdump
0000000000000007 <_my_stub>:
_my_stub():
7: e8 00 00 00 00 callq c <main>
Here is relevant portion of readelf
Symbol table '.symtab' contains 14 entries:
Num: Value Size Type Bind Vis Ndx Name
5: 0000000000000007 5 NOTYPE LOCAL DEFAULT 1 _my_stub
I can see from the objdump and symbol table that the size of _my_stub is 5. How can I get this value in C?
I don't know of a way to access the size attribute from within gas. As an alternative, how about replacing .size with
hhh:
.long .-_my_stub # 32-bit integer constant, regardless of the size of a C long
and
extern const uint32_t gfoo asm("hhh");
// asm("asm_name") sidesteps the _name vs. name issue across OSes
I can't try it here, but I expect you should be able to printf("%ld\n", gfoo);. I tried using .equ so this would be a constant rather than allocating memory for it, but I never got it to work right.
This does leave the question as to the purpose of the .size attribute. Why set it if you you can't read it? I'm not an expert, but I've got a theory:
According to the docs, for COFF outputs, .size must be within a .def/.endef. Looking at .def, we see that it's used to Begin defining debugging information for a symbol name.
While ELF doesn't have the same nesting requirement, it seems plausible to assume that debugging is the intent there too. If this is only intended to be used by debuggers, it (kinda) makes sense that there's no way to access it from within the assembler.
I guess you just want to get the size of a subset of a code segment or data segment. Here is an assembly example (GAS&AT style) you can refer to:
target_start:
// Put your code segment or data segment here
// ...
target_end = .
// Use .global to export the label
.global target_size
target_size = target_end - target_start
In C/C++ source file, you can use label target_size by extern long target_size.
Note: this example hasn't been tested.

Unable to access correct global label data of assembly from C in linux

I have an assembly code (hello1.s) where global label A_Td is defined and I want to access all the long data values defined with global label A_Td from/inside C program.
.file "hello1.s"
.globl A_Td
.text
.align 64
A_Td:
.long 1353184337,1353184337
.long 1399144830,1399144830
.long 3282310938,3282310938
.long 2522752826,2522752826
.long 3412831035,3412831035
.long 4047871263,4047871263
.long 2874735276,2874735276
.long 2466505547,2466505547
As A_Td is defined in text section, so it is placed in code section and only one copy is loaded into memory.
Using yasm , I have generated hello1.o file
yasm -p gas -f elf32 hello1.s
Now, to access all the long data using global label A_Td , I have written following C code (test_glob.c) taking clue from here global label.
//test_glob.c
extern A_Td ;
int main()
{
long *p;
int i;
p=(long *)(&A_Td);
for(i=0;i<16;i++)
{
printf("p+%d %p %ld\n",i, p+i,*(p+i));
}
return 0;
}
Using following command I have compiled C program and then run the C code.
gcc hello1.o test_glob.c
./a.out
I am getting following output
p+0 0x8048400 1353184337
p+1 0x8048404 1353184337
p+2 0x8048408 1399144830
p+3 0x804840c 1399144830 -----> correct till this place
p+4 0x8048410 -1012656358 -----> incorrect value retrieved from this place
p+5 0x8048414 -1012656358
p+6 0x8048418 -1772214470
p+7 0x804841c -1772214470
p+8 0x8048420 -882136261
p+9 0x8048424 -882136261
p+10 0x8048428 -247096033
p+11 0x804842c -247096033
p+12 0x8048430 -1420232020
p+13 0x8048434 -1420232020
p+14 0x8048438 -1828461749
p+15 0x804843c -1828461749
ONLY first 4 long values are correctly accessed from C program. Why this is happening ?
What needs to be done inside C program to access the rest of data correctly ?
I am using Linux. Any help to resolve this issue or any link will be a great help. Thanks in advance.
How many bytes does "long" have in this system?
It seems to me that printf interprets the numbers as four byte signed integers, where the value 3282310938 has the hex value C3A4171A, which is above 7FFFFFFF (in decimal: 2147483647) which is the largest four byte positive signed number, and hence a negative value -1012656358.
I assume that the assembler just interprets these four byte numbers as unsigned.
If you would use %lu instead of %ld, printf would interpret the numbers as unsigned, and should show what you expected.

C read() produces wrong byte order compared to hexdump? [duplicate]

I am playing with the Unix hexdump utility. My input file is UTF-8 encoded, containing a single character ñ, which is C3 B1 in hexadecimal UTF-8.
hexdump test.txt
0000000 b1c3
0000002
Huh? This shows B1 C3 - the inverse of what I expected! Can someone explain?
For getting the expected output I do:
hexdump -C test.txt
00000000 c3 b1 |..|
00000002
I was thinking I understood encoding systems.
This is because hexdump defaults to using 16-bit words and you are running on a little-endian architecture. The byte sequence b1 c3 is thus interpreted as the hex word c3b1. The -C option forces hexdump to work with bytes instead of words.
I found two ways to avoid that:
hexdump -C file
or
od -tx1 < file
I think it is stupid that hexdump decided that files are usually 16bit word little endian. Very confusing IMO.

Resources