[]A\A]A^A_ and ;*3$" in compiled C binary - c

I'm on an Ubuntu 18.04 laptop coding C with VSCode and compiling it with GNU's gcc.
I'm doing some basic engineering on my own C code and I noticed a few interesting details, on of which is the pair []A\A]A^A_ and ;*3$" that seems to appear in every one of my compiled C binaries. Between them is usually (or always) strings that I hard code in for printf() functions.
An example is this short piece of code here:
#include <stdio.h>
#include <stdbool.h>
int f(int i);
int main()
{
int x = 5;
int o = f(x);
printf("The factorial of %d is: %d\n", x, o);
return 0;
}
int f(int i)
{
if(i == 0)
{
return i;
}
else
{
return i*f(i-1);
}
}
... is then compiled using gcc test.c -o test.
When I run strings test, the following is outputted:
/lib64/ld-linux-x86-64.so.2
0HSn(
libc.so.6
printf
__cxa_finalize
__libc_start_main
GLIBC_2.2.5
_ITM_deregisterTMCloneTable
__gmon_start__
_ITM_registerTMCloneTable
AWAVI
AUATL
[]A\A]A^A_
The factorial of %d is: %d
;*3$"
GCC: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
crtstuff.c
deregister_tm_clones
__do_global_dtors_aux
completed.7697
__do_global_dtors_aux_fini_array_entry
frame_dummy
__frame_dummy_init_array_entry
test.c
__FRAME_END__
__init_array_end
_DYNAMIC
__init_array_start
__GNU_EH_FRAME_HDR
_GLOBAL_OFFSET_TABLE_
__libc_csu_fini
_ITM_deregisterTMCloneTable
_edata
printf##GLIBC_2.2.5
__libc_start_main##GLIBC_2.2.5
__data_start
__gmon_start__
__dso_handle
_IO_stdin_used
__libc_csu_init
__bss_start
main
__TMC_END__
_ITM_registerTMCloneTable
__cxa_finalize##GLIBC_2.2.5
.symtab
.strtab
.shstrtab
.interp
.note.ABI-tag
.note.gnu.build-id
.gnu.hash
.dynsym
.dynstr
.gnu.version
.gnu.version_r
.rela.dyn
.rela.plt
.init
.plt.got
.text
.fini
.rodata
.eh_frame_hdr
.eh_frame
.init_array
.fini_array
.dynamic
.data
.bss
.comment
Same as other scripts I've written, the 2 pieces []A\A]A^A_ and ;*3$" always pop up, 1 before the strings used with printf and one right after.
I'm curious: What exactly do those strings mean? I'm guessing they mainly mark the begining and endding of the use of hard-coded output strings.

Our digital computers work on bits, most commonly clustered in bytes containing 8 bits each. The meaning of such a combination depends on the context and the interpretation.
A non-exhausting list of possible interpretation is:
ASCII characters with the eighth bit ignored or accepted only if 0;
signed or unsigned 8-bit integer;
operation code (or part of it) of one specific machine language, each processor (family) has its own different set.
For example, the hex value 0x43 can be seen as:
ASCII character 'C';
Unsigned 8-bit integer 67 (signed is the same if 2's complement is used);
Operation code "LD B,E" for a Z80 CPU (see, I'm really old and learned that processor in depth);
Operation code "EORS ari" for an ARM CPU.
Now strings simply (not to say "primitively") scans through the given file and tries so interpret the bytes as sequences of printable ASCII characters. By default a sequence has to have at least 4 characters and the bytes are interpreted as 7-bit ASCII. BTW, the file does not have to be an executable. You can scan any file but if you give it an object file by default it scans only sections that are loaded in memory.
So what you see are sequences of bytes which by chance are at least 4 printable characters in a row. And because some patterns are always in an executable it just looks as if they have a special meaning. Actually they have but they don't have to relate to your program's strings.
You can use strings to quickly peek into a file to find, well, strings which might help you with whatever you're trying to accomplish.

What you're seeing is an ASCII representation of a particular bit pattern that happens to be common in executable programs generated by that particular compiler. The pattern might correspond to a particular sequence of machine language instructions which the compiler is fond of emitting. Or it might correspond to a particular data structure which the compiler or linker uses to mark the various other pieces of data stored in the executable.
Given enough work, it would probably be possible to work out the actual details, for your C code and your particular version of your particular compiler, precisely what the bit patterns behind []A\A]A^A_ and ;*3$" correspond to. But I don't do much machine-language programming any more, so I'm not going to try, and the answers probably wouldn't be too interesting in the end, anyway.
But it reminds me of little quirk which I have noticed and can explain. Suppose you wrote the very simple program
int i = 12345;
If you compiled that program and ran strings on it, and if you told it to look for strings as short as two characters, you'd probably see (among lots of other short, meaningless strings), the string
90
and that bit pattern would, in fact, correspond to your variable! What's up with that?
Well, 12345 in hexadecimal is 0x3039, and most machines these days are little-endian, so those two bytes in memory are stored in the other order as
39 30
and in ASCII, 0x39 is '9', while 0x30 is '0'.
And if this is interesting to you, you can try compiling the program fragment
int i = 12345;
long int a = 1936287860;
long int b = 1629516649;
long int c = 1953719668;
long long int x = 48857072035144;
long long int y = 36715199885175;
and running strings -2 on it, and see what else you get.

Related

Where in object file does the code of function "main" starts?

I have an object file of a C program which prints hello world, just for the question.
I am trying to understand using readelf utility or gdb or hexedit(I can't figure which tool is a correct one) where in the file does the code of function "main" starts.
I know using readelf that symbol _start & main occurs and the address where it is mapped in a virtual memory. Moreover, I also know what the size of .text section and the of coruse where entry point specified, i.e the address which the same of text section.
The question is - Where in the file does the code of function "main" starts? I tought that is the entry point and the offset of the text section but how I understand it the sections data, bss, rodata should be ran before main and it appears after section text in readelf.
Also I tought we should sum the size all the lines till main in symbol table, but I am not sure at all if it is correct.
Additional question which follow up this one is if I want to replace main function with NOP instrcutres or plant one ret instruction in my object file. how can I know the offset where I can do it using hexedit.
So, let's go through it step by step.
Start with this C file:
#include <stdio.h>
void printit()
{
puts("Hello world!");
}
int main(void)
{
printit();
return 0;
}
As the comments look like you are on x86, compile it as 32-bit non-PIE executable like this:
$ gcc -m32 -no-pie -o test test.c
The -m32 option is needed, because I am working at a x86-64 machine. As you already know, you can get the virtual memory address of main using readelf, objdump or nm, for example like this:
$ nm test | grep -w main
0804918d T main
Obviously, 804918d can not be an offset in the file that is just 15 kB big. You need to find the mapping between virtual memory addresses and file offsets. In a typical ELF file, the mapping is included twice. Once in a detailed form for linkers (as object files are also ELF files) and debuggers, and a second time in a condensed form that is used by the kernel for loading programs. The detailed form is the list of sections, consisting of section headers, and you can view it like this (the output is shortened a bit, to make the answer more readable):
$ readelf --section-headers test
There are 29 section headers, starting at offset 0x3748:
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[...]
[11] .init PROGBITS 08049000 001000 000020 00 AX 0 0 4
[12] .plt PROGBITS 08049020 001020 000030 04 AX 0 0 16
[13] .text PROGBITS 08049050 001050 0001c1 00 AX 0 0 16
[14] .fini PROGBITS 08049214 001214 000014 00 AX 0 0 4
[15] .rodata PROGBITS 0804a000 002000 000015 00 A 0 0 4
[...]
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
p (processor specific)
Here you find that the .text section starts at (virtual) address 08049050 and has a size of 1c1 bytes, so it ends at address 08049211. The address of main, 804918d is in this range, so you know main is a member of the text section. If you subtract the base of the text section from the address of main, you find that main is 13d bytes into the text section. The section listing also contains the file offset where the data for the text section starts. It's 1050, so the first byte of main is at offset 0x1050 + 0x13d == 0x118d.
You can do the same calculation using program headers:
$ readelf --program-headers test
[...]
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
PHDR 0x000034 0x08048034 0x08048034 0x00160 0x00160 R 0x4
INTERP 0x000194 0x08048194 0x08048194 0x00013 0x00013 R 0x1
[Requesting program interpreter: /lib/ld-linux.so.2]
LOAD 0x000000 0x08048000 0x08048000 0x002e8 0x002e8 R 0x1000
LOAD 0x001000 0x08049000 0x08049000 0x00228 0x00228 R E 0x1000
LOAD 0x002000 0x0804a000 0x0804a000 0x0019c 0x0019c R 0x1000
LOAD 0x002f0c 0x0804bf0c 0x0804bf0c 0x00110 0x00114 RW 0x1000
[...]
The second load line tells you that the area 08049000 (VirtAddr) to 08049228 (VirtAddr + MemSiz) is readable and executable, and loaded from offset 1000 in the file. So again you can calculate that the address of main is 18d bytes into this load area, so it has to reside at offset 0x118d inside the executable. Let's test that:
$ ./test
Hello world!
$ echo -ne '\xc3' | dd of=test conv=notrunc bs=1 count=1 seek=$((0x118d))
1+0 records in
1+0 records out
1 byte copied, 0.0116672 s, 0.1 kB/s
$ ./test
$
Overwriting the first byte of main with 0xc3, the opcode for return (near) on x86, causes the program to not output anything anymore.
_start normally belongs to a module ( a *.o file) that is fixed (it is called differently on different systems, but a common name is crt0.o which is written in assembler.) That fixed code prepares the stack (normally the arguments and the environment are stored in the initial stack segment by the execve(2) system call) the mission of crt0.s is to prepare the initial C stack frame and call main(). Once main() ends, it is responsible of getting the return value from main and calling all the atexit() handlers to finish calling the _exit(2) system call.
The linking of crt0.o is normally transparent due to the fact that you always call the compiler to do the linking itself, so you normally don't have to add crt0.o as the first object module, but the compiler knows (lately, all this stuff has grown considerably, since we depend on architecture and ABIs to pass parameters between functions)
If you execute the compiler with the -v option, you'll get the exact command line it uses to call the linker and you'll get the secrets of the final memory map your program has on its first stages.

Why does GCC store global and static int differently?

Here is my C program with one static, two global, one local and one extern variable.
#include <stdio.h>
int gvar1;
int gvar2 = 12;
extern int evar = 1;
int main(void)
{
int lvar;
static int svar = 4;
lvar = 2;
gvar1 = 3;
printf ("global1-%d global2-%d local+1-%d static-%d extern-%d\n", gvar1, gvar2, (lvar+1), svar, evar);
return 0;
}
Note that gvar1, gvar2, evar, lvar and svar are all defined as integers.
I disassembled the code using objdump and the debug_str for this shows as below:
Contents of section .debug_str:
0000 76617269 61626c65 732e6300 6c6f6e67 variables.c.long
0010 20756e73 69676e65 6420696e 74002f75 unsigned int./u
0020 73657273 2f686f6d 6534302f 72616f70 sers/home40/raop
0030 2f626b75 702f6578 616d706c 65730075 /bkup/examples.u
0040 6e736967 6e656420 63686172 00737661 nsigned char.sva
0050 72006d61 696e006c 6f6e6720 696e7400 r.main.long int.
0060 6c766172 0073686f 72742075 6e736967 lvar.short unsig
0070 6e656420 696e7400 67766172 31006776 ned int.gvar1.gv
0080 61723200 65766172 00474e55 20432034 ar2.evar.GNU C 4
0090 2e342e36 20323031 31303733 31202852 .4.6 20110731 (R
00a0 65642048 61742034 2e342e36 2d332900 ed Hat 4.4.6-3).
00b0 73686f72 7420696e 7400 short int.
Why is it showing the following?
unsigned char.svar
long int.lvar
short unsigned int.gvar1.gvar2.evar
How does GCC decide which type it should be stored as?
I am using GCC 4.4.6 20110731 (Red Hat 4.4.6-3)
Why is it showing the following?
Simple answer: It is not showing what you think but it is showing:
1 "variables.c"
2 "long unsigned int"
2a "unsigned int"
2b "int"
3 "/users/home40/raop/bkup/examples"
4 "unsigned char"
4a "char"
5 "svar"
6 "main"
7 "long int"
8 "lvar"
9 "short unsigned int"
10 "gvar1"
11 "gvar2"
12 "evar"
13 "GNU C 4.4.6 20110731 (Red Hat 4.4.6-3)"
14 "short int"
The section is named .debug_str; it contains a list of strings which are separated by NUL bytes. These strings are in any order and they are referenced by the section .debug_info. So the fact that svar is following unsigned char has no meaning at all.
The .debug_info section contains the actual debugging information. This section does not contain strings. Instead it will contain information like this:
...
Item 123:
Type of information: Data type
Name: 2b /* String #2b in ".debug_str" is "int" */
Kind of data type: Signed integer
Number of bits: 32
... some more information ...
Item 124:
Type of information: Global variable
Name: 8 /* "lvar" */
Data type defined by: Item 123
Stored at: Address 0x1234
... some more information ...
Begin item 125:
Type of information: Function
Name: 6 /* "main" */
... some more information ...
Item 126:
Type of information: Local variable
Name: 5 /* "svar" */
Data type defined by: Item 123
Stored at: Address 0x1238
... some more information ...
End item 125 /* Function "main" */
Item 127:
...
You can see this information using the following command:
readelf --debug-dump filename.o
Why does GCC store global and static int differently?
I compiled your example twice: Once with optimization and once without optimization.
Without optimization svar and gvar1 were stored exactly the same way: Data type int, stored on a fixed address. lvar was: Data type int, stored on the stack.
With optimization lvar and svar were stored the same way: Data type: int, not stored at all, instead they are treated as constant value.
(This makes sense because the values of these variables never change.)
The C11 specification (read n1570) -or older C standards- does not define at what addresses or offsets are stored global or static variables, so the implementation (your gcc compiler and your ld linker) is free to put them at any place.
The organization and layout of the data segments is an implementation detail.
You may want to read more about DWARF to understand debug information, which is useful to the gdb debugger.
You may want to read more about linkers and loaders, and about the ELF format, if you want to understand how they are working. On Linux, there are several utilities to inspect elf(5) files, including objdump(1), readelf(1), nm(1).
Notice that your GCC4.4 is an obsolete and old version of GCC. Current version is GCC7, and GCC8 will be released in a few weeks (spring 2018). I strongly recommend to upgrade your compiler.
If you need to understand how and why the data segments are organized in such way and why your implementation chooses such a layout, you could take advantage that both gcc and ld (from binutils) are free software, and study their source in details. You'll need many years of work, since they are complex software (more than ten million lines of source code).
If you happen to start studying the internals of GCC, be sure to study a recent version. Most people of the GCC community have probably forgotten the details of GCC4.4 (released in 2009). A lot of things have changed in GCC since that ancient thing. A few years ago, I have written many slides about GCC internals, see the documentation of GCC MELT.
BTW, the layout of data segments, or of variables inside them, might vary with optimization options. It might happen that lvar does not sit in memory (e.g. stays in a register only); it could happen that a static variable is removed (using something like the as-if rule) etc.
For a single translation unit foo.c, you might compile it into assembler code using gcc -fverbose-asm -S -O foo.c and look into the emitted foo.s assembler code.
To understand more how your ld linker work, you might look into some relevant linker script. You could find how ld is invoked from gcc by using gcc -v (instead of gcc) in your compilation and linking command.
In most cases, you should not care about the particular offsets (in object files or executables) or addresses (in the virtual address space of your process) of global or static variables. Be also aware of ASLR. The proc(5) filesystem can be used to understand your process.
(your question is severely lacking some motivation and context)

Unable to access correct global label data of assembly from C in linux

I have an assembly code (hello1.s) where global label A_Td is defined and I want to access all the long data values defined with global label A_Td from/inside C program.
.file "hello1.s"
.globl A_Td
.text
.align 64
A_Td:
.long 1353184337,1353184337
.long 1399144830,1399144830
.long 3282310938,3282310938
.long 2522752826,2522752826
.long 3412831035,3412831035
.long 4047871263,4047871263
.long 2874735276,2874735276
.long 2466505547,2466505547
As A_Td is defined in text section, so it is placed in code section and only one copy is loaded into memory.
Using yasm , I have generated hello1.o file
yasm -p gas -f elf32 hello1.s
Now, to access all the long data using global label A_Td , I have written following C code (test_glob.c) taking clue from here global label.
//test_glob.c
extern A_Td ;
int main()
{
long *p;
int i;
p=(long *)(&A_Td);
for(i=0;i<16;i++)
{
printf("p+%d %p %ld\n",i, p+i,*(p+i));
}
return 0;
}
Using following command I have compiled C program and then run the C code.
gcc hello1.o test_glob.c
./a.out
I am getting following output
p+0 0x8048400 1353184337
p+1 0x8048404 1353184337
p+2 0x8048408 1399144830
p+3 0x804840c 1399144830 -----> correct till this place
p+4 0x8048410 -1012656358 -----> incorrect value retrieved from this place
p+5 0x8048414 -1012656358
p+6 0x8048418 -1772214470
p+7 0x804841c -1772214470
p+8 0x8048420 -882136261
p+9 0x8048424 -882136261
p+10 0x8048428 -247096033
p+11 0x804842c -247096033
p+12 0x8048430 -1420232020
p+13 0x8048434 -1420232020
p+14 0x8048438 -1828461749
p+15 0x804843c -1828461749
ONLY first 4 long values are correctly accessed from C program. Why this is happening ?
What needs to be done inside C program to access the rest of data correctly ?
I am using Linux. Any help to resolve this issue or any link will be a great help. Thanks in advance.
How many bytes does "long" have in this system?
It seems to me that printf interprets the numbers as four byte signed integers, where the value 3282310938 has the hex value C3A4171A, which is above 7FFFFFFF (in decimal: 2147483647) which is the largest four byte positive signed number, and hence a negative value -1012656358.
I assume that the assembler just interprets these four byte numbers as unsigned.
If you would use %lu instead of %ld, printf would interpret the numbers as unsigned, and should show what you expected.

asprintf - how to get string input in C [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I am reading the book "21 century C" (first editon) and find a interesting program using asprintf
to get the string without using malloc /size of for string length or space allocation. Please read the attached image from the same book to understand the context.Following program also from book. The program compile run and NOT taking string input from keyboard instead getting following message. Question is : Why the program doesn't take string input from keboard instead the showing long (unusual) error message?
#define _GNU_SOURCE // stdio.h to include asprintf
#include <stdlib.h>
#include <stdio.h>
void get_strings(char const *in) {
char *cmd;
asprintf(&cmd, "strings %s", in);
if (system(cmd))
fprintf(stderr, "Something went Wrong %s.\n", cmd);
free(cmd);
}
int main(int argc, char **argv) {
get_strings(argv[0]);
//return 0;
}
When the run the program the output is :
/lib64/ld-linux-x86-64.so.2
libc.so.6
__stack_chk_fail
asprintf
stderr
system
fprintf
__libc_start_main
free
__gmon_start__
GLIBC_2.4
GLIBC_2.2.5
UH-X
AWAVA
AUATL
[]A\A]A^A_
strings %s
Something went Wrong %s.
;*3$"
GCC: (Ubuntu 5.3.1-14ubuntu2) 5.3.1 20160413
crtstuff.c
__JCR_LIST__
deregister_tm_clones
__do_global_dtors_aux
completed.7585
__do_global_dtors_aux_fini_array_entry
frame_dummy
__frame_dummy_init_array_entry
get_strings.c
__FRAME_END__
__JCR_END__
__init_array_end
_DYNAMIC
__init_array_start
__GNU_EH_FRAME_HDR
_GLOBAL_OFFSET_TABLE_
__libc_csu_fini
free##GLIBC_2.2.5
_ITM_deregisterTMCloneTable
_edata
__stack_chk_fail##GLIBC_2.4
system##GLIBC_2.2.5
get_strings
__libc_start_main##GLIBC_2.2.5
__data_start
fprintf##GLIBC_2.2.5
__gmon_start__
__dso_handle
_IO_stdin_used
__libc_csu_init
__bss_start
asprintf##GLIBC_2.2.5
main
_Jv_RegisterClasses
__TMC_END__
_ITM_registerTMCloneTable
stderr##GLIBC_2.2.5
.symtab
.strtab
.shstrtab
.interp
.note.ABI-tag
.note.gnu.build-id
.gnu.hash
.dynsym
.dynstr
.gnu.version
.gnu.version_r
.rela.dyn
.rela.plt
.init
.plt.got
.text
.fini
.rodata
.eh_frame_hdr
.eh_frame
.init_array
.fini_array
.jcr
.dynamic
.got.plt
.data
.bss
.comment
------------------
(program exited with code: 0)
Press return to continue
**I running it on Linux Mint 18. GCC version -5.3.1
Build setting - gcc -Wall -c "%f"
Compile - gcc -Wall -o "%e" "%f"**
The purpose of the program is not to get input from the user: it uses the system() function to run the strings program with its own name as the only argument.
If you are running on a Unix environment, the strings program scans files for printable strings. The output you observe is more of less expected: your program executable as produced by gcc contains many printable strings:
you can spot the string literal present in the source code:
Something went Wrong %s.
numerous symbol names to be resolved dynamically at load time
debugging information, such as the name of the source file: crtstuff.c
section names starting with .
there are also some random items ([]A\A]A^A_, ;*3$"...) that are just sequences of printable characters present in the executable file code or binary data, mistakenly interpreted by string as C strings because they are followed by a null byte.
There is no place in your program where it reads from standard input/keyboard. And the system("strings ...") is passing a filename to the strings command, so strings reads from that file and not from keyboard.
If you intend to read from the files with the filenames passed to your program you need to keep in mind that argv[0]is the program name. You need to look at argv[1], argv[2] and so on.
for(int i = 1; i < argc; ++i)
get_strings(argv[i]);

Why the int type takes up 8 bytes in BSS section but 4 bytes in DATA section

I am trying to learn the structure of executable files of C program. My environment is GCC and 64bit Intel processor.
Consider the following C code a.cc.
#include <cstdlib>
#include <cstdio>
int x;
int main(){
printf("%d\n", sizeof(x));
return 10;
}
The size -o a shows
text data bss dec hex filename
1134 552 8 1694 69e a
After I added another initialized global variable y.
int y=10;
The size a shows (where a is the name of the executable file from a.cc)
text data bss dec hex filename
1134 556 12 1702 6a6 a
As we know, the BSS section stores the size of uninitialized global variables and DATA stores initialized ones.
Why int takes up 8 bytes in BSS? The sizeof(x) in my code shows that the int actually takes up 4 bytes.
The int y=10 added 4 bytes to DATA which makes sense since int should take 4 bytes. But, why does it adds 4 bytes to BSS?
The difference between two size commands stays the same after deleting the two lines #include ....
Update:
I think my understanding of BSS is wrong. It may not store the uninitialized global variables. As the Wikipedia says "The size that BSS will require at runtime is recorded in the object file, but BSS (unlike the data segment) doesn't take up any actual space in the object file." For example, even the one line C code int main(){} has bss 8.
Does the 8 or 16 of BSS comes from alignment?
It doesn't, it takes up 4 bytes regardless of which segment it's in. You can use the nm tool (from the GNU binutils package) with the -S argument to get the names and sizes of all of the symbols in the object file. You're likely seeing secondary affects of the compiler including or not including certain other symbols for whatever reasons.
For example:
$ cat a1.c
int x;
$ cat a2.c
int x = 1;
$ gcc -c a1.c a2.c
$ nm -S a1.o a2.o
a1.o:
0000000000000004 0000000000000004 C x
a2.o:
0000000000000000 0000000000000004 D x
One object file has a 4-byte object named x in the uninitialized data segment (C), while the other object file has a 4-byte object named x in the initialized data segment (D).

Resources