Finding a string in a compiled executable - c

I have a very simple program as below:
#include
int main(){
char* mystring = "ABCDEFGHIJKLMNO";
puts(mystring);
char otherstring[15];
otherstring[0] = 'a';
otherstring[1] = 'b';
otherstring[2] = 'c';
otherstring[3] = 'd';
otherstring[4] = 'e';
otherstring[5] = 'f';
otherstring[6] = 'g';
otherstring[7] = 'h';
otherstring[8] = 'i';
otherstring[9] = 'j';
otherstring[10] = 'k';
otherstring[11] = 'l';
otherstring[12] = 'm';
otherstring[13] = 'n';
otherstring[14] = 'o';
puts(otherstring);
return 0;
}
Compiler was MS VC++.
Whether I build this program with or without optimisations I can find the string "ABCDEFGHIJKLMNO" in the executable using a hex editor.
However, I cannot find the string "abcdefghijklmno"
What is the compiler doing that is different for otherstring?
The hex editor I used was Hexedit - but tried others and still couldn't find otherstring. Anyone any ideas why not or how to find?
By the way I am not doing this for hacking reasons.

This is what my gcc did with this code. I assume your compiler does a similar thing. The string constant is stored in the read only section and mystring is initialized with it's address.
The individual chars are placed directly into their array location on the stack. Also note that otherstring is not NULL terminated when you're calling puts with it.
.file "test.c"
.section .rodata
.LC0:
.string "ABCDEFGHIJKLMNO"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
subq $48, %rsp
movq %fs:40, %rax
movq %rax, -8(%rbp)
xorl %eax, %eax
/* here is where mystring is loaded with the address of "ABCDEFGHIJKLMNO" */
movq $.LC0, -40(%rbp)
/* this is the call to puts */
movq -40(%rbp), %rax
movq %rax, %rdi
call puts
/* here is where the bytes are loaded into otherstring on the stack */
movb $97, -32(%rbp) //'a'
movb $98, -31(%rbp) //'b'
movb $99, -30(%rbp) //'c'
movb $100, -29(%rbp) //'d'
movb $101, -28(%rbp) //'e'
movb $102, -27(%rbp) //'f'
movb $103, -26(%rbp) //'g'
movb $104, -25(%rbp) //'h'
movb $105, -24(%rbp) //'i'
movb $106, -23(%rbp) //'j'
movb $107, -22(%rbp) //'k'
movb $108, -21(%rbp) //'l'
movb $109, -20(%rbp) //'m'
movb $110, -19(%rbp) //'n'
movb $111, -18(%rbp) //'o'

The compiler is likely placing the number for each character into each array position, just as you wrote it, without any optimization that would be found from reading the code. Remember that a single character is no different than a number in c, so you could even use the ascii codes instead of 'a'. From a hexeditor I would expect you would see those converted back to letters, just spaced out a bit.

In the first case the compiler initializes data with the exact string "ABC...".
In the second case, each assignment is done sequentially, therefore compiler generates code to perform this assignment. In the executable you should see 15 repeating byte sequences where only the initializer ('a', 'b', 'c'...) changes.

Related

Addresses in C Language

The GCC compiler on CSLab translates the following C function:
int func(int x) {
return 13 + x;
}
Into the following Assembly code:
func:
pushq %rbp
movq %rsp, %rbp
movl %edi, -4(%rbp)
movl -4(%rbp), %eax
addl $13, %eax
popq %rbp
ret
I have completed this code and was then asked the following question:
In the Assembly code for func shown in the previous question, suppose %rsp has the value
0x7fffffffe3e0
What is the address corresponding to the parameter (local variable) x? Include the 0x prefix.
(Note that the address has 12 significant hex digits, or 6 bytes. > The value for the top two hex digits is 0. Omit the 0s to the left just as shown above.)
I answered 0xd and it was incorrect.
Taking the given value for %rsp as 0x7fffffffe3e0, we have
movq %rsp, %rbp
which copies the value of %rsp to %rbp, then we have
movl -4(%rbp), %eax
addl $13, %eax
We copy something to %eax and add 13 to it, so that something must be x. That something is -4(%rbp), which translates to "an object 4 bytes below the address value stored in %rbp".
Thus, the address of x must be 0x7fffffffe3e0 - 4, or 0x7fffffffe3dc.
Read up on your x86 assembly addressing modes.

Hex Character bit rotation [duplicate]

This question already has answers here:
What are the calling conventions for UNIX & Linux system calls (and user-space functions) on i386 and x86-64
(4 answers)
Segfault when loading from [esp] in 64-bit code
(1 answer)
Closed 1 year ago.
I wrote the following x86-64 functions to be called in a C program:
The first one takes in a 2-digit hexadecimal character and rotates its bit towards the right by one bit, i.e, if '12'(means 0x12 but 0x is not feeded as input) is rotated right by one bit to give '09'(0x09)
.file "rotate_right.s"
.section .rodata
.data
.globl rotate_right
.type rotate_right, #function
.text
rotate_right:
pushq %rbp
movq %rsp,%rbp
pushq %rsi
pushq %rdi
pushq %rbx
subl $4, %esp
movb 8(%ebp), %al
sarb $1, %al
leave
ret
.size rotate_right, .-rotate_right
Similarly, this function rotates the bits to the left one position, so '12'(0x12) becomes '24'(0x24).
.file "rotate_left.s"
.section .rodata
.data
.globl rotate_left
.type rotate_left, #function
.text
rotate_left:
pushq %rbp
movq %rsp,%rbp
pushq %rsi
pushq %rdi
pushq %rbx
subl $4, %esp
movb 8(%ebp), %al
sarb $1, %al
leave
ret
.size rotate_left, .-rotate_left
The create_key() function gets a four bit input like 0110 and outputs an 8-bit output as unsigned int i.e 01100110:
.file "create_key.s"
.section .rodata
ask_key:
.string "Enter 4-bit key:"
.data
.globl create_key
.type create_key, #function
.text
create_key:
pushq %rbp
# stack holding
movq %rsp, %rbp
movl $ask_key, %edi
# printing the ask string
movl $0, %eax
# calling the C functions
call printf
movl $0,%esi
# rsi is set to 0. the key
pushq %rsi
# take it to the stack
# call getchar for getting each key bit
call getchar
popq %rsi
subl $48,%eax
# doing this will give us 1 or 0 if the we input 1 or 0
sall $1,%esi
# shift the key by one bit
orl %eax,%esi
# OR key and the sigle input
pushq %rsi
# push rsi to stack to save the value in rsi
# Do the above operation a total of 4 times
call getchar
popq %rsi
subl $48,%eax
sall $1,%esi
orl %eax,%esi
pushq %rsi
call getchar
popq %rsi
subl $48,%eax
sall $1,%esi
orl %eax,%esi
pushq %rsi
call getchar
popq %rsi
subl $48,%eax
sall $1,%esi
orl %eax,%esi
# copy first 4-bits of the key into %rax
movl %esi,%eax
#left shift the key 4 times
sall $4,%esi
# OR the secont 4-bits into %rax
orl %esi,%eax
leave
ret
# return the values and end the function
.size create_key, .-create_key
Here is the C-program,
#include<stdio.h>
#include<stdlib.h>
#include<math.h>
#include<string.h>
extern unsigned int create_key();
extern unsigned char rotate_left(unsigned char x);
extern unsigned char rotate_right(unsigned char x);
int main(){
/* This Part Takes The Input To Be Ciphered
And Prints The Hexadecimal Value*/
char word[200], outword[400], xor_hex[400], hex_rot[400], ch, hex[2];
unsigned int i_key, encodant, rotated;
int i, len, j;
/* i_key is the integer equvivalent of the cipher key*/
printf("enter cleartext:");
i = 0;
ch = getchar();
while(ch!='\n'){
word[i] = ch;
++i;
ch = getchar();
}
fflush(stdin);
len = i;
word[i] = '\0';
printf("%s\n", word);
printf("hex encoding:\n");
for(i = 0; i<len; i++){
sprintf(outword+(i*2), "%02X", word[i]);
}
for(i=0;i<(2*len);i++){
if(i%2==0 & i>0)printf(" ");
if(i%20==0 & i>0)printf("\n");
printf("%c", outword[i]);
}
printf("\n");
/* This Part Asks For The Cipher Key */
i_key = create_key();
/* XOR Encoding of the Hex Cyphertext*/
for(i=0;i<len*2;i+=2){
hex[0] = outword[i];
hex[1] = outword[i+1];
encodant = (int)strtol(hex, NULL, 16);
sprintf(xor_hex+(i), "%02X", (i_key^encodant));
}
/* Encoding the text using bit rotation */
j=1;
for(i=0;i<len*2;i+=2){
hex[0]=xor_hex[i];
hex[1]=xor_hex[i+1];
encodant = (int)strtol(hex, NULL, 16);
if(j%2==0)rotated = rotate_right(encodant);
else rotated = rotate_left(encodant);
j++;
sprintf(hex_rot+(i), "%02X", rotated);
}
/* Printing The Finished Ciphered Text */
printf("hex ciphertext:\n");
for(i=0;i<(2*len);i++){
if(i%2==0 & i>0)printf(" ");
if(i%20==0 & i>0)printf("\n");
printf("%c", hex_rot[i]);
}
printf("\n");
return 0;
}
The function prototypes can't be changed, i.e rotate functions must be char and have char parameters, the create_key function is good appearently, but my code gives segmentation fault. I don't know what to do in this situation so any help is appreciated.
There is no need to do anything with the stack here. Take the argument from rdi (SysV) or rcx (Win32), put it in al, rotate and return:
.file "rotate.S"
.text
.globl rotate_right
rotate_right:
mov %rdi, %rax
shrb $1, %al
ret
.globl rotate_left
rotate_left:
mov %rdi, %rax
shlb $1, %al
ret
.end
Now that is GNU as syntax, it may require some adjustment for AT&T asm.
To test it:
#include <stdio.h>
unsigned char rotate_left(unsigned char);
unsigned char rotate_right(unsigned char);
int main() {
printf("%02x %02x\n", rotate_left(0x12), rotate_right(0x12));
return 0;
}
Prints:
24 09

In which data segment is the C string stored?

I'm wondering what's the difference between char s[] = "hello" and char *s = "hello".
After reading this and this, I'm still not very clear on this question.
As I know, there are five data segments in memory, Text, BSS, Data, Stack and Heap.
From my understanding,
in case of char s[] = "hello":
"hello" is in Text.
s is in Data if it is a global variable or in Stack if it is a local variable.
We also have a copy of "hello" where the s is stored, so we can modify the value of this string via s.
in case of char *s = "hello":
"hello" is in Text.
s is in Data if it is a global variable or in Stack if it is a local variable.
s just points to "hello" in Text and we don't have a copy of it, therefore modifying the value of string via this pointer should cause "Segmentation Fault".
Am I right?
You are right that "hello" for the first case is mutable and for the second case is immutable string. And they are kept in read-only memory before initialization.
In the first case the mutable memory is initialized/copied from immutable string. In the second case the pointer refers to immutable string.
For first case wikipedia says,
The values for these variables are initially stored within the
read-only memory (typically within .text) and are copied into the
.data segment during the start-up routine of the program.
Let us examine segment.c file.
char*s = "hello"; // string
char sar[] = "hello"; // string array
char content[32];
int main(int argc, char*argv[]) {
char psar[] = "parhello"; // local/private string array
char*ps = "phello"; // private string
content[0] = 1;
sar[3] = 1; // OK
// sar++; // not allowed
// s[2] = 1; // segmentation fault
s = sar;
s[2] = 1; // OK
psar[3] = 1; // OK
// ps[2] = 1; // segmentation fault
ps = psar;
ps[2] = 1; // OK
return 0;
}
Here is the assembly generated for segment.c file. Note that both s and sar is in global aka .data segment. It seems sar is const pointer to a mutable initialized memory or not pointer at all(practically it is an array). And eventually it has an implication that sizeof(sar) = 6 is different to sizeof(s) = 8. There are "hello" and "phello" in readonly(.rodata) section and effectively immutable.
.file "segment.c"
.globl s
.section .rodata
.LC0:
.string "hello"
.data
.align 8
.type s, #object
.size s, 8
s:
.quad .LC0
.globl sar
.type sar, #object
.size sar, 6
sar:
.string "hello"
.comm content,32,32
.section .rodata
.LC1:
.string "phello"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $64, %rsp
movl %edi, -52(%rbp)
movq %rsi, -64(%rbp)
movq %fs:40, %rax
movq %rax, -8(%rbp)
xorl %eax, %eax
movl $1752326512, -32(%rbp)
movl $1869376613, -28(%rbp)
movb $0, -24(%rbp)
movq $.LC1, -40(%rbp)
movb $1, content(%rip)
movb $1, sar+3(%rip)
movq $sar, s(%rip)
movq s(%rip), %rax
addq $2, %rax
movb $1, (%rax)
movb $1, -29(%rbp)
leaq -32(%rbp), %rax
movq %rax, -40(%rbp)
movq -40(%rbp), %rax
addq $2, %rax
movb $1, (%rax)
movl $0, %eax
movq -8(%rbp), %rdx
xorq %fs:40, %rdx
je .L2
call __stack_chk_fail
.L2:
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
.section .note.GNU-stack,"",#progbits
Again for local variable in main, the compiler does not bother to create a name. And it may keep it in register or in stack memory.
Note that local variable value "parhello" is optimized into 1752326512 and 1869376613 numbers. I discovered it by changing the value of "parhello" to "parhellp". The diff of the assembly output is as follows,
39c39
< movl $1886153829, -28(%rbp)
---
> movl $1869376613, -28(%rbp)
So there is no separate immutable store for psar . It is turned into integers in the code segment.
answer to your first question:
char s[] = "hello";
s is an array of type char. An array is a const pointer, meaning that you cannot change the s using pointer arithmetic (i.e. s++). The data aren't const, though, so you can change it.
See this example C code:
#include <stdio.h>
void reverse(char *p){
char c;
char* q = p;
while (*q) q++;
q--; // point to the end
while (p < q) {
c = *p;
*p++ = *q;
*q-- = c;
}
}
int main(){
char s[] = "DCBA";
reverse( s);
printf("%s\n", s); // ABCD
}
which reverses the text "DCBA" and produces "ABCD".
char *p = "hello"
p is a pointer to a char. You can do pointer arithmetic -- p++ will compile -- and puts data in read-only parts of the memory (const data).
and using p[0]='a'; will result to runtime error:
#include <stdio.h>
int main(){
char* s = "DCBA";
s[0]='D'; // compile ok but runtime error
printf("%s\n", s); // ABCD
}
this compiles, but not runs.
const char* const s = "DCBA";
With a const char* const, you can change neither s nor the data content which point to (i.e. "DCBE"). so data and pointer are const:
#include <stdio.h>
int main(){
const char* const s = "DCBA";
s[0]='D'; // compile error
printf("%s\n", s); // ABCD
}
The Text segment is normally the segment where your code is stored and is const; i.e. unchangeable. In embedded systems, this is the ROM, PROM, or flash memory; in a desktop computer, it can be in RAM.
The Stack is RAM memory used for local variables in functions.
The Heap is RAM memory used for global variables and heap-initialized data.
BSS contains all global variables and static variables that are initialized to zero or not initialized vars.
For more information, see the relevant Wikipedia and this relevant Stack Overflow question
With regards to s itself: The compiler decides where to put it (in stack space or CPU registers).
For more information about memory protection and access violations or segmentation faults, see the relevant Wikipedia page
This is a very broad topic, and ultimately the exact answers depend on your hardware and compiler.

understanding pointers and casting in assembly

I was given a function in assembly which basically converted uppercase letters to lowercase letters. Here is some of the assembly,
Q1:
pushq %rbp
movq %rsp, %rbp
subq $24, %rsp
movq %rdi, -24(%rbp)
movl $0, -4(%rbp)
movl $0. -8%(%rbp)
jmp .L2
L2:
movl -4(%rbp) %edx
movq -24(%rbp), %rax
addq %rdx, %rax
movzbl (%rax), %eax
testb %al, %al
jne .L4
...
Much of the rest is repetitive but L2 is what really is confusing me. This is my logic so far:
We store param1 into -24(%rbp). We create local1 and local2, set them both to 0 and then jump to L2. I move local1 into %edx, param1 into %rax. Now this is where things get confusing for me,
I was told the following line, addq ended up in local1 being a pointer to param1. I just reasoned add local1 + param1 and store them into %rax. How is that possible?
Next is, movzbl. From my understanding we dereference %rax so we get something like eax = (int) rax.
I was also told to think of it as converting a char to int. Which one is true, how do I know that I'm typecasting? What about if %rax didn't have parentheses around it? Is it an int because it's 4 bytes and %eax is a 32 bit register. Thank you in advance for your help, I'm kind of lost here....
local1 is not a pointer, it's an index (a counter).
That code is doing something like:
void toupper(char* text)
{
int i = 0; /* at rbp-4 */
int j = 0; /* unused, at rbp-8 */
int ch; /* in eax */
while((ch = *(text + i)) != 0)
{
...
}
}
Note that in C pointer arithmetic *(text + i) is of course equivalent to text[i].
Yes, the movzbl is converting an unsigned char to an int you can see that from the instruction name itself: MOVe Zero extended Byte to Long.
The parentheses denote pointer dereferencing.

Where string data is stored?

I wrote a small c program:
#include <stdio.h>
int main()
{
char s[] = "Hello, world!";
printf("%s\n", s);
return 0;
}
which compiles to (on my linux machine):
.file "hello.c"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $32, %rsp
movq %fs:40, %rax
movq %rax, -8(%rbp)
xorl %eax, %eax
movl $1819043144, -32(%rbp)
movl $1998597231, -28(%rbp)
movl $1684828783, -24(%rbp)
movw $33, -20(%rbp)
leaq -32(%rbp), %rax
movq %rax, %rdi
call puts
movl $0, %eax
movq -8(%rbp), %rdx
xorq %fs:40, %rdx
je .L3
call __stack_chk_fail
.L3:
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2"
.section .note.GNU-stack,"",#progbits
I don't understand the assembly code, but I can't see anywhere the string message. So how the executable know what to print?
It's here:
movl $1819043144, -32(%rbp) ; 1819043144 = 0x6C6C6548 = "lleH"
movl $1998597231, -28(%rbp) ; 1998597231 = 0x77202C6F = "w ,o"
movl $1684828783, -24(%rbp) ; 1684828783 = 0x646C726F = "dlro"
movw $33, -20(%rbp) ; 33 = 0x0021 = "\0!"
In this particular case the compiler is generating inline instructions to generate the literal string constant before calling printf. Of course in other situations it may not do this but may instead store a string constant in another section of memory. Bottom line: you can not make any assumptions about how or where the compiler will generate and store string literals.
The string is here:
movl $1819043144, -32(%rbp)
movl $1998597231, -28(%rbp)
movl $1684828783, -24(%rbp)
This copies a bunch of values to the stack. Those values happen to be your string.
string constants are stored in the binary of your application. Exactly where is up to your compiler.
Assembly has no "string" concept. Thus, the "string" is actually a chunk of memory. The string is stored somewhere in memory (up to the compiler) then you can manipulate this chunk of data using its memory address (pointer).
If your string is constant, compiler might want to use it as constants instead of storing it into memory, which is faster. This is your case, as pointed out by Paul R:
movl $1819043144, -32(%rbp)
movl $1998597231, -28(%rbp)
movl $1684828783, -24(%rbp)
You cannot make assumptions about how the compiler will treat your string.
In addition to the above, the compiler can see that your string literal cannot be referenced directly (i.e. there can't be any valid pointers to your string), which is why it can just copy it inline. If however you assign a character pointer instead, i.e.
char *s = "Hello, world!";
The compiler will initialise a string literal somewhere in memory, since you can of course now point to it. This modification produces on my machine:
.LC0:
.string "Hello, world!"
.text
.globl main
.type main, #function
One assumption can be made about string literals: if a pointer is initialised to a literal, it will point to a static char array held somewhere in memory. As a result the pointer is valid in any part of the program, e.g. you can return a pointer to a string literal initialised in a function, and it will still be valid.

Resources