When I try the below code I am not clearly able to analyze malloc api internal calls.What I am not clear is about the system call mmap is called only once for 2 or more malloc calls.If I am assigning more then 4069 bytes also it is calling only one mmap internally(trace is identified by using strace -p processid ).
#include<stdio.h>
#include<stdlib.h>
main()
{
int *p,*q;
sleep(20);
p=malloc(5096);
printf("p=%p\n",p);
q=malloc(4096);
printf("q=%p\n",q);
sleep(2);
return 0;
}
strace OUTPUT:
root#TEST:/home/harish# strace -p 6109
Process 6109 attached
restart_syscall(<... resuming interrupted call ...>
) = 0
brk(0) = 0xeca000
brk(0xeec000) = 0xeec000
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 14), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f10b7bc7000
write(1, "p=0xeca010\n", 11) = 11
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({20, 0},
0x7ffc34a51790) = 0
write(1, "q=0xecb020\n", 11) = 11
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({2, 0}, 0x7ffc34a51790) = 0
exit_group(0) = ?
+++ exited with 0 +++
What I am looking is ,if malloc is used more then once will it call more then one mmap since memory is exceeding in two malloc's beyond 4096
malloc() does not result into mmap() call. Generally it would result into brk(). However, not each call will result into brk(). It depends a lot on currently allocated pages, asked memory and other things.
Your process' internal heap (accessed via malloc, free and realloc) manages memory as it sees fit - this includes:
growing the heap by large or fixed increments to amortize the cost of expensive brk/sbrk syscalls over multiple (de)allocations
dealing with smaller (de)allocations inside that heap area itself
managing (de)fragmentation of allocated records
It's also common to use different mechanisms for large and small allocations, for example small objects are allocated from that contiguous area managed by brk/sbrk, but individual large objects may be allocated directly with mmap.
Related
I am studying memory management and have a question about how malloc works.
The malloc man page states that:
Normally, malloc() allocates memory from the heap, and adjusts the
size of the heap as required, using sbrk(2). When allocating blocks
of memory larger than MMAP_THRESHOLD bytes, the glibc malloc()
implementation allocates the memory as a private anonymous mapping
using mmap(2). MMAP_THRESHOLD is 128 kB by default, but is
adjustable using mallopt(3).
To verify it, I did an experiment with a piece of code:
#include<stdlib.h>
#include<stdio.h>
int main()
{
int size = 10;
int *p = malloc(size);
if(p)
{
printf("allocated %d bytes at addr: %p \n", size, p);
free(p);
}
else
{
free(p);
}
return 0;
}
I traced this program with strace to see what syscall was used. Here is the result:
Why in this example did malloc call mmap instead of brk?
All those mmap() calls are part of your program's startup when it's loading shared libraries. It's standard stuff you'll see when you strace most programs.
The real action is in the last few lines:
Two calls to brk() coming from malloc().
An fstat() and a write() call coming from printf().
You can add a printout to the top of main() to see when your code actually starts running.
(It's important to call the write() syscall directly instead of printing with printf() or puts(). The stdio functions call malloc() internally which muddles what we're trying to test.)
#include <unistd.h>
int main()
{
write(1, "start\n", 6);
...
}
When I do that I see the write() call right before the brk(NULL), which I've marked below with a blank line:
...
mmap(0x7f1b34802000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e7000) = 0x7f1b34802000
mmap(0x7f1b34808000, 15072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f1b34808000
close(3) = 0
arch_prctl(ARCH_SET_FS, 0x7f1b34a124c0) = 0
mprotect(0x7f1b34802000, 16384, PROT_READ) = 0
mprotect(0x558c3cd9a000, 4096, PROT_READ) = 0
mprotect(0x7f1b34a33000, 4096, PROT_READ) = 0
munmap(0x7f1b34a13000, 128122) = 0
write(1, "start\n", 6) = 6
brk(NULL) = 0x558c3dc58000
brk(0x558c3dc79000) = 0x558c3dc79000
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 4), ...}) = 0
write(1, "allocated 10 bytes at addr: 0x55"..., 44) = 44
exit_group(0) = ?
+++ exited with 0 +++
Most libc implementations are open source. Study the source code of glibc or of
musl-libc. Both implement malloc and free. Use also strace(1)
Usually, they use mmap(2) or sometimes sbrk(2)
Of course they try to minimize the number of system calls, at least for small memory sizes.
I want to create a new dynamic library instead of another, the source code of which is lost. I have created a library with exported functions, but the program does not load it. Conclusion Strace is almost the same, the only difference is that in the case of loading my library after the call to read() there is no call to fstat64().
strace original library:
open("/usr/local/lpr/li2/libSA.so", O_RDONLY) = 12
read(12, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\3409\0"..., 1024) = 1024
fstat64(12, {st_mode=S_IFREG|0644, st_size=46166, ...}) = 0
old_mmap(NULL, 40256, PROT_READ|PROT_EXEC, MAP_PRIVATE, 12, 0) = 0x40150000
mprotect(0x40159000, 3392, PROT_NONE) = 0
old_mmap(0x40159000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 12, 0x8000) = 0x40159000
close(12) = 0
my library:
open("/usr/local/lpr/li2/libSA.so", O_RDONLY) = 12
read(12, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\210\0\0"..., 1024) = 1024
close(12) = 0
time(NULL)
You're trying to load a 64-bit shared object into a 32-bit process.
The ELF header read by these two read() calls:
read(12, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\3409\0"..., 1024) = 1024
and
read(12, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`\210\0\0"..., 1024) = 1024
differ. Note that the fifth byte in the first read() is 1. That's the successful load of a 32-bit shared object.
That fifth byte is 2 on the unsuccessful attempt - and that 2 means that the shared object is a 64-bit shared object.
You likely need to compile and link with the -m32 option.
I work with C and I make apache modules and I work with strace as my main tool for debugging timings. Here's code I threw together. My apologies if variable names do not meet standards.
#include <stdio.h>
int main(){
long ct2,ct; //counters
int a=0; //dummy value
FILE *f0=fopen("/","r"); //measuring point
ct2=10;
while (--ct2>0){
ct=5000000;
while (--ct>0){
if (!!a){
printf("%d",a);
}
}
}
FILE *f=fopen("/","r"); //measuring point
ct2=10;
while (--ct2>0){
ct=5000000;
while (--ct>0){
if (a){
printf("%d",a);
}
}
}
FILE *f2=fopen("/","r"); //measuring point
return 0;
}
This code does compile. I then run it through strace (by typing in a terminal: strace -r -ttt ./a.out) and I see:
0.000000 execve("./a.out", ["./a.out"], [/* 47 vars */]) = 0
0.000315 brk(0) = 0x804a000
0.000124 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
0.000144 open("/etc/ld.so.cache", O_RDONLY) = 3
0.000116 fstat64(3, {st_mode=S_IFREG|0644, st_size=139721, ...}) = 0
0.000138 mmap2(NULL, 139721, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7ece000
0.000114 close(3) = 0
0.000109 open("/lib/libc.so.6", O_RDONLY) = 3
0.000113 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\360d\1"..., 512) = 512
0.000130 fstat64(3, {st_mode=S_IFREG|0755, st_size=1575187, ...}) = 0
0.000131 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7ecd000
0.000122 mmap2(NULL, 1357360, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7d81000
0.000119 mmap2(0xb7ec7000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x146) = 0xb7ec7000
0.000146 mmap2(0xb7eca000, 9776, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb7eca000
0.000139 close(3) = 0
0.000112 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7d80000
0.000119 set_thread_area({entry_number:-1 -> 6, base_addr:0xb7d806c0, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0
0.000217 mprotect(0xb7ec7000, 4096, PROT_READ) = 0
0.000108 munmap(0xb7ece000, 139721) = 0
0.000174 brk(0) = 0x804a000
0.000099 brk(0x806b000) = 0x806b000
0.000110 open("/", O_RDONLY) = 3
0.203487 open("/", O_RDONLY) = 4
0.202225 open("/", O_RDONLY) = 5
0.000133 exit_group(0) = ?
I can tell right off at the end that:
0.000110 open("/", O_RDONLY) = 3
0.203487 open("/", O_RDONLY) = 4
0.202225 open("/", O_RDONLY) = 5
return to the three measuring points I set up.
I want to be able to adjust the measuring point lines in my code so that when I run strace I can find my measuring points like I do now, but where the system makes less intensive operations. I don't see anything else from strace related to my program other than the file calls.
I'm thinking maybe if there was such a thing as a built-in MeasureMe function in C that I would use that in place of the measuring point lines in my code, then strace could output:
0.000110 MeasureMe called in code
0.203487 MeasureMe called in code
0.202225 MeasureMe called in code
Is there any way I can go about this with Strace?
The reason why I'm asking about strace instead of gdb is because I use it to debug requests to my apache server like the person in this video does it, and I'll be able to see apache modules in action:
https://www.youtube.com/watch?v=eF-p--AH37E
Any idea how I can solve this? or will I have to continue to make failed attempts at opening non-existing files?
I gather what you are currently using is open("/",O_RDONLY) [or open("/i_do_not_exist",O_RDONLY)] for a "tracepoint". Unfortunately, because you're using strace, you're constrained to using syscalls. But, there is a way to achieve the effect you want.
What you need/want for a tracepoint that you're manually inserting at various points in your source code is:
Any unique syscall that doesn't harm anything
Is easily distinguishable from real code [even code that may return errors such as opening a file or checking for existence with access]
Minimal overhead / fastest execution
Actually, dup on a bad fildes fills the bill nicely:
dup(-10000);
It will return EBADF. It is easily distinguishable as a tracepoint because most real dup calls that are "bad" will be dup(-1)
You can have as many of these as you want. The actual argument becomes the "tracepoint number":
dup(-10001); // tracepoint 1
...
dup(-10002); // tracepoint 2
...
dup(-10003); // tracepoint 3
The output will look like:
0.000044 dup(-10001) = -1 EBADF (Bad file descriptor)
0.000022 dup(-10002) = -1 EBADF (Bad file descriptor)
0.000019 dup(-10003) = -1 EBADF (Bad file descriptor)
I usually encapsulate this in a macro:
#ifdef DEBUG
#define TRACEPOINT(_tno) tracepoint(_tno)
#else
#define TRACEPOINT(_tno) /**/
#endif
void
tracepoint(int tno)
{
dup(-10000 - tno);
}
Then, I add something like:
TRACEPOINT(1); // initialization phase
...
TRACEPOINT(2); // execution phase
...
TRACEPOINT(3); // cleanup/shutdown
Now, I'll write a perl or python script to read in the source files, extracting the comments for the given tracepoints, and append them to the matching lines in the strace output file:
0.000044 TRACEPOINT(1) initialization phase
0.000022 TRACEPOINT(2) execution phase
0.000019 TRACEPOINT(3) cleanup/shutdown
A more sophisticated version of the post-processing script can do all sorts of things:
keep track of timestamps and append a time difference between one tracepoint and the previous one to the trace line
add file name and line number information to the tracepoint lines
keep track of the number of times a given tracepoint is hit [similar to gdb and breakpoints]
generate summary reports relating to tracepoints
Ηow to EXIT_SUCCESS after strict mode seccomp is set. Is it the correct practice, to call syscall(SYS_exit, EXIT_SUCCESS); at the end of main?
#include <stdlib.h>
#include <unistd.h>
#include <sys/prctl.h>
#include <linux/seccomp.h>
#include <sys/syscall.h>
int main(int argc, char **argv) {
prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
//return EXIT_SUCCESS; // does not work
//_exit(EXIT_SUCCESS); // does not work
// syscall(__NR_exit, EXIT_SUCCESS); // (EDIT) This works! Is this the ultimate answer and the right way to exit success from seccomp-ed programs?
syscall(SYS_exit, EXIT_SUCCESS); // (EDIT) works; SYS_exit equals __NR_exit
}
// gcc seccomp.c -o seccomp && ./seccomp; echo "${?}" # I want 0
As explained in eigenstate.org and in SECCOMP (2):
The only system calls that the calling thread is permitted to
make are read(2), write(2), _exit(2) (but not exit_group(2)),
and sigreturn(2). Other system calls result in the delivery
of a SIGKILL signal.
As a result, one would expect _exit() to work, but it's a wrapper function that invokes exit_group(2) which is not allowed in strict mode ([1], [2]), thus the process gets killed.
It's even reported in exit(2) - Linux man page:
In glibc up to version 2.3, the _exit() wrapper function invoked the kernel system call of the same name. Since glibc 2.3, the wrapper function invokes exit_group(2), in order to terminate all of the threads in a process.
Same happens with the return statement, which should end up in killing your process, in the very similar manner with _exit().
Stracing the process will provide further confirmation (to allow this to show up, you have to not set PR_SET_SECCOMP; just comment prctl()) and I got similar output for both non-working cases:
linux12:/home/users/grad1459>gcc seccomp.c -o seccomp
linux12:/home/users/grad1459>strace ./seccomp
execve("./seccomp", ["./seccomp"], [/* 24 vars */]) = 0
brk(0) = 0x8784000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb775f000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=97472, ...}) = 0
mmap2(NULL, 97472, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7747000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/i386-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\220\226\1\0004\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1730024, ...}) = 0
mmap2(NULL, 1739484, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xdd0000
mmap2(0xf73000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a3) = 0xf73000
mmap2(0xf76000, 10972, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xf76000
close(3) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7746000
set_thread_area({entry_number:-1 -> 6, base_addr:0xb7746900, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0
mprotect(0xf73000, 8192, PROT_READ) = 0
mprotect(0x8049000, 4096, PROT_READ) = 0
mprotect(0x16e000, 4096, PROT_READ) = 0
munmap(0xb7747000, 97472) = 0
exit_group(0) = ?
linux12:/home/users/grad1459>
As you can see, exit_group() is called, explaining everything!
Now as you correctly stated, "SYS_exit equals __NR_exit"; for example it's defined in mit.syscall.h:
#define SYS_exit __NR_exit
so the last two calls are equivalent, i.e. you can use the one you like, and the output should be this:
linux12:/home/users/grad1459>gcc seccomp.c -o seccomp && ./seccomp ; echo "${?}"
0
PS
You could of course define a filter yourself and use:
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, filter);
as explained in the eigenstate link, to allow _exit() (or, strictly speaking, exit_group(2)), but do that only if you really need to and know what you are doing.
The problem occurs, because the GNU C library uses the exit_group syscall, if it is available, in Linux instead of exit, for the _exit() function (see sysdeps/unix/sysv/linux/_exit.c for verification), and as documented in the man 2 prctl, the exit_group syscall is not allowed by the strict seccomp filter.
Because the _exit() function call occurs inside the C library, we cannot interpose it with our own version (that would just do the exit syscall). (The normal process cleanup is done elsewhere; in Linux, the _exit() function only does the final syscall that terminates the process.)
We could ask the GNU C library developers to use the exit_group syscall in Linux only when there are more than one thread in the current process, but unfortunately, it would not be easy, and even if added right now, would take quite some time for the feature to be available on most Linux distributions.
Fortunately, we can ditch the default strict filter, and instead define our own. There is a small difference in behaviour: the apparent signal that kills the process will change from SIGKILL to SIGSYS. (The signal is not actually delivered, as the kernel does kill the process; only the apparent signal number that caused the process to die changes.)
Furthermore, this is not even that difficult. I did waste a bit of time looking into some GCC macro trickery that would make it trivial to manage the allowed syscalls' list, but I decided it would not be a good approach: the list of allowed syscalls should be carefully considered -- we only add exit_group() compared to the strict filter, here! -- so making it a bit difficult is okay.
The following code, say example.c, has been verified to work on a 4.4 kernel (should work on kernels 3.5 or later) on x86-64 (for both x86 and x86-64, i.e. 32-bit and 64-bit binaries). It should work on all Linux architectures, however, and it does not require or use the libseccomp library.
#define _GNU_SOURCE
#include <stdlib.h>
#include <stddef.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <stdio.h>
static const struct sock_filter strict_filter[] = {
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof (struct seccomp_data, nr))),
BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_rt_sigreturn, 5, 0),
BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_read, 4, 0),
BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_write, 3, 0),
BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_exit, 2, 0),
BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_exit_group, 1, 0),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW)
};
static const struct sock_fprog strict = {
.len = (unsigned short)( sizeof strict_filter / sizeof strict_filter[0] ),
.filter = (struct sock_filter *)strict_filter
};
int main(void)
{
/* To be able to set a custom filter, we need to set the "no new privs" flag.
The Documentation/prctl/no_new_privs.txt file in the Linux kernel
recommends this exact form: */
if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
fprintf(stderr, "Cannot set no_new_privs: %m.\n");
return EXIT_FAILURE;
}
if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &strict)) {
fprintf(stderr, "Cannot install seccomp filter: %m.\n");
return EXIT_FAILURE;
}
/* The seccomp filter is now active.
It differs from SECCOMP_SET_MODE_STRICT in two ways:
1. exit_group syscall is allowed; it just terminates the
process
2. Parent/reaper sees SIGSYS as the killing signal instead of
SIGKILL, if the process tries to do a syscall not in the
explicitly allowed list
*/
return EXIT_SUCCESS;
}
Compile using e.g.
gcc -Wall -O2 example.c -o example
and run using
./example
or under strace to see the syscalls and library calls done;
strace ./example
The strict_filter BPF program is really trivial. The first opcode loads the syscall number into the accumulator. The next five opcodes compare it to an acceptable syscall number, and if found, jump to the final opcode that allows the syscall. Otherwise the second-to-last opcode kills the process.
Note that although the documentation refers to sigreturn being the allowed syscall, the actual name of the syscall in Linux is rt_sigreturn. (sigreturn was deprecated in favour of rt_sigreturn ages ago.)
Furthermore, when the filter is installed, the opcodes are copied to kernel memory (see kernel/seccomp.c in the Linux kernel sources), so it does not affect the filter in any way if the data is modified later. Having the structures static const has zero security impact, in other words.
I used static since there is no need for the symbols to be visible outside this compilation unit (or in a stripped binary), and const to put the data into the read-only data section of the ELF binary.
The form of a BPF_JUMP(BPF_JMP | BPF_JEQ, nr, equals, differs) is simple: the accumulator (the syscall number) is compared to nr. If they are equal, then the next equals opcodes are skipped. Otherwise, the next differs opcodes are skipped.
Since the equals cases jump to the very final opcode, you can add new opcodes at the top (that is, just after the initial opcode), incrementing the equals skip count for each one.
Note that printf() will not work after the seccomp filter is installed, because internally, the C library wants to do a fstat syscall (on standard output), and a brk syscall to allocate some memory for a buffer.
I've been trying to piece together how stack memory is handed out to threads. I haven't been able to piece the whole thing together. I tried to go to the code, but I'm more confused, so I'm asking for your help.
I asked this question a little while ago. So assume that particular program (therefore, all threads are within the same process). If I write printfs for each beginning of stack pointer, and then how much is allocated for them, then I get stuff like the table at the end of this message, where the first column is a time_t usec, the second doesn't matter, the third is the tid of the thread, the fourth is the guard size, then begin of stack, end of stack (sorted by beginning of stack), last but one is the allocated stack (8 Megs by default) and the last column is the difference between the end of the first allocated stack, and the beginning of the next stack.
This means that (I think), if 0, then the stacks are contiguous, if positive, since the stack grows down in memory, then it means that there is "free space" of however many Mbs between a tid and the next (in memory). If negative, this means that memory is being reused. So this may mean that that stack space has been freed before this thread was created.
My problem is: what exactly is the algorithm that assigns stack space to threads (at a higher level than code) and why do I sometimes get contiguous stacks, and sometimes not, and sometimes get values like 7.94140625 and 0.0625 in the last column?
This is all Linux 2.6, C and pthreads.
This may be a question we will have to iterate on to get it right, and for this I apologize, but I'm telling you what I know right now. Feel free to ask for clarifications.
Thanks for this. The table follows.
52815 14 14786 4096 92549120 100941824 8392704 0
52481 14 14784 4096 100941824 109334528 8392704 0
51700 14 14777 4096 109334528 117727232 8392704 0
70747 14 14806 4096 117727232 126119936 8392704 8.00390625
75813 14 14824 4096 117727232 126119936 8392704 0
51464 14 14776 4096 126119936 134512640 8392704 8.00390625
76679 14 14833 4096 126119936 134512640 8392704 -4.51953125
53799 14 14791 4096 139251712 147644416 8392704 -4.90234375
52708 14 14785 4096 152784896 161177600 8392704 0
50912 14 14773 4096 161177600 169570304 8392704 0
51617 14 14775 4096 169570304 177963008 8392704 0
70028 14 14793 4096 177963008 186355712 8392704 0
51048 14 14774 4096 186355712 194748416 8392704 0
50596 14 14771 4096 194748416 203141120 8392704 8.00390625
First, by stracing a simple test program that launches a single thread, we can see the syscalls it used to create a new thread. Here's a simple test program:
#include <pthread.h>
#include <stdio.h>
void *test(void *x) { }
int main() {
pthread_t thr;
printf("start\n");
pthread_create(&thr, NULL, test, NULL);
pthread_join(thr, NULL);
printf("end\n");
return 0;
}
And the relevant portion of its strace output:
write(1, "start\n", 6start
) = 6
mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xf6e32000
brk(0) = 0x8915000
brk(0x8936000) = 0x8936000
mprotect(0xf6e32000, 4096, PROT_NONE) = 0
clone(child_stack=0xf7632494, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xf7632bd8, {entry_number:12, base_addr:0xf7632b70, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xf7632bd8) = 9181
futex(0xf7632bd8, FUTEX_WAIT, 9181, NULL) = -1 EAGAIN (Resource temporarily unavailable)
write(1, "end\n", 4end
) = 4
exit_group(0) = ?
We can see that it obtains a stack from mmap with PROT_READ|PROT_WRITE protection and MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK flags. It then protects the first (ie, lowest) page of the stack, to detect stack overflows. The rest of the calls aren't really relevant to the discussion at hand.
So, then, how does mmap allocate the stack, then? Well, let's start at mmap_pgoff in the Linux kernel; the entry point for the modern mmap2 syscall. It delegates to do_mmap_pgoff after taking some locks. This then calls get_unmapped_area to find an appropriate range of unmapped pages.
Unfortunately, this then calls a function pointer defined in the vma - this is probably so that 32-bit and 64-bit processes can have different ideas of which addresses can be mapped. In the case of x86, this is defined in arch_pick_mmap_layout, which switches based on whether it's using a 32-bit or 64-bit architecture for this process.
So let's look at the implementation of arch_get_unmapped_area then. It first gets some reasonable defaults for its search from find_start_end, then tests to see if the address hint passed in is valid (for thread stacks, no hint is passed). It then starts scanning through the virtual memory map, starting from a cached address, until it finds a hole. It saves the end of the hole for use in the next search, then returns the location of this hole. If it reaches the end of the address space, it starts again from the start, giving it one more chance to find an open area.
So as you can see, normally, it will assign stacks in an increasing manner (for x86; x86-64 uses arch_get_unmapped_area_topdown and will likely assign them decreasing). However, it also keeps a cache of where to start a search, so it might leave gaps depending on when areas are freed. In particular, when a mmaped area is freed, it might update the free-address-search-cache, so you might see out of order allocations there as well.
That said, this is all an implementation detail. Do not rely on any of this in your program. Just take what addresses mmap hands out and be happy :)
glibc handles this in nptl/allocatestack.c.
Key line is:
mem = mmap (NULL, size, prot,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
So it just asks the kernel for some anonymous memory, not unlike malloc does for large blocks. Which block it actually gets is up to the kernel...