Can I get a thread's stack address from pthread_self() - c

I want to get the stack address of a thread through some function to which we can pass pthread_self(). Is it possible? The reason I am doing this is because I want to write my own assigned thread identifier for a thread somewhere in its stack. I can write near the end of the stack (end of the stack memory and not the current stack address. We can ofcourse expect the application to not get to the bottom of the stack and therefore use space from there).
In other words, I want to use the thread stack for putting a kind of thread local variable there. So, do we have some function like the following provided by pthread?
stack_address = stack_address_for_thread( pthread_self() );
I can use the syntax for thread local variables by gcc for this purpose, but I'm in a situation where I can't use them.

Probably it's better to use pthread_key_create and pthread_key_getspecific and let the implementation worry about those details.
A good example of usage is here:
pthread_key_create
Edit: I should clarify -- I'm suggesting you use the libpthread provided method of creating thread-local information, instead of rolling your own by pushing something onto the end of the stack where it's possible your information could be lost.

With GCC, it is simpler to declare your thread local variables with __thread keyword, like
__thread int i;
extern __thread struct state s;
static __thread char *p;
That is GCC specific (but I'll guess clang has it also, and the newest C++ & future C standards have something similar), but less brittle than pointers hacks based upeon pthread_self() (and should be a bit faster, but less portable, than pthread_key_getsspecific, as suggested by Denniston)
But I would really like you to give more context and motivation in your questions.

I want to write my own assigned thread identifier for a thread
There are multiple ways to achieve that. The most obvious one:
__thread int my_id;
I can use the syntax for thread local variables by gcc for this purpose, but I'm in a situation where I can't use them.
You need to explain why you can't use thread-locals. Chances are high that other solutions, such as pthread_getattr_np, wouldn't work either.

First get the bottom of the stack and give read/write permission to it with the following code.
pthread_attr_t attr;
void * stackaddr;
int * plocal_var;
size_t stacksize;
pthread_getattr_np(pthread_self(), &attr);
pthread_attr_getstack( &attr, &stackaddr, &stacksize );
printf( "stackaddr = %p, stacksize = %d\n", stackaddr, stacksize );
plocal_var = (int*)mmap( stackaddr, 4096, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0 );
// Now try to write something
*plocal_var = 4;
and then you can get the thread ID, with the function get_thread_id() shown below. Note that calling mmap with size 4096 has the effect of pushing the boundary of the stack by 4096, that is why we subtract 4096 when getting the local variable address.
int get_thread_id()
{
pthread_attr_t attr;
char * stackaddr;
int * plocal_var;
size_t stacksize;
pthread_getattr_np(pthread_self(), &attr);
pthread_attr_getstack( &attr, (void**)&stackaddr, &stacksize );
//printf( "stackaddr = %p, stacksize = %d\n", stackaddr, stacksize );
plocal_var = (int*)(stackaddr - 4096);
return *plocal_var;
}

Related

Why does the page fault not cause the thread to finish its execution later?

I have the below code where I'm intentionally creating a page fault in one of the threads in file.c
util.c
#include "util.h"
// to use as a fence() instruction
extern inline __attribute__((always_inline))
CYCLES rdtscp(void) {
CYCLES cycles;
asm volatile ("rdtscp" : "=a" (cycles));
return cycles;
}
// initialize address
void init_ram_address(char* FILE_NAME){
char *filename = FILE_NAME;
int fd = open(filename, O_RDWR);
if(fd == -1) {
printf("Could not open file .\n");
exit(0);
}
void *file_address = mmap(NULL, DEFAULT_FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, 0);
ram_address = (int *) file_address;
}
// initialize address
void init_disk_address(char* FILE_NAME){
char *filename = FILE_NAME;
int fd = open(filename, O_RDWR);
if(fd == -1) {
printf("Could not open file .\n");
exit(0);
}
void *file_address = mmap(NULL, DEFAULT_FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
disk_address = (int *) file_address;
}
file.c
#include "util.h"
void *f1();
void *f2();
pthread_barrier_t barrier;
pthread_mutex_t mutex;
int main(int argc, char **argv)
{
pthread_t t1, t2;
// in ram
init_ram_address(RAM_FILE_NAME);
// in disk
init_disk_address(DISK_FILE_NAME);
pthread_create(&t1, NULL, &f1, NULL);
pthread_create(&t2, NULL, &f2, NULL);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
return 0;
}
void *f1()
{
rdtscp();
int load = *(ram_address);
rdtscp();
printf("Expecting this to be run first.\n");
}
void *f2()
{
rdtscp();
int load = *(disk_address);
rdtscp();
printf("Expecting this to be run second.\n");
}
I've used rdtscp() in the above code for fencing purposes (to ensure that the print statement get executed only after the load operation is done).
Since t2 will incur a page fault, I expect t1 to finish executing its print statement first.
To run both the threads on the same core, I run taskset -c 10 ./file.
I see that t2 prints its statement before t1. What could be the reason for this?
I think you're expecting t2's int load = *(disk_address); to cause a context switch to t1, and since you're pinning everything to the same CPU core, that would give t1 time to win the race to take the lock for stdout.
A soft page fault doesn't need to context-switch, just update the page tables with a file page from the pagecache. Despite the mapping being backed by a disk file, not anonymous memory or just copy-on-write tricks, if the the file has been read or written recently it will be hot in the pagecache and not require I/O (which would make it a hard page fault).
Maybe try evicting disk cache before a test run, like with echo 3 | sudo tee /proc/sys/vm/drop_caches if this is Linux, so access to the mmap region without MAP_POPULATE will be a hard page fault (requiring I/O).
(See *https://unix.stackexchange.com/questions/17936/setting-proc-sys-vm-drop-caches-to-clear-cache*; sync first, at least on the disk file, if it was recently written, to make sure it's page(s) are clean and able to be evicted aka dropped. Dropping caches is mainly useful for benchmarking.)
Or programmatically, you can hint the kernel with the madvise(2) system call, like madvise(MADV_DONTNEED) on a page, encouraging it to evict it from pagecache soon. (Or at least hint that your process doesn't need it; other processes might keep it hot).
In Linux kernel 5.4 and later, MADV_COLD works as a hint to evict the specified page(s) on memory pressure. ("Deactivate" probably means remove from HW page tables, so next access will at least be a soft page fault.) Or MADV_PAGEOUT is apparently supposed to get the kernel to reclaim the specified page(s) right away, I guess before the system call returns. After that, the next access should be a hard page fault.
MADV_COLD (since Linux 5.4)
Deactivate a given range of pages. This will make the pages a more probable reclaim target should there be a memory pressure. This is a nondestructive operation. The advice might be ignored for some pages in the range when it is not applicable.
MADV_PAGEOUT (since Linux 5.4)
Reclaim a given range of pages. This is done to free up memory occupied by these pages. If a page is anonymous, it will be swapped out. If a page is file-backed and dirty, it will be written back to the backing storage. The advice might be ignored for some pages in the range when it is not applicable.
These madvise args are Linux-specific. The madvise system call itself (as opposed to posix_madvise) is not guaranteed portable, but the man page gives the impression that some other systems have their own madvise system calls supporting some standard "advice" hints to the kernel.
You haven't shown the declaration of ram_address or disk_address.
If it's not a pointer-to-volatile like volatile int *disk_address, the loads may be optimized away at compile time. Writes to non-escaped local vars like int load don't have to respect "memory" clobbers because nothing else could possibly have a reference to them.
If you compiled without optimization or something, then yes the load will still happen even without volatile.

Gaining access to heap metadata of a process from within itself

While I can write reasonable C code, my expertise is mainly with Java and so I apologize if this question makes no sense.
I am writing some code to help me do heap analysis. I'm doing this via instrumentation with LLVM. What I'm looking for is a way to access the heap metadata for a process from within itself. Is such a thing possible? I know that information about the heap is stored in many malloc_state structs (main_arena for example). If I can gain access to main_arena, I can start enumerating the different arenas, heaps, bins, etc. As I understand, these variables are all defined statically and so they can't be accessed.
But is there some way of getting this information? For example, could I use /proc/$pid/mem to leak the information somehow?
Once I have this information, I want want to basically get information about all the different freelists. So I want, for every bin in each bin type, the number of chunks in the bin and their sizes. For fast, small, and tcache bins I know that I just need the index to figure out the size. I have looked at how these structures are implemented and how to iterate through them. So all I need is to gain access to these internal structures.
I have looked at malloc_info and that is my fallback, but I would also like to get information about tcache and I don't think that is included in malloc_info.
An option I have considered is to build a custom version of glibc has the malloc_struct variables declared non-statically. But from what I can see, it's not very straightforward to build your own custom glibc as you have to build the entire toolchain. I'm using clang so I would have to build LLVM from source against my custom glibc (at least this is what I've understood from researching this approach).
I had a similar requirement recently, so I do think that being able to get to main_arena for a given process does have its value, one example being post-mortem memory usage analysis.
Using dl_iterate_phdr and elf.h, it's relatively straightforward to resolve main_arena based on the local symbol:
#define _GNU_SOURCE
#include <fcntl.h>
#include <link.h>
#include <signal.h>
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
// Ignored:
// - Non-x86_64 architectures
// - Resource and error handling
// - Style
static int cb(struct dl_phdr_info *info, size_t size, void *data)
{
if (strcmp(info->dlpi_name, "/lib64/libc.so.6") == 0) {
int fd = open(info->dlpi_name, O_RDONLY);
struct stat stat;
fstat(fd, &stat);
char *base = mmap(NULL, stat.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
Elf64_Ehdr *header = (Elf64_Ehdr *)base;
Elf64_Shdr *secs = (Elf64_Shdr*)(base+header->e_shoff);
for (unsigned secinx = 0; secinx < header->e_shnum; secinx++) {
if (secs[secinx].sh_type == SHT_SYMTAB) {
Elf64_Sym *symtab = (Elf64_Sym *)(base+secs[secinx].sh_offset);
char *symnames = (char *)(base + secs[secs[secinx].sh_link].sh_offset);
unsigned symcount = secs[secinx].sh_size/secs[secinx].sh_entsize;
for (unsigned syminx = 0; syminx < symcount; syminx++) {
if (strcmp(symnames+symtab[syminx].st_name, "main_arena") == 0) {
void *mainarena = ((char *)info->dlpi_addr)+symtab[syminx].st_value;
printf("main_arena found: %p\n", mainarena);
raise(SIGTRAP);
return 0;
}
}
}
}
}
return 0;
}
int main()
{
dl_iterate_phdr(cb, NULL);
return 0;
}
dl_iterate_phdr is used to get the base address of the mapped glibc. The mapping does not contain the symbol table needed (.symtab), so the library has to be mapped again. The final address is determined by the base address plus the symbol value.
(gdb) run
Starting program: a.out
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff77f0700 (LWP 24834)]
main_arena found: 0x7ffff7baec60
Thread 1 "a.out" received signal SIGTRAP, Trace/breakpoint trap.
raise (sig=5) at ../sysdeps/unix/sysv/linux/raise.c:50
50 return ret;
(gdb) select 1
(gdb) print mainarena
$1 = (void *) 0x7ffff7baec60 <main_arena>
(gdb) print &main_arena
$3 = (struct malloc_state *) 0x7ffff7baec60 <main_arena>
The value matches that of main_arena, so the correct address was found.
There are other ways to get to main_arena without relying on the library itself. Walking the existing heap allows for discovering main_arena, for example, but that strategy is considerably less straightforward.
Of course, once you have main_arena, you need all internal type definitions to be able to inspect the data.
I am writing some code to help me do heap analysis.
What kind of heap analysis?
I want want to basically get information about all the different freelists. So I want, for every bin in each bin type, the number of chunks in the bin and their sizes. For fast, small, and tcache bins I know that I just need the index to figure out the size.
This information only makes sense if you are planning to change the malloc implementation. It does not make sense to attempt to collect it if your goal is to analyze or improve heap usage by the application, so it sounds like you have an XY problem.
In addition, things like bin and tcache only make sense in a context of particular malloc implementation (TCMalloc and jemalloc would not have any bins).
For analysis of application heap usage, you may want to use TCmalloc, as it provides a lot of tools for heap profiling and introspection.

Why can't malloc/free APIs be correctly called in threads created by clone?

Why does some glibc's APIs(such as function malloc(), realloc() or free()) can not be correctly called in threads that are created by syscall clone?
Here is my code only for testing:
int thread_func( void *arg )
{
void *ptr = malloc( 4096 );
printf( "tid=%d, ptr=%x\n", gettid(), ptr );
sleep(1);
if( ptr )
free( ptr );
return 0;
}
int main( int argc, char **argv )
{
int i, m;
void *stk;
int stksz = 1024 * 128;
int flag = CLONE_VM | CLONE _FILES | CLONE_FS | CLONE_SIGHAND;
for( i=m=0; i < 100; i++ )
{
stk = malloc( stksz );
if( !stk ) break;
if( clone( thread_func, stk+stksz, flags, NULL, NULL, NULL, NULL ) != -1 )
m++;
}
printf( "create %d thread\n", m );
sleep(10);
return 0;
}
Testing result: thread thread_func or main thread main will be blocked on malloc() or free() function randomly. Or sometimes causes malloc() or free() to crash.
I think may be malloc() and free() need certain TLS data to distinguish every thread.
Does anyone know the reason, and what solution can been used to resolve this problem?
I think may be malloc() and free() need certain TLS data to distinguish every thread.
Glibc's malloc() and free() do not rely on TLS. They use mutexes to protect the shared memory-allocation data structures. To reduce contention for those, they employ a strategy of maintaining separate memory-allocation arenas with independent metadata and mutexes. This is documented on their manual page.
After correcting the syntax errors in your code and dummying-out the call to non-existent function gettid() (see comments on the question), I was able to produce segmentation faults, but not blockage. Perhaps you confused the exit delay caused by your program's 10-second sleep with blockage.
In addition to any issue that may have been related to your undisclosed implementation of gettid(), your program contains two semantic errors, each producing undefined behavior:
As I already noted in comments, it passes the wrong child-stack pointer values.*
It uses the wrong printf() directive in thread_func() for printing the pointer. The directive for pointer values is %p; %x is for arguments of type unsigned int.
After I corrected those errors as well, the program consistently ran to completion for me. Revised code:
int thread_func(void *arg) {
void *ptr = malloc(4096);
// printf( "tid=%d, ptr=%x\n", gettid(), ptr );
printf("tid=%d, ptr=%p\n", 1, ptr);
sleep(1);
if (ptr) {
free(ptr);
}
return 0;
}
int main(int argc, char **argv) {
int i, m;
char *stk; // Note: char * instead of void * to afford arithmetic
int stksz = 1024 * 128;
int flags = CLONE_VM | CLONE_FILES | CLONE_FS | CLONE_SIGHAND;
for (i = m = 0; i < 100; i++) {
stk = malloc( stksz );
if( !stk ) break;
if (clone(thread_func, stk + stksz - 1, flags, NULL, NULL, NULL, NULL ) != -1) {
m++;
}
}
printf("create %d thread\n", m);
sleep(10);
return 0;
}
Even with that, however, all is not completely well: I see various anomalies in the program output, especially near the beginning.
The bottom line is that, contrary to your assertion, you are not creating any threads, at least not in the sense that the C library recognizes. You are merely creating processes that have behavior similar to threads'. That may be sufficient for some purposes, but you cannot rely on the system to treat such processes identically to threads.
On Linux, bona fide threads that the system and standard library will recognize are POSIX threads, launched via pthread_create(). (I note here that modifying your program to use pthread_create() instead of clone() resolved the output anomalies for me.) You might be able to add flags and arguments to your clone() calls that make the resulting processes enough like the Linux implementation of pthreads to be effectively identical, but whyever would you do such a thing instead of just using real pthreads in the first place?
* The program also performs pointer arithmetic on a void *, which C does not permit. GCC accepts that as an extension, however, and since your code is deeply Linux-specific anyway, I'm letting that slide with only this note.
Correct, malloc and free need TLS for at least the following things:
The malloc arena attached to the current thread (used for allocation operations).
The errno TLS variable (written to when system calls fail).
The stack protector canary (if enabled and the architecture stores the canary in the TCB).
The malloc thread cache (enabled by default in the upcoming glibc 2.26 release).
All these items need a properly initialized thread control block (TCB), but curiously, until recently and as far as malloc/free was concerned, it almost did not matter if a thread created with clone was shared with another TCB (so that the data is no longer thread-local):
Threads basically never reattach themselves to a different arena, so the arena TLS variable is practically read-only after initialization—and multiple threads can share a single arena. errno can be shared as long as system calls only fail in one of the threads undergoing sharing. The stack protector canary is read-only after process startup, and its value is identical across threads anyway.
But all this is an implementation detail, and things change radically in glibc 2.26 with its malloc thread cache: The cache is read and written without synchronization, so it is very likely that what you are trying to do results in memory corruption.
This is not a material change in glibc 2.26, it is always how things were: calling any glibc function from a thread created with clone is undefined. As John Bollinger pointed out, this mostly worked by accident before, but I can assure you that it has always been completely undefined.

Move memory pages per-thread in NUMA architecture

i have 2 questions in one:
(i) Suppose thread X is running at CPU Y. Is it possible to use the syscalls migrate_pages - or even better move_pages (or their libnuma wrapper) - to move the pages associated with X to the node in which Y is connected?
This question arrises because first argument of both syscalls is PID (and i need a per-thread approach for some researching i'm doing)
(ii) in the case of positive answer for (i), how can i get all the pages used by some thread? My aim is, move the page(s) that contains array M[] for exemple...how to "link" data structures with their memory pages, for the sake of using the syscalls above?
An extra information: i'm using C with pthreads. Thanks in advance !
You want to use the higher level libnuma interfaces instead of the low level system calls.
The libnuma library offers a simple programming interface to the NUMA (Non Uniform Memory Access) policy supported by the Linux kernel. On a NUMA architecture some memory areas have different latency or bandwidth than others.
Available policies are page interleaving (i.e., allocate in a round-robin fashion from all, or a subset, of the nodes on the system), preferred node allocation (i.e., preferably allocate on a particular node), local allocation (i.e., allocate on the node on which the task is currently executing), or allocation only on specific nodes (i.e., allocate on some subset of the available nodes). It is also possible to bind tasks to specific nodes.
The man pages for the low level numa_* system calls warn you away from using them:
Link with -lnuma to get the system call definitions. libnuma and the required <numaif.h> header are available in the numactl package.
However, applications should not use these system calls directly. Instead, the higher level interface provided by the numa(3) functions in the numactl package is recommended. The numactl package is available at <ftp://oss.sgi.com/www/projects/libnuma/download/>. The package is also included in some Linux distributions. Some distributions include the development library and header in the separate numactl-devel package.
Here's the code I use for pinning a thread to a single CPU and moving the stack to the corresponding NUMA node (slightly adapted to remove some constants defined elsewhere). Note that I first create the thread normally, and then call the SetAffinityAndRelocateStack() below from within the thread. I think this is much better then trying to create your own stack, since stacks have special support for growing in case the bottom is reached.
The code can also be adapted to operate on the newly created thread from outside, but this could give rise to race conditions (e.g. if the thread performs I/O into its stack), so I wouldn't recommend it.
void* PreFaultStack()
{
const size_t NUM_PAGES_TO_PRE_FAULT = 50;
const size_t size = NUM_PAGES_TO_PRE_FAULT * numa_pagesize();
void *allocaBase = alloca(size);
memset(allocaBase, 0, size);
return allocaBase;
}
void SetAffinityAndRelocateStack(int cpuNum)
{
assert(-1 != cpuNum);
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(cpuNum, &cpuset);
const int rc = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
assert(0 == rc);
pthread_attr_t attr;
void *stackAddr = nullptr;
size_t stackSize = 0;
if ((0 != pthread_getattr_np(pthread_self(), &attr)) || (0 != pthread_attr_getstack(&attr, &stackAddr, &stackSize))) {
assert(false);
}
const unsigned long nodeMask = 1UL << numa_node_of_cpu(cpuNum);
const auto bindRc = mbind(stackAddr, stackSize, MPOL_BIND, &nodeMask, sizeof(nodeMask), MPOL_MF_MOVE | MPOL_MF_STRICT);
assert(0 == bindRc);
PreFaultStack();
// TODO: Also lock the stack with mlock() to guarantee it stays resident in RAM
return;
}

Why am I getting segmentation fault here?

I have the following code, where I try to write something into the stack. I write at the bottom of the stack, which the application still hasn't touched (Note that stack grows downwards and stackaddr here points to the bottom).
However I get segmentation fault even after doing mprotect to give both write and read permissions to that memory region. I get segmentation fault even if I use the compilation flag -fno-stack-protector. What is happening here?
pthread_attr_t attr;
void * stackaddr;
int * plocal_var;
size_t stacksize;
pthread_getattr_np(pthread_self(), &attr);
pthread_attr_getstack( &attr, &stackaddr, &stacksize );
printf( "stackaddr = %p, stacksize = %d\n", stackaddr, stacksize );
plocal_var = (int*)stackaddr;
mprotect((void*)plocal_var, 4096, PROT_READ | PROT_WRITE);
*plocal_var = 4;
printf( "local_var = %d!\n", *plocal_var );
You are almost certainly trying to mprotect() pages which are not yet mapped. You should check the return code: mprotect() is probably returning -1 and setting errno to ENOMEM (this is documented in the mprotect(2) man page).
Stack pages are mapped on demand, but the kernel is clever enough to distinguish between page faults caused by an access at or above the current stack pointer (which are caused by valid attempts to expand the stack downwards, by decrementing the stack pointer, and then performing a read or write to some positive offset from the new value), and page faults caused by an access below the stack pointer (which are not valid).

Resources