Segmentation Fault using getcontext() in thread library - c

I am trying to implement a user level thread library in C using systems calls such as get context, swap context , etc
I have a thread control block that looks like this :
struct tcb {
int thread_id;
int thread_pri;
ucontext_t *thread_context;
struct tcb *next;
}
And I have a function called init() that looks like this:
void t_init()
{
tcb *tmp;
tmp = malloc(sizeof(tcb));
getcontext(tmp->thread_context); /* let tmp be the context of main() */
running_head = tmp;
}
I used gdb and I got a segmentation fault during runtime at the getcontext(tmp->thread_context) function.
I have read the man pages for getcontext() but am unsure as to why this is returning a segmentation fault to me!
Any suggestions please?

You haven't allocated any space for thread_context, try
void t_init()
{
struct tcb *tmp;
tmp = malloc(sizeof(struct tcb));
if (!tmp)
return -1;
memset(&tmp, 0, sizeof(struct tcb));
tmp->thread_context = malloc(sizeof(ucontext_t));
if (!tmp->thread_context)
return -1;
getcontext(tmp->thread_context);
}

We can get the following information about getcontext/setcontext "The GNU C Library Reference Manual Chapter:23 Non Locals Exits, Page 622)", and found the following
While allocating the memory for the stack one has to be careful.
Most modern processors keep track of whether a certain memory region is allowed to contain code which is executed or not. Data segments and
heap memory is normally not tagged to allow this. The result is that
programs would fail. Examples for such code include the calling
sequences the GNU C compiler generates for calls to nested functions.
Safe ways to allocate stacks correctly include using memory on the
original threads stack or explicitly allocate memory tagged for
execution using memory mapped I/O.
This is causing the problem and you should use the recommended step to allocate the memory(using memory mapped I/O For more information, Please refer the libc manual).

Related

Why can't malloc/free APIs be correctly called in threads created by clone?

Why does some glibc's APIs(such as function malloc(), realloc() or free()) can not be correctly called in threads that are created by syscall clone?
Here is my code only for testing:
int thread_func( void *arg )
{
void *ptr = malloc( 4096 );
printf( "tid=%d, ptr=%x\n", gettid(), ptr );
sleep(1);
if( ptr )
free( ptr );
return 0;
}
int main( int argc, char **argv )
{
int i, m;
void *stk;
int stksz = 1024 * 128;
int flag = CLONE_VM | CLONE _FILES | CLONE_FS | CLONE_SIGHAND;
for( i=m=0; i < 100; i++ )
{
stk = malloc( stksz );
if( !stk ) break;
if( clone( thread_func, stk+stksz, flags, NULL, NULL, NULL, NULL ) != -1 )
m++;
}
printf( "create %d thread\n", m );
sleep(10);
return 0;
}
Testing result: thread thread_func or main thread main will be blocked on malloc() or free() function randomly. Or sometimes causes malloc() or free() to crash.
I think may be malloc() and free() need certain TLS data to distinguish every thread.
Does anyone know the reason, and what solution can been used to resolve this problem?
I think may be malloc() and free() need certain TLS data to distinguish every thread.
Glibc's malloc() and free() do not rely on TLS. They use mutexes to protect the shared memory-allocation data structures. To reduce contention for those, they employ a strategy of maintaining separate memory-allocation arenas with independent metadata and mutexes. This is documented on their manual page.
After correcting the syntax errors in your code and dummying-out the call to non-existent function gettid() (see comments on the question), I was able to produce segmentation faults, but not blockage. Perhaps you confused the exit delay caused by your program's 10-second sleep with blockage.
In addition to any issue that may have been related to your undisclosed implementation of gettid(), your program contains two semantic errors, each producing undefined behavior:
As I already noted in comments, it passes the wrong child-stack pointer values.*
It uses the wrong printf() directive in thread_func() for printing the pointer. The directive for pointer values is %p; %x is for arguments of type unsigned int.
After I corrected those errors as well, the program consistently ran to completion for me. Revised code:
int thread_func(void *arg) {
void *ptr = malloc(4096);
// printf( "tid=%d, ptr=%x\n", gettid(), ptr );
printf("tid=%d, ptr=%p\n", 1, ptr);
sleep(1);
if (ptr) {
free(ptr);
}
return 0;
}
int main(int argc, char **argv) {
int i, m;
char *stk; // Note: char * instead of void * to afford arithmetic
int stksz = 1024 * 128;
int flags = CLONE_VM | CLONE_FILES | CLONE_FS | CLONE_SIGHAND;
for (i = m = 0; i < 100; i++) {
stk = malloc( stksz );
if( !stk ) break;
if (clone(thread_func, stk + stksz - 1, flags, NULL, NULL, NULL, NULL ) != -1) {
m++;
}
}
printf("create %d thread\n", m);
sleep(10);
return 0;
}
Even with that, however, all is not completely well: I see various anomalies in the program output, especially near the beginning.
The bottom line is that, contrary to your assertion, you are not creating any threads, at least not in the sense that the C library recognizes. You are merely creating processes that have behavior similar to threads'. That may be sufficient for some purposes, but you cannot rely on the system to treat such processes identically to threads.
On Linux, bona fide threads that the system and standard library will recognize are POSIX threads, launched via pthread_create(). (I note here that modifying your program to use pthread_create() instead of clone() resolved the output anomalies for me.) You might be able to add flags and arguments to your clone() calls that make the resulting processes enough like the Linux implementation of pthreads to be effectively identical, but whyever would you do such a thing instead of just using real pthreads in the first place?
* The program also performs pointer arithmetic on a void *, which C does not permit. GCC accepts that as an extension, however, and since your code is deeply Linux-specific anyway, I'm letting that slide with only this note.
Correct, malloc and free need TLS for at least the following things:
The malloc arena attached to the current thread (used for allocation operations).
The errno TLS variable (written to when system calls fail).
The stack protector canary (if enabled and the architecture stores the canary in the TCB).
The malloc thread cache (enabled by default in the upcoming glibc 2.26 release).
All these items need a properly initialized thread control block (TCB), but curiously, until recently and as far as malloc/free was concerned, it almost did not matter if a thread created with clone was shared with another TCB (so that the data is no longer thread-local):
Threads basically never reattach themselves to a different arena, so the arena TLS variable is practically read-only after initialization—and multiple threads can share a single arena. errno can be shared as long as system calls only fail in one of the threads undergoing sharing. The stack protector canary is read-only after process startup, and its value is identical across threads anyway.
But all this is an implementation detail, and things change radically in glibc 2.26 with its malloc thread cache: The cache is read and written without synchronization, so it is very likely that what you are trying to do results in memory corruption.
This is not a material change in glibc 2.26, it is always how things were: calling any glibc function from a thread created with clone is undefined. As John Bollinger pointed out, this mostly worked by accident before, but I can assure you that it has always been completely undefined.

How do userspace programs pass memory back to the kernel after free()?

I've been reading a lot about memory allocation on the heap and how certain heap management allocators do it.
Say I have the following program:
#include<stdlib.h>
#include<stdio.h>
#include<unistd.h>
int main(int argc, char *argv[]) {
// allocate 4 gigabytes of RAM
void *much_mems = malloc(4294967296);
// sleep for 10 minutes
sleep(600);
// free teh ram
free(*much_mems);
// sleep some moar
sleep(600);
return 0;
}
Let's say for sake of argument that my compiler doesn't optimize out anything above, that I can actually allocate 4GiB of RAM, that the malloc() call returns an actual pointer and not NULL, that size_t can hold an integer as big as 4294967296 on my given platform, that the allocater implemented by the malloc call actually does allocate that amount of RAM in the heap. Pretend that the above code does exactly what it looks like it will do.
After the call to free executes, how does the kernel know that those 4 GiB of RAM are now eligible for use for other processes and for the kernel itself? I'm not assuming the kernel is Linux, but that would be a good example. Before the call to free, this process has a heap size of at least 4GiB, and afterward, does it still have that heap size?
How do modern operating systems allow userspace programs to return memory back to kernel space? Do free implementations execute a syscall to the kernel (or many syscalls) to tell it which areas of memory are now available? And is it possible that my 4 GiB allocation will be non-contiguous?
Do free implementations execute a syscall to the kernel (or many syscalls) to tell it which areas of memory are now available?
Yes.
A modern implementation of malloc on Linux will call mmap to allocate a large amount of memory. The kernel will find an unused virtual address, mark it as allocated, and return it. (The kernel may also return an error if there isn't enough free memory)
free would then call munmap to deallocate the memory, passing the address and size of the allocation.
On Windows, malloc will call VirtualAlloc and free will call VirtualFree.
On GNU/Linux with Glibc, large memory allocations, of more than a few hundred kilobytes, are handled by calling mmap. When the free function is invoked on this, the library knows that the memory was allocated this way (thanks to meta-data stored in a header). It simply calls unmap on it to release it. That's how the kernel knows; its mmap and unmap API is being used.
You can see these calls if you run strace on the program.
The kernel keeps track of all mmap-ed regions using a red-black tree. Given an arbitrary virtual address, it can quickly determine whether it lands in the mmap area, and which mapping, by performing a tree walk.
Before the call to free, this process has a heap size of at least 4GiB...
The C language does not define either "heap" or "stack". Before the call to free, this process has a chunk of 4 GB dynamically allocated memory...
and afterward, does it still have that heap size?
...and after the free(), access to that memory would be undefined behaviour, so for practical purposes, that dynamically allocated memory is no longer "there".
What the library does "under the hood" (e.g. caching, see below) is up to the library, and is subject to change without further notice. This could change with the amount of available physical memory, system load, runtime parameters, ...
How do modern operating systems allow userspace programs to return memory back to kernel space?
It's up to the standard library's implementation to decide (which, of course, has to talk to the operating system to actually, physically allocate / free memory).
Others have pointed out how certain, existing implementations do it. Other libraries, operating systems, and environments exist.
Do free implementations execute a syscall to the kernel (or many syscalls) to tell it which areas of memory are now available?
Possibly. A common optimization done by library implementations is to "cache" free()d memory, so subsequent malloc() calls can be served without talking to the kernel (which is a costly operation). When, how much, and how long memory is cached this way is, you guessed it, implementation-defined.
And is it possible that my 4 GiB allocation will be non-contiguous?
The process will always "see" contiguous memory. In a system supporting virtual memory (i.e. "modern" desktop OS's like Linux or Windows), the physical memory might be non-contiguous, but the virtual addresses your process gets to see will be contiguous (or the malloc() would have failed if this requirement could not be serviced).
Again, other systems exist. You might be looking at a system that doesn't virtualize addresses (i.e. gives physical addresses to the process). You might be looking at a system that assigns a given amount of memory to a process on startup, serves any malloc() requests from that, and doesn't support the allocation of additional memory. And so on.
If we're using Linux as an example it uses mmap to allocate large chunks of memory. This means when you free it it gets umapped ie the kernel gets told that it can now unmap this memory. Read up on the brk and sbrk system calls. A good place to start would be here...
What does brk( ) system call do?
and here. The following post discusses how malloc is implemented which will give you a good idea what's happening under the covers...
How is malloc() implemented internally?
Doug Lea's malloc can be found here. It's well commented and public domain...
ftp://g.oswego.edu/pub/misc/malloc.c
malloc() and free() are kernel functions (system calls) . it is being called by the application to allocate and free memory on the heap.
application itself is not allocating/freeing memory .
the whole mechanism is executed at kernel level .
see the below heap implementation code
void *heap_alloc(uint32_t nbytes) {
heap_header *p, *prev_p; // used to keep track of the current unit
unsigned int nunits; // this is the number of "allocation units" needed by nbytes of memory
nunits = (nbytes + sizeof(heap_header) - 1) / sizeof(heap_header) + 1; // see how much we will need to allocate for this call
// check to see if the list has been created yet; start it if not
if ((prev_p = _heap_free) == NULL) {
_heap_base.s.next = _heap_free = prev_p = &_heap_base; // point at the base of the memory
_heap_base.s.alloc_sz = 0; // and set it's allocation size to zero
}
// now enter a for loop to find a block fo memory
for (p = prev_p->s.next;; prev_p = p, p = p->s.next) {
// did we find a big enough block?
if (p->s.alloc_sz >= nunits) {
// the block is exact length
if (p->s.alloc_sz == nunits)
prev_p->s.next = p->s.next;
// the block needs to be cut
else {
p->s.alloc_sz -= nunits;
p += p->s.alloc_sz;
p->s.alloc_sz = nunits;
}
_heap_free = prev_p;
return (void *)(p + 1);
}
// not enough space!! Try to get more from the kernel
if (p == _heap_free) {
// if the kernel has no more memory, return error!
if ((p = morecore()) == NULL)
return NULL;
}
}
}
this heap_alloc function uses morecore function which is implemented as below :
heap_header *morecore() {
char *cp;
heap_header *up;
cp = (char *)pmmngr_alloc_block(); // allocate more memory for the heap
// if cp is null we have no memory left
if (cp == NULL)
return NULL;
//vmmngr_mapPhysicalAddress(cp, (void *)_virt_addr); // and map it's virtual address to it's physical address
vmmngr_mapPhysicalAddress(vmmngr_get_directory(), _virt_addr, (uint32_t)cp, I86_PTE_PRESENT | I86_PTE_WRITABLE);
_virt_addr += BLOCK_SIZE; // tack on nu bytes to the virtual address; this will be our next allocation address
up = (heap_header *)cp;
up->s.alloc_sz = BLOCK_SIZE;
heap_free((void *)(up + 1));
return _heap_free;
}
as you can see this function is asking the physical memory manager to allocate a block :
cp = (char *)pmmngr_alloc_block();
and then map the allocated block into virtual memory :
vmmngr_mapPhysicalAddress(vmmngr_get_directory(), _virt_addr, (uint32_t)cp, I86_PTE_PRESENT | I86_PTE_WRITABLE);
as you can see , the whole story is being controlled by the heap manager in kernel level.

malloc works, cudaHostAlloc segfaults?

I am new to CUDA and I want to use cudaHostAlloc. I was able to isolate my problem to this following code. Using malloc for host allocation works, using cudaHostAlloc results in a segfault, possibly because the area allocated is invalid? When I dump the pointer in both cases it is not null, so cudaHostAlloc returns something...
works
in_h = (int*) malloc(length*sizeof(int)); //works
for (int i = 0;i<length;i++)
in_h[i]=2;
doesn't work
cudaHostAlloc((void**)&in_h,length*sizeof(int),cudaHostAllocDefault);
for (int i = 0;i<length;i++)
in_h[i]=2; //segfaults
Standalone Code
#include <stdio.h>
void checkDevice()
{
cudaDeviceProp info;
int deviceName;
cudaGetDevice(&deviceName);
cudaGetDeviceProperties(&info,deviceName);
if (!info.deviceOverlap)
{
printf("Compute device can't use streams and should be discarded.");
exit(EXIT_FAILURE);
}
}
int main()
{
checkDevice();
int *in_h;
const int length = 10000;
cudaHostAlloc((void**)&in_h,length*sizeof(int),cudaHostAllocDefault);
printf("segfault comming %d\n",in_h);
for (int i = 0;i<length;i++)
{
in_h[i]=2; // Segfaults here
}
return EXIT_SUCCESS;
}
~
Invocation
[id129]$ nvcc fun.cu
[id129]$ ./a.out
segfault comming 327641824
Segmentation fault (core dumped)
Details
Program is run in interactive mode on a cluster. I was told that an invocation of the program from the compute node pushes it to the cluster. Have not had any trouble with other home made toy cuda codes.
Edit
cudaError_t err = cudaHostAlloc((void**)&in_h,length*sizeof(int),cudaHostAllocDefault);
printf("Error status is %s\n",cudaGetErrorString(err));
gives driver error...
Error status is CUDA driver version is insufficient for CUDA runtime version
Always check for Errors. It is likely that cudaHostAlloc is failing to allocate any memory. If it fails, you are not bailing but are rather writing to unallocated address space. When using malloc it allocates memory as requested and does not fail. But there are cases when malloc may result in failures as well, so it is best to do checks on the pointer before writing into it.
For future, it may be best to do something like this
int *ptr = NULL;
// Allocate using cudaHostAlloc or malloc
// If using cudaHostAlloc check for success
if (!ptr) ERROR_OUT();
// Write to this memory
EDIT (Response to edit in the question)
The error message indicates you have an older driver compared to the toolkit. If you do not want to be stuck for a while, try to download an older version of cuda toolkit that is compatible with your driver. You can install it in your user account and use its nvcc + libraries for temporarily.
Your segfault is not caused by the writes to the block of memory allocated by cudaHostAlloc, but rather from trying to 'free' an address returned from cudaHostAlloc. I was able to reproduce your problem using the code you provided, but replacing free with cudaFreeHost fixed the segfault for me.
cudaFreeHost

kernel crash with kmalloc

I am trying to assign memory using kmalloc in kernel code in fact in a queueing discipline. I want to assign memory to q->agg_queue_hdr of which q is a queueing discipline and agg_queue_hdr is a struct, so if assign memory like this:
q->agg_queue_hdr=kmalloc(sizeof(struct agg_queue), GFP_ATOMIC);
the kernel crashes. Based on the examples of kmalloc I saw from searching, I now changed it to:
agg_queue_hdr=kmalloc(sizeof(struct agg_queue), GFP_ATOMIC);
with which the kernel doesn't crash. Now I want to know how can I assign memory to the pointer q->agg_queue_hdr?
Make sure q is pointed to a valid area of memory. Then you should be able to assign q->agg_queue_hdr like you had it to begin with.
Why don't you modify your code with below way, which would avoid kernel panic.
if (q->agg_queue_hdr) {
q->agg_queue_hdr = kmalloc(sizeof(struct agg_queue), GFP_ATOMIC);
}
else {
printk("[+] q->agg_queue_hdr invalid \n");
dump_stack(); // print callstack in the kernel log.
}
When disassembing "q->agg_queue_hdr", "ldr" instruction will works where kernel panic occurs.

How to allocate an executable page in a Linux kernel module?

I'm writing a Linux kernel module, and I'd like to allocate an executable page. Plain kmalloc() returns a pointer within a non-executable page, and I get a kernel panic when executing code there. It has to work on Ubuntu Karmic x86, 2.6.31-20-generic-pae.
#include <linux/vmalloc.h>
#include <asm/pgtype_types.h>
...
char *p = __vmalloc(byte_size, GFP_KERNEL, PAGE_KERNEL_EXEC);
...
if (p != NULL) vfree(p);
/**
* vmalloc_exec - allocate virtually contiguous, executable memory
* #size: allocation size
*
* Kernel-internal function to allocate enough pages to cover #size
* the page level allocator and map them into contiguous and
* executable kernel virtual space.
*
* For tight control over page level allocator and protection flags
* use __vmalloc() instead.
*
* Return: pointer to the allocated memory or %NULL on error
*/
void *vmalloc_exec(unsigned long size)
{
return __vmalloc_node(size, 1, GFP_KERNEL, PAGE_KERNEL_EXEC,
NUMA_NO_NODE, __builtin_return_address(0));
}
Linux 5.4 and above no longer make interfaces available so that arbitrary kernel modules won't modify page attributes to turn the executable bit on/off. However, there might be a specific configuration that will allow to do so, but I'm not aware of such.
If someone uses a kernel version that is lower than 5.4, you can use set_memory_x(unsigned long addr, int numpages); and friends to change the attributes of a page, or if you need to pass a struct page, you could use set_pages_x.
Keep in mind that it is considered dangerous and you have to know what you are doing.

Resources