How to allocate an executable page in a Linux kernel module?

How to allocate an executable page in a Linux kernel module? - c

I'm writing a Linux kernel module, and I'd like to allocate an executable page. Plain kmalloc() returns a pointer within a non-executable page, and I get a kernel panic when executing code there. It has to work on Ubuntu Karmic x86, 2.6.31-20-generic-pae.

#include <linux/vmalloc.h>
#include <asm/pgtype_types.h>
...
char *p = __vmalloc(byte_size, GFP_KERNEL, PAGE_KERNEL_EXEC);
...
if (p != NULL) vfree(p);

/**
* vmalloc_exec - allocate virtually contiguous, executable memory
* #size: allocation size
*
* Kernel-internal function to allocate enough pages to cover #size
* the page level allocator and map them into contiguous and
* executable kernel virtual space.
*
* For tight control over page level allocator and protection flags
* use __vmalloc() instead.
*
* Return: pointer to the allocated memory or %NULL on error
*/
void *vmalloc_exec(unsigned long size)
{
return __vmalloc_node(size, 1, GFP_KERNEL, PAGE_KERNEL_EXEC,
NUMA_NO_NODE, __builtin_return_address(0));
}

Linux 5.4 and above no longer make interfaces available so that arbitrary kernel modules won't modify page attributes to turn the executable bit on/off. However, there might be a specific configuration that will allow to do so, but I'm not aware of such.
If someone uses a kernel version that is lower than 5.4, you can use set_memory_x(unsigned long addr, int numpages); and friends to change the attributes of a page, or if you need to pass a struct page, you could use set_pages_x.
Keep in mind that it is considered dangerous and you have to know what you are doing.

Related

How do userspace programs pass memory back to the kernel after free()?

I've been reading a lot about memory allocation on the heap and how certain heap management allocators do it.
Say I have the following program:
#include<stdlib.h>
#include<stdio.h>
#include<unistd.h>
int main(int argc, char *argv[]) {
// allocate 4 gigabytes of RAM
void *much_mems = malloc(4294967296);
// sleep for 10 minutes
sleep(600);
// free teh ram
free(*much_mems);
// sleep some moar
sleep(600);
return 0;
}
Let's say for sake of argument that my compiler doesn't optimize out anything above, that I can actually allocate 4GiB of RAM, that the malloc() call returns an actual pointer and not NULL, that size_t can hold an integer as big as 4294967296 on my given platform, that the allocater implemented by the malloc call actually does allocate that amount of RAM in the heap. Pretend that the above code does exactly what it looks like it will do.
After the call to free executes, how does the kernel know that those 4 GiB of RAM are now eligible for use for other processes and for the kernel itself? I'm not assuming the kernel is Linux, but that would be a good example. Before the call to free, this process has a heap size of at least 4GiB, and afterward, does it still have that heap size?
How do modern operating systems allow userspace programs to return memory back to kernel space? Do free implementations execute a syscall to the kernel (or many syscalls) to tell it which areas of memory are now available? And is it possible that my 4 GiB allocation will be non-contiguous?

Do free implementations execute a syscall to the kernel (or many syscalls) to tell it which areas of memory are now available?
Yes.
A modern implementation of malloc on Linux will call mmap to allocate a large amount of memory. The kernel will find an unused virtual address, mark it as allocated, and return it. (The kernel may also return an error if there isn't enough free memory)
free would then call munmap to deallocate the memory, passing the address and size of the allocation.
On Windows, malloc will call VirtualAlloc and free will call VirtualFree.

On GNU/Linux with Glibc, large memory allocations, of more than a few hundred kilobytes, are handled by calling mmap. When the free function is invoked on this, the library knows that the memory was allocated this way (thanks to meta-data stored in a header). It simply calls unmap on it to release it. That's how the kernel knows; its mmap and unmap API is being used.
You can see these calls if you run strace on the program.
The kernel keeps track of all mmap-ed regions using a red-black tree. Given an arbitrary virtual address, it can quickly determine whether it lands in the mmap area, and which mapping, by performing a tree walk.

Before the call to free, this process has a heap size of at least 4GiB...
The C language does not define either "heap" or "stack". Before the call to free, this process has a chunk of 4 GB dynamically allocated memory...
and afterward, does it still have that heap size?
...and after the free(), access to that memory would be undefined behaviour, so for practical purposes, that dynamically allocated memory is no longer "there".
What the library does "under the hood" (e.g. caching, see below) is up to the library, and is subject to change without further notice. This could change with the amount of available physical memory, system load, runtime parameters, ...
How do modern operating systems allow userspace programs to return memory back to kernel space?
It's up to the standard library's implementation to decide (which, of course, has to talk to the operating system to actually, physically allocate / free memory).
Others have pointed out how certain, existing implementations do it. Other libraries, operating systems, and environments exist.
Do free implementations execute a syscall to the kernel (or many syscalls) to tell it which areas of memory are now available?
Possibly. A common optimization done by library implementations is to "cache" free()d memory, so subsequent malloc() calls can be served without talking to the kernel (which is a costly operation). When, how much, and how long memory is cached this way is, you guessed it, implementation-defined.
And is it possible that my 4 GiB allocation will be non-contiguous?
The process will always "see" contiguous memory. In a system supporting virtual memory (i.e. "modern" desktop OS's like Linux or Windows), the physical memory might be non-contiguous, but the virtual addresses your process gets to see will be contiguous (or the malloc() would have failed if this requirement could not be serviced).
Again, other systems exist. You might be looking at a system that doesn't virtualize addresses (i.e. gives physical addresses to the process). You might be looking at a system that assigns a given amount of memory to a process on startup, serves any malloc() requests from that, and doesn't support the allocation of additional memory. And so on.

If we're using Linux as an example it uses mmap to allocate large chunks of memory. This means when you free it it gets umapped ie the kernel gets told that it can now unmap this memory. Read up on the brk and sbrk system calls. A good place to start would be here...
What does brk( ) system call do?
and here. The following post discusses how malloc is implemented which will give you a good idea what's happening under the covers...
How is malloc() implemented internally?
Doug Lea's malloc can be found here. It's well commented and public domain...
ftp://g.oswego.edu/pub/misc/malloc.c

malloc() and free() are kernel functions (system calls) . it is being called by the application to allocate and free memory on the heap.
application itself is not allocating/freeing memory .
the whole mechanism is executed at kernel level .
see the below heap implementation code
void *heap_alloc(uint32_t nbytes) {
heap_header *p, *prev_p; // used to keep track of the current unit
unsigned int nunits; // this is the number of "allocation units" needed by nbytes of memory
nunits = (nbytes + sizeof(heap_header) - 1) / sizeof(heap_header) + 1; // see how much we will need to allocate for this call
// check to see if the list has been created yet; start it if not
if ((prev_p = _heap_free) == NULL) {
_heap_base.s.next = _heap_free = prev_p = &_heap_base; // point at the base of the memory
_heap_base.s.alloc_sz = 0; // and set it's allocation size to zero
}
// now enter a for loop to find a block fo memory
for (p = prev_p->s.next;; prev_p = p, p = p->s.next) {
// did we find a big enough block?
if (p->s.alloc_sz >= nunits) {
// the block is exact length
if (p->s.alloc_sz == nunits)
prev_p->s.next = p->s.next;
// the block needs to be cut
else {
p->s.alloc_sz -= nunits;
p += p->s.alloc_sz;
p->s.alloc_sz = nunits;
}
_heap_free = prev_p;
return (void *)(p + 1);
}
// not enough space!! Try to get more from the kernel
if (p == _heap_free) {
// if the kernel has no more memory, return error!
if ((p = morecore()) == NULL)
return NULL;
}
}
}
this heap_alloc function uses morecore function which is implemented as below :
heap_header *morecore() {
char *cp;
heap_header *up;
cp = (char *)pmmngr_alloc_block(); // allocate more memory for the heap
// if cp is null we have no memory left
if (cp == NULL)
return NULL;
//vmmngr_mapPhysicalAddress(cp, (void *)_virt_addr); // and map it's virtual address to it's physical address
vmmngr_mapPhysicalAddress(vmmngr_get_directory(), _virt_addr, (uint32_t)cp, I86_PTE_PRESENT | I86_PTE_WRITABLE);
_virt_addr += BLOCK_SIZE; // tack on nu bytes to the virtual address; this will be our next allocation address
up = (heap_header *)cp;
up->s.alloc_sz = BLOCK_SIZE;
heap_free((void *)(up + 1));
return _heap_free;
}
as you can see this function is asking the physical memory manager to allocate a block :
cp = (char *)pmmngr_alloc_block();
and then map the allocated block into virtual memory :
vmmngr_mapPhysicalAddress(vmmngr_get_directory(), _virt_addr, (uint32_t)cp, I86_PTE_PRESENT | I86_PTE_WRITABLE);
as you can see , the whole story is being controlled by the heap manager in kernel level.

Segmentation Fault using getcontext() in thread library

I am trying to implement a user level thread library in C using systems calls such as get context, swap context , etc
I have a thread control block that looks like this :
struct tcb {
int thread_id;
int thread_pri;
ucontext_t *thread_context;
struct tcb *next;
}
And I have a function called init() that looks like this:
void t_init()
{
tcb *tmp;
tmp = malloc(sizeof(tcb));
getcontext(tmp->thread_context); /* let tmp be the context of main() */
running_head = tmp;
}
I used gdb and I got a segmentation fault during runtime at the getcontext(tmp->thread_context) function.
I have read the man pages for getcontext() but am unsure as to why this is returning a segmentation fault to me!
Any suggestions please?

You haven't allocated any space for thread_context, try
void t_init()
{
struct tcb *tmp;
tmp = malloc(sizeof(struct tcb));
if (!tmp)
return -1;
memset(&tmp, 0, sizeof(struct tcb));
tmp->thread_context = malloc(sizeof(ucontext_t));
if (!tmp->thread_context)
return -1;
getcontext(tmp->thread_context);
}

We can get the following information about getcontext/setcontext "The GNU C Library Reference Manual Chapter:23 Non Locals Exits, Page 622)", and found the following
While allocating the memory for the stack one has to be careful.
Most modern processors keep track of whether a certain memory region is allowed to contain code which is executed or not. Data segments and
heap memory is normally not tagged to allow this. The result is that
programs would fail. Examples for such code include the calling
sequences the GNU C compiler generates for calls to nested functions.
Safe ways to allocate stacks correctly include using memory on the
original threads stack or explicitly allocate memory tagged for
execution using memory mapped I/O.
This is causing the problem and you should use the recommended step to allocate the memory(using memory mapped I/O For more information, Please refer the libc manual).

Move memory pages per-thread in NUMA architecture

i have 2 questions in one:
(i) Suppose thread X is running at CPU Y. Is it possible to use the syscalls migrate_pages - or even better move_pages (or their libnuma wrapper) - to move the pages associated with X to the node in which Y is connected?
This question arrises because first argument of both syscalls is PID (and i need a per-thread approach for some researching i'm doing)
(ii) in the case of positive answer for (i), how can i get all the pages used by some thread? My aim is, move the page(s) that contains array M[] for exemple...how to "link" data structures with their memory pages, for the sake of using the syscalls above?
An extra information: i'm using C with pthreads. Thanks in advance !

You want to use the higher level libnuma interfaces instead of the low level system calls.
The libnuma library offers a simple programming interface to the NUMA (Non Uniform Memory Access) policy supported by the Linux kernel. On a NUMA architecture some memory areas have different latency or bandwidth than others.
Available policies are page interleaving (i.e., allocate in a round-robin fashion from all, or a subset, of the nodes on the system), preferred node allocation (i.e., preferably allocate on a particular node), local allocation (i.e., allocate on the node on which the task is currently executing), or allocation only on specific nodes (i.e., allocate on some subset of the available nodes). It is also possible to bind tasks to specific nodes.
The man pages for the low level numa_* system calls warn you away from using them:
Link with -lnuma to get the system call definitions. libnuma and the required <numaif.h> header are available in the numactl package.
However, applications should not use these system calls directly. Instead, the higher level interface provided by the numa(3) functions in the numactl package is recommended. The numactl package is available at <ftp://oss.sgi.com/www/projects/libnuma/download/>. The package is also included in some Linux distributions. Some distributions include the development library and header in the separate numactl-devel package.

Here's the code I use for pinning a thread to a single CPU and moving the stack to the corresponding NUMA node (slightly adapted to remove some constants defined elsewhere). Note that I first create the thread normally, and then call the SetAffinityAndRelocateStack() below from within the thread. I think this is much better then trying to create your own stack, since stacks have special support for growing in case the bottom is reached.
The code can also be adapted to operate on the newly created thread from outside, but this could give rise to race conditions (e.g. if the thread performs I/O into its stack), so I wouldn't recommend it.
void* PreFaultStack()
{
const size_t NUM_PAGES_TO_PRE_FAULT = 50;
const size_t size = NUM_PAGES_TO_PRE_FAULT * numa_pagesize();
void *allocaBase = alloca(size);
memset(allocaBase, 0, size);
return allocaBase;
}
void SetAffinityAndRelocateStack(int cpuNum)
{
assert(-1 != cpuNum);
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(cpuNum, &cpuset);
const int rc = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
assert(0 == rc);
pthread_attr_t attr;
void *stackAddr = nullptr;
size_t stackSize = 0;
if ((0 != pthread_getattr_np(pthread_self(), &attr)) || (0 != pthread_attr_getstack(&attr, &stackAddr, &stackSize))) {
assert(false);
}
const unsigned long nodeMask = 1UL << numa_node_of_cpu(cpuNum);
const auto bindRc = mbind(stackAddr, stackSize, MPOL_BIND, &nodeMask, sizeof(nodeMask), MPOL_MF_MOVE | MPOL_MF_STRICT);
assert(0 == bindRc);
PreFaultStack();
// TODO: Also lock the stack with mlock() to guarantee it stays resident in RAM
return;
}

Linux is not allowing me to access a fixed region of memory

I have some data stored in a FLASH memory that I need to access with C pointers to be able to make a non-Linux graphics driver work (I think this requirement is DMA related, not sure). Calling read works, but I don't want to have intermediate RAM buffers between the FLASH and the non-Linux driver.
However, just creating a pointer and storing the address that I want on it is making Linux emit an exception about invalid access on me.
void *ptr = 0xdeadbeef;
int a = *ptr; // invalid access!
What am I missing here? And could someone point me to a material to make this concepts clear for me?
I'm reading about mmap but I'm not sure that this is what I need.

The problem you have is that linux runs your program in a virtual address space. So every address you use directly in the code (like 0xdeadbeef) is a virtual address that gets translated by the memory management unit into a physical address which is not necessarily the same as your virtual address. This allows easy separation of multiple independent processes and other stuff like paging, etc.
The problem is now, that in your case no physical address is mapped to the virtual address 0xdeadbeef causing the kernel to abort execution.
The call mmap you already found asks the kernel to assign a specific file (from a specific offset) to a virtual address of your process. Note that the returning address of mmap could be a completely different address. So don't make any assumptions about the virtual address you get.
Therefore there are examples with mmap and /dev/mem out there where the offset for the memory device is the physical address. After the kernel was able to assign the file from the offset you gave to a virtual address of your process you can access the memory area asif it were a direct access.
After you don't need the area anymore don't forget to munmap the area. Otherwise you'll cause something similar to a memory leak.
One problem with the /dev/mem method is that the user running the process needs access to this device. This could introduce a security issue (e.g. Samsung recently introduced such a security hole in their hand held devices)
A more secure way is the way described in a article i found (The Userspace I/O HOWTO) as you still have control about the memory areas accessable by the user's process.

You need to access the memory differently. Basically you need to open /dev/mem and use mmap(). (as you suggested). Simple example:
int openMem(unsigned int address, unsigned int size)
{
int mmapFD;
int page_size;
unsigned int page_start_address;
/* Minimum page size for the mmapped region. */
mask = size - 1;
/* Get the page size. */
page_size = (int) sysconf(_SC_PAGE_SIZE);
/* We have to map shared memory to beginning of memory page so adjust
* memory address accordingly. */
page_start_address = address - (address % page_size);
/* Open the file that will be mapped. */
if((mmapFD = open("/dev/mem", (O_RDWR | O_SYNC))) == -1)
{
printf("Opening shared memory device failed\n");
return -1;
}
mmap_base_address = mmap(0, size, (PROT_READ|PROT_WRITE), MAP_SHARED, mmapFD, (off_t)page_start_address & ~mask);
if(mmap_base_address == MAP_FAILED)
{
printf("Mapping memory failed\n");
return -1;
}
return 0;
}
unsigned int *getAddress(unsigned int address)
{
unsigned int log_address;
log_address = (int)((off_t)mmap_base_address + ((off_t)address & mask));
return (unsigned int*)log_address;
}
...
result = openMem(address, 0x10000);
if (result < 0)
return result;
target_address = getValue(address);
*(unsigned int*)target_address = value;
This would set "value" to "address".

You need to call ioremap - something like:
void *myaddr = ioremap(0xdeadbeef, size);
where size is the size of your memory region. You probably want to use a page-aligned address for the first argument, e.g. 0xdeadb000 - but I expect your actual device isn't at "0xdeadbeef" anyways.
Edit: The call to ioremap must be done from a driver!

How to get a specific memory address using C

For my bachelor thesis i want to visualize the data remanence of memory and how it persists after rebooting a system.
I had the simple idea to mmap a picture to memory, shut down my computer, wait x seconds, boot the computer and see if the picture is still there.
int mmap_lena(void)
{
FILE *fd = NULL;
size_t lena_size;
void *addr = NULL;
fd = fopen("lena.png", "r");
fseek(fd, 0, SEEK_END);
lena_size = ftell(fd);
addr = mmap((void *) 0x12345678, (size_t) lena_size, (int) PROT_READ, (int) MAP_SHARED, (int) fileno(fd), (off_t) 0);
fprintf(stdout, "Addr = %p\n", addr);
munmap((void *) addr, (size_t) lena_size);
fclose(fd);
fclose(fd_log);
return EXIT_SUCCESS;
}
I ommitted checking return values for clarities sake.
So after the mmap i tried to somehow get the address, but i usually end up with a segmentation fault as to my understanding the memory is protected by my operating system.
int fetch_lena(void)
{
FILE *fd = NULL;
FILE *fd_out = NULL;
size_t lenna_size;
FILE *addr = (FILE *) 0x12346000;
fd = fopen("lena.png", "r");
fd_out = fopen("lena_out.png", "rw");
fseek(fd, 0, SEEK_END);
lenna_size = ftell(fd);
// Segfault
fwrite((FILE *) addr, (size_t) 1, (size_t) lenna_size, (FILE *) fd_out);
fclose(fd);
fclose(fd_out);
return 0;
}
Please also note that i hard coded the adresses in this example, so whenever you run mmap_lena the value i use in fetch_lena could be wrong as the operating system takes the first parameter to mmap only as a hint (on my system it always defaults to 0x12346000 somehow).
If there is any trivial coding error i am sorry as my C skills have not fully developed.
I would like to now if there is any way to get to the data i want without implementing any malloc hooks or memory allocator hacks.
Thanks in advance,
David

One issue you have is that you are getting back a virtual address, not the physical address where the memory resides. Next time you boot, the mapping probably won't be the same.
This can definitly be done within a kernel module in Linux, but I don't think there is any sort of API in userspace you can use.
If you have permission ( and I assume you could be root on this machine if you are rebooting it ), then you can peek at /dev/mem to see the actual phyiscal layout. Maybe you should try sampling values, reboot, and see how many of those values persisted.

There is a similar project where a cold boot attack is demonstrated. The source code is available, maybe you can get some inspiration there.
However, AFAIR they read out the memory without loading an OS first and therefore do not have to mess with the OSs memory protection. Maybe you should try this too to avoid memory being overwritten or cleared by the OS after boot.
(Also check the video on the site, it's pretty impressive ;)

In the question Direct Memory Access in Linux we worked out most of the fundamentals needed to accomplish this. Note, mmap() is not the answer to this for exactly the reasons that were stated by others .. you need a real address, not virtual, which you can only get inside the kernel (or by writing a driver to relay one to userspace).
The simplest method would be to write a character device driver that can be read or written to, with an ioctl to give you a valid start or ending address. Again, if you want pointers on the memory management functions to use in the kernel, see the question that I've linked to .. most of it was worked out in the comments in the first (and accepted) answer.

Your test code looks odd
FILE *addr = (FILE *) 0x12346000;
fwrite((FILE *) fd_out, (size_t) 1,
(size_t) lenna_size, (FILE *) addr);
You can't just cast an integer to a FILE pointer and expect to get something sane.
Did you also switch the first and last argument to fwrite ? The last argument is supposed to be the FILE* to write to.

You probably want as little OS as possible for this purpose; the more software you load, the more chances of overwriting something you want to examine.
DOS might be a good bet; it uses < 640k of memory. If you don't load HIMEM and instead write your own (assembly required) routine to jump into pmode, copy a block of high memory into low memory, then jump back into real mode, you could write a mostly-real-mode program which can dump out the physical ram (minus however much the BIOS, DOS and your app use). It could dump it to a flash disc or something.
Of course the real problem may be that the BIOS clears the memory during POST.

I'm not familiar with Linux, but you'll likely need to write a device driver. Device drivers must have some way to convert virtual memory addresses to physical memory addresses for DMA purposes (DMA controllers only deal with physical memory addresses). You should be able to use those interfaces to deal directly with physical memory.

I don't say it's the lowest effort, but for completeness sake,
You can Compile a MMU-less Linux Kernel
And then you can have a blast as all addresses are real.
Note 1: You might still get errors if you access few hardware/bios mapped address spaces.
Note 2: You don't access memory using files in this case, you just assign an address to a pointer and read it's content.
int* pStart = (int*)0x600000;
int* pEnd = (int*)0x800000;
for(int* p = pStart; p < pEnd; ++p)
{
// do what you want with *p
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight