Move memory pages per-thread in NUMA architecture

Move memory pages per-thread in NUMA architecture - c

i have 2 questions in one:
(i) Suppose thread X is running at CPU Y. Is it possible to use the syscalls migrate_pages - or even better move_pages (or their libnuma wrapper) - to move the pages associated with X to the node in which Y is connected?
This question arrises because first argument of both syscalls is PID (and i need a per-thread approach for some researching i'm doing)
(ii) in the case of positive answer for (i), how can i get all the pages used by some thread? My aim is, move the page(s) that contains array M[] for exemple...how to "link" data structures with their memory pages, for the sake of using the syscalls above?
An extra information: i'm using C with pthreads. Thanks in advance !

You want to use the higher level libnuma interfaces instead of the low level system calls.
The libnuma library offers a simple programming interface to the NUMA (Non Uniform Memory Access) policy supported by the Linux kernel. On a NUMA architecture some memory areas have different latency or bandwidth than others.
Available policies are page interleaving (i.e., allocate in a round-robin fashion from all, or a subset, of the nodes on the system), preferred node allocation (i.e., preferably allocate on a particular node), local allocation (i.e., allocate on the node on which the task is currently executing), or allocation only on specific nodes (i.e., allocate on some subset of the available nodes). It is also possible to bind tasks to specific nodes.
The man pages for the low level numa_* system calls warn you away from using them:
Link with -lnuma to get the system call definitions. libnuma and the required <numaif.h> header are available in the numactl package.
However, applications should not use these system calls directly. Instead, the higher level interface provided by the numa(3) functions in the numactl package is recommended. The numactl package is available at <ftp://oss.sgi.com/www/projects/libnuma/download/>. The package is also included in some Linux distributions. Some distributions include the development library and header in the separate numactl-devel package.

Here's the code I use for pinning a thread to a single CPU and moving the stack to the corresponding NUMA node (slightly adapted to remove some constants defined elsewhere). Note that I first create the thread normally, and then call the SetAffinityAndRelocateStack() below from within the thread. I think this is much better then trying to create your own stack, since stacks have special support for growing in case the bottom is reached.
The code can also be adapted to operate on the newly created thread from outside, but this could give rise to race conditions (e.g. if the thread performs I/O into its stack), so I wouldn't recommend it.
void* PreFaultStack()
{
const size_t NUM_PAGES_TO_PRE_FAULT = 50;
const size_t size = NUM_PAGES_TO_PRE_FAULT * numa_pagesize();
void *allocaBase = alloca(size);
memset(allocaBase, 0, size);
return allocaBase;
}
void SetAffinityAndRelocateStack(int cpuNum)
{
assert(-1 != cpuNum);
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(cpuNum, &cpuset);
const int rc = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
assert(0 == rc);
pthread_attr_t attr;
void *stackAddr = nullptr;
size_t stackSize = 0;
if ((0 != pthread_getattr_np(pthread_self(), &attr)) || (0 != pthread_attr_getstack(&attr, &stackAddr, &stackSize))) {
assert(false);
}
const unsigned long nodeMask = 1UL << numa_node_of_cpu(cpuNum);
const auto bindRc = mbind(stackAddr, stackSize, MPOL_BIND, &nodeMask, sizeof(nodeMask), MPOL_MF_MOVE | MPOL_MF_STRICT);
assert(0 == bindRc);
PreFaultStack();
// TODO: Also lock the stack with mlock() to guarantee it stays resident in RAM
return;
}

Related

About fine-grained locking implementation in MIT 6.828 (2018) operating systems engineering course

I am now trying to making progress on the MIT 6.828 (2018) course on Operating Systems Engineering, and I like it a lot. It is fun and challenging. I learned a lot of basic OS knowledge from this. Now I am struggling with this fine-grained locking challenge: https://pdos.csail.mit.edu/6.828/2018/labs/lab4/
But when I try to run: make run-primes-nox CPUS=4, I got failed when forking the child, I suspect it is the kernel stack data got corrupted or replaced during scheduling.
The parent sometimes won't recover from the fork system call
in scheduler, before making a round, acquire some lock(lock_scheduler();) to prevent other CPUs from accessing the process list.
int i = 1, curpos = -1, k = 0;
if (curenv)
curpos = ENVX(curenv->env_id);
lock_scheduler();
for (; i < NENV; i++)
{
k = (i + curpos) % NENV; // in a circular way
if (envs[k].env_status == ENV_RUNNABLE)
{
env_run(&envs[k]);
}
}
if (curenv != NULL && curenv->env_status == ENV_RUNNING)
{
env_run(curenv);
}
// sched_halt never returns
sched_halt();
during sched_halt or about to env_run, we release the lock.
if (kernel_lock.locked && kernel_lock.cpu == thiscpu)
unlock_kernel();
if (scheduler_lock.locked && scheduler_lock.cpu == thiscpu)
unlock_scheduler();
when trapped into the kernel from interrupts or system call(explicitly with int $0x30), we lock the kernel with original big kernel lock(BKL), and before exiting the trap e.g. by env_run, we release the kernel lock.
void
trap(struct Trapframe *tf)
{
// The environment may have set DF and some versions
// of GCC rely on DF being clear
asm volatile("cld" ::: "cc");
// Halt the CPU if some other CPU has called panic()
extern char *panicstr;
if (panicstr)
asm volatile("hlt");
// Re-acqurie the big kernel lock if we were halted in
// sched_yield()
xchg(&thiscpu->cpu_status, CPU_STARTED);
// Check that interrupts are disabled. If this assertion
// fails, DO NOT be tempted to fix it by inserting a "cli" in
// the interrupt path.
assert(!(read_eflags() & FL_IF));
// only apply in trap
lock_kernel();
......
Currently:
I keep the kernel_lock when trapped into the kernel from user space
I use the page_lock to protect the page_free_list when allocating or deallocating the memory
I acquire the scheduler_lock when getting into the sched_yield method, unlock it just before running any user process (env_pop_tf)
Sorry the information might not very sufficient, I have uploaded my workspace on Github here:
https://github.com/k0Iry/6.828_2018_mit_jos
here contains all my implementation from lab1 till lab4. Thanks for reviewing!
Way to reproduce that issue:
git clone https://github.com/k0Iry/6.828_2018_mit_jos.git && cd 6.828_2018_mit_jos
wget https://raw.githubusercontent.com/k0Iry/xv6-jos-i386-lab/master/labs/0001-trying-with-fine-grained-locks.patch
git apply 0001-trying-with-fine-grained-locks.patch
make run-primes-nox CPUS=4
you got the error during the processes' forking

Physical memory management in Userspace?

I am working on an embedded device with only 512MB of RAM and the device is running Linux kernel. I want to do the memory management of all the processes running in the userspace by my own library. is it possible to do so. from my understanding, the memory management is done by kernel, Is it possible to have that functionality in User space.

If your embedded device runs Linux, it has an MMU. Controling the MMU is normally a privileged operation, so only an operating system kernel has access to it. Therefore the answer is: No, you can't.
Of course you can write software running directly on the device, without operating system, but I guess that's not what you wanted. You should probably take one step back, ask yourself what gave you the idea about the memory management and what could be a better way to solve this original problem.

You can consider using setrlimit. Refer another Q&A.
I wrote the test code and run it on my PC. I can see that memory usage is limited. The exact relationship of units requires further analysis.
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include <sys/resource.h>
int main(int argc, char* argv)
{
long limitSize = 1;
long testSize = 140000;
// 1. BEFORE: getrlimit
{
struct rlimit asLimit;
getrlimit(RLIMIT_AS, &asLimit);
printf("BEFORE: rlimit(RLIMIT_AS) = %ld,%ld\n", asLimit.rlim_cur, asLimit.rlim_max);
}
// 2. BEFORE: test malloc
{
char *xx = malloc(testSize);
if (xx == NULL)
perror("malloc FAIL");
else
printf("malloc(%ld) OK\n", testSize);
free(xx);
}
// 3. setrlimit
{
struct rlimit new;
new.rlim_cur = limitSize;
new.rlim_max = limitSize;
setrlimit(RLIMIT_AS, &new);
}
// 4. AFTER: getrlimit
{
struct rlimit asLimit;
getrlimit(RLIMIT_AS, &asLimit);
printf("AFTER: rlimit(RLIMIT_AS) = %ld,%ld\n", asLimit.rlim_cur, asLimit.rlim_max);
}
// 5. AFTER: test malloc
{
char *xx = malloc(testSize);
if (xx == NULL)
perror("malloc FAIL");
else
printf("malloc(%ld) OK\n", testSize);
free(xx);
}
return 0;
}
Result:
BEFORE: rlimit(RLIMIT_AS) = -1,-1
malloc(140000) OK
AFTER: rlimit(RLIMIT_AS) = 1,1
malloc FAIL: Cannot allocate memory

From what I understand of your question, you want to somehow use your own library for handling memory of kernel processes. I presume you are doing this to make sure that rogue processes don't use too much memory, which allows your process to use as much memory as is available. I believe this idea is flawed.
For example, imagine this scenario:
Total memory 512MB
Process 1 limit of 128MB - Uses 64MB
Process 2 imit of 128MB - Uses 64MB
Process 3 limit of 256MB - Uses 256MB then runs out of memory, when in fact 128MB is still available.
I know you THINK this is the answer to your problem, and on 'normal' embedded systems, this would probably work, but you are using a complex kernel, running processes you don't have total control over. You should write YOUR software to be robust when memory gets tight because that is all you can control.

Workqueue implementation in Linux Kernel

Can any one help me to understand difference between below mentioned APIs in Linux kernel:
struct workqueue_struct *create_workqueue(const char *name);
struct workqueue_struct *create_singlethread_workqueue(const char *name);
I had written sample modules, when I try to see them using ps -aef, both have created a workqueue, but I was not able to see any difference.
I have referred to http://www.makelinux.net/ldd3/chp-7-sect-6, and according to LDD3:
If you use create_workqueue, you get a workqueue that has a dedicated thread for each processor on the system. In many cases, all those threads are simply overkill; if a single worker thread will suffice, create the workqueue with create_singlethread_workqueue instead.
But I was not able to see multiple worker threads (each for a processor).

Workqueues have changed since LDD3 was written.
These two functions are actually macros:
#define create_workqueue(name) \
alloc_workqueue("%s", WQ_MEM_RECLAIM, 1, (name))
#define create_singlethread_workqueue(name) \
alloc_workqueue("%s", WQ_UNBOUND | WQ_MEM_RECLAIM, 1, (name))
The alloc_workqueue documentation says:
Allocate a workqueue with the specified parameters. For detailed
information on WQ_* flags, please refer to Documentation/workqueue.txt.
That file is too big to quote entirely, but it says:
alloc_workqueue() allocates a wq. The original create_*workqueue()
functions are deprecated and scheduled for removal.
[...]
A wq no longer manages execution resources but serves as a domain for
forward progress guarantee, flush and work item attributes.

if(singlethread){
cwq = init_cpu_workqueue(wq, singlethread_cpu);
err = create_workqueue_thread(cwq, singlethread_cpu);
start_workqueue_thread(cwq, -1);
}else{
list_add(&wq->list, &workqueues);
for_each_possible_cpu(cpu) {
cwq = init_cpu_workqueue(wq, cpu);
err = create_workqueue_thread(cwq, cpu);
start_workqueue_thread(cwq, cpu);
}
}

Is it possible to "punch holes" through mmap'ed anonymous memory?

Consider a program which uses a large number of roughly page-sized memory regions (say 64 kB or so), each of which is rather short-lived. (In my particular case, these are alternate stacks for green threads.)
How would one best do to allocate these regions, such that their pages can be returned to the kernel once the region isn't in use anymore? The naïve solution would clearly be to simply mmap each of the regions individually, and munmap them again as soon as I'm done with them. I feel this is a bad idea, though, since there are so many of them. I suspect that the VMM may start scaling badly after a while; but even if it doesn't, I'm still interested in the theoretical case.
If I instead just mmap myself a huge anonymous mapping from which I allocate the regions on demand, is there a way to "punch holes" through that mapping for a region that I'm done with? Kind of like madvise(MADV_DONTNEED), but with the difference that the pages should be considered deleted, so that the kernel doesn't actually need to keep their contents anywhere but can just reuse zeroed pages whenever they are faulted again.
I'm using Linux, and in this case I'm not bothered by using Linux-specific calls.

I did a lot of research into this topic (for a different use) at some point. In my case I needed a large hashmap that was very sparsely populated + the ability to zero it every now and then.
mmap solution:
The easiest solution (which is portable, madvise(MADV_DONTNEED) is linux specific) to zero a mapping like this is to mmap a new mapping above it.
void * mapping = mmap(MAP_ANONYMOUS);
// use the mapping
// zero certain pages
mmap(mapping + page_aligned_offset, length, MAP_FIXED | MAP_ANONYMOUS);
The last call is performance wise equivalent to subsequent munmap/mmap/MAP_FIXED, but is thread safe.
Performance wise the problem with this solution is that the pages have to be faulted in again on a subsequence write access which issues an interrupt and a context change. This is only efficient if very few pages were faulted in in the first place.
memset solution:
After having such crap performance if most of the mapping has to be unmapped I decided to zero the memory manually with memset. If roughly over 70% of the pages are already faulted in (and if not they are after the first round of memset) then this is faster then remapping those pages.
mincore solution:
My next idea was to actually only memset on those pages that have been faulted in before. This solution is NOT thread-safe. Calling mincore to determine if a page is faulted in and then selectively memset them to zero was a significant performance improvement until over 50% of the mapping was faulted in, at which point memsetting the entire mapping became simpler (mincore is a system call and requires one context change).
incore table solution:
My final approach which I then took was having my own in-core table (one bit per page) that says if it has been used since the last wipe. This is by far the most efficient way since you will only be actually zeroing the pages in each round that you actually used. It obviously also is not thread safe and requires you to track which pages have been written to in user space, but if you need this performance then this is by far the most efficient approach.

I don't see why doing lots of calls to mmap/munmap should be that bad. The lookup performance in the kernel for mappings should be O(log n).
Your only options as it seems to be implemented in Linux right now is to punch holes in the mappings to do what you want is mprotect(PROT_NONE) and that is still fragmenting the mappings in the kernel so it's mostly equivalent to mmap/munmap except that something else won't be able to steal that VM range from you. You'd probably want madvise(MADV_REMOVE) work or as it's called in BSD - madvise(MADV_FREE). That is explicitly designed to do exactly what you want - the cheapest way to reclaim pages without fragmenting the mappings. But at least according to the man page on my two flavors of Linux it's not fully implemented for all kinds of mappings.
Disclaimer: I'm mostly familiar with the internals of BSD VM systems, but this should be quite similar on Linux.
As in the discussion in comments below, surprisingly enough MADV_DONTNEED seems to do the trick:
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <stdio.h>
#include <unistd.h>
#include <err.h>
int
main(int argc, char **argv)
{
int ps = getpagesize();
struct rusage ru = {0};
char *map;
int n = 15;
int i;
if ((map = mmap(NULL, ps * n, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)) == MAP_FAILED)
err(1, "mmap");
for (i = 0; i < n; i++) {
map[ps * i] = i + 10;
}
printf("unnecessary printf to fault stuff in: %d %ld\n", map[0], ru.ru_minflt);
/* Unnecessary call to madvise to fault in that part of libc. */
if (madvise(&map[ps], ps, MADV_NORMAL) == -1)
err(1, "madvise");
if (getrusage(RUSAGE_SELF, &ru) == -1)
err(1, "getrusage");
printf("after MADV_NORMAL, before touching pages: %d %ld\n", map[0], ru.ru_minflt);
for (i = 0; i < n; i++) {
map[ps * i] = i + 10;
}
if (getrusage(RUSAGE_SELF, &ru) == -1)
err(1, "getrusage");
printf("after MADV_NORMAL, after touching pages: %d %ld\n", map[0], ru.ru_minflt);
if (madvise(map, ps * n, MADV_DONTNEED) == -1)
err(1, "madvise");
if (getrusage(RUSAGE_SELF, &ru) == -1)
err(1, "getrusage");
printf("after MADV_DONTNEED, before touching pages: %d %ld\n", map[0], ru.ru_minflt);
for (i = 0; i < n; i++) {
map[ps * i] = i + 10;
}
if (getrusage(RUSAGE_SELF, &ru) == -1)
err(1, "getrusage");
printf("after MADV_DONTNEED, after touching pages: %d %ld\n", map[0], ru.ru_minflt);
return 0;
}
I'm measuring ru_minflt as a proxy to see how many pages we needed to allocate (this isn't exactly true, but the next sentence makes it more likely). We can see that we get new pages in the third printf because the contents of map[0] are 0.

How to allocate an executable page in a Linux kernel module?

I'm writing a Linux kernel module, and I'd like to allocate an executable page. Plain kmalloc() returns a pointer within a non-executable page, and I get a kernel panic when executing code there. It has to work on Ubuntu Karmic x86, 2.6.31-20-generic-pae.

#include <linux/vmalloc.h>
#include <asm/pgtype_types.h>
...
char *p = __vmalloc(byte_size, GFP_KERNEL, PAGE_KERNEL_EXEC);
...
if (p != NULL) vfree(p);

/**
* vmalloc_exec - allocate virtually contiguous, executable memory
* #size: allocation size
*
* Kernel-internal function to allocate enough pages to cover #size
* the page level allocator and map them into contiguous and
* executable kernel virtual space.
*
* For tight control over page level allocator and protection flags
* use __vmalloc() instead.
*
* Return: pointer to the allocated memory or %NULL on error
*/
void *vmalloc_exec(unsigned long size)
{
return __vmalloc_node(size, 1, GFP_KERNEL, PAGE_KERNEL_EXEC,
NUMA_NO_NODE, __builtin_return_address(0));
}

Linux 5.4 and above no longer make interfaces available so that arbitrary kernel modules won't modify page attributes to turn the executable bit on/off. However, there might be a specific configuration that will allow to do so, but I'm not aware of such.
If someone uses a kernel version that is lower than 5.4, you can use set_memory_x(unsigned long addr, int numpages); and friends to change the attributes of a page, or if you need to pass a struct page, you could use set_pages_x.
Keep in mind that it is considered dangerous and you have to know what you are doing.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight