While I can write reasonable C code, my expertise is mainly with Java and so I apologize if this question makes no sense.
I am writing some code to help me do heap analysis. I'm doing this via instrumentation with LLVM. What I'm looking for is a way to access the heap metadata for a process from within itself. Is such a thing possible? I know that information about the heap is stored in many malloc_state structs (main_arena for example). If I can gain access to main_arena, I can start enumerating the different arenas, heaps, bins, etc. As I understand, these variables are all defined statically and so they can't be accessed.
But is there some way of getting this information? For example, could I use /proc/$pid/mem to leak the information somehow?
Once I have this information, I want want to basically get information about all the different freelists. So I want, for every bin in each bin type, the number of chunks in the bin and their sizes. For fast, small, and tcache bins I know that I just need the index to figure out the size. I have looked at how these structures are implemented and how to iterate through them. So all I need is to gain access to these internal structures.
I have looked at malloc_info and that is my fallback, but I would also like to get information about tcache and I don't think that is included in malloc_info.
An option I have considered is to build a custom version of glibc has the malloc_struct variables declared non-statically. But from what I can see, it's not very straightforward to build your own custom glibc as you have to build the entire toolchain. I'm using clang so I would have to build LLVM from source against my custom glibc (at least this is what I've understood from researching this approach).
I had a similar requirement recently, so I do think that being able to get to main_arena for a given process does have its value, one example being post-mortem memory usage analysis.
Using dl_iterate_phdr and elf.h, it's relatively straightforward to resolve main_arena based on the local symbol:
#define _GNU_SOURCE
#include <fcntl.h>
#include <link.h>
#include <signal.h>
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
// Ignored:
// - Non-x86_64 architectures
// - Resource and error handling
// - Style
static int cb(struct dl_phdr_info *info, size_t size, void *data)
{
if (strcmp(info->dlpi_name, "/lib64/libc.so.6") == 0) {
int fd = open(info->dlpi_name, O_RDONLY);
struct stat stat;
fstat(fd, &stat);
char *base = mmap(NULL, stat.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
Elf64_Ehdr *header = (Elf64_Ehdr *)base;
Elf64_Shdr *secs = (Elf64_Shdr*)(base+header->e_shoff);
for (unsigned secinx = 0; secinx < header->e_shnum; secinx++) {
if (secs[secinx].sh_type == SHT_SYMTAB) {
Elf64_Sym *symtab = (Elf64_Sym *)(base+secs[secinx].sh_offset);
char *symnames = (char *)(base + secs[secs[secinx].sh_link].sh_offset);
unsigned symcount = secs[secinx].sh_size/secs[secinx].sh_entsize;
for (unsigned syminx = 0; syminx < symcount; syminx++) {
if (strcmp(symnames+symtab[syminx].st_name, "main_arena") == 0) {
void *mainarena = ((char *)info->dlpi_addr)+symtab[syminx].st_value;
printf("main_arena found: %p\n", mainarena);
raise(SIGTRAP);
return 0;
}
}
}
}
}
return 0;
}
int main()
{
dl_iterate_phdr(cb, NULL);
return 0;
}
dl_iterate_phdr is used to get the base address of the mapped glibc. The mapping does not contain the symbol table needed (.symtab), so the library has to be mapped again. The final address is determined by the base address plus the symbol value.
(gdb) run
Starting program: a.out
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff77f0700 (LWP 24834)]
main_arena found: 0x7ffff7baec60
Thread 1 "a.out" received signal SIGTRAP, Trace/breakpoint trap.
raise (sig=5) at ../sysdeps/unix/sysv/linux/raise.c:50
50 return ret;
(gdb) select 1
(gdb) print mainarena
$1 = (void *) 0x7ffff7baec60 <main_arena>
(gdb) print &main_arena
$3 = (struct malloc_state *) 0x7ffff7baec60 <main_arena>
The value matches that of main_arena, so the correct address was found.
There are other ways to get to main_arena without relying on the library itself. Walking the existing heap allows for discovering main_arena, for example, but that strategy is considerably less straightforward.
Of course, once you have main_arena, you need all internal type definitions to be able to inspect the data.
I am writing some code to help me do heap analysis.
What kind of heap analysis?
I want want to basically get information about all the different freelists. So I want, for every bin in each bin type, the number of chunks in the bin and their sizes. For fast, small, and tcache bins I know that I just need the index to figure out the size.
This information only makes sense if you are planning to change the malloc implementation. It does not make sense to attempt to collect it if your goal is to analyze or improve heap usage by the application, so it sounds like you have an XY problem.
In addition, things like bin and tcache only make sense in a context of particular malloc implementation (TCMalloc and jemalloc would not have any bins).
For analysis of application heap usage, you may want to use TCmalloc, as it provides a lot of tools for heap profiling and introspection.
Related
I am working on an embedded device with only 512MB of RAM and the device is running Linux kernel. I want to do the memory management of all the processes running in the userspace by my own library. is it possible to do so. from my understanding, the memory management is done by kernel, Is it possible to have that functionality in User space.
If your embedded device runs Linux, it has an MMU. Controling the MMU is normally a privileged operation, so only an operating system kernel has access to it. Therefore the answer is: No, you can't.
Of course you can write software running directly on the device, without operating system, but I guess that's not what you wanted. You should probably take one step back, ask yourself what gave you the idea about the memory management and what could be a better way to solve this original problem.
You can consider using setrlimit. Refer another Q&A.
I wrote the test code and run it on my PC. I can see that memory usage is limited. The exact relationship of units requires further analysis.
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
#include <sys/resource.h>
int main(int argc, char* argv)
{
long limitSize = 1;
long testSize = 140000;
// 1. BEFORE: getrlimit
{
struct rlimit asLimit;
getrlimit(RLIMIT_AS, &asLimit);
printf("BEFORE: rlimit(RLIMIT_AS) = %ld,%ld\n", asLimit.rlim_cur, asLimit.rlim_max);
}
// 2. BEFORE: test malloc
{
char *xx = malloc(testSize);
if (xx == NULL)
perror("malloc FAIL");
else
printf("malloc(%ld) OK\n", testSize);
free(xx);
}
// 3. setrlimit
{
struct rlimit new;
new.rlim_cur = limitSize;
new.rlim_max = limitSize;
setrlimit(RLIMIT_AS, &new);
}
// 4. AFTER: getrlimit
{
struct rlimit asLimit;
getrlimit(RLIMIT_AS, &asLimit);
printf("AFTER: rlimit(RLIMIT_AS) = %ld,%ld\n", asLimit.rlim_cur, asLimit.rlim_max);
}
// 5. AFTER: test malloc
{
char *xx = malloc(testSize);
if (xx == NULL)
perror("malloc FAIL");
else
printf("malloc(%ld) OK\n", testSize);
free(xx);
}
return 0;
}
Result:
BEFORE: rlimit(RLIMIT_AS) = -1,-1
malloc(140000) OK
AFTER: rlimit(RLIMIT_AS) = 1,1
malloc FAIL: Cannot allocate memory
From what I understand of your question, you want to somehow use your own library for handling memory of kernel processes. I presume you are doing this to make sure that rogue processes don't use too much memory, which allows your process to use as much memory as is available. I believe this idea is flawed.
For example, imagine this scenario:
Total memory 512MB
Process 1 limit of 128MB - Uses 64MB
Process 2 imit of 128MB - Uses 64MB
Process 3 limit of 256MB - Uses 256MB then runs out of memory, when in fact 128MB is still available.
I know you THINK this is the answer to your problem, and on 'normal' embedded systems, this would probably work, but you are using a complex kernel, running processes you don't have total control over. You should write YOUR software to be robust when memory gets tight because that is all you can control.
Consider a program which uses a large number of roughly page-sized memory regions (say 64 kB or so), each of which is rather short-lived. (In my particular case, these are alternate stacks for green threads.)
How would one best do to allocate these regions, such that their pages can be returned to the kernel once the region isn't in use anymore? The naïve solution would clearly be to simply mmap each of the regions individually, and munmap them again as soon as I'm done with them. I feel this is a bad idea, though, since there are so many of them. I suspect that the VMM may start scaling badly after a while; but even if it doesn't, I'm still interested in the theoretical case.
If I instead just mmap myself a huge anonymous mapping from which I allocate the regions on demand, is there a way to "punch holes" through that mapping for a region that I'm done with? Kind of like madvise(MADV_DONTNEED), but with the difference that the pages should be considered deleted, so that the kernel doesn't actually need to keep their contents anywhere but can just reuse zeroed pages whenever they are faulted again.
I'm using Linux, and in this case I'm not bothered by using Linux-specific calls.
I did a lot of research into this topic (for a different use) at some point. In my case I needed a large hashmap that was very sparsely populated + the ability to zero it every now and then.
mmap solution:
The easiest solution (which is portable, madvise(MADV_DONTNEED) is linux specific) to zero a mapping like this is to mmap a new mapping above it.
void * mapping = mmap(MAP_ANONYMOUS);
// use the mapping
// zero certain pages
mmap(mapping + page_aligned_offset, length, MAP_FIXED | MAP_ANONYMOUS);
The last call is performance wise equivalent to subsequent munmap/mmap/MAP_FIXED, but is thread safe.
Performance wise the problem with this solution is that the pages have to be faulted in again on a subsequence write access which issues an interrupt and a context change. This is only efficient if very few pages were faulted in in the first place.
memset solution:
After having such crap performance if most of the mapping has to be unmapped I decided to zero the memory manually with memset. If roughly over 70% of the pages are already faulted in (and if not they are after the first round of memset) then this is faster then remapping those pages.
mincore solution:
My next idea was to actually only memset on those pages that have been faulted in before. This solution is NOT thread-safe. Calling mincore to determine if a page is faulted in and then selectively memset them to zero was a significant performance improvement until over 50% of the mapping was faulted in, at which point memsetting the entire mapping became simpler (mincore is a system call and requires one context change).
incore table solution:
My final approach which I then took was having my own in-core table (one bit per page) that says if it has been used since the last wipe. This is by far the most efficient way since you will only be actually zeroing the pages in each round that you actually used. It obviously also is not thread safe and requires you to track which pages have been written to in user space, but if you need this performance then this is by far the most efficient approach.
I don't see why doing lots of calls to mmap/munmap should be that bad. The lookup performance in the kernel for mappings should be O(log n).
Your only options as it seems to be implemented in Linux right now is to punch holes in the mappings to do what you want is mprotect(PROT_NONE) and that is still fragmenting the mappings in the kernel so it's mostly equivalent to mmap/munmap except that something else won't be able to steal that VM range from you. You'd probably want madvise(MADV_REMOVE) work or as it's called in BSD - madvise(MADV_FREE). That is explicitly designed to do exactly what you want - the cheapest way to reclaim pages without fragmenting the mappings. But at least according to the man page on my two flavors of Linux it's not fully implemented for all kinds of mappings.
Disclaimer: I'm mostly familiar with the internals of BSD VM systems, but this should be quite similar on Linux.
As in the discussion in comments below, surprisingly enough MADV_DONTNEED seems to do the trick:
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <stdio.h>
#include <unistd.h>
#include <err.h>
int
main(int argc, char **argv)
{
int ps = getpagesize();
struct rusage ru = {0};
char *map;
int n = 15;
int i;
if ((map = mmap(NULL, ps * n, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)) == MAP_FAILED)
err(1, "mmap");
for (i = 0; i < n; i++) {
map[ps * i] = i + 10;
}
printf("unnecessary printf to fault stuff in: %d %ld\n", map[0], ru.ru_minflt);
/* Unnecessary call to madvise to fault in that part of libc. */
if (madvise(&map[ps], ps, MADV_NORMAL) == -1)
err(1, "madvise");
if (getrusage(RUSAGE_SELF, &ru) == -1)
err(1, "getrusage");
printf("after MADV_NORMAL, before touching pages: %d %ld\n", map[0], ru.ru_minflt);
for (i = 0; i < n; i++) {
map[ps * i] = i + 10;
}
if (getrusage(RUSAGE_SELF, &ru) == -1)
err(1, "getrusage");
printf("after MADV_NORMAL, after touching pages: %d %ld\n", map[0], ru.ru_minflt);
if (madvise(map, ps * n, MADV_DONTNEED) == -1)
err(1, "madvise");
if (getrusage(RUSAGE_SELF, &ru) == -1)
err(1, "getrusage");
printf("after MADV_DONTNEED, before touching pages: %d %ld\n", map[0], ru.ru_minflt);
for (i = 0; i < n; i++) {
map[ps * i] = i + 10;
}
if (getrusage(RUSAGE_SELF, &ru) == -1)
err(1, "getrusage");
printf("after MADV_DONTNEED, after touching pages: %d %ld\n", map[0], ru.ru_minflt);
return 0;
}
I'm measuring ru_minflt as a proxy to see how many pages we needed to allocate (this isn't exactly true, but the next sentence makes it more likely). We can see that we get new pages in the third printf because the contents of map[0] are 0.
I am looking for a way to implement a function that gets an address, and tells the page size used in this address. One solution looks for the address in the segments in /proc//smaps and returns the value of "KernelPageSize:". This solution is very slow because it involves reading a file linearly, a file which might be long. I need a faster and more efficient solution.
Is there a system call for this? (int getpagesizefromaddr(void *addr);)
If not, is there a way to deduce the page size?
Many Linux architectures support "huge pages", see Documentation/vm/hugetlbpage.txt for detailed information. On x86-64, for example, sysconf(_SC_PAGESIZE) reports 4096 as page size, but 2097152-byte huge pages are also available. From the application's perspective, this rarely matters; the kernel is perfectly capable of converting from one page type to another as needed, without the userspace application having to worry about it.
However, for specific workloads the performance benefits are significant. This is why transparent huge page support (see Documentation/vm/transhuge.txt) was developed. This is especially noticeable in virtual environments, i.e. where the workload is running in a guest environment. The new advice flags MADV_HUGEPAGE and MADV_NOHUGEPAGE for madvise() allows an application to tell the kernel about its preferences, so that mmap(...MAP_HUGETLB...) is not the only way to obtain these performance benefits.
I personally assumed Eldad's guestion was related to a workload running in a guest environment, and the point is to observe the page mapping types (normal or huge page) while benchmarking, to find out the most effective configurations for specific workloads.
Let's dispel all misconceptions by showing a real-world example, huge.c:
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>
#define PAGES 1024
int main(void)
{
FILE *in;
void *ptr;
size_t page;
page = (size_t)sysconf(_SC_PAGESIZE);
ptr = mmap(NULL, PAGES * page, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, (off_t)0);
if (ptr == MAP_FAILED) {
fprintf(stderr, "Cannot map %ld pages (%ld bytes): %s.\n", (long)PAGES, (long)PAGES * page, strerror(errno));
return 1;
}
/* Dump /proc/self/smaps to standard out. */
in = fopen("/proc/self/smaps", "rb");
if (!in) {
fprintf(stderr, "Cannot open /proc/self/smaps: %s.\n", strerror(errno));
return 1;
}
while (1) {
char *line, buffer[1024];
line = fgets(buffer, sizeof buffer, in);
if (!line)
break;
if ((line[0] >= '0' && line[0] <= '9') ||
(line[0] >= 'a' && line[0] <= 'f') ||
(strstr(line, "Page")) ||
(strstr(line, "Size")) ||
(strstr(line, "Huge"))) {
fputs(line, stdout);
continue;
}
}
fclose(in);
return 0;
}
The above allocates 1024 pages using huge pages, if possible. (On x86-64, one huge page is 2 MiB or 512 normal pages, so this should allocate two huge pages' worth, or 4 MiB, of private anonymous memory. Adjust the PAGES constant if you run on a different architecture.)
Make sure huge pages are enabled by verifying /proc/sys/vm/nr_hugepages is greater than zero. On most systems it defaults to zero, so you need to raise it, for example using
sudo sh -c 'echo 10 > /proc/sys/vm/nr_hugepages'
which tells the kernel to keep a pool of 10 huge pages (20 MiB on x86-64) available.
Compile and run the above program,
gcc -W -Wall -O3 huge.c -o huge && ./huge
and you will obtain an abbreviated /proc/PID/smaps output. On my machine, the interesting part contains
2aaaaac00000-2aaaab000000 rw-p 00000000 00:0c 21613022 /anon_hugepage (deleted)
Size: 4096 kB
AnonHugePages: 0 kB
KernelPageSize: 2048 kB
MMUPageSize: 2048 kB
which obviously differs from the typical parts, e.g.
01830000-01851000 rw-p 00000000 00:00 0 [heap]
Size: 132 kB
AnonHugePages: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
The exact format of the complete /proc/self/smaps file is described in man 5 proc, and is quite straightforward to parse. Note that this is a pseudofile generated by the kernel, so it is never localized; the whitespace characters are HT (code 9) and SP (code 32), and newline is LF (code 10).
My recommended approach would be to maintain a structure describing the mappings, for example
struct region {
size_t start; /* first in region at (void *)start */
size_t length; /* last in region at (void *)(start + length - 1) */
size_t pagesize; /* KernelPageSize field */
};
struct maps {
size_t length; /* of /proc/self/smaps */
unsigned long hash; /* fast hash, say DJB XOR */
size_t count; /* number of regions */
pthread_rwlock_t lock; /* region array lock */
struct region *region;
};
where the lock member is only needed if it is possible that one thread examines the region array while another thread is updating or replacing it.
The idea is that at desired intervals, the /proc/self/smaps pseudofile is read, and a fast, simple hash (or CRC) is calculated. If the length and the hash match, then assume mappings have not changed, and reuse the existing information. Otherwise, the write lock is taken (remember, the information is already stale), the mapping information parsed, and a new region array is generated.
If multithreaded, the lock member allows multiple concurrent readers, but protects against using a discarded region array.
Note: When calculating the hash, you can also calculate the number of map entries, as property lines all begin with an uppercase ASCII letter (A-Z, codes 65 to 90). In other words, the number of lines that begin with a lowercase hex digit (0-9, codes 48 to 57, or a-f, codes 97 to 102) is the number of memory regions described.
Of the functions provided by the C library, mmap(), munmap(), mremap(), madvise() (and posix_madvise()), mprotect(), malloc(), calloc(), realloc(), free(), brk(), and sbrk() may change the memory mappings (although I'm not certain this list contains them all). These library calls can be interposed, and the memory region list updated after each (successful) call. This should allow an application to rely on the memory region structures for accurate information.
Personally, I would create this facility as a preload library (loaded using LD_PRELOAD). That allows easily interposing the above functions with just a few lines of code: the interposed function calls the original function, and if successful, calls an internal function that reloads the memory region information from /proc/self/smaps. Care should be taken to call the original memory management functions, and to keep errno unchanged; otherwise it should be quite straightforward. I personally would also avoid using library functions (including string.h) to parse the fields, but I am overly careful anyway.
The interposed library would obviously also provide the function to query the page size at a specific address, say pagesizeat(). (If your application exports a weak version that always returns -1 with errno==ENOTSUP, your preload library can override it, and you don't need to worry about whether the preload library is loaded or not -- if not, the function will just return an error.)
Questions?
i have 2 questions in one:
(i) Suppose thread X is running at CPU Y. Is it possible to use the syscalls migrate_pages - or even better move_pages (or their libnuma wrapper) - to move the pages associated with X to the node in which Y is connected?
This question arrises because first argument of both syscalls is PID (and i need a per-thread approach for some researching i'm doing)
(ii) in the case of positive answer for (i), how can i get all the pages used by some thread? My aim is, move the page(s) that contains array M[] for exemple...how to "link" data structures with their memory pages, for the sake of using the syscalls above?
An extra information: i'm using C with pthreads. Thanks in advance !
You want to use the higher level libnuma interfaces instead of the low level system calls.
The libnuma library offers a simple programming interface to the NUMA (Non Uniform Memory Access) policy supported by the Linux kernel. On a NUMA architecture some memory areas have different latency or bandwidth than others.
Available policies are page interleaving (i.e., allocate in a round-robin fashion from all, or a subset, of the nodes on the system), preferred node allocation (i.e., preferably allocate on a particular node), local allocation (i.e., allocate on the node on which the task is currently executing), or allocation only on specific nodes (i.e., allocate on some subset of the available nodes). It is also possible to bind tasks to specific nodes.
The man pages for the low level numa_* system calls warn you away from using them:
Link with -lnuma to get the system call definitions. libnuma and the required <numaif.h> header are available in the numactl package.
However, applications should not use these system calls directly. Instead, the higher level interface provided by the numa(3) functions in the numactl package is recommended. The numactl package is available at <ftp://oss.sgi.com/www/projects/libnuma/download/>. The package is also included in some Linux distributions. Some distributions include the development library and header in the separate numactl-devel package.
Here's the code I use for pinning a thread to a single CPU and moving the stack to the corresponding NUMA node (slightly adapted to remove some constants defined elsewhere). Note that I first create the thread normally, and then call the SetAffinityAndRelocateStack() below from within the thread. I think this is much better then trying to create your own stack, since stacks have special support for growing in case the bottom is reached.
The code can also be adapted to operate on the newly created thread from outside, but this could give rise to race conditions (e.g. if the thread performs I/O into its stack), so I wouldn't recommend it.
void* PreFaultStack()
{
const size_t NUM_PAGES_TO_PRE_FAULT = 50;
const size_t size = NUM_PAGES_TO_PRE_FAULT * numa_pagesize();
void *allocaBase = alloca(size);
memset(allocaBase, 0, size);
return allocaBase;
}
void SetAffinityAndRelocateStack(int cpuNum)
{
assert(-1 != cpuNum);
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(cpuNum, &cpuset);
const int rc = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
assert(0 == rc);
pthread_attr_t attr;
void *stackAddr = nullptr;
size_t stackSize = 0;
if ((0 != pthread_getattr_np(pthread_self(), &attr)) || (0 != pthread_attr_getstack(&attr, &stackAddr, &stackSize))) {
assert(false);
}
const unsigned long nodeMask = 1UL << numa_node_of_cpu(cpuNum);
const auto bindRc = mbind(stackAddr, stackSize, MPOL_BIND, &nodeMask, sizeof(nodeMask), MPOL_MF_MOVE | MPOL_MF_STRICT);
assert(0 == bindRc);
PreFaultStack();
// TODO: Also lock the stack with mlock() to guarantee it stays resident in RAM
return;
}
I'm writing a Linux kernel module, and I'd like to allocate an executable page. Plain kmalloc() returns a pointer within a non-executable page, and I get a kernel panic when executing code there. It has to work on Ubuntu Karmic x86, 2.6.31-20-generic-pae.
#include <linux/vmalloc.h>
#include <asm/pgtype_types.h>
...
char *p = __vmalloc(byte_size, GFP_KERNEL, PAGE_KERNEL_EXEC);
...
if (p != NULL) vfree(p);
/**
* vmalloc_exec - allocate virtually contiguous, executable memory
* #size: allocation size
*
* Kernel-internal function to allocate enough pages to cover #size
* the page level allocator and map them into contiguous and
* executable kernel virtual space.
*
* For tight control over page level allocator and protection flags
* use __vmalloc() instead.
*
* Return: pointer to the allocated memory or %NULL on error
*/
void *vmalloc_exec(unsigned long size)
{
return __vmalloc_node(size, 1, GFP_KERNEL, PAGE_KERNEL_EXEC,
NUMA_NO_NODE, __builtin_return_address(0));
}
Linux 5.4 and above no longer make interfaces available so that arbitrary kernel modules won't modify page attributes to turn the executable bit on/off. However, there might be a specific configuration that will allow to do so, but I'm not aware of such.
If someone uses a kernel version that is lower than 5.4, you can use set_memory_x(unsigned long addr, int numpages); and friends to change the attributes of a page, or if you need to pass a struct page, you could use set_pages_x.
Keep in mind that it is considered dangerous and you have to know what you are doing.