I've been trying to piece together how stack memory is handed out to threads. I haven't been able to piece the whole thing together. I tried to go to the code, but I'm more confused, so I'm asking for your help.
I asked this question a little while ago. So assume that particular program (therefore, all threads are within the same process). If I write printfs for each beginning of stack pointer, and then how much is allocated for them, then I get stuff like the table at the end of this message, where the first column is a time_t usec, the second doesn't matter, the third is the tid of the thread, the fourth is the guard size, then begin of stack, end of stack (sorted by beginning of stack), last but one is the allocated stack (8 Megs by default) and the last column is the difference between the end of the first allocated stack, and the beginning of the next stack.
This means that (I think), if 0, then the stacks are contiguous, if positive, since the stack grows down in memory, then it means that there is "free space" of however many Mbs between a tid and the next (in memory). If negative, this means that memory is being reused. So this may mean that that stack space has been freed before this thread was created.
My problem is: what exactly is the algorithm that assigns stack space to threads (at a higher level than code) and why do I sometimes get contiguous stacks, and sometimes not, and sometimes get values like 7.94140625 and 0.0625 in the last column?
This is all Linux 2.6, C and pthreads.
This may be a question we will have to iterate on to get it right, and for this I apologize, but I'm telling you what I know right now. Feel free to ask for clarifications.
Thanks for this. The table follows.
52815 14 14786 4096 92549120 100941824 8392704 0
52481 14 14784 4096 100941824 109334528 8392704 0
51700 14 14777 4096 109334528 117727232 8392704 0
70747 14 14806 4096 117727232 126119936 8392704 8.00390625
75813 14 14824 4096 117727232 126119936 8392704 0
51464 14 14776 4096 126119936 134512640 8392704 8.00390625
76679 14 14833 4096 126119936 134512640 8392704 -4.51953125
53799 14 14791 4096 139251712 147644416 8392704 -4.90234375
52708 14 14785 4096 152784896 161177600 8392704 0
50912 14 14773 4096 161177600 169570304 8392704 0
51617 14 14775 4096 169570304 177963008 8392704 0
70028 14 14793 4096 177963008 186355712 8392704 0
51048 14 14774 4096 186355712 194748416 8392704 0
50596 14 14771 4096 194748416 203141120 8392704 8.00390625
First, by stracing a simple test program that launches a single thread, we can see the syscalls it used to create a new thread. Here's a simple test program:
#include <pthread.h>
#include <stdio.h>
void *test(void *x) { }
int main() {
pthread_t thr;
printf("start\n");
pthread_create(&thr, NULL, test, NULL);
pthread_join(thr, NULL);
printf("end\n");
return 0;
}
And the relevant portion of its strace output:
write(1, "start\n", 6start
) = 6
mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xf6e32000
brk(0) = 0x8915000
brk(0x8936000) = 0x8936000
mprotect(0xf6e32000, 4096, PROT_NONE) = 0
clone(child_stack=0xf7632494, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xf7632bd8, {entry_number:12, base_addr:0xf7632b70, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xf7632bd8) = 9181
futex(0xf7632bd8, FUTEX_WAIT, 9181, NULL) = -1 EAGAIN (Resource temporarily unavailable)
write(1, "end\n", 4end
) = 4
exit_group(0) = ?
We can see that it obtains a stack from mmap with PROT_READ|PROT_WRITE protection and MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK flags. It then protects the first (ie, lowest) page of the stack, to detect stack overflows. The rest of the calls aren't really relevant to the discussion at hand.
So, then, how does mmap allocate the stack, then? Well, let's start at mmap_pgoff in the Linux kernel; the entry point for the modern mmap2 syscall. It delegates to do_mmap_pgoff after taking some locks. This then calls get_unmapped_area to find an appropriate range of unmapped pages.
Unfortunately, this then calls a function pointer defined in the vma - this is probably so that 32-bit and 64-bit processes can have different ideas of which addresses can be mapped. In the case of x86, this is defined in arch_pick_mmap_layout, which switches based on whether it's using a 32-bit or 64-bit architecture for this process.
So let's look at the implementation of arch_get_unmapped_area then. It first gets some reasonable defaults for its search from find_start_end, then tests to see if the address hint passed in is valid (for thread stacks, no hint is passed). It then starts scanning through the virtual memory map, starting from a cached address, until it finds a hole. It saves the end of the hole for use in the next search, then returns the location of this hole. If it reaches the end of the address space, it starts again from the start, giving it one more chance to find an open area.
So as you can see, normally, it will assign stacks in an increasing manner (for x86; x86-64 uses arch_get_unmapped_area_topdown and will likely assign them decreasing). However, it also keeps a cache of where to start a search, so it might leave gaps depending on when areas are freed. In particular, when a mmaped area is freed, it might update the free-address-search-cache, so you might see out of order allocations there as well.
That said, this is all an implementation detail. Do not rely on any of this in your program. Just take what addresses mmap hands out and be happy :)
glibc handles this in nptl/allocatestack.c.
Key line is:
mem = mmap (NULL, size, prot,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);
So it just asks the kernel for some anonymous memory, not unlike malloc does for large blocks. Which block it actually gets is up to the kernel...
Related
I'm currently working on freertos with stm32f4. After creating the project with cubemx with below configration.
Its seem RTOS has around 25k bytes for me to allocate stacks to threads. But somehow when i create thread with stack size 1000. It has only 20888 bytes left for total heap of RTOS. If i allocate 2000. It got 16888 left. It seems like it always consumes 4 times of stack size allocation. Really really confused about what is happening.
osThreadDef(Task_Embedded, Task_VATEmbedded, osPriorityNormal, 0, 1000);
VATEmbeddedTaskHandle = osThreadCreate(osThread(Task_Embedded), NULL);
osThreadId osThreadCreate (const osThreadDef_t *thread_def, void *argument)
{
TaskHandle_t handle;
if (xTaskCreate((TaskFunction_t)thread_def->pthread,(const portCHAR *)thread_def->name,
thread_def->stacksize, argument, makeFreeRtosPriority(thread_def->tpriority),
&handle) != pdPASS) {
return NULL;
}
return handle;
}
Looking at CMSIS man
Configuration of Thread count and Stack Space
osThreadDef defines a thread function.
The parameter stacksz specifies thereby the stack requirements of this thread function. CMSIS-RTOS RTX defines two methods for defining the stack requirements:
when stacksz is 0, a fixed-size memory pool is used to for the thread stack. In this case OS_STKSIZE specifies the stack size for the thread function.
when stacksz is not 0, the thread stack is allocated from a user space. The size of this user space is specified with OS_PRIVSTKSIZE.
(Emphasis mine.)
Where OS-PRIVSTKSIZE tells
Is the combined stack requirement (in words) of all threads that are defined with with osThreadDef stacksz != 0 (excluding main).
(Emphasis mine.)
The FreeRTOS API is online, and the description of the usStackDepth parameter to the xTaskCreate() function clearly states the stack is defined in words, not bytes. FreeRTOS runs on 8, 16 32 and 64-bit processors, so the word size is dependent on the architecture - in your case it is 4, which matches your observation. http://www.freertos.org/a00125.html
Which of the three values, vsize, size and rss from ps output is suitable for use in quick memory leak detection? For my purpose, if a process has been running for few days and its memory has kept increasing then that is a good enough indicator that it is leaking memory. I understand that a tool like valgrind should ultimately be used but its use is intrusive and so not always desirable.
For my understanding, I wrote a simple piece of C code that basically allocates 1 MiB of memory, frees it and then allocates 1 MiB again. It also sleeps before every step for 10 seconds giving me time to see output from ps -p <pid> -ovsize=,size=,rss=. Here it is:
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <stdio.h>
#include <stdint.h>
#define info(args...) printf(args)
char* bytes(char* str, uint32_t size, uint32_t n)
{
char* unit = "B";
if (n > 1000) {
n /= 1000;
unit = "KB";
}
if (n > 1000) {
n /= 1000;
unit = "MB";
}
snprintf(str, size, "%u %s", n, unit);
return(str);
}
void* xmalloc(size_t size)
{
char msg[64];
size_t max = sizeof(msg);
void *p = NULL;
info("Allocating %s\n", bytes(msg, max, size));
p = malloc(size);
memset(p, '1', size);
return(p);
}
void* xfree(void* p, size_t size)
{
char msg[64];
size_t max = sizeof(msg);
info("Freeing %s\n", bytes(msg, max, size));
free(p);
return(NULL);
}
void nap()
{
const int dur = 10;
info("Sleeping for %d seconds\n", dur);
sleep(dur);
}
int main(void)
{
int err = 0;
size_t kb = 1024;
size_t block = 1024 * kb;
char* p = NULL;
nap();
p = xmalloc(block);
nap();
p = xfree(p, block);
nap();
p = xmalloc(block);
nap();
return(err);
}
Now, ps was run every two seconds from a shell script that helped also print the measurements timestamps. Its output was:
# time vsize size rss
1429207116 3940 188 312
1429207118 3940 188 312
1429207120 3940 188 312
1429207122 3940 188 312
1429207124 3940 188 312
1429207126 4968 1216 1364
1429207128 4968 1216 1364
1429207130 4968 1216 1364
1429207132 4968 1216 1364
1429207135 4968 1216 1364
1429207137 3940 188 488
1429207139 3940 188 488
1429207141 3940 188 488
1429207143 3940 188 488
1429207145 5096 1344 1276
1429207147 5096 1344 1276
1429207149 5096 1344 1276
1429207151 5096 1344 1276
1429207153 5096 1344 1276
From the values above, and keeping in mind the descriptions given in the man page for ps(1), it seems to me that the best measure is vsize. Is this understanding correct? Note that the man page says that size is a measure of the total amount of dirty pages and rss the amount of pages in physical memory. These could very much become lower than the total memory used by the process.
These experiments were tried on Debian 7.8 running GNU/Linux 3.2.0-4-amd64.
Generally speaking the total virtual size (vsize) of your process is the main measure of process size. rss is just the portion that happens to be using real memory at the moment. size is a measure of how many pages have actually been modified.
A constantly increasing vsize, with relatively stable or cyclic size and rss values might suggest heap fragmentation or a poor heap allocator algorithm.
A constantly increasing vsize and size, with a relatively stable rss might suggest a memory leak, heap fragmentation, or a poor heap allocator algorithm.
You will have to understand something of how a given program uses memory resources in order to use just these external measures of process resource usage to estimate whether it suffers from a memory leak or not.
Part of that involves knowing a little bit about how the heap is managed by the C library malloc() and free() routines, including what additional memory it might require internally to manage the list of active allocations, how it deals with fragmentation of the heap, and how it might release unused parts of the heap back to the operating system.
For example your test shows that both the total virtual size of the process, and the number of "dirty" pages it required, grew slightly larger the second time the program allocated the same amount of memory again. This probably shows some of the overhead of malloc(), i.e. the amount of memory its own internal data structures required up to that point. It would have been interesting to see what happened if the program had done another free() and sleep() before exiting. It might also be instructive to modify your code so that it calls sleep() between calling malloc() and memset(), and then observe the results from ps.
So, a simple program which should only require a fixed amount of memory to run, or which allocates memory to do a specific unit of work and then should free all of that memory once that unit of work is completed, should show a relatively stable vsize, assuming it doesn't ever process more than one unit of work at a time and have a "bad" pattern of allocation that would lead to heap fragmentation.
As you noted, a tool like valgrind, along with intimate knowledge of the program's internal implementation, is necessary to show actual memory leaks and prove they are solely the responsibility of the program.
(BTW, you might want to simplify your code somewhat -- don't use unnecessary macros like info() in particular, and for this type of example trying to be fancy with printing values in larger units, using extra variables to do size calculations, etc., is also more of an obfuscation than a help. Too many printfs also obfuscate the code -- use only those you need to see what step the program is at and to see values that are not known at compile time.)
I'm working in a driver that uses a buffer backed by hugepages, and I'm finding some problems with the sequentality of the hugepages.
In userspace, the program allocates a big buffer backed by hugepages using the mmap syscall. The buffer is then communicated to the driver through a ioctl call. The driver uses the get_user_pages function to get the memory address of that buffer.
This works perfectly with a buffer size of 1 GB (1 hugepage). get_user_pages returns a lot of pages (HUGE_PAGE_SIZE / PAGE_SIZE) but they're all contigous, so there's no problem. I just grab the address of the first page with page_address and work with that. The driver can also map that buffer back to userspace with remap_pfn_range when another program does a mmap call on the char device.
However, things get complicated when the buffer is backed by more than one hugepage. It seems that the kernel can return a buffer backed by non-sequential hugepages. I.e, if the hugepage pool's layout is something like this
+------+------+------+------+
| HP 1 | HP 2 | HP 3 | HP 4 |
+------+------+------+------+
, a request for a hugepage-backed buffer could be fulfilled by reserving HP1 and HP4, or maybe HP3 and then HP2. That means that when I get the pages with get_user_pages in the last case, the address of page 0 is actually 1 GB after the address of page 262.144 (the next hugepage's head).
Is there any way to sequentalize access to those pages? I tried reordering the addresses to find the lower one so I can use the whole buffer (e.g., if kernel gives me a buffer backed by HP3, HP2 I use as base address the one of HP2), but it seems that would scramble the data in userspace (offset 0 in that reordered buffer is maybe offset 1GB in the userspace buffer).
TL;DR: Given >1 unordered hugepages, is there any way to access them sequentially in a Linux kernel driver?
By the way, I'm working on a Linux machine with 3.8.0-29-generic kernel.
Using the function suggested by CL, vm_map_ram, I was able to remap the memory so it can be accesed sequentially, independently of the number of hugepages mapped. I leave the code here (error control not included) in case it helps anyone.
struct page** pages;
int retval;
unsigned long npages;
unsigned long buffer_start = (unsigned long) huge->addr; // Address from user-space map.
void* remapped;
npages = 1 + ((bufsize- 1) / PAGE_SIZE);
pages = vmalloc(npages * sizeof(struct page *));
down_read(¤t->mm->mmap_sem);
retval = get_user_pages(current, current->mm, buffer_start, npages,
1 /* Write enable */, 0 /* Force */, pages, NULL);
up_read(¤t->mm->mmap_sem);
nid = page_to_nid(pages[0]); // Remap on the same NUMA node.
remapped = vm_map_ram(pages, npages, nid, PAGE_KERNEL);
// Do work on remapped.
I'm trying to optimize my dynamic memory usage. The thing is that I initially allocate some amount of memory for the data I get from a socket. Then, on the new data arrival I'm reallocating memory so the newly arrived part will fit into the local buffer. After some poking around I've found that malloc actually allocates a greater block than requested. In some cases significantly greater; here comes some debug info from malloc_usable_size(ptr):
requested 284 bytes, allocated 320 bytes
requested 644 bytes, reallocated 1024 bytes
It's well known that malloc/realloc are expensive operations. In most cases newly arrived data will fit into a previously allocated block (at least when I requested 644 byes and get 1024 instead), but I have no idea how I can figure that out.
The trouble is that malloc_usable_size should not be relied upon (as described in manual) and if the program requested 644 bytes and malloc allocated 1024, the excess 644 bytes may be overwritten and can not be used safely. So, using malloc for a given amount of data and then use malloc_usable_size to figure out how many bytes were really allocated isn't the way to go.
What I want is to know the block grid before calling malloc, so I will request exactly the maximum amount of bytes greater then I need, store allocated size and on the realloc check if I really need to realloc, or if the previously allocated block is fine just because it's greater.
In other words, if I were to request 644 bytes, and malloc actually gave me 1024, I want to have predicted that and requested 1024 instead.
Depending on your particular implementation of libc you will have different behaviour. I have found in most cases two approaches to do the trick:
Use the stack, this is not always feasible, but C allows VLAs on the stack and is the most effective if you don't intend to pass your buffer to an external thread
while (1) {
char buffer[known_buffer_size];
read(fd, buffer, known_buffer_size);
// use buffer
// released at the end of scope
}
In Linux you can make excellent use of mremap which can enlarge/shrink memory with zero-copy guaranteed. It may move your VM mapping though. Only problem here is that it only works in chunks of system page size sysconf(_SC_PAGESIZE) which is usually 0x1000.
void * buffer = mmap(NULL, init_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
while(1) {
// if needs remapping
{
// zero copy, but involves a system call
buffer = mremap(buffer, new_size, MREMAP_MAYMOVE);
}
// use buffer
}
munmap(buffer, current_size);
OS X has similar semantics to Linux's mremap through the Mach vm_remap, it's a little more compilcated though.
void * buffer = mmap(NULL, init_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
mach_port_t this_task = mach_task_self();
while(1) {
// if needs remapping
{
// zero copy, but involves a system call
void * new_address = mmap(NULL, new_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
vm_prot_t cur_prot, max_prot;
munmap(new_address, current_size); // vm needs to be empty for remap
// there is a race condition between these two calls
vm_remap(this_task,
&new_address, // new address
current_size, // has to be page-aligned
0, // auto alignment
0, // remap fixed
this_task, // same task
buffer, // source address
0, // MAP READ-WRITE, NOT COPY
&cur_prot, // unused protection struct
&max_prot, // unused protection struct
VM_INHERIT_DEFAULT);
munmap(buffer, current_size); // remove old mapping
buffer = new_address;
}
// use buffer
}
The short answer is that the standard malloc interface does not provide the information you are looking for. To use the information breaks the abstraction provided.
Some alternatives are:
Rethink your usage model. Perhaps pre-allocate a pool of buffers at start, filling them as you go. Unfortunately this could complicate your program more than you would like.
Use a different memory allocation library that does provide the needed interface. Different libraries provide different tradeoffs in terms of fragmentation, max run time, average run time, etc.
Use your OS memory allocation API. These are often written to be efficient, but will generally require a system call (unlike a user-space library).
In my professional code, I often take advantage of the actual size allocated by malloc()[etc], rather than the requested size. This is my function for determining the actual allocation size0:
int MM_MEM_Stat(
void *I__ptr_A,
size_t *_O_allocationSize
)
{
int rCode = GAPI_SUCCESS;
size_t size = 0;
/*-----------------------------------------------------------------
** Validate caller arg(s).
*/
#ifdef __linux__ // Not required for __APPLE__, as alloc_size() will
// return 0 for non-malloc'ed refs.
if(NULL == I__ptr_A)
{
rCode=EINVAL;
goto CLEANUP;
}
#endif
/*-----------------------------------------------------------------
** Calculate the size.
*/
#if defined(__APPLE__)
size=malloc_size(I__ptr_A);
#elif defined(__linux__)
size=malloc_usable_size(I__ptr_A);
#else
!##$%
#endif
if(0 == size)
{
rCode=EFAULT;
goto CLEANUP;
}
/*-----------------------------------------------------------------
** Return requested values to caller.
*/
if(_O_allocationSize)
*_O_allocationSize = size;
CLEANUP:
return(rCode);
}
I did some sore research and found two interesting things about malloc realization in Linux and FreeBSD:
1) in Linux malloc increment blocks linearly in 16 byte steps, at least up to 8K, so no optimization needed at all, it's just not reasonable;
2) in FreeBSD situation is different, steps are bigger and tend to grow up with requested block size.
So, any kind of optimization is needed only for FreeBSD as Linux allocates blocks with a very tiny steps and it's very unlikely to receive less then 16 bytes of data from socket.
I am looking for a way to implement a function that gets an address, and tells the page size used in this address. One solution looks for the address in the segments in /proc//smaps and returns the value of "KernelPageSize:". This solution is very slow because it involves reading a file linearly, a file which might be long. I need a faster and more efficient solution.
Is there a system call for this? (int getpagesizefromaddr(void *addr);)
If not, is there a way to deduce the page size?
Many Linux architectures support "huge pages", see Documentation/vm/hugetlbpage.txt for detailed information. On x86-64, for example, sysconf(_SC_PAGESIZE) reports 4096 as page size, but 2097152-byte huge pages are also available. From the application's perspective, this rarely matters; the kernel is perfectly capable of converting from one page type to another as needed, without the userspace application having to worry about it.
However, for specific workloads the performance benefits are significant. This is why transparent huge page support (see Documentation/vm/transhuge.txt) was developed. This is especially noticeable in virtual environments, i.e. where the workload is running in a guest environment. The new advice flags MADV_HUGEPAGE and MADV_NOHUGEPAGE for madvise() allows an application to tell the kernel about its preferences, so that mmap(...MAP_HUGETLB...) is not the only way to obtain these performance benefits.
I personally assumed Eldad's guestion was related to a workload running in a guest environment, and the point is to observe the page mapping types (normal or huge page) while benchmarking, to find out the most effective configurations for specific workloads.
Let's dispel all misconceptions by showing a real-world example, huge.c:
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>
#define PAGES 1024
int main(void)
{
FILE *in;
void *ptr;
size_t page;
page = (size_t)sysconf(_SC_PAGESIZE);
ptr = mmap(NULL, PAGES * page, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, (off_t)0);
if (ptr == MAP_FAILED) {
fprintf(stderr, "Cannot map %ld pages (%ld bytes): %s.\n", (long)PAGES, (long)PAGES * page, strerror(errno));
return 1;
}
/* Dump /proc/self/smaps to standard out. */
in = fopen("/proc/self/smaps", "rb");
if (!in) {
fprintf(stderr, "Cannot open /proc/self/smaps: %s.\n", strerror(errno));
return 1;
}
while (1) {
char *line, buffer[1024];
line = fgets(buffer, sizeof buffer, in);
if (!line)
break;
if ((line[0] >= '0' && line[0] <= '9') ||
(line[0] >= 'a' && line[0] <= 'f') ||
(strstr(line, "Page")) ||
(strstr(line, "Size")) ||
(strstr(line, "Huge"))) {
fputs(line, stdout);
continue;
}
}
fclose(in);
return 0;
}
The above allocates 1024 pages using huge pages, if possible. (On x86-64, one huge page is 2 MiB or 512 normal pages, so this should allocate two huge pages' worth, or 4 MiB, of private anonymous memory. Adjust the PAGES constant if you run on a different architecture.)
Make sure huge pages are enabled by verifying /proc/sys/vm/nr_hugepages is greater than zero. On most systems it defaults to zero, so you need to raise it, for example using
sudo sh -c 'echo 10 > /proc/sys/vm/nr_hugepages'
which tells the kernel to keep a pool of 10 huge pages (20 MiB on x86-64) available.
Compile and run the above program,
gcc -W -Wall -O3 huge.c -o huge && ./huge
and you will obtain an abbreviated /proc/PID/smaps output. On my machine, the interesting part contains
2aaaaac00000-2aaaab000000 rw-p 00000000 00:0c 21613022 /anon_hugepage (deleted)
Size: 4096 kB
AnonHugePages: 0 kB
KernelPageSize: 2048 kB
MMUPageSize: 2048 kB
which obviously differs from the typical parts, e.g.
01830000-01851000 rw-p 00000000 00:00 0 [heap]
Size: 132 kB
AnonHugePages: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
The exact format of the complete /proc/self/smaps file is described in man 5 proc, and is quite straightforward to parse. Note that this is a pseudofile generated by the kernel, so it is never localized; the whitespace characters are HT (code 9) and SP (code 32), and newline is LF (code 10).
My recommended approach would be to maintain a structure describing the mappings, for example
struct region {
size_t start; /* first in region at (void *)start */
size_t length; /* last in region at (void *)(start + length - 1) */
size_t pagesize; /* KernelPageSize field */
};
struct maps {
size_t length; /* of /proc/self/smaps */
unsigned long hash; /* fast hash, say DJB XOR */
size_t count; /* number of regions */
pthread_rwlock_t lock; /* region array lock */
struct region *region;
};
where the lock member is only needed if it is possible that one thread examines the region array while another thread is updating or replacing it.
The idea is that at desired intervals, the /proc/self/smaps pseudofile is read, and a fast, simple hash (or CRC) is calculated. If the length and the hash match, then assume mappings have not changed, and reuse the existing information. Otherwise, the write lock is taken (remember, the information is already stale), the mapping information parsed, and a new region array is generated.
If multithreaded, the lock member allows multiple concurrent readers, but protects against using a discarded region array.
Note: When calculating the hash, you can also calculate the number of map entries, as property lines all begin with an uppercase ASCII letter (A-Z, codes 65 to 90). In other words, the number of lines that begin with a lowercase hex digit (0-9, codes 48 to 57, or a-f, codes 97 to 102) is the number of memory regions described.
Of the functions provided by the C library, mmap(), munmap(), mremap(), madvise() (and posix_madvise()), mprotect(), malloc(), calloc(), realloc(), free(), brk(), and sbrk() may change the memory mappings (although I'm not certain this list contains them all). These library calls can be interposed, and the memory region list updated after each (successful) call. This should allow an application to rely on the memory region structures for accurate information.
Personally, I would create this facility as a preload library (loaded using LD_PRELOAD). That allows easily interposing the above functions with just a few lines of code: the interposed function calls the original function, and if successful, calls an internal function that reloads the memory region information from /proc/self/smaps. Care should be taken to call the original memory management functions, and to keep errno unchanged; otherwise it should be quite straightforward. I personally would also avoid using library functions (including string.h) to parse the fields, but I am overly careful anyway.
The interposed library would obviously also provide the function to query the page size at a specific address, say pagesizeat(). (If your application exports a weak version that always returns -1 with errno==ENOTSUP, your preload library can override it, and you don't need to worry about whether the preload library is loaded or not -- if not, the function will just return an error.)
Questions?