Can static memory be lazily allocated? - c

Having a static array in a C program:
#define MAXN (1<<13)
void f() {
static int X[MAXN];
//...
}
Can the Linux kernel choose to not map the addresses to physical memory until the each page is actually used? How can X be full of 0s then, is the memory zeroed when each page is accessed? How does that not impact the performance of the program?

Can the Linux kernel choose to not map the addresses to physical memory until the each page is actually used?
Yes, it does this for all memory (except special memory used by drivers and the kernel itself).
How can X be full of 0s then, is the memory zeroed when each page is accessed?
You're supposed to ignore this detail. As long as the memory is full of zeroes when you access it, we say it's full of zeroes.
How does that not impact the performance of the program?
It does.

Can the Linux kernel choose to not map the addresses to physical memory until the each page is actually used?
Yes, with userspace memory it is always done.
How can X be full of 0s then, is the memory zeroed when each page is accessed?
The kernel maintains a page full of 0s, when the user asks for a new page of the static array (static thus full of 0s before first use), the kernel provides the zeroed page, without permissions for the program to write. Writing to the array causes the copy-on-write mechanism to trigger: a page fault occurs, the kernel then allocates a writable page, maps it and resumes the program from the last instruction (the one that couldn't complete because of the page fault). Note that prezeroing optimizations change the implementation details here, but the theory's the same.
How does that not impact the performance of the program?
The program doesn't have to zero a (potentially) lot of pages on start, and the kernel doesn't actually have to have the memory (one can ask for more memory than the system's got, as long as you don't use it). Page faults will be generated during the execution of the program, but they can be minimized, see mmap() and madvise() with MADV_SEQUENTIAL. Remember that the Translation Lookaside Buffer is not infinite, there are so many entries it can maintain.
Sources: A linux memory FAQ, Introduction to Memory Management in Linux by Alan Ott

Related

heap overflow affecting other programs

I was trying to create the condition for malloc to return a NULL pointer. In the below program, though I can see malloc returning NULL, once the program is forcebly terminated, I see that all other programs are becoming slow and finally I had to reboot the system. So my question is whether the memory for heap is shared with other programs? If not, other programs should not have affected. Is OS is not allocating certain amount of memory at the time of execution? I am using windows 10, Mingw.
#include <stdio.h>
#include <malloc.h>
void mallocInFunction(void)
{
int *ptr=malloc(500);
if(ptr==NULL)
{
printf("Memory Could not be allocated\n");
}
else
{
printf("Allocated memory successfully\n");
}
}
int main (void)
{
while(1)
{
mallocInFunction();
}
return(0);
}
So my question is whether the memory for heap is shared with other programs?
Physical memory (RAM) is a resource that is shared by all processes. The operating system makes decisions about how much RAM to allocate to each process and adjusts that over time.
If not, other programs should not have affected. Is OS is not allocating certain amount of memory at the time of execution?
At the time the program starts executing, the operating system has no idea how much memory the program will want or need. Instead, it deals with allocations as they happen. Unless configured otherwise, it will typically do everything it possibly can to allow the program's allocation to succeed because presumably there's a reason the program is doing what it's doing and the operating system won't try to second guess it.
... whether the memory for heap is shared with other programs?
Well, the C standard doesn't exactly require a heap, but in the context of a task-switching, multi-user and multi-threaded OS, of course memory is shared between processes! The C standard doesn't require any of this, but this is all pretty common stuff:
CPU cache memory tends to be preferred for code that's executed often, though this might get swapped around quite a bit; that may or may not be swapped to a heap.
Task switching causes registers to be swapped to other forms of memory; that may or may not be swapped to a heap.
Entire pages are swapped to and from disk, so that other programs can make use of them when your OS switches execution away from your program and to the other programs, and when it's your programs turn to execute again among other reasons. This may or may not involve manipulating the heap.
FWIW, you're referring to memory that has allocated storage duration. It's best to avoid using terms like heap and stack, as they're virtually meaningless. The memory you're referring to is on a silicon chip, regardless of whether it uses a heap or a stack.
... Is OS is not allocating certain amount of memory at the time of execution?
Speaking of silicon chips and execution, your OS likely only has control of one processor (a silicon chip which contains some logic circuits and memory, among other things I'm sure) with which to execute many programs! To summarise this post, yes, your program is most likely sharing those silicon chips with other programs!
On a tangential note, I don't think heap overflow means what you think it means.
Your question cannot be answered in the context of C, the language. For C, there's no such thing as a heap, a process, ...
But it can be answered in the context of operating systems. Even a bit generically because many modern multitasking OSes do similar things.
Given a modern multitasking OS, it will use virtual address spaces for each process. The OS manages a fixed size of physical RAM and divides this into pages, when a process needs memory, such pages are mapped into the process' virtual address space (typically using a different virtual address than the physical one). So when all memory pages are claimed by the OS itself and by the processes running, the OS will typically save some of these pages that are not in active use to disk, in a swap area, in order to serve this page as a fresh page to the next process requesting one. But when the original page is touched (and this is typically the case with free(), see below), it must first be loaded from disk again, but to have a free page for this, another page must be saved to swap space.
This is, like all disk I/O, slow, and it's probably what you see happening here.
Now to fully understand this: what does malloc() do? It typically requests from the operating system to have the memory of the own process increased (and if necessary, the OS does this by mapping another page), and it uses this new memory by writing some information there about the block of memory requested (so free() can work correctly later) and ultimately returns a pointer to a block that's free to use for the program. free() uses the information written by malloc(), modifies it to indicate this block is free again, and it typically can't give any memory back to the OS because there are other malloc()d blocks in the same page. It will give memory back when possible, but that's the exception in a typical scenario where dynamic allocations are heavily used.
So, the answer to your question is: Yes, the RAM is shared because there is only one set of physical RAM. The OS does the best it can to hide that fact and virtualize RAM, but if a process consumes all that is there, this will have visible effects.
malloc() is not system call but libc library function. So when a program ask for allocating memory via malloc(), system call brk()/sbrk() OR mmap() to allocated page(s), more details here.
Please keep in mind that the memory you get is all virtual in nature, that means if you have 3GB of physical RAM you can actually allocate almost infinite memory. So how does this happens? This happens via concept called 'paging', where system stores and retrieves data from secondary memory storage(HDD/SDD) to main memory(RAM), more details here.
So with this theory, out of memory usually quite rare but program like above which is checking system limits, this can happen. This is nicely explained here.
Now, why other programs are sort of hanged OR slow? Because they all share the same operating system and system is starving for resource. In fact at a point the system will crash and reboot again.
Hope this helps?

About sbrk() and malloc()

I've read the linux manual about sbrk() thoroughly:
sbrk() changes the location of the program break, which defines the end
of the process's data segment (i.e., the program break is the first
location after the end of the uninitialized data segment).
And I do know that user space memory's organization is like the following:
The problem is:
When I call sbrk(1), why does it say I am increasing the size of heap? As the manual says, I am changing the end position of "data segment & bss". So, what increases should be the size of data segment & bss, right?
The data and bss segments are a fixed size. The space allocated to the process after the end of those segments is therefore not a part of those segments; it is merely contiguous with them. And that space is called the heap space and is used for dynamic memory allocation.
If you want to regard it as 'extending the data/bss segment', that's fine too. It won't make any difference to the behaviour of the program, or the space that's allocated, or anything.
The manual page on Mac OS X indicates you really shouldn't be using them very much:
The brk and sbrk functions are historical curiosities left over from earlier days before the advent of virtual memory management. The brk() function sets the break or lowest address of a process's data segment (uninitialized data) to addr (immediately above bss). Data addressing is restricted between addr and the lowest stack pointer to the stack segment. Memory is allocated by brk in page size pieces; if addr is not evenly divisible by the system page size, it is increased to the next page boundary.
The current value of the program break is reliably returned by sbrk(0) (see also end(3)). The getrlimit(2) system call may be used to determine the maximum permissible size of the data segment; it will not be possible to set the break beyond the rlim_max value returned from a call to getrlimit, e.g. etext + rlp->rlim_max (see end(3) for the definition of etext).
It is mildly exasperating that I can't find a manual page for end(3), despite the pointers to look at it. Even this (slightly old) manual page for sbrk() does not have a link for it.
Notice that today sbrk(2) is rarely used. Most malloc implementations are using mmap(2) -at least for large allocations- to acquire a memory segment (and munmap to release it). Quite often, free simply marks a memory zone to be reusable by some future malloc (and does not release any memory to the Linux kernel).
(so practically, the heap of a modern linux process is made of several segments, so is more subtle than your picture; and multi-threaded processes have one stack per thread)
Use proc(5), notably /proc/self/maps and /proc/$pid/maps, to understand the virtual address space of some process. Try first to understand the output of cat /proc/self/maps (showing the address space of that cat command) and of cat /proc/$$/maps (showing the address space of your shell). Try also to look at the maps pseudo-file for your web browser (e.g. cat /proc/$(pidof firefox)/maps or cat /proc/$(pidof iceweasel)/maps etc...); I have more than a thousand lines (so process segments) in it.
Use strace(1) to understand the system calls done by a given command or process.
Take advantage that on Linux most (and probably all) C standard library implementations are free software, so you can study their source code. The source code of musl-libc is quite easy to read.
Read also about ELF, ASLR, dynamic linking & ld-linux(8), and the Advanced Linux Programming book then syscalls(2)

process allocated memory blocks from kernel

i need to have reliable measurement of allocated memory in a linux process. I've been looking into mallinfo but i've read that it is deprecated. What is the state of the art alternative for this sort of statistics?
basically i'm interested in at least two numbers:
number (and size) of allocated memory blocks/pages from the kernel by any malloc or whatever implementation uses the C library of choice
(optional but still important) number of allocated memory by userspace code (via malloc, new, etc.) minus the deallocated memory by it (via free, delete, etc.)
one possibility i have is to override malloc calls with LD_PRELOAD, but it might introduce an unwanted overhead at runtime, also it might not interact properly with other libraries i'm using that also rely on LD_PRELOAD aop-ness.
another possibility i've read is with rusage.
To be clear, this is NOT for debugging purposes, the memory usage is intrinsic feature of the application (similar to Mathematica or Matlab that display the amount of memory used, only that more precise at the block-level)
For this purpose - a "memory usage" introspection feature within an application - the most appropriate interface is malloc_hook(3). These are GNU extensions that allow you to hook every malloc(), realloc() and free() call, maintaining your statistics.
To see how much memory is mapped by your application from the kernel's point of view, you can read and collate the information in the /proc/self/smaps pseudofile. This also lets you see how much of each allocation is resident, swapped, shared/private, clean/dirty etc.
/proc/PID/status contains a few useful pieces of information (try running cat /proc/$$/status for example).
VmPeak is the largest your process's virtual memory space ever became during its execution. This includes all pages mapped into your process, including executable pages, mmap'ed files, stack, and heap.
VmSize is the current size of your process's virtual memory space.
VmRSS is the Resident Set Size of your process; i.e., how much of it is taking up physical RAM right now. (A typical process will have lots of stuff mapped that it never uses, like most of the C library. If no processes need a page, eventually it will be evicted and become non-resident. RSS measures the pages that remain resident and are mapped into your process.)
VmHWM is the High Water Mark of VmRSS; i.e. the highest that number has been during the lifetime of the process.
VmData is the size of your process's "data" segment; i.e., roughly its heap usage. Note that small blocks on which you have done malloc and then free will still be in use from the kernel's point of view; large blocks will actually be returned to the kernel when freed. (If memory serves, "large" means greater than 128k for current glibc.) This is probably the closest to what you are looking for.
These measurements are probably better than trying to track malloc and free, since they indicate what is "really going on" from a system-wide point of view. Just because you have called free() on some memory, that does not mean it has been returned to the system for other processes to use.

How does physical pages are allocated and freed during the malloc and free call?

Malloc allocates memory from one of the virtual memory regions of the process called Heap.
What is the initial size of the Heap (just after the execution begins and prior to any malloc call)? Say, if Heap starts from X virtual address and ends at Y virtual address I want to know the difference between X and Y.
I have read the answers to the duplicate question which was asked earlier.
How do malloc() and free() work?
The answers written are all in the context of virtual address but I want to know how the physical pages are allocated.
I am not sure but I think that this initial size (X-Y) would not have the corresponding page table entries in the operating system. Please correct me if I am wrong.
Now, say there is a request for allocating (and using) 10 bytes of memory, a new page would be allocated. Then, all the further requests for memory would be satisfied from this page or every time a new page would be allocated? Who would decide this?
When the memory would be freed (using free()) then at what time this allocated physical page would be freed and marked as available? I understand that the virtual address and physical page would not be freed immediately as the amount of memory freed could be very less. Then at what time the corresponding association between the physical and virtual address would be terminated?
I am sorry if my questions may sound strange. I am just a newbie and trying to understand the internals.
Normally you can think of physical pages as being allocated temporarily. If the memory that your program is using is swapped to disk, then at any time the association between your virtual addresses and physical RAM can be dropped, and that physical RAM used for something else.
If the program later accesses that memory, the OS will assign a new physical page to that virtual page, copy the data back from the page file into the physical memory, and complete the memory access.
So, to answer your question, the physical page might be marked as available when your program is no longer using the allocations that were put in it, or before. Or after, since malloc doesn't always bother freeing memory back to the OS. You don't really get to predict this stuff.
This all happens in the kernel, it's invisible from the point of view of C, just as CPU caching of memory is invisible from C. Well, invisible until your program slows down massively due to swapping. Obviously if you disable the swap file then things change a bit: instead of your program slowing down due to swapping, some program somewhere will fail to allocate memory, or something will be killed by the OOM killer.
How pages are allocated is different in each os, Linux, Mac, Windows, etc. In most/all implementations there is a kernel mechanism that defines how it is allocated.
http://www.linuxjournal.com/article/1133
How the OS handles this is quite OS dependent. In most (if not all) cases, the OS will at least takes note in its table that there was an allocation. You probably are confusing with the fact that some OS in some situation do not commit memory until it has been accessed. (keyword: overcommit; if you want my opinion on this, it should be a per process setting, and not a global one, and defaulting to committing the memory).
Now for returning freed memory to the OS, that depends on the allocator. It can't return anything less than a page, so while a page contains allocated memory, it won't be returned. And depending on how it has been allocated, there may be other constraints; for instance when using sbreak() as traditionally done on Unix, you can return only the latest allocated pages (i.e. if you return a page, all the one allocated after are also returned). More modern approach on Unix use mmapped memory for large blocks, under the rationale that mmapped memory can be returned as wanted. For small allocation blocks, it is often deemed not worthwhile to check if pages in the middle could be returned, and so mmapped memory isn't used.

Why malloc+memset is slower than calloc?

It's known that calloc is different than malloc in that it initializes the memory allocated. With calloc, the memory is set to zero. With malloc, the memory is not cleared.
So in everyday work, I regard calloc as malloc+memset.
Incidentally, for fun, I wrote the following code for a benchmark.
The result is confusing.
Code 1:
#include<stdio.h>
#include<stdlib.h>
#define BLOCK_SIZE 1024*1024*256
int main()
{
int i=0;
char *buf[10];
while(i<10)
{
buf[i] = (char*)calloc(1,BLOCK_SIZE);
i++;
}
}
Output of Code 1:
time ./a.out
**real 0m0.287s**
user 0m0.095s
sys 0m0.192s
Code 2:
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#define BLOCK_SIZE 1024*1024*256
int main()
{
int i=0;
char *buf[10];
while(i<10)
{
buf[i] = (char*)malloc(BLOCK_SIZE);
memset(buf[i],'\0',BLOCK_SIZE);
i++;
}
}
Output of Code 2:
time ./a.out
**real 0m2.693s**
user 0m0.973s
sys 0m1.721s
Replacing memset with bzero(buf[i],BLOCK_SIZE) in Code 2 produces the same result.
My question is: Why is malloc+memset so much slower than calloc? How can calloc do that?
The short version: Always use calloc() instead of malloc()+memset(). In most cases, they will be the same. In some cases, calloc() will do less work because it can skip memset() entirely. In other cases, calloc() can even cheat and not allocate any memory! However, malloc()+memset() will always do the full amount of work.
Understanding this requires a short tour of the memory system.
Quick tour of memory
There are four main parts here: your program, the standard library, the kernel, and the page tables. You already know your program, so...
Memory allocators like malloc() and calloc() are mostly there to take small allocations (anything from 1 byte to 100s of KB) and group them into larger pools of memory. For example, if you allocate 16 bytes, malloc() will first try to get 16 bytes out of one of its pools, and then ask for more memory from the kernel when the pool runs dry. However, since the program you're asking about is allocating for a large amount of memory at once, malloc() and calloc() will just ask for that memory directly from the kernel. The threshold for this behavior depends on your system, but I've seen 1 MiB used as the threshold.
The kernel is responsible for allocating actual RAM to each process and making sure that processes don't interfere with the memory of other processes. This is called memory protection, it has been dirt common since the 1990s, and it's the reason why one program can crash without bringing down the whole system. So when a program needs more memory, it can't just take the memory, but instead it asks for the memory from the kernel using a system call like mmap() or sbrk(). The kernel will give RAM to each process by modifying the page table.
The page table maps memory addresses to actual physical RAM. Your process's addresses, 0x00000000 to 0xFFFFFFFF on a 32-bit system, aren't real memory but instead are addresses in virtual memory. The processor divides these addresses into 4 KiB pages, and each page can be assigned to a different piece of physical RAM by modifying the page table. Only the kernel is permitted to modify the page table.
How it doesn't work
Here's how allocating 256 MiB does not work:
Your process calls calloc() and asks for 256 MiB.
The standard library calls mmap() and asks for 256 MiB.
The kernel finds 256 MiB of unused RAM and gives it to your process by modifying the page table.
The standard library zeroes the RAM with memset() and returns from calloc().
Your process eventually exits, and the kernel reclaims the RAM so it can be used by another process.
How it actually works
The above process would work, but it just doesn't happen this way. There are three major differences.
When your process gets new memory from the kernel, that memory was probably used by some other process previously. This is a security risk. What if that memory has passwords, encryption keys, or secret salsa recipes? To keep sensitive data from leaking, the kernel always scrubs memory before giving it to a process. We might as well scrub the memory by zeroing it, and if new memory is zeroed we might as well make it a guarantee, so mmap() guarantees that the new memory it returns is always zeroed.
There are a lot of programs out there that allocate memory but don't use the memory right away. Sometimes memory is allocated but never used. The kernel knows this and is lazy. When you allocate new memory, the kernel doesn't touch the page table at all and doesn't give any RAM to your process. Instead, it finds some address space in your process, makes a note of what is supposed to go there, and makes a promise that it will put RAM there if your program ever actually uses it. When your program tries to read or write from those addresses, the processor triggers a page fault and the kernel steps in to assign RAM to those addresses and resumes your program. If you never use the memory, the page fault never happens and your program never actually gets the RAM.
Some processes allocate memory and then read from it without modifying it. This means that a lot of pages in memory across different processes may be filled with pristine zeroes returned from mmap(). Since these pages are all the same, the kernel makes all these virtual addresses point to a single shared 4 KiB page of memory filled with zeroes. If you try to write to that memory, the processor triggers another page fault and the kernel steps in to give you a fresh page of zeroes that isn't shared with any other programs.
The final process looks more like this:
Your process calls calloc() and asks for 256 MiB.
The standard library calls mmap() and asks for 256 MiB.
The kernel finds 256 MiB of unused address space, makes a note about what that address space is now used for, and returns.
The standard library knows that the result of mmap() is always filled with zeroes (or will be once it actually gets some RAM), so it doesn't touch the memory, so there is no page fault, and the RAM is never given to your process.
Your process eventually exits, and the kernel doesn't need to reclaim the RAM because it was never allocated in the first place.
If you use memset() to zero the page, memset() will trigger the page fault, cause the RAM to get allocated, and then zero it even though it is already filled with zeroes. This is an enormous amount of extra work, and explains why calloc() is faster than malloc() and memset(). If you end up using the memory anyway, calloc() is still faster than malloc() and memset() but the difference is not quite so ridiculous.
This doesn't always work
Not all systems have paged virtual memory, so not all systems can use these optimizations. This applies to very old processors like the 80286 as well as embedded processors which are just too small for a sophisticated memory management unit.
This also won't always work with smaller allocations. With smaller allocations, calloc() gets memory from a shared pool instead of going directly to the kernel. In general, the shared pool might have junk data stored in it from old memory that was used and freed with free(), so calloc() could take that memory and call memset() to clear it out. Common implementations will track which parts of the shared pool are pristine and still filled with zeroes, but not all implementations do this.
Dispelling some wrong answers
Depending on the operating system, the kernel may or may not zero memory in its free time, in case you need to get some zeroed memory later. Linux does not zero memory ahead of time, and Dragonfly BSD recently also removed this feature from their kernel. Some other kernels do zero memory ahead of time, however. Zeroing pages during idle isn't enough to explain the large performance differences anyway.
The calloc() function is not using some special memory-aligned version of memset(), and that wouldn't make it much faster anyway. Most memset() implementations for modern processors look kind of like this:
function memset(dest, c, len)
// one byte at a time, until the dest is aligned...
while (len > 0 && ((unsigned int)dest & 15))
*dest++ = c
len -= 1
// now write big chunks at a time (processor-specific)...
// block size might not be 16, it's just pseudocode
while (len >= 16)
// some optimized vector code goes here
// glibc uses SSE2 when available
dest += 16
len -= 16
// the end is not aligned, so one byte at a time
while (len > 0)
*dest++ = c
len -= 1
So you can see, memset() is very fast and you're not really going to get anything better for large blocks of memory.
The fact that memset() is zeroing memory that is already zeroed does mean that the memory gets zeroed twice, but that only explains a 2x performance difference. The performance difference here is much larger (I measured more than three orders of magnitude on my system between malloc()+memset() and calloc()).
Party trick
Instead of looping 10 times, write a program that allocates memory until malloc() or calloc() returns NULL.
What happens if you add memset()?
Because on many systems, in spare processing time, the OS goes around setting free memory to zero on its own and marking it safe for calloc(), so when you call calloc(), it may already have free, zeroed memory to give you.
On some platforms in some modes malloc initialises the memory to some typically non-zero value before returning it, so the second version could well initialize the memory twice

Resources