The malloc function uses both sbrk and mmap functions. Now the sbrk function increases or decreases the data segment. So it grows linearly. Now my question is, is this linearity always maintained, or for example, an mmap call can allocate memory overlapping the data segment?
I'm talking about multithreaded programs running on multicore systems. This blog talks about some serious flaws of sbrk for multithreaded programs, and it points out that it is possible that memory allocated with sbrk can be intermingled with memory alloacted with mmap (The sbrk heap could become discontinuous because a mmaped region or a shared object obstructs the growth of the heap).
That blog post doesn't see the forest for the trees; only the malloc implementation is allowed to call sbrk with a nonzero argument. More precisely, most malloc implementations for Unix will stop functioning correctly (and by that I mean "your program will crash") if application code calls sbrk with a nonzero argument. If you want to make a large allocation directly from the OS you must use mmap to do it.
(It is true that in a multi-threaded program, malloc must internally wrap a mutex around its calls to sbrk, but that's an implementation detail. POSIX says malloc is thread safe, that's the important thing for an application programmer.)
mmap will not allocate memory overlapping the brk area unless you use MAP_FIXED. If you use MAP_FIXED and your program blows up you get to keep all the pieces.
The kernel tries to avoid doing it, but mmap in normal operation could conceivably allocate memory close to the top of the brk area. If this happens, a subsequent sbrk call that would collide with the mmap region will fail. It will not allocate discontiguous memory. Good implementations of malloc ought to detect this condition and start using mmap for everything. I have not actually tried it, but a test program would be pretty easy to write.
is this linearity always maintained, or for example, an mmap call can allocate memory overlapping the data segment?
Observed behavior is that the brk area is always linear. Implementation details: If enlarging the brk area is not possible, for example due to a blocking mapping, glibc will switch to mmap-only. Small allocations (<128KB) seem to be obtained by glibc via brk if possible, so blocking that with:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
int main(void)
{
int i;
for (i = 0; i < 1024; ++i) {
malloc(2048);
if (i == 512) {
void *r, *end = sbrk(0);
r = mmap(end, 4096, PROT_NONE,
MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
}
}
}
when straced, yields indeed
[...]
brk(0x1e7d000) = 0x1e7d000
mmap(0x1e7d000, 4096, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, 0, 0) = 0x1e7d000
brk(0x1e9e000) = 0x1e7d000 <-- (!)
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd9bc9000
Related
What is the upper limit of the increment that can be used in a sbrk call?
I am unable to successfully call sbrk with a 2e10 increment, but I am able to call sbrk with a 1e10 increment three times in a row.
I have a similar issue with mmap.
Tested in a Arch Linux 5.4.3-arch1-1 x86-64 with glibc 2.30.
Code example compiled with gcc 9.2.0.
Code example:
#define _DEFAULT_SOURCE
#include <stdio.h>
#include <unistd.h>
int main() {
printf("sizeof(intptr_t)=%ld\n", sizeof(intptr_t));
intptr_t increment = 1e10;
if (sbrk(increment * 2) == (void *) -1) {
printf("error sbrk 1\n");
}
if (sbrk(increment) == (void *) -1) {
printf("error sbrk 2\n");
}
if (sbrk(increment) == (void *) -1) {
printf("error sbrk 3\n");
}
if (sbrk(increment) == (void *) -1) {
printf("error sbrk 4\n");
}
return 0;
}
The code above results in the following output:
sizeof(intptr_t)=8
error sbrk 1
Technically, sbrk will take intptr_t on most modern systems. This will be 32 signed int quantity (-2^31 to 2^31-1) when compiling with 32 bit pointers, and 64 bit signed int (-2^63 to 2^63-1) when compiling with 64 bit pointers.
Of course, the practical range is much smaller. For the sbrk call to succeed the following should be true:
The parameter should be within the address space supported by the processor/machine. Many modern system will allow up to 2^48 addresses (as oppose to the theoretical limit of 2^64.
The total size of requested heap size should be within the configuration limits: the process specific hard/soft limits (ulimit), and system configuration limit.
When increasing size of heap, there should be enough physical or swap space to supported the requested memory.
Back to the user question, when running 32 bit program (on 32 or 64 system), the limit is 2GB or less. And 1e10 will not work. When running 64 bit program the request may fail or pass depending on configuration, resources and setup.
I did more experiments to try to check what configurations and resources could explain the described behavior.
I found out that I am unable to allocate a contiguous block of memory larger than the total physical Memory + Swap.
This seems to answer my question.
However, this is a configuration?
I've been reading a lot about memory allocation on the heap and how certain heap management allocators do it.
Say I have the following program:
#include<stdlib.h>
#include<stdio.h>
#include<unistd.h>
int main(int argc, char *argv[]) {
// allocate 4 gigabytes of RAM
void *much_mems = malloc(4294967296);
// sleep for 10 minutes
sleep(600);
// free teh ram
free(*much_mems);
// sleep some moar
sleep(600);
return 0;
}
Let's say for sake of argument that my compiler doesn't optimize out anything above, that I can actually allocate 4GiB of RAM, that the malloc() call returns an actual pointer and not NULL, that size_t can hold an integer as big as 4294967296 on my given platform, that the allocater implemented by the malloc call actually does allocate that amount of RAM in the heap. Pretend that the above code does exactly what it looks like it will do.
After the call to free executes, how does the kernel know that those 4 GiB of RAM are now eligible for use for other processes and for the kernel itself? I'm not assuming the kernel is Linux, but that would be a good example. Before the call to free, this process has a heap size of at least 4GiB, and afterward, does it still have that heap size?
How do modern operating systems allow userspace programs to return memory back to kernel space? Do free implementations execute a syscall to the kernel (or many syscalls) to tell it which areas of memory are now available? And is it possible that my 4 GiB allocation will be non-contiguous?
Do free implementations execute a syscall to the kernel (or many syscalls) to tell it which areas of memory are now available?
Yes.
A modern implementation of malloc on Linux will call mmap to allocate a large amount of memory. The kernel will find an unused virtual address, mark it as allocated, and return it. (The kernel may also return an error if there isn't enough free memory)
free would then call munmap to deallocate the memory, passing the address and size of the allocation.
On Windows, malloc will call VirtualAlloc and free will call VirtualFree.
On GNU/Linux with Glibc, large memory allocations, of more than a few hundred kilobytes, are handled by calling mmap. When the free function is invoked on this, the library knows that the memory was allocated this way (thanks to meta-data stored in a header). It simply calls unmap on it to release it. That's how the kernel knows; its mmap and unmap API is being used.
You can see these calls if you run strace on the program.
The kernel keeps track of all mmap-ed regions using a red-black tree. Given an arbitrary virtual address, it can quickly determine whether it lands in the mmap area, and which mapping, by performing a tree walk.
Before the call to free, this process has a heap size of at least 4GiB...
The C language does not define either "heap" or "stack". Before the call to free, this process has a chunk of 4 GB dynamically allocated memory...
and afterward, does it still have that heap size?
...and after the free(), access to that memory would be undefined behaviour, so for practical purposes, that dynamically allocated memory is no longer "there".
What the library does "under the hood" (e.g. caching, see below) is up to the library, and is subject to change without further notice. This could change with the amount of available physical memory, system load, runtime parameters, ...
How do modern operating systems allow userspace programs to return memory back to kernel space?
It's up to the standard library's implementation to decide (which, of course, has to talk to the operating system to actually, physically allocate / free memory).
Others have pointed out how certain, existing implementations do it. Other libraries, operating systems, and environments exist.
Do free implementations execute a syscall to the kernel (or many syscalls) to tell it which areas of memory are now available?
Possibly. A common optimization done by library implementations is to "cache" free()d memory, so subsequent malloc() calls can be served without talking to the kernel (which is a costly operation). When, how much, and how long memory is cached this way is, you guessed it, implementation-defined.
And is it possible that my 4 GiB allocation will be non-contiguous?
The process will always "see" contiguous memory. In a system supporting virtual memory (i.e. "modern" desktop OS's like Linux or Windows), the physical memory might be non-contiguous, but the virtual addresses your process gets to see will be contiguous (or the malloc() would have failed if this requirement could not be serviced).
Again, other systems exist. You might be looking at a system that doesn't virtualize addresses (i.e. gives physical addresses to the process). You might be looking at a system that assigns a given amount of memory to a process on startup, serves any malloc() requests from that, and doesn't support the allocation of additional memory. And so on.
If we're using Linux as an example it uses mmap to allocate large chunks of memory. This means when you free it it gets umapped ie the kernel gets told that it can now unmap this memory. Read up on the brk and sbrk system calls. A good place to start would be here...
What does brk( ) system call do?
and here. The following post discusses how malloc is implemented which will give you a good idea what's happening under the covers...
How is malloc() implemented internally?
Doug Lea's malloc can be found here. It's well commented and public domain...
ftp://g.oswego.edu/pub/misc/malloc.c
malloc() and free() are kernel functions (system calls) . it is being called by the application to allocate and free memory on the heap.
application itself is not allocating/freeing memory .
the whole mechanism is executed at kernel level .
see the below heap implementation code
void *heap_alloc(uint32_t nbytes) {
heap_header *p, *prev_p; // used to keep track of the current unit
unsigned int nunits; // this is the number of "allocation units" needed by nbytes of memory
nunits = (nbytes + sizeof(heap_header) - 1) / sizeof(heap_header) + 1; // see how much we will need to allocate for this call
// check to see if the list has been created yet; start it if not
if ((prev_p = _heap_free) == NULL) {
_heap_base.s.next = _heap_free = prev_p = &_heap_base; // point at the base of the memory
_heap_base.s.alloc_sz = 0; // and set it's allocation size to zero
}
// now enter a for loop to find a block fo memory
for (p = prev_p->s.next;; prev_p = p, p = p->s.next) {
// did we find a big enough block?
if (p->s.alloc_sz >= nunits) {
// the block is exact length
if (p->s.alloc_sz == nunits)
prev_p->s.next = p->s.next;
// the block needs to be cut
else {
p->s.alloc_sz -= nunits;
p += p->s.alloc_sz;
p->s.alloc_sz = nunits;
}
_heap_free = prev_p;
return (void *)(p + 1);
}
// not enough space!! Try to get more from the kernel
if (p == _heap_free) {
// if the kernel has no more memory, return error!
if ((p = morecore()) == NULL)
return NULL;
}
}
}
this heap_alloc function uses morecore function which is implemented as below :
heap_header *morecore() {
char *cp;
heap_header *up;
cp = (char *)pmmngr_alloc_block(); // allocate more memory for the heap
// if cp is null we have no memory left
if (cp == NULL)
return NULL;
//vmmngr_mapPhysicalAddress(cp, (void *)_virt_addr); // and map it's virtual address to it's physical address
vmmngr_mapPhysicalAddress(vmmngr_get_directory(), _virt_addr, (uint32_t)cp, I86_PTE_PRESENT | I86_PTE_WRITABLE);
_virt_addr += BLOCK_SIZE; // tack on nu bytes to the virtual address; this will be our next allocation address
up = (heap_header *)cp;
up->s.alloc_sz = BLOCK_SIZE;
heap_free((void *)(up + 1));
return _heap_free;
}
as you can see this function is asking the physical memory manager to allocate a block :
cp = (char *)pmmngr_alloc_block();
and then map the allocated block into virtual memory :
vmmngr_mapPhysicalAddress(vmmngr_get_directory(), _virt_addr, (uint32_t)cp, I86_PTE_PRESENT | I86_PTE_WRITABLE);
as you can see , the whole story is being controlled by the heap manager in kernel level.
I'm trying to optimize my dynamic memory usage. The thing is that I initially allocate some amount of memory for the data I get from a socket. Then, on the new data arrival I'm reallocating memory so the newly arrived part will fit into the local buffer. After some poking around I've found that malloc actually allocates a greater block than requested. In some cases significantly greater; here comes some debug info from malloc_usable_size(ptr):
requested 284 bytes, allocated 320 bytes
requested 644 bytes, reallocated 1024 bytes
It's well known that malloc/realloc are expensive operations. In most cases newly arrived data will fit into a previously allocated block (at least when I requested 644 byes and get 1024 instead), but I have no idea how I can figure that out.
The trouble is that malloc_usable_size should not be relied upon (as described in manual) and if the program requested 644 bytes and malloc allocated 1024, the excess 644 bytes may be overwritten and can not be used safely. So, using malloc for a given amount of data and then use malloc_usable_size to figure out how many bytes were really allocated isn't the way to go.
What I want is to know the block grid before calling malloc, so I will request exactly the maximum amount of bytes greater then I need, store allocated size and on the realloc check if I really need to realloc, or if the previously allocated block is fine just because it's greater.
In other words, if I were to request 644 bytes, and malloc actually gave me 1024, I want to have predicted that and requested 1024 instead.
Depending on your particular implementation of libc you will have different behaviour. I have found in most cases two approaches to do the trick:
Use the stack, this is not always feasible, but C allows VLAs on the stack and is the most effective if you don't intend to pass your buffer to an external thread
while (1) {
char buffer[known_buffer_size];
read(fd, buffer, known_buffer_size);
// use buffer
// released at the end of scope
}
In Linux you can make excellent use of mremap which can enlarge/shrink memory with zero-copy guaranteed. It may move your VM mapping though. Only problem here is that it only works in chunks of system page size sysconf(_SC_PAGESIZE) which is usually 0x1000.
void * buffer = mmap(NULL, init_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
while(1) {
// if needs remapping
{
// zero copy, but involves a system call
buffer = mremap(buffer, new_size, MREMAP_MAYMOVE);
}
// use buffer
}
munmap(buffer, current_size);
OS X has similar semantics to Linux's mremap through the Mach vm_remap, it's a little more compilcated though.
void * buffer = mmap(NULL, init_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
mach_port_t this_task = mach_task_self();
while(1) {
// if needs remapping
{
// zero copy, but involves a system call
void * new_address = mmap(NULL, new_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
vm_prot_t cur_prot, max_prot;
munmap(new_address, current_size); // vm needs to be empty for remap
// there is a race condition between these two calls
vm_remap(this_task,
&new_address, // new address
current_size, // has to be page-aligned
0, // auto alignment
0, // remap fixed
this_task, // same task
buffer, // source address
0, // MAP READ-WRITE, NOT COPY
&cur_prot, // unused protection struct
&max_prot, // unused protection struct
VM_INHERIT_DEFAULT);
munmap(buffer, current_size); // remove old mapping
buffer = new_address;
}
// use buffer
}
The short answer is that the standard malloc interface does not provide the information you are looking for. To use the information breaks the abstraction provided.
Some alternatives are:
Rethink your usage model. Perhaps pre-allocate a pool of buffers at start, filling them as you go. Unfortunately this could complicate your program more than you would like.
Use a different memory allocation library that does provide the needed interface. Different libraries provide different tradeoffs in terms of fragmentation, max run time, average run time, etc.
Use your OS memory allocation API. These are often written to be efficient, but will generally require a system call (unlike a user-space library).
In my professional code, I often take advantage of the actual size allocated by malloc()[etc], rather than the requested size. This is my function for determining the actual allocation size0:
int MM_MEM_Stat(
void *I__ptr_A,
size_t *_O_allocationSize
)
{
int rCode = GAPI_SUCCESS;
size_t size = 0;
/*-----------------------------------------------------------------
** Validate caller arg(s).
*/
#ifdef __linux__ // Not required for __APPLE__, as alloc_size() will
// return 0 for non-malloc'ed refs.
if(NULL == I__ptr_A)
{
rCode=EINVAL;
goto CLEANUP;
}
#endif
/*-----------------------------------------------------------------
** Calculate the size.
*/
#if defined(__APPLE__)
size=malloc_size(I__ptr_A);
#elif defined(__linux__)
size=malloc_usable_size(I__ptr_A);
#else
!##$%
#endif
if(0 == size)
{
rCode=EFAULT;
goto CLEANUP;
}
/*-----------------------------------------------------------------
** Return requested values to caller.
*/
if(_O_allocationSize)
*_O_allocationSize = size;
CLEANUP:
return(rCode);
}
I did some sore research and found two interesting things about malloc realization in Linux and FreeBSD:
1) in Linux malloc increment blocks linearly in 16 byte steps, at least up to 8K, so no optimization needed at all, it's just not reasonable;
2) in FreeBSD situation is different, steps are bigger and tend to grow up with requested block size.
So, any kind of optimization is needed only for FreeBSD as Linux allocates blocks with a very tiny steps and it's very unlikely to receive less then 16 bytes of data from socket.
I am trying to "mmap" a binary file (~ 8Gb) using the following code (test.c).
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
int main(int argc, char *argv[])
{
const char *memblock;
int fd;
struct stat sb;
fd = open(argv[1], O_RDONLY);
fstat(fd, &sb);
printf("Size: %lu\n", (uint64_t)sb.st_size);
memblock = mmap(NULL, sb.st_size, PROT_WRITE, MAP_PRIVATE, fd, 0);
if (memblock == MAP_FAILED) handle_error("mmap");
for(uint64_t i = 0; i < 10; i++)
{
printf("[%lu]=%X ", i, memblock[i]);
}
printf("\n");
return 0;
}
test.c is compiled using gcc -std=c99 test.c -o test and file of test returns: test: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.15, not stripped
Although this works fine for small files, I get a segmentation fault when I try to load a big one. The program actually returns:
Size: 8274324021
mmap: Cannot allocate memory
I managed to map the whole file using boost::iostreams::mapped_file but I want to do it using C and system calls. What is wrong with my code?
MAP_PRIVATE mappings require a memory reservation, as writing to these pages may result in copy-on-write allocations. This means that you can't map something too much larger than your physical ram + swap. Try using a MAP_SHARED mapping instead. This means that writes to the mapping will be reflected on disk - as such, the kernel knows it can always free up memory by doing writeback, so it won't limit you.
I also note that you're mapping with PROT_WRITE, but you then go on and read from the memory mapping. You also opened the file with O_RDONLY - this itself may be another problem for you; you must specify O_RDWR if you want to use PROT_WRITE with MAP_SHARED.
As for PROT_WRITE only, this happens to work on x86, because x86 doesn't support write-only mappings, but may cause segfaults on other platforms. Request PROT_READ|PROT_WRITE - or, if you only need to read, PROT_READ.
On my system (VPS with 676MB RAM, 256MB swap), I reproduced your problem; changing to MAP_SHARED results in an EPERM error (since I'm not allowed to write to the backing file opened with O_RDONLY). Changing to PROT_READ and MAP_SHARED allows the mapping to succeed.
If you need to modify bytes in the file, one option would be to make private just the ranges of the file you're going to write to. That is, munmap and remap with MAP_PRIVATE the areas you intend to write to. Of course, if you intend to write to the entire file then you need 8GB of memory to do so.
Alternately, you can write 1 to /proc/sys/vm/overcommit_memory. This will allow the mapping request to succeed; however, keep in mind that if you actually try to use the full 8GB of COW memory, your program (or some other program!) will be killed by the OOM killer.
Linux (and apparently a few other UNIX systems) have the MAP_NORESERVE flag for mmap(2), which can be used to explicitly enable swap space overcommitting. This can be useful when you wish to map a file larger than the amount of free memory available on your system.
This is particularly handy when used with MAP_PRIVATE and only intend to write to a small portion of the memory mapped range, since this would otherwise trigger swap space reservation of the entire file (or cause the system to return ENOMEM, if system wide overcommitting hasn't been enabled and you exceed the free memory of the system).
The issue to watch out for is that if you do write to a large portion of this memory, the lazy swap space reservation may cause your application to consume all the free RAM and swap on the system, eventually triggering the OOM killer (Linux) or causing your app to receive a SIGSEGV.
You don't have enough virtual memory to handle that mapping.
As an example, I have a machine here with 8G RAM, and ~8G swap (so 16G total virtual memory available).
If I run your code on a VirtualBox snapshot that is ~8G, it works fine:
$ ls -lh /media/vms/.../snap.vdi
-rw------- 1 me users 9.2G Aug 6 16:02 /media/vms/.../snap.vdi
$ ./a.out /media/vms/.../snap.vdi
Size: 9820000256
[0]=3C [1]=3C [2]=3C [3]=20 [4]=4F [5]=72 [6]=61 [7]=63 [8]=6C [9]=65
Now, if I drop the swap, I'm left with 8G total memory. (Don't run this on an active server.) And the result is:
$ sudo swapoff -a
$ ./a.out /media/vms/.../snap.vdi
Size: 9820000256
mmap: Cannot allocate memory
So make sure you have enough virtual memory to hold that mapping (even if you only touch a few pages in that file).
I have a small example program which simply fopens a file and uses fgets to read it. Using strace, I notice that the first call to fgets runs a mmap system call, and then read system calls are used to actually read the contents of the file. on fclose, the file is munmaped. If I instead open read the file with open/read directly, this obviously does not occur. I'm curious as to what is the purpose of this mmap is, and what it is accomplishing.
On my Linux 2.6.31 based system, when under heavy virtual memory demand these mmaps will sometimes hang for several seconds, and appear to me to be unnecessary.
The example code:
#include <stdlib.h>
#include <stdio.h>
int main ()
{
FILE *f;
if ( NULL == ( f=fopen( "foo.txt","r" )))
{
printf ("Fail to open\n");
}
char buf[256];
fgets(buf,256,f);
fclose(f);
}
And here is the relevant strace output when the above code is run:
open("foo.txt", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=9, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb8039000
read(3, "foo\nbar\n\n"..., 4096) = 9
close(3) = 0
munmap(0xb8039000, 4096) = 0
It's not the file that is mmap'ed - in this case mmap is used anonymously (not on a file), probably to allocate memory for the buffer that the consequent reads will use.
malloc in fact results in such a call to mmap. Similarly, the munmap corresponds to a call to free.
The mmap is not mapping the file; instead it's allocating memory for the stdio FILE buffering. Normally malloc would not use mmap to service such a small allocation, but it seems glibc's stdio implementation is using mmap directly to get the buffer. This is probably to ensure it's page-aligned (though posix_memalign could achieve the same thing) and/or to make sure closing the file returns the buffer memory to the kernel. I question the usefulness of page-aligning the buffer. Presumably it's for performance, but I can't see any way it would help unless the file offset you're reading from is also page-aligned, and even then it seems like a dubious micro-optimization.
from what i have read memory mapping functions are useful while handling large files. now the definition of large is something i have no idea about. but yes for the large files they are significantly faster as compared to the 'buffered' i/o calls.
in the example that you have posted i think the file is opened by the open() function and mmap is used for allocating memory or something else.
from the syntax of mmap function this can be seen clearly:
void *mmap(void *addr, size_t len, int prot, int flags, int fildes, off_t off);
the second last parameter takes the file descriptor which should be non-negative.
while in the stack trace it is -1
Source code of fopen in glibc shows that mmap can be actually used.
https://sourceware.org/git/?p=glibc.git;a=blob;f=libio/iofopen.c;h=965d21cd978f3acb25ca23152993d9cac9f120e3;hb=HEAD#l36