In malloc, why use brk at all? Why not just use mmap? - c

Typical implementations of malloc use brk/sbrk as the primary means of claiming memory from the OS. However, they also use mmap to get chunks for large allocations. Is there a real benefit to using brk instead of mmap, or is it just tradition? Wouldn't it work just as well to do it all with mmap?
(Note: I use sbrk and brk interchangeably here because they are interfaces to the same Linux system call, brk.)
For reference, here are a couple of documents describing the glibc malloc:
GNU C Library Reference Manual: The GNU Allocator
https://www.gnu.org/software/libc/manual/html_node/The-GNU-Allocator.html
glibc wiki: Overview of Malloc
https://sourceware.org/glibc/wiki/MallocInternals
What these documents describe is that sbrk is used to claim a primary arena for small allocations, mmap is used to claim secondary arenas, and mmap is also used to claim space for large objects ("much larger than a page").
The use of both the application heap (claimed with sbrk) and mmap introduces some additional complexity that might be unnecessary:
Allocated Arena - the main arena uses the application's heap. Other arenas use mmap'd heaps. To map a chunk to a heap, you need to know which case applies. If this bit is 0, the chunk comes from the main arena and the main heap. If this bit is 1, the chunk comes from mmap'd memory and the location of the heap can be computed from the chunk's address.
[Glibc malloc is derived from ptmalloc, which was derived from dlmalloc, which was started in 1987.]
The jemalloc manpage (http://jemalloc.net/jemalloc.3.html) has this to say:
Traditionally, allocators have used sbrk(2) to obtain memory, which is suboptimal for several reasons, including race conditions, increased fragmentation, and artificial limitations on maximum usable memory. If sbrk(2) is supported by the operating system, this allocator uses both mmap(2) and sbrk(2), in that order of preference; otherwise only mmap(2) is used.
So, they even say here that sbrk is suboptimal but they use it anyway, even though they've already gone to the trouble of writing their code so that it works without it.
[Writing of jemalloc started in 2005.]
UPDATE: Thinking about this more, that bit about "in order of preference" gives me a line on inquiry. Why the order of preference? Are they just using sbrk as a fallback in case mmap is not supported (or lacks necessary features), or is it possible for the process to get into some state where it can use sbrk but not mmap? I'll look at their code and see if I can figure out what it's doing.
I'm asking because I'm implementing a garbage collection system in C, and so far I see no reason to use anything besides mmap. I'm wondering if there's something I'm missing, though.
(In my case I have an additional reason to avoid brk, which is that I might need to use malloc at some point.)

The system call brk() has the advantage of having only a single data item to track memory use, which happily is also directly related to the total size of the heap.
This has been in the exact same form since 1975's Unix V6. Mind you, V6 supported a user address space of 65,535 bytes. So there wasn't a lot of thought given for managing much more than 64K, certainly not terabytes.
Using mmap seems reasonable until I start wondering how altered or added-on garbage collection could use mmap but without rewriting the allocation algorithm too.
Will that work nicely with realloc(), fork(), etc.?

Calling mmap(2) once per memory allocation is not a viable approach for a general purpose memory allocator because the allocation granularity (the smallest individual unit which may be allocated at a time) for mmap(2) is PAGESIZE (usually 4096 bytes), and because it requires a slow and complicated syscall. The allocator fast path for small allocations with low fragmentation should require no syscalls.
So regardless what strategy you use, you still need to support multiple of what glibc calls memory arenas, and the GNU manual mentions: "The presence of multiple arenas allows multiple threads to allocate memory simultaneously in separate arenas, thus improving performance."
The jemalloc manpage (http://jemalloc.net/jemalloc.3.html) has this to say:
Traditionally, allocators have used sbrk(2) to obtain memory, which is suboptimal for several reasons, including race conditions, increased fragmentation, and artificial limitations on maximum usable memory. If sbrk(2) is supported by the operating system, this allocator uses both mmap(2) and sbrk(2), in that order of preference; otherwise only mmap(2) is used.
I don't see how any of these apply to the modern use of sbrk(2), as I understand it. Race conditions are handled by threading primitives. Fragmentation is handled just as would be done with memory arenas allocated by mmap(2). The maximum usable memory is irrelevant, because mmap(2) should be used for any large allocation to reduce fragmentation and to release memory back to the operating system immediately on free(3).
The use of both the application heap (claimed with sbrk) and mmap introduces some additional complexity that might be unnecessary:
Allocated Arena - the main arena uses the application's heap. Other arenas use mmap'd heaps. To map a chunk to a heap, you need to know which case applies. If this bit is 0, the chunk comes from the main arena and the main heap. If this bit is 1, the chunk comes from mmap'd memory and the location of the heap can be computed from the chunk's address.
So the question now is, if we're already using mmap(2), why not just allocate an arena at process start with mmap(2) instead of using sbrk(2)? Especially so if, as quoted, it is necessary to track which allocation type was used. There are several reasons:
mmap(2) may not be supported.
sbrk(2) is already initialized for a process, whereas mmap(2) would introduce additional requirements.
As glibc wiki says, "If the request is large enough, mmap() is used to request memory directly from the operating system [...] and there may be a limit to how many such mappings there can be at one time. "
A memory map allocated with mmap(2) cannot be extended as easily. Linux has mremap(2), but its use limits the allocator to kernels which support it. Premapping many pages with PROT_NONE access uses too much virtual memory. Using MMAP_FIXED unmaps any mapping which may have been there before without warning. sbrk(2) has none of these problems, and is explicitly designed to allow for extending its memory safely.

mmap() didn't exist in the early versions of Unix. brk() was the only way to increase the size of the data segment of the process at that time. The first version of Unix with mmap() was SunOS in the mid 80's, the first open-source version was BSD-Reno in 1990.
And to be usable for malloc() you don't want to require a real file to back up the memory. In 1988 SunOS implemented /dev/zero for this purpose, and in the 1990's HP-UX implemented the MAP_ANONYMOUS flag.
There are now versions of mmap() that offer a variety of methods to allocate the heap.

The obvious advantage is that you can grow the last allocation in place, which is something you can't do with mmap(2) (mremap(2) is a Linux extension, not portable).
For naive (and not-so-naive) programs which are using realloc(3) eg. to append to a string, this translates in a 1 or 2 orders of magnitude speed boost ;-)

I don't know the details on Linux specifically, but on FreeBSD for several years now mmap is preferred and jemalloc in FreeBSD's libc has sbrk() completely disabled. brk()/sbrk() are not implemented in the kernel on the newer ports to aarch64 and risc-v.
If I understand the history of jemalloc correctly, it was originally the new allocator in FreeBSD's libc before it was broken out and made portable. Now FreeBSD is a downstream consumer of jemalloc. Its very possible that its preference for mmap() over sbrk() originated with the characteristics of the FreeBSD VM system that was built around implementing the mmap interface.
It's worth noting that in SUS and POSIX brk/sbrk are deprecated and should be considered non-portable at this point. If you are working on a new allocator you probably don't want to depend on them.

Related

Can I enforce sbrk return address to be within a certain specific range?

I want to make sure the return address of sbrk is within a certain specific range. I read somewhere that sbrk allocates from an area allocated at program initialization. So I'm wondering if there's anyway I can enforce the program initialization to allocate from a specific address? For example, with mmap, I'll be able to do so with MAP_FIXED_NOREPLACE . Is it possible to have something similar?
No, this is not possible. brk and sbrk refer to the data segment of the program, and that can be loaded at any valid address that meets the needs of the dynamic linker. Different architectures can and do use different addresses, and even machines of the same architecture can use different ranges depending on the configuration of the kernel. Using a fixed address or address range is extremely nonportable and will make your program very brittle to future changes. I fully expect that doing this will cause your program to break in the future simply by upgrading libc.
In addition, modern programs are typically compiled as position-independent executables so that ASLR can be used to improve security. Therefore, even if you knew the address range that was used for one invocation of your program, the very next invocation of your program might use a totally different address range.
In addition, you almost never want to invoke brk or sbrk by hand. In almost all cases, you will want to use the system memory allocator (or a replacement like jemalloc), which will handle this case for you. For example, glibc's malloc implementation, like most others, will allocate large chunks of memory using mmap, which can significantly reduce memory usage in long-running programs, since these large chunks can be freed independently. The memory allocator also may not appreciate you changing the size of the data segment without consulting it.
Finally, in case you care about portability to other Unix systems, not all systems even have brk and sbrk. OpenBSD allocates all memory using mmap which improves security by expanding the use of ASLR (at the cost of performance).
If you absolutely must use a fixed address or address range and there is no alternative, you'll need to use mmap to allocate that range of memory.

When is it more appropriate to use valloc() as opposed to malloc()?

C (and C++) include a family of dynamic memory allocation functions, most of which are intuitively named and easy to explain to a programmer with a basic understanding of memory. malloc() simply allocates memory, while calloc() allocates some memory and clears it eagerly. There are also realloc() and free(), which are pretty self-explanatory.
The manpage for malloc() also mentions valloc(), which allocates (size) bytes aligned to the page border.
Unfortunately, my background isn't thorough enough in low-level intricacies; what are the implications of allocating and using page border-aligned memory, and when is this appropriate as opposed to regular malloc() or calloc()?
The manpage for valloc contains an important note:
The function valloc() appeared in 3.0BSD. It is documented as being obsolete in 4.3BSD, and as legacy in SUSv2. It does not appear in POSIX.1-2001.
valloc is obsolete and nonstandard - to answer your question, it would never be appropriate to use in new code.
While there are some reasons to want to allocate aligned memory - this question lists a few good ones - it is usually better to let the memory allocator figure out which bit of memory to give you. If you are certain that you need your freshly-allocated memory aligned to something, use aligned_alloc (C11) or posix_memalign (POSIX) instead.
Allocations with page alignment usually are not done for speed - they're because you want to take advantage of some feature of your processor's MMU, which typically works with page granularity.
One example is if you want to use mprotect(2) to change the access rights on that memory. Suppose, for instance, that you want to store some data in a chunk of memory, and then make it read only, so that any buggy part of your program that tries to write there will trigger a segfault. Since mprotect(2) can only change permissions page by page (since this is what the underlying CPU hardware can enforce), the block where you store your data had better be page aligned, and its size had better be a multiple of the page size. Otherwise the area you set read-only might include other, unrelated data that still needs to be written.
Or, perhaps you are going to generate some executable code in memory and then want to execute it later. Memory you allocate by default probably isn't set to allow code execution, so you'll have to use mprotect to give it execute permission. Again, this has to be done with page granularity.
Another example is if you want to allocate memory now, but might want to mmap something on top of it later.
So in general, a need for page-aligned memory would relate to some fairly low-level application, often involving something system-specific. If you needed it, you'd know. (And as mentioned, you should allocate it not with valloc, but using posix_memalign, or perhaps an anonymous mmap.)
First of all valloc is obsolete, and memalignshould be used instead.
Second thing it's not part of the C (C++) standard at all.
It's a special allocation which is aligned to _SC_PAGESIZE boundary.
When is it useful to use it? I guess never, unless you have some specific low level requirement. If you would need it, you would know to need it, since it's rarely useful (maybe just when trying some micro-optimizations or creating shared memory between processes).
The self-evident answer is that it is appropriate to use valloc when malloc is unsuitable (less efficient) for the application (virtual) memory usage pattern and valloc is better suited (more efficient). This will depend on the OS and libraries and architecture and application...
malloc traditionally allocated real memory from freed memory if available and by increasing the brk point if not, in which case it is cleared by the OS for security reasons.
calloc in a dumb implementation does a malloc and then (re)clears the memory, while a smart implementation would avoid reclearing newly allocated memory that is automatically cleared by the operating system.
valloc relates to virtual memory. In a virtual memory system using the file system, you can allocate a large amount of memory or filespace/swapspace, even more than physical memory, and it will be swapped in by pages so alignment is a factor. In Unix creation of file of a specified file and adding/deleting pages is done using inodes to define the file but doesn't deal with actual disk blocks till needed, in which case it creates them cleared. So I would expect a valloc system to increase the size of the data segment swap without actually allocating physical or swap pages, or running a for loop to clear it all - as the file and paging system does that as needed. Thus valloc should be a heck of a lot faster than malloc. But as with calloc, how particular idiotsyncratic *x/C flavours do it is up to them, and the valloc man page is totally unhelpful about these expectations.
Traditionally this was implemented with brk/sbrk. Of course in a virtual memory system, whether a paged or a segmented system, there is no real need for any of this brk/sbrk stuff and it is enough to simply write the last location in a file or address space to extend up to that point.
Re the allocation to page boundaries, that is not usually something the user wants or needs, but rather is usually something the system wants or needs.
A (probably more expensive) way to simulate valloc is to determine the page boundary and then call aligned_alloc or posix_memalign with this alignment spec.
The fact that valloc is deprecated or has been removed or is not required in some OS' doesn't mean that it isn't still useful and required for best efficiency in others. If it has been deprecated or removed, one would hope that there are replacements that are as efficient (but I wouldn't bet on it, and might, indeed have, written my own malloc replacement).
Over the last 40 years the tradeoffs of real and (once invented) virtual memory have changed periodically, and mainstream OS has tended to go for frills rather than efficiency, with programmers who don't have (time or space) efficiency as a major imperative. In the embedded systems, efficiency is more critical, but even there efficiency is often not well supported by the standard OS and/or tools. But when in doubt, you can roll your own malloc replacement for your application that does what you need, rather than depend on what someone else woke up and decided to do/implement, or to undo/deprecate.
So the real answer is you don't necessarily want to use valloc or malloc or calloc or any of the replacements your current subversion of an OS provides.

Why malloc(1) gives more than one page size?

I have tried in my machine using sbrk(1) and then deliberately write out of bound to test page size, which is 4096 bytes. But when I call malloc(1), I get SEGV after accessing 135152 bytes, which is way more than one page size. I know that malloc is library function and it is implementation dependent, but considering that it calls sbrk eventually, why will it give more than one page size. Can anyone tell me about its internal working?
My operating system is ubuntu 14.04 and my architecture is x86
Update: Now I am wondering if it's because malloc returns the address to a free list block that is large enough to hold my data. But that address may be in the middle of the heap so that I can keep writing until the upper limit of the heap is reached.
Older malloc() implementations of UNIX used sbrk()/brk() system calls. But these days, implementations use mmap() and sbrk(). The malloc() implementation of glibc (that's probably the one you use on your Ubuntu 14.04) uses both sbrk() and mmap() and the choice to use which one to allocate when you request the typically depends on the size of the allocation request, which glibc does dynamically.
For small allocations, glibc uses sbrk() and for larger allocations it uses mmap(). The macro M_MMAP_THRESHOLD is used to decide this. Currently, it's default value is set to 128K. This explains why your code managed to allocate 135152 bytes as it is roughly ~128K. Even though, you requested only 1 byte, your implementation allocates 128K for efficient memory allocation. So segfault didn't occur until you cross this limit.
You can play with M_MAP_THRESHOLD by using mallopt() by changing the default parameters.
M_MMAP_THRESHOLD
For allocations greater than or equal to the limit
specified (in bytes) by M_MMAP_THRESHOLD that can't be satisfied from
the free list, the memory-allocation functions employ mmap(2) instead
of increasing the program break using sbrk(2).
Allocating memory using mmap(2) has the significant advantage that
the allocated memory blocks can always be independently released back
to the system. (By contrast, the heap can be trimmed only if memory
is freed at the top end.) On the other hand, there are some
disadvantages to the use of mmap(2): deallocated space is not placed
on the free list for reuse by later allocations; memory may be wasted
because mmap(2) allocations must be page-aligned; and the kernel must
perform the expensive task of zeroing out memory allocated via
mmap(2). Balancing these factors leads to a default setting of
128*1024 for the M_MMAP_THRESHOLD parameter.
The lower limit for this parameter is 0. The upper limit is
DEFAULT_MMAP_THRESHOLD_MAX: 512*1024 on 32-bit systems or
4*1024*1024*sizeof(long) on 64-bit systems.
Note: Nowadays, glibc uses a dynamic mmap threshold by default.
The initial value of the threshold is 128*1024, but when blocks
larger than the current threshold and less than or equal to
DEFAULT_MMAP_THRESHOLD_MAX are freed, the threshold is adjusted
upward to the size of the freed block. When dynamic mmap
thresholding is in effect, the threshold for trimming the heap is also
dynamically adjusted to be twice the dynamic mmap threshold. Dynamic
adjustment of the mmap threshold is disabled if any of the
M_TRIM_THRESHOLD, M_TOP_PAD, M_MMAP_THRESHOLD, or M_MMAP_MAX
parameters is set.
For example, if you do:
#include<malloc.h>
mallopt(M_MMAP_THRESHOLD, 0);
before calling malloc(), you'll likely see a different limit. Most of these are implementation details and C standard says it's undefined behaviour to write into memory that your process doesn't own. So do it at your risk -- otherwise, demons may fly out of your nose ;-)
malloc allocates memory in large blocks for performance reasons. Subsequent calls to malloc can give you memory from the large block instead of having to ask the operating system for a lot of small blocks. This cuts down on the number of system calls needed.
From this article:
When a process needs memory, some room is created by moving the upper bound of the heap forward, using the brk() or sbrk() system calls. Because a system call is expensive in terms of CPU usage, a better strategy is to call brk() to grab a large chunk of memory and then split it as needed to get smaller chunks. This is exactly what malloc() does. It aggregates a lot of smaller malloc() requests into fewer large brk() calls. Doing so yields a significant performance improvement.
Note that some modern implementations of malloc use mmap instead of brk/sbrk to allocate memory, but otherwise the above is still true.

Implement a user-defined malloc() function?

How do you create a new malloc() function defined in C language ?
I don't even have an elflike hint as to how to do this , how to map virtual space address with physical space , what algorithm to follow ? I do know that malloc() is not so efficient as it can cause fragmentation. So something efficient can be created. Even if not efficient , how to implement a naive C malloc function?
It was asked recently in an interview.
Wikipedia actually provides a good summary of various malloc implementations, including ones optimized for specific conditions, along with links to learn more about the implementation.
http://en.wikipedia.org/wiki/C_dynamic_memory_allocation
dlmalloc
A general-purpose allocator. The GNU C library (glibc) uses an allocator based on dlmalloc.
jemalloc
In order to avoid lock contention, jemalloc uses separate "arenas" for each CPU. Experiments measuring number of allocations per second in multithreading application have shown that this makes it scale linearly with the number of threads, while for both phkmalloc and dlmalloc performance was inversely proportional to the number of threads.
Hoard memory allocator
Hoard uses mmap exclusively, but manages memory in chunks of 64 kilobytes called superblocks. Hoard's heap is logically divided into a single global heap and a number of per-processor heaps. In addition, there is a thread-local cache that can hold a limited number of superblocks. By allocating only from superblocks on the local per-thread or per-processor heap, and moving mostly-empty superblocks to the global heap so they can be reused by other processors, Hoard keeps fragmentation low while achieving near linear scalability with the number of threads.
tcmalloc
Every thread has local storage for small allocations. For large allocations mmap or sbrk can be used. TCMalloc, a malloc developed by Google, has garbage-collection for local storage of dead threads. The TCMalloc is considered to be more than twice as fast as glibc's ptmalloc for multithreaded programs.
dmalloc (not covered by Wikipedia)
The debug memory allocation or dmalloc library has been designed as a drop in replacement for the system's malloc, realloc, calloc, free and other memory management routines while providing powerful debugging facilities configurable at runtime. These facilities include such things as memory-leak tracking, fence-post write detection, file/line number reporting, and general logging of statistics.
I think it's a nice example of malloc in An example in Chapter 8.7 of C Programming Language book:

making your own malloc function?

I read that some games rewrite their own malloc to be more efficient. I don't understand how this is possible in a virtual memory world. If I recall correctly, malloc actually calls an OS specific function, which maps the virtual address to a real address with the MMU. So then how can someone make their own memory allocator and allocate real memory, without calling the actual runtime's malloc?
Thanks
It's certainly possible to write an allocator more efficient than a general purpose one.
If you know the properties of your allocations, you can blow general purpose allocators out of the water.
Case in point: many years ago, we had to design and code up a communication subsystem (HDLC, X.25 and proprietary layers) for embedded systems. The fact that we knew the maximum allocation would always be less than 128 bytes (or something like that) meant that we didn't have to mess around with variable sized blocks at all. Every allocation was for 128 bytes no matter how much you asked for.
Of course, if you asked for more, it returned NULL.
By using fixed-length blocks, we were able to speed up allocations and de-allocations greatly, using bitmaps and associated structures to hold accounting information rather than relying on slower linked lists. In addition, the need to coalesce freed blocks was not needed.
Granted, this was a special case but you'll find that's so for games as well. In fact, we've even used this in a general purpose system where allocations below a certain threshold got a fixed amount of memory from a self-managed pre-allocated pool done the same way. Any other allocations (larger than the threshold or if the pool was fully allocated) were sent through to the "real" malloc.
Just because malloc() is a standard C function doesn't mean that it's the lowest level access you have to the memory system. In fact, malloc() is probably implemented in terms of lower-level operating system functionality. That means you could call those lower level interfaces too. They might be OS-specific, but they might allow you better performance than you would get from the malloc() interface. If that were the case, you could implement your own memory allocation system any way you want, and maybe be even more efficient about it - optimizing the algorithm for the characteristics of the size and frequency of allocations you're going to make, for example.
In general, malloc will call an OS-specific function to obtain a bunch of memory (at least one VM page), and will then divide that memory up into smaller chunks as needed to return to the caller of malloc.
The malloc library will also have a list (or lists) of free blocks, so it can often meet a request without asking the OS for more memory. Determining how many different block sizes to handle, deciding whether to attempt to combine adjacent free blocks, and so forth, are the choices the malloc library implementor has to make.
It's possible for you to bypass the malloc library and directly invoke the OS-level "give me some memory" function and do your own allocation/freeing within the memory you get from the OS. Such implementations are likely to be OS-specific. Another alternative is to use malloc for initial allocations, but maintain your own cache of freed objects.
One thing you can do is have your allocator allocate a pool of memory, then service requests from than (and allocate a bigger pool if it runs out). I'm not sure if that's what they're doing though.
If I recall correctly, malloc actually
calls an OS specific function
Not quite. Most hardware has a 4KB page size. Operating systems generally don't expose a memory allocation interface offering anything smaller than page-sized (and page-aligned) chunks.
malloc spends most of its time managing the virtual memory space that has already been allocated, and only occasionally requests more memory from the OS (obviously this depends on the size of the items you allocate and how often you free).
There is a common misconception that when you free something it is immediately returned to the operating system. While this sometimes occurs (particularly for larger memory blocks) it is generally the case that freed memory remains allocated to the process and can then be re-used by later mallocs.
So most of the work is in bookkeeping of already-allocated virtual space. Allocation strategies can have many aims, such as fast operation, low memory wastage, good locality, space for dynamic growth (e.g. realloc) and so on.
If you know more about your pattern of memory allocation and release, you can optimise malloc and free for your usage patterns or provide a more extensive interface.
For instance, you may be allocating lots of equal-sized objects, which may change the optimal allocation parameters. Or you may always free large amounts of objects at once, in which case you don't want free to be doing fancy things.
Have a look at memory pools and obstacks.
See How do games like GTA IV not fragment the heap?.

Resources