Why malloc(1) gives more than one page size? - c

I have tried in my machine using sbrk(1) and then deliberately write out of bound to test page size, which is 4096 bytes. But when I call malloc(1), I get SEGV after accessing 135152 bytes, which is way more than one page size. I know that malloc is library function and it is implementation dependent, but considering that it calls sbrk eventually, why will it give more than one page size. Can anyone tell me about its internal working?
My operating system is ubuntu 14.04 and my architecture is x86
Update: Now I am wondering if it's because malloc returns the address to a free list block that is large enough to hold my data. But that address may be in the middle of the heap so that I can keep writing until the upper limit of the heap is reached.

Older malloc() implementations of UNIX used sbrk()/brk() system calls. But these days, implementations use mmap() and sbrk(). The malloc() implementation of glibc (that's probably the one you use on your Ubuntu 14.04) uses both sbrk() and mmap() and the choice to use which one to allocate when you request the typically depends on the size of the allocation request, which glibc does dynamically.
For small allocations, glibc uses sbrk() and for larger allocations it uses mmap(). The macro M_MMAP_THRESHOLD is used to decide this. Currently, it's default value is set to 128K. This explains why your code managed to allocate 135152 bytes as it is roughly ~128K. Even though, you requested only 1 byte, your implementation allocates 128K for efficient memory allocation. So segfault didn't occur until you cross this limit.
You can play with M_MAP_THRESHOLD by using mallopt() by changing the default parameters.
M_MMAP_THRESHOLD
For allocations greater than or equal to the limit
specified (in bytes) by M_MMAP_THRESHOLD that can't be satisfied from
the free list, the memory-allocation functions employ mmap(2) instead
of increasing the program break using sbrk(2).
Allocating memory using mmap(2) has the significant advantage that
the allocated memory blocks can always be independently released back
to the system. (By contrast, the heap can be trimmed only if memory
is freed at the top end.) On the other hand, there are some
disadvantages to the use of mmap(2): deallocated space is not placed
on the free list for reuse by later allocations; memory may be wasted
because mmap(2) allocations must be page-aligned; and the kernel must
perform the expensive task of zeroing out memory allocated via
mmap(2). Balancing these factors leads to a default setting of
128*1024 for the M_MMAP_THRESHOLD parameter.
The lower limit for this parameter is 0. The upper limit is
DEFAULT_MMAP_THRESHOLD_MAX: 512*1024 on 32-bit systems or
4*1024*1024*sizeof(long) on 64-bit systems.
Note: Nowadays, glibc uses a dynamic mmap threshold by default.
The initial value of the threshold is 128*1024, but when blocks
larger than the current threshold and less than or equal to
DEFAULT_MMAP_THRESHOLD_MAX are freed, the threshold is adjusted
upward to the size of the freed block. When dynamic mmap
thresholding is in effect, the threshold for trimming the heap is also
dynamically adjusted to be twice the dynamic mmap threshold. Dynamic
adjustment of the mmap threshold is disabled if any of the
M_TRIM_THRESHOLD, M_TOP_PAD, M_MMAP_THRESHOLD, or M_MMAP_MAX
parameters is set.
For example, if you do:
#include<malloc.h>
mallopt(M_MMAP_THRESHOLD, 0);
before calling malloc(), you'll likely see a different limit. Most of these are implementation details and C standard says it's undefined behaviour to write into memory that your process doesn't own. So do it at your risk -- otherwise, demons may fly out of your nose ;-)

malloc allocates memory in large blocks for performance reasons. Subsequent calls to malloc can give you memory from the large block instead of having to ask the operating system for a lot of small blocks. This cuts down on the number of system calls needed.
From this article:
When a process needs memory, some room is created by moving the upper bound of the heap forward, using the brk() or sbrk() system calls. Because a system call is expensive in terms of CPU usage, a better strategy is to call brk() to grab a large chunk of memory and then split it as needed to get smaller chunks. This is exactly what malloc() does. It aggregates a lot of smaller malloc() requests into fewer large brk() calls. Doing so yields a significant performance improvement.
Note that some modern implementations of malloc use mmap instead of brk/sbrk to allocate memory, but otherwise the above is still true.

Related

Advantages of mmap() over sbrk()?

From my book:
Recall from our first discussion that modern dynamic memory managers
not only use sbrk() but also mmap(). This process helps reduce the
negative effects of memory fragmentation when large blocks of memory
are freed but locked by smaller, more recently allocated blocks lying
between them and the end of the allocated space. In this case, had the
block been allocated with sbrk(), it would have probably remained
unused by the system for some time (or at least most of it).
Can someone kindly explain how using mmap reduces the negative effects of memory fragmentation? The given example didn't make any sense to me and wasn't clear at all.
it would have probably remained unused by the system for some time
Why this claim was made, when we free it the system can use it later. Maybe the OS keeps list of freed blocks in heap to use them when possible instead of using more space in heap.
Please Relate to both questions.
Advantages of mmap() over sbrk()?
brk/sbrk is LIFO. Let's say you increase the segment size by X number of bytes to make room for allocation A and X number of bytes to make allocation B, and then free A. You cannot reduce the allocated memory because B is still allocated. And since the segment is shared across the entire program, if multiple parts of the program use it directly, you will have no way of knowing whether particular part is still in use or not. And if one part of the program (let's say malloc) assumes entire control over the use of brk/sbrk, then calling them elsewhere will break the program.
By contrast, mmap can be unmapped in any order and allocation by one part of the program doesn't conflict with other parts of the program.
brk/sbrk are not part of the POSIX standard and thus not portable.
By contrast, mmap is standard and portable.
mmap can also do things like map files into memory which is not possible using brk/sbrk.
it would have probably remained unused by the system for some time
Why this claim was made
See 1.
Maybe the OS keeps list of freed block
There are no "blocks". There is one (virtual) block called the data segment. brk/sbrk sets the size of that block.
But doesn't mmap allocate on heap
No. "Heap" is at the end of the data segment and heap is what grows using brk/sbrk. mmap does not allocate in the area of memory that has been allocated using brk/sbrk.
mmap creates a new segment elsewhere in the address space.
does malloc actually save the free blocks that were allocated with sbrk for later usage?
If it is allocated using brk/sbrk in the first place, and if malloc hasn't reduced the size of the "heap" (in case that was possible), then malloc may reuse a free "slot" that has been previously freed. It would be a useful thing to do.
"then calling them elsewhere will break the program." can you give an example
malloc(42);
sbrk(42);
malloc(42); // maybe kaboom, who knows?
In conclusion: Just don't use brk/sbrk to set the segment size. Perhaps there's little reason to use (anonymous) mmap either. Use malloc in C.
When sbrk() is used, the heap is just one, large block of memory. If your pattern of allocating and freeing doesn't leave large, contiguous blocks of memory, every large allocation will need to grow the heap. This can result in inefficient memory use, because of all the unused gaps that are left in the heap.
With mmap(), you can have a bunch of independent blocks of mapped memory. So you could use the sbrk() heap for your small allocations, which can be packed neatly, and use mmap() for large allocations. When you're done with one of these large blocks, you can just remove the entire mapping.

Is it safe to use mmap and malloc to allocate memory in same program?

Till now what I understood is as follow:
malloc internally uses sbrk and brk to allocate memory by increasing top of heap.
mmap allocate memory in form of pages.
Now, let's say current top of sbrk/malloc is 0x001000. And I use mmap to allocate a page of 4KB which is allocated at 0x0020000. Later, if I used malloc multiple times and because of that it had to increase sbrk top. So, what if top reaches 0x002000?
So, it will be great if someone can clarify the following.
Is above scenario possible?
If no than please point out flaw in my understanding of malloc and mmap.
If yes than I assume it is not safe to use it in this way. So, is there any other way to use both safely?
Thank you.
malloc is normally not implemented this way today... malloc used sbrk(2) in old implementations, when extending the data segment was the only way to ask the system for more virtual memory. Newer systems use mmap(2) if available, as they allow more flexibility when the virtual space is large enough (each mmaped chunk is managed as a new data segment for the process requesting it). sbrk(2) expands and shrinks the data segment, just like a stack.... so you have to be careful using sbrk(2) in case you are going to use it intermixed with a sbrk implementation of malloc. The way malloc operates, normally disallows you to return any memory obtained with sbrk(2) if you intermix the calls... so you can only use it to grow the data segment safely.
sbrk(2) also allocates memory in pages. Since paged virtual memory emerged, almost all o.s. allocation is made in page units. Newer systems have even more than one pagesize (e.g. 4Kb and 2Mb sizes), so you can get benefit of that, depending on the application.
As 64bit systems get more and more use, there's no problem in allocating address space large enough to allow for both mecanisms to live together. This is an advantage for a multiple heap malloc implementation, as memory is allocated and deallocated independently, and never in LIFO allocated order.
Malloc uses different approaches to allocate memory, but implementations normally try not to interfere with user sbrk(2) usage. You have to be careful, that is, if you intermix malloc(3) calls with sbrk(2) in a sbrk(2) malloc system. then you run the risk of sbrk(2)ing over the malloc adjusted data segment, and breaking the malloc internal data structures. You had better not to use sbrk(2) yourself if you are using a sbrk(2) implementation of malloc.
Finally, to answer your question, mmap(2) allocates memory as malloc(3) does, so malloc is not, and has not to be, aware of the allocated memory you did for your own use with mmap(2).

In malloc, why use brk at all? Why not just use mmap?

Typical implementations of malloc use brk/sbrk as the primary means of claiming memory from the OS. However, they also use mmap to get chunks for large allocations. Is there a real benefit to using brk instead of mmap, or is it just tradition? Wouldn't it work just as well to do it all with mmap?
(Note: I use sbrk and brk interchangeably here because they are interfaces to the same Linux system call, brk.)
For reference, here are a couple of documents describing the glibc malloc:
GNU C Library Reference Manual: The GNU Allocator
https://www.gnu.org/software/libc/manual/html_node/The-GNU-Allocator.html
glibc wiki: Overview of Malloc
https://sourceware.org/glibc/wiki/MallocInternals
What these documents describe is that sbrk is used to claim a primary arena for small allocations, mmap is used to claim secondary arenas, and mmap is also used to claim space for large objects ("much larger than a page").
The use of both the application heap (claimed with sbrk) and mmap introduces some additional complexity that might be unnecessary:
Allocated Arena - the main arena uses the application's heap. Other arenas use mmap'd heaps. To map a chunk to a heap, you need to know which case applies. If this bit is 0, the chunk comes from the main arena and the main heap. If this bit is 1, the chunk comes from mmap'd memory and the location of the heap can be computed from the chunk's address.
[Glibc malloc is derived from ptmalloc, which was derived from dlmalloc, which was started in 1987.]
The jemalloc manpage (http://jemalloc.net/jemalloc.3.html) has this to say:
Traditionally, allocators have used sbrk(2) to obtain memory, which is suboptimal for several reasons, including race conditions, increased fragmentation, and artificial limitations on maximum usable memory. If sbrk(2) is supported by the operating system, this allocator uses both mmap(2) and sbrk(2), in that order of preference; otherwise only mmap(2) is used.
So, they even say here that sbrk is suboptimal but they use it anyway, even though they've already gone to the trouble of writing their code so that it works without it.
[Writing of jemalloc started in 2005.]
UPDATE: Thinking about this more, that bit about "in order of preference" gives me a line on inquiry. Why the order of preference? Are they just using sbrk as a fallback in case mmap is not supported (or lacks necessary features), or is it possible for the process to get into some state where it can use sbrk but not mmap? I'll look at their code and see if I can figure out what it's doing.
I'm asking because I'm implementing a garbage collection system in C, and so far I see no reason to use anything besides mmap. I'm wondering if there's something I'm missing, though.
(In my case I have an additional reason to avoid brk, which is that I might need to use malloc at some point.)
The system call brk() has the advantage of having only a single data item to track memory use, which happily is also directly related to the total size of the heap.
This has been in the exact same form since 1975's Unix V6. Mind you, V6 supported a user address space of 65,535 bytes. So there wasn't a lot of thought given for managing much more than 64K, certainly not terabytes.
Using mmap seems reasonable until I start wondering how altered or added-on garbage collection could use mmap but without rewriting the allocation algorithm too.
Will that work nicely with realloc(), fork(), etc.?
Calling mmap(2) once per memory allocation is not a viable approach for a general purpose memory allocator because the allocation granularity (the smallest individual unit which may be allocated at a time) for mmap(2) is PAGESIZE (usually 4096 bytes), and because it requires a slow and complicated syscall. The allocator fast path for small allocations with low fragmentation should require no syscalls.
So regardless what strategy you use, you still need to support multiple of what glibc calls memory arenas, and the GNU manual mentions: "The presence of multiple arenas allows multiple threads to allocate memory simultaneously in separate arenas, thus improving performance."
The jemalloc manpage (http://jemalloc.net/jemalloc.3.html) has this to say:
Traditionally, allocators have used sbrk(2) to obtain memory, which is suboptimal for several reasons, including race conditions, increased fragmentation, and artificial limitations on maximum usable memory. If sbrk(2) is supported by the operating system, this allocator uses both mmap(2) and sbrk(2), in that order of preference; otherwise only mmap(2) is used.
I don't see how any of these apply to the modern use of sbrk(2), as I understand it. Race conditions are handled by threading primitives. Fragmentation is handled just as would be done with memory arenas allocated by mmap(2). The maximum usable memory is irrelevant, because mmap(2) should be used for any large allocation to reduce fragmentation and to release memory back to the operating system immediately on free(3).
The use of both the application heap (claimed with sbrk) and mmap introduces some additional complexity that might be unnecessary:
Allocated Arena - the main arena uses the application's heap. Other arenas use mmap'd heaps. To map a chunk to a heap, you need to know which case applies. If this bit is 0, the chunk comes from the main arena and the main heap. If this bit is 1, the chunk comes from mmap'd memory and the location of the heap can be computed from the chunk's address.
So the question now is, if we're already using mmap(2), why not just allocate an arena at process start with mmap(2) instead of using sbrk(2)? Especially so if, as quoted, it is necessary to track which allocation type was used. There are several reasons:
mmap(2) may not be supported.
sbrk(2) is already initialized for a process, whereas mmap(2) would introduce additional requirements.
As glibc wiki says, "If the request is large enough, mmap() is used to request memory directly from the operating system [...] and there may be a limit to how many such mappings there can be at one time. "
A memory map allocated with mmap(2) cannot be extended as easily. Linux has mremap(2), but its use limits the allocator to kernels which support it. Premapping many pages with PROT_NONE access uses too much virtual memory. Using MMAP_FIXED unmaps any mapping which may have been there before without warning. sbrk(2) has none of these problems, and is explicitly designed to allow for extending its memory safely.
mmap() didn't exist in the early versions of Unix. brk() was the only way to increase the size of the data segment of the process at that time. The first version of Unix with mmap() was SunOS in the mid 80's, the first open-source version was BSD-Reno in 1990.
And to be usable for malloc() you don't want to require a real file to back up the memory. In 1988 SunOS implemented /dev/zero for this purpose, and in the 1990's HP-UX implemented the MAP_ANONYMOUS flag.
There are now versions of mmap() that offer a variety of methods to allocate the heap.
The obvious advantage is that you can grow the last allocation in place, which is something you can't do with mmap(2) (mremap(2) is a Linux extension, not portable).
For naive (and not-so-naive) programs which are using realloc(3) eg. to append to a string, this translates in a 1 or 2 orders of magnitude speed boost ;-)
I don't know the details on Linux specifically, but on FreeBSD for several years now mmap is preferred and jemalloc in FreeBSD's libc has sbrk() completely disabled. brk()/sbrk() are not implemented in the kernel on the newer ports to aarch64 and risc-v.
If I understand the history of jemalloc correctly, it was originally the new allocator in FreeBSD's libc before it was broken out and made portable. Now FreeBSD is a downstream consumer of jemalloc. Its very possible that its preference for mmap() over sbrk() originated with the characteristics of the FreeBSD VM system that was built around implementing the mmap interface.
It's worth noting that in SUS and POSIX brk/sbrk are deprecated and should be considered non-portable at this point. If you are working on a new allocator you probably don't want to depend on them.

How to find how much memory is actually used up by a malloc call?

If I call:
char *myChar = (char *)malloc(sizeof(char));
I am likely to be using more than 1 byte of memory, because malloc is likely to be using some memory on its own to keep track of free blocks in the heap, and it may effectively cost me some memory by always aligning allocations along certain boundaries.
My question is: Is there a way to find out how much memory is really used up by a particular malloc call, including the effective cost of alignment, and the overhead used by malloc/free?
Just to be clear, I am not asking to find out how much memory a pointer points to after a call to malloc. Rather, I am debugging a program that uses a great deal of memory, and I want to be aware of which parts of the code are allocating how much memory. I'd like to be able to have internal memory accounting that very closely matches the numbers reported by top. Ideally, I'd like to be able to do this programmatically on a per-malloc-call basis, as opposed to getting a summary at a checkpoint.
There isn't a portable solution to this, however there may be operating-system specific solutions for the environments you're interested in.
For example, with glibc on Linux, you can use the mallinfo() function from <malloc.h> which returns a struct mallinfo. The uordblks and hblkhd members of this structure contains the dynamically allocated address space used by the program including book-keeping overhead - if you take the difference of this before and after each malloc() call, you will know the amount of space used by that call. (The overhead is not necessarily constant for every call to malloc()).
Using your example:
char *myChar;
size_t s = sizeof(char);
struct mallinfo before, after;
int mused;
before = mallinfo();
myChar = malloc(s);
after = mallinfo();
mused = (after.uordblks - before.uordblks) + (after.hblkhd - before.hblkhd);
printf("Requested size %zu, used space %d, overhead %zu\n", s, mused, mused - s);
Really though, the overhead is likely to be pretty minor unless you are making a very very high number of very small allocations, which is a bad idea anyway.
It really depends on the implementation. You should really use some memory debugger. On Linux Valgrind's Massif tool can be useful. There are memory debugging libraries like dmalloc, ...
That said, typical overhead:
1 int for storing size + flags of this block.
possibly 1 int for storing size of previous/next block, to assist in coallescing blocks.
2 pointers, but these may only be used in free()'d blocks, being reused for application storage in allocated blocks.
Alignment to an approppriate type, e.g: double.
-1 int (yes, that's a minus) of the next/previous chunk's field containing our size if we are an allocated block, since we cannot be coallesced until we're freed.
So, a minimum size can be 16 to 24 bytes. and minimum overhead can be 4 bytes.
But you could also satisfy every allocation via mapping memory pages (typically 4Kb), which would mean overhead for smaller allocations would be huge. I think OpenBSD does this.
There is nothing defined in the C library to query the total amount of physical memory used by a malloc() call. The amount of memory allocated is controlled by whatever memory manager is hooked up behind the scenes that malloc() calls into. That memory manager can allocate as much extra memory as it deemes necessary for its internal tracking purposes, on top of whatever extra memory the OS itself requires. When you call free(), it accesses the memory manager, which knows how to access that extra memory so it all gets released properly, but there is no way for you to know how much memory that involves. If you need that much fine detail, then you need to write your own memory manager.
If you do use valgrind/Massif, there's an option to show either the malloc value or the top value, which differ a LOT in my experience. Here's an excerpt from the Valgrind manual http://valgrind.org/docs/manual/ms-manual.html :
...However, if you wish to measure all the memory used by your program,
you can use the --pages-as-heap=yes. When this option is enabled,
Massif's normal heap block profiling is replaced by lower-level page
profiling. Every page allocated via mmap and similar system calls is
treated as a distinct block. This means that code, data and BSS
segments are all measured, as they are just memory pages. Even the
stack is measured...

making your own malloc function?

I read that some games rewrite their own malloc to be more efficient. I don't understand how this is possible in a virtual memory world. If I recall correctly, malloc actually calls an OS specific function, which maps the virtual address to a real address with the MMU. So then how can someone make their own memory allocator and allocate real memory, without calling the actual runtime's malloc?
Thanks
It's certainly possible to write an allocator more efficient than a general purpose one.
If you know the properties of your allocations, you can blow general purpose allocators out of the water.
Case in point: many years ago, we had to design and code up a communication subsystem (HDLC, X.25 and proprietary layers) for embedded systems. The fact that we knew the maximum allocation would always be less than 128 bytes (or something like that) meant that we didn't have to mess around with variable sized blocks at all. Every allocation was for 128 bytes no matter how much you asked for.
Of course, if you asked for more, it returned NULL.
By using fixed-length blocks, we were able to speed up allocations and de-allocations greatly, using bitmaps and associated structures to hold accounting information rather than relying on slower linked lists. In addition, the need to coalesce freed blocks was not needed.
Granted, this was a special case but you'll find that's so for games as well. In fact, we've even used this in a general purpose system where allocations below a certain threshold got a fixed amount of memory from a self-managed pre-allocated pool done the same way. Any other allocations (larger than the threshold or if the pool was fully allocated) were sent through to the "real" malloc.
Just because malloc() is a standard C function doesn't mean that it's the lowest level access you have to the memory system. In fact, malloc() is probably implemented in terms of lower-level operating system functionality. That means you could call those lower level interfaces too. They might be OS-specific, but they might allow you better performance than you would get from the malloc() interface. If that were the case, you could implement your own memory allocation system any way you want, and maybe be even more efficient about it - optimizing the algorithm for the characteristics of the size and frequency of allocations you're going to make, for example.
In general, malloc will call an OS-specific function to obtain a bunch of memory (at least one VM page), and will then divide that memory up into smaller chunks as needed to return to the caller of malloc.
The malloc library will also have a list (or lists) of free blocks, so it can often meet a request without asking the OS for more memory. Determining how many different block sizes to handle, deciding whether to attempt to combine adjacent free blocks, and so forth, are the choices the malloc library implementor has to make.
It's possible for you to bypass the malloc library and directly invoke the OS-level "give me some memory" function and do your own allocation/freeing within the memory you get from the OS. Such implementations are likely to be OS-specific. Another alternative is to use malloc for initial allocations, but maintain your own cache of freed objects.
One thing you can do is have your allocator allocate a pool of memory, then service requests from than (and allocate a bigger pool if it runs out). I'm not sure if that's what they're doing though.
If I recall correctly, malloc actually
calls an OS specific function
Not quite. Most hardware has a 4KB page size. Operating systems generally don't expose a memory allocation interface offering anything smaller than page-sized (and page-aligned) chunks.
malloc spends most of its time managing the virtual memory space that has already been allocated, and only occasionally requests more memory from the OS (obviously this depends on the size of the items you allocate and how often you free).
There is a common misconception that when you free something it is immediately returned to the operating system. While this sometimes occurs (particularly for larger memory blocks) it is generally the case that freed memory remains allocated to the process and can then be re-used by later mallocs.
So most of the work is in bookkeeping of already-allocated virtual space. Allocation strategies can have many aims, such as fast operation, low memory wastage, good locality, space for dynamic growth (e.g. realloc) and so on.
If you know more about your pattern of memory allocation and release, you can optimise malloc and free for your usage patterns or provide a more extensive interface.
For instance, you may be allocating lots of equal-sized objects, which may change the optimal allocation parameters. Or you may always free large amounts of objects at once, in which case you don't want free to be doing fancy things.
Have a look at memory pools and obstacks.
See How do games like GTA IV not fragment the heap?.

Resources