Implement a user-defined malloc() function? - c

How do you create a new malloc() function defined in C language ?
I don't even have an elflike hint as to how to do this , how to map virtual space address with physical space , what algorithm to follow ? I do know that malloc() is not so efficient as it can cause fragmentation. So something efficient can be created. Even if not efficient , how to implement a naive C malloc function?
It was asked recently in an interview.

Wikipedia actually provides a good summary of various malloc implementations, including ones optimized for specific conditions, along with links to learn more about the implementation.
http://en.wikipedia.org/wiki/C_dynamic_memory_allocation
dlmalloc
A general-purpose allocator. The GNU C library (glibc) uses an allocator based on dlmalloc.
jemalloc
In order to avoid lock contention, jemalloc uses separate "arenas" for each CPU. Experiments measuring number of allocations per second in multithreading application have shown that this makes it scale linearly with the number of threads, while for both phkmalloc and dlmalloc performance was inversely proportional to the number of threads.
Hoard memory allocator
Hoard uses mmap exclusively, but manages memory in chunks of 64 kilobytes called superblocks. Hoard's heap is logically divided into a single global heap and a number of per-processor heaps. In addition, there is a thread-local cache that can hold a limited number of superblocks. By allocating only from superblocks on the local per-thread or per-processor heap, and moving mostly-empty superblocks to the global heap so they can be reused by other processors, Hoard keeps fragmentation low while achieving near linear scalability with the number of threads.
tcmalloc
Every thread has local storage for small allocations. For large allocations mmap or sbrk can be used. TCMalloc, a malloc developed by Google, has garbage-collection for local storage of dead threads. The TCMalloc is considered to be more than twice as fast as glibc's ptmalloc for multithreaded programs.
dmalloc (not covered by Wikipedia)
The debug memory allocation or dmalloc library has been designed as a drop in replacement for the system's malloc, realloc, calloc, free and other memory management routines while providing powerful debugging facilities configurable at runtime. These facilities include such things as memory-leak tracking, fence-post write detection, file/line number reporting, and general logging of statistics.

I think it's a nice example of malloc in An example in Chapter 8.7 of C Programming Language book:

Related

In malloc, why use brk at all? Why not just use mmap?

Typical implementations of malloc use brk/sbrk as the primary means of claiming memory from the OS. However, they also use mmap to get chunks for large allocations. Is there a real benefit to using brk instead of mmap, or is it just tradition? Wouldn't it work just as well to do it all with mmap?
(Note: I use sbrk and brk interchangeably here because they are interfaces to the same Linux system call, brk.)
For reference, here are a couple of documents describing the glibc malloc:
GNU C Library Reference Manual: The GNU Allocator
https://www.gnu.org/software/libc/manual/html_node/The-GNU-Allocator.html
glibc wiki: Overview of Malloc
https://sourceware.org/glibc/wiki/MallocInternals
What these documents describe is that sbrk is used to claim a primary arena for small allocations, mmap is used to claim secondary arenas, and mmap is also used to claim space for large objects ("much larger than a page").
The use of both the application heap (claimed with sbrk) and mmap introduces some additional complexity that might be unnecessary:
Allocated Arena - the main arena uses the application's heap. Other arenas use mmap'd heaps. To map a chunk to a heap, you need to know which case applies. If this bit is 0, the chunk comes from the main arena and the main heap. If this bit is 1, the chunk comes from mmap'd memory and the location of the heap can be computed from the chunk's address.
[Glibc malloc is derived from ptmalloc, which was derived from dlmalloc, which was started in 1987.]
The jemalloc manpage (http://jemalloc.net/jemalloc.3.html) has this to say:
Traditionally, allocators have used sbrk(2) to obtain memory, which is suboptimal for several reasons, including race conditions, increased fragmentation, and artificial limitations on maximum usable memory. If sbrk(2) is supported by the operating system, this allocator uses both mmap(2) and sbrk(2), in that order of preference; otherwise only mmap(2) is used.
So, they even say here that sbrk is suboptimal but they use it anyway, even though they've already gone to the trouble of writing their code so that it works without it.
[Writing of jemalloc started in 2005.]
UPDATE: Thinking about this more, that bit about "in order of preference" gives me a line on inquiry. Why the order of preference? Are they just using sbrk as a fallback in case mmap is not supported (or lacks necessary features), or is it possible for the process to get into some state where it can use sbrk but not mmap? I'll look at their code and see if I can figure out what it's doing.
I'm asking because I'm implementing a garbage collection system in C, and so far I see no reason to use anything besides mmap. I'm wondering if there's something I'm missing, though.
(In my case I have an additional reason to avoid brk, which is that I might need to use malloc at some point.)
The system call brk() has the advantage of having only a single data item to track memory use, which happily is also directly related to the total size of the heap.
This has been in the exact same form since 1975's Unix V6. Mind you, V6 supported a user address space of 65,535 bytes. So there wasn't a lot of thought given for managing much more than 64K, certainly not terabytes.
Using mmap seems reasonable until I start wondering how altered or added-on garbage collection could use mmap but without rewriting the allocation algorithm too.
Will that work nicely with realloc(), fork(), etc.?
Calling mmap(2) once per memory allocation is not a viable approach for a general purpose memory allocator because the allocation granularity (the smallest individual unit which may be allocated at a time) for mmap(2) is PAGESIZE (usually 4096 bytes), and because it requires a slow and complicated syscall. The allocator fast path for small allocations with low fragmentation should require no syscalls.
So regardless what strategy you use, you still need to support multiple of what glibc calls memory arenas, and the GNU manual mentions: "The presence of multiple arenas allows multiple threads to allocate memory simultaneously in separate arenas, thus improving performance."
The jemalloc manpage (http://jemalloc.net/jemalloc.3.html) has this to say:
Traditionally, allocators have used sbrk(2) to obtain memory, which is suboptimal for several reasons, including race conditions, increased fragmentation, and artificial limitations on maximum usable memory. If sbrk(2) is supported by the operating system, this allocator uses both mmap(2) and sbrk(2), in that order of preference; otherwise only mmap(2) is used.
I don't see how any of these apply to the modern use of sbrk(2), as I understand it. Race conditions are handled by threading primitives. Fragmentation is handled just as would be done with memory arenas allocated by mmap(2). The maximum usable memory is irrelevant, because mmap(2) should be used for any large allocation to reduce fragmentation and to release memory back to the operating system immediately on free(3).
The use of both the application heap (claimed with sbrk) and mmap introduces some additional complexity that might be unnecessary:
Allocated Arena - the main arena uses the application's heap. Other arenas use mmap'd heaps. To map a chunk to a heap, you need to know which case applies. If this bit is 0, the chunk comes from the main arena and the main heap. If this bit is 1, the chunk comes from mmap'd memory and the location of the heap can be computed from the chunk's address.
So the question now is, if we're already using mmap(2), why not just allocate an arena at process start with mmap(2) instead of using sbrk(2)? Especially so if, as quoted, it is necessary to track which allocation type was used. There are several reasons:
mmap(2) may not be supported.
sbrk(2) is already initialized for a process, whereas mmap(2) would introduce additional requirements.
As glibc wiki says, "If the request is large enough, mmap() is used to request memory directly from the operating system [...] and there may be a limit to how many such mappings there can be at one time. "
A memory map allocated with mmap(2) cannot be extended as easily. Linux has mremap(2), but its use limits the allocator to kernels which support it. Premapping many pages with PROT_NONE access uses too much virtual memory. Using MMAP_FIXED unmaps any mapping which may have been there before without warning. sbrk(2) has none of these problems, and is explicitly designed to allow for extending its memory safely.
mmap() didn't exist in the early versions of Unix. brk() was the only way to increase the size of the data segment of the process at that time. The first version of Unix with mmap() was SunOS in the mid 80's, the first open-source version was BSD-Reno in 1990.
And to be usable for malloc() you don't want to require a real file to back up the memory. In 1988 SunOS implemented /dev/zero for this purpose, and in the 1990's HP-UX implemented the MAP_ANONYMOUS flag.
There are now versions of mmap() that offer a variety of methods to allocate the heap.
The obvious advantage is that you can grow the last allocation in place, which is something you can't do with mmap(2) (mremap(2) is a Linux extension, not portable).
For naive (and not-so-naive) programs which are using realloc(3) eg. to append to a string, this translates in a 1 or 2 orders of magnitude speed boost ;-)
I don't know the details on Linux specifically, but on FreeBSD for several years now mmap is preferred and jemalloc in FreeBSD's libc has sbrk() completely disabled. brk()/sbrk() are not implemented in the kernel on the newer ports to aarch64 and risc-v.
If I understand the history of jemalloc correctly, it was originally the new allocator in FreeBSD's libc before it was broken out and made portable. Now FreeBSD is a downstream consumer of jemalloc. Its very possible that its preference for mmap() over sbrk() originated with the characteristics of the FreeBSD VM system that was built around implementing the mmap interface.
It's worth noting that in SUS and POSIX brk/sbrk are deprecated and should be considered non-portable at this point. If you are working on a new allocator you probably don't want to depend on them.

memory allocation/deallocation for embedded devices

Currently we use malloc/free Linux commands for memory allocation/de-allocation in our C based embedded application. I heard that this would cause memory fragmentation as the heap size increases/decreases because of memory allocation/de-allocation which would result in performance degradation. Other programming languages with efficient Garbage Collection solves this issue by freeing the memory when not in use.
Are there any alternate approaches which would solve this issue in C based embedded programs ?
You may take a look at a solution called memory pool allocation.
See: Memory pools implementation in C
Yes, there's an easy solution: don't use dynamic memory allocation outside of initialization.
It is common (in my experience) in embedded systems to only allow calls to malloc when a program starts (this is usually done by convention, there's nothing in C to enforce this. Although you can create your own wrapper for malloc to do this). This requires more work to analyze what memory your program could possibly use since you have to allocate it all at once. The benefit you get, however, is a complete understanding of what memory your program uses.
In some cases this is fairly straightforward, in particular if your system has enough memory to allocate everything it could possibly need all at once. In severely memory-limited systems, however, you're left with the managing the memory yourself. I've seen this done by writing "custom allocators" which you allocate and free memory from. I'll provide an example.
Let's say you're implementing some mathematical program that needs lots of big matrices (not terribly huge, but for example 1000x1000 floats). Your system may not have the memory to allocate many of these matrices, but if you can allocate at least one of them, you could create a pool of memory used for matrix objects, and every time you need a matrix you grab memory from that pool, and when you're done with it you return it to the pool. This is easy if you can return them in the same order you got them in, meaning the memory pool works just like a stack. If this isn't the case, perhaps you could just clear the entire pool at the end of each "iteration" (assuming this math system is periodic).
With more detail about what exactly you're trying to implement I could provide more relevant/specific examples.
Edit: See sg7's answer as well: that user provides a link to well-established frameworks which implement what I describe here.

Optimization of C program with SLAB-like technologies

I have a programming project with highly intensive use of malloc/free functions.
It has three types of structures with very high dynamics and big numbers. By this way, malloc and free are heavily used, called thousands of times per second. Can replacement of standard memory allocation by user-space version of SLAB solve this problem? Is there any implementation of such algorithms?
P.S.
System is Linux-oriented.
Sizes of structures is less than 100 bytes.
Finally, I'll prefer to use ready implementation because memory management is really hard topic.
If you only have three different then you would greatly gain by using a pool allocator (either custom made or something like boost::pool but for C). Doug Lea's binning based malloc would serve as a very good base for a pool allocator (its used in glibc).
However, you also need to take into account other factors, such as multi-threading and memory reusage (will objects be allocated, freed then realloced or just alloced then freed?). from this angle you can check into tcmalloc (which is designed for extreme allocations, both quantity and memory usage), nedmalloc or hoard. all of these allocators are open source and thus can be easily altered to suite the sizes of the objects you allocate.
Without knowing more it's impossible to give you a good answer, but yes, managing your own memory (often by allocating a large block and then doing your own allocations with in that large block) can avoid the high cost associated with general purpose memory managers. For example, in Windows many small allocations will bring performance to its knees. Existing implementations exist for almost every type of memory manager, but I'm not sure what kind you're asking for exactly...
When programming in Windows I find calling malloc/free is like death for performance -- almost any in-app memory allocation that amortizes memory allocations by batching will save you gobs of processor time when allocating/freeing, so it may not be so important which approach you use, as long as you're not calling the default allocator.
That being said, here's some simplistic multithreading-naive ideas:
This isn't strictly a slab manager, but it seems to achieve a good balance and is commonly used.
I personally find I often end up using a fairly simple-to-implement memory-reusing manager for memory blocks of the same sizes -- it maintains a linked list of unused memory of a fixed size and allocates a new block of memory when it needs to. The trick here is to store the pointers for the linked list in the unused memory blocks -- that way there's a very tiny overhead of four bytes. The entire process is O(1) whenever it's reusing memory. When it has to allocate memory it calls a slab allocator (which itself is trivial.)
For a pure allocate-only slab allocator you just ask the system (nicely) to give you a large chunk of memory and keep track of what space you haven't used yet (just maintain a pointer to the start of the unused area and a pointer to the end). When you don't have enough space to allocate the requested size, allocate a new slab. (For large chunks, just pass through to the system allocator.)
The problem with chaining these approaches? Your application will never free any memory, but performance-critical applications often are either one-shot processing applications or create many objects of the same sizes and then stop using them.
If you're careful, the above approach isn't too hard to make multithread friendly, even with just atomic operations.
I recently implemented my own userspace slab allocator, and it proved to be much more efficient (speedwise and memory-wise) than malloc/free for a large amount of fixed-size allocations. You can find it here.
Allocations and freeing work in O(1) time, and are speeded up because of bitvectors being used to represent empty/full slots. When allocating, the __builtin_ctzll GCC intrinsic is used to locate the first set bit in the bitvector (representing an empty slot), which should translate to a single instruction on modern hardware. When freeing, some clever bitwise arithmetic is performed with the pointer itself, in order to locate the header of the corresponding slab and to mark the corresponding slot as free.

making your own malloc function?

I read that some games rewrite their own malloc to be more efficient. I don't understand how this is possible in a virtual memory world. If I recall correctly, malloc actually calls an OS specific function, which maps the virtual address to a real address with the MMU. So then how can someone make their own memory allocator and allocate real memory, without calling the actual runtime's malloc?
Thanks
It's certainly possible to write an allocator more efficient than a general purpose one.
If you know the properties of your allocations, you can blow general purpose allocators out of the water.
Case in point: many years ago, we had to design and code up a communication subsystem (HDLC, X.25 and proprietary layers) for embedded systems. The fact that we knew the maximum allocation would always be less than 128 bytes (or something like that) meant that we didn't have to mess around with variable sized blocks at all. Every allocation was for 128 bytes no matter how much you asked for.
Of course, if you asked for more, it returned NULL.
By using fixed-length blocks, we were able to speed up allocations and de-allocations greatly, using bitmaps and associated structures to hold accounting information rather than relying on slower linked lists. In addition, the need to coalesce freed blocks was not needed.
Granted, this was a special case but you'll find that's so for games as well. In fact, we've even used this in a general purpose system where allocations below a certain threshold got a fixed amount of memory from a self-managed pre-allocated pool done the same way. Any other allocations (larger than the threshold or if the pool was fully allocated) were sent through to the "real" malloc.
Just because malloc() is a standard C function doesn't mean that it's the lowest level access you have to the memory system. In fact, malloc() is probably implemented in terms of lower-level operating system functionality. That means you could call those lower level interfaces too. They might be OS-specific, but they might allow you better performance than you would get from the malloc() interface. If that were the case, you could implement your own memory allocation system any way you want, and maybe be even more efficient about it - optimizing the algorithm for the characteristics of the size and frequency of allocations you're going to make, for example.
In general, malloc will call an OS-specific function to obtain a bunch of memory (at least one VM page), and will then divide that memory up into smaller chunks as needed to return to the caller of malloc.
The malloc library will also have a list (or lists) of free blocks, so it can often meet a request without asking the OS for more memory. Determining how many different block sizes to handle, deciding whether to attempt to combine adjacent free blocks, and so forth, are the choices the malloc library implementor has to make.
It's possible for you to bypass the malloc library and directly invoke the OS-level "give me some memory" function and do your own allocation/freeing within the memory you get from the OS. Such implementations are likely to be OS-specific. Another alternative is to use malloc for initial allocations, but maintain your own cache of freed objects.
One thing you can do is have your allocator allocate a pool of memory, then service requests from than (and allocate a bigger pool if it runs out). I'm not sure if that's what they're doing though.
If I recall correctly, malloc actually
calls an OS specific function
Not quite. Most hardware has a 4KB page size. Operating systems generally don't expose a memory allocation interface offering anything smaller than page-sized (and page-aligned) chunks.
malloc spends most of its time managing the virtual memory space that has already been allocated, and only occasionally requests more memory from the OS (obviously this depends on the size of the items you allocate and how often you free).
There is a common misconception that when you free something it is immediately returned to the operating system. While this sometimes occurs (particularly for larger memory blocks) it is generally the case that freed memory remains allocated to the process and can then be re-used by later mallocs.
So most of the work is in bookkeeping of already-allocated virtual space. Allocation strategies can have many aims, such as fast operation, low memory wastage, good locality, space for dynamic growth (e.g. realloc) and so on.
If you know more about your pattern of memory allocation and release, you can optimise malloc and free for your usage patterns or provide a more extensive interface.
For instance, you may be allocating lots of equal-sized objects, which may change the optimal allocation parameters. Or you may always free large amounts of objects at once, in which case you don't want free to be doing fancy things.
Have a look at memory pools and obstacks.
See How do games like GTA IV not fragment the heap?.

(C) which heap policies are most often used?

I have heard that 'better-fit' is pretty commonly used, but I don't seem to read much about it online. What are the most commonly used / thought to be the most efficient policies used by heap allocators.
(I admit my vocabulary may be flawed; when I say 'policy' i mean things such as 'best fit,' 'first fit,' 'next fit,' etc)
Edit: I am also particularly interested in how the heap policies of 'better fit' and doug lea's strategy (http://gee.cs.oswego.edu/dl/html/malloc.html) compare. Doug uses a type of best fit, but his approach uses index bins, whereas better fit uses a Cartesian tree.
C programming environments use the malloc implementation provided by the standard C library that comes with the operating system The concepts in Doug Lea's memory allocator (called dlmalloc) are most widely used in most of the memory allocators in one form or another on UNIX systems. dlmalloc uses bins of different sizes to accommodate objects - the bin closest to the object size is used to allocate the object.
FreeBSD uses a new multi-threaded memory-allocator called jemalloc designed to be concurrent and thread-safe, which provides good performance characteristics when used in multi-core systems of today. A comparison of the old malloc and the new multithreaded one can be found here. Even though it's multi-threaded it still uses the concepts of chunks of different sizes to accommodate objects according to their size (chunk(s) closest to the size of the object are used to allocate the object).
Inside the UNIX kernels the most popular memory allocator is the slab allocator, which was introduced by Sun Microsystems. The slab allocator uses large chunks of memory called slabs. These slabs are divided among caches of objects (or pools) of different sizes. Each object is allocated from the cache which contains objects closest to its size.
As you'd notice the above bin/chunks/slab caches are just forms of the best fit algorithm. So, you can easily assume that the "best fit" algorithm is one of the most widely used malloc algorithm (although memory allocators differ in other important ways).

Resources