I read that some games rewrite their own malloc to be more efficient. I don't understand how this is possible in a virtual memory world. If I recall correctly, malloc actually calls an OS specific function, which maps the virtual address to a real address with the MMU. So then how can someone make their own memory allocator and allocate real memory, without calling the actual runtime's malloc?
Thanks
It's certainly possible to write an allocator more efficient than a general purpose one.
If you know the properties of your allocations, you can blow general purpose allocators out of the water.
Case in point: many years ago, we had to design and code up a communication subsystem (HDLC, X.25 and proprietary layers) for embedded systems. The fact that we knew the maximum allocation would always be less than 128 bytes (or something like that) meant that we didn't have to mess around with variable sized blocks at all. Every allocation was for 128 bytes no matter how much you asked for.
Of course, if you asked for more, it returned NULL.
By using fixed-length blocks, we were able to speed up allocations and de-allocations greatly, using bitmaps and associated structures to hold accounting information rather than relying on slower linked lists. In addition, the need to coalesce freed blocks was not needed.
Granted, this was a special case but you'll find that's so for games as well. In fact, we've even used this in a general purpose system where allocations below a certain threshold got a fixed amount of memory from a self-managed pre-allocated pool done the same way. Any other allocations (larger than the threshold or if the pool was fully allocated) were sent through to the "real" malloc.
Just because malloc() is a standard C function doesn't mean that it's the lowest level access you have to the memory system. In fact, malloc() is probably implemented in terms of lower-level operating system functionality. That means you could call those lower level interfaces too. They might be OS-specific, but they might allow you better performance than you would get from the malloc() interface. If that were the case, you could implement your own memory allocation system any way you want, and maybe be even more efficient about it - optimizing the algorithm for the characteristics of the size and frequency of allocations you're going to make, for example.
In general, malloc will call an OS-specific function to obtain a bunch of memory (at least one VM page), and will then divide that memory up into smaller chunks as needed to return to the caller of malloc.
The malloc library will also have a list (or lists) of free blocks, so it can often meet a request without asking the OS for more memory. Determining how many different block sizes to handle, deciding whether to attempt to combine adjacent free blocks, and so forth, are the choices the malloc library implementor has to make.
It's possible for you to bypass the malloc library and directly invoke the OS-level "give me some memory" function and do your own allocation/freeing within the memory you get from the OS. Such implementations are likely to be OS-specific. Another alternative is to use malloc for initial allocations, but maintain your own cache of freed objects.
One thing you can do is have your allocator allocate a pool of memory, then service requests from than (and allocate a bigger pool if it runs out). I'm not sure if that's what they're doing though.
If I recall correctly, malloc actually
calls an OS specific function
Not quite. Most hardware has a 4KB page size. Operating systems generally don't expose a memory allocation interface offering anything smaller than page-sized (and page-aligned) chunks.
malloc spends most of its time managing the virtual memory space that has already been allocated, and only occasionally requests more memory from the OS (obviously this depends on the size of the items you allocate and how often you free).
There is a common misconception that when you free something it is immediately returned to the operating system. While this sometimes occurs (particularly for larger memory blocks) it is generally the case that freed memory remains allocated to the process and can then be re-used by later mallocs.
So most of the work is in bookkeeping of already-allocated virtual space. Allocation strategies can have many aims, such as fast operation, low memory wastage, good locality, space for dynamic growth (e.g. realloc) and so on.
If you know more about your pattern of memory allocation and release, you can optimise malloc and free for your usage patterns or provide a more extensive interface.
For instance, you may be allocating lots of equal-sized objects, which may change the optimal allocation parameters. Or you may always free large amounts of objects at once, in which case you don't want free to be doing fancy things.
Have a look at memory pools and obstacks.
See How do games like GTA IV not fragment the heap?.
Related
C (and C++) include a family of dynamic memory allocation functions, most of which are intuitively named and easy to explain to a programmer with a basic understanding of memory. malloc() simply allocates memory, while calloc() allocates some memory and clears it eagerly. There are also realloc() and free(), which are pretty self-explanatory.
The manpage for malloc() also mentions valloc(), which allocates (size) bytes aligned to the page border.
Unfortunately, my background isn't thorough enough in low-level intricacies; what are the implications of allocating and using page border-aligned memory, and when is this appropriate as opposed to regular malloc() or calloc()?
The manpage for valloc contains an important note:
The function valloc() appeared in 3.0BSD. It is documented as being obsolete in 4.3BSD, and as legacy in SUSv2. It does not appear in POSIX.1-2001.
valloc is obsolete and nonstandard - to answer your question, it would never be appropriate to use in new code.
While there are some reasons to want to allocate aligned memory - this question lists a few good ones - it is usually better to let the memory allocator figure out which bit of memory to give you. If you are certain that you need your freshly-allocated memory aligned to something, use aligned_alloc (C11) or posix_memalign (POSIX) instead.
Allocations with page alignment usually are not done for speed - they're because you want to take advantage of some feature of your processor's MMU, which typically works with page granularity.
One example is if you want to use mprotect(2) to change the access rights on that memory. Suppose, for instance, that you want to store some data in a chunk of memory, and then make it read only, so that any buggy part of your program that tries to write there will trigger a segfault. Since mprotect(2) can only change permissions page by page (since this is what the underlying CPU hardware can enforce), the block where you store your data had better be page aligned, and its size had better be a multiple of the page size. Otherwise the area you set read-only might include other, unrelated data that still needs to be written.
Or, perhaps you are going to generate some executable code in memory and then want to execute it later. Memory you allocate by default probably isn't set to allow code execution, so you'll have to use mprotect to give it execute permission. Again, this has to be done with page granularity.
Another example is if you want to allocate memory now, but might want to mmap something on top of it later.
So in general, a need for page-aligned memory would relate to some fairly low-level application, often involving something system-specific. If you needed it, you'd know. (And as mentioned, you should allocate it not with valloc, but using posix_memalign, or perhaps an anonymous mmap.)
First of all valloc is obsolete, and memalignshould be used instead.
Second thing it's not part of the C (C++) standard at all.
It's a special allocation which is aligned to _SC_PAGESIZE boundary.
When is it useful to use it? I guess never, unless you have some specific low level requirement. If you would need it, you would know to need it, since it's rarely useful (maybe just when trying some micro-optimizations or creating shared memory between processes).
The self-evident answer is that it is appropriate to use valloc when malloc is unsuitable (less efficient) for the application (virtual) memory usage pattern and valloc is better suited (more efficient). This will depend on the OS and libraries and architecture and application...
malloc traditionally allocated real memory from freed memory if available and by increasing the brk point if not, in which case it is cleared by the OS for security reasons.
calloc in a dumb implementation does a malloc and then (re)clears the memory, while a smart implementation would avoid reclearing newly allocated memory that is automatically cleared by the operating system.
valloc relates to virtual memory. In a virtual memory system using the file system, you can allocate a large amount of memory or filespace/swapspace, even more than physical memory, and it will be swapped in by pages so alignment is a factor. In Unix creation of file of a specified file and adding/deleting pages is done using inodes to define the file but doesn't deal with actual disk blocks till needed, in which case it creates them cleared. So I would expect a valloc system to increase the size of the data segment swap without actually allocating physical or swap pages, or running a for loop to clear it all - as the file and paging system does that as needed. Thus valloc should be a heck of a lot faster than malloc. But as with calloc, how particular idiotsyncratic *x/C flavours do it is up to them, and the valloc man page is totally unhelpful about these expectations.
Traditionally this was implemented with brk/sbrk. Of course in a virtual memory system, whether a paged or a segmented system, there is no real need for any of this brk/sbrk stuff and it is enough to simply write the last location in a file or address space to extend up to that point.
Re the allocation to page boundaries, that is not usually something the user wants or needs, but rather is usually something the system wants or needs.
A (probably more expensive) way to simulate valloc is to determine the page boundary and then call aligned_alloc or posix_memalign with this alignment spec.
The fact that valloc is deprecated or has been removed or is not required in some OS' doesn't mean that it isn't still useful and required for best efficiency in others. If it has been deprecated or removed, one would hope that there are replacements that are as efficient (but I wouldn't bet on it, and might, indeed have, written my own malloc replacement).
Over the last 40 years the tradeoffs of real and (once invented) virtual memory have changed periodically, and mainstream OS has tended to go for frills rather than efficiency, with programmers who don't have (time or space) efficiency as a major imperative. In the embedded systems, efficiency is more critical, but even there efficiency is often not well supported by the standard OS and/or tools. But when in doubt, you can roll your own malloc replacement for your application that does what you need, rather than depend on what someone else woke up and decided to do/implement, or to undo/deprecate.
So the real answer is you don't necessarily want to use valloc or malloc or calloc or any of the replacements your current subversion of an OS provides.
I have a programming project with highly intensive use of malloc/free functions.
It has three types of structures with very high dynamics and big numbers. By this way, malloc and free are heavily used, called thousands of times per second. Can replacement of standard memory allocation by user-space version of SLAB solve this problem? Is there any implementation of such algorithms?
P.S.
System is Linux-oriented.
Sizes of structures is less than 100 bytes.
Finally, I'll prefer to use ready implementation because memory management is really hard topic.
If you only have three different then you would greatly gain by using a pool allocator (either custom made or something like boost::pool but for C). Doug Lea's binning based malloc would serve as a very good base for a pool allocator (its used in glibc).
However, you also need to take into account other factors, such as multi-threading and memory reusage (will objects be allocated, freed then realloced or just alloced then freed?). from this angle you can check into tcmalloc (which is designed for extreme allocations, both quantity and memory usage), nedmalloc or hoard. all of these allocators are open source and thus can be easily altered to suite the sizes of the objects you allocate.
Without knowing more it's impossible to give you a good answer, but yes, managing your own memory (often by allocating a large block and then doing your own allocations with in that large block) can avoid the high cost associated with general purpose memory managers. For example, in Windows many small allocations will bring performance to its knees. Existing implementations exist for almost every type of memory manager, but I'm not sure what kind you're asking for exactly...
When programming in Windows I find calling malloc/free is like death for performance -- almost any in-app memory allocation that amortizes memory allocations by batching will save you gobs of processor time when allocating/freeing, so it may not be so important which approach you use, as long as you're not calling the default allocator.
That being said, here's some simplistic multithreading-naive ideas:
This isn't strictly a slab manager, but it seems to achieve a good balance and is commonly used.
I personally find I often end up using a fairly simple-to-implement memory-reusing manager for memory blocks of the same sizes -- it maintains a linked list of unused memory of a fixed size and allocates a new block of memory when it needs to. The trick here is to store the pointers for the linked list in the unused memory blocks -- that way there's a very tiny overhead of four bytes. The entire process is O(1) whenever it's reusing memory. When it has to allocate memory it calls a slab allocator (which itself is trivial.)
For a pure allocate-only slab allocator you just ask the system (nicely) to give you a large chunk of memory and keep track of what space you haven't used yet (just maintain a pointer to the start of the unused area and a pointer to the end). When you don't have enough space to allocate the requested size, allocate a new slab. (For large chunks, just pass through to the system allocator.)
The problem with chaining these approaches? Your application will never free any memory, but performance-critical applications often are either one-shot processing applications or create many objects of the same sizes and then stop using them.
If you're careful, the above approach isn't too hard to make multithread friendly, even with just atomic operations.
I recently implemented my own userspace slab allocator, and it proved to be much more efficient (speedwise and memory-wise) than malloc/free for a large amount of fixed-size allocations. You can find it here.
Allocations and freeing work in O(1) time, and are speeded up because of bitvectors being used to represent empty/full slots. When allocating, the __builtin_ctzll GCC intrinsic is used to locate the first set bit in the bitvector (representing an empty slot), which should translate to a single instruction on modern hardware. When freeing, some clever bitwise arithmetic is performed with the pointer itself, in order to locate the header of the corresponding slab and to mark the corresponding slot as free.
While developing a piece of software for embedded system I used realloc() function many times. Now I've been said that I "should not use realloc() in embedded" without any explanation.
Is realloc() dangerous for embedded system and why?
Yes, all dynamic memory allocation is regarded as dangerous, and it is banned from most "high integrity" embedded systems, such as industrial/automotive/aerospace/med-tech etc etc. The answer to your question depends on what sort of embedded system you are doing.
The reasons it's banned from high integrity embedded systems is not only the potential memory leaks, but also a lot of dangerous undefined/unspecified/impl.defined behavior asociated with those functions.
EDIT: I also forgot to mention heap fragmentation, which is another danger. In addition, MISRA-C also mentions "data inconsistency, memory exhaustion, non-deterministic behaviour" as reasons why it shouldn't be used. The former two seem rather subjective, but non-deterministic behaviour is definitely something that isn't allowed in these kind of systems.
References:
MISRA-C:2004 Rule 20.4 "Dynamic heap memory allocation shall not be used."
IEC 61508 Functional safety, 61508-3 Annex B (normative) Table B1, >SIL1: "No dynamic objects", "No dynamic variables".
It depends on the particular embedded system. Dynamic memory management on an small embedded system is tricky to begin with, but realloc is no more complicated than a free and malloc (of course, that's not what it does). On some embedded systems you'd never dream of calling malloc in the first place. On other embedded systems, you almost pretend it's a desktop.
If your embedded system has a poor allocator or not much RAM, then realloc might cause fragmentation problems. Which is why you avoid malloc too, cause it causes the same problems.
The other reason is that some embedded systems must be high reliability, and malloc / realloc can return NULL. In these situations, all memory is allocated statically.
In many embedded systems, a custom memory manager can provide better semantics than are available with malloc/realloc/free. Some applications, for example, can get by with a simple mark-and-release allocator. Keep a pointer to the start of not-yet-allocated memory, allocate things by moving the pointer upward, and jettison them by moving the pointer below them. That won't work if it's necessary to jettison some things while keeping other things that were allocated after them, but in situations where that isn't necessary the mark-and-release allocator is cheaper than any other allocation method. In some cases where the mark-and-release allocator isn't quite good enough, it may be helpful to allocate some things from the start of the heap and other things from the end of the heap; one may free up the things allocated from one end without affecting those allocated from the other.
Another approach that can sometimes be useful in non-multitasking or cooperative-multitasking systems is to use memory handles rather than direct pointers. In a typical handle-based system, there's a table of all allocated objects, built at the top of memory working downward, and objects themselves are allocated from the bottom up. Each allocated object in memory holds either a reference to the table slot that references it (if live) or else an indication of its size (if dead). The table entry for each object will hold the object's size as well as a pointer to the object in memory. Objects may be allocated by simply finding a free table slot (easy, since table slots are all fixed size), storing the address of the object's table slot at the start of free memory, storing the object itself just beyond that, and updating the start of free memory to point just past the object. Objects may be freed by replacing the back-reference with a length indication, and freeing the object in the table. If an allocation would fail, relocate all live objects starting at the top of memory, overwriting any dead objects, and updating the object table to point to their new addresses.
The performance of this approach is non-deterministic, but fragmentation is not a problem. Further, it may be possible in some cooperative multitasking systems to perform garbage collection "in the background"; provided that the garbage collector can complete a pass in the time it takes to chug through the slack space, long waits can be avoided. Further, some fairly simple "generational" logic may be used to improve average-case performance at the expense of worst-case performance.
realloc can fail, just like malloc can. This is one reason why you probably should not use either in an embedded system.
realloc is worse than malloc in that you will need to have the old and new pointers valid during the realloc. In other words, you will need 2X the memory space of the original malloc, plus any additional amount (assuming realloc is increasing the buffer size).
Using realloc is going to be very dangerous, because it may return a new pointer to your memory location. This means:
All references to the old pointer must be corrected after realloc.
For a multi-threaded system, the realloc must be atomic. If you are disabling interrupts to achieve this, the realloc time might be long enough to cause a hardware reset by the watchdog.
Update: I just wanted to make it clear. I'm not saying that realloc is worse than implementing realloc using a malloc/free. That would be just as bad. If you can do a single malloc and free, without resizing, it's slightly better, yet still dangerous.
The issues with realloc() in embedded systems are no different than in any other system, but the consequences may be more severe in systems where memory is more constrained, and the sonsequences of failure less acceptable.
One problem not mentioned so far is that realloc() (and any other dynamic memory operation for that matter) is non-deterministic; that is it's execution time is variable and unpredictable. Many embedded systems are also real-time systems, and in such systems, non-deterministic behaviour is unacceptable.
Another issue is that of thread-safety. Check your library's documantation to see if your library is thread-safe for dynamic memory allocation. Generally if it is, you will need to implement mutex stubs to integrate it with your particular thread library or RTOS.
Not all emebdded systems are alike; if your embedded system is not real-time (or the process/task/thread in question is not real-time, and is independent of the real-time elements), and you have large amounts of memory unused, or virtual memory capabilities, then the use of realloc() may be acceptable, if perhaps ill-advised in most cases.
Rather than accept "conventional wisdom" and bar dynamic memory regardless, you should understand your system requirements, and the behaviour of dynamic memory functions and make an appropriate decision. That said, if you are building code for reuability and portability to as wide a range of platforms and applications as possible, then reallocation is probably a really bad idea. Don't hide it in a library for example.
Note too that the same problem exists with C++ STL container classes that dynamically reallocate and copy data when the container capacity is increased.
Well, it's better to avoid using realloc if it's possible, since this operation is costly especially being put into the loop: for example, if some allocated memory needs to be extended and there no gap between after current block and the next allocated block - this operation is almost equals: malloc + memcopy + free.
Today, I appeared for an interview and the interviewer asked me this,
Tell me the steps how will you design your own free( ) function for
deallocate the allocated memory.
How can it be more efficient than C's default free() function ? What can you conclude ?
I was confused, couldn't think of the way to design.
What do you think guys ?
EDIT : Since we need to know about how malloc() works, can you tell me the steps to write our own malloc() function
That's actually a pretty vague question, and that's probably why you got confused. Does he mean, given an existing malloc implementation, how would you go about trying to develop a more efficient way to free the underlying memory? Or was he expecting you to start discussing different kinds of malloc implementations and their benefits and problems? Did he expect you to know how virtual memory functions on the x86 architecture?
Also, by more efficient, does he mean more space efficient or more time efficient? Does free() have to be deterministic? Does it have to return as much memory to the OS as possible because it's in a low-memory, multi-tasking environment? What's our criteria here?
It's hard to say where to start with a vague question like that, other than to start asking your own questions to get clarification. After all, in order to design your own free function, you first have to know how malloc is implemented. So chances are, the question was really about whether or not you knew anything about how malloc can be implemented.
If you're not familiar with the internals of memory management, the easiest way to get started with understanding how malloc is implemented is to first write your own.
Check out this IBM DeveloperWorks article called "Inside Memory Management" for starters.
But before you can write your own malloc/free, you first need memory to allocate/free. Unfortunately, in a protected mode OS, you can't directly address the memory on the machine. So how do you get it?
You ask the OS for it. With the virtual memory features of the x86, any piece of RAM or swap memory can be mapped to a memory address by the OS. What your program sees as memory could be physically fragmented throughout the entire system, but thanks to the kernel's virtual memory manager, it all looks the same.
The kernel usually provides system calls that allow you to map in additional memory for your process. On older UNIX OS's this was usually brk/sbrk to grow heap memory onto the edge of your process or shrink it off, but a lot of systems also provide mmap/munmap to simply map a large block of heap memory in. It's only once you have access to a large, contiguous looking block of memory that you need malloc/free to manage it.
Once your process has some heap memory available to it, it's all about splitting it into chunks, with each chunk containing its own meta information about its size and position and whether or not it's allocated, and then managing those chunks. A simple list of structs, each containing some fields for meta information and a large array of bytes, could work, in which case malloc has to run through the list until if finds a large enough unallocated chunk (or chunks it can combine), and then map in more memory if it can't find a big enough chunk. Once you find a chunk, you just return a pointer to the data. free() can then use that pointer to reverse back a few bytes to the member fields that exist in the structure, which it can then modify (i.e. marking chunk.allocated = false;). If there's enough unallocated chunks at the end of your list, you can even remove them from the list and unmap or shrink that memory off your process's heap.
That's a real simple method of implementing malloc though. As you can imagine, there's a lot of possible ways of splitting your memory into chunks and then managing those chunks. There's as many ways as there are data structures and algorithms. They're all designed for different purposes too, like limiting fragmentation due to small, allocated chunks mixed with small, unallocated chunks, or ensuring that malloc and free run fast (or sometimes even more slowly, but predictably slowly). There's dlmalloc, ptmalloc, jemalloc, Hoard's malloc, and many more out there, and many of them are quite small and succinct, so don't be afraid to read them. If I remember correctly, "The C Programming Language" by Kernighan and Ritchie even uses a simple malloc implementation as one of their examples.
You can't blindly design free() without knowing how malloc() works under the hood because your implementation of free() would need to know how to manipulate the bookkeeping data and that's impossible without knowing how malloc() is implemented.
So an unswerable question could be how you would design malloc() and free() instead which is not a trivial question but you could answer it partially for example by proposing some very simple implementation of a memory pool that would not be equivalent to malloc() of course but would indicate your presence of knowledge.
One common approach when you only have access to user space (generally known as memory pool) is to get a large chunk of memory from the OS on application start-up. Your malloc needs to check which areas of the right size of that pool are still free (through some data structure) and hand out pointers to that memory. Your free needs to mark the memory as free again in the data structure and possibly needs to check for fragmentation of the pool.
The benefits are that you can do allocation in nearly constant time, the drawback is that your application consumes more memory than actually is needed.
Tell me the steps how will you design your own free( ) function for deallocate the allocated memory.
#include <stdlib.h>
#undef free
#define free(X) my_free(X)
inline void my_free(void *ptr) { }
How can it be more efficient than C's default free() function ?
It is extremely fast, requiring zero machine cycles. It also makes use-after-free bugs go away. It's a very useful free function for use in programs which are instantiated as short-lived batch processes; it can usefully be deployed in some production situations.
What can you conclude ?
I really want this job, but in another company.
Memory usage patterns could be a factor. A default implementation of free can't assume anything about how often you allocate/deallocate and what sizes you allocate when you do.
For example, if you frequently allocate and deallocate objects that are of similar size, you could gain speed, memory efficiency, and reduced fragmentation by using a memory pool.
EDIT: as sharptooth noted, only makes sense to design free and malloc together. So the first thing would be to figure out how malloc is implemented.
malloc and free only have a meaning if your app is to work on top of an OS. If you would like to write your own memory management functions you would have to know how to request the memory from that specific OS or you could reserve the heap memory right away using existing malloc and then use your own functions to distribute/redistribute the allocated memory through out your app
There is an architecture that malloc and free are supposed to adhere to -- essentially a class architecture permitting different strategies to coexist. Then the version of free that is executed corresponds to the version of malloc used.
However, I'm not sure how often this architecture is observed.
The knowledge of working of malloc() is necessary to implement free(). You can find a implementation of malloc() and free() using the sbrk() system call in K&R The C Programming Language Chapter 8, Section 8.7 "Example--A Storage Allocator" pp.185-189.
Consider two applications: one (num. 1) that invokes malloc() many times, and the other (num. 2) that invokes malloc() few times.
Both applications allocate the same amount of memory (assume 100MB).
For which application the next malloc() call will be faster, #1 or #2?
In other words: Does malloc() have an index of allocated locations in memory?
You asked 2 questions:
for which application the next malloc() call will be faster, #1 or #2?
In other words: Does malloc() have an index of allocated locations in memory?
You've implied that they are the same question, but they are not. The answer to the latter question is YES.
As for which will be faster, it is impossible to say. It depends on the allocator algorithm, the machine state, the fragmentation in the current process, and so on.
Your idea is sound, though: you should think about how malloc usage will affect performance.
There was once an app I wrote that used lots of little blobs of memory, each allocated with malloc(). It worked correctly but was slow. I replaced the many calls to malloc with just one, and then sliced up that large block within my app. It was much much faster.
I don't recommend this approach; it's just an illustration of the point that malloc usage can materially affect performance.
My advice is to measure it.
Of course this completely depends on the malloc implementation, but in this case, with no calls to free, most malloc implementations will probably give you the same algorithmic speed.
As another answer commented, usually there will be a list of free blocks, but if you have not called free, there will just be one, so it should be O(1) in both cases.
This assumes that the memory allocated for the heap is big enough in both cases. In case #1, you will have allocated more total memory, as each allocation involves memory overhead to store meta-data, as a result you may need to call sbrk(), or equivalent to grow the heap in case #1, which would add an additional overhead.
They will probably be different due to cache and other second order effects, since the memory alignments for the new allocation won't be the same.
If you have been freeing some of the memory blocks, then it is likely that #2 will be faster due to less fragmentation, and so a smaller list of free blocks to search.
If you have freed all the memory blocks, it should end up being exactly the same, since any sane free implementation will have coalesced the blocks back into a single arena of memory.
Malloc has to run through a linked list of free blocks to find one to allocate. This takes time. So, #1 will usually be slower:
The more often you call malloc, the more time it will take - so reducing the number of calls will give you a speed improvement (though whether it is significant will depend on your exact circumstances).
In addition, if you malloc many small blocks, then as you free those blocks, you will fragment the heap much more than if you only allocate and free a few large blocks. So you are likely to end up with many small free blocks on your heap rather than a few big blocks, and therefore your mallocs may have to search further through the free-space lists to find a suitable block to allocate. WHich again will make them slower.
These are of course implementation details, but typically free() will insert the memory into a list of free blocks. malloc() will then look at this list for a free block that is the right size, or larger. Typically, only if this fails does malloc() ask the kernel for more memory.
There are also other considerations, such as when to coalesce multiple adjacent blocks into a single, larger block.
And, another reason that malloc() is expensive: If malloc() is called from multiple threads, there must be some kind of synchronization on these global structures. (i.e. locks.) There exist malloc() implementations with different optimization schemes to make it better for multple threads, but generally, keeping it multi-thread safe adds to the cost, as multiple threads will contend for those locks and block progress on each other.
You can always do a better job using malloc() to allocate a large chunk of memory and sub-dividing it yourself. Malloc() was optimized to work well in the general case and makes no assumptions whether or not you use threads or what the size of the program's allocations might be.
Whether it is a good idea to implement your own sub-allocator is a secondary question. It rarely is, explicit memory management is already hard enough. You rarely need another layer of code that can screw up and crash your program without any good way to debug it. Unless you are writing a debug allocator.
The answer is that it depends, most of the potential slowness rather comes from malloc() and free() in combination and usually #1 and #2 will be of similar speed.
All malloc() implementations do have an indexing mechanism, but the speed of adding a new block to the index is usually not dependant on the number of blocks already in the index.
Most of the slowness of malloc comes from two sources
searching for a suitable free block among the previously freed(blocks)
multi-processor problems with locking
Writing my own almost standards compliant malloc() replacement tool malloc() && free() times from 35% to 3-4%, and it seriously optimised those two factors. It would likely have been a similar speed to use some other high-performance malloc, but having our own was more portable to esoteric devices and of course allows free to be inlined in some places.
You don't define the relative difference between "many" and "few" but I suspect most mallocs would function almost identically in both scenarios. The question implies that each call to malloc has as much overhead as a system call and page table updates. When you do a malloc call, e.g. malloc(14), in a non-brain-dead environment, malloc will actually allocate more memory than you ask for, often a multiple of the system MMU page size. You get your 14 bytes and malloc keeps track of the newly allocated area so that later calls can just return a chunk of the already allocated memory, until more memory needs to be requested from the OS.
In other words, if I call malloc(14) 100 times or malloc(1400) once, the overhead will be about the same. I'll just have to manage the bigger allocated memory chunk myself.
Allocating one block of memory is faster than allocating many blocks. There is the overhead of the system call and also searching for available blocks. In programming reducing the number of operations usually speeds up the execution time.
Memory allocators may have to search to find a block of memory that is the correct size. This adds to the overhead of the execution time.
However, there may be better chances of success when allocating small blocks of memory versus one large block. Is your program allocating one small block and releasing it or does it need to allocate (and preserve) small blocks. When memory becomes fragmented, there are less big chunks available, so the memory allocator may have to coalesce all the blocks to form a block big enough for the allocation.
If your program is allocating and destroying many small blocks of memory you may want to consider allocating a static array and using that for your memory.