C program slowly takes up more memory [duplicate] - c

I wrote a C program in Linux that mallocs memory, ran it in a loop, and TOP didn't show any memory consumption.
then I've done something with that memory, and TOP did show memory consumption.
When I malloc, do I really "get memory", or is there a "lazy" memory management, that only gives me the memory if/when I use it?
(There is also an option that TOP only know about memory consumption when I use it, so I'm not sure about this..)
Thanks

On Linux, malloc requests memory with sbrk() or mmap() - either way, your address space is expanded immediately, but Linux does not assign actual pages of physical memory until the first write to the page in question. You can see the address space expansion in the VIRT column, while the actual, physical memory usage in RES.

This starts a little off subject ( and then I'll tie it in to your question ), but what's happening is similar to what happens when you fork a process in Linux. When forking there is a mechanism called copy on write which only copies the memory space for the new process when the memory is written too. This way if the forked process exec's a new program right away then you've saved the overhead of copying the original programs memory.
Getting back to your question, the idea is similar. As others have pointed out, requesting the memory gets you the virtual memory space immediately, but the actual pages are only allocated when write to them.
What's the purpose of this? It basically makes mallocing memory a more or less constant time operation Big O(1) instead of a Big O(n) operation ( similar to the way the linux scheduler spreads it's work out instead of doing it in one big chunk ).
To demonstrate what I mean I did the following experiment:
rbarnes#rbarnes-desktop:~/test_code$ time ./bigmalloc
real 0m0.005s
user 0m0.000s
sys 0m0.004s
rbarnes#rbarnes-desktop:~/test_code$ time ./deadbeef
real 0m0.558s
user 0m0.000s
sys 0m0.492s
rbarnes#rbarnes-desktop:~/test_code$ time ./justwrites
real 0m0.006s
user 0m0.000s
sys 0m0.008s
The bigmalloc program allocates 20 million ints, but doesn't do anything with them. deadbeef writes one int to each page resulting in 19531 writes and justwrites allocates 19531 ints and zeros them out. As you can see deadbeef takes about 100 times longer to execute than bigmalloc and about 50 times longer than justwrites.
#include <stdlib.h>
int main(int argc, char **argv) {
int *big = malloc(sizeof(int)*20000000); // allocate 80 million bytes
return 0;
}
.
#include <stdlib.h>
int main(int argc, char **argv) {
int *big = malloc(sizeof(int)*20000000); // allocate 80 million bytes
// immediately write to each page to simulate all at once allocation
// assuming 4k page size on 32bit machine
for ( int* end = big + 20000000; big < end; big+=1024 ) *big = 0xDEADBEEF ;
return 0;
}
.
#include <stdlib.h>
int main(int argc, char **argv) {
int *big = calloc(sizeof(int),19531); // number of writes
return 0;
}

Yes, the memory isn't mapped into your memoryspace unless you touch it. mallocing memory will only setup the paging tables so they know when you get a pagefault in the allocated memory, the memory should be mapped in.

Are you using compiler optimizations? Maybe the optimizer has removed the allocation since you're not using the allocated resources?

The feature is called overcommit - kernel "promises" you memory by increasing the data segment size, but does not allocate physical memory to it. When you touch an address in that new space the process page-faults into the kernel, which then tries to map physical pages to it.

Yes, note the VirtualAlloc flags,
MEM_RESERVE
MEM_COMMIT
.
Heh, but for Linux, or any POSIX/BSD/SVR# system, vfork(), has been around for ages and provides simular functionality.
The vfork() function differs from
fork() only in that the child process
can share code and data with the
calling process (parent process). This
speeds cloning activity significantly
at a risk to the integrity of the
parent process if vfork() is misused.
The use of vfork() for any purpose
except as a prelude to an immediate
call to a function from the exec
family, or to _exit(), is not advised.
The vfork() function can be used to
create new processes without fully
copying the address space of the old
process. If a forked process is simply
going to call exec, the data space
copied from the parent to the child by
fork() is not used. This is
particularly inefficient in a paged
environment, making vfork()
particularly useful. Depending upon
the size of the parent's data space,
vfork() can give a significant
performance improvement over fork().

Related

About Dynamic Memory Allocation in C [duplicate]

I was trying to figure out how much memory I can malloc to maximum extent on my machine
(1 Gb RAM 160 Gb HD Windows platform).
I read that the maximum memory malloc can allocate is limited to physical memory (on heap).
Also when a program exceeds consumption of memory to a certain level, the computer stops working because other applications do not get enough memory that they require.
So to confirm, I wrote a small program in C:
int main(){
int *p;
while(1){
p=(int *)malloc(4);
if(!p)break;
}
}
I was hoping that there would be a time when memory allocation would fail and the loop would break, but my computer hung as it was an infinite loop.
I waited for about an hour and finally I had to force shut down my computer.
Some questions:
Does malloc allocate memory from HD also?
What was the reason for above behaviour?
Why didn't loop break at any point of time?
Why wasn't there any allocation failure?
I read that the maximum memory malloc can allocate is limited to physical memory (on heap).
Wrong: most computers/OSs support virtual memory, backed by disk space.
Some questions: does malloc allocate memory from HDD also?
malloc asks the OS, which in turn may well use some disk space.
What was the reason for above behavior? Why didn't the loop break at any time?
Why wasn't there any allocation failure?
You just asked for too little at a time: the loop would have broken eventually (well after your machine slowed to a crawl due to the large excess of virtual vs physical memory and the consequent super-frequent disk access, an issue known as "thrashing") but it exhausted your patience well before then. Try getting e.g. a megabyte at a time instead.
When a program exceeds consumption of memory to a certain level, the
computer stops working because other applications do not get enough
memory that they require.
A total stop is unlikely, but when an operation that normally would take a few microseconds ends up taking (e.g.) tens of milliseconds, those four orders of magnitude may certainly make it feel as if the computer had basically stopped, and what would normally take a minute could take a week.
I know this thread is old, but for anyone willing to give it a try oneself, use this code snipped
#include <stdlib.h>
int main() {
int *p;
while(1) {
int inc=1024*1024*sizeof(char);
p=(int*) calloc(1,inc);
if(!p) break;
}
}
run
$ gcc memtest.c
$ ./a.out
upon running, this code fills up ones RAM until killed by the kernel. Using calloc instead of malloc to prevent "lazy evaluation". Ideas taken from this thread:
Malloc Memory Questions
This code quickly filled my RAM (4Gb) and then in about 2 minutes my 20Gb swap partition before it died. 64bit Linux of course.
/proc/sys/vm/overcommit_memory controls the maximum on Linux
On Ubuntu 19.04 for example, we can easily see that malloc is implemented with mmap(MAP_ANONYMOUS by using strace.
Then man proc then describes how /proc/sys/vm/overcommit_memory controls the maximum allocation:
This file contains the kernel virtual memory accounting mode. Values are:
0: heuristic overcommit (this is the default)
1: always overcommit, never check
2: always check, never overcommit
In mode 0, calls of mmap(2) with MAP_NORESERVE are not checked, and the default check is very weak, leading to the risk of getting a process "OOM-killed".
In mode 1, the kernel pretends there is always enough memory, until memory actually runs out. One use case for this mode is scientific computing applications that em‐ ploy large sparse arrays. In Linux kernel versions before 2.6.0, any nonzero value implies mode 1.
In mode 2 (available since Linux 2.6), the total virtual address space that can be allocated (CommitLimit in /proc/meminfo) is calculated as
CommitLimit = (total_RAM - total_huge_TLB) * overcommit_ratio / 100 + total_swap
where:
total_RAM is the total amount of RAM on the system;
total_huge_TLB is the amount of memory set aside for huge pages;
overcommit_ratio is the value in /proc/sys/vm/overcommit_ratio; and
total_swap is the amount of swap space.
For example, on a system with 16GB of physical RAM, 16GB of swap, no space dedicated to huge pages, and an overcommit_ratio of 50, this formula yields a Com‐ mitLimit of 24GB.
Since Linux 3.14, if the value in /proc/sys/vm/overcommit_kbytes is nonzero, then CommitLimit is instead calculated as:
CommitLimit = overcommit_kbytes + total_swap
See also the description of /proc/sys/vm/admiin_reserve_kbytes and /proc/sys/vm/user_reserve_kbytes.
Documentation/vm/overcommit-accounting.rst in the 5.2.1 kernel tree also gives some information, although lol a bit less:
The Linux kernel supports the following overcommit handling modes
0 Heuristic overcommit handling. Obvious overcommits of address
space are refused. Used for a typical system. It ensures a
seriously wild allocation fails while allowing overcommit to
reduce swap usage. root is allowed to allocate slightly more
memory in this mode. This is the default.
1 Always overcommit. Appropriate for some scientific
applications. Classic example is code using sparse arrays and
just relying on the virtual memory consisting almost entirely
of zero pages.
2 Don't overcommit. The total address space commit for the
system is not permitted to exceed swap + a configurable amount
(default is 50%) of physical RAM. Depending on the amount you
use, in most situations this means a process will not be
killed while accessing pages but will receive errors on memory
allocation as appropriate.
Useful for applications that want to guarantee their memory
allocations will be available in the future without having to
initialize every page.
Minimal experiment
We can easily see the maximum allowed value with:
main.c
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <string.h>
#include <unistd.h>
int main(int argc, char **argv) {
char *chars;
size_t nbytes;
/* Decide how many ints to allocate. */
if (argc < 2) {
nbytes = 2;
} else {
nbytes = strtoull(argv[1], NULL, 0);
}
/* Allocate the bytes. */
chars = mmap(
NULL,
nbytes,
PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS,
-1,
0
);
/* This can happen for example if we ask for too much memory. */
if (chars == MAP_FAILED) {
perror("mmap");
exit(EXIT_FAILURE);
}
/* Free the allocated memory. */
munmap(chars, nbytes);
return EXIT_SUCCESS;
}
GitHub upstream.
Compile and run to allocate 1GiB and 1TiB:
gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
./main.out 0x40000000
./main.out 0x10000000000
We can then play around with the allocation value to see what the system allows.
I can't find a precise documentation for 0 (the default), but on my 32GiB RAM machine it does not allow the 1TiB allocation:
mmap: Cannot allocate memory
If I enable unlimited overcommit however:
echo 1 | sudo tee /proc/sys/vm/overcommit_memory
then the 1TiB allocation works fine.
Mode 2 is well documented, but I'm lazy to carry out precise calculations to verify it. But I will just point out that in practice we are allowed to allocate about:
overcommit_ratio / 100
of total RAM, and overcommit_ratio is 50 by default, so we can allocate about half of total RAM.
VSZ vs RSS and the out-of-memory killer
So far, we have just allocated virtual memory.
However, at some point of course, if you use enough of those pages, Linux will have to start killing some processes.
I have illustrated that in detail at: What is RSS and VSZ in Linux memory management
Try this
#include <stdlib.h>
#include <stdio.h>
main() {
int Mb = 0;
while (malloc(1<<20)) ++Mb;
printf("Allocated %d Mb total\n", Mb);
}
Include stdlib and stdio for it.
This extract is taken from deep c secrets.
malloc does its own memory management, managing small memory blocks itself, but ultimately it uses the Win32 Heap functions to allocate memory. You can think of malloc as a "memory reseller".
The windows memory subsystem comprises physical memory (RAM) and virtual memory (HD). When physical memory becomes scarce, some of the pages can be copied from physical memory to virtual memory on the hard drive. Windows does this transparently.
By default, Virtual Memory is enabled and will consume the available space on the HD. So, your test will continue running until it has either allocated the full amount of virtual memory for the process (2GB on 32-bit windows) or filled the hard disk.
As per C90 standard guarantees that you can get at least one object 32 kBytes in size, and this may be static, dynamic, or automatic memory. C99 guarantees at least 64 kBytes. For any higher limit, refer your compiler's documentation.
Also, malloc's argument is a size_t and the range of that type is [0,SIZE_MAX], so the maximum you can request is SIZE_MAX, which value varies upon implementation and is defined in <limits.h>.
I don't actually know why that failed, but one thing to note is that `malloc(4)" may not actually give you 4 bytes, so this technique is not really an accurate way to find your maximum heap size.
I found this out from my question here.
For instance, when you declare 4 bytes of memory, the space directly before your memory could contain the integer 4, as an indication to the kernel of how much memory you asked for.
Does malloc allocate memory from HD also?
Implementation of malloc() depends on libc implementation and operating system (OS). Typically malloc() doesn't always request RAM from the OS but returns a pointer to previously allocated memory block "owned" by libc.
In case of POSIX compatible systems, this libc controlled memory area is usually increased using syscall brk(). That doesn't allow releasing any memory between two still existing allocations which causes the process to look still using all the RAM after allocating areas A, B, C in sequence and releasing B. This is because areas A and C around the area B are still in use so the memory allocated from the OS cannot be returned.
Many modern malloc() implementations have some kind of heuristic where small allocations use the memory area reserved via brk() and "big" allocations use anonymous virtual memory blocks reserved via mmap() using MAP_ANONYMOUS flag. This allows immediately returning these big allocations when free() is later called. Typically the runtime performance of mmap() is slightly slower than using previously reserved memory which is the reason malloc() implements this heuristic.
Both brk() and mmap() allocate virtual memory from the OS. And virtual memory can be always backed up by swap which may be stored in any storage that the OS supports, including HDD.
In case you run Windows, the syscalls have different names but the underlying behavior is probably about the same.
What was the reason for above behaviour?
Since your example code never touched the memory, I'd guess you're seeing behavior where OS implements copy-on-write for virtual RAM and the memory is mapped to shared page with whole page filled with zeroes by default. Modern operating systems do this because many programs allocate more RAM than they actually need and using shared zero page by default for all memory allocations avoids needing to use real RAM for these allocations.
If you want to test how OS handles your loop and actually reserve true storage, you need to write something to the memory you allocated. For x86 compatible hardware you only need to write one byte per each 4096 byte segment because page size is 4096 and the hardware cannot implement copy-on-write behavior for smaller segments; once one byte is modified, the whole 4096 byte segment called page must be reserved for your process. I'm not aware of any modern CPU that would support smaller than 4096 byte pages. Modern Intel CPUs support 2 MB and 1 GB pages in addition to 4096 byte pages but the 1 GB pages are rarely used because the overhead of using 2 MB pages is small enough for any sensible RAM amounts. 1 GB pages might make sense if your system has hundreds of terabytes of RAM.
So basically your program only tested reserving virtual memory without ever using said virtual memory. Your OS probably has special optimization for this which avoids needing more than 4 KB of RAM to support this.
Unless your objective is to try to measure the overhead caused by your malloc() implementation, you should avoid trying to allocate memory block smaller than 16-32 bytes. For mmap() allocations the minimum possible overhead is 8 bytes per allocation on x86-64 hardware due the data needed to return the memory to the operating system so it really doesn't make sense for malloc() to use mmap() syscall for a single 4 byte allocation.
The overhead is needed to keep track of memory allocations because the memory is freed using void free(void*) so memory allocation routines must keep track of the allocated memory segment size somewhere. Many malloc() implementations also need additional metadata and if they need to keep track of any memory addresses, those need 8 bytes per address.
If you truly want to search for the limits of your system, you should probably do binary search for the limit where malloc() fails. In practice, you try to allocate ..., 1KB, 2KB, 4KB, 8KB, ..., 32 GB which then fails and you know that the real world limit is between 16 GB and 32 GB. You can then split this size in half and figure out the exact limit with additional testing. If you do this kind of search, it may be easier to always release any successful allocation and reserve the test block with a single malloc() call. That should also avoid accidentally accounting for malloc() overhead so much because you need only one allocation at any time at max.
Update: As pointed out by Peter Cordes in the comments, your malloc() implementation may be writing bookkeeping data about your allocations in the reserved RAM which causes real memory to be used and that can cause system to start swapping so heavily that you cannot recover it in any sensible timescale without shutting down the computer. In case you're running Linux and have enabled "Magic SysRq" keys, you could just press Alt+SysRq+f to kill the offending process taking all the RAM and system would run just fine again. It is possible to write malloc() implementation that doesn't usually touch the RAM allocated via brk() and I assumed you would be using one. (This kind of implementation would allocate memory in 2^n sized segments and all similarly sized segments are reserved in the same range of addresses. When free() is later called, the malloc() implementation knows the size of the allocation from the address and bookkeeping about free memory segments are kept in separate bitmap in single location.) In case of Linux, malloc() implementation touching the reserved pages for internal bookkeeping is called dirtying the memory, which prevents sharing memory pages because of copy-on-write handling.
Why didn't loop break at any point of time?
If your OS implements the special behavior described above and you're running 64-bit system, you're not going to run out of virtual memory in any sensible timescale so your loop seems infinite.
Why wasn't there any allocation failure?
You didn't actually use the memory so you're allocating virtual memory only. You're basically increasing the maximum pointer value allowed for your process but since you never access the memory, the OS never bothers the reserve any physical memory for your process.
In case you're running Linux and want the system to enforce virtual memory usage to match actually available memory, you have to write 2 to kernel setting /proc/sys/vm/overcommit_memory and maybe adjust overcommit_ratio, too. See https://unix.stackexchange.com/q/441364/20336 for details about memory overcommit on Linux. As far as I know, Windows implements overcommit, too, but I don't know how to adjust its behavior.
when first time you allocate any size to *p, every next time you leave that memory to be unreferenced. That means
at a time your program is allocating memory of 4 bytes only
. then how can you thing you have used entire RAM, that's why SWAP device( temporary space on HDD) is out of discussion. I know an memory management algorithm in which when no one program is referencing to memory block, that block is eligible to allocate for programs memory request. That's why you are just keeping busy to RAM Driver and that's why it can't give chance to service other programs. Also this a dangling reference problem.
Ans : You can at most allocate the memory of your RAM size. Because no program has access to swap device.
I hope your all questions has got satisfactory answers.

Does mmap or malloc allocate RAM?

I know this is probably a stupid question but i've been looking for awhile and can't find a definitive answer. If I use mmap or malloc (in C, on a linux machine) does either one allocate space in RAM? For example, if I have 2GB of RAM and wanted to use all available RAM could I just use a malloc/memset combo, mmap, or is there another option I don't know of?
I want to write a series of simple programs that can run simultaneously and keep all RAM used in the process to force swap to be used, and pages swapped in/out frequently. I tried this already with the program below, but it's not exactly what I want. It does allocate memory (RAM?), and force swap to be used (if enough instances are running), but when I call sleep doesn't that just lock the memory from being used (so nothing is actually being swapped in or out from other processes?), or am I misunderstanding something.
For example, if I ran this 3 times would I be using 2GB (all) of RAM from the first two instances, and the third instance would then swap one of the previous two instances out (of RAM) and the current instance into RAM? Or would instance #3 just run using disk or virtual memory?
This brings up another point, would I need to allocate enough memory to use all available virtual memory as well for the swap partition to be used?
Lastly, would mmap (or any other C function. Hell, even another language if applicable) be better for doing this?
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MB(size) ( (size) * 1024 * 1024)
#define GB(size) ( (size) * 1024 * 1024 * 1024)
int main(){
char *p;
p = (char *)malloc(MB(512));
memset(p, 'T', MB(512));
printf(".5 GB allocated...\n");
char *q;
q = (char *)malloc(MB(512));
memset(q, 'T', MB(512));
printf("1 GB allocated...\n");
printf("Sleeping...\n");
sleep(300);
}
** Edit: I am using CentOS 6.4 (with 3.6.0 kernel) for my OS, if that helps any.
This is very OS/machine dependent.
In most OSes neither allocates RAM. They both allocate VM space. They make a certain range of your processes virtual memory valid for use. RAM is normally allocated later by the OS on first write. Until then those allocations do not use RAM (aside from the page table that lists them as valid VM space).
If you want to allocate physical RAM then you have to make each page (sysconf(_SC_PAGESIZE) gives you the system pagesize) dirty.
In Linux you can see your VM mappings with all details in /proc/self/smaps. Rss is your resident set of that mapping (how much is resident in RAM), everything else that is dirty will have been swapped out. All non-dirty memory will be available for use, but won't exist until then.
You can make all pages dirty with something like
size_t mem_length;
char (*my_memory)[sysconf(_SC_PAGESIZE)] = mmap(
NULL
, mem_length
, PROT_READ | PROT_WRITE
, MAP_PRIVATE | MAP_ANONYMOUS
, -1
, 0
);
int i;
for (i = 0; i * sizeof(*my_memory) < mem_length; i++) {
my_memory[i][0] = 1;
}
On some Implementations this can also be achieved by passing the MAP_POPULATE flag to mmap, but (depending on your system) it may just fail mmap with ENOMEM if you try to map more then you have RAM available.
Theory and practice differ greatly here. In theory, neither mmap nor malloc allocate actual RAM, but in practice they do.
mmap will allocate RAM to store a virtual memory area data structure (VMA). If mmap is used with an actual file to be mapped, it will (unless explicitly told differently) further allocate several pages of RAM to prefetch the mapped file's contents.
Other than that, it only reserves address space, and RAM will be allocated as it is accessed for the first time.
malloc, similarly, only logically reserves amounts of address space within the virtual address space of your process by telling the operating system either via sbrk or mmap that it wants to manage some (usually much larger than you request) area of address space. It then subdivides this huge area via some more or less complicated algorithm and finally reserves a portion of this address space (properly aligned and rounded) for your use and returns a pointer to it.
But: malloc also needs to store some additional information somewhere, or it would be impossible for free to do its job at a later time. At the very least free needs to know the size of an allocated block in addition to the start address. Usually, malloc therefore secretly allocates a few extra bytes which are immediately preceding the address that you get -- you don't know about that, it doesn't tell you.
Now the crux of the matter is that while in theory malloc does not touch the memory that it manages and does not allocate physical RAM, in practice it does. And this does indeed cause page faults and memory pages to be created (i.e. RAM being used).
You can verify this under Linux by keeping to call malloc and watch the OOP killer blast your process out of existence because the system runs out of physical RAM when in fact there should be plenty left.

Zero RAM using C in Linux

How can I zero unused RAM in Linux for security purposes ? I wrote this simple C program but I do not know if the RAM called by malloc will be reused at the next loop or if new RAM will be used. Hopefully, after a few minutes the entire RAM will have been zeroed.
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
int main(void)
{
char *a = NULL; // declare variable
while(1) // infinite loop
{
a = malloc(524288); // half a MB
memset(a, 0, 524288); // zero
free(a); // free
sleep(1); // sleep for 1 second
}
}
Linux already has a kernel process that is zeroing memory using idle cycles so it will have memory ready to hand to processes that request it.
Your loop may or may not zero different memory depending on the particular malloc implementation. If you really want to write a process like you describe, look into using sbrk directly to ensure you're cycling memory in and out of your process. I bet if you check you'll find every byte given to you by sbrk is already zero, though.
You can't zero system RAM. The system owns it. If you want to run a system which zeros the RAM then you need to write your own OS!
As long as you never access uninitialized memory, you don't have to worry about what someone else left behind. As long as you never free memory before zeroing it out, you don't have to worry about what you have left behind.
I think you need to write a kernel module to actually do this reliably. And then you still could only zero unused pages. Note that pages that were used by other processes will be cleared automatically by the kernel on allocation.
What are you trying to do? Avoid cold boot attacks?
Typically, on my system (2.6.36) I can free all the unused (but allocated) memory by just doing a while(1) malloc(); loop, and killing it when it stops allocating memory.

Memory Leak Using malloc fails

I am writing a program to leak memory( main memory ) to test how the system behaves with low system memory and swap memory. We are using the following loop which runs periodically and leaks memory
main(int argc, char* argv[] )
{
int arg_mem = argv[1];
while(1)
{
u_int_ptr =(unsigned int*) malloc(arg_mem * 1024 * 1024);
if( u_int_ptr == NULL )
printf("\n leakyapp Daemon FAILED due to insufficient available memory....");
sleep( arg_time );
}
}
Above loop runs for sometime and prints the message "leakyapp Daemon FAILED due to insufficient available memory...." . But when I run the command "free" I can see that running this program has no effect either on Main memory or Swap.
Am I doing something wrong ?
Physical memory is not committed to your allocations until you actually write into it.
If you have a kernel version after 2.6.23, use mmap() with the MAP_POPULATE flag instead of malloc():
u_int_ptr = mmap(NULL, arg_mem * 1024 * 1024, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
if (u_int_ptr == MAP_FAILED)
/* ... */
If you have an older kernel, you'll have to touch each page in the allocation.
There might be some sort of copy-on-write optimization. I would suggest actually writing something to the memory you are allocating.
What is happening is that malloc requests argmem * 256 pages from the heap (assuming a 4 Kbyte page size). The heap in turn requests the memory from the operating system. However, all that does is create entries in the page table for the newly allocated memory block. No actual physical RAM is allocated to the process, except that required by the heap to track the malloc request.
As soon as the process tries to access one of those pages by reading or writing, a page fault is generated because the entry in the page table is effectively a dangling pointer. The operating system will then allocate a physical page to the process. It's only then that you'll see the available physical memory go down.
Since all new pages start completely zeroed out, Linux might employ a "copy on write" strategy to optimise page allocation. i.e. it might keep a single page totally zeroed and always allocate that one when a process tries to read from a previously unused page. Only when the process tries to write to that new page would it actually allocate a completely fresh page from physical RAM. I don't know if Linux actually does this, but if it does, merely reading from a new page is not going to be enough to increase physical memory usage.
So, your best strategy is to allocate your large block of RAM and then write something at 4096 byte intervals throughout it.
What does ulimit -m -v print?
Explanation: On any server OS, you can limit the amount of resources a process can allocate to make sure that a single runaway process can't bring down the whole machine.
I'm guessing (based on the command line argument) that you're using a desktop/server OS and not an embedded system.
Allocating memory like this is probably not consuming much RAM. Your memory allocation might not have even succeeded - on some OSs (e.g. Linux), malloc() can return non-NULL even when you ask for more memory than is available.
Without knowing what your OS is and exactly what you're trying to test, it's difficult to suggest anything specific, but you might want to look at more low level ways of allocating memory than malloc(), or ways of controlling the virtual memory system. On Linux you might want to look at mlock().
I think caf already explained it. Linux is usually configured to allow overcommitting memory. You allocate huge chunks of memory, but internally there happens nothing but just making a note that you process wants this huge chunk of memory. It's not before you try to write that chunk, that the kernel tries to find free virtual memory to satisfy the read/write access. This is a bit like flight booking: Airlines usually overbook the flights, because there's always a percentage of passengers who do not show up.
You can force the memory to be committed by writing to the chunk with memset() after allocation. calloc should work too.

Why malloc+memset is slower than calloc?

It's known that calloc is different than malloc in that it initializes the memory allocated. With calloc, the memory is set to zero. With malloc, the memory is not cleared.
So in everyday work, I regard calloc as malloc+memset.
Incidentally, for fun, I wrote the following code for a benchmark.
The result is confusing.
Code 1:
#include<stdio.h>
#include<stdlib.h>
#define BLOCK_SIZE 1024*1024*256
int main()
{
int i=0;
char *buf[10];
while(i<10)
{
buf[i] = (char*)calloc(1,BLOCK_SIZE);
i++;
}
}
Output of Code 1:
time ./a.out
**real 0m0.287s**
user 0m0.095s
sys 0m0.192s
Code 2:
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#define BLOCK_SIZE 1024*1024*256
int main()
{
int i=0;
char *buf[10];
while(i<10)
{
buf[i] = (char*)malloc(BLOCK_SIZE);
memset(buf[i],'\0',BLOCK_SIZE);
i++;
}
}
Output of Code 2:
time ./a.out
**real 0m2.693s**
user 0m0.973s
sys 0m1.721s
Replacing memset with bzero(buf[i],BLOCK_SIZE) in Code 2 produces the same result.
My question is: Why is malloc+memset so much slower than calloc? How can calloc do that?
The short version: Always use calloc() instead of malloc()+memset(). In most cases, they will be the same. In some cases, calloc() will do less work because it can skip memset() entirely. In other cases, calloc() can even cheat and not allocate any memory! However, malloc()+memset() will always do the full amount of work.
Understanding this requires a short tour of the memory system.
Quick tour of memory
There are four main parts here: your program, the standard library, the kernel, and the page tables. You already know your program, so...
Memory allocators like malloc() and calloc() are mostly there to take small allocations (anything from 1 byte to 100s of KB) and group them into larger pools of memory. For example, if you allocate 16 bytes, malloc() will first try to get 16 bytes out of one of its pools, and then ask for more memory from the kernel when the pool runs dry. However, since the program you're asking about is allocating for a large amount of memory at once, malloc() and calloc() will just ask for that memory directly from the kernel. The threshold for this behavior depends on your system, but I've seen 1 MiB used as the threshold.
The kernel is responsible for allocating actual RAM to each process and making sure that processes don't interfere with the memory of other processes. This is called memory protection, it has been dirt common since the 1990s, and it's the reason why one program can crash without bringing down the whole system. So when a program needs more memory, it can't just take the memory, but instead it asks for the memory from the kernel using a system call like mmap() or sbrk(). The kernel will give RAM to each process by modifying the page table.
The page table maps memory addresses to actual physical RAM. Your process's addresses, 0x00000000 to 0xFFFFFFFF on a 32-bit system, aren't real memory but instead are addresses in virtual memory. The processor divides these addresses into 4 KiB pages, and each page can be assigned to a different piece of physical RAM by modifying the page table. Only the kernel is permitted to modify the page table.
How it doesn't work
Here's how allocating 256 MiB does not work:
Your process calls calloc() and asks for 256 MiB.
The standard library calls mmap() and asks for 256 MiB.
The kernel finds 256 MiB of unused RAM and gives it to your process by modifying the page table.
The standard library zeroes the RAM with memset() and returns from calloc().
Your process eventually exits, and the kernel reclaims the RAM so it can be used by another process.
How it actually works
The above process would work, but it just doesn't happen this way. There are three major differences.
When your process gets new memory from the kernel, that memory was probably used by some other process previously. This is a security risk. What if that memory has passwords, encryption keys, or secret salsa recipes? To keep sensitive data from leaking, the kernel always scrubs memory before giving it to a process. We might as well scrub the memory by zeroing it, and if new memory is zeroed we might as well make it a guarantee, so mmap() guarantees that the new memory it returns is always zeroed.
There are a lot of programs out there that allocate memory but don't use the memory right away. Sometimes memory is allocated but never used. The kernel knows this and is lazy. When you allocate new memory, the kernel doesn't touch the page table at all and doesn't give any RAM to your process. Instead, it finds some address space in your process, makes a note of what is supposed to go there, and makes a promise that it will put RAM there if your program ever actually uses it. When your program tries to read or write from those addresses, the processor triggers a page fault and the kernel steps in to assign RAM to those addresses and resumes your program. If you never use the memory, the page fault never happens and your program never actually gets the RAM.
Some processes allocate memory and then read from it without modifying it. This means that a lot of pages in memory across different processes may be filled with pristine zeroes returned from mmap(). Since these pages are all the same, the kernel makes all these virtual addresses point to a single shared 4 KiB page of memory filled with zeroes. If you try to write to that memory, the processor triggers another page fault and the kernel steps in to give you a fresh page of zeroes that isn't shared with any other programs.
The final process looks more like this:
Your process calls calloc() and asks for 256 MiB.
The standard library calls mmap() and asks for 256 MiB.
The kernel finds 256 MiB of unused address space, makes a note about what that address space is now used for, and returns.
The standard library knows that the result of mmap() is always filled with zeroes (or will be once it actually gets some RAM), so it doesn't touch the memory, so there is no page fault, and the RAM is never given to your process.
Your process eventually exits, and the kernel doesn't need to reclaim the RAM because it was never allocated in the first place.
If you use memset() to zero the page, memset() will trigger the page fault, cause the RAM to get allocated, and then zero it even though it is already filled with zeroes. This is an enormous amount of extra work, and explains why calloc() is faster than malloc() and memset(). If you end up using the memory anyway, calloc() is still faster than malloc() and memset() but the difference is not quite so ridiculous.
This doesn't always work
Not all systems have paged virtual memory, so not all systems can use these optimizations. This applies to very old processors like the 80286 as well as embedded processors which are just too small for a sophisticated memory management unit.
This also won't always work with smaller allocations. With smaller allocations, calloc() gets memory from a shared pool instead of going directly to the kernel. In general, the shared pool might have junk data stored in it from old memory that was used and freed with free(), so calloc() could take that memory and call memset() to clear it out. Common implementations will track which parts of the shared pool are pristine and still filled with zeroes, but not all implementations do this.
Dispelling some wrong answers
Depending on the operating system, the kernel may or may not zero memory in its free time, in case you need to get some zeroed memory later. Linux does not zero memory ahead of time, and Dragonfly BSD recently also removed this feature from their kernel. Some other kernels do zero memory ahead of time, however. Zeroing pages during idle isn't enough to explain the large performance differences anyway.
The calloc() function is not using some special memory-aligned version of memset(), and that wouldn't make it much faster anyway. Most memset() implementations for modern processors look kind of like this:
function memset(dest, c, len)
// one byte at a time, until the dest is aligned...
while (len > 0 && ((unsigned int)dest & 15))
*dest++ = c
len -= 1
// now write big chunks at a time (processor-specific)...
// block size might not be 16, it's just pseudocode
while (len >= 16)
// some optimized vector code goes here
// glibc uses SSE2 when available
dest += 16
len -= 16
// the end is not aligned, so one byte at a time
while (len > 0)
*dest++ = c
len -= 1
So you can see, memset() is very fast and you're not really going to get anything better for large blocks of memory.
The fact that memset() is zeroing memory that is already zeroed does mean that the memory gets zeroed twice, but that only explains a 2x performance difference. The performance difference here is much larger (I measured more than three orders of magnitude on my system between malloc()+memset() and calloc()).
Party trick
Instead of looping 10 times, write a program that allocates memory until malloc() or calloc() returns NULL.
What happens if you add memset()?
Because on many systems, in spare processing time, the OS goes around setting free memory to zero on its own and marking it safe for calloc(), so when you call calloc(), it may already have free, zeroed memory to give you.
On some platforms in some modes malloc initialises the memory to some typically non-zero value before returning it, so the second version could well initialize the memory twice

Resources