I'm trying to determine why my applications consumes 4GB of Private Bytes. So I took a full memory dump, loaded it in windbg. But analyzing using !heap -stat -h produces weird results which don't add up:
0:000> !heap -s
(...)
Heap Flags Reserv Commit Virt Free List UCR Virt Lock Fast
(k) (k) (k) (k) length blocks cont. heap
-------------------------------------------------------------------------------------
000002d0a0000000 00000002 2800804 2780508 2800700 2984 1980 177 0 6 LFH
000002d09fe10000 00008000 64 4 64 2 1 1 0 0
000002d09ff70000 00001002 1342924 1334876 1342820 13042 3342 87 0 0 LFH
Ok, I got a 2.8GB heap and a 1.34GB heap. Let's look at the allocations of the first one:
0:000> !heap -stat -h 000002d0a0000000
heap # 000002d0a0000000
group-by: TOTSIZE max-display: 20
size #blocks total ( %) (percent of total busy bytes)
651 291 - 1035e1 (16.00)
79c 1df - e3ce4 (14.06)
28 156d - 35908 (3.31)
(...)
IIUC, the first line means block size 0x651(=1617 bytes), number blocks 0x291(=657), for total bytes of 0x103531(=1062369 bytes =~1MB), and that's 16% of total busy bytes. But looking at the summary, there should be ~2.8GB of busy bytes!
Another disparity:
0:000> !heap -stat -h 000002d0a0000000 -grp A
heap # 000002d0a0000000
group-by: ALLOCATIONSIZE max-display: 20
size #blocks total ( %) (percent of total busy bytes)
a160 1 - a160 (0.62)
7e50 2 - fca0 (0.97)
0:000> !heap -h 000002d0a0000000
(...)
(509 lines that note allocations with size 7e50, like this one:)
000002d0a3f48000: 11560 . 07e60 [101] - busy (7e50)
Edit: Many lines also say Internal at the end, which appears to mean HEAP_ENTRY_VIRTUAL_ALLOC - but the 509 lines with (7e50) don't.
My question: How can I get !heap -stat -h to show all the allocations, so they add up to the output of !heap -s?
At the moment I can only explain the busy percentage, but that may already be helpful. Its value is a bit misleading.
Virtual memory is memory taken from VirtualAlloc(). The C++ heap manager uses that basic mechanism to get memory from the operating system. That virtual memory can be committed (ready to use) or reserved (can be committed later). The output of !heap -s tells you the status of the heaps with respect to that virtual memory.
So we agree that any memory the C++ heap manager can use is committed memory. This coarse granular virtual memory is split into finer blocks by the C++ heap manager. The heap manager may allocate such smaller blocks and free them, depending on the need of malloc()/free() or new/delete operations.
When blocks become free, they are no longer busy. At the same time, the C++ heap manager may decide to not give the free block back to the OS, because
it can't, since other parts of the 64k virtual memory are still in use
or it doesn't want to (internal reasons we can't exactly know, e.g. performance reasons)
Since the free parts do not count as busy, the busy percentage seems to be too high when compared to the virtual memory.
Mapped to your case, this means:
you have 2.8 GB of virtual memory
in heap 000002d0a0000000, you have ~1 MB / 16% = 6.25 MB of memory in use, the rest could be in free heap blocks (it possibly isn't)
The following example is based on this C++ code:
#include "stdafx.h"
#include <iostream>
#include <Windows.h>
#include <string>
#include <iomanip>
int main()
{
HANDLE hHeap = HeapCreate(0, 0x1000000, 0x10000000); // no options, initial 16M, max 256M
HeapAlloc(hHeap, HEAP_GENERATE_EXCEPTIONS, 511000); // max. allocation size for non-growing heap
std::cout << "Debug now, handle is 0x" << std::hex << std::setfill('0') << std::setw(sizeof(HANDLE)) << hHeap << std::endl;
std::string dummy;
std::getline(std::cin, dummy);
return 0;
}
The only 511kB block will be reported as 100%, although it is only ~1/32 of the 16 MB:
0:001> !heap -stat -h 009c0000
heap # 009c0000
group-by: TOTSIZE max-display: 20
size #blocks total ( %) (percent of total busy bytes)
7cc18 1 - 7cc18 (100.00)
To see the free parts as well, use !heap -h <heap> -f:
0:001> !heap -h 0x01430000 -f
Index Address Name Debugging options enabled
3: 01430000
Segment at 01430000 to 11430000 (01000000 bytes committed)
Flags: 00001000
ForceFlags: 00000000
Granularity: 8 bytes
Segment Reserve: 00100000
Segment Commit: 00002000
DeCommit Block Thres: 00000200
DeCommit Total Thres: 00002000
Total Free Size: 001f05c7
Max. Allocation Size: 7ffdefff
Lock Variable at: 01430138
Next TagIndex: 0000
Maximum TagIndex: 0000
Tag Entries: 00000000
PsuedoTag Entries: 00000000
Virtual Alloc List: 014300a0
Uncommitted ranges: 01430090
FreeList[ 00 ] at 014300c4: 01430590 . 0240e1b0
0240e1a8: 7cc20 . 21e38 [100] - free <-- no. 1
02312588: 7f000 . 7f000 [100] - free <-- no. 2
[...]
01430588: 00588 . 7f000 [100] - free <-- no. 32
Heap entries for Segment00 in Heap 01430000
address: psize . size flags state (requested size)
01430000: 00000 . 00588 [101] - busy (587)
01430588: 00588 . 7f000 [100]
[...]
02312588: 7f000 . 7f000 [100]
02391588: 7f000 . 7cc20 [101] - busy (7cc18)
0240e1a8: 7cc20 . 21e38 [100]
0242ffe0: 21e38 . 00020 [111] - busy (1d)
02430000: 0f000000 - uncommitted bytes.
0:001> ? 7cc18
Evaluate expression: 511000 = 0007cc18
Here we see that I have a heap of 256 MB (240 MB uncommitted, 0x0f000000 + 16 MB committed, 0x01000000). Summing up the items in the FreeList, I get
0:001> ? 0n31 * 7f000 + 21e38
Evaluate expression: 16264760 = 00f82e38
So almost everything (~16 MB) is considered as free and not busy by the C++ heap manager. Memory like that 16 MB is reported by !heap -s in this way in WinDbg 6.2.9200:
0:001> !heap -s
LFH Key : 0x23e41d0e
Termination on corruption : ENABLED
Heap Flags Reserv Commit Virt Free List UCR Virt Lock Fast
(k) (k) (k) (k) length blocks cont. heap
-----------------------------------------------------------------------------
004d0000 00000002 1024 212 1024 6 5 1 0 0 LFH
00750000 00001002 64 20 64 9 2 1 0 0
01430000 00001000 262144 16384 262144 15883 32 1 0 0
External fragmentation 96 % (32 free blocks)
-----------------------------------------------------------------------------
IMHO there's a bug regarding reserved and committed memory: it should be 262144k virtual - 16384 committed = 245760k reserved.
Note how the list length matches the number of free blocks reported before.
Above explains the busy percentage only. The remaining question is: the free memory reported in your case doesn't match this scenario.
Usually I'd say the remaining memory is in virtual blocks, i.e. memory blocks that are larger than 512 kB (32 bit) or 1 MB (64 bit) as mentioned on MSDN for growable heaps. But that's not the case here.
There is no output about virtual blocks and the number of virtual blocks is reported as 0.
A program that generates a virtual block would be
#include "stdafx.h"
#include <iostream>
#include <Windows.h>
#include <string>
#include <iomanip>
int main()
{
HANDLE hHeap = HeapCreate(0, 0x1000000, 0); // no options, initial 16M, growable
HeapAlloc(hHeap, HEAP_GENERATE_EXCEPTIONS, 20*1024*1024); // 20 MB, force growing
std::cout << "Debug now, handle is 0x" << std::hex << std::setfill('0') << std::setw(sizeof(HANDLE)) << hHeap << std::endl;
std::string dummy;
std::getline(std::cin, dummy);
return 0;
}
and the !heap command would mention the virtual block:
0:001> !heap -s
LFH Key : 0x7140028b
Termination on corruption : ENABLED
Heap Flags Reserv Commit Virt Free List UCR Virt Lock Fast
(k) (k) (k) (k) length blocks cont. heap
-----------------------------------------------------------------------------
006d0000 00000002 1024 212 1024 6 5 1 0 0 LFH
001d0000 00001002 64 20 64 9 2 1 0 0
Virtual block: 01810000 - 01810000 (size 00000000)
00810000 00001002 16384 16384 16384 16382 33 1 1 0
External fragmentation 99 % (33 free blocks)
-----------------------------------------------------------------------------
In your case however, the value virtual blocks is 0. Perhaps this what is reported as "Internal" in your version of WinDbg. If you have not upgraded yet, try version 6.2.9200 to get the same output as I do.
Related
Situation : board with an Arm CPU that has Nand flash next to it. On power-up, U-boot bootloader starts up and copies the flash contents to RAM, then it transfers control to that code in RAM. A Linux system with some application code, composed through Buildroot, starts running. Its entire filesystem is stored as a single UBIFS file in flash, and it starts using that.
When a certain byte is set, the bootloader keeps in control, and starts a TFTP transfer to download and store a new flash image.
Trigger : a board came back defective. Linux kernel startup clearly shows the issue:
[ 1.931150] Creating 8 MTD partitions on "atmel_nand":
[ 1.936285] 0x000000000000-0x000000040000 : "at91bootstrap"
[ 1.945280] 0x000000040000-0x0000000c0000 : "bootloader"
[ 1.954065] 0x0000000c0000-0x000000100000 : "bootloader env"
[ 1.963262] 0x000000100000-0x000000140000 : "bootloader redundant env"
[ 1.973221] 0x000000140000-0x000000180000 : "spare"
[ 1.981552] 0x000000180000-0x000000200000 : "device tree"
[ 1.990466] 0x000000200000-0x000000800000 : "kernel"
[ 1.999210] 0x000000800000-0x000010000000 : "rootfs"
...
[ 4.016251] ubi0: attached mtd7 (name "rootfs", size 248 MiB)
[ 4.022181] ubi0: PEB size: 131072 bytes (128 KiB), LEB size: 126976 bytes
[ 4.029040] ubi0: min./max. I/O unit sizes: 2048/2048, sub-page size 2048
[ 4.035941] ubi0: VID header offset: 2048 (aligned 2048), data offset: 4096
[ 4.042960] ubi0: good PEBs: 1980, bad PEBs: 4, corrupted PEBs: 0
[ 4.049033] ubi0: user volume: 2, internal volumes: 1, max. volumes count: 128
[ 4.056359] ubi0: max/mean erase counter: 2/0, WL threshold: 4096, image sequence number: 861993884
[ 4.065476] ubi0: available PEBs: 0, total reserved PEBs: 1980, PEBs reserved for bad PEB handling: 36
[ 4.074898] ubi0: background thread "ubi_bgt0d" started, PID 77
...
[ 4.298009] UBIFS (ubi0:0): UBIFS: mounted UBI device 0, volume 0, name "rootfs", R/O mode
[ 4.306415] UBIFS (ubi0:0): LEB size: 126976 bytes (124 KiB), min./max. I/O unit sizes: 2048 bytes/2048 bytes
[ 4.316418] UBIFS (ubi0:0): FS size: 155926528 bytes (148 MiB, 1228 LEBs), journal size 9023488 bytes (8 MiB, 72 LEBs)
[ 4.327197] UBIFS (ubi0:0): reserved for root: 0 bytes (0 KiB)
[ 4.333095] UBIFS (ubi0:0): media format: w4/r0 (latest is w5/r0), UUID AE9F77DC-04AF-433F-92BC-D3375C83B518, small LPT model
[ 4.346924] VFS: Mounted root (ubifs filesystem) readonly on device 0:15.
[ 4.356186] devtmpfs: mounted
[ 4.367038] Freeing unused kernel memory: 1024K
[ 4.371812] Run /sbin/init as init process
[ 4.568143] UBIFS (ubi0:1): background thread "ubifs_bgt0_1" started, PID 83
[ 4.644809] UBIFS (ubi0:1): recovery needed
[ 4.685823] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 126976 bytes from PEB 235:4096, read only 126976 bytes, retry
[ 4.732212] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 126976 bytes from PEB 235:4096, read only 126976 bytes, retry
[ 4.778705] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 126976 bytes from PEB 235:4096, read only 126976 bytes, retry
[ 4.824159] ubi0 error: ubi_io_read: error -74 (ECC error) while reading 126976 bytes from PEB 235:4096, read 126976 bytes
... which causes an exception, but the kernel keeps on going, then another error is detected :
[ 5.071518] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 126976 bytes from PEB 709:4096, read only 126976 bytes, retry
[ 5.118110] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 126976 bytes from PEB 709:4096, read only 126976 bytes, retry
[ 5.164447] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 126976 bytes from PEB 709:4096, read only 126976 bytes, retry
[ 5.210987] ubi0 error: ubi_io_read: error -74 (ECC error) while reading 126976 bytes from PEB 709:4096, read 126976 bytes
... but impressively, the system still comes up alive and behaves almost fine.
Why does the kernel not mark these flash blocks as bad ? Those data can't be read anyway, and at least the next image flashing might skip the bad blocks...
Investigation : so the Kernel found a defective PEB #235 (decimal) in the "rootfs" partition of the flash. Each PEB is 128KB, so the error sits somewhere beyond byte 30,801,920 (decimal). Since the "rootfs" partition only starts from byte 0x800000 of the flash, the actual damaged page must be somewhere beyond byte 39,190,528 (decimal) or 0x2560000. And sure enough, when using the nand read utility within U-boot :
U-Boot> nand read 0x20000000 0x2560000 0x1000
NAND read: device 0 offset 0x2560000, size 0x1000
4096 bytes read: OK
U-Boot> nand read 0x20000000 0x2561000 0x1000
NAND read: device 0 offset 0x2561000, size 0x1000
4096 bytes read: OK
U-Boot> nand read 0x20000000 0x2562000 0x1000
NAND read: device 0 offset 0x2562000, size 0x1000
PMECC: Too many errors
NAND read from offset 2562000 failed -5
0 bytes read: ERROR
so the damaged page sits at offset 8K within that block of flash.
From various other posts, I learned that nand flash with 2K pages organized in 128K blocks, has an extra 64 "Out Of Band" bytes over every 2048 payload bytes, bringing each page to a gross size of 2112 bytes. Anyway, the entire block of 128K will have to be disused, as this is the erase size. No problem, there is storage to spare, I just want to make sure that the next flashing will skip over this bad block.
Since neither the Linux kernel nor the bootloader bothered to mark the bad block, I'll do it by hand in U-boot:
U-Boot> nand markbad 2562000
block 0x02562000 successfully marked as bad
A similar investigation for the 2nd bad flash page reveals that the other error sits at flash address 0x60a1000 :
U-Boot> nand read 0 60A1000 800
NAND read: device 0 offset 0x60a1000, size 0x800
PMECC: Too many errors
NAND read from offset 60a1000 failed -5
0 bytes read: ERROR
so here too, the nand markbad utility is used to manually put a permanent mark on this block :
U-Boot> nand markbad 60a1000
block 0x060a1000 successfully marked as bad
and to verify that everything is taken into account :
U-Boot> nand bad
Device 0 bad blocks:
02560000
060a0000
Just like it should be - from the start of each 128K block, both blocks are marked.
Problem : so I learned that the 64 OOB bytes are divided in 2 bytes marker, 38 bytes error-correcting code, and 24 bytes journaling. Of all the OOB bytes accompanying each 2048 payload bytes, only the very first piece of 64 bytes, accompanying the first page of 2KB, lends its 2 bytes marker code to indicate the status of the entire 128KB block. These 2 bytes should be modified in the flash device itself so that this status is persistent. So in my U-boot session, instead of launching the Linux system, I restarted the CPU and remained in U-boot :
U-Boot> reset
resetting ...
RomBOOT
ba_offset = 0xc ...
AT91Bootstrap 3.6.0-00029-g0cd4e6a (Wed Nov 12 12:14:04 CET 2014)
NAND: ONFI flash detected
NAND: Manufacturer ID: 0x2c Chip ID: 0x32
NAND: Disable On-Die ECC
PMECC: page_size: 0x800, oob_size: 0x40, pmecc_cap: 0x4, sector_size: 0x200
NAND: Initialize PMECC params, cap: 0x4, sector: 0x200
NAND: Image: Copy 0x80000 bytes from 0x40000 to 0x26f00000
NAND: Done to load image
U-Boot 2013.10-00403-g1f9a20a (Nov 12 2014 - 12:14:27)
CPU: SAMA5D31
Crystal frequency: 12 MHz
CPU clock : 528 MHz
Master clock : 132 MHz
DRAM: 128 MiB
NAND: 256 MiB
MMC: mci: 0
In: serial
Out: serial
Err: serial
Net: macb0
Hit any key to stop autoboot: 0
U-Boot> nand info
Device 0: nand0, sector size 128 KiB
Page size 2048 b
OOB size 64 b
Erase size 131072 b
U-Boot> nand bad
Device 0 bad blocks:
U-Boot>
The bad blocks have been forgotten - the marker code was not applied persistently ?
Granted, this U-boot version seems rather old. Has the nand markbad utility been improved since then ?
Workaround : I modified the OOB bytes of the first page within the bad block myself. I read all 2112 bytes of the first page into RAM, then modified the 2 bytes marker code, and wrote the 2112 bytes back from RAM into flash. Technically, I should have erased the whole 128K flash page and then written back all 128K of contents. But my laziness has been challenged enough today. Nand flash can be toggled from 1 to 0 arbitrarily - it's the reverse operation that is hard, requiring an erase to restore a whole 128K page back to all-0xFF. I noticed that all the "block good" markers are encoded as 0xFFFF, so I figured that writing "0x0000" instead should suffice.
U-Boot> nand read.raw 0x20200000 0x2560000 1
NAND read: 2112 bytes read: OK
The format for nand read.raw is a bit quirky, as opposed to nand.read which expects size as the last argument in bytes, it wants size expressed in number-of-pages instead. The first page is all we need, so argument '1' does the trick. The contents, which have now been transferred to RAM, can be inspected with U-boot's md utility :
U-Boot> md 0x20200000 0x210
20200000: 23494255 00000001 00000000 01000000 UBI#............
20200010: 00080000 00100000 9cfb6033 00000000 ........3`......
...
202007e0: 00000000 00000000 00000000 00000000 ................
202007f0: 00000000 00000000 00000000 00000000 ................
20200800: ffffffff ffffffff ffffffff ffffffff ................
20200810: ffffffff ffffffff ffffffff ffffffff ................
20200820: ffffffff b0c9aa24 0008fdb8 00000000 ....$...........
20200830: 00000000 00000000 00000000 00000000 ................
Note how the md utility expects its size argument in yet a different format : this one expects it in units of words. Just to keep us alert.
The dump at address 0x20200800 clearly shows how markbad has failed its purpose: the 2 marker bytes of the bad block are still merrily on 0xFFFF.
Then to modify these bytes, another U-boot utility comes in handy :
U-Boot> mm 0x20200800
20200800: ffffffff ? 00000000
20200804: ffffffff ? q
It's a bit crude, I've changed the 4 first OOB bytes instead of just the 2 first marker bytes. Finall, to write the modified contents back into flash :
U-Boot> nand write.raw 0x20200000 0x2560000 1
NAND write: 2112 bytes written: OK
Funny enough, the nand bad diagnostic doesn't notice the block which has just been marked, even after some nand read attempts which do fail.
U-Boot> nand bad
Device 0 bad blocks:
U-Boot>
But this is no cause for alarm. The 2nd bad block was marked manually in a similar fashion, and upon another reset :
U-Boot> reset
resetting ...
RomBOOT
ba_offset = 0xc ...
AT91Bootstrap 3.6.0-00029-g0cd4e6a (Wed Nov 12 12:14:04 CET 2014)
...
U-Boot 2013.10-00403-g1f9a20a (Nov 12 2014 - 12:14:27)
...
Hit any key to stop autoboot: 0
U-Boot> nand bad
Device 0 bad blocks:
02560000
060a0000
U-Boot>
Lo and behold, the 'bad block' marking has persisted ! The next flash storage operation neatly skipped over the bad blocks, saving a consistent kernel and filesystem in the various partitions of the flash. This was the intention all along, but it seems to require gritty manual work. Is there no automated way ?
U-Boot has changed quite a bit since 2014. Patches possibly of relevance to your problem include:
dc0b69fa9f97 ("mtd: nand: mxs_nand: allow to enable BBT support")
c4adf9db5d38 ("spl: nand: sunxi: remove support for so-called 'syndrome' mode")
8d1809a96699 ("spl: nand: simple: replace readb() with chip specific read_buf()")
Please, retest with U-Boot Git HEAD. If there is still something missing, please, report it to the U-Boot developer list or even better send your patch.
I'm working on an embedded project on an ARM mcu that has a custom linker file with several different memory spaces:
/* Memory Spaces Definitions */
MEMORY
{
rom (rx) : ORIGIN = 0x00400000, LENGTH = 0x00200000
data_tcm (rw) : ORIGIN = 0x20000000, LENGTH = 0x00008000
prog_tcm (rwx) : ORIGIN = 0x00000000, LENGTH = 0x00008000
ram (rwx) : ORIGIN = 0x20400000, LENGTH = 0x00050000
sdram (rw) : ORIGIN = 0x70000000, LENGTH = 0x00200000
}
Specifically, I have a number of different memory devices with different characteristics (TCM, plain RAM (with a D-Cache in the way), and an external SDRAM), all mapped as part of the same address space.
I'm specifically placing different variables in the different memory spaces, depending on the requirements (am I DMA'ing into it, do I have cache-coherence issues, do I expect to overflow the D-cache, etc...).
If I exceed any one of the sections, I get a linker error. However, unless I do so, the linker only prints the memory usage as bulk percentage:
Program Memory Usage : 33608 bytes 1.6 % Full
Data Memory Usage : 2267792 bytes 91.1 % Full
Given that I have 3 actively used memory spaces, and I know for a fact that I'm using 100% of one of them (the SDRAM), it's kind of a useless output.
Is there any way to make the linker output the percentage of use for each memory space individually? Right now, I have to manually open the .map file, search for the section header, and then manually subtract the size from the total available memory specified in the .ld file.
While this is kind of a minor thing, it'd sure be nice to just have the linker do:
Program Memory Usage : 33608 bytes 1.6 % Full
Data Memory Usage : 2267792 bytes 91.1 % Full
data_dtcm : xxx bytes xx % Full
ram : xxx bytes xx % Full
sdram : xxx bytes xx % Full
This is with GCC-ARM, and therefore GCC-LD.
Arrrgh, so of course, I find the answer right after asking the question:
--print-memory-usage
Used as -Wl,--print-memory-usage, you get the following:
Memory region Used Size Region Size %age Used
rom: 31284 B 2 MB 1.49%
data_tcm: 26224 B 32 KB 80.03%
prog_tcm: 0 GB 32 KB 0.00%
ram: 146744 B 320 KB 44.78%
sdram: 2 MB 2 MB 100.00%
I need to know the amount of heap space allocated for a process. In theory call malloc(1) (in C program) only once will get a start address of heap in my code. And call sbrk(0) a system call which return end of heap space allocated to that process. Below is my test code.
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#ifndef ALLOCATE_BYTES
#define ALLOCATE_BYTES 0x1000000
#endif /* #ifndef ALLOCATE_BYTES */
void subroutine (void);
int main()
{
void *init_ptr = malloc((size_t) 1);
void *end_ptr = sbrk((intptr_t) 0);
printf(" Total Heap space = %lld B = %lld kB \n",
(long long) (init_ptr - end_ptr),
(long long) (init_ptr - end_ptr)/1024);
subroutine();
end_ptr = sbrk((intptr_t) 0); /* New edit 1 */
printf(" Total Heap space = %lld B = %lld kB \n",
(long long) (init_ptr - end_ptr),
(long long) (init_ptr - end_ptr)/1024);
return 0;
}
void subroutine (void)
{
void *ptr;
long long count = 0;
size_t size = ALLOCATE_BYTES;
size_t size_p = ALLOCATE_BYTES + 1;
long long mem_alocated = 0;
printf(" start rate to %u \n", size);
while (1)
{
ptr = malloc((size_t)size);
if (ptr != NULL)
{
mem_alocated += size;
count++;
}
if ((ptr == NULL) && (size == size_p))
{
size/=2;
printf("overflow --> reduced Bytes SIZE %*u & "
"current count = %lld ---> total bytes %lld \n"
, 7, size, count, mem_alocated);
}
if ((ptr == NULL) && (size == 1))
{
printf("overflow....!! at %lld for %lld bytes\n",
count, count * ALLOCATE_BYTES);
break;
}
size_p = size;
}
}
The following is the results:
$ gcc -o exmpl_heap_consume exmpl_heap_consume.c -DALLOCATE_BYTES=0x10000
$ ./exmpl_heap_consume
Total Heap space = -135160 B = -131 kB
start rate to 65536
overflow --> reduced Bytes SIZE 32768 & current count = 48792 ---> total bytes 3197632512
overflow --> reduced Bytes SIZE 16384 & current count = 49084 ---> total bytes 3207200768
overflow --> reduced Bytes SIZE 8192 & current count = 49371 ---> total bytes 3211902976
overflow --> reduced Bytes SIZE 4096 & current count = 49658 ---> total bytes 3214254080
overflow --> reduced Bytes SIZE 2048 & current count = 49945 ---> total bytes 3215429632
overflow --> reduced Bytes SIZE 1024 & current count = 50233 ---> total bytes 3216019456
overflow --> reduced Bytes SIZE 512 & current count = 50521 ---> total bytes 3216314368
overflow --> reduced Bytes SIZE 256 & current count = 50809 ---> total bytes 3216461824
overflow --> reduced Bytes SIZE 128 & current count = 51098 ---> total bytes 3216535808
overflow --> reduced Bytes SIZE 64 & current count = 51100 ---> total bytes 3216536064
overflow --> reduced Bytes SIZE 32 & current count = 51100 ---> total bytes 3216536064
overflow --> reduced Bytes SIZE 16 & current count = 51387 ---> total bytes 3216545248
overflow --> reduced Bytes SIZE 8 & current count = 51389 ---> total bytes 3216545280
overflow --> reduced Bytes SIZE 4 & current count = 51676 ---> total bytes 3216547576
overflow --> reduced Bytes SIZE 2 & current count = 51676 ---> total bytes 3216547576
overflow --> reduced Bytes SIZE 1 & current count = 51676 ---> total bytes 3216547576
overflow....!! at 51676 for 3386638336 bytes
Total Heap space = -135160 B = -131 kB
This result says theoretical I have "135160 bytes" of memory locations for heap. But if I start consuming all of it till the repeated malloc() function call returns NULL. While doing so I keep track of how many bytes of memory my program has consumed now.
But here the question is my theoretical heap space (135160 bytes) is not matching with my practical counts (3386638336 bytes). Am I missing anything?
Edit 2:
I added some checks for the pointer return by the malloc() call and aggregated it to see the total. I observed the total bytes allocated is just less than the my theoretical heap space. This would suggest two things malloc() internally not calling sbrk() and secondly it is allocating memory elsewhere. Am I right till this point or anything missing here?
You've completely misunderstood brk/sbrk. sbrk(0) will tell the location of current program break, aka the end of data segment. When malloc runs out of space, it will call sbrk with a positive offset that will resize the data segment, thus moving the program break further. The Linux man page for sbrk clarifies this:
DESCRIPTION
brk() and sbrk() change the location of the program break, which
defines the end of the process's data segment (i.e., the program break
is the first location after the end of the uninitialized data seg‐
ment). Increasing the program break has the effect of allocating mem‐
ory to the process; decreasing the break deallocates memory.
brk() sets the end of the data segment to the value specified by addr,
when that value is reasonable, the system has enough memory, and the
process does not exceed its maximum data size (see setrlimit(2)).
sbrk() increments the program's data space by increment bytes. Call‐
ing sbrk() with an increment of 0 can be used to find the current
location of the program break.
(emphasis mine)
Furthermore usually on modern malloc implementations allocations of sufficient size (say >128kB) are allocated directly with mmap outside of the data segment. I'd test if the pointer returned from malloc(65536) even were between an address that is in the initialized data segment (do a global int foo = 42;, then &foo should be before the .bss/uninitialized data) and the current program break.
I think what you are not remembering is that your `practical count' is the available RAM on your system, which is used for multiple things. For example, I modified your program to:
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#ifndef ALLOCATE_BYTES
#define ALLOCATE_BYTES 0x1000000
#endif /* #ifndef ALLOCATE_BYTES */
int main()
{
void *init_ptr = malloc((size_t) 1);
void *end_ptr = sbrk((intptr_t) 0);
printf("\nmalloc returned: 0x%08x\n", (uint64_t)init_ptr);
printf("sbrk returned: 0x%08x\n", (uint64_t)end_ptr);
printf(" Total Heap space = %lld B = %lld kB \n",
(long long) (end_ptr - init_ptr),
(long long) (end_ptr - init_ptr)/1024);
sleep(300);
return 0;
}
I compiled the above using gcc (version 4.8.4) on Ubuntu 14.04 as: gcc -std=c11 -pedantic -Wall temp.c -o temp; then I ran the following commands at the command prompt:
******#crossbow:~/junk$ ./temp &
[2] 2069
******#crossbow:~/junk$
malloc returned: 0x01cac010
sbrk returned: 0x01ccd000
Total Heap space = 135152 B = 131 kB
******#crossbow:ps
PID TTY TIME CMD
1681 pts/0 00:00:00 bash
1775 pts/0 00:00:15 emacs
2069 pts/0 00:00:00 temp
2070 pts/0 00:00:00 ps
******#crossbow:~/junk$ cat /proc/2069/maps
00400000- 00401000 r-xp 00000000 fc:02 10616872 /home/******/junk/temp
00600000- 00601000 r--p 00000000 fc:02 10616872 /home/******/junk/temp
00601000- 00602000 rw-p 00001000 fc:02 10616872 /home/******/junk/temp
01cac000- 01ccd000 rw-p 00000000 00:00 0 [heap]
7f16a2e44000- 7f16a2fff000 r-xp 00000000 fc:02 70779581 /lib/x86_64-linux-gnu/libc-2.19.so
7f16a2fff000- 7f16a31fe000 ---p 001bb000 fc:02 70779581 /lib/x86_64-linux-gnu/libc-2.19.so
7f16a31fe000- 7f16a3202000 r--p 001ba000 fc:02 70779581 /lib/x86_64-linux-gnu/libc-2.19.so
7f16a3202000- 7f16a3204000 rw-p 001be000 fc:02 70779581 /lib/x86_64-linux-gnu/libc-2.19.so
7f16a3204000- 7f16a3209000 rw-p 00000000 00:00 0
7f16a3209000- 7f16a322c000 r-xp 00000000 fc:02 70779574 /lib/x86_64-linux-gnu/ld-2.19.so
7f16a3407000- 7f16a340a000 rw-p 00000000 00:00 0
7f16a3428000- 7f16a342b000 rw-p 00000000 00:00 0
7f16a342b000- 7f16a342c000 r--p 00022000 fc:02 70779574 /lib/x86_64-linux-gnu/ld-2.19.so
7f16a342c000- 7f16a342d000 rw-p 00023000 fc:02 70779574 /lib/x86_64-linux-gnu/ld-2.19.so
7f16a342d000- 7f16a342e000 rw-p 00000000 00:00 0
7ffcee9ab000- 7ffcee9cc000 rw-p 00000000 00:00 0 [stack]
7ffcee9ef000- 7ffcee9f1000 r-xp 00000000 00:00 0 [vdso]
7ffcee9f1000- 7ffcee9f3000 r--p 00000000 00:00 0 [vvar]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Now, from this you can see that currently the heap starts at 0x01cac000 and runs to 0x01ccd000. Notice, that malloc does not allocate memory starting from the beginning of the heap.
Also, notice, that the heap can grow upwards till it bumps into the area that dynamic libraries are loaded (in this case, this is at 0x7f16a2e44000).
From the memory map listing above, you can also see where the stack is located, if you have any interest in that, as well as where various parts of your program are loaded. You can determine what those blocks are by using the readelf utility, and you can see:
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
PHDR 0x000040 0x0000000000400040 0x0000000000400040 0x0001f8 0x0001f8 R E 0x8
INTERP 0x000238 0x0000000000400238 0x0000000000400238 0x00001c 0x00001c R 0x1
[Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
LOAD 0x000000 0x0000000000400000 0x0000000000400000 0x0008d4 0x0008d4 R E 0x200000
LOAD 0x000e10 0x0000000000600e10 0x0000000000600e10 0x000248 0x000250 RW 0x200000
DYNAMIC 0x000e28 0x0000000000600e28 0x0000000000600e28 0x0001d0 0x0001d0 RW 0x8
NOTE 0x000254 0x0000000000400254 0x0000000000400254 0x000044 0x000044 R 0x4
GNU_EH_FRAME 0x0007a8 0x00000000004007a8 0x00000000004007a8 0x000034 0x000034 R 0x4
GNU_STACK 0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW 0x10
GNU_RELRO 0x000e10 0x0000000000600e10 0x0000000000600e10 0x0001f0 0x0001f0 R 0x1
From looking at this, we can see that the area in memory from 0x00400000 to 0x00401000 is occupied by the contents of the file described by the second program header. This segment contains the following program sections: .interp .note.ABI-tag .note.gnu.build-id .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .text .fini .rodata .eh_frame_hdr .eh_frame. Probably the most important for you would be the .text section, which contains your code. In fact, the programs entry point is located at 0x00400520 (as determined by readelf -h).
Looking at the next area in memory from 0x00601000 to 0x00602000 and referring to the above, we can see that this corresponds to the section of the file described by the third program header. This segment contains the following program sections: .init_array .fini_array .jcr .dynamic .got .got.plt .data .bss.
Context: I am currently learning how to properly use CUDA, in particular how to generate random numbers using CURAND. I learned here that it might be wise to generate my random numbers directly when I need them, inside the kernel which performs the core calculation in my code.
Following the documentation, I decided to play a bit and try come up with a simple running piece of code which I can later adapt to my needs.
I excluded MTGP32 because of the limit of 256 concurrent threads in a block (and just 200 pre-generated parameter sets). Besides, I do not want to use doubles, so I decided to stick to the default generator (XORWOW).
Problem: I am having a hard time understanding why the same seed value in my code is generating different sequences of numbers for a number of threads per block bigger than 128 (when blockSize<129, everything runs as I would expect). After doing proper CUDA error checking, as suggested by Robert in his comment, it is somewhat clear that hardware limitations play a role. Moreover, not using "-G -g" flags at compile time raises the "trouble for threshold" from 128 to 384.
Questions: What exactly is causing this? Robert worte in his comment that "it might be a registers per thread issue". What does this mean? Is there an easy way to look at the hardware specs and say where this limit will be? Can I get around this issue without having to generate more random numbers per thread?
A related issue seems to have been discussed here but I do not think it applies to my case.
My code (see below) was mostly inspired by these examples.
Code:
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand_kernel.h>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true){
if (code != cudaSuccess){
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void setup_kernel(curandState *state, int seed, int n){
int id = threadIdx.x + blockIdx.x*blockDim.x;
if(id<n){
curand_init(seed, id, 0, &state[id]);
}
}
__global__ void generate_uniform_kernel(curandState *state, float *result, int n){
int id = threadIdx.x + blockIdx.x*blockDim.x;
float x;
if(id<n){
curandState localState = state[id];
x = curand_uniform(&localState);
state[id] = localState;
result[id] = x;
}
}
int main(int argc, char *argv[]){
curandState *devStates;
float *devResults, *hostResults;
int n = atoi(argv[1]);
int s = atoi(argv[2]);
int blockSize = atoi(argv[3]);
int nBlocks = n/blockSize + (n%blockSize == 0?0:1);
printf("\nn: %d, blockSize: %d, nBlocks: %d, seed: %d\n", n, blockSize, nBlocks, s);
hostResults = (float *)calloc(n, sizeof(float));
cudaMalloc((void **)&devResults, n*sizeof(float));
cudaMalloc((void **)&devStates, n*sizeof(curandState));
setup_kernel<<<nBlocks, blockSize>>>(devStates, s, n);
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );
generate_uniform_kernel<<<nBlocks, blockSize>>>(devStates, devResults, n);
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );
cudaMemcpy(hostResults, devResults, n*sizeof(float), cudaMemcpyDeviceToHost);
for(int i=0; i<n; i++) {
printf("\n%10.13f", hostResults[i]);
}
cudaFree(devStates);
cudaFree(devResults);
free(hostResults);
return 0;
}
I compiled two binaries, one using the "-G -g" debugging flags and the other without. I named them rng_gen_d and rng_gen, respectively:
$ nvcc -lcuda -lcurand -O3 -G -g --ptxas-options=-v rng_gen.cu -o rng_gen_d
ptxas /tmp/tmpxft_00002257_00000000-5_rng_gen.ptx, line 2143; warning : Double is not supported. Demoting to float
ptxas info : 77696 bytes gmem, 72 bytes cmem[0], 32 bytes cmem[14]
ptxas info : Compiling entry function '_Z12setup_kernelP17curandStateXORWOWii' for 'sm_10'
ptxas info : Used 43 registers, 32 bytes smem, 72 bytes cmem[1], 6480 bytes lmem
ptxas info : Compiling entry function '_Z23generate_uniform_kernelP17curandStateXORWOWPfi' for 'sm_10'
ptxas info : Used 10 registers, 36 bytes smem, 40 bytes cmem[1], 48 bytes lmem
$ nvcc -lcuda -lcurand -O3 --ptxas-options=-v rng_gen.cu -o rng_gen
ptxas /tmp/tmpxft_00002b73_00000000-5_rng_gen.ptx, line 533; warning : Double is not supported. Demoting to float
ptxas info : 77696 bytes gmem, 72 bytes cmem[0], 32 bytes cmem[14]
ptxas info : Compiling entry function '_Z12setup_kernelP17curandStateXORWOWii' for 'sm_10'
ptxas info : Used 20 registers, 32 bytes smem, 48 bytes cmem[1], 6440 bytes lmem
ptxas info : Compiling entry function '_Z23generate_uniform_kernelP17curandStateXORWOWPfi' for 'sm_10'
ptxas info : Used 19 registers, 36 bytes smem, 4 bytes cmem[1]
To start with, there is a strange warning message at compile time (see above):
ptxas /tmp/tmpxft_00002b31_00000000-5_rng_gen.ptx, line 2143; warning : Double is not supported. Demoting to float
Some debugging showed that the line causing this warning is:
curandState localState = state[id];
There are no doubles declared, so I do not know exactly how to solve this (or even if this needs solving).
Now, an example of the (actual) problem I am facing:
$ ./rng_gen_d 5 314 127
n: 5, blockSize: 127, nBlocks: 1, seed: 314
0.9151657223701
0.3925153017044
0.7007563710213
0.8806988000870
0.5301177501678
$ ./rng_gen_d 5 314 128
n: 5, blockSize: 128, nBlocks: 1, seed: 314
0.9151657223701
0.3925153017044
0.7007563710213
0.8806988000870
0.5301177501678
$ ./rng_gen_d 5 314 129
n: 5, blockSize: 129, nBlocks: 1, seed: 314
GPUassert: too many resources requested for launch rng_gen.cu 54
Line 54 is gpuErrchk() right after setup_kernel().
With the other binary (no "-G -g" flags at compile time), the "threshold for trouble" is raised to 384:
$ ./rng_gen 5 314 129
n: 5, blockSize: 129, nBlocks: 1, seed: 314
0.9151657223701
0.3925153017044
0.7007563710213
0.8806988000870
0.5301177501678
$ ./rng_gen 5 314 384
n: 5, blockSize: 384, nBlocks: 1, seed: 314
0.9151657223701
0.3925153017044
0.7007563710213
0.8806988000870
0.5301177501678
$ ./rng_gen 5 314 385
n: 5, blockSize: 385, nBlocks: 1, seed: 314
GPUassert: too many resources requested for launch rng_gen.cu 54
Finally, should this be somehow related to the hardware I am using for this preliminary testing (the project will be later launched on a much more powerful machine), here are the specs of the card I am using:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Quadro NVS 160M"
CUDA Driver Version / Runtime Version 5.5 / 5.5
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 256 MBytes (268107776 bytes)
( 1) Multiprocessors, ( 8) CUDA Cores/MP: 8 CUDA Cores
GPU Clock rate: 1450 MHz (1.45 GHz)
Memory Clock rate: 702 Mhz
Memory Bus Width: 64-bit
Maximum Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536, 32768), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(8192), 512 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(8192, 8192), 512 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Max dimension size of a thread block (x,y,z): (512, 512, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 1)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: No with 0 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 1, Device0 = Quadro NVS 160M
Result = PASS
And this is it. Any guidance on this matter will most welcome. Thanks!
EDIT:
1) Added proper cuda error checking, as suggested by Robert.
2) Deleted the cudaMemset line, which was useless anyway.
3) Compiled and ran the code without the "-G -g" flags.
4) Updated the output accordingly.
First of all, when you're having trouble with CUDA code, it's always advisable to do proper cuda error checking. It will eliminate a certain amount of head scratching, probably save you some time, and will certainly improve the ability of folks to help you on sites like this one.
Now you've discovered you have a registers per thread issue. The compiler while generating code will use registers for various purposes. Each thread requires this complement of registers to run it's thread code. When you attempt to launch a kernel, one of the requirements that must be met is that the number of registers required per thread times the number of requested threads in the launch must be less than the total number of registers available per block. Note that the number of registers required per thread may have to be rounded up to some granular allocation increment. Also note that the number of threads requested will normally be rounded up to the next higher increment of 32 (if not evenly divisible by 32) as threads are launched in warps of 32. Also note that the max registers per block varies by compute capability, and this quantity can be inspected via the deviceQuery sample as you've shown. Also as you've discovered, certain command line switches like -G can affect how nvcc utilizes registers.
To get advance notice of these types of resource issues, you can compile your code with additional command line switches:
nvcc -arch=sm_11 -Xptxas=-v -o mycode mycode.cu
The -Xptxas=-v switch will generate resource usage output by the ptxas assembler (which converts intermediate ptx code to sass assembly code, i.e. machine code), including registers required per thread. Note that the output will be delivered per kernel in this case, as each kernel may have it's own requirements. You can get more info about the nvcc compiler in the documentation.
As a crude workaround, you can specify a switch at compile time to limit all kernel compilation to a max register usage number:
nvcc -arch=sm_11 -Xptxas=-v -maxrregcount=16 -o mycode mycode.cu
This would limit each kernel to using no more than 16 registers per thread. When multiplied by 512 (the hardware limit of threads per block for a cc1.x device) this yields a value of 8192, which is the hardware limit on total registers per threadblock for your device.
However the above method is crude in that it applies the same limit to all kernels in your program. If you wanted to tailor this to each kernel launch (for example if different kernels in your program were launching different numbers of threads) you could use the launch bounds methodology, which is described here.
On my machine Time A and Time B swap depending on whether A is
defined or not (which changes the order in which the two callocs are called).
I initially attributed this to the paging system. Weirdly, when
mmap is used instead of calloc, the situation is even more bizzare -- both the loops take the same amount of time, as expected. As
can be seen with strace, the callocs ultimately result in two
mmaps, so there is no return-already-allocated-memory magic going on.
I'm running Debian testing on an Intel i7.
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <time.h>
#define SIZE 500002816
#ifndef USE_MMAP
#define ALLOC calloc
#else
#define ALLOC(a, b) (mmap(NULL, a * b, PROT_READ | PROT_WRITE, \
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0))
#endif
int main() {
clock_t start, finish;
#ifdef A
int *arr1 = ALLOC(sizeof(int), SIZE);
int *arr2 = ALLOC(sizeof(int), SIZE);
#else
int *arr2 = ALLOC(sizeof(int), SIZE);
int *arr1 = ALLOC(sizeof(int), SIZE);
#endif
int i;
start = clock();
{
for (i = 0; i < SIZE; i++)
arr1[i] = (i + 13) * 5;
}
finish = clock();
printf("Time A: %.2f\n", ((double)(finish - start))/CLOCKS_PER_SEC);
start = clock();
{
for (i = 0; i < SIZE; i++)
arr2[i] = (i + 13) * 5;
}
finish = clock();
printf("Time B: %.2f\n", ((double)(finish - start))/CLOCKS_PER_SEC);
return 0;
}
The output I get:
~/directory $ cc -Wall -O3 bench-loop.c -o bench-loop
~/directory $ ./bench-loop
Time A: 0.94
Time B: 0.34
~/directory $ cc -DA -Wall -O3 bench-loop.c -o bench-loop
~/directory $ ./bench-loop
Time A: 0.34
Time B: 0.90
~/directory $ cc -DUSE_MMAP -DA -Wall -O3 bench-loop.c -o bench-loop
~/directory $ ./bench-loop
Time A: 0.89
Time B: 0.90
~/directory $ cc -DUSE_MMAP -Wall -O3 bench-loop.c -o bench-loop
~/directory $ ./bench-loop
Time A: 0.91
Time B: 0.92
You should also test using malloc instead of calloc. One thing that calloc does is to fill the allocated memory with zeros.
I believe in your case that when you calloc arr1 last and then assign to it, it is already faulted into cache memory, since it was the last one allocated and zero-filled. When you calloc arr1 first and arr2 second, then the zero-fill of arr2 pushes arr1 out of cache.
Guess I could have written more, or less, especially as less is more.
The reason can differ from system to system. However; for clib:
The total time used for each operation is the other way around; if you time
the calloc + the iteration.
I.e.:
Calloc arr1 : 0.494992654
Calloc arr2 : 0.000021250
Itr arr1 : 0.430646035
Itr arr2 : 0.790992411
Sum arr1 : 0.925638689
Sum arr2 : 0.791013661
Calloc arr1 : 0.503130736
Calloc arr2 : 0.000025906
Itr arr1 : 0.427719162
Itr arr2 : 0.809686047
Sum arr1 : 0.930849898
Sum arr2 : 0.809711953
The first calloc subsequently malloc has a longer execution time then
second. A call as i.e. malloc(0) before any calloc etc. evens out the time
used for malloc like calls in same process (Explanation below). One can
however see an slight decline in time for these calls if one do several in line.
The iteration time, however, will flatten out.
So in short; The total system time used is highest on which ever get alloc'ed first.
This is however an overhead that can't be escaped in the confinement of a process.
There is a lot of maintenance going on. A quick touch on some of the cases:
Short on page's
When a process request memory it is served a virtual address range. This range
translates by a page table to physical memory. If a page translated byte by
byte we would quickly get huge page tables. This, as one, is a reason why
memory ranges are served in chunks - or pages. The page size are system
dependent. The architecture can also provide various page sizes.
If we look at execution of above code and add some reads from /proc/PID/stat
we see this in action (Esp. note RSS):
PID Stat {
PID : 4830 Process ID
MINFLT : 214 Minor faults, (no page memory read)
UTIME : 0 Time user mode
STIME : 0 Time kernel mode
VSIZE : 2039808 Virtual memory size, bytes
RSS : 73 Resident Set Size, Number of pages in real memory
} : Init
PID Stat {
PID : 4830 Process ID
MINFLT : 51504 Minor faults, (no page memory read)
UTIME : 4 Time user mode
STIME : 33 Time kernel mode
VSIZE : 212135936 Virtual memory size, bytes
RSS : 51420 Resident Set Size, Number of pages in real memory
} : Post calloc arr1
PID Stat {
PID : 4830 Process ID
MINFLT : 51515 Minor faults, (no page memory read)
UTIME : 4 Time user mode
STIME : 33 Time kernel mode
VSIZE : 422092800 Virtual memory size, bytes
RSS : 51428 Resident Set Size, Number of pages in real memory
} : Post calloc arr2
PID Stat {
PID : 4830 Process ID
MINFLT : 51516 Minor faults, (no page memory read)
UTIME : 36 Time user mode
STIME : 33 Time kernel mode
VSIZE : 422092800 Virtual memory size, bytes
RSS : 51431 Resident Set Size, Number of pages in real memory
} : Post iteration arr1
PID Stat {
PID : 4830 Process ID
MINFLT : 102775 Minor faults, (no page memory read)
UTIME : 68 Time user mode
STIME : 58 Time kernel mode
VSIZE : 422092800 Virtual memory size, bytes
RSS : 102646 Resident Set Size, Number of pages in real memory
} : Post iteration arr2
PID Stat {
PID : 4830 Process ID
MINFLT : 102776 Minor faults, (no page memory read)
UTIME : 68 Time user mode
STIME : 69 Time kernel mode
VSIZE : 2179072 Virtual memory size, bytes
RSS : 171 Resident Set Size, Number of pages in real memory
} : Post free()
As we can see pages actually allocated in memory is postponed for arr2 awaiting
page request; which lasts until iteration begins. If we add a malloc(0) before
calloc of arr1 we can register that neither array is allocated in physical
memory before iteration.
As a page might not be used it is more efficient to do the mapping on request.
This is why when the process i.e. do a calloc the sufficient number of pages
are reserved, but not necessarily actually allocated in real memory.
When an address is referenced the page table is consulted. If the address is
in a page which is not allocated the system serves a page fault and the page
is subsequently allocated. Total sum of allocated pages is called Resident
Set Size (RSS).
We can do an experiment with our array by iterating (touching) i.e. 1/4 of it.
Here I have also added malloc(0) before any calloc.
Pre iteration 1/4:
RSS : 171 Resident Set Size, Number of pages in real meory
for (i = 0; i < SIZE / 4; ++i)
arr1[i] = 0;
Post iteration 1/4:
RSS : 12967 Resident Set Size, Number of pages in real meory
Post iteration 1/1:
RSS : 51134 Resident Set Size, Number of pages in real meory
To further speed up things most systems additionally cache the N most recent
page table entries in a translation lookaside buffer (TLB).
brk, mmap
When a process (c|m|…)alloc the upper bounds of the heap is expanded by
brk() or sbrk(). These system calls are expensive and to compensate for
this malloc collect multiple smaller calls in to one bigger brk().
This also affects free() as a negative brk() also is resource expensive
they are collected and performed as a bigger operation.
For huge request; i.e. like the one in your code, malloc() uses mmap().
The threshold for this, which is configurable by mallopt(), is an educated
value
We can have fun with this modifying the SIZE in your code. If we utilize
malloc.h and use,
struct mallinfo minf = mallinfo();
(no, not milf), we can show this (Note Arena and Hblkhd, …):
Initial:
mallinfo {
Arena : 0 (Bytes of memory allocated with sbrk by malloc)
Ordblks : 1 (Number of chunks not in use)
Hblks : 0 (Number of chunks allocated with mmap)
Hblkhd : 0 (Bytes allocated with mmap)
Uordblks: 0 (Memory occupied by chunks handed out by malloc)
Fordblks: 0 (Memory occupied by free chunks)
Keepcost: 0 (Size of the top-most releasable chunk)
} : Initial
MAX = ((128 * 1024) / sizeof(int))
mallinfo {
Arena : 0 (Bytes of memory allocated with sbrk by malloc)
Ordblks : 1 (Number of chunks not in use)
Hblks : 1 (Number of chunks allocated with mmap)
Hblkhd : 135168 (Bytes allocated with mmap)
Uordblks: 0 (Memory occupied by chunks handed out by malloc)
Fordblks: 0 (Memory occupied by free chunks)
Keepcost: 0 (Size of the top-most releasable chunk)
} : After malloc arr1
mallinfo {
Arena : 0 (Bytes of memory allocated with sbrk by malloc)
Ordblks : 1 (Number of chunks not in use)
Hblks : 2 (Number of chunks allocated with mmap)
Hblkhd : 270336 (Bytes allocated with mmap)
Uordblks: 0 (Memory occupied by chunks handed out by malloc)
Fordblks: 0 (Memory occupied by free chunks)
Keepcost: 0 (Size of the top-most releasable chunk)
} : After malloc arr2
Then we subtract sizeof(int) from MAX and get:
mallinfo {
Arena : 266240 (Bytes of memory allocated with sbrk by malloc)
Ordblks : 1 (Number of chunks not in use)
Hblks : 0 (Number of chunks allocated with mmap)
Hblkhd : 0 (Bytes allocated with mmap)
Uordblks: 131064 (Memory occupied by chunks handed out by malloc)
Fordblks: 135176 (Memory occupied by free chunks)
Keepcost: 135176 (Size of the top-most releasable chunk)
} : After malloc arr1
mallinfo {
Arena : 266240 (Bytes of memory allocated with sbrk by malloc)
Ordblks : 1 (Number of chunks not in use)
Hblks : 0 (Number of chunks allocated with mmap)
Hblkhd : 0 (Bytes allocated with mmap)
Uordblks: 262128 (Memory occupied by chunks handed out by malloc)
Fordblks: 4112 (Memory occupied by free chunks)
Keepcost: 4112 (Size of the top-most releasable chunk)
} : After malloc arr2
We register that the system works as advertised. If size of allocation is
below threshold sbrk is used and memory handled internally by malloc,
else mmap is used.
The structure of this also helps on preventing fragmentation of memory etc.
Point being that the malloc family is optimized for general usage. However
mmap limits can be modified to meet special needs.
Note this (and down trough 100+ lines) when / if modifying mmap threshold.
.
This can be further observed if we fill (touch) every page of arr1 and arr2
before we do the timing:
Touch pages … (Here with page size of 4 kB)
for (i = 0; i < SIZE; i += 4096 / sizeof(int)) {
arr1[i] = 0;
arr2[i] = 0;
}
Itr arr1 : 0.312462317
CPU arr1 : 0.32
Itr arr2 : 0.312869158
CPU arr2 : 0.31
Also see:
Synopsis of compile-time options
Vital statistics
… actually the entire file is a nice read.
Sub notes:
So, the CPU knows the physical address then? Nah.
In the world of memory a lot has to be addressed ;). A core hardware for
this is the memory management unit (MMU). Either as an integrated part of
the CPU or external chip.
The operating system configure the MMU on boot and define access for various
regions (read only, read-write, etc.) thus giving a level of security.
The address we as mortals see is the logical address that the CPU uses. The
MMU translates this to a physical address.
The CPU's address consist of two parts: a page address and a offset.
[PAGE_ADDRESS.OFFSET]
And the process of getting a physical address we can have something like:
.-----. .--------------.
| CPU > --- Request page 2 ----> | MMU |
+-----+ | Pg 2 == Pg 4 |
| +------v-------+
+--Request offset 1 -+ |
| (Logical page 2 EQ Physical page 4)
[ ... ] __ | |
[ OFFSET 0 ] | | |
[ OFFSET 1 ] | | |
[ OFFSET 2 ] | | |
[ OFFSET 3 ] +--- Page 3 | |
[ OFFSET 4 ] | | |
[ OFFSET 5 ] | | |
[ OFFSET 6 ]__| ___________|____________+
[ OFFSET 0 ] | |
[ OFFSET 1 ] | ...........+
[ OFFSET 2 ] |
[ OFFSET 3 ] +--- Page 4
[ OFFSET 4 ] |
[ OFFSET 5 ] |
[ OFFSET 6 ]__|
[ ... ]
A CPU's logical address space is directly linked to the address length. A
32-bit address processor has a logical address space of 232 bytes.
The physical address space is how much memory the system can afford.
There is also the handling of fragmented memory, re-alignment etc.
This brings us into the world of swap files. If a process request more memory
then is physically available; one or several pages of other process(es) are
transfered to disk/swap and their pages "stolen" by the requesting process.
The MMU keeps track of this; thus the CPU doesn't have to worry about where
the memory is actually located.
This further brings us on to dirty memory.
If we print some information from /proc/[pid]/smaps, more specific the range
of our arrays we get something like:
Start:
b76f3000-b76f5000
Private_Dirty: 8 kB
Post calloc arr1:
aaeb8000-b76f5000
Private_Dirty: 12 kB
Post calloc arr2:
9e67c000-b76f5000
Private_Dirty: 20 kB
Post iterate 1/4 arr1
9e67b000-b76f5000
Private_Dirty: 51280 kB
Post iterate arr1:
9e67a000-b76f5000
Private_Dirty: 205060 kB
Post iterate arr2:
9e679000-b76f5000
Private_Dirty: 410096 kB
Post free:
9e679000-9e67d000
Private_Dirty: 16 kB
b76f2000-b76f5000
Private_Dirty: 12 kB
When a virtual page is created a system typically clears a dirty bit in the
page.
When the CPU writes to a part of this page the dirty bit is set; thus when
swapped the pages with dirty bits are written, clean pages are skipped.
Short Answer
The first time that calloc is called it is explicitly zeroing out the memory. While the next time that it is called it assumed that the memory returned from mmap is already zeroed out.
Details
Here's some of the things that I checked to come to this conclusion that you could try yourself if you wanted:
Insert a calloc call before your first ALLOC call. You will see that after this the Time for Time A and Time B are the same.
Use the clock() function to check how long each of the ALLOC calls take. In the case where they are both using calloc you will see that the first call takes much longer than the second one.
Use time to time the execution time of the calloc version and the USE_MMAP version. When I did this I saw that the execution time for USE_MMAP was consistently slightly less.
I ran with strace -tt -T which shows both the time of when the system call was made and how long it took. Here is part of the output:
Strace output:
21:29:06.127536 mmap(NULL, 2000015360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fff806fd000 <0.000014>
21:29:07.778442 mmap(NULL, 2000015360, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fff093a0000 <0.000021>
21:29:07.778563 times({tms_utime=63, tms_stime=102, tms_cutime=0, tms_cstime=0}) = 4324241005 <0.000011>
You can see that the first mmap call took 0.000014 seconds, but that about 1.5 seconds elapsed before the next system call. Then the second mmap call took 0.000021 seconds, and was followed by the times call a few hundred microsecond later.
I also stepped through part of the application execution with gdb and saw that the first call to calloc resulted in numerous calls to memset while the second call to calloc did not make any calls to memset. You can see the source code for calloc here (look for __libc_calloc) if you are interested. As for why calloc is doing the memset on the first call but not subsequent ones I don't know. But I feel fairly confident that this explains the behavior you have asked about.
As for why the array that was zeroed memset has improved performance my guess is that it is because of values being loaded into the TLB rather than the cache since it is a very large array. Regardless the specific reason for the performance difference that you asked about is that the two calloc calls behave differently when they are executed.
It's just a matter of when the process memory image expands by a page.
Summary: The time difference is explained when analysing the time is takes to allocate the arrays. The last allocated calloc takes just a bit more time whereas the other (or all when using mmap) take virtualy no time. The real allocation in memory is probably deferred when first accessed.
I don't know enough about the internal of memory allocation on Linux. But I ran your script slightly modified: I've added a third array and some extra iterations per array operations. And I have taken into account the remark of Old Pro that the time to allocate the arrays was not taken into account.
Conclusion: Using calloc takes longer than using mmap for the allocation (mmap virtualy uses no time when you allocate the memory, it's probably postponed later when fist accessed), and using my program there is almost no difference in the end between using mmap or calloc for the overall program execution.
Anyway, first remark, both memory allocation happen in the memory mapping region and not in the heap. To verify this, I've added a quick n' dirty pause so you can check the memory mapping of the process (/proc//maps)
Now to your question, the last allocated array with calloc seems to be really allocated in memory (not postponed). As arr1 and arr2 behaves now exactly the same (the first iteration is slow, subsequent iterations are faster). Arr3 is faster for the first iteration because the memory was allocated earlier. When using the A macro, then it is arr1 which benefits from this. My guess would be that the kernel has preallocated the array in memory for the last calloc. Why? I don't know... I've tested it also with only one array (so I removed all occurence of arr2 and arr3), then I have the same time (roughly) for all 10 iterations of arr1.
Both malloc and mmap behave the same (results not shown below), the first iteration is slow, subsequent iterations are faster for all 3 arrays.
Note: all results were coherent accross the various gcc optimisation flags (-O0 to -O3), so it doesn't look like the root of the behaviour is derived from some kind of gcc optimsation.
Note2: Test run on Ubuntu Precise Pangolin (kernel 3.2), with GCC 4.6.3
#include <stdlib.h>
#include <stdio.h>
#include <sys/mman.h>
#include <time.h>
#define SIZE 500002816
#define ITERATION 10
#if defined(USE_MMAP)
# define ALLOC(a, b) (mmap(NULL, a * b, PROT_READ | PROT_WRITE, \
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0))
#elif defined(USE_MALLOC)
# define ALLOC(a, b) (malloc(b * a))
#elif defined(USE_CALLOC)
# define ALLOC calloc
#else
# error "No alloc routine specified"
#endif
int main() {
clock_t start, finish, gstart, gfinish;
start = clock();
gstart = start;
#ifdef A
unsigned int *arr1 = ALLOC(sizeof(unsigned int), SIZE);
unsigned int *arr2 = ALLOC(sizeof(unsigned int), SIZE);
unsigned int *arr3 = ALLOC(sizeof(unsigned int), SIZE);
#else
unsigned int *arr3 = ALLOC(sizeof(unsigned int), SIZE);
unsigned int *arr2 = ALLOC(sizeof(unsigned int), SIZE);
unsigned int *arr1 = ALLOC(sizeof(unsigned int), SIZE);
#endif
finish = clock();
unsigned int i, j;
double intermed, finalres;
intermed = ((double)(finish - start))/CLOCKS_PER_SEC;
printf("Time to create: %.2f\n", intermed);
printf("arr1 addr: %p\narr2 addr: %p\narr3 addr: %p\n", arr1, arr2, arr3);
finalres = 0;
for (j = 0; j < ITERATION; j++)
{
start = clock();
{
for (i = 0; i < SIZE; i++)
arr1[i] = (i + 13) * 5;
}
finish = clock();
intermed = ((double)(finish - start))/CLOCKS_PER_SEC;
finalres += intermed;
printf("Time A: %.2f\n", intermed);
}
printf("Time A (average): %.2f\n", finalres/ITERATION);
finalres = 0;
for (j = 0; j < ITERATION; j++)
{
start = clock();
{
for (i = 0; i < SIZE; i++)
arr2[i] = (i + 13) * 5;
}
finish = clock();
intermed = ((double)(finish - start))/CLOCKS_PER_SEC;
finalres += intermed;
printf("Time B: %.2f\n", intermed);
}
printf("Time B (average): %.2f\n", finalres/ITERATION);
finalres = 0;
for (j = 0; j < ITERATION; j++)
{
start = clock();
{
for (i = 0; i < SIZE; i++)
arr3[i] = (i + 13) * 5;
}
finish = clock();
intermed = ((double)(finish - start))/CLOCKS_PER_SEC;
finalres += intermed;
printf("Time C: %.2f\n", intermed);
}
printf("Time C (average): %.2f\n", finalres/ITERATION);
gfinish = clock();
intermed = ((double)(gfinish - gstart))/CLOCKS_PER_SEC;
printf("Global Time: %.2f\n", intermed);
return 0;
}
Results:
Using USE_CALLOC
Time to create: 0.13
arr1 addr: 0x7fabcb4a6000
arr2 addr: 0x7fabe917d000
arr3 addr: 0x7fac06e54000
Time A: 0.67
Time A: 0.48
...
Time A: 0.47
Time A (average): 0.48
Time B: 0.63
Time B: 0.47
...
Time B: 0.48
Time B (average): 0.48
Time C: 0.45
...
Time C: 0.46
Time C (average): 0.46
With USE_CALLOC and A
Time to create: 0.13
arr1 addr: 0x7fc2fa206010
arr2 addr: 0xx7fc2dc52e010
arr3 addr: 0x7fc2be856010
Time A: 0.44
...
Time A: 0.43
Time A (average): 0.45
Time B: 0.65
Time B: 0.47
...
Time B: 0.46
Time B (average): 0.48
Time C: 0.65
Time C: 0.48
...
Time C: 0.45
Time C (average): 0.48
Using USE_MMAP
Time to create: 0.0
arr1 addr: 0x7fe6332b7000
arr2 addr: 0x7fe650f8e000
arr3 addr: 0x7fe66ec65000
Time A: 0.55
Time A: 0.48
...
Time A: 0.45
Time A (average): 0.49
Time B: 0.54
Time B: 0.46
...
Time B: 0.49
Time B (average): 0.50
Time C: 0.57
...
Time C: 0.40
Time C (average): 0.43