I am working on a NUMA computer. It has two nodes with 16GB ram on each node. When I am running a large program, I used both htop and numactl --hardware to observe the memory consumption. However I got two different result.
htop showed that my program consumed around 20GB of memory in total. However, numactl --hardware showed that almost 32GB is used. So, which one is correct? Or is numactl --hardware not showing the actual resident memory but other kinds of memory?
numactl --hardware memory output comes from the numa_node_size64() function in libnuma, which in turn gets the info from the MemTotal and MemFree values in /sys/devices/system/node/node%d/meminfo.
Assuming you are on Linux, you might try cat /sys/devices/system/node/node0/meminfo (same for node1) to see much more detailed memory info. You should be able to correlate some of those values with your htop output. If that does not help, one would have to look at the kernel source how the MemFree value is derived.
Here's sample output from my single-node system. You see there's a lot of information:
Node 0 MemTotal: 7069704 kB
Node 0 MemFree: 4099480 kB
Node 0 MemUsed: 2970224 kB
Node 0 Active: 1677108 kB
Node 0 Inactive: 934216 kB
Node 0 Active(anon): 1056284 kB
Node 0 Inactive(anon): 46232 kB
Node 0 Active(file): 620824 kB
Node 0 Inactive(file): 887984 kB
Node 0 Unevictable: 16 kB
Node 0 Mlocked: 16 kB
Node 0 Dirty: 220 kB
Node 0 Writeback: 0 kB
Node 0 FilePages: 1556076 kB
Node 0 Mapped: 249100 kB
Node 0 AnonPages: 1055236 kB
Node 0 Shmem: 47276 kB
Node 0 KernelStack: 3712 kB
Node 0 PageTables: 33648 kB
Node 0 NFS_Unstable: 0 kB
Node 0 Bounce: 0 kB
Node 0 WritebackTmp: 0 kB
Node 0 Slab: 218156 kB
Node 0 SReclaimable: 168548 kB
Node 0 SUnreclaim: 49608 kB
Node 0 AnonHugePages: 0 kB
Node 0 HugePages_Total: 0
Node 0 HugePages_Free: 0
Node 0 HugePages_Surp: 0
It turns out that numactl --hardware regards cache memory as "used memory" but not "free memory". That's why it shows much more memory consumption that htop shows.
A good read:
http://www.linuxatemyram.com/
Related
I'm trying to understand "how memory works". As far as I understand the OS (Linux in my case) when calling mmap to create MAP_ANONYMOUS mapping it creates:
mmap() creates a new mapping in the virtual address
space of the calling process
As far as I know virtyal address space of a process may exceed tge actual physical memory available.
Also as far as I know the actual mapping to a physical memory occurs when CPU triggers page fault when it tries to access to a memory page that is not in page table yet.
OS catches the page fault and creates an entry in a page directory.
What should happen if I mmaped some anonymous memory (but did not touch any of the pages), then other processess exhausted all the physical memory and then I try to use one of the pages mmaped (I have swap disabled)?
CPU should trigger page fault and then try to create an entry in a page direcrory. But since no physical memory left it will not be able to do so...
To use mmap (MAP_ANONYMOUS) or malloc changes nothing in your case, if you dont have enough free memory mmap returns MAP_FAILED and malloc returns NULL
If I use that program :
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char ** argv)
{
int n = atoi(argv[1]);
void * m;
if (argc == 1) {
m = mmap(NULL, n*1024*1024, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (m == MAP_FAILED) {
puts("ko");
return 0;
}
}
else {
m = malloc(n*1024*1024);
if (m == 0) {
puts("ko");
return 0;
}
}
puts("ok");
getchar();
char * p = (char *) m;
char * sup = p + n*1024*1024;
while (p < sup) {
*p = 0;
p += 512;
}
puts("done");
getchar();
return 0;
}
I am on a raspberrypi with 1Gb of memory and a swap of 100Mo, the memory is already used by chromium because I am on SO
proc/meminfo gives :
MemTotal: 949448 kB
MemFree: 295008 kB
MemAvailable: 633560 kB
Buffers: 39296 kB
Cached: 360372 kB
SwapCached: 0 kB
Active: 350416 kB
Inactive: 260960 kB
Active(anon): 191976 kB
Inactive(anon): 41908 kB
Active(file): 158440 kB
Inactive(file): 219052 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 102396 kB
SwapFree: 102396 kB
Dirty: 352 kB
Writeback: 0 kB
AnonPages: 211704 kB
Mapped: 215924 kB
Shmem: 42304 kB
Slab: 24528 kB
SReclaimable: 12108 kB
SUnreclaim: 12420 kB
KernelStack: 2128 kB
PageTables: 5676 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 577120 kB
Committed_AS: 1675164 kB
VmallocTotal: 1114112 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
CmaTotal: 8192 kB
CmaFree: 6796 kB
If I do that :
pi#raspberrypi:/tmp $ ./a.out 750
ko
750 is to large, but
pi#raspberrypi:/tmp $ ./a.out 600 &
[1] 1525
pi#raspberrypi:/tmp $ ok
The used memory (top etc) doesn't reflect the 600Mo because I do not read/write in them
proc/meminfo gives :
MemTotal: 949448 kB
MemFree: 282860 kB
MemAvailable: 626016 kB
Buffers: 39432 kB
Cached: 362860 kB
SwapCached: 0 kB
Active: 362696 kB
Inactive: 260580 kB
Active(anon): 199880 kB
Inactive(anon): 41392 kB
Active(file): 162816 kB
Inactive(file): 219188 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 102396 kB
SwapFree: 102396 kB
Dirty: 624 kB
Writeback: 0 kB
AnonPages: 220988 kB
Mapped: 215672 kB
Shmem: 41788 kB
Slab: 24788 kB
SReclaimable: 12296 kB
SUnreclaim: 12492 kB
KernelStack: 2136 kB
PageTables: 5692 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 577120 kB
Committed_AS: 2288564 kB
VmallocTotal: 1114112 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
CmaTotal: 8192 kB
CmaFree: 6796 kB
And I can again do
pi#raspberrypi:/tmp $ ./a.out 600 &
[2] 7088
pi#raspberrypi:/tmp $ ok
pi#raspberrypi:/tmp $ jobs
[1]- stopped ./a.out 600
[2]+ stopped ./a.out 600
pi#raspberrypi:/tmp $
Even the total is too large for the memory + swap, /proc/meminfo gives :
MemTotal: 949448 kB
MemFree: 282532 kB
MemAvailable: 626112 kB
Buffers: 39432 kB
Cached: 359980 kB
SwapCached: 0 kB
Active: 365200 kB
Inactive: 257736 kB
Active(anon): 202280 kB
Inactive(anon): 38320 kB
Active(file): 162920 kB
Inactive(file): 219416 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 102396 kB
SwapFree: 102396 kB
Dirty: 52 kB
Writeback: 0 kB
AnonPages: 223520 kB
Mapped: 212600 kB
Shmem: 38716 kB
Slab: 24956 kB
SReclaimable: 12476 kB
SUnreclaim: 12480 kB
KernelStack: 2120 kB
PageTables: 5736 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 577120 kB
Committed_AS: 2876612 kB
VmallocTotal: 1114112 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
CmaTotal: 8192 kB
CmaFree: 6796 kB
If I write in the memory of %1 then stop it I have a lot of swap done on the flash
pi#raspberrypi:/tmp $ %1
./a.out 600
done
^Z
[1]+ stopped ./a.out 600
now there is almost no free swap and almost no free memory, /proc/meminfo gives
MemTotal: 949448 kB
MemFree: 33884 kB
MemAvailable: 32544 kB
Buffers: 796 kB
Cached: 66032 kB
SwapCached: 66608 kB
Active: 483668 kB
Inactive: 390360 kB
Active(anon): 462456 kB
Inactive(anon): 374188 kB
Active(file): 21212 kB
Inactive(file): 16172 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 102396 kB
SwapFree: 3080 kB
Dirty: 96 kB
Writeback: 0 kB
AnonPages: 740984 kB
Mapped: 61176 kB
Shmem: 29288 kB
Slab: 21932 kB
SReclaimable: 9084 kB
SUnreclaim: 12848 kB
KernelStack: 2064 kB
PageTables: 7012 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 577120 kB
Committed_AS: 2873112 kB
VmallocTotal: 1114112 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
CmaTotal: 8192 kB
CmaFree: 6796 kB
%1 is still waiting on the getchar, if I do the same for %2 it works but in fact because the process %1 disappear (without message on the shell)
The behavior is the same if I malloc (giving a second argument to the program)
See also What is the purpose of MAP_ANONYMOUS flag in mmap system call?
First of all, if you disable swap (you don't add any swap partition) that doesn't mean you are not using swap. Read below.
You can run a system without any swap secondary space, but that doesn't mean you are not using virtual memory. You cannot disable virtual memory and virtual memory is a fundamental concept for the implementation of mmap(2) syscall.
mmap(2) uses a file to fill the initial contents of the pages it uses for the memory segment. But it does more... It uses normal virtual memory for the allocated pages of that segment, and swaps them out when the kernel needs some of those pages. As there's a file to store the page contents, you don't need to swap it, just write the contents of the page to the proper place in the file. As when other process that has the same shared memory segment attached, the same page is mapped on both processes, when one process writes the page, the other sees it immediately. Also if some process reads or writes the file, as the block used is the same one to read the disk file, the data it is going to see is the same data shared with both processes. This is how it works.
The kernel saves a lot of swap with this mechanism, and also this allows the kernel to be able to discard parts of the text segment of a program without having to swap them out to a secondary device (as they are already in the file's text segment for the program)
When you say
What should happen if I mmaped some anonymous memory (but did not touch any of the pages)...
if you didn't touch any of the pages, then probably neither of them have been still mapped actually, only the resource is prepared to be used, but has not yet allocated. It's when you fault on one of those pages (e.g. for reading, you promised not to touch them) is the page mapped to an actual memory page, but the disk backup (or the swap space for it) is actually in the file, not in the swap device. That page is also actually the disk block (more precisely the set of disk blocks) used to store the data from the disk driver, so no multiple copies of the same data are in use.
EDIT
Anonymous mmap(2) probably uses also disk blocks (in some default disk unit). So even in the case you don't use a swap device, probably you are allowed to use mmap(2) with virtual space mapped to disk inode. I have not checked this, but old unix pipes worked this way. A temporary inode (without an entry allocated in a directory, like erased files with an open process) can be used for this.
I'm just wondering whether is possible to calculate unique row number for groups without using Group By...
My datasets as following...
FileSize FileSize(KB/MB)
--------------------------------
0 0.00 KB
0 0.00 KB
36 0.04 KB
39 0.04 KB
425 0.42 KB
435 0.42 KB
435 0.42 KB
1000960 0.95 MB
1001290 0.95 MB
1266831853 1.27 GB
1266831968 1.27 GB
1312708509 1.31 GB
1312711756 1.31 GB
1367911756 1.36 GB
I would like to the datasets output as follow which required to sort the FileSize(KB/MB) by FileSize column in TabularModel...
FileSizeRank FileSize FileSize(KB/MB)
-------------------------------------------
1 0 0.00 KB
1 0 0.00 KB
2 36 0.04 KB
2 39 0.04 KB
3 425 0.42 KB
3 435 0.42 KB
3 435 0.42 KB
4 1000960 0.95 MB
4 1001290 0.95 MB
5 1266831853 1.27 GB
5 1266831968 1.27 GB
6 1312708509 1.31 GB
6 1312711756 1.31 GB
7 1367911756 1.36 GB
I have tried this but it didn't help
ROW_NUMBER() OVER(PARTITION BY [Filesize(KB/MB)] ORDER BY FileSize) AS FileSizeRank
You need to use DENSE_RANK:
SELECT *,DENSE_RANK() OVER( ORDER BY [Filesize(KB/MB)]) AS FileSizeRank
FROM tab
Rextester Demo
EDIT:
To avoid possible ORDER BY text problem I suggest to use:
SELECT *,DENSE_RANK() OVER( ORDER BY ROUND([File Size]/1000.0,2)) AS FileSizeRank
FROM tab;
Rextester Demo2
I am trying to get the physical address range of all the available ram in the system inside a linux kernel module.
I saw cat /proc/iomem and saw that the physical memory is itself not contiguous.
I understand that for 32bit systems compatibility there is PCI and other peripheral memory need to be inside the 4GB address range.
also the 640 kB initial for DOS.
below output is from x86_64 system
00000000-00000fff : reserved
00001000-0009d7ff : System RAM //640kB here
0009d800-0009ffff : reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000cedff : Video ROM
000e0000-000fffff : reserved
000f0000-000fffff : System ROM
00100000-daa85fff : System RAM //~3.5 gb here
01000000-0177acb8 : Kernel code
0177acb9-01d1b53f : Kernel data
01e79000-01fbdfff : Kernel bss
daa86000-daa87fff : reserved
daa88000-dad0efff : System RAM //some RAM here
dad0f000-dae75fff : reserved
dae76000-dae95fff : ACPI Non-volatile Storage
dae96000-daf1efff : reserved
daf1f000-daf9efff : ACPI Non-volatile Storage
daf9f000-daffefff : ACPI Tables
dafff000-daffffff : System RAM //some RAM here
db000000-df9fffff : reserved
dba00000-df9fffff : Graphics Stolen Memory
dfa00000-feafffff : PCI Bus 0000:00
e0000000-efffffff : 0000:00:02.0
f0000000-f03fffff : 0000:00:02.0
f0400000-f04fffff : PCI Bus 0000:02
f0400000-f0403fff : 0000:02:00.0
f0400000-f0403fff : r8169
f0404000-f0404fff : 0000:02:00.0
f0404000-f0404fff : r8169
f0500000-f05fffff : PCI Bus 0000:01
f0500000-f0503fff : 0000:01:00.0
f0500000-f0503fff : bcma-pci-bridge
f0600000-f0603fff : 0000:00:1b.0
f0600000-f0603fff : ICH HD audio
f0604000-f06040ff : 0000:00:1f.3
f0605000-f060500f : 0000:00:16.0
f0605000-f060500f : mei_me
f0608000-f06087ff : 0000:00:1f.2
f0608000-f06087ff : ahci
f0609000-f06093ff : 0000:00:1d.0
f0609000-f06093ff : ehci_hcd
f060a000-f060a3ff : 0000:00:1a.0
f060a000-f060a3ff : ehci_hcd
f8000000-fbffffff : PCI MMCONFIG 0000 [bus 00-3f]
f8000000-fbffffff : reserved
f8000000-fbffffff : pnp 00:05
fec00000-fec00fff : reserved
fec00000-fec003ff : IOAPIC 0
fed00000-fed003ff : HPET 0
fed00000-fed003ff : PNP0103:00
fed08000-fed08fff : reserved
fed10000-fed19fff : reserved
fed10000-fed17fff : pnp 00:05
fed18000-fed18fff : pnp 00:05
fed19000-fed19fff : pnp 00:05
fed1c000-fed1ffff : reserved
fed1c000-fed1ffff : pnp 00:05
fed1f410-fed1f414 : iTCO_wdt
fed20000-fed3ffff : pnp 00:05
fed40000-fed44fff : PCI Bus 0000:00
fed45000-fed8ffff : pnp 00:05
fed90000-fed93fff : pnp 00:05
fee00000-fee00fff : Local APIC
fee00000-fee00fff : reserved
ff000000-ffffffff : INT0800:00
ffd80000-ffffffff : reserved
100000000-15fdfffff : System RAM //~1.5 gB here
15fe00000-15fffffff : RAM buffer
My Question is .
1. how to get all of the RAM which can be used for DMA, using kernel code.
2. why is there extra RAM regions. ? also why the RAM is not split at some proper boundary for ex. 2GB +3GB.
3. will only the 3.5GB of area will be used for DMA or higher 1.5 GB can also be used for DMA. in linux.
There are a few commands that can be used from the linux terminal for this. Both will show the physical memory address range in your linux system.
cat /proc/meminfo: This will print the value in terminal as:
MemTotal: 8027952 kB
MemFree: 3893748 kB
Buffers: 132208 kB
Cached: 1666864 kB
SwapCached: 226556 kB
Active: 1979556 kB
Inactive: 1849480 kB
Active(anon): 1592580 kB
Inactive(anon): 886080 kB
Active(file): 386976 kB
Inactive(file): 963400 kB
Unevictable: 68 kB
Mlocked: 68 kB
SwapTotal: 15624188 kB
SwapFree: 15050964 kB
Dirty: 172 kB
Writeback: 0 kB
AnonPages: 1907548 kB
Mapped: 223484 kB
Shmem: 448696 kB
Slab: 140444 kB
SReclaimable: 101456 kB
SUnreclaim: 38988 kB
KernelStack: 4960 kB
PageTables: 53108 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 19638164 kB
Committed_AS: 7822876 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 567356 kB
VmallocChunk: 34359151824 kB
or vmstat -s: This will print value as:
8027952 K total memory
4114688 K used memory
1960100 K active memory
1849792 K inactive memory
3913264 K free memory
132240 K buffer memory
1667108 K swap cache
15624188 K total swap
573224 K used swap
15050964 K free swap
931285 non-nice user cpu ticks
6391 nice user cpu ticks
152567 system cpu ticks
7019826 idle cpu ticks
181109 IO-wait cpu ticks
19 IRQ cpu ticks
2262 softirq cpu ticks
There is one more command using dmidecode: you can use sudo dmidecode -t memory to check the details of ram in your linux system.
envirenment: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04)
3.16.0-30-generic #40~14.04.1-Ubuntu SMP Thu Jan 15 17:45:15 UTC 2015 i686 i686 i686 GNU
C code a2.c: has a 40MB global array, and each items are assigmented.
int b[10000000];//40M global array
void main() {
int i = 0;
for(i = 0; i<10000000; i++) {b[i]=i;}
while(1);
}
and I build like gcc -o a2 a2.c
When I run this code and see the smap file cat /proc/25739/smaps, the content are as follows
08048000-08049000 r-xp 00000000 08:11 46930087 /home/jzd/test/a2
Size: 4 kB
Rss: 4 kB
Pss: 4 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 4 kB
Private_Dirty: 0 kB
Referenced: 4 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
VmFlags: rd ex mr mw me dw
//here I hide some sections
0804b000-0a670000 rw-p 00000000 00:00 0
Size: 39060 kB
Rss: 39060 kB // the RSS is the global array's size
Pss: 2196 kB // the array is only used by the program
// why it's pss is not equal with rss
Shared_Clean: 0 kB // all shared size is 0
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 39060 kB
Referenced: 39060 kB
Anonymous: 39060 kB
AnonHugePages: 36864 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
VmFlags: rd wr mr mw me ac
//here I hide other sections
Why does that happen?
You have the support for transparent huge pages (THP) enabled and your executable's BSS is backed by huge pages:
0804b000-0a670000 rw-p 00000000 00:00 0
Size: 39060 kB
Rss: 39060 kB
Pss: 2196 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 39060 kB
Referenced: 39060 kB
Anonymous: 39060 kB
AnonHugePages: 36864 kB <------
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
VmFlags: rd wr mr mw me ac
If you look closely, the reported Pss value of 2196 KiB is exactly the amount of anonymous memory mappings backed by regular 4 KiB pages, i.e. the difference between Anonymous and AnonHugePages.
My guess is that the accounting of THP in PSS is broken in 3.16.0-30-generic. Between your kernel version and the version of #Evan's kernel, there are several commits affecting the part of the Linux kernel that generates the contents of the smaps file (fs/proc/task_mmu.c), more specifically this change between 3.18 and 3.19 likely fixed things.
I'm not sure why you are seeing that, I ran your test program and get a different result, in line with what you were expecting:
00602000-02c27000 rw-p 00000000 00:00 0
Size: 39060 kB
Rss: 39060 kB
Pss: 39060 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 39060 kB
Referenced: 38824 kB
Anonymous: 39060 kB
AnonHugePages: 8192 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
My kernel version is 3.19.0-30-generic #34-Ubuntu SMP. Are you sure that you are running the program exactly as you posted it? It is also possible that the kernel memory reporting changed at some point, or that this behavior depends on how the kernel is built.
We have a project written in ANSI C. Generally the memory consumption was not a big concern, but now we have a request to fit our program into 256 KB of RAM. I don't have this exact platform on hands, so I compile my project under 32 bit x86 Linux (because it provides enough different tools to evaluate the memory consumption), optimize what I can, remove some features and eventually I have to have the conclusion: what features we need to sacrifice to be able to run on very small systems (if we're able at all). First of all I did a research what exactly a memory size in linux and it seems I have to optimize the RSS size, not VSZ. But in linux even a smallest program which prints "Hello world!" once a second consumes 285-320 KB in RSS:
#include <stdio.h>
#include <unistd.h>
#include <signal.h>
unsigned char cuStopCycle = 0;
void SigIntHandler(int signo)
{
printf("SIGINT received, terminating the program\n");
cuStopCycle = 1;
}
int main()
{
signal( SIGINT, SigIntHandler);
while(!cuStopCycle)
{
printf("Hello, World!\n");
sleep(1);
}
printf("Exiting...\n");
}
user#Ubuntu12-vm:~/tmp/prog_size$ size ./prog_size
text data bss dec hex filename
1456 272 12 1740 6cc ./prog_size
root#Ubuntu12-vm:/home/app# ps -C prog_size -o pid,rss,vsz,args
PID RSS VSZ COMMAND
22348 316 2120 ./prog_size
Obviously this program will perfectly run on small PLCs, with 64KB of RAM. It is just linux loads a lot of libs. I generate a map file for this program and all this data + bss comes from the CRT library. I need to mention that if I add some code to this project - 10,000 times "a = a + b" or manipulate arrays 2000 long int variables, I see the difference in code size, bss size but eventually the RSS size of the process is the same, it doesn't affected)
So I take this as a baseline, the point I want to reach (and which I will never reach, because I need more functions than just print a message once a second).
So here comes my project, where I removed all extra features, removed all auxiliary functions, removed everything except the basic functionality. There are some ways to optimize more, but not that much, what could be removed is already taken away:
root#Ubuntu12-vm:/home/app/workspace/proj_sizeopt/Cmds# ls -l App
-rwxr-xr-x 1 root root 42520 Jul 13 18:33 App
root#Ubuntu12-vm:/home/app/workspace/proj_sizeopt/Cmds# size ./App
text data bss dec hex filename
37027 404 736 38167 9517 ./App
So I have ~36KB of code and ~1KB of data. I do not call malloc inside of my project, I use a shared memory allocation with a wrapper library so I can control how much memory is allocated:
The total memory size allocated is 2052 bytes
Under the hood there are malloc calls obviously, if I substitute 'malloc' calls with my function which summarize all alloc requests I see that ~2.3KB of memory is allocated:
root#Ubuntu12-vm:/home/app/workspace/proj_sizeopt/Cmds# LD_PRELOAD=./override_malloc.so ./App
Malloc allocates 2464 bytes total
Now I run my project amd see that it consumes 600KB of RAM.
root#Ubuntu12-vm:/home/app/workspace/proj_sizeopt# ps -C App -o pid,rss,vsz,args
PID RSS VSZ COMMAND
22093 604 2340 ./App
I do not understand why it eats so much memory. The code size is small. There is not much memory allocated. The size of data is small. Why it takes so much memory? I tried to analyze the mapping of the process:
root#Ubuntu12-vm:/home/app/workspace/proj_sizeopt# pmap -x 22093
22093: ./App
Address Kbytes RSS Dirty Mode Mapping
08048000 0 28 0 r-x-- App
08052000 0 4 4 r---- App
08053000 0 4 4 rw--- App
09e6a000 0 4 4 rw--- [ anon ]
b7553000 0 4 4 rw--- [ anon ]
b7554000 0 48 0 r-x-- libpthread-2.15.so
b756b000 0 4 4 r---- libpthread-2.15.so
b756c000 0 4 4 rw--- libpthread-2.15.so
b756d000 0 8 8 rw--- [ anon ]
b7570000 0 300 0 r-x-- libc-2.15.so
b7714000 0 8 8 r---- libc-2.15.so
b7716000 0 4 4 rw--- libc-2.15.so
b7717000 0 12 12 rw--- [ anon ]
b771a000 0 16 0 r-x-- librt-2.15.so
b7721000 0 4 4 r---- librt-2.15.so
b7722000 0 4 4 rw--- librt-2.15.so
b7731000 0 4 4 rw-s- [ shmid=0x70000c ]
b7732000 0 4 4 rw-s- [ shmid=0x6f800b ]
b7733000 0 4 4 rw-s- [ shmid=0x6f000a ]
b7734000 0 4 4 rw-s- [ shmid=0x6e8009 ]
b7735000 0 12 12 rw--- [ anon ]
b7738000 0 4 0 r-x-- [ anon ]
b7739000 0 104 0 r-x-- ld-2.15.so
b7759000 0 4 4 r---- ld-2.15.so
b775a000 0 4 4 rw--- ld-2.15.so
bfb41000 0 12 12 rw--- [ stack ]
-------- ------- ------- ------- -------
total kB 2336 - - -
And it looks like the program size (in RSS) is only 28KB, the rest is consumed by shared libraries. BTW I do not use posix threads, I do not explicitly link to it, but somehow the linker anyway links this library I have no idea why (this is not really important). If we look at the mapping in more details:
root#Ubuntu12-vm:/home/app/workspace/proj_sizeopt# cat /proc/22093/smaps
08048000-08052000 r-xp 00000000 08:01 344838 /home/app/workspace/proj_sizeopt/Cmds/App
Size: 40 kB
Rss: 28 kB
Pss: 28 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 28 kB
Private_Dirty: 0 kB
Referenced: 28 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
...
09e6a000-09e8b000 rw-p 00000000 00:00 0 [heap]
Size: 132 kB
Rss: 4 kB
Pss: 4 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 4 kB
Referenced: 4 kB
Anonymous: 4 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
...
b7570000-b7714000 r-xp 00000000 08:01 34450 /lib/i386-linux-gnu/libc-2.15.so
Size: 1680 kB
Rss: 300 kB
Pss: 7 kB
Shared_Clean: 300 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 300 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
...
b7739000-b7759000 r-xp 00000000 08:01 33401 /lib/i386-linux-gnu/ld-2.15.so
Size: 128 kB
Rss: 104 kB
Pss: 3 kB
Shared_Clean: 104 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 104 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
...
bfb41000-bfb62000 rw-p 00000000 00:00 0 [stack]
Size: 136 kB
Rss: 12 kB
Pss: 12 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 12 kB
Referenced: 12 kB
Anonymous: 12 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
So I see that RSS size for my project is 40KB, but only 28 KB is used. Does it mean that this project will fit into 256 KB of RAM?
The heap size is 132KB but only 4 KB is used. Why is that? I'm sure it will be different on the small embedded platform.
The stack is 136KB but only 12KB is used.
GLIBC/LD obviously consume some memory, but what exactly memory will it be on the embedded platform?
I do not look at PSS because it doesn't make any sense in my case, I look only at RSS.
What conclusions can I draw from this picture? How exactly to evaluate memory consumption by the application? Look at the RSS size of the process? Or subtract from this size RSS of all mapped system libraries? What is about heap/stack size?
I would be very grateful for any advises, notes, memory consumption optimizations techniques, DOs and DON'Ts for platforms with extremely small amount of RAM (except obvious - keep amount of data and code to the very minimum).
I also will appreciate an explanation WHY the program with small amount of code and data (and which doesn't allocate much memory) still consumes a lot of RAM in RSS.
Thank you in advance
... fit our program into 256 KB of RAM. I don't have this exact platform on hands, so I compile my project under 32 bit x86 Linux..
And what you now see is that the Linux platform tools make reasonable assumptions of your possible need of stack and heap, given that it nows you run on a big machine, and links-in a reasonable set of library functions for your needs. Some you won't need, but it gives them to you "for free".
To fit in 256 Kb on your target platform, you must compile for your target platform and link with the target platform's libraries (and CRT) using the target platform's linker.
Those will make different assumptions, use possibly smaller linbrary footprints, make smaller assumptions on stack and heap space, etcetera. For example, create "Hello World" for the target platform and check its needs on that target platform. Or use a realistic simulator of the target platform and libraries (and not to forget, OS, whch partly dictates what the libraries must do).
And if it is then still too big, you have to re-write or tweak the whole CRT and all libraries....
the program needs to be compiled/linked with the embedded device in mind.
For best results use a makefile
use the 'rt' library written for the embedded device
use the start.s file, located, via the makefile, where execution begins.
use 'static' in the linker parameters
use the linker parameters to not include any libraries but what is specifically requested.
do not use libraries written for your development machine. Only use libraries written for the embedded device.
do NOT include stdio.h, etc. unless specifically written for the embedded device
do NOT call printf() inside a signal handler.
if possible, do not call printf() at all.
instead write a small char output function and have it perform the output through the uart.
do not use signals, instead use interrupts
the resulting application will not run on your PC., but, once loaded, will run on the 256k device
do not call sleep(), rather write your own function that uses a device timer peripheral, that sets the timer and puts the device into powerdown mode.
the time interrupt needs to bring the device out of the powerdown mode.
in the makefile, specifically set the size of the stack, heap, etc.
have the link step output a .map file. study that map file until you understand everything in it.
use a compiler/linker that is specific for the embedded device
you probably will need to include a function that initializes the peripherals on the device, like the clock, the uart, the timer, the watchdog, and any other built in peripherals that the code actually uses.
you will need a file that allocates the interrupt table, and a small function to handle each of the interrupts, even though most of those functions will do nothing beyond clearing the appropriate interrupt pending flag and returning from the interrupt
you will probably need a function to periodically refresh the watchdog, conditionally, depending on an indication that the main function is still cycling regularily. I.E the main function loop and the initialization function will refresh the watchdog