The n-body program does this at the begining:
real4 *pin = (real4*)malloc(n * sizeof(real4));
real4 *pout = (real4*)malloc(n * sizeof(real4));
real3 *v = (real3*)malloc(n * sizeof(real3));
real3 *f = (real3*)malloc(n * sizeof(real3));
the total size of this should be (if n = 100): 100*32 + 100*32 + 100*24 + 100*24 = 11200B but with Valgrind's massif I have this:
I am not fammilliar with massif, but when talking about heap memory, there are two numbers that are interesting, how much memory has the allocator requested from the OS, and how much has the allocator given to your program through malloc(). If your program has requested ~10K of bytes, it is reasonable to think that the allocator may have requested a round number like 32K from the OS. The allocator typically request memory in large blocks from the OS since kernel calls are slow. (and a few other reasons)
So I would guess that the 32K that you are seeing, is what the allocator has aquired from the OS, ready to be given to your program through any additional malloc() that may happen.
Related
I'm currently trying to figure out the maximum memory that is able to be allocated through the malloc() command in C.
Until now I´ve tried a simple algorithm that increments a counter that will subsequently be allocated. If the malloc command returns "NULL" I know, that there is not enough memory available.
ULONG ulMaxSize = 0;
for (ULONG ulSize = /*0x40036FF0*/ 0x40A00000; ulSize <= 0xffffffff; ulSize++)
{
void* pBuffer = malloc(ulSize);
if (pBuffer == NULL)
{
ulMaxSize = ulSize - 1;
break;
}
free(pBuffer);
}
void* pMaxBuffer = malloc(ulMaxSize);
However, this algorithm gets executed very long since the malloc() command has turned out to be a time consuming task.
My question is now, if there is a more efficient algorithm to find the maximum memory able to be allocated?
The maximum memory that can be allocated depends mostly on few factors:
Address space limits on the process (max memory, virtual memory and friends).
Virtual space available
Physical space available
Fragmentation, which will limit the size of continuous memory blocks.
... Other limits ...
From your description (extreme slowness) looks like the process start using swap, which is VERY slow vs. real memory.
Consider the following alternative
For address space limit, look at ulimit -a (or use getrlimit to access the same data from C program) - look for 'max memory size', and 'virtual memory'
For swap space, physical memory - top
ulimit -a (filtered)
data seg size (kbytes, -d) unlimited
max memory size (kbytes, -m) 2048
stack size (kbytes, -s) 8192
virtual memory (kbytes, -v) unlimited
From a practical point, given that a program does not have control over system resources, you should be focused on 'max memory size'.
Other than using OS specific API to get such number :
sysinfo on linux or reading it from /proc/meminfo )
GlobalMemoryStatusEx for win32
You can also do a binary search, not recommended, as the state of the system might be in flux and the result could vary over time:
ULONG getMax() {
ULONG min = 0x0;
ULONG max = 0xffffffff;
void* t = malloc(max);
if(t!=NULL) {
free(t);
return max;
}
while(max-min > 1) {
ULONG mid = min + (max - min) / 2;
t = malloc(mid);
if(t == NULL) {
max = mid;
continue;
}
free(t);
min = mid;
}
return min;
}
I'm working with stm32+rtos to implement a file system based on spi flash. For freertos, I adopted heap_1 implementation. This is how i create my task.
osThreadDef(Task_Embedded, Task_VATEmbedded, osPriorityNormal, 0, 2500);
VATEmbeddedTaskHandle = osThreadCreate(osThread(Task_Embedded), NULL);
I allocated 10000 bytes of memory to this thread.
and in this thread. I tried to write data into flash. In the first few called it worked successfully. but somehow it crash when i tried more time of write.
VATAPI_RESULT STM32SPIWriteSector(void *writebuf, uint8_t* SectorAddr, uint32_t buff_size){
if(STM32SPIEraseSector(SectorAddr) == VAT_SUCCESS){
DBGSTR("ERASE SECTOR - 0x%2x %2x %2x", SectorAddr[0], SectorAddr[1], SectorAddr[2]);
}else return VAT_UNKNOWN;
if(STM32SPIProgram_multiPage(writebuf, SectorAddr, buff_size) == VAT_SUCCESS){
DBGSTR("WRTIE SECTOR SUCCESSFUL");
return VAT_SUCCESS;
}else return VAT_UNKNOWN;
return VAT_UNKNOWN;
}
.
VATAPI_RESULT STM32SPIProgram_multiPage(uint8_t *writebuf, uint8_t *writeAddr, uint32_t buff_size){
VATAPI_RESULT nres;
uint8_t tmpaddr[3] = {writeAddr[0], writeAddr[1], writeAddr[2]};
uint8_t* sectorBuf = malloc(4096 * sizeof(uint8_t));
uint8_t* pagebuf = malloc(255* sizeof(uint8_t));
memset(§orBuf[0],0,4096);
memset(&pagebuf[0],0,255);
uint32_t i = 0, tmp_convert1, times = 0;
if(buff_size < Page_bufferSize)
times = 1;
else{
times = buff_size / (Page_bufferSize-1);
if((times%(Page_bufferSize-1))!=0)
times++;
}
/* Note : According to winbond flash feature, the last bytes of every 256 bytes should be 0, so we need to plus one byte on every 256 bytes*/
i = 0;
while(i < times){
memset(&pagebuf[0], 0, Page_bufferSize - 1);
memcpy(&pagebuf[0], &writebuf[i*255], Page_bufferSize - 1);
memcpy(§orBuf[i*Page_bufferSize], &pagebuf[0], Page_bufferSize - 1);
sectorBuf[((i+1)*Page_bufferSize)-1] = 0;
i++;
}
i = 0;
while(i < times){
if((nres=STM32SPIPageProgram(§orBuf[Page_bufferSize*i], &tmpaddr[0], Page_bufferSize)) != VAT_SUCCESS){
DBGSTR("STM32SPIProgram_allData write data fail on %d times!",i);
free(sectorBuf);
free(pagebuf);
return nres;
}
tmp_convert1 = (tmpaddr[0]<<16 | tmpaddr[1]<<8 | tmpaddr[2]) + Page_bufferSize;
tmpaddr[0] = (tmp_convert1&0xFF0000) >> 16;
tmpaddr[1] = (tmp_convert1&0xFF00) >>8;
tmpaddr[2] = 0x00;
i++;
}
free(sectorBuf);
free(pagebuf);
return nres;
}
I open the debugger and it seems like it crash when i malloced "sectorbuf" in function "STM32SPIProgram_multiPage", what Im confused is that i did free the memory after "malloc". anyone has idea about it?
arm-none-eabi-size "RTOS.elf"
text data bss dec hex filename
77564 988 100756 179308 2bc6c RTOS.elf
Reading the man
Memory Management
[...]
If RTOS objects are created dynamically then the standard C library malloc() and free() functions can sometimes be used for the purpose, but ...
they are not always available on embedded systems,
they take up valuable code space,
they are not thread safe, and
they are not deterministic (the amount of time taken to execute the function will differ from call to call)
... so more often than not an alternative memory allocation implementation is required.
One embedded / real time system can have very different RAM and timing requirements to another - so a single RAM allocation algorithm will only ever be appropriate for a subset of applications.
To get around this problem, FreeRTOS keeps the memory allocation API in its portable layer. The portable layer is outside of the source files that implement the core RTOS functionality, allowing an application specific implementation appropriate for the real time system being developed to be provided. When the RTOS kernel requires RAM, instead of calling malloc(), it instead calls pvPortMalloc(). When RAM is being freed, instead of calling free(), the RTOS kernel calls vPortFree().
[...]
(Emphasis mine.)
So the meaning is that if you use directly malloc, FreeRTOS is not able to handle the heap consumed by the system function. Same if you choose heap_3 management that is a simple malloc wrapper.
Take also note that the memory management you choose has no free capability.
heap_1.c
This is the simplest implementation of all. It does not permit memory to be freed once it has been allocated. Despite this, heap_1.c is appropriate for a large number of embedded applications. This is because many small and deeply embedded applications create all the tasks, queues, semaphores, etc. required when the system boots, and then use all of these objects for the lifetime of program (until the application is switched off again, or is rebooted). Nothing ever gets deleted.
The implementation simply subdivides a single array into smaller blocks as RAM is requested. The total size of the array (the total size of the heap) is set by configTOTAL_HEAP_SIZE - which is defined in FreeRTOSConfig.h. The configAPPLICATION_ALLOCATED_HEAP FreeRTOSConfig.h configuration constant is provided to allow the heap to be placed at a specific address in memory.
The xPortGetFreeHeapSize() API function returns the total amount of heap space that remains unallocated, allowing the configTOTAL_HEAP_SIZE setting to be optimised.
The heap_1 implementation:
Can be used if your application never deletes a task, queue, semaphore, mutex, etc. (which actually covers the majority of applications in which FreeRTOS gets used).
Is always deterministic (always takes the same amount of time to execute) and cannot result in memory fragmentation.
Is very simple and allocated memory from a statically allocated array, meaning it is often suitable for use in applications that do not permit true dynamic memory allocation.
(Emphasis mine.)
Side note: You have always to check malloc return value != NULL.
I have a very large (~1E9) array of objects that I malloc, realloc, and free on iterations of a single thread program.
Specifically,
//(Individual *ind)
//malloc, old implementation
ind->obj = (double *)malloc(sizeof(double)*acb->in.nobj);
//new implementation
ind->obj = (double *)a_allocate(acb->in.nobj*sizeof(double));
void *a_allocate (int siz){
void *buf;
buf = calloc(1,siz);
acb->totmemMalloc+=siz;
if (buf==NULL){
a_throw2("a_allocate...failed to allocate buf...<%d>",siz);
}
return buf;
}
...
//realloc
ind->obj = (double *)a_realloc(ind->obj, acb->in.nobj*sizeof(double));
void *a_realloc (void *bufIn,int siz)
{
void *buf = bufIn;
if (buf==NULL){
a_throw2("a_realloc called with null bufIn...");
}
buf = realloc(buf,siz);
return buf;
}
...
//deallocate
free(ind->obj);
The other three dozen properties are processed similarly.
However, every few test runs, the code fails a heap validation on the deallocation of only this object property (the free() statement). At the time of failure, the ind->obj property is not null and has some valid value.
Is there any obvious problem with what I'm doing?
I'm very new to C and am not entirely sure I'm perform the memory operations correctly.
Thanks!
EDIT: using _CRTLDBG_REPORT_FLAG
HEAP[DEMO.exe]: Heap block at 010172B0 modified at 010172E4 past requested size of 2c
Heap validation is a delayed metric. The Visual Studio debug heap can be used (debug build) with more frequent checks Microsoft : Debug Heap flags.
Alternative using application verifier and turning on heap checking, will help find the point which is causing this.
+-----+----------+-----+ +----+-------------+-----+
| chk | memory |chk2 | | chk| different m | chk2|
+-----+----------+-----+ +----+-------------+-----+
When the system allocates memory, it puts meta- information about the memory before the returned pointer (or maybe after). When these memory pieces get overwritten, then that causes the heap failure.
This may be the memory you are freeing, or the memory which was directly before hand.
Edit - to address comments
A message such as "HEAP[DEMO.exe]: Heap block at 010172B0 modified at 010172E4 past requested size of 2c"
Implies that the memory at 01017280 wrote beyond the end of the allocated memory.
This could be because the amount malloced/realloced was too small, or an error in your loops.
+---+-----------------+----+--------------------------+
|chk|d0|d1|d2|d3|d4|d5|chk2| memory |
+---+-----------------+----+--------------------------+
So if you tried to write into d6 above, that would cause 'chk2' to be overwritten, which is being detected. In this case the difference is small - requested size is 0x2c and the difference = E4 - B0 = 0x34
Turning on these debug checks should change your code to be more crashing and predictable. If there is no randomness in your data, then turn off ASLR (only for debugging) and the addresses being used will be predictable, you can put a breakpoint in the malloc/realloc for a given memory address.
I was recently delving into the details of Linux's memory management as I want to implement something similar for my own toy kernel, so I was hoping if someone who's familiar with the details could help me understand one thing. Apparently the physical memory manager is a buddy algorithm, which is further specialised to return blocks of pages of a particular order (0 to 9, with 0 being just a single page). For each order the blocks are stored as a linked list. Say if a block of order 5 is requested but is not found on the list of order 5 blocks, the algorithm searches for a block in order 6, splits it into two, gives the requested half and moves the other half an order lower (as it is half in size).
What I don't get is how the kernel stores these structures, or how it allocates space for them. Since for order 0 pages you would need 1M entries (each is a 4KiB page), does it mean that the kernel allocates 1MiB * sizeof(struct page)? What about the blocks of order 1 and above? Does the kernel reuse allocated blocks by marking them as a higher order, and when it needs to split it in two just return the block and get one that is unused?
What I don't get is how the kernel stores these structures, or how it allocates space for them. Since for order 0 pages you would need 1M entries (each is a 4KiB page), does it mean that the kernel allocates 1MiB * sizeof(struct page)?
Initialization of zones is done by calling paging_init() (arch/x86/mm/init_32.c; some descriptions - https://www.kernel.org/doc/gorman/html/understand/understand005.html 2.3 Zone Initialisation and http://repo.hackerzvoice.net/depot_madchat/ebooks/Mem_virtuelle/linux-mm/vminit.html Initializing the Kernel Page Tables) from setup_arch() via (native_pagetable_init() and indirect call 1166 x86_init.paging.pagetable_init();):
690 /*
691 * paging_init() sets up the page tables - note that the first 8MB are
692 * already mapped by head.S.
...*/
697 void __init paging_init(void)
698 {
699 pagetable_init();
...
711 zone_sizes_init();
712 }
pagetable_init() creates kernel page tables in swapper_pg_dir array of 1024 pgd_ts.
zone_sizes_init() actually defines zones of physical memory and calls free_area_init_nodes() to initialize them with actual work done (for each NUMA node for_each_online_node(nid) {...}) in free_area_init_node() which calls three functions:
calculate_node_totalpages() prints page counts for every node in dmesg
alloc_node_mem_map() does actual job of allocating struct page for every physical page in this node; memory for them is allocated by bootmem allocator doc1 doc2 (you can see its debug with bootmem_debug=1 kernel boot option):
4936 size = (end - start) * sizeof(struct page);
4937 map = alloc_remap(pgdat->node_id, size);
if (!map) map = memblock_virt_alloc_node_nopanic(size, pgdat->node_id);
free_area_init_core() (with filling of bitmaps in struct zone). Functionality of free_area_init_core described for older kernels in http://repo.hackerzvoice.net/depot_madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html#INITIALIZE as:
free_area_init_core() The memory map is built, and the freelists and buddy bitmaps initialized, in free_area_init_core().
Free lists of orders in each zone are initialized and orders are marked as having no any free page: free_area_init_core() -> init_currently_empty_zone() -> zone_init_free_lists:
4147 static void __meminit zone_init_free_lists(struct zone *zone)
4148 {
4149 unsigned int order, t;
4150 for_each_migratetype_order(order, t) {
4151 INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
4152 zone->free_area[order].nr_free = 0;
4153 }
4154 }
PS: There is init() in kernel, it is called start_kernel(), and LXR (Linux cross-reference) will help you to navigate between functions (I posted links to lxr.free-electrons.com, but there are several online LXRs):
501 asmlinkage __visible void __init start_kernel(void)
...
528 boot_cpu_init();
529 page_address_init();
530 pr_notice("%s", linux_banner);
531 setup_arch(&command_line);
The following static allocation gives segmentation fault
double U[100][2048][2048];
But the following dynamic allocation goes fine
double ***U = (double ***)malloc(100 * sizeof(double **));
for(i=0;i<100;i++)
{
U[i] = (double **)malloc(2048 * sizeof(double *));
for(j=0;j<2048;j++)
{
U[i][j] = (double *)malloc(2048*sizeof(double));
}
}
The ulimit is set to unlimited in linux.
Can anyone give me some hint on whats happening?
When you say the ulimit is set to unlimited, are you using the -s option? As otherwise this doesn't change the stack limit, only the file size limit.
There appear to be stack limits regardless, though. I can allocate:
double *u = malloc(200*2048*2048*(sizeof(double))); // 6gb contiguous memory
And running the binary I get:
VmData: 6553660 kB
However, if I allocate on the stack, it's:
double u[200][2048][2048];
VmStk: 2359308 kB
Which is clearly not correct (suggesting overflow). With the original allocations, the two give the same results:
Array: VmStk: 3276820 kB
malloc: VmData: 3276860 kB
However, running the stack version, I cannot generate a segfault no matter what the size of the array -- even if it's more than the total memory actually on the system, if -s unlimited is set.
EDIT:
I did a test with malloc in a loop until it failed:
VmData: 137435723384 kB // my system doesn't quite have 131068gb RAM
Stack usage never gets above 4gb, however.
Assuming your machine actually has enough free memory to allocate 3.125 GiB of data, the difference most likely lies in the fact that the static allocation needs all of this memory to be contiguous (it's actually a 3-dimensional array), while the dynamic allocation only needs contiguous blocks of about 2048*8 = 16 KiB (it's an array of pointers to arrays of pointers to quite small actual arrays).
It is also possible that your operating system uses swap files for heap memory when it runs out, but not for stack memory.
There is a very good discussion of Linux memory management - and specifically the stack - here: 9.7 Stack overflow, it is worth the read.
You can use this command to find out what your current stack soft limit is
ulimit -s
On Mac OS X the hard limit is 64MB, see How to change the stack size using ulimit or per process on Mac OS X for a C or Ruby program?
You can modify the stack limit at run-time from your program, see Change stack size for a C++ application in Linux during compilation with GNU compiler
I combined your code with the sample there, here's a working program
#include <stdio.h>
#include <sys/resource.h>
unsigned myrand() {
static unsigned x = 1;
return (x = x * 1664525 + 1013904223);
}
void increase_stack( rlim_t stack_size )
{
rlim_t MIN_STACK = 1024 * 1024;
stack_size += MIN_STACK;
struct rlimit rl;
int result;
result = getrlimit(RLIMIT_STACK, &rl);
if (result == 0)
{
if (rl.rlim_cur < stack_size)
{
rl.rlim_cur = stack_size;
result = setrlimit(RLIMIT_STACK, &rl);
if (result != 0)
{
fprintf(stderr, "setrlimit returned result = %d\n", result);
}
}
}
}
void my_func() {
double U[100][2048][2048];
int i,j,k;
for(i=0;i<100;++i)
for(j=0;j<2048;++j)
for(k=0;k<2048;++k)
U[i][j][k] = myrand();
double sum = 0;
int n;
for(n=0;n<1000;++n)
sum += U[myrand()%100][myrand()%2048][myrand()%2048];
printf("sum=%g\n",sum);
}
int main() {
increase_stack( sizeof(double) * 100 * 2048 * 2048 );
my_func();
return 0;
}
You are hitting a limit of the stack. By default on Windows, the stack is 1M but can grow more if there is enough memory.
On many *nix systems default stack size is 512K.
You are trying to allocate 2048 * 2048 * 100 * 8 bytes, which is over 2^25 (over 2G for stack). If you have a lot of virtual memory available and still want to allocate this on stack, use a different stack limit while linking the application.
Linux:
How to increase the gcc executable stack size?
Change stack size for a C++ application in Linux during compilation with GNU compiler
Windows:
http://msdn.microsoft.com/en-us/library/tdkhxaks%28v=vs.110%29.aspx