How does Linux allocate memory for its physical allocator? - c

I was recently delving into the details of Linux's memory management as I want to implement something similar for my own toy kernel, so I was hoping if someone who's familiar with the details could help me understand one thing. Apparently the physical memory manager is a buddy algorithm, which is further specialised to return blocks of pages of a particular order (0 to 9, with 0 being just a single page). For each order the blocks are stored as a linked list. Say if a block of order 5 is requested but is not found on the list of order 5 blocks, the algorithm searches for a block in order 6, splits it into two, gives the requested half and moves the other half an order lower (as it is half in size).
What I don't get is how the kernel stores these structures, or how it allocates space for them. Since for order 0 pages you would need 1M entries (each is a 4KiB page), does it mean that the kernel allocates 1MiB * sizeof(struct page)? What about the blocks of order 1 and above? Does the kernel reuse allocated blocks by marking them as a higher order, and when it needs to split it in two just return the block and get one that is unused?

What I don't get is how the kernel stores these structures, or how it allocates space for them. Since for order 0 pages you would need 1M entries (each is a 4KiB page), does it mean that the kernel allocates 1MiB * sizeof(struct page)?
Initialization of zones is done by calling paging_init() (arch/x86/mm/init_32.c; some descriptions - https://www.kernel.org/doc/gorman/html/understand/understand005.html 2.3 Zone Initialisation and http://repo.hackerzvoice.net/depot_madchat/ebooks/Mem_virtuelle/linux-mm/vminit.html Initializing the Kernel Page Tables) from setup_arch() via (native_pagetable_init() and indirect call 1166 x86_init.paging.pagetable_init();):
690 /*
691 * paging_init() sets up the page tables - note that the first 8MB are
692 * already mapped by head.S.
...*/
697 void __init paging_init(void)
698 {
699 pagetable_init();
...
711 zone_sizes_init();
712 }
pagetable_init() creates kernel page tables in swapper_pg_dir array of 1024 pgd_ts.
zone_sizes_init() actually defines zones of physical memory and calls free_area_init_nodes() to initialize them with actual work done (for each NUMA node for_each_online_node(nid) {...}) in free_area_init_node() which calls three functions:
calculate_node_totalpages() prints page counts for every node in dmesg
alloc_node_mem_map() does actual job of allocating struct page for every physical page in this node; memory for them is allocated by bootmem allocator doc1 doc2 (you can see its debug with bootmem_debug=1 kernel boot option):
4936 size = (end - start) * sizeof(struct page);
4937 map = alloc_remap(pgdat->node_id, size);
if (!map) map = memblock_virt_alloc_node_nopanic(size, pgdat->node_id);
free_area_init_core() (with filling of bitmaps in struct zone). Functionality of free_area_init_core described for older kernels in http://repo.hackerzvoice.net/depot_madchat/ebooks/Mem_virtuelle/linux-mm/zonealloc.html#INITIALIZE as:
free_area_init_core() The memory map is built, and the freelists and buddy bitmaps initialized, in free_area_init_core().
Free lists of orders in each zone are initialized and orders are marked as having no any free page: free_area_init_core() -> init_currently_empty_zone() -> zone_init_free_lists:
4147 static void __meminit zone_init_free_lists(struct zone *zone)
4148 {
4149 unsigned int order, t;
4150 for_each_migratetype_order(order, t) {
4151 INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
4152 zone->free_area[order].nr_free = 0;
4153 }
4154 }
PS: There is init() in kernel, it is called start_kernel(), and LXR (Linux cross-reference) will help you to navigate between functions (I posted links to lxr.free-electrons.com, but there are several online LXRs):
501 asmlinkage __visible void __init start_kernel(void)
...
528 boot_cpu_init();
529 page_address_init();
530 pr_notice("%s", linux_banner);
531 setup_arch(&command_line);

Related

Retroarch Memory Map undefined for PSX (Beetle PSX HW)

I am working on pulling data from Retroarch via JSON format using software called Gamehook.
I am trying to add support for the PSX with the Beetle PSX HW core.
The issue is that even with extensive searching through the core and retroarch's source code, I cannot find a proper memory map for it. I have tried adding the ranges from the two sources below, which gives me a "no memory map defined" error.
http://www.raphnet.net/electronique/psx_adaptor/Playstation.txt
0x8000_0000 0x801f_ffff Kernel and User Memory Mirror (2 Meg) Cached
0xa000_0000 0xa01f_ffff Kernel and User Memory Mirror (2 Meg) Uncached
0x0000_0000-0x0000_ffff Kernel (64K)
0x0001_0000 0x001f_ffff User Memory (1.9 Meg)
0x1f80_0000-0x1f80_03ff Scratch Pad (1024 bytes)
and
https://psx-spx.consoledev.net/memorymap/
KUSEG KSEG0 KSEG1
00000000h 80000000h A0000000h 2048K Main RAM (first 64K reserved for BIOS)
1F000000h 9F000000h BF000000h 8192K Expansion Region 1 (ROM/RAM)
1F800000h 9F800000h -- 1K Scratchpad (D-Cache used as Fast RAM)
1F801000h 9F801000h BF801000h 8K I/O Ports
1F802000h 9F802000h BF802000h 8K Expansion Region 2 (I/O Ports)
1FA00000h 9FA00000h BFA00000h 2048K Expansion Region 3 (SRAM BIOS region for DTL cards)
1FC00000h 9FC00000h BFC00000h 512K BIOS ROM (Kernel) (4096K max)
FFFE0000h (in KSEG2) 0.5K Internal CPU control registers (Cache Control)
Looking at consoleinfo.c in the Retroarch source code, it defines the memory regions for Playstation as.
/* ===== PlayStation ===== */
/* http://www.raphnet.net/electronique/psx_adaptor/Playstation.txt */
static const rc_memory_region_t _rc_memory_regions_playstation[] = {
{ 0x000000U, 0x00FFFFU, 0x000000U, RC_MEMORY_TYPE_SYSTEM_RAM, "Kernel RAM" },
{ 0x010000U, 0x1FFFFFU, 0x010000U, RC_MEMORY_TYPE_SYSTEM_RAM, "System RAM" }
};
static const rc_memory_regions_t rc_memory_regions_playstation = { _rc_memory_regions_playstation, 2 };
And the definition for rc_memory_region_t is
typedef struct rc_memory_region_t {
unsigned start_address; /* first address of block as queried by RetroAchievements */
unsigned end_address; /* last address of block as queried by RetroAchievements */
unsigned real_address; /* real address for first address of block */
char type; /* RC_MEMORY_TYPE_ for block */
const char* description; /* short description of block */
}
rc_memory_region_t;
I have asked in the Retroarch discord server and they have unfortunately given me conflicting information. Such as, I want to use the RetroAcheievements memory address or that I need to find out how Retroarch deals with memory addressing of the cores.
With that second portion, I have looked at the entire source code in depth and have only found the memory addressing information here.
So my question comes down to, does anyone know either how to find the mappable memory addresses from the cores in Retroarch or does anyone happen to know the mappable memory addresses from the cores in Retroarch?

Freertos + STM32 - thread memory overflow with malloc

I'm working with stm32+rtos to implement a file system based on spi flash. For freertos, I adopted heap_1 implementation. This is how i create my task.
osThreadDef(Task_Embedded, Task_VATEmbedded, osPriorityNormal, 0, 2500);
VATEmbeddedTaskHandle = osThreadCreate(osThread(Task_Embedded), NULL);
I allocated 10000 bytes of memory to this thread.
and in this thread. I tried to write data into flash. In the first few called it worked successfully. but somehow it crash when i tried more time of write.
VATAPI_RESULT STM32SPIWriteSector(void *writebuf, uint8_t* SectorAddr, uint32_t buff_size){
if(STM32SPIEraseSector(SectorAddr) == VAT_SUCCESS){
DBGSTR("ERASE SECTOR - 0x%2x %2x %2x", SectorAddr[0], SectorAddr[1], SectorAddr[2]);
}else return VAT_UNKNOWN;
if(STM32SPIProgram_multiPage(writebuf, SectorAddr, buff_size) == VAT_SUCCESS){
DBGSTR("WRTIE SECTOR SUCCESSFUL");
return VAT_SUCCESS;
}else return VAT_UNKNOWN;
return VAT_UNKNOWN;
}
.
VATAPI_RESULT STM32SPIProgram_multiPage(uint8_t *writebuf, uint8_t *writeAddr, uint32_t buff_size){
VATAPI_RESULT nres;
uint8_t tmpaddr[3] = {writeAddr[0], writeAddr[1], writeAddr[2]};
uint8_t* sectorBuf = malloc(4096 * sizeof(uint8_t));
uint8_t* pagebuf = malloc(255* sizeof(uint8_t));
memset(&sectorBuf[0],0,4096);
memset(&pagebuf[0],0,255);
uint32_t i = 0, tmp_convert1, times = 0;
if(buff_size < Page_bufferSize)
times = 1;
else{
times = buff_size / (Page_bufferSize-1);
if((times%(Page_bufferSize-1))!=0)
times++;
}
/* Note : According to winbond flash feature, the last bytes of every 256 bytes should be 0, so we need to plus one byte on every 256 bytes*/
i = 0;
while(i < times){
memset(&pagebuf[0], 0, Page_bufferSize - 1);
memcpy(&pagebuf[0], &writebuf[i*255], Page_bufferSize - 1);
memcpy(&sectorBuf[i*Page_bufferSize], &pagebuf[0], Page_bufferSize - 1);
sectorBuf[((i+1)*Page_bufferSize)-1] = 0;
i++;
}
i = 0;
while(i < times){
if((nres=STM32SPIPageProgram(&sectorBuf[Page_bufferSize*i], &tmpaddr[0], Page_bufferSize)) != VAT_SUCCESS){
DBGSTR("STM32SPIProgram_allData write data fail on %d times!",i);
free(sectorBuf);
free(pagebuf);
return nres;
}
tmp_convert1 = (tmpaddr[0]<<16 | tmpaddr[1]<<8 | tmpaddr[2]) + Page_bufferSize;
tmpaddr[0] = (tmp_convert1&0xFF0000) >> 16;
tmpaddr[1] = (tmp_convert1&0xFF00) >>8;
tmpaddr[2] = 0x00;
i++;
}
free(sectorBuf);
free(pagebuf);
return nres;
}
I open the debugger and it seems like it crash when i malloced "sectorbuf" in function "STM32SPIProgram_multiPage", what Im confused is that i did free the memory after "malloc". anyone has idea about it?
arm-none-eabi-size "RTOS.elf"
text data bss dec hex filename
77564 988 100756 179308 2bc6c RTOS.elf
Reading the man
Memory Management
[...]
If RTOS objects are created dynamically then the standard C library malloc() and free() functions can sometimes be used for the purpose, but ...
they are not always available on embedded systems,
they take up valuable code space,
they are not thread safe, and
they are not deterministic (the amount of time taken to execute the function will differ from call to call)
... so more often than not an alternative memory allocation implementation is required.
One embedded / real time system can have very different RAM and timing requirements to another - so a single RAM allocation algorithm will only ever be appropriate for a subset of applications.
To get around this problem, FreeRTOS keeps the memory allocation API in its portable layer. The portable layer is outside of the source files that implement the core RTOS functionality, allowing an application specific implementation appropriate for the real time system being developed to be provided. When the RTOS kernel requires RAM, instead of calling malloc(), it instead calls pvPortMalloc(). When RAM is being freed, instead of calling free(), the RTOS kernel calls vPortFree().
[...]
(Emphasis mine.)
So the meaning is that if you use directly malloc, FreeRTOS is not able to handle the heap consumed by the system function. Same if you choose heap_3 management that is a simple malloc wrapper.
Take also note that the memory management you choose has no free capability.
heap_1.c
This is the simplest implementation of all. It does not permit memory to be freed once it has been allocated. Despite this, heap_1.c is appropriate for a large number of embedded applications. This is because many small and deeply embedded applications create all the tasks, queues, semaphores, etc. required when the system boots, and then use all of these objects for the lifetime of program (until the application is switched off again, or is rebooted). Nothing ever gets deleted.
The implementation simply subdivides a single array into smaller blocks as RAM is requested. The total size of the array (the total size of the heap) is set by configTOTAL_HEAP_SIZE - which is defined in FreeRTOSConfig.h. The configAPPLICATION_ALLOCATED_HEAP FreeRTOSConfig.h configuration constant is provided to allow the heap to be placed at a specific address in memory.
The xPortGetFreeHeapSize() API function returns the total amount of heap space that remains unallocated, allowing the configTOTAL_HEAP_SIZE setting to be optimised.
The heap_1 implementation:
Can be used if your application never deletes a task, queue, semaphore, mutex, etc. (which actually covers the majority of applications in which FreeRTOS gets used).
Is always deterministic (always takes the same amount of time to execute) and cannot result in memory fragmentation.
Is very simple and allocated memory from a statically allocated array, meaning it is often suitable for use in applications that do not permit true dynamic memory allocation.
(Emphasis mine.)
Side note: You have always to check malloc return value != NULL.

What is the memory node in kzalloc_node in the Linux kernel

I do not understand what the memory node is in the kzalloc_node function. The description says, "allocate zeroed memory from a particular memory node." But what is a memory node? I am specifically looking at a portion of the deadline I/O scheduler (shown below).
static int deadline_init_queue(struct request_queue *q, struct elevator_type *e)
{
struct deadline_data *dd;
...
dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
...
}
There's a very good description here:
https://www.kernel.org/doc/gorman/html/understand/understand009.html
...the function alloc_pages() calls numa_node_id() to return the
logical ID of the node associated with the current running CPU. This
NID is passed to _alloc_pages() which calls NODE_DATA() with the NID
as a parameter.
On UMA architectures, this will unconditionally result
in contig_page_data being returned but NUMA architectures instead set
up an array which NODE_DATA() uses NID as an offset into. In other
words, architectures are responsible for setting up a CPU ID to NUMA
memory node mapping.
This is effectively still a node-local allocation
policy as is used in 2.4 but it is a lot more clearly defined.
See also: https://en.wikipedia.org/wiki/Non-uniform_memory_access

Can CUDA Unified Memory used as Pinned memory(Unified Virtual Memory)?

As I know, we can allocate a Pinned memory area within kernel memory. (From KGPU)
Then, allocate linux kernel data in Pinned memory and transfer to GPU.
But problem is that linux kernel data should be arranged as array.
Today, a case that is a tree.
I have tried pass it from Pinned memory to GPU.
But when a node access next node, memory access error occured.
I was wondering is Unified Memory can be allocated as Pinned memory area in kernel memory?
So tree can be builded in Unified Memory area and used by GPU without other runtime API like cudaMallocMaganed.
Or is that Unified memory must only use cudaMallocMaganed?
But when a node access next node, memory access error occurred.
This just means you have a bug in your code.
Or is that Unified memory must only use cudaMallocManaged?
Currently, the only way to access the features of Unified Memory is to use a managed allocator. For dynamic allocations, that is cudaMallocManaged(). For static allocations, it is via the __managed__ keyword.
The programming guide has additional information.
In response to the comments below, here is a trivial worked example of creating a singly-linked list using pinned memory, and traversing that list in device code:
$ cat t1115.cu
#include <stdio.h>
#define NUM_ELE 5
struct ListElem{
int id;
bool last;
ListElem *next;
};
__global__ void test_kernel(ListElem *list){
int count = 0;
while (!(list->last)){
printf("List element %d has id %d\n", count++, list->id);
list = list->next;}
printf("List element %d is the last item in the list\n", count);
}
int main(){
ListElem *h_list, *my_list;
cudaHostAlloc(&h_list, sizeof(ListElem), cudaHostAllocDefault);
my_list = h_list;
for (int i = 0; i < NUM_ELE-1; i++){
my_list->id = i+101;
my_list->last = false;
cudaHostAlloc(&(my_list->next), sizeof(ListElem), cudaHostAllocDefault);
my_list = my_list->next;}
my_list->last = true;
test_kernel<<<1,1>>>(h_list);
cudaDeviceSynchronize();
}
$ nvcc -o t1115 t1115.cu
$ cuda-memcheck ./t1115
========= CUDA-MEMCHECK
List element 0 has id 101
List element 1 has id 102
List element 2 has id 103
List element 3 has id 104
List element 4 is the last item in the list
========= ERROR SUMMARY: 0 errors
$
Note that in the interest of brevity of presentation, I have dispensed with proper CUDA error checking in this example (although running the code with cuda-memcheck demonstrates there are no CUDA run-time errors), but I recommend it any time you are having trouble with a CUDA code. Also note that this example assumes a proper UVA environment.

HeapFree Breakpoint on Free()

I have a very large (~1E9) array of objects that I malloc, realloc, and free on iterations of a single thread program.
Specifically,
//(Individual *ind)
//malloc, old implementation
ind->obj = (double *)malloc(sizeof(double)*acb->in.nobj);
//new implementation
ind->obj = (double *)a_allocate(acb->in.nobj*sizeof(double));
void *a_allocate (int siz){
void *buf;
buf = calloc(1,siz);
acb->totmemMalloc+=siz;
if (buf==NULL){
a_throw2("a_allocate...failed to allocate buf...<%d>",siz);
}
return buf;
}
...
//realloc
ind->obj = (double *)a_realloc(ind->obj, acb->in.nobj*sizeof(double));
void *a_realloc (void *bufIn,int siz)
{
void *buf = bufIn;
if (buf==NULL){
a_throw2("a_realloc called with null bufIn...");
}
buf = realloc(buf,siz);
return buf;
}
...
//deallocate
free(ind->obj);
The other three dozen properties are processed similarly.
However, every few test runs, the code fails a heap validation on the deallocation of only this object property (the free() statement). At the time of failure, the ind->obj property is not null and has some valid value.
Is there any obvious problem with what I'm doing?
I'm very new to C and am not entirely sure I'm perform the memory operations correctly.
Thanks!
EDIT: using _CRTLDBG_REPORT_FLAG
HEAP[DEMO.exe]: Heap block at 010172B0 modified at 010172E4 past requested size of 2c
Heap validation is a delayed metric. The Visual Studio debug heap can be used (debug build) with more frequent checks Microsoft : Debug Heap flags.
Alternative using application verifier and turning on heap checking, will help find the point which is causing this.
+-----+----------+-----+ +----+-------------+-----+
| chk | memory |chk2 | | chk| different m | chk2|
+-----+----------+-----+ +----+-------------+-----+
When the system allocates memory, it puts meta- information about the memory before the returned pointer (or maybe after). When these memory pieces get overwritten, then that causes the heap failure.
This may be the memory you are freeing, or the memory which was directly before hand.
Edit - to address comments
A message such as "HEAP[DEMO.exe]: Heap block at 010172B0 modified at 010172E4 past requested size of 2c"
Implies that the memory at 01017280 wrote beyond the end of the allocated memory.
This could be because the amount malloced/realloced was too small, or an error in your loops.
+---+-----------------+----+--------------------------+
|chk|d0|d1|d2|d3|d4|d5|chk2| memory |
+---+-----------------+----+--------------------------+
So if you tried to write into d6 above, that would cause 'chk2' to be overwritten, which is being detected. In this case the difference is small - requested size is 0x2c and the difference = E4 - B0 = 0x34
Turning on these debug checks should change your code to be more crashing and predictable. If there is no randomness in your data, then turn off ASLR (only for debugging) and the addresses being used will be predictable, you can put a breakpoint in the malloc/realloc for a given memory address.

Resources