I am struggling my head with this problem.
If I check my application with top command in Linux, I get that VIRT is always the same (running for a couple of days) while RES is increasing a bit (between 4 bytes and 32 bytes) after an operation. I perform an operation once each 60 minutes.
An operation consists in reading some frames by SPI, adding them to several linked lists and after a while, extracting them in another thread.
I executed Valgrind with the following options:
valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes -v ./application
Once I close it (My app would run forever if everything goes well) I check leaks and I see nothing but the threads I haven't closed. Only possibly lost bytes in threads. No definitely and no indirectly.
I do not know if this is normal. I have used Valgrind in the past to find some leaks and it always worked well, I am pretty sure it is working well right now too but the RES issue I can't explain.
I have checked the linked list to see if I was leaving some nodes without free but, to me, it seems to be right.
I have 32 linked lists in one array. I have done this to make the push/pop operations easier without having 32 separate lists. I don't know if this could be causing the problem. If this is the case, I will split them:
typedef struct PR_LL_M_node {
uint8_t val[60];
struct PR_LL_M_node *next;
} PR_LL_M_node_t;
pthread_mutex_t PR_LL_M_lock[32];
PR_LL_M_node_t *PR_LL_M_head[32];
uint16_t PR_LL_M_counter[32];
int16_t LL_M_InitLinkedList(uint8_t LL_M_number) {
if (pthread_mutex_init(&PR_LL_M_lock[LL_M_number], NULL) != 0) {
printf("Mutex LL M %d init failed\n", LL_M_number);
return -1;
}
PR_LL_M_ready[LL_M_number] = 0;
PR_LL_M_counter[LL_M_number] = 0;
PR_LL_M_head[LL_M_number] = NULL;
pthread_mutex_unlock(&PR_LL_M_lock[LL_M_number]);
return PR_LL_M_counter[LL_M_number];
}
int16_t LL_M_Push(uint8_t LL_M_number, uint8_t *LL_M_frame, uint16_t LL_M_size) {
pthread_mutex_lock(&PR_LL_M_lock[LL_M_number]);
PR_LL_M_node_t *current = PR_LL_M_head[LL_M_number];
if (current != NULL) {
while (current->next != NULL) {
current = current->next;
}
/* now we can add a new variable */
current->next = malloc(sizeof(PR_LL_M_node_t));
memset(current->next->val, 0x00, 60);
/* Clean buffer before using it */
memcpy(current->next->val, LL_M_frame, LL_M_size);
current->next->next = NULL;
} else {
PR_LL_M_head[LL_M_number] = malloc(sizeof(PR_LL_M_node_t));
memcpy(PR_LL_M_head[LL_M_number]->val, LL_M_frame, LL_M_size);
PR_LL_M_head[LL_M_number]->next = NULL;
}
PR_LL_M_counter[LL_M_number]++;
pthread_mutex_unlock(&PR_LL_M_lock[LL_M_number]);
return PR_LL_M_counter[LL_M_number];
}
int16_t LL_M_Pop(uint8_t LL_M_number, uint8_t *LL_M_frame) {
PR_LL_M_node_t *next_node = NULL;
pthread_mutex_lock(&PR_LL_M_lock[LL_M_number]);
if ((PR_LL_M_head[LL_M_number] == NULL)) {
pthread_mutex_unlock(&PR_LL_M_lock[LL_M_number]);
return -1;
}
if ((PR_LL_M_counter[LL_M_number] == 0)) {
pthread_mutex_unlock(&PR_LL_M_lock[LL_M_number]);
return -1;
}
next_node = PR_LL_M_head[LL_M_number]->next;
memcpy(LL_M_frame, PR_LL_M_head[LL_M_number]->val, 60);
free(PR_LL_M_head[LL_M_number]);
PR_LL_M_counter[LL_M_number]--;
PR_LL_M_head[LL_M_number] = next_node;
pthread_mutex_unlock(&PR_LL_M_lock[LL_M_number]);
return PR_LL_M_counter[LL_M_number];
}
This way I pass the number of the linked list I want to manage and I operate over it. What do you think? Is RES a real problem? I think it could be related to other parts of the application but I have commented out most of it and it always happens if the push/pop operation is used. If I leave push/pop out, RES maintains its number.
When I extract the values I use a do/while until I get -1 as response of the pop operation.
Your observations do not seem to indicate a problem:
VIRT and RES are expressed in KiB (units of 1024 bytes). Depending on how virtual memory works on your system, the numbers should always be multiples of the page size, which is most likely 4KiB.
RES is the amount of resident memory, in other words the amount of RAM actually mapped for your program at a given time.
If the program goes to sleep for 60 minutes at a time, the system (Linux) may determine that some of its pages are good candidates for discarding or swapping should it need to map memory for other processes. RES will diminish accordingly. Note incidentally that top is one such process that needs memory and thus can disturb your process while observing it, a computing variant of Heisenberg's Principle.
When the process wakes up, whatever memory is accessed by the running thread is mapped back into memory, either from the executable file(s) if it was discarded or from the swap file, or from nowhere if the discarded page was all null bytes or an unused part of the stack. This causes RES to increase again.
There might be other problems in the code, especially in code that you did not post, but if VIRT does not change, you do not seem to have a memory leak.
Related
My AF-XDP userspace program is based on this tutorial: https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP
I am currently trying to parse ~360.000 RTP-packets per second (checking for continuous sequence numbers) but I loose around 25 per second (this means that for 25 packets the statement previous_rtp_sqnz_nmbr + 1 == current_rtp_sqnz_nmbr doesn't hold true).
So I tried to increase the number of allocated packets NUM_FRAMES from 228.000 to 328.000. With default FRAME_SIZE of XSK_UMEM__DEFAULT_FRAME_SIZE = 4096 this results in 1281Mbyte being allocated (no problem because I have 32GB of RAM) but for whatever reason, this function call:
static struct xsk_umem_info *configure_xsk_umem(void *buffer, uint64_t size)
{
printf("Try to allocate %lu\n", size);
struct xsk_umem_info *umem;
int ret;
umem = calloc(1, sizeof(*umem));
if (!umem)
return NULL;
ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq,
NULL);
if (ret) {
errno = -ret;
return NULL;
}
umem->buffer = buffer;
return umem;
}
fails with
Try to allocate 1343488000
ERROR: Can't create umem "Cannot allocate memory"
I don't know why? But because I know that my RTP-packets are not larger than 1500bytes, I set FRAME_SIZE 3072 so I am now at around 960Mbyte (which works without an error).
However, I am now loosing half of the received packets (this means that for 180.000 packets the previous sequence number doesn't line up with the current sequence number).
Because of this I ask the question: What is the relationship between FRAME_SIZE and the actual size of a packet? Because obviously it can not be the same.
Edit: I am using 5.4.0-4-amd64 #1 SMP Debian 5.4.19-1 (2020-02-13) x86_64 GNU/Linux and just copied the libbpf-repository from here into my code-base: https://github.com/libbpf/libbpf
So I don't know whether the error mentioned here: https://github.com/xdp-project/xdp-tutorial/issues/76 is still valid?
I came up with an idea I am trying to implement for a lock free stack that does not rely on reference counting to resolve the ABA problem, and also handles memory reclamation properly. It is similar in concept to RCU, and relies on two features: marking a list entry as removed, and tracking readers traversing the list. The former is simple, it just uses the LSB of the pointer. The latter is my "clever" attempt at an approach to implementing an unbounded lock free stack.
Basically, when any thread attempts to traverse the list, one atomic counter (list.entries) is incremented. When the traversal is complete, a second counter (list.exits) is incremented.
Node allocation is handled by push, and deallocation is handled by pop.
The push and pop operations are fairly similar to the naive lock-free stack implementation, but the nodes marked for removal must be traversed to arrive at a non-marked entry. Push basically is therefore much like a linked list insertion.
The pop operation similarly traverses the list, but it uses atomic_fetch_or to mark the nodes as removed while traversing, until it reaches a non-marked node.
After traversing the list of 0 or more marked nodes, a thread that is popping will attempt to CAS the head of the stack. At least one thread concurrently popping will succeed, and after this point all readers entering the stack will no longer see the formerly marked nodes.
The thread that successfully updates the list then loads the atomic list.entries, and basically spin-loads atomic.exits until that counter finally exceeds list.entries. This should imply that all readers of the "old" version of the list have completed. The thread then simply frees the the list of marked nodes that it swapped off the top of the list.
So the implications from the pop operation should be (I think) that there can be no ABA problem because the nodes that are freed are not returned to the usable pool of pointers until all concurrent readers using them have completed, and obviously the memory reclamation issue is handled as well, for the same reason.
So anyhow, that is theory, but I'm still scratching my head on the implementation, because it is currently not working (in the multithreaded case). It seems like I am getting some write after free issues among other things, but I'm having trouble spotting the issue, or maybe my assumptions are flawed and it just won't work.
Any insights would be greatly appreciated, both on the concept, and on approaches to debugging the code.
Here is my current (broken) code (compile with gcc -D_GNU_SOURCE -std=c11 -Wall -O0 -g -pthread -o list list.c):
#include <pthread.h>
#include <stdatomic.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <stdio.h>
#include <unistd.h>
#define NUM_THREADS 8
#define NUM_OPS (1024 * 1024)
typedef uint64_t list_data_t;
typedef struct list_node_t {
struct list_node_t * _Atomic next;
list_data_t data;
} list_node_t;
typedef struct {
list_node_t * _Atomic head;
int64_t _Atomic size;
uint64_t _Atomic entries;
uint64_t _Atomic exits;
} list_t;
enum {
NODE_IDLE = (0x0),
NODE_REMOVED = (0x1 << 0),
NODE_FREED = (0x1 << 1),
NODE_FLAGS = (0x3),
};
static __thread struct {
uint64_t add_count;
uint64_t remove_count;
uint64_t added;
uint64_t removed;
uint64_t mallocd;
uint64_t freed;
} stats;
#define NODE_IS_SET(p, f) (((uintptr_t)p & f) == f)
#define NODE_SET_FLAG(p, f) ((void *)((uintptr_t)p | f))
#define NODE_CLR_FLAG(p, f) ((void *)((uintptr_t)p & ~f))
#define NODE_POINTER(p) ((void *)((uintptr_t)p & ~NODE_FLAGS))
list_node_t * list_node_new(list_data_t data)
{
list_node_t * new = malloc(sizeof(*new));
new->data = data;
stats.mallocd++;
return new;
}
void list_node_free(list_node_t * node)
{
free(node);
stats.freed++;
}
static void list_add(list_t * list, list_data_t data)
{
atomic_fetch_add_explicit(&list->entries, 1, memory_order_seq_cst);
list_node_t * new = list_node_new(data);
list_node_t * _Atomic * next = &list->head;
list_node_t * current = atomic_load_explicit(next, memory_order_seq_cst);
do
{
stats.add_count++;
while ((NODE_POINTER(current) != NULL) &&
NODE_IS_SET(current, NODE_REMOVED))
{
stats.add_count++;
current = NODE_POINTER(current);
next = ¤t->next;
current = atomic_load_explicit(next, memory_order_seq_cst);
}
atomic_store_explicit(&new->next, current, memory_order_seq_cst);
}
while(!atomic_compare_exchange_weak_explicit(
next, ¤t, new,
memory_order_seq_cst, memory_order_seq_cst));
atomic_fetch_add_explicit(&list->exits, 1, memory_order_seq_cst);
atomic_fetch_add_explicit(&list->size, 1, memory_order_seq_cst);
stats.added++;
}
static bool list_remove(list_t * list, list_data_t * pData)
{
uint64_t entries = atomic_fetch_add_explicit(
&list->entries, 1, memory_order_seq_cst);
list_node_t * start = atomic_fetch_or_explicit(
&list->head, NODE_REMOVED, memory_order_seq_cst);
list_node_t * current = start;
stats.remove_count++;
while ((NODE_POINTER(current) != NULL) &&
NODE_IS_SET(current, NODE_REMOVED))
{
stats.remove_count++;
current = NODE_POINTER(current);
current = atomic_fetch_or_explicit(¤t->next,
NODE_REMOVED, memory_order_seq_cst);
}
uint64_t exits = atomic_fetch_add_explicit(
&list->exits, 1, memory_order_seq_cst) + 1;
bool result = false;
current = NODE_POINTER(current);
if (current != NULL)
{
result = true;
*pData = current->data;
current = atomic_load_explicit(
¤t->next, memory_order_seq_cst);
atomic_fetch_add_explicit(&list->size,
-1, memory_order_seq_cst);
stats.removed++;
}
start = NODE_SET_FLAG(start, NODE_REMOVED);
if (atomic_compare_exchange_strong_explicit(
&list->head, &start, current,
memory_order_seq_cst, memory_order_seq_cst))
{
entries = atomic_load_explicit(&list->entries, memory_order_seq_cst);
while ((int64_t)(entries - exits) > 0)
{
pthread_yield();
exits = atomic_load_explicit(&list->exits, memory_order_seq_cst);
}
list_node_t * end = NODE_POINTER(current);
list_node_t * current = NODE_POINTER(start);
while (current != end)
{
list_node_t * tmp = current;
current = atomic_load_explicit(¤t->next, memory_order_seq_cst);
list_node_free(tmp);
current = NODE_POINTER(current);
}
}
return result;
}
static list_t list;
pthread_mutex_t ioLock = PTHREAD_MUTEX_INITIALIZER;
void * thread_entry(void * arg)
{
sleep(2);
int id = *(int *)arg;
for (int i = 0; i < NUM_OPS; i++)
{
bool insert = random() % 2;
if (insert)
{
list_add(&list, i);
}
else
{
list_data_t data;
list_remove(&list, &data);
}
}
struct rusage u;
getrusage(RUSAGE_THREAD, &u);
pthread_mutex_lock(&ioLock);
printf("Thread %d stats:\n", id);
printf("\tadded = %lu\n", stats.added);
printf("\tremoved = %lu\n", stats.removed);
printf("\ttotal added = %ld\n", (int64_t)(stats.added - stats.removed));
printf("\tadded count = %lu\n", stats.add_count);
printf("\tremoved count = %lu\n", stats.remove_count);
printf("\tadd average = %f\n", (float)stats.add_count / stats.added);
printf("\tremove average = %f\n", (float)stats.remove_count / stats.removed);
printf("\tmallocd = %lu\n", stats.mallocd);
printf("\tfreed = %lu\n", stats.freed);
printf("\ttotal mallocd = %ld\n", (int64_t)(stats.mallocd - stats.freed));
printf("\tutime = %f\n", u.ru_utime.tv_sec
+ u.ru_utime.tv_usec / 1000000.0f);
printf("\tstime = %f\n", u.ru_stime.tv_sec
+ u.ru_stime.tv_usec / 1000000.0f);
pthread_mutex_unlock(&ioLock);
return NULL;
}
int main(int argc, char ** argv)
{
struct {
pthread_t thread;
int id;
}
threads[NUM_THREADS];
for (int i = 0; i < NUM_THREADS; i++)
{
threads[i].id = i;
pthread_create(&threads[i].thread, NULL, thread_entry, &threads[i].id);
}
for (int i = 0; i < NUM_THREADS; i++)
{
pthread_join(threads[i].thread, NULL);
}
printf("Size = %ld\n", atomic_load(&list.size));
uint32_t count = 0;
list_data_t data;
while(list_remove(&list, &data))
{
count++;
}
printf("Removed %u\n", count);
}
You mention you are trying to solve the ABA problem, but the description and code is actually an attempt to solve a harder problem: the memory reclamation problem.
This problem typically arises in the "deletion" functionality of lock-free collections implemented in languages without garbage collection. The core issue is that a thread removing a node from a shared structure often doesn't know when it is safe to free the removed node as because other reads may still have a reference to it. Solving this problem often, as a side effect, also solves the ABA problem: which is specifically about a CAS operation succeeding even though the underlying pointer (and state of the object) has been been changed at least twice in the meantime, ending up with the original value but presenting a totally different state.
The ABA problem is easier in the sense that there are several straightforward solutions to the ABA problem specifically that don't lead to a solution to the "memory reclamation" problem. It is also easier in the sense that hardware that can detect the modification of the location, e.g., with LL/SC or transactional memory primitives, might not exhibit the problem at all.
So that said, you are hunting for a solution to the memory reclamation problem, and it will also avoid the ABA problem.
The core of your issue is this statement:
The thread that successfully updates the list then loads the atomic
list.entries, and basically spin-loads atomic.exits until that counter
finally exceeds list.entries. This should imply that all readers of
the "old" version of the list have completed. The thread then simply
frees the the list of marked nodes that it swapped off the top of the
list.
This logic doesn't hold. Waiting for list.exits (you say atomic.exits but I think it's a typo as you only talk about list.exits elsewhere) to be greater than list.entries only tells you there have now been more total exits than there were entries at the point the mutating thread captured the entry count. However, these exits may have been generated by new readers coming and going: it doesn't at all imply that all the old readers have finished as you claim!
Here's a simple example. First a writing thread T1 and a reading thread T2 access the list around the same time, so list.entries is 2 and list.exits is 0. The writing thread pops an node, and saves the current value (2) of list.entries and waits for lists.exits to be greater than 2. Now three more reading threads, T3, T4, T5 arrive and do a quick read of the list and leave. Now lists.exits is 3, and your condition is met and T1 frees the node. T2 hasn't gone anywhere though and blows up since it is reading a freed node!
The basic idea you have can work, but your two counter approach in particular definitely doesn't work.
This is a well-studied problem, so you don't have to invent your own algorithm (see the link above), or even write your own code since things like librcu and concurrencykit already exist.
For Educational Purposes
If you wanted to make this work for educational purposes though, one approach would be to use ensure that threads coming in after a modification have started use a different set of list.entry/exit counters. One way to do this would be a generation counter, and when the writer wants to modify the list, it increments the generation counter, which causes new readers to switch to a different set of list.entry/exit counters.
Now the writer just has to wait for list.entry[old] == list.exists[old], which means all the old readers have left. You could also just get away with a single counter per generation: you don't really two entry/exit counters (although it might help reduce contention).
Of course, you know have a new problem of managing this list of separate counters per generation... which kind of looks like the original problem of building a lock-free list! This problem is a bit easier though because you might put some reasonable bound on the number of generations "in flight" and just allocate them all up-front, or you might implement a limited type of lock-free list that is easier to reason about because additions and deletions only occur at the head or tail.
I'm working with stm32+rtos to implement a file system based on spi flash. For freertos, I adopted heap_1 implementation. This is how i create my task.
osThreadDef(Task_Embedded, Task_VATEmbedded, osPriorityNormal, 0, 2500);
VATEmbeddedTaskHandle = osThreadCreate(osThread(Task_Embedded), NULL);
I allocated 10000 bytes of memory to this thread.
and in this thread. I tried to write data into flash. In the first few called it worked successfully. but somehow it crash when i tried more time of write.
VATAPI_RESULT STM32SPIWriteSector(void *writebuf, uint8_t* SectorAddr, uint32_t buff_size){
if(STM32SPIEraseSector(SectorAddr) == VAT_SUCCESS){
DBGSTR("ERASE SECTOR - 0x%2x %2x %2x", SectorAddr[0], SectorAddr[1], SectorAddr[2]);
}else return VAT_UNKNOWN;
if(STM32SPIProgram_multiPage(writebuf, SectorAddr, buff_size) == VAT_SUCCESS){
DBGSTR("WRTIE SECTOR SUCCESSFUL");
return VAT_SUCCESS;
}else return VAT_UNKNOWN;
return VAT_UNKNOWN;
}
.
VATAPI_RESULT STM32SPIProgram_multiPage(uint8_t *writebuf, uint8_t *writeAddr, uint32_t buff_size){
VATAPI_RESULT nres;
uint8_t tmpaddr[3] = {writeAddr[0], writeAddr[1], writeAddr[2]};
uint8_t* sectorBuf = malloc(4096 * sizeof(uint8_t));
uint8_t* pagebuf = malloc(255* sizeof(uint8_t));
memset(§orBuf[0],0,4096);
memset(&pagebuf[0],0,255);
uint32_t i = 0, tmp_convert1, times = 0;
if(buff_size < Page_bufferSize)
times = 1;
else{
times = buff_size / (Page_bufferSize-1);
if((times%(Page_bufferSize-1))!=0)
times++;
}
/* Note : According to winbond flash feature, the last bytes of every 256 bytes should be 0, so we need to plus one byte on every 256 bytes*/
i = 0;
while(i < times){
memset(&pagebuf[0], 0, Page_bufferSize - 1);
memcpy(&pagebuf[0], &writebuf[i*255], Page_bufferSize - 1);
memcpy(§orBuf[i*Page_bufferSize], &pagebuf[0], Page_bufferSize - 1);
sectorBuf[((i+1)*Page_bufferSize)-1] = 0;
i++;
}
i = 0;
while(i < times){
if((nres=STM32SPIPageProgram(§orBuf[Page_bufferSize*i], &tmpaddr[0], Page_bufferSize)) != VAT_SUCCESS){
DBGSTR("STM32SPIProgram_allData write data fail on %d times!",i);
free(sectorBuf);
free(pagebuf);
return nres;
}
tmp_convert1 = (tmpaddr[0]<<16 | tmpaddr[1]<<8 | tmpaddr[2]) + Page_bufferSize;
tmpaddr[0] = (tmp_convert1&0xFF0000) >> 16;
tmpaddr[1] = (tmp_convert1&0xFF00) >>8;
tmpaddr[2] = 0x00;
i++;
}
free(sectorBuf);
free(pagebuf);
return nres;
}
I open the debugger and it seems like it crash when i malloced "sectorbuf" in function "STM32SPIProgram_multiPage", what Im confused is that i did free the memory after "malloc". anyone has idea about it?
arm-none-eabi-size "RTOS.elf"
text data bss dec hex filename
77564 988 100756 179308 2bc6c RTOS.elf
Reading the man
Memory Management
[...]
If RTOS objects are created dynamically then the standard C library malloc() and free() functions can sometimes be used for the purpose, but ...
they are not always available on embedded systems,
they take up valuable code space,
they are not thread safe, and
they are not deterministic (the amount of time taken to execute the function will differ from call to call)
... so more often than not an alternative memory allocation implementation is required.
One embedded / real time system can have very different RAM and timing requirements to another - so a single RAM allocation algorithm will only ever be appropriate for a subset of applications.
To get around this problem, FreeRTOS keeps the memory allocation API in its portable layer. The portable layer is outside of the source files that implement the core RTOS functionality, allowing an application specific implementation appropriate for the real time system being developed to be provided. When the RTOS kernel requires RAM, instead of calling malloc(), it instead calls pvPortMalloc(). When RAM is being freed, instead of calling free(), the RTOS kernel calls vPortFree().
[...]
(Emphasis mine.)
So the meaning is that if you use directly malloc, FreeRTOS is not able to handle the heap consumed by the system function. Same if you choose heap_3 management that is a simple malloc wrapper.
Take also note that the memory management you choose has no free capability.
heap_1.c
This is the simplest implementation of all. It does not permit memory to be freed once it has been allocated. Despite this, heap_1.c is appropriate for a large number of embedded applications. This is because many small and deeply embedded applications create all the tasks, queues, semaphores, etc. required when the system boots, and then use all of these objects for the lifetime of program (until the application is switched off again, or is rebooted). Nothing ever gets deleted.
The implementation simply subdivides a single array into smaller blocks as RAM is requested. The total size of the array (the total size of the heap) is set by configTOTAL_HEAP_SIZE - which is defined in FreeRTOSConfig.h. The configAPPLICATION_ALLOCATED_HEAP FreeRTOSConfig.h configuration constant is provided to allow the heap to be placed at a specific address in memory.
The xPortGetFreeHeapSize() API function returns the total amount of heap space that remains unallocated, allowing the configTOTAL_HEAP_SIZE setting to be optimised.
The heap_1 implementation:
Can be used if your application never deletes a task, queue, semaphore, mutex, etc. (which actually covers the majority of applications in which FreeRTOS gets used).
Is always deterministic (always takes the same amount of time to execute) and cannot result in memory fragmentation.
Is very simple and allocated memory from a statically allocated array, meaning it is often suitable for use in applications that do not permit true dynamic memory allocation.
(Emphasis mine.)
Side note: You have always to check malloc return value != NULL.
After reading the following paper https://people.freebsd.org/~lstewart/articles/cpumemory.pdf ("What every programmer should know about memory") I wanted to try one of the author's test, that is, measuring the effects of TLB on the final execution time.
I am working on a Samsung Galaxy S3 that embeds a Cortex-A9.
According to the documentation:
we have two micro TLBs for instruction and data cache in L1 (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388e/Chddiifa.html)
The main TLB is located in L2 (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388e/Chddiifa.html)
Data micro TLB has 32 entries (instruction micro TLB has either 32 or 64 entries)
L1' size == 32 Kbytes
L1 cache line == 32 bytes
L2' size == 1MB
I wrote a small program that allocates an array of structs with N entries. Each entry's size is == 32 bytes so it fits in a cache line.
I perform several read access and I measure the execution time.
typedef struct {
int elmt; // sizeof(int) == 4 bytes
char padding[28]; // 4 + 28 = 32B == cache line size
}entry;
volatile entry ** entries = NULL;
//Allocate memory and init to 0
entries = calloc(NB_ENTRIES, sizeof(entry *));
if(entries == NULL) perror("calloc failed"); exit(1);
for(i = 0; i < NB_ENTRIES; i++)
{
entries[i] = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
if(entries[i] == MAP_FAILED) perror("mmap failed"); exit(1);
}
entries[LAST_ELEMENT]->elmt = -1
//Randomly access and init with random values
n = -1;
i = 0;
while(++n < NB_ENTRIES -1)
{
//init with random value
entries[i]->elmt = rand() % NB_ENTRIES;
//loop till we reach the last element
while(entries[entries[i]->elmt]->elmt != -1)
{
entries[i]->elmt++;
if(entries[i]->elmt == NB_ENTRIES)
entries[i]->elmt = 0;
}
i = entries[i]->elmt;
}
gettimeofday(&tStart, NULL);
for(i = 0; i < NB_LOOPS; i++)
{
j = 0;
while(j != -1)
{
j = entries[j]->elmt
}
}
gettimeofday(&tEnd, NULL);
time = (tEnd.tv_sec - tStart.tv_sec);
time *= 1000000;
time += tEnd.tv_usec - tStart.tv_usec;
time *= 100000
time /= (NB_ENTRIES * NBLOOPS);
fprintf(stdout, "%d %3lld.%02lld\n", NB_ENTRIES, time / 100, time % 100);
I have an outer loop that makes NB_ENTRIES vary from 4 to 1024.
As one can see in the figure below, while NB_ENTRIES == 256 entries, executing time is longer.
When NB_ENTRIES == 404 I get an "out of memory" (why? micro TLBs exceeded? main TLBs exceeded? Page Tables exceeded? virtual memory for the process exceeded?)
Can someone explain me please what is really going on from 4 to 256 entries, then from 257 to 404 entries?
EDIT 1
As it has been suggested, I ran membench (src code) and below the results:
EDIT 2
In the following paper (page 3) they ran (I suppose) the same benchmark. But the different steps are clearly visible from their plots, which is not my case.
Right now, according to their results and explanations, I only can identify few things.
plots confirm that L1 cache line size is 32 bytes because as they said
"once the array size exceeds the size of the data cache (32KB), the reads begin to generate misses [...] an inflection point occurs when every read generates a misse".
In my case the very first inflection point appears when stride == 32 Bytes.
- The graph shows that we have a second-level (L2) cache. I think it is depicted by the yellow line (1MB == L2 size)
- Therefore the two last plots above the latter probably reflects the latency while accessing Main Memory (+ TLB?).
However from this benchmark, I am not able to identify:
the cache associativity. Normally D-Cache and I-Cache are 4-way associative (Cortex-A9 TRM).
The TLB effects. As they said,
in most systems, a secondary increase in latency is indicative of the TLB, which caches a limited number of virtual to physical translations.[..] The absence of a rise in latency attributable to TLB indicates that [...]"
large page sizes have probably been used/implemented.
EDIT 3
This link explains the TLB effects from another membench graph. One can actually retrieve the same effects on my graph.
On a 4KB page system, as you grow your strides, while they're still < 4K, you'll enjoy less and less utilization of each page [...] you'll have to access the 2nd level TLB on each access [...]
The cortex-A9 supports 4KB pages mode.
Indeed as one can see in my graph up to strides == 4K, latencies are increasing, then, when it reachs 4K
you suddenly start benefiting again since you're actually skipping whole pages.
tl;dr -> Provide a proper MVCE.
This answer should be a comment but is too big to be posted as comment, so posting as answer instead:
I had to fix a bunch of syntax errors (missing semicolons) and declare undefined variables.
After fixing all those problems, the code did NOTHING (the program quit even prior to executing the first mmap. I'm giving the tip to use curly brackets all the time, here is your first and your second error caused by NOT doing so:
.
// after calloc:
if(entries == NULL) perror("calloc failed"); exit(1);
// after mmap
if(entries[i] == MAP_FAILED) perror("mmap failed"); exit(1);
both lines just terminate your program regardless of the condition.
Here you got an endless loop (reformatted, added curly brackets but no other change):
.
//Randomly access and init with random values
n = -1;
i = 0;
while (++n < NB_ENTRIES -1) {
//init with random value
entries[i]->elmt = rand() % NB_ENTRIES;
//loop till we reach the last element
while (entries[entries[i]->elmt]->elmt != -1) {
entries[i]->elmt++;
if (entries[i]->elmt == NB_ENTRIES) {
entries[i]->elmt = 0;
}
}
i = entries[i]->elmt;
}
First iteration starts by setting entries[0]->elmt to some random value, then inner loop increments until it reaches LAST_ELEMENT. Then i is set to that value (i.e. LAST_ELEMENT) and second loop overwrites end marker -1 to some other random value. Then it's constantly incremented mod NB_ENTRIES in the inner loop until you hit CTRL+C.
Conclusion
If you want help, then post a Minimal, Complete, and Verifiable example and not something else.
i recently learnt about memalloc() and free() and i was just wondering if there was a way to appropriately check if all the memallocs are appropriately being freed?
I have this code right here for an implementation of doubly linked list, and im unclear whether id need to go through every node and deallocate each p1 and p2, or does doing it once count?:
struct s {
int data;
struct s *p1;
struct s *p2;
};
void freedl(struct s *p)
{
if(p->p1 != NULL)
{
printf("free %d \n", p->p1->data);
}
if(p->p2 != NULL)
{
freedl(p->p2);
}
else
{
printf("free last %d", p->data);
free(p);
}
}
int main(void) {
struct s *one, *two, *three, *four, *five, *six;
one = malloc(sizeof *one);
two = malloc(sizeof *two);
three = malloc(sizeof *three);
four = malloc(sizeof *four);
five = malloc(sizeof *five);
six = malloc(sizeof *six);
one->data = 1;
one->p1 = NULL;
one->p2 = two;
two->data = 2;
two->p1 = one;
two->p2 = three;
three->data = 3;
three->p1 = two;
three->p2 = four;
four->data = 4;
four->p1 = three;
four->p2 = five;
five->data = 5;
five->p1 = four;
five->p2 = six;
six->data = 6;
six->p1 = five;
six->p2 = NULL;
freedl(one);
return EXIT_SUCCESS;
}
and I'd just like to make sure I'm doing it right!
The answer is "yes"; exactly how hard it is depends on what OS you're in.
If you're in Linux, Mac OS X or one of the BSDs (FreeBSD, OpenBSD, ...):
(and possibly Haiku)
You can use an utility called valgrind. It is an excellent utility, designed for (among other things) exactly that --- checking for memory leaks.
The basic usage is simple:
valgrind ./my-program
It is a complex utility though, so I'd recommend checking out the valgrind manual for more advanced usage.
It actually does a far more than that, in that it can detect many (but not all) out-of-bounds accesses and similar problems. It also includes other tools that may come useful, such as callgrind for profiling code.
Do note, however, that valgrind will make your program run very slowly, due to the way it operates.
Any other OS (incl. Windows):
Unfortunately, there is no such utility for Windows (no free ones, anyways; and the commercial ones costed a small fortune, last I've checked --- and none does quite as much as valgrind & friends can).
What you can do, however, is implement macros and check manually on exit:
#define malloc(size) chk_malloc(size, __FILE__, __LINE__)
#define free(ptr) chk_free(ptr, __FILE__, __LINE__)
// etc... for realloc, calloc
...
// at start of main():
atexit(chk_report); // report problems when program exits normally
You then have to implement chk_malloc, chk_free and so on. There can be some "leaks", however, if you do things such as setAllocator(malloc). If you're okay with losing line information, you can then try doing:
#define malloc chk_malloc // chk_malloc now only takes 1 argument
#define free chk_free
...
There are certain hacks which would allow you to keep file/line info even with this #define, but they would seriously complicate matters (it would involve basically hacking closures into C).
If you don't want to change the code in any way, you could try your luck with replacing just those functions (by replacing the stdlib with your own shim DLL), though you won't get file/line information that way. It might also be problematic if the compilation was done statically, or if the compiler has replaced it with some intrinsic (which is unlikely for malloc, but not inconceivable).
The implementation can be very simple, or it can be complex, it's up to you. I don't know of any existing implementations, but you can probably find something online.
System allocators have some way of getting statistics (such as bytes allocated) out of them. (mallinfo on linux). At the beginning of your program you should store the number of bytes allocated and at the end you need t make sure the number is the same. If the number is different, you have a potential memory leak.
Finding that leak is another story. Tools like valgrind will help.
You can valgrind but it can be a little bit slow
Or something buildin in compiler. For example in clang (I think that in gcc 4.9 too) there are LeakSanitizer:
$ cat example.c
int main()
{
malloc(100);
return 0;
}
$ clang -fsanitize=leak -g example.c -fsanitize=address
$ ASAN_OPTIONS=detect_leaks=1 ./a.out
=================================================================
==9038==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 100 byte(s) in 1 object(s) allocated from:
#0 0x46c871 (/home/debian/a.out+0x46c871)
#1 0x49888c (/home/debian/a.out+0x49888c)
#2 0x7fea542e4ec4 (/lib/x86_64-linux-gnu/libc.so.6+0x21ec4)
SUMMARY: AddressSanitizer: 100 byte(s) leaked in 1 allocation(s).