using UNIX Pipeline with C - c

I am required to create 6 threads to perform a task (increment/decrement a number) concurrently until the integer becomes 0. I am supposed to be using only UNIX commands (Pipelines to be specific) and I can't get my head around how pipelines work, or how I can implement this program.
This integer can be stored in a text file.
I would really appreciate it if anyone can explain how to implement this program

The book is right, pipes can be used to protect critical sections, although how to do so is non-obvous.
int *make_pipe_semaphore(int initial_count)
{
int *ptr = malloc(2 * sizeof(int));
if (pipe(ptr)) {
free(ptr);
return NULL;
}
while (initial_count--)
pipe_release(ptr);
return ptr;
}
void free_pipe_semaphore(int *sem)
{
close(sem[0]);
close(sem[1]);
free(sem);
}
void pipe_wait(int *sem)
{
char x;
read(sem[0], &x, 1);
}
void pipe_release(int *sem)
{
char x;
write(sem[1], &x, 1);
}
The maximum free resources in the semaphore varies from OS to OS but is usually at least 4096. This doesn't matter for protecting a critical section where the initial and maximum values are both 1.
Usage:
/* Initialization section */
int *sem = make_pipe_semaphore(1);
/* critical worker */
{
pipe_wait(sem);
/* do work */
/* end critical section */
pipe_release(sem);
}

Related

understand membarrier function in linux

Example of using membarrier function from linux manual: https://man7.org/linux/man-pages/man2/membarrier.2.html
#include <stdlib.h>
static volatile int a, b;
static void
fast_path(int *read_b)
{
a = 1;
asm volatile ("mfence" : : : "memory");
*read_b = b;
}
static void
slow_path(int *read_a)
{
b = 1;
asm volatile ("mfence" : : : "memory");
*read_a = a;
}
int
main(int argc, char **argv)
{
int read_a, read_b;
/*
* Real applications would call fast_path() and slow_path()
* from different threads. Call those from main() to keep
* this example short.
*/
slow_path(&read_a);
fast_path(&read_b);
/*
* read_b == 0 implies read_a == 1 and
* read_a == 0 implies read_b == 1.
*/
if (read_b == 0 && read_a == 0)
abort();
exit(EXIT_SUCCESS);
}
The code above transformed to use membarrier() becomes:
#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <linux/membarrier.h>
static volatile int a, b;
static int
membarrier(int cmd, unsigned int flags, int cpu_id)
{
return syscall(__NR_membarrier, cmd, flags, cpu_id);
}
static int
init_membarrier(void)
{
int ret;
/* Check that membarrier() is supported. */
ret = membarrier(MEMBARRIER_CMD_QUERY, 0, 0);
if (ret < 0) {
perror("membarrier");
return -1;
}
if (!(ret & MEMBARRIER_CMD_GLOBAL)) {
fprintf(stderr,
"membarrier does not support MEMBARRIER_CMD_GLOBAL\n");
return -1;
}
return 0;
}
static void
fast_path(int *read_b)
{
a = 1;
asm volatile ("" : : : "memory");
*read_b = b;
}
static void
slow_path(int *read_a)
{
b = 1;
membarrier(MEMBARRIER_CMD_GLOBAL, 0, 0);
*read_a = a;
}
int
main(int argc, char **argv)
{
int read_a, read_b;
if (init_membarrier())
exit(EXIT_FAILURE);
/*
* Real applications would call fast_path() and slow_path()
* from different threads. Call those from main() to keep
* this example short.
*/
slow_path(&read_a);
fast_path(&read_b);
/*
* read_b == 0 implies read_a == 1 and
* read_a == 0 implies read_b == 1.
*/
if (read_b == 0 && read_a == 0)
abort();
exit(EXIT_SUCCESS);
}
This "membarrier" description is taken from the Linux manual. I am still confused about how does trhe "membarrier" function add overhead to the slow side, and remove overhead from the fast side, thus resulting in an overall performance increase as long as the slow side is infrequent enough that the overhead of the membarrier() calls does not outweigh the performance gain on the fast side.
Could you please help me to describe it in more detail.
Thanks!
This pair of writes-then-read-the-other-var is https://preshing.com/20120515/memory-reordering-caught-in-the-act/, a demo of StoreLoad reordering (the only kind x86 allows, given its program-order + store buffer with store forwarding memory model).
With only one local MFENCE you could still get reordering:
FAST using just mfence, not membarrier
a = 1 exec
read_b = b; // 0
b = 1;
mfence (force b=1 to be visible before reading a)
read_a = a; // 0
a = 1 visible (global vis. delayed by store buffer)
But consider what would happen if an mfence-on-every-core had to be part of every possible order, between the slow-path's store and its reload.
This ordering would no longer be possible. If read_b=b has already read a 0, then a=1 is already pending1 (if it isn't visible already). It's impossible for it to stay private until after read_a = a because membarrier() makes sure a full barrier runs on every core, and SLOW waits for that to happen (membarrier to return) before reading a.
And there's no way to get 0,0 from having SLOW execute first; it runs membarrier itself so its store is definitely visible to other threads before it reads a.
footnote 1: Waiting to execute, or waiting in the store buffer to commit to L1d cache. The asm("":::"memory") ensures that, but is actually redundant because volatile itself guarantees that the accesses happen in asm in program order. And we basically need volatile for other reasons when hand-rolling atomics instead of using C11 _Atomic. (But generally don't do that unless you're actually writing kernel code. Use atomic_store_explicit(&a, 1, memory_order_release);).
Note it's actually the store buffer that creates StoreLoad reordering (the only kind x86 allows), not so much OoO exec. In fact, a store buffer also lets x86 execute stores out-of-order and then make them globally visible in program order (if it turns out they weren't the result of mis-speculation or something!).
Also note that in-order CPUs can do their memory accesses out of order. They start instructions (including loads) in order, but can let them complete out of order, e.g. by scoreboarding loads to allow hit-under-miss. See also How is load->store reordering possible with in-order commit?

What kinds of data in a device driver can be shared among processes?

In device drivers, how can we tell what data is shared among processes and what is local to a process? The Linux Device Drivers book mentions
Any time that a hardware or software resource is shared beyond a single thread of execution, and the possibility exists that one thread could encounter an inconsistent view of that resource, you must explicitly manage access to that resource.
But what kinds of software resources can be shared among threads and what kinds of data cannot be shared? I know that global variables are generally considered as shared memory but what other kinds of things need to be protected?
For example, is the struct inode and struct file types passed in file operations like open, release, read, write, etc. considered to be shared?
In the open call inside main.c , why is dev (in the line dev = container_of(inode->i_cdev, struct scull_dev, cdev);) not protected with a lock if it points to a struct scull_dev entry in the global array scull_devices?
In scull_write, why isn't the line int quantum = dev->quantum, qset = dev->qset; locked with a semaphore since it's accessing a global variable?
/* In scull.h */
struct scull_qset {
void **data; /* pointer to an array of pointers which each point to a quantum buffer */
struct scull_qset *next;
};
struct scull_dev {
struct scull_qset *data; /* Pointer to first quantum set */
int quantum; /* the current quantum size */
int qset; /* the current array size */
unsigned long size; /* amount of data stored here */
unsigned int access_key; /* used by sculluid and scullpriv */
struct semaphore sem; /* mutual exclusion semaphore */
struct cdev cdev; /* Char device structure */
};
/* In main.c */
struct scull_dev *scull_devices; /* allocated in scull_init_module */
int scull_major = SCULL_MAJOR;
int scull_minor = 0;
int scull_nr_devs = SCULL_NR_DEVS;
int scull_quantum = SCULL_QUANTUM;
int scull_qset = SCULL_QSET;
ssize_t scull_write(struct file *filp, const char __user *buf, size_t count,
loff_t *f_pos)
{
struct scull_dev *dev = filp->private_data; /* flip->private_data assigned in scull_open */
struct scull_qset *dptr;
int quantum = dev->quantum, qset = dev->qset;
int itemsize = quantum * qset;
int item; /* item in linked list */
int s_pos; /* position in qset data array */
int q_pos; /* position in quantum */
int rest;
ssize_t retval = -ENOMEM; /* value used in "goto out" statements */
if (down_interruptible(&dev->sem))
return -ERESTARTSYS;
/* find listitem, qset index and offset in the quantum */
item = (long)*f_pos / itemsize;
rest = (long)*f_pos % itemsize;
s_pos = rest / quantum;
q_pos = rest % quantum;
/* follow the list up to the right position */
dptr = scull_follow(dev, item);
if (dptr == NULL)
goto out;
if (!dptr->data) {
dptr->data = kmalloc(qset * sizeof(char *), GFP_KERNEL);
if (!dptr->data)
goto out;
memset(dptr->data, 0, qset * sizeof(char *));
}
if (!dptr->data[s_pos]) {
dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL);
if (!dptr->data[s_pos])
goto out;
}
/* write only up to the end of this quantum */
if (count > quantum - q_pos)
count = quantum - q_pos;
if (copy_from_user(dptr->data[s_pos]+q_pos, buf, count)) {
retval = -EFAULT;
goto out;
}
*f_pos += count;
retval = count;
/* update the size */
if (dev->size < *f_pos)
dev->size = *f_pos;
out:
up(&dev->sem);
return retval;
}
int scull_open(struct inode *inode, struct file *filp)
{
struct scull_dev *dev; /* device information */
/* Question: Why was the lock not placed here? */
dev = container_of(inode->i_cdev, struct scull_dev, cdev);
filp->private_data = dev; /* for other methods */
/* now trim to 0 the length of the device if open was write-only */
if ( (filp->f_flags & O_ACCMODE) == O_WRONLY) {
if (down_interruptible(&dev->sem))
return -ERESTARTSYS;
scull_trim(dev); /* ignore errors */
up(&dev->sem);
}
return 0; /* success */
}
int scull_init_module(void)
{
int result, i;
dev_t dev = 0;
/* assigns major and minor numbers (left out for brevity) */
/*
* allocate the devices -- we can't have them static, as the number
* can be specified at load time
*/
scull_devices = kmalloc(scull_nr_devs * sizeof(struct scull_dev), GFP_KERNEL);
if (!scull_devices) {
result = -ENOMEM;
goto fail; /* isn't this redundant? */
}
memset(scull_devices, 0, scull_nr_devs * sizeof(struct scull_dev));
/* Initialize each device. */
for (i = 0; i < scull_nr_devs; i++) {
scull_devices[i].quantum = scull_quantum;
scull_devices[i].qset = scull_qset;
init_MUTEX(&scull_devices[i].sem);
scull_setup_cdev(&scull_devices[i], i);
}
/* some other stuff (left out for brevity) */
return 0; /* succeed */
fail:
scull_cleanup_module(); /* left out for brevity */
return result;
}
/*
* Set up the char_dev structure for this device.
*/
static void scull_setup_cdev(struct scull_dev *dev, int index)
{
int err, devno = MKDEV(scull_major, scull_minor + index);
cdev_init(&dev->cdev, &scull_fops);
dev->cdev.owner = THIS_MODULE;
dev->cdev.ops = &scull_fops; /* isn't this redundant? */
err = cdev_add (&dev->cdev, devno, 1);
/* Fail gracefully if need be */
if (err)
printk(KERN_NOTICE "Error %d adding scull%d", err, index);
}
All data in memory can be considered a "shared resource" if both threads are able to access it*. The only resource they wouldn't be shared between processors is the data in the registers, which is abstracted away in C.
There are two reasons that you would not practically consider two resources to be shared (even though they do not actually mean that two threads could not theoretically access them, some nightmarish code could sometimes bypass these).
Only one thread can/does access it. Clearly if only one thread accesses a variable then there can be no race conditions. This is the reason local variables and single threaded programs do not need locking mechanisms.
The value is constant. You can't get different results based on order of access if the value can never change.
The program you have shown here is incomplete, so it is hard to say, but each of the variables accessed without locking must meet one of the criteria for this program to be thread safe.
There are some non-obvious ways to meet the criteria, such as if a variable is constant or limited to one thread only in a specific context.
You gave two examples of lines that were not locked. For the first line.
dev = container_of(inode->i_cdev, struct scull_dev, cdev);
This line does not actually access any variables, it just computes where the struct containing cdev would be. There can be no race conditions because nobody else has access to your pointers (though they have access to what they point to), they are only accessible within the function (this is not true of what they point to). This meets criteria (1).
The other example is
int quantum = dev->quantum, qset = dev->qset;
This one is a bit harder to say without context, but my best guess is that it is assumed that dev->quantum and dev->qset will never change during the function call. This seems supported by the fact that they are only called in scull_init_module which should only be called once at the very beginning. I believe this fits criteria (2).
Which brings up another way that you might change a shared variable without locking, if you know that other threads will not try to access it until you are done for some other reason (eg they are not extant yet)
In short, all memory is shared, but sometimes you can get away with acting like its not.
*There are embedded systems where each processor has some amount of RAM that only it could use, but this is not the typical case.

How to create a process that runs a routine with variable number of parameters?

I know there are lots of questions here about functions that take a variable number of arguments. I also know there's lots of docs about stdarg.h and its macros. And I also know how printf-like functions take a variable number of arguments. I already tried each of those alternatives and they didn't help me. So, please, keep that in mind before marking this question as duplicate.
I'm working on the process management features of a little embedded operating system and I'm stuck on the design of a function that can create processes that run a function with a variable number of parameters. Here's a simplified version of how I want my API to looks like:
// create a new process
// * function is a pointer to the routine the process will run
// * nargs is the number of arguments the routine takes
void create(void* function, uint8_t nargs, ...);
void f1();
void f2(int i);
void f3(float f, int i, const char* str);
int main()
{
create(f1, 0);
create(f2, 1, 9);
create(f3, 3, 3.14f, 9, "string");
return 0;
}
And here is a pseudocode for the relevant part of the implementation of system call create:
void create(void* function, uint8_t nargs, ...)
{
process_stack = create_stack();
first_arg = &nargs + 1;
copy_args_list_to_process_stack(process_stack, first_arg);
}
Of course I'll need to know the calling convention in order to be able to copy from create's activation record to the new process stack, but that's not the problem. The problem is how many bytes do I need to copy. Even though I know how many arguments I need to copy, I don't know how much space each of those arguments occupy. So I don't know when to stop copying.
The Xinu Operating System does something very similar to what I want to do, but I tried hard to understand the code and didn't succeed. I'll transcript a very simplified version of the Xinu's create function here. Maybe someone understand and help me.
pid32 create(void* procaddr, uint32 ssize, pri16 priority, char *name, int32 nargs, ...)
{
int32 i;
uint32 *a; /* points to list of args */
uint32 *saddr; /* stack address */
saddr = (uint32 *)getstk(ssize); // return a pointer to the new process's stack
*saddr = STACKMAGIC; // STACKMAGIC is just a marker to detect stack overflow
// this is the cryptic part
/* push arguments */
a = (uint32 *)(&nargs + 1); /* start of args */
a += nargs -1; /* last argument */
for ( ; nargs > 4 ; nargs--) /* machine dependent; copy args */
*--saddr = *a--; /* onto created process's stack */
*--saddr = (long)procaddr;
for(i = 11; i >= 4; i--)
*--saddr = 0;
for(i = 4; i > 0; i--) {
if(i <= nargs)
*--saddr = *a--;
else
*--saddr = 0;
}
}
I got stuck on this line: a += nargs -1;. This should move the pointer a 4*(nargs - 1) ahead in memory, right? What if an argument's size is not 4 bytes? But that is just the first question. I also didn't understand the next lines of the code.
If you are writing an operating system, you also define the calling convention(s) right? Settle for argument sizes of sizeof(void*) and pad as necessary.

What's a good way to implement simple clone() based multithread library?

I'm trying to build simple multithread library based on linux using clone() and other kernel utilities.I've come to a point where I'm not really sure what's the correct way to do things. I tried going trough original NPTL code but it's a bit too much.
That's how for instance I imagine the create method:
typedef int sk_thr_id;
typedef void *sk_thr_arg;
typedef int (*sk_thr_func)(sk_thr_arg);
sk_thr_id sk_thr_create(sk_thr_func f, sk_thr_arg a){
void* stack;
stack = malloc( 1024*64 );
if ( stack == 0 ){
perror( "malloc: could not allocate stack" );
exit( 1 );
}
return ( clone(f, (char*) stack + FIBER_STACK, SIGCHLD | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_VM, a ) );
}
1: I'm not really sure what the correct clone() flags should be. I just found these being used in a simple example. Any general directions here will be welcome.
Here are parts of the mutex primitives created using futexes(not my own code for now):
#define cmpxchg(P, O, N) __sync_val_compare_and_swap((P), (O), (N))
#define cpu_relax() asm volatile("pause\n": : :"memory")
#define barrier() asm volatile("": : :"memory")
static inline unsigned xchg_32(void *ptr, unsigned x)
{
__asm__ __volatile__("xchgl %0,%1"
:"=r" ((unsigned) x)
:"m" (*(volatile unsigned *)ptr), "0" (x)
:"memory");
return x;
}
static inline unsigned short xchg_8(void *ptr, char x)
{
__asm__ __volatile__("xchgb %0,%1"
:"=r" ((char) x)
:"m" (*(volatile char *)ptr), "0" (x)
:"memory");
return x;
}
int sys_futex(void *addr1, int op, int val1, struct timespec *timeout, void *addr2, int val3)
{
return syscall(SYS_futex, addr1, op, val1, timeout, addr2, val3);
}
typedef union mutex mutex;
union mutex
{
unsigned u;
struct
{
unsigned char locked;
unsigned char contended;
} b;
};
int mutex_init(mutex *m, const pthread_mutexattr_t *a)
{
(void) a;
m->u = 0;
return 0;
}
int mutex_lock(mutex *m)
{
int i;
/* Try to grab lock */
for (i = 0; i < 100; i++)
{
if (!xchg_8(&m->b.locked, 1)) return 0;
cpu_relax();
}
/* Have to sleep */
while (xchg_32(&m->u, 257) & 1)
{
sys_futex(m, FUTEX_WAIT_PRIVATE, 257, NULL, NULL, 0);
}
return 0;
}
int mutex_unlock(mutex *m)
{
int i;
/* Locked and not contended */
if ((m->u == 1) && (cmpxchg(&m->u, 1, 0) == 1)) return 0;
/* Unlock */
m->b.locked = 0;
barrier();
/* Spin and hope someone takes the lock */
for (i = 0; i < 200; i++)
{
if (m->b.locked) return 0;
cpu_relax();
}
/* We need to wake someone up */
m->b.contended = 0;
sys_futex(m, FUTEX_WAKE_PRIVATE, 1, NULL, NULL, 0);
return 0;
}
2: The main question for me is how to implement the "join" primitive? I know it's supposed to be based on futexes too. It's a struggle for me for now to come up with something.
3: I need some way to cleanup stuff(like the allocated stack) after a thread has finished. I can't really thing of a good way to do this too.
Probably for these I'll need to have additional structure in user space for every thread with some information saved in it. Can someone point me in good direction for solving these issues?
4: I'll want to have a way to tell how much time a thread has been running, how long it's been since it's last being scheduled and other stuff like that. Are there some kernel calls providing such info?
Thanks in advance!
The idea that there can exist a "multithreading library" as a third-party library separate from the rest of the standard library is an outdated and flawed notion. If you want to do this, you'll have to first drop all use of the standard library; particularly, your call to malloc is completely unsafe if you're calling clone yourself, because:
malloc will have no idea that multiple threads exist, and therefore may fail to perform proper synchronization.
Even if it knew they existed, malloc will need to access an unspecified, implementation-specific structure located at the address given by the thread pointer. As this structure is implementation-specific, you have no way of creating such a structure that will be interpreted correctly by both the current and all future versions of your system's libc.
These issues don't apply just to malloc but to most of the standard library; even async-signal-safe functions may be unsafe to use, as they might dereference the thread pointer for cancellation-related purposes, performing optimal syscall mechanisms, etc.
If you really insist on making your own threads implementation, you'll have to abstain from using glibc or any modern libc that's integrated with threads, and instead opt for something much more naive like klibc. This could be an educational experiment, but it would not be appropriate for a deployed application.
1) You are using an example of LinuxThreads. I will not rewrite good references for directions, but I advise you "The Linux Programming interface" of Michael Kerrisk, chapter 28. It explains in 25 pages, what you need.
2) If you set the CLONE_CHILD_CLEARID flag, when the child terminates, the ctid argument of clone is cleared. If you treat that pointer as a futex, you can implement the join primitive. Good luck :-) If you don't want to use futexes, have also a look to wait3 and wait4.
3) I do not know what you want to cleanup, but you can use the clone tls arugment. This is a thread local storage buffer. If the thread is finished, you can clean that buffer.
4) See getrusage.

Simple C implementation to track memory malloc/free?

programming language: C
platform: ARM
Compiler: ADS 1.2
I need to keep track of simple melloc/free calls in my project. I just need to get very basic idea of how much heap memory is required when the program has allocated all its resources. Therefore, I have provided a wrapper for the malloc/free calls. In these wrappers I need to increment a current memory count when malloc is called and decrement it when free is called. The malloc case is straight forward as I have the size to allocate from the caller. I am wondering how to deal with the free case as I need to store the pointer/size mapping somewhere. This being C, I do not have a standard map to implement this easily.
I am trying to avoid linking in any libraries so would prefer *.c/h implementation.
So I am wondering if there already is a simple implementation one may lead me to. If not, this is motivation to go ahead and implement one.
EDIT: Purely for debugging and this code is not shipped with the product.
EDIT: Initial implementation based on answer from Makis. I would appreciate feedback on this.
EDIT: Reworked implementation
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include <string.h>
#include <limits.h>
static size_t gnCurrentMemory = 0;
static size_t gnPeakMemory = 0;
void *MemAlloc (size_t nSize)
{
void *pMem = malloc(sizeof(size_t) + nSize);
if (pMem)
{
size_t *pSize = (size_t *)pMem;
memcpy(pSize, &nSize, sizeof(nSize));
gnCurrentMemory += nSize;
if (gnCurrentMemory > gnPeakMemory)
{
gnPeakMemory = gnCurrentMemory;
}
printf("PMemAlloc (%#X) - Size (%d), Current (%d), Peak (%d)\n",
pSize + 1, nSize, gnCurrentMemory, gnPeakMemory);
return(pSize + 1);
}
return NULL;
}
void MemFree (void *pMem)
{
if(pMem)
{
size_t *pSize = (size_t *)pMem;
// Get the size
--pSize;
assert(gnCurrentMemory >= *pSize);
printf("PMemFree (%#X) - Size (%d), Current (%d), Peak (%d)\n",
pMem, *pSize, gnCurrentMemory, gnPeakMemory);
gnCurrentMemory -= *pSize;
free(pSize);
}
}
#define BUFFERSIZE (1024*1024)
typedef struct
{
bool flag;
int buffer[BUFFERSIZE];
bool bools[BUFFERSIZE];
} sample_buffer;
typedef struct
{
unsigned int whichbuffer;
char ch;
} buffer_info;
int main(void)
{
unsigned int i;
buffer_info *bufferinfo;
sample_buffer *mybuffer;
char *pCh;
printf("Tesint MemAlloc - MemFree\n");
mybuffer = (sample_buffer *) MemAlloc(sizeof(sample_buffer));
if (mybuffer == NULL)
{
printf("ERROR ALLOCATING mybuffer\n");
return EXIT_FAILURE;
}
bufferinfo = (buffer_info *) MemAlloc(sizeof(buffer_info));
if (bufferinfo == NULL)
{
printf("ERROR ALLOCATING bufferinfo\n");
MemFree(mybuffer);
return EXIT_FAILURE;
}
pCh = (char *)MemAlloc(sizeof(char));
printf("finished malloc\n");
// fill allocated memory with integers and read back some values
for(i = 0; i < BUFFERSIZE; ++i)
{
mybuffer->buffer[i] = i;
mybuffer->bools[i] = true;
bufferinfo->whichbuffer = (unsigned int)(i/100);
}
MemFree(bufferinfo);
MemFree(mybuffer);
if(pCh)
{
MemFree(pCh);
}
return EXIT_SUCCESS;
}
You could allocate a few extra bytes in your wrapper and put either an id (if you want to be able to couple malloc() and free()) or just the size there. Just malloc() that much more memory, store the information at the beginning of your memory block and and move the pointer you return that many bytes forward.
This can, btw, also easily be used for fence pointers/finger-prints and such.
Either you can have access to internal tables used by malloc/free (see this question: Where Do malloc() / free() Store Allocated Sizes and Addresses? for some hints), or you have to manage your own tables in your wrappers.
You could always use valgrind instead of rolling your own implementation. If you don't care about the amount of memory you allocate you could use an even simpler implementation: (I did this really quickly so there could be errors and I realize that it is not the most efficient implementation. The pAllocedStorage should be given an initial size and increase by some factor for a resize etc. but you get the idea.)
EDIT: I missed that this was for ARM, to my knowledge valgrind is not available on ARM so that might not be an option.
static size_t indexAllocedStorage = 0;
static size_t *pAllocedStorage = NULL;
static unsigned int free_calls = 0;
static unsigned long long int total_mem_alloced = 0;
void *
my_malloc(size_t size){
size_t *temp;
void *p = malloc(size);
if(p == NULL){
fprintf(stderr,"my_malloc malloc failed, %s", strerror(errno));
exit(EXIT_FAILURE);
}
total_mem_alloced += size;
temp = (size_t *)realloc(pAllocedStorage, (indexAllocedStorage+1) * sizeof(size_t));
if(temp == NULL){
fprintf(stderr,"my_malloc realloc failed, %s", strerror(errno));
exit(EXIT_FAILURE);
}
pAllocedStorage = temp;
pAllocedStorage[indexAllocedStorage++] = (size_t)p;
return p;
}
void
my_free(void *p){
size_t i;
int found = 0;
for(i = 0; i < indexAllocedStorage; i++){
if(pAllocedStorage[i] == (size_t)p){
pAllocedStorage[i] = (size_t)NULL;
found = 1;
break;
}
}
if(!found){
printf("Free Called on unknown\n");
}
free_calls++;
free(p);
}
void
free_check(void) {
size_t i;
printf("checking freed memeory\n");
for(i = 0; i < indexAllocedStorage; i++){
if(pAllocedStorage[i] != (size_t)NULL){
printf( "Memory leak %X\n", (unsigned int)pAllocedStorage[i]);
free((void *)pAllocedStorage[i]);
}
}
free(pAllocedStorage);
pAllocedStorage = NULL;
}
I would use rmalloc. It is a simple library (actually it is only two files) to debug memory usage, but it also has support for statistics. Since you already wrapper functions it should be very easy to use rmalloc for it. Keep in mind that you also need to replace strdup, etc.
Your program may also need to intercept realloc(), calloc(), getcwd() (as it may allocate memory when buffer is NULL in some implementations) and maybe strdup() or a similar function, if it is supported by your compiler
If you are running on x86 you could just run your binary under valgrind and it would gather all this information for you, using the standard implementation of malloc and free. Simple.
I've been trying out some of the same techniques mentioned on this page and wound up here from a google search. I know this question is old, but wanted to add for the record...
1) Does your operating system not provide any tools to see how much heap memory is in use in a running process? I see you're talking about ARM, so this may well be the case. In most full-featured OSes, this is just a matter of using a cmd-line tool to see the heap size.
2) If available in your libc, sbrk(0) on most platforms will tell you the end address of your data segment. If you have it, all you need to do is store that address at the start of your program (say, startBrk=sbrk(0)), then at any time your allocated size is sbrk(0) - startBrk.
3) If shared objects can be used, you're dynamically linking to your libc, and your OS's runtime loader has something like an LD_PRELOAD environment variable, you might find it more useful to build your own shared object that defines the actual libc functions with the same symbols (malloc(), not MemAlloc()), then have the loader load your lib first and "interpose" the libc functions. You can further obtain the addresses of the actual libc functions with dlsym() and the RTLD_NEXT flag so you can do what you are doing above without having to recompile all your code to use your malloc/free wrappers. It is then just a runtime decision when you start your program (or any program that fits the description in the first sentence) where you set an environment variable like LD_PRELOAD=mymemdebug.so and then run it. (google for shared object interposition.. it's a great technique and one used by many debuggers/profilers)

Resources