Memory compactor in C? - c

I have a program that simulates best-fit memory management.
Basically, while there are available holes in the holes list, large enough for a process we are trying to allocate, processes are allocated and added to the process list. However, eventually, we get to a point where holes become very fragmented and we need to perform compaction.
The easiest way to do this, is obviously to create a new list and add all the processes in sequential order. However, that is not very realistic, since in real world, you wouldn't have space to move things to and create a new list.
Can you think of a way to push all the processes to one end of memory and free space to the other? Basically it is set up like this array of holes (holes are structs that contain starting index and size) and an array of processes (processes are also structs that contain process id, starting index and size).

You could shuffle the the allocated memory to the end of available memory one after another.
Pseudocode:
sort procs by start_index (descending)
avail_end = END_OF_MEM - p[0].size (adjust alignment)
for p in procs
memmove( avail_end, p.start_index , p.size )
avail_end = avail_end - p.size
this should lead to one free memory block at the beginning of available memory. You could also stop this function after one region has moved (after a time threshold has reached) and continue later (By testing whter there is a gap beetween subsequent process alocated regions, to skip unneccesary moves).

If the array of holes is sorted on starting index, you could iterate over the array and take two holes and move the memory chunk in between to the start index of the first hole.
Further explanation:
Each hole has a starting index and an ending index equal to start_index + size.
By comparing the ending index of the first hole to the starting index of the second, you get the size of the memory chunk in between. Then you can do a memmove of the memory chunk to the first starting index.

When freeing a memory (and creating a hole), do you already merge it with adjacent holes? Otherwise that is a good idea to lessen fragmentation.

Related

Handle memory properly with a pool of structs

I have a program with three pools of structs. For each of them I use a list a of used structs and another one for the unused structs. During the execution the program consumes structs, and returns them back to the pool on demand. Also, there is a garbage collector to clean the "zombie" structs and return them to the pool.
At the beginning of the execution, the virtual memory, as expected, shows around 10GB* of memory allocated, and as the program uses the pool, the RSS memory increases.
Although the used nodes are back in the pool, marked as unused nodes, the RSS memory do not decreases. I expect this, because the OS doesn't know about what I'm doing with the memory, is not able to notice if I'm doing a real use of them or managing a pool.
What I would like to do is to force the unused memory to go back to virtual memory whenever I want, for example, when the RSS memory increases above X GB.
Is there any way to mark, given the memory pointer, a memory area to put it in virtual memory? I know this is the Operating System responsability but maybe there is a way to force it.
Maybe I shouldn't care about this, what do you think?
Thanks in advance.
Note 1: This program is used in High Performance Computing, that's why it's using this amount of memory.
I provide a picture of the pool usage vs the memory usage, for a few files. As you can see, the sudden drops in the pool usage are due to the garbage collector, what I would like to see, is this drop reflected in the memory usage.
You can do this as long as you are allocating your memory via mmap and not via malloc. You want to use the madvise function with the POSIX_MADV_DONTNEED argument.
Just remember to run madvise with POSIX_MADV_WILLNEED before using them again to ensure there is actually memory behind them.
This does not actually guarantee the pages will be swapped out but gives the kernel a strong hint to do so when it has time.
Git 2.19 (Q3 2018) offers an example of memory pool of struct, using mmap, not malloc.
For a large tree, the index needs to hold many cache entries allocated on heap.
These cache entries are now allocated out of a dedicated memory pool to amortize malloc(3) overhead.
See commit 8616a2d, commit 8e72d67, commit 0e58301, commit 158dfef, commit 8fb8e3f, commit a849735, commit 825ed4d, commit 768d796 (02 Jul 2018) by Jameson Miller (jamill).
(Merged by Junio C Hamano -- gitster -- in commit ae533c4, 02 Aug 2018)
block alloc: allocate cache entries from mem_pool
When reading large indexes from disk, a portion of the time is dominated in malloc() calls.
This can be mitigated by allocating a large block of memory and manage it ourselves via memory pools.
This change moves the cache entry allocation to be on top of memory pools.
Design:
The index_state struct will gain a notion of an associated memory_pool from which cache_entries will be allocated from.
When reading in the index from disk, we have information on the number of entries and their size, which can guide us in deciding how large our initial memory allocation should be.
When an index is discarded, the associated memory_pool will be discarded as well - so the lifetime of a cache_entry is tied to the lifetime of the index_state that it was allocated for.
In the case of a Split Index, the following rules are followed.
1st, some terminology is defined:
Terminology:
'the_index': represents the logical view of the index
'split_index': represents the "base" cache entries.
Read from the split index file.
'the_index' can reference a single split_index, as well as cache_entries from the split_index. the_index will be discarded before the split_index is.
This means that when we are allocating cache_entries in the presence of a split index, we need to allocate the entries from the split_index's memory pool.
This allows us to follow the pattern that the_index can reference cache_entries from the split_index, and that the cache_entries will not be freed while they are still being referenced.
Managing transient cache_entry structs:
Cache entries are usually allocated for an index, but this is not always the case. Cache entries are sometimes allocated because this is the type that the existing checkout_entry function works with.
Because of this, the existing code needs to handle cache entries associated with an index / memory pool, and those that only exist transiently.
Several strategies were contemplated around how to handle this.
Chosen approach:
An extra field was added to the cache_entry type to track whether the cache_entry was allocated from a memory pool or not.
This is currently an int field, as there are no more available bits in the existing
ce_flags bit field.
If / when more bits are needed, this new field can be turned into a proper bit field.
We decided tracking and iterating over known memory pool regions was
less desirable than adding an extra field to track this state.

Re-use threads in CUDA

I have a large series of numbers, in an array, about 150MB of numbers, and I need to find consecutive sequences of numbers, the sequences might be from 3 to 160 numbers. so to make it simple, I decided the each thread should start such as ThreadID = CellID
So thread0 looks at cell0, and if the number in cell0 matches my sequence, then, thread0 = cell1 and so on, and if the numbed does not match, the thread is stopped and I do that for my 20000 threads.
So that works out, fine but I wanted to know how to reuse threads, because the array in which i'm looking for the series of number is much bigger.
So should I divide my array in smaller arrays, and load them into shared memory, and loop over the number of smaller arrays and (eventually pad the last one). Or should I keep the big array in global memory, and have my thread to be to ThreadID = cellID and then ThreadID = cellID+20000 etc. or is there a better way to go through.
To clarify : At the moment i use 20 000 threads, 1 Array of numbers in Global Memory (150MB), and a sequence of numbers in shared memory (eg: 1,2,3,4,5), represented as an array. Thread0 start at Cell0, and look if the cell0 in global memory, is equal to cell0 in shared memory, if yes, thread0 compare cell1 in global memory, to cell1 in shared memory, and so on until there is a full match.
If the numbers in both (global and shared memory) cells are not equal, that thread is simply discarded. Since, most of the numbers in the Global memory Array will not match the first number of my sequence. I thought it was a good idea to use one thread to match Cell_N in GM to Cell_N in ShM and overlap the threads. and this technique allows coalesced memory access the first time, since every thread from 0 to 19 999 will access contiguous memory.
But what I would like to know, is "what would be the best way to re-use the threads" that have been discarded, or the threads that finished to match. To be able to match the entire array of 150MB instead of simply match (20000 numbers + (length of sequence -1)).
"what would be the best way to re-use the threads" that have been discarded, or the threads that finished to match. To be able to match the entire array of 150MB instead of simply match (20000 numbers + (length of sequence -1)).
You can re-use threads in a fashion similar to the canonical CUDA reduction sample (using the final implementation as a reference).
int idx = threadIdx.x+blockDim.x*blockIdx.x;
while (idx < DSIZE){
perform_sequence_matching(idx);
idx += gridDim.x*blockDim.x;
}
In this way, with an arbitrary number of threads in your grid, you can cover an arbitrary problem size (DSIZE);

CUDA threads appending variable amounts of data to common array

My application takes millions of input records, each 8 bytes, and hashes each one into two or more output bins. That is, each input key K creates a small number of pairs (B1,K), (B2,K), ... The number of output bins per key is not known until the key is processed. It's usually 2 but could occasionally be 10 or more.
All those output pairs need to be eventually stored in one array since all the keys in each bin will later be processed together. How to do this efficiently?
Using an atomic increment to repeatedly reserve a pair from a global array sounds horribly slow. Another obvious method would be to init a hash table as an array of pointers to some sort of storage per bin. That looks slower.
I'm thinking of pre-reserving 2 pairs per input record in a block shared array, then grabbing more space as needed (i.e., a reimplementation of the STL vector reserve operation), then having the last thread in each block copying the block shared array to global memory.
However I'm not looking forward to implementing that. Help? Thanks.
Using an atomic increment to repeatedly reserve a pair from a global
array sounds horribly slow.
You could increment bins of a global array instead of one entry at a time. In other words, you could have a large array, each thread could start with 10 possible output entries. If the thread over flows it requests for the next available bin from the global array. If you're worried about slow speed with the 1 atomic number, you could use 10 atomic numbers to 10 portions of the array and distribute the accesses. If one gets full, find another one.
I'm also considering processing the data twice: the 1st time just to
determine the number of output records for each input record. Then
allocate just enough space and finally process all the data again.
This is another valid method. The bottleneck is calculating the offset of each thread into the global array once you have the total number of results for each thread. I haven't figured a reasonable parallel way to do that.
The last option I can think of, would be to allocate a large array, distribute it based on blocks, used a shared atomic int (would help with slow global atomics). If you run out of space, mark that the block didn't finish, and mark where it left off. On your next iteration complete the work that hasn't been finished.
Downside of course of the distributed portions of global memory is like talonmies said... you need a gather or compaction to make the results dense.
Good luck!

Differentiating between queue full and queue empty

I am using a an array with a write index and a read index to implement a straightforward FIFO Queue. I do the usual MOD ArraySize when incrementing the write and read index.
Is there a way to differentiate between queue full and queue empty condition (wrIndex == rdIndex) without using any additional queuecount and also without wasting any array entry i.e . Queue is full if (WrIndex + 1 ) MOD ArraySize == ReadIndex
I'd go with 'wasting' an array entry to detect the queue full condition, especially if you're dealing with different threads/tasks being producers and consumers. Having another flag keep track of that situation increases the locking necessary to keep things consistent and increases the likelihood of some sort of bug that introduces a race condition. This is even more true in the case where you can't use a critical section (as you mention in a comment) to ensure that things are in-sync.
You'll need at least a bit somewhere to keep track of that condition, and that probably means at least a byte. Assuming that your queue contains ints you're only saving 3 bytes of RAM and you're going to chew up several more bytes of program image (which might not be as precious, so that might not matter). If you keep a flag bit inside a byte used to store other flag bits, then you have to additionally deal with setting/testing/clearing that flag bit in a thread safe manner to ensure that the other bits don't get corrupted.
If you're queuing bytes, then you probably save nothing - you can consider the sentinel element to be the flag that you'd have to put somewhere else. But now you have to have no extra code to deal with the flag.
Consider carefully if you really need that extra queue item, and keep in mind that if you're queuing bytes, then the extra queue item probably isn't really extra space
Instead of a read and write index, you could use a read index and a queue count. From the queue count, you can easily tell if the queue is empty of full. And the write index can be computed as (read index + queue count) mod array_size.
What's wrong with a queue count? It sounds like you're going for maximum efficiency and minimal logic, and while I would do the same, I think I'd still use a queue count variable. Otherwise, one other potential solution would be to use a linked list. Low memory usage, and removing first element would be easy, just make sure that you have pointers to the head and tail of the list.
Basically you only need a single additional bit somewhere to signal that the queue is currently empty. You can probably stash that away somewhere, e.g., in the most significant bit of one of your indices (and than AND-ing the lower bits creatively in places where you need to work only on the actual index into your array).
But honestly, I'd go with a queue count first and only cut that if I really need that space, instead of putting up with bit fiddling.

A program to control memory for a task in C

I have got a task to solve, that is a bit cryptic. The task is to make a program in C that handles texts messages, the program should simulate a system with a small amount of memory, the system should only be able to hold X messages with maximum X characters, every character takes 1 byte (ASCII). To manage messages should I make a system that is held in the primary memory (to simulate a system with limited memory). When the program starts the program should allocate ONE memory area for all information for messages.
This is called the metadatastructure in the task:
The memory area used for storage in its entirety to be continuous in memory, but divided in 32 bytes data blocks, the amount of data blocks in the system should be limited to 512.
The tasks also says that i should create X number data blocks , X depends on with value X number messages the system is set to contain.
I believe I need to create a structure like a ring buffer to hold every message (data block?).
This is called the bitmap for data blocks :
To keep track of witch data block that is free and busy I have to implement a bitmap where I have 1 but for each data block. The bit value is 0(busy)/ 1(free). This bitmap should be used to find free data blocks when I want to add a message, the bitmap should be up to date when the systems deletes or creates a data block for a message.
The allocated memory for this system should be divided into 3 blocks /areas , 1 for the metadatastructure, 1 for the bitmap for each data block and 1 for data blocks.
I need help to thing aloud about solutions and how this can be solved in C.
Thanks
At the beginning of your program malloc a large block. The return pointer is where it starts and you know how big the block you asked for is so you know where it ends.
That's your memory store.
Write a allocator and de-allocator that use the store (and only the store) and call them from the rest of your program instead of calling malloc and free...
This task can also be done with a whopping big array and using array offsets as pointer equivalents, but that would be silly in c. I only mention it because I used one constructed that way in fortran for years in a major piece of particle physics software called PAW.
concerning the bit map
Your allocator must know at all times which parts of the store are in use and which are not. That's the only way it can reliably give you a currently unused block, right? Maintaining a bitmap is one way to do that.
Why is this good? Imagine that you've been using this memory for a while. Many objects have been allocated in that space and some have been freed. The available space is no long continuous, but instead is rather patchy.
Suddenly you need to allocated a big object.
Where do you find a large chunk of continuous free memory to put it in? Scanning the bitmap will be faster than walking a complicated data structure.

Resources