I'm thinking about re-implementing the malloc(3) function (as well as free(3), realloc(3) and calloc(3)) using mmap(2) and munmap(2) syscalls (as sbrk(2) is now deprecated on some operating systems).
My strategy to allocate memory blocks on the page returned by mmap(2) would be to store metadata right before the block of data. Thus the metadata could consist of 3 attributes :
is_free : a char (1 byte) to tell if the block is considered free or not;
size : an size_t (4 bytes) with the size of the block in term of bytes;
next : a pointer (1 byte) to the next block's metadata (or to the next page first block if there's no more space after the block).
But as I can't use malloc to allocate a struct for them, I would simply consider putting 6 bytes of metadata in front of the block each time I create one :
+---------+---------+--------+------------------+---------+---------+--------+---------+---
| is_free | size | next | Block1 | is_free | size | next | Block2 | ...
+---------+---------+--------+------------------+---------+---------+--------+---------+---
| 1 byte | 4 bytes | 1 byte | n bytes | 1 byte | 4 bytes | 1 byte | m bytes | ...
+---------+---------+--------+------------------+---------+---------+--------+---------+---
The question is :
How can I be sure the user/process using my malloc won't be able to read/write the metadata of the blocks with such architecture ?
Eg: With the previous schema, I return the Block1's first byte to the user/process. If he/it does *(Block1 + n) = Some1ByteData he/it can alter the metadata of the next block which will cause issues with my program if I try to allocate a new block later on.
On the mmap(2) man page I read that I could give protection flags for the pages, but if I use them, then the user/process using my malloc won't be able to use the block I give. How is it achieve in the real malloc ?
PS: For the moment I don't consider thread-safe implementation nor looking for top-tier performances. I just want something strong and functionnal.
Thanks.
Related
I'm re-coding the malloc function using brk, sbrk & getpagesize()
I must follow two rules:
1)
I must align my memory on a power of 2
It means: If the call to malloc is : malloc(9); i must return them a block of 16 byte. ( the nearest power of 2);
2)
I must align the break (program end data segment) on a multiple of 2 pages.
I'm thinking about the rules, i'm wondering if i'm true;
Rule 1)
I just need to make the return of my malloc (so the adress returned by malloc in hexa) a multiple of 2 ?
And for the Rule 2)
the break is the last adress in the heap if i'm not wrong,
do i need to set my break like this (the break - the heap start) % (2 * getpagesize())== 0?
or just the break % (2 * getpagesize() == 0?
Thanks
1) I must align my memory on a power of 2
…
Rule 1) I just need to make the return of my malloc (so the adress returned by malloc in hexa) a multiple of 2 ?
For an address to be aligned on power of 2 that is 2p, the address must be a multiple of 2p.
2) I must align the break (program end data segment) on a multiple of 2 pages.
…
the break is the last adress in the heap if i'm not wrong, do i need to set my break like this (the break - the heap start) % (2 * getpagesize())== 0? or just the break % (2 * getpagesize() == 0?
The phrase “set my break” is unclear. You need to use sbrk(0) to get the current value of the break and calculate how much you need to add to it to make it a multiple of twice the page size. That tells you where you need to start a block of memory that is aligned to a multiple of twice the page size. Then you need additional memory to contain whatever amount of data you want to put there (the amount being allocated).
I used both functions to search queries from a very large set of data. Their speed is about the same at first, but when the size gets very large, binary search array is slightly faster. Is that because of caching effects? Array has sequentially. Does tree have so?
int binary_array_search(int array[], int length, int query){
//the array has been sorted
int left=0, right=length-1;
int mid;
while(left <= right){
mid = (left+right)/2;
if(query == array[mid]){
return 1;
}
else if(query < array[mid]){
right = mid-1;
}
else{
left = mid+1;
}
}
return 0;
}
// Search a binary search tree
int binary_tree_search(bst_t *tree, int ignore, int query){
node_t *node = tree->root;
while(node != NULL){
int data = node->data;
if(query < data){
node = node->left;
}
else if(query > data){
node =node->right;
}
else{
return 1;
}
}
return 0;
}
Here are some results:
LENGTH SEARCHES binary search array binary search tree
1024 10240 7.336000e-03 8.230000e-03
2048 20480 1.478000e-02 1.727900e-02
4096 40960 3.001100e-02 3.596800e-02
8192 81920 6.132700e-02 7.663800e-02
16384 163840 1.251240e-01 1.637960e-01
There are several reasons why an array may be and should be faster:
A node in the tree is at least 3 times bigger then an item in the array due to the left and right pointers.
For example, on a 32 bit system you'll have 12 bytes instead of 4. Chances are those 12 bytes are padded to or aligned on 16 bytes. On a 64 bit system we get 8 and 24 to 32 bytes.
This means that with an array 3 to 4 times more items can be loaded in the L1 cache.
Nodes in the tree are allocated on the heap, and those could be everywhere in memory, depending on the order they were allocated (also, the heap can get fragmented) - and creating those nodes (with new or alloc) will also take more time compared to a possible one time allocation for the array - but this is probably not part of the speed test here.
To access a single value in the array only one read has to be done, for the tree we need two: the left or right pointer and the value.
When the lower levels of the search are reached, the items to compare will be close together in the array (and possibly already in the L1 cache) while they are probably spread in memory for the tree.
Most of the time arrays will be faster due to locality of reference.
Is that because of caching effects?
Sure, that is the main reason. On modern CPUs, cache is transparently used to read/write data in memory.
Cache is much faster than the main memory (DRAM). Just to give you a perspective, accessing data in Level 1 cache is ~4 CPU cycles, while accessing the DRAM on the same CPU is ~200 CPU cycles, i.e. 50 times faster.
Cache operate on small blocks called cache lines, which are usually 64 bytes long.
More info: https://en.wikipedia.org/wiki/CPU_cache
Array has sequentially. Does tree have so?
Array is a single block of data. Each element of an array is adjacent to its neighbors, i.e.:
+-------------------------------+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+-------------------------------+
block of 32 bytes (8 times 4)
Each array access fetches a cache line, i.e. 64 bytes or 16 int values. So, for array there is a quite high probability (especially at the end of the binary search) that the next access will be within the same cache line, so no memory access will be needed.
On the other hand, tree nodes are allocated one by one:
+------------------------------------------------+
+------------------+ | +------------------+ +------------------+ |
| 0 | left | right | -+ | 2 | left | right | <- | 1 | left | right | <-+
+------------------+ +------------------+ +------------------+
block 0 of 24 bytes block 2 of 24 bytes block 1 of 24 bytes
As we can see, to store just 3 values we used 2 times more memory than to store 8 values in an array above. So the tree structure is more sparse and statistically has less data per each 64 bytes cache line.
Also each memory allocation returns a block in memory which might not be adjacent to the previously allocated tree nodes.
Also allocator aligns each memory block to at least 8 bytes (on 64-bit CPUs), so there are some bytes wasted there. Not to mention that we need to store those left and right pointers in each node...
So each tree access, even at the very end of the sort, will need to fetch a cache line, i.e. slower that the array access.
So why then an array just a tad bit faster in the tests? It is due to a binary search. At the very beginning of the sort we access data quite randomly and each access is quite far from the previous access. So the array structure gets it boost just at the end of the sort.
Just for fun, try to compare linear search (i.e. basic search loop) in array vs binary search in tree. I bet you will be surprised with the results ;)
I'm trying to print a dynamic 2D array in C, the problem is that with the code it prints fine the first time but if its printed anymore times its missing the bottom row.
This is the code I'm working with:
void display_board( int height, int width, char** gameBoard ) {
int i;
char *rowString;
rowString = malloc(((width*2)+1) * sizeof( char ));
for( i = 0; i < ((height*2) + 1); i++ ){
rowString = *(gameBoard + i);
printf("%s\n",rowString);
}
free(rowString);
}
The game being made is dots and boxes so width and height are the amount of boxes, the arrays are actually allocated as height*2+1 and width*2+1 and is set up to look like this if the height is 2 and width is4, note that the example has all the edges filled in already but normally the edges would just be white spaces:
x-x-x-x-x
| | | | |
x-x-x-x-x
| | | | |
x-x-x-x-x
When I print this the first time it looks like that, but if I try and print it again it looks like this:
x-x-x-x-x
| | | | |
x-x-x-x-x
| | | | |
Any idea on why this is happening?
Before the loop you allocate memory and assign it to the pointer variable rowString. But inside the loop you reassign the variable to point to somewhere else, loosing the original pointer to your allocated memory. Later when you try to free the memory you free the memory in the "2d matrix" instead of the memory you just allocated.
All of this leads to undefined behavior as you then later try to dereference the previously free'd memory (*(gameBoard + i)).
The obvious solution here is to not reassign the pointer, but to copy the string, e.g. with strcpy. Another pretty obvious solution would be to not allocate memory at all, and not even use the rowString variable, as you don't really need it but can use the pointer *(gameBoard + i) directly in your printf call.
Assume I have used ptr = malloc(old_size); to allocate a memory block with old_size bytes. Only the first header_size bytes is meaningful. I'm going to increase the size to new_size.
new_size is greater than old_size and old_size is greater than header_size.
before:
/- - - - - - - old_size - - - - - - - \
+===============+---------------------+
\-header_size-/
after:
/- - - - - - - - - - - - - - - new_size - - - - - - - - - - - - - - - - - - -\
+===============+------------------------------------------------------------+
\- header_size-/
I don't care what is stored after ptr + header_size because I'll read some data to there.
method 1: go straight to new_size
ptr = realloc(ptr, new_size);
method 2: shrink to header_size and grow to new_size
ptr = realloc(ptr, header_size);
ptr = realloc(ptr, new_size);
method 3: allocate a new memory block and copy the first header_size bytes
void *newptr = malloc(new_size);
memcpy(newptr, ptr, header_size);
free(ptr);
ptr = newptr;
Which is faster?
Neither malloc (for the whole block) nor realloc (for the space beyond the size of the old block when increasing the size) guarantee what the memory you receive will contain so, if you want those excess bytes set to zero (for example), you'll have to do it yourself with something like:
// ptr contains current block.
void *saveptr = ptr;
ptr = realloc (ptr, new_size);
if (ptr == NULL) {
// do something intelligent like recover saveptr and exit.
}
memset (ptr + header_size, 0, new_size - header_size);
However, since you've stated that you don't care about the content beyond the header, the fastest is almost certainly a single realloc since that's likely to be optimised under the covers.
Calling it twice for contraction and expansion, or calling malloc-new/memcpy/free-old is very unlikely to be as efficient though, as with all optimisations, you should measure, don't guess!
Keep in mind that realloc doesn't necessarily have to copy your memory at all. If the expansion can be done in place, then an intelligent heap manager will just increase the size of the block without copying anything, such as:
+-----------+ ^ +-----------+ <- At same address,
| Old block | | Need | New block | no copying
| | | this | | involved.
+-----------+ | much | |
| Free | | now. | |
| | v +-----------+
| | | Free |
| | | |
+-----------+ +-----------+
It almost certainly depends on the values of old_size, new_size and header_size, and also it depends on the implementation. You'd have to pick some values and measure.
1) is probably best in the case where header_size == old_size-1 && old_size == new_size-1, since it gives you the best chance of the single realloc being basically a no-op. (2) should be only very slightly slower in that case (2 almost-no-ops being marginally slower than 1).
3) is probably best in the case where header_size == 1 && old_size == 1024*1024 && new_size == 2048*1024, because the realloc would have to move the allocation, but you avoid copying 1MB of data you don't care about. (2) should be only very slightly slower in that case.
2) is probably best when header_size is much smaller than old_size, and new_size is in a range where it's reasonably likely that the realloc will relocate, but also reasonably likely that it won't. Then you can't predict which of (1) and (3) it is that will be very slightly faster than (2).
In analyzing (2), I have assumed that realloc downwards is approximately free and returns the same pointer. This is not guaranteed. I can think of two things that can mess you up:
realloc downwards copies to a new allocation
realloc downwards splits the buffer to create a new chunk of free memory, but then when you realloc back up again the allocator doesn't merge that new free chunk straight back onto your buffer again in order to return without copying.
Either of those could make (2) significantly more expensive than (1). So it's an implementation detail whether or not (2) is a good way of hedging your bets between the advantages of (1) (sometimes avoids copying anything) and the advantages of (3) (sometimes avoids copying too much).
Btw, this kind of idle speculation about performance is more effective in order to tentatively explain your observations, than it is to tentatively predict what observations we would make in the unlikely event that we actually cared enough about performance to test it.
Furthermore, I suspect that for large allocations, the implementation might be able to do even a relocating realloc without copying anything, by re-mapping the memory to a new address. In which case they would all be fast. I haven't looked into whether implementations actually do that, though.
That probably depends on what the sizes are and if copying is needed.
Method 1 will copy everything contained in the old block - but if you don't do that too often, you won't notice.
Method 2 will only copy what you need to keep, as you discard everything else beforehand.
Method 3 will copy unconditionally, while the others only copy if the memory block cannot be resized where it is.
Personally, I would prefer method 2 if you do this quite often, or method 1 if you do it more seldom. Respectively, I would profile which of these will be faster.
Given the code :
void transpose2(array dst,array src)
{
int i,j;
for ( i=0; i<4; i++) {
for ( j=0; j<4; j++) {
dst[i][j] = src[j][i];
}
}
}
Assumptions :
int is 4 bytes
src array starts at address 0 , dst starts at address 64
the size of the cache is 32 bytes , at the beginning the cache is empty
Assuming that I have a cache with size of 32 bytes , under write through ,write allocate & LRU , using 2way set associative method , where each block is 8 bytes :
When I read from the memory , how many bytes do I take each iteration from the memory ?
is it 4 or 8 ?
What I'm quite sure about is that the cache has 4 cells , or rows , and each row has 8 bytes .Is this correct ?
What is a little confusing is the 2way part , I think that each way has 4 bytes , right ? please correct me if I'm wrong ...
Then when I "take" a block from the memory , I just don't exactly understand how many bytes !!?
Thanks in advance
Ron
The cache way (aka its associativity) does not affect the amount of data that's transferred when a transfer occurs; the block size is the block size.
Associativity is simply a measure how many possible locations there are in the cache that a given block from memory could be stored. So:
For a direct-mapped cache (associativity=1), memory address xyz will always map to the same cache location.
For a two-way cache, xyz could map to either of two cache locations.
For a fully-associative cache, xyz could map to anywhere in cache.
I'm really not saying anything here which isn't already explained at e.g. Wikipedia: http://en.wikipedia.org/wiki/CPU_cache#Associativity.
When the CPU references (load or store) a word from a block that is not in the cache, that block is demanded to memory. So, with the parameters supplied, every cache miss involves a 8 byte transfer from memory to cache.
Related to the terminology, your cache has 4 entries, containers or cache lines (32 bytes / 8 bytes/block). As it is 2-way associative, there are 2 sets of 2 entries. Blocks with even addreses map to set 0, while blocks with odd addresses map to set 1.
Block addresses are obtained by shifting the word address log2(block_size) bits (3 bits in your cache).
For example:
address 64 belongs to block 8
address 72 belongs to block 9