I used both functions to search queries from a very large set of data. Their speed is about the same at first, but when the size gets very large, binary search array is slightly faster. Is that because of caching effects? Array has sequentially. Does tree have so?
int binary_array_search(int array[], int length, int query){
//the array has been sorted
int left=0, right=length-1;
int mid;
while(left <= right){
mid = (left+right)/2;
if(query == array[mid]){
return 1;
}
else if(query < array[mid]){
right = mid-1;
}
else{
left = mid+1;
}
}
return 0;
}
// Search a binary search tree
int binary_tree_search(bst_t *tree, int ignore, int query){
node_t *node = tree->root;
while(node != NULL){
int data = node->data;
if(query < data){
node = node->left;
}
else if(query > data){
node =node->right;
}
else{
return 1;
}
}
return 0;
}
Here are some results:
LENGTH SEARCHES binary search array binary search tree
1024 10240 7.336000e-03 8.230000e-03
2048 20480 1.478000e-02 1.727900e-02
4096 40960 3.001100e-02 3.596800e-02
8192 81920 6.132700e-02 7.663800e-02
16384 163840 1.251240e-01 1.637960e-01
There are several reasons why an array may be and should be faster:
A node in the tree is at least 3 times bigger then an item in the array due to the left and right pointers.
For example, on a 32 bit system you'll have 12 bytes instead of 4. Chances are those 12 bytes are padded to or aligned on 16 bytes. On a 64 bit system we get 8 and 24 to 32 bytes.
This means that with an array 3 to 4 times more items can be loaded in the L1 cache.
Nodes in the tree are allocated on the heap, and those could be everywhere in memory, depending on the order they were allocated (also, the heap can get fragmented) - and creating those nodes (with new or alloc) will also take more time compared to a possible one time allocation for the array - but this is probably not part of the speed test here.
To access a single value in the array only one read has to be done, for the tree we need two: the left or right pointer and the value.
When the lower levels of the search are reached, the items to compare will be close together in the array (and possibly already in the L1 cache) while they are probably spread in memory for the tree.
Most of the time arrays will be faster due to locality of reference.
Is that because of caching effects?
Sure, that is the main reason. On modern CPUs, cache is transparently used to read/write data in memory.
Cache is much faster than the main memory (DRAM). Just to give you a perspective, accessing data in Level 1 cache is ~4 CPU cycles, while accessing the DRAM on the same CPU is ~200 CPU cycles, i.e. 50 times faster.
Cache operate on small blocks called cache lines, which are usually 64 bytes long.
More info: https://en.wikipedia.org/wiki/CPU_cache
Array has sequentially. Does tree have so?
Array is a single block of data. Each element of an array is adjacent to its neighbors, i.e.:
+-------------------------------+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+-------------------------------+
block of 32 bytes (8 times 4)
Each array access fetches a cache line, i.e. 64 bytes or 16 int values. So, for array there is a quite high probability (especially at the end of the binary search) that the next access will be within the same cache line, so no memory access will be needed.
On the other hand, tree nodes are allocated one by one:
+------------------------------------------------+
+------------------+ | +------------------+ +------------------+ |
| 0 | left | right | -+ | 2 | left | right | <- | 1 | left | right | <-+
+------------------+ +------------------+ +------------------+
block 0 of 24 bytes block 2 of 24 bytes block 1 of 24 bytes
As we can see, to store just 3 values we used 2 times more memory than to store 8 values in an array above. So the tree structure is more sparse and statistically has less data per each 64 bytes cache line.
Also each memory allocation returns a block in memory which might not be adjacent to the previously allocated tree nodes.
Also allocator aligns each memory block to at least 8 bytes (on 64-bit CPUs), so there are some bytes wasted there. Not to mention that we need to store those left and right pointers in each node...
So each tree access, even at the very end of the sort, will need to fetch a cache line, i.e. slower that the array access.
So why then an array just a tad bit faster in the tests? It is due to a binary search. At the very beginning of the sort we access data quite randomly and each access is quite far from the previous access. So the array structure gets it boost just at the end of the sort.
Just for fun, try to compare linear search (i.e. basic search loop) in array vs binary search in tree. I bet you will be surprised with the results ;)
Related
I'm thinking about re-implementing the malloc(3) function (as well as free(3), realloc(3) and calloc(3)) using mmap(2) and munmap(2) syscalls (as sbrk(2) is now deprecated on some operating systems).
My strategy to allocate memory blocks on the page returned by mmap(2) would be to store metadata right before the block of data. Thus the metadata could consist of 3 attributes :
is_free : a char (1 byte) to tell if the block is considered free or not;
size : an size_t (4 bytes) with the size of the block in term of bytes;
next : a pointer (1 byte) to the next block's metadata (or to the next page first block if there's no more space after the block).
But as I can't use malloc to allocate a struct for them, I would simply consider putting 6 bytes of metadata in front of the block each time I create one :
+---------+---------+--------+------------------+---------+---------+--------+---------+---
| is_free | size | next | Block1 | is_free | size | next | Block2 | ...
+---------+---------+--------+------------------+---------+---------+--------+---------+---
| 1 byte | 4 bytes | 1 byte | n bytes | 1 byte | 4 bytes | 1 byte | m bytes | ...
+---------+---------+--------+------------------+---------+---------+--------+---------+---
The question is :
How can I be sure the user/process using my malloc won't be able to read/write the metadata of the blocks with such architecture ?
Eg: With the previous schema, I return the Block1's first byte to the user/process. If he/it does *(Block1 + n) = Some1ByteData he/it can alter the metadata of the next block which will cause issues with my program if I try to allocate a new block later on.
On the mmap(2) man page I read that I could give protection flags for the pages, but if I use them, then the user/process using my malloc won't be able to use the block I give. How is it achieve in the real malloc ?
PS: For the moment I don't consider thread-safe implementation nor looking for top-tier performances. I just want something strong and functionnal.
Thanks.
I'm using lists and arrays very often, I am wondering what is faster, array or list?
Let's say we have array of integers and linked-list, both hold same values.
int array_data[10] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
typedef struct node_data{
int data;
struct node_data * next;
} list_data;
____ ____ ____ ____ ____ ____ ____ ____ ____ ____
list_data *d = -> | 1| -> | 2| -> | 3| -> | 4| -> | 5| -> | 6| -> | 7| -> | 8| -> | 9| -> |10| -> NULL
If I want to get value of array_data on index 6, I'll use array_data[6] and get value of 7. But what if I want same data from list? Should I go from start and count hops till I reach asked index get_data(d,6)? Or there is better way to do this?
int get_data(list_data *l, int index){
int i = 0;
while(l != NULL){
if(index == i){
return l -> data;
}
i++;
l = l -> next;
}
return 0;
}
How about using array of pointers to list elements? Will be this best solution in case I have more then 100,000 records or even more to save, and each record contains more then one data type. I'll mostly need insertion to the end, and very frequently access to elements.
Thanks!
You are correct to consider the question each time; when deciding whether to implement an array, linked-list (or other) structure.
ARRAY
+ Fast random access.
+ Dynamically allocated arrays can be re-sized using `realloc()`.
+ Sorted using `qsort()`.
+ For sorted arrays, a specific record can be located using `bsearch()`
- Must occupy a contiguous block of memory.
- For a long-lived applications, frequent enlargement of the the array can eventually lead to a fragmented memory space, and perhaps even eventual failure of `realloc()`.
- Inserting and deleting elements is expensive. Inserting an element (in a sorted array) requires all elements of the array beyond the insertion point to be moved. A similar movement of elements is required when deleting an element.
LINKED-LIST
+ Does not require a contiguous block of memory.
+ Much more efficient than an array to re-size dynamically. Outperforms an array when it comes to fragmented memory usage
+ Sequential access is good, but perhaps still not as fast as an array (due to CPU cache misses, etc.).
- Random access is not really possible.
- Extra memory overhead for node pointers (priorNode, nextNode).
There are other structures that even combine arrays with linked list, such as hash tables, binary trees, n-trees, random-access-list, etc., each comes with various characteristics to consider.
Arrays are constant in access time; i.e. it takes the same amount of time to access any element.
That's not true for lists: the average time taken is linear on the number of elements.
However, lists don't require a contiguous block of memory, unlike arrays. As such, appending to an array can cause memory reallocation which can wreak havoc with any pointers to array elements that you've stored.
These points are the principal considerations when choosing between an array and a list.
Having some trouble figuring out the hit and miss rates of the following two snippets of code.
Given info: we have a 1024 Byte direct-mapped cache with block sizes of 16 bytes. So that makes 64 lines (sets in this case) then. Assume the cache starts empty. Consider the following code:
struct pos {
int x;
int y;
};
struct pos grid[16][16];
int total_x = 0; int total_y = 0;
void function1() {
int i, j;
for (i = 0; i < 16; i++) {
for (j = 0; j < 16; j++) {
total_x += grid[j][i].x;
total_y += grid[j][i].y;
}
}
}
void function2() {
int i, j;
for (i = 0; i < 16; i++) {
for (j = 0; j < 16; j++) {
total_x += grid[i][j].x;
total_y += grid[i][j].y;
}
}
}
I can tell from some basic rules (i.e. C arrays are row-major order) that function2 should be better. But I don't understand how to calculate the hit/miss percentages. Apparently function1() misses 50% of the time, while function2() only misses 25% of the time.
Could somebody walk me through how those calculations work? All I can really see is that no more than half the grid will ever fit inside the cache at once. Also, is this concept easy to extend to k-way associative caches?
Thanks.
How data are stored in memory
Every structure pos has a size of 8 Bytes, thus the total size of pos[16][16] is 2048 Bytes. And the order of the array are as follows:
pos[0][0] pos[0][1] pos[0][2] ...... pos[0][15] pos[1]0[] ...... pos[1][15].......pos[15][0] ......pos[15][15]
The cache organization compared to the data
For the cache, each block is 16 Bytes, which is the same size as two elements of the array. The Entire cache is 1024 Bytes, which is half the size of the entire array. Since cache is direct-mapped, that means if we label the cache block from 0 to 63, we can safely assume that the mapping should look like this
------------ memory----------------------------cache
pos[0][0] pos[0][1] -----------> block 0
pos[0][2] pos[0][3] -----------> block 1
pos[0][4] pos[0][5] -----------> block 2
pos[0][14] pos[0][15] --------> block 7
.......
pos[1][0] pos[1][1] -----------> block 8
pos[1][2] pos[1][3] -----------> block 9
.......
pos[7][14] pos[7][15] --------> block 63
pos[8][0] pos[8][1] -----------> block 0
.......
pos[15][14] pos[15][15] -----> block 63
How function1 manipulates memory
The loop follows a column-wise inner loop, that means the first iteration loads pos[0][0] and pos[0][1] to cache block 0, the second iteration loads pos[1][0] and pos[1][1] to cache block 8. Caches are cold, so the first column x is always miss, while y is always hit. The second column data are supposedly all loaded in cache during the first column access, but this is NOT the case. Since pos[8][0] access has already evict the former pos[0][0] page(they both map to block 0!).So on, the miss rate is 50%.
How function2 manipulates memory
The second function has nice stride-1 access pattern. That means when accessing pos[0][0].x pos[0][0].y pos[0][1].x pos[0][1].y only the first one is a miss due to the cold cache. The following patterns are all the same. So the miss rate is only 25%.
K-way associative cache follows the same analysis, although that may be more tedious. For getting the most out of the cache system, try to initiate a nice access pattern, say stride-1, and use the data as much as possible during each loading from memory. Real world cpu microarchitecture employs other intelligent design and algorithm to enhance the efficiency. The best method is always to measure the time in real world, dump the core code, and do a thorough analysis.
Ok, my computer science lectures are a bit far off but I think I figured it out (it's actually a very easy example when you think about it).
Your struct is 8 byte long (2 x 4). Since your cache blocks are 16 bytes, a memory access grid[i][j] will fetch exactly two struct entries (grid[i][j] and grid[i][j+1]). Therefore, if you loop through the second index only every 4th access will lead to a memory read. If you loop through the first index, you probably throw away the second entry that has been fetched, that depends on the number of fetches in the inner loop vs. the overall cache-size though.
Now we have to think about the cache size as well: You say that you have 64 lines that are directly mapped. In function 1, an inner loop is 16 fetches. That means, the 17th fetch you get to grid[j][i+1]. This should actually be a hit, since it should have been kept in the cache since the last inner loop walk. Every second inner loop should therefore only consist of hits.
Well, if my reasonings are correct, the answer that has been given to you should be wrong. Both functions should perform with 25% misses. Maybe someone finds a better answer but if you understand my reasoning I'd ask a TA about that.
Edit: Thinking about it again, we should first define what actually qualifies as a miss/hit. When you look at
total_x += grid[j][i].x;
total_y += grid[j][i].y;
are these defined as two memory accesses or one? A decent compiler with optimization settings should optimize this to
pos temp = grid[j][i];
total_x += temp.x;
total_y += temp.y;
which could be counted as one memory access. I therefore propose the universal answer to all CS questions: "It depends."
Given the code :
void transpose2(array dst,array src)
{
int i,j;
for ( i=0; i<4; i++) {
for ( j=0; j<4; j++) {
dst[i][j] = src[j][i];
}
}
}
Assumptions :
int is 4 bytes
src array starts at address 0 , dst starts at address 64
the size of the cache is 32 bytes , at the beginning the cache is empty
Assuming that I have a cache with size of 32 bytes , under write through ,write allocate & LRU , using 2way set associative method , where each block is 8 bytes :
When I read from the memory , how many bytes do I take each iteration from the memory ?
is it 4 or 8 ?
What I'm quite sure about is that the cache has 4 cells , or rows , and each row has 8 bytes .Is this correct ?
What is a little confusing is the 2way part , I think that each way has 4 bytes , right ? please correct me if I'm wrong ...
Then when I "take" a block from the memory , I just don't exactly understand how many bytes !!?
Thanks in advance
Ron
The cache way (aka its associativity) does not affect the amount of data that's transferred when a transfer occurs; the block size is the block size.
Associativity is simply a measure how many possible locations there are in the cache that a given block from memory could be stored. So:
For a direct-mapped cache (associativity=1), memory address xyz will always map to the same cache location.
For a two-way cache, xyz could map to either of two cache locations.
For a fully-associative cache, xyz could map to anywhere in cache.
I'm really not saying anything here which isn't already explained at e.g. Wikipedia: http://en.wikipedia.org/wiki/CPU_cache#Associativity.
When the CPU references (load or store) a word from a block that is not in the cache, that block is demanded to memory. So, with the parameters supplied, every cache miss involves a 8 byte transfer from memory to cache.
Related to the terminology, your cache has 4 entries, containers or cache lines (32 bytes / 8 bytes/block). As it is 2-way associative, there are 2 sets of 2 entries. Blocks with even addreses map to set 0, while blocks with odd addresses map to set 1.
Block addresses are obtained by shifting the word address log2(block_size) bits (3 bits in your cache).
For example:
address 64 belongs to block 8
address 72 belongs to block 9
Imagine you have some memory containing a bunch of bytes:
++++ ++-- ---+ +++-
-++- ++++ ++++ ----
---- ++++ +
Let us say + means allocated and - means free.
I'm searching for the formula of how to calculate the percentage of fragmentation.
Background
I'm implementing a tiny dynamic memory management for an embedded device with static memory. My goal is to have something I can use for storing small amounts of data. Mostly incoming packets over a wireless connection, at about 128 Bytes each.
As R. says, it depends exactly what you mean by "percentage of fragmentation" - but one simple formula you could use would be:
(free - freemax)
---------------- x 100% (or 100% for free=0)
free
where
free = total number of bytes free
freemax = size of largest free block
That way, if all memory is in one big block, the fragmentation is 0%, and if memory is all carved up into hundreds of tiny blocks, it will be close to 100%.
Calculate how many 128 bytes packets you could fit in the current memory layout.
Let be that number n.
Calculate how many 128 bytes packets you could fit in a memory layout with the same number of bytes allocated than the current one, but with no holes (that is, move all the + to the left for example).
Let be that number N.
Your "fragmentation ratio" would be alpha = n/N
If your allocations are all roughly the same size, just split your memory up into TOTAL/MAXSIZE pieces each consisting of MAXSIZE bytes. Then fragmentation is irrelevant.
To answer your question in general, there is no magic number for "fragmentation". You have to evaluate the merits of different functions in reflecting how fragmented memory is. Here is one I would recommend, as a function of a size n:
fragmentation(n) = -log(n * number_of_free_slots_of_size_n / total_bytes_free)
Note that the log is just there to map things to a "0 to infinity" scale; you should not actually evaluate that in practice. Instead you might simply evaluate:
freespace_quality(n) = n * number_of_free_slots_of_size_n / total_bytes_free
with 1.0 being ideal (able to allocate the maximum possible number of objects of size n) and 0.0 being very bad (unable to allocate any).
If you had [++++++-----++++--++-++++++++--------+++++] and you wanted to measure the fragmentation of the free space (or any other allocation)
You could measure the average contiguous block size
Total blocks / Count of contiguous blocks.
In this case it would be
4/(5 + 2 + 1 + 8) / 4 = 4
Based on R.. GitHub STOP HELPING ICE's answer, I came up with the following way of computing fragmentation as a single percentage number:
Where:
n is the total number of free blocks
FreeSlots(i) means how many i-sized slots you can fit in the available free memory space
IdealFreeSlots(i) means how many i-sized slots would fit in a perfectly unfragmented memory of size n. This is a simple calculation: IdealFreeSlots(i) = floor(n / i).
How I came up with this formula:
I was thinking about how I could combine all the freespace_quality(i) values to get a single fragmentation percentage, but I wasn't very happy with the result of this function. Even in an ideal scenario, you could have freespace_quality(i) != 1 if the free space size n is not divisible by i. For example, if n=10 and i=3, freespace_quality(3) = 9/10 = 0.9.
So, I created a derived function freespace_relative_quality(i) which looks like this:
This would always have the output 1 in the ideal "perfectly unfragmented" scenario.
After doing the math:
All that's left to do now to get to the final fragmentation formula is to calculate the average freespace quality for all values of i (from 1 to n), and then invert the range by doing 1 - the average quality so that 0 means completely unfragmented (maximum quality) and 1 means most fragmented (minimum quality).