Arrays or linked-list for frequently random access? - c

I'm using lists and arrays very often, I am wondering what is faster, array or list?
Let's say we have array of integers and linked-list, both hold same values.
int array_data[10] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
typedef struct node_data{
int data;
struct node_data * next;
} list_data;
____ ____ ____ ____ ____ ____ ____ ____ ____ ____
list_data *d = -> | 1| -> | 2| -> | 3| -> | 4| -> | 5| -> | 6| -> | 7| -> | 8| -> | 9| -> |10| -> NULL
If I want to get value of array_data on index 6, I'll use array_data[6] and get value of 7. But what if I want same data from list? Should I go from start and count hops till I reach asked index get_data(d,6)? Or there is better way to do this?
int get_data(list_data *l, int index){
int i = 0;
while(l != NULL){
if(index == i){
return l -> data;
}
i++;
l = l -> next;
}
return 0;
}
How about using array of pointers to list elements? Will be this best solution in case I have more then 100,000 records or even more to save, and each record contains more then one data type. I'll mostly need insertion to the end, and very frequently access to elements.
Thanks!

You are correct to consider the question each time; when deciding whether to implement an array, linked-list (or other) structure.
ARRAY
+ Fast random access.
+ Dynamically allocated arrays can be re-sized using `realloc()`.
+ Sorted using `qsort()`.
+ For sorted arrays, a specific record can be located using `bsearch()`
- Must occupy a contiguous block of memory.
- For a long-lived applications, frequent enlargement of the the array can eventually lead to a fragmented memory space, and perhaps even eventual failure of `realloc()`.
- Inserting and deleting elements is expensive. Inserting an element (in a sorted array) requires all elements of the array beyond the insertion point to be moved. A similar movement of elements is required when deleting an element.
LINKED-LIST
+ Does not require a contiguous block of memory.
+ Much more efficient than an array to re-size dynamically. Outperforms an array when it comes to fragmented memory usage
+ Sequential access is good, but perhaps still not as fast as an array (due to CPU cache misses, etc.).
- Random access is not really possible.
- Extra memory overhead for node pointers (priorNode, nextNode).
There are other structures that even combine arrays with linked list, such as hash tables, binary trees, n-trees, random-access-list, etc., each comes with various characteristics to consider.

Arrays are constant in access time; i.e. it takes the same amount of time to access any element.
That's not true for lists: the average time taken is linear on the number of elements.
However, lists don't require a contiguous block of memory, unlike arrays. As such, appending to an array can cause memory reallocation which can wreak havoc with any pointers to array elements that you've stored.
These points are the principal considerations when choosing between an array and a list.

Related

Seg faulting with 4D arrays & initializing dynamic arrays

I ran into a big of a problem with a tetris program I'm writing currently in C.
I am trying to use a 4D multi-dimensional array e.g.
uint8_t shape[7][4][4][4]
but I keep getting seg faults when I try that, I've read around and it seems to be that I'm using up all the stack memory with this kind of array (all I'm doing is filling the array with 0s and 1s to depict a shape so I'm not inputting a ridiculously high number or something).
Here is a version of it (on pastebin because as you can imagine its very ugly and long).
If I make the array smaller it seems to work but I'm trying to avoid a way around it as theoretically each "shape" represents a rotation as well.
https://pastebin.com/57JVMN20
I've read that you should use dynamic arrays so they end up on the heap but then I run into the issue how someone would initialize a dynamic array in such a way as linked above. It seems like it would be a headache as I would have to go through loops and specifically handle each shape?
I would also be grateful for anybody to let me pick their brain on dynamic arrays how best to go about them and if it's even worth doing normal arrays at all.
Even though I have not understood why do you use 4D arrays to store shapes for a tetris game, and I agree with bolov's comment that such an array should not overflow the stack (7*4*4*4*1 = 448 bytes), so you should probably check other code you wrote.
Now, to your question on how to manage 4D (N-Dimensional)dynamically sized arrays. You can do this in two ways:
The first way consists in creating an array of (N-1)-Dimensional arrays. If N = 2 (a table) you end up with a "linearized" version of the table (a normal array) which dimension is equal to R * C where R is the number of rows and C the number of columns. Inductively speaking, you can do the very same thing for N-Dimensional arrays without too much effort. This method has some drawbacks though:
You need to know beforehand all the dimensions except one (the "latest") and all the dimensions are fixed. Back to the N = 2 example: if you use this method on a table of C columns and R rows, you can change the number of rows by allocating C * sizeof(<your_array_type>) more bytes at the end of the preallocated space, but not the number of columns (not without rebuilding the entire linearized array). Moreover, different rows must have the same number of columns C (you cannot have a 2D array that looks like a triangle when drawn on paper, just to get things clear).
You need to carefully manage the indicies: you cannot simply write my_array[row][column], instead you must access that array with my_array[row*C + column]. If N is not 2, then this formula gets... interesting
You can use N-1 arrays of pointers. That's my favourite solution because it does not have any of the drawbacks from the previous solution, although you need to manage pointers to pointers to pointers to .... to pointers to a type (but that's what you do when you access my_array[7][4][4][4].
Solution 1
Let's say you want to build an N-Dimensional array in C using the first solution.
You know the length of each dimension of the array up to the (N-1)-th (let's call them d_1, d_2, ..., d_(N-1)). We can build this inductively:
We know how to build a dynamic 1-dimensional array
Supposing we know how to build a (N-1)-dimensional array, we show that we can build a N-Dimensional array by putting each (N-1)-dimensional array we have available in a 1-Dimensional array, thus increasing the available dimensions by 1.
Let's also assume that the data type that the arrays must hold is called T.
Let's suppose we want to create an array with R (N-1)-dimensional arrays inside it. For that we need to know the size of each (N-1)-dimensional array, so we need to calculate it.
For N = 1 the size is just sizeof(T)
For N = 2 the size is d_1 * sizeof(T)
For N = 3 the size is d_2 * d_1 * sizeof(T)
You can easily inductively prove that the number of bytes required to store R (N-1)-dimensional arrays is R*(d_1 * d_2 * ... * d_(n-1) * sizeof(T)). And that's done.
Now, we need to access a random element inside this massive N-dimensional array. Let's say we want to access the item with indicies (i_1, i_2, ..., i_N). For this we are going to repeat the inductive reasoning:
For N = 1, the index of the i_1 element is just my_array[i_1]
For N = 2, the index of the (i_1, i_2) element can be calculated by thinking that each d_1 elements, a new array begins, so the element is my_array[i_1 * d_1 + i_2].
For N = 3, we can repeat the same process and end up having the element my_array[d_2 * ((i_1 * d_1) + i_2) + i_3]
And so on.
Solution 2
The second solution wastes a bit more memory, but it's more straightforward, both to understand and to implement.
Let's just stick to the N = 2 case so that we can think better. Imagine to have a table and to split it row by row and to place each row in its own memory slot. Now, a row is a 1-dimensional array, and to make a 2-dimensional array we only need to be able to have an ordered array with references to each row. Something like the following drawing shows (the last row is the R-th row):
+------+
| R1 -------> [1,2,3,4]
|------|
| R2 -------> [2,4,6,8]
|------|
| R3 -------> [3,6,9,12]
|------|
| .... |
|------|
| RR -------> [R, 2*R, 3*R, 4*R]
+------+
In order to do that, you need to first allocate the references array (R elements long) and then, iterate through this array and assign to each entry the pointer to a newly allocated memory area of size d_1.
We can easily extend this for N dimensions. Simply build a R dimensional array and, for each entry in this array, allocate a new 1-Dimensional array of size d_(N-1) and do the same for the newly created array until you get to the array with size d_1.
Notice how you can easily access each element by simply using the expression my_array[i_1][i_2][i_3]...[i_N].
For example, let's suppose N = 3 and T is uint8_t and that d_1, d_2 and d_3 are known (and not uninitialized) in the following code:
size_t d1 = 5, d2 = 7, d3 = 3;
int ***my_array;
my_array = malloc(d1 * sizeof(int**));
for(size_t x = 0; x<d1; x++){
my_array[x] = malloc(d2 * sizeof(int*));
for (size_t y = 0; y < d2; y++){
my_array[x][y] = malloc(d3 * sizeof(int));
}
}
//Accessing a random element
size_t x1 = 2, y1 = 6, z1 = 1;
my_array[x1][y1][z1] = 32;
I hope this helps. Please feel free to comment if you have questions.

Constructing sequential Huffman Tree From Scratch

Given some textual file, I need to read each alphanumeric characters and code them using Huffman's algorithm.
Reading characters, storing probabilities and creating nodes are solved as well as creating Huffman's trie using pointers.
However, I need to create and initialize Huffman's tree using a sequential representation of a binary tree, without any pointers.
This could be done by creating a regular tree using pointers and then just reading it into the array, but I aim to directly populate an array with the nodes.
I considered creating smaller trees and merging them together but opted for a matrix representation where I would gather elements with the smallest probabilities from a binary heap and store them into the rows of a matrix where row of a matrix would represent the level at which the node should be in a binary tree, in a reverse order that is.
E.g. Given characters and their probabilities as char[int] pairs.
a[1], b[1], c[2], d[1], e[3], f[11], g[2]
I aim to create matrix that looks like
____________________________________
a | b | d | g |
____________________________________
ab | c | dg | e |
____________________________________
abc | deg | | |
____________________________________
abcdeg | f | | |
____________________________________
abcdefg | | | |
____________________________________
Where levels of a, b, c, d, e & f would be rows of a matrix.
Currently, I'm stuck on how to recursively increment levels of elements when their "parent" moves (If I'm combining two nodes from the different levels ['ab' and 'c'], I easily equal level of c with ab and solve problem, but in case that for example 'c' and 'd' where both in second row) and how to create the full binary tree (If it has left son, it needs to have right one) with only levels of terminal nodes.
In advance, I understand that the question is not very specific and would appreciate to hear if there's another approach to this problem instead of just solving the mentioned one.
Is this a contrived problem for homework? I ask because representations of trees that don't use links require O(2^h) space to store a tree of height h. This is because they assume the tree is complete, allowing index calculations to replace pointers. Since Huffman trees can have height h=m-1 for an alphabet of size m, the size of the worst case array could be enormous. Most of it would be unused.
But if you give up the idea that a link must be a pointer and allow it to be an array index, then you're fine. A long time ago - before the dynamic memory allocators became common - this was standard. This problem is particularly good for this method because you always know the number of nodes in the tree in advance: one less than twice the alphabet size. In C you might do something like this
typedef struct {
char ch;
int f;
int left, right; // Indices of children. If both -1, this is leaf for char ch.
} NODE;
#define ALPHABET_SIZE 7
NODE nodes[2 * ALPHABET_SIZE - 1] = {
{ 'a', 1, , -1, -1},
{ 'b', 1, -1, -1 },
{ 'c', 2, -1, -1 },
{ 'd', 1, -1, -1 },
{ 'e', 3, -1, -1 },
{ 'f', 11, -1, -1 },
{ 'g', 2, -1, -1 },
// Rest of array for internal nodes
};
int n_nodes = ALPHABET_SIZE;
int add_internal_node(int f, int left, int right) {
// Allocate a new node in the array and fill in its values.
int i = n_nodes++;
nodes[i] = (NODE) { .f = f, .left = left, .right = right };
return i;
}
Now you'd use the standard tree-building algorithm like this:
int build_huffman_tree(void) {
// Add the indices of the leaf nodes to the priority queue.
for (int i = 0; i < ALPHABET_SIZE; ++i)
add_to_frequency_priority_queue(i);
while (priority_queue_size() > 1) {
int a = remove_min_frequency(); // Removes index of lowest freq node from the queue.
int b = remove_min_frequency();
int p = add_internal_node(nodes[a].f + nodes[b].f, a, b);
add_to_frequency_priority_queue(p);
}
// Last node is huffman tree root.
return remove_min_frequency();
}
The decoding algorithm will use the index of the root like this:
char decode(BIT bits[], int huffman_tree_root_index) {
int i = 0, p = huffman_tree_root_index;
while (node[p].left != -1 || node[p].right != -1) // while not a leaf
p = bits[i++] ? nodes[p].right : nodes[p].left;
return nodes[p].ch;
}
Of course this doesn't return how many bits were consumed, which a real decoder needs to do. A real decoder is also not getting its bits in an array. Finally, for encoding you want parent indices in addition to the children. Working out these matters ought to be fun. Good luck with it.

Why binary search array is slightly faster than binary search tree?

I used both functions to search queries from a very large set of data. Their speed is about the same at first, but when the size gets very large, binary search array is slightly faster. Is that because of caching effects? Array has sequentially. Does tree have so?
int binary_array_search(int array[], int length, int query){
//the array has been sorted
int left=0, right=length-1;
int mid;
while(left <= right){
mid = (left+right)/2;
if(query == array[mid]){
return 1;
}
else if(query < array[mid]){
right = mid-1;
}
else{
left = mid+1;
}
}
return 0;
}
// Search a binary search tree
int binary_tree_search(bst_t *tree, int ignore, int query){
node_t *node = tree->root;
while(node != NULL){
int data = node->data;
if(query < data){
node = node->left;
}
else if(query > data){
node =node->right;
}
else{
return 1;
}
}
return 0;
}
Here are some results:
LENGTH SEARCHES binary search array binary search tree
1024 10240 7.336000e-03 8.230000e-03
2048 20480 1.478000e-02 1.727900e-02
4096 40960 3.001100e-02 3.596800e-02
8192 81920 6.132700e-02 7.663800e-02
16384 163840 1.251240e-01 1.637960e-01
There are several reasons why an array may be and should be faster:
A node in the tree is at least 3 times bigger then an item in the array due to the left and right pointers.
For example, on a 32 bit system you'll have 12 bytes instead of 4. Chances are those 12 bytes are padded to or aligned on 16 bytes. On a 64 bit system we get 8 and 24 to 32 bytes.
This means that with an array 3 to 4 times more items can be loaded in the L1 cache.
Nodes in the tree are allocated on the heap, and those could be everywhere in memory, depending on the order they were allocated (also, the heap can get fragmented) - and creating those nodes (with new or alloc) will also take more time compared to a possible one time allocation for the array - but this is probably not part of the speed test here.
To access a single value in the array only one read has to be done, for the tree we need two: the left or right pointer and the value.
When the lower levels of the search are reached, the items to compare will be close together in the array (and possibly already in the L1 cache) while they are probably spread in memory for the tree.
Most of the time arrays will be faster due to locality of reference.
Is that because of caching effects?
Sure, that is the main reason. On modern CPUs, cache is transparently used to read/write data in memory.
Cache is much faster than the main memory (DRAM). Just to give you a perspective, accessing data in Level 1 cache is ~4 CPU cycles, while accessing the DRAM on the same CPU is ~200 CPU cycles, i.e. 50 times faster.
Cache operate on small blocks called cache lines, which are usually 64 bytes long.
More info: https://en.wikipedia.org/wiki/CPU_cache
Array has sequentially. Does tree have so?
Array is a single block of data. Each element of an array is adjacent to its neighbors, i.e.:
+-------------------------------+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
+-------------------------------+
block of 32 bytes (8 times 4)
Each array access fetches a cache line, i.e. 64 bytes or 16 int values. So, for array there is a quite high probability (especially at the end of the binary search) that the next access will be within the same cache line, so no memory access will be needed.
On the other hand, tree nodes are allocated one by one:
+------------------------------------------------+
+------------------+ | +------------------+ +------------------+ |
| 0 | left | right | -+ | 2 | left | right | <- | 1 | left | right | <-+
+------------------+ +------------------+ +------------------+
block 0 of 24 bytes block 2 of 24 bytes block 1 of 24 bytes
As we can see, to store just 3 values we used 2 times more memory than to store 8 values in an array above. So the tree structure is more sparse and statistically has less data per each 64 bytes cache line.
Also each memory allocation returns a block in memory which might not be adjacent to the previously allocated tree nodes.
Also allocator aligns each memory block to at least 8 bytes (on 64-bit CPUs), so there are some bytes wasted there. Not to mention that we need to store those left and right pointers in each node...
So each tree access, even at the very end of the sort, will need to fetch a cache line, i.e. slower that the array access.
So why then an array just a tad bit faster in the tests? It is due to a binary search. At the very beginning of the sort we access data quite randomly and each access is quite far from the previous access. So the array structure gets it boost just at the end of the sort.
Just for fun, try to compare linear search (i.e. basic search loop) in array vs binary search in tree. I bet you will be surprised with the results ;)

How should you add elements to a multi dimensional array? (In C)

I'm working on a table football cup program (in C), where I have 16 people facing off to get to the final. I'm having trouble putting elements into the different elements of the array (which has sort of stopped my progress until I figure it out). I've searched on the internet (not extensively) about pointers, but I can't find anything on multi dimensional arrays.
I have 8 games, each with 2 participants, who each play 5 matches against each other. Hopefully that means I define the array as int lastSixteen[8][2][5]. All participants have a unique ID
Assuming I have declared my arrays correctly... On to the main question.
This is what I'm currently doing:
int i;
for(i=0; i<MAX_PLAYERS/2;i++){
roundOne[i] = i;
}
I want to set the first dimension of my array to be the numbers 1 through 8 incl. but I run into 'error: incompatible types in assignment'.
I tried setting the line with the assignment to be roundOne[i][][] = i; but as I expected, that didn't work either.
Later on in the program I need to set the second set of numbers to be the games participants to be the 16 participants (to keep it simple I'm doing it in ascending numerical order) so Game 1 is Player 1 and Player 2, Game 2 is Player 3 and 4 etc.
for(i=0; i<16; i++){
if(i % 2 != 0){
roundOne[(MAX_PLAYERS/2)-1][0] = i; /* puts 1,3,5,7,9,11,13,15 */
}
else{
roundOne[(MAX_PLAYERS/2)-1][1] = i; /* puts 2,4,6,8,10,12,14,16 */
}
}
I'm assuming the second part will be fixed by the answer to the first part since they return the same error, but I included it because I don't know.
A sample of code that has a minimal, Complete, and Verifiable example.
#include <stdio.h>
#define MAX_PLAYERS 16
int main(void){
int i;
int roundOne[8][2][5];
/* seeded in numerical order.*/
for(i=0; i<MAX_PLAYERS/2;i++){
roundOne[i] = i;
}
for(i=0; i<MAX_PLAYERS; i++){
if(i % 2 != 0){
roundOne[(MAX_PLAYERS/2)-1][0] = i;
}
else{
roundOne[(MAX_PLAYERS/2)-1][1] = i;
}
}
return 0;
}
Thanks in Advance,
Rinslep
You can't just use a multi dimensional array - it doesn't do what you want. And here is why: Lets say you have 8 games and 2 players (forget that there are 5 matches for a second) That means your multi dimensional array would have 16 spots:
Player
0 1
+---+---+
0 | | |
+---+---+
1 | | |
+---+---+
2 | | |
G +---+---+
a 3 | | |
m +---+---+
e 4 | | |
+---+---+
5 | | |
+---+---+
6 | | |
+---+---+
7 | | |
+---+---+
Now you want to put the game number in there AND you want to put the unique player IDs in there AND you might want to put other stuff in there (like who won and the score)? How are you going to do that? There are a couple choices:
The game number is the index into the array - not a value you store in the array. Now you can store the palyer IDs for each game in the array. But this still doesn't address storing other stuff (like who won and the score)
If the game number needs to be stored in the array (or other things like who won and the score) you will need to store more than one thing in the array so the array cannot hold ints - you need an array of structures.
It is hard to guess what the right data structure is because it depends on what your program is going to do, but I think I would do this:
typedef struct match
{
int score[2]; /* index 0 is player 1, index 1 is player 2 */
int winner; /* index into the player and score arrays (either 0 or 1) */
};
typedef struct game
{
int players[2]; /* index 0 is player 1, index 1 is player 2 */
match matches[5];
};
game games[8];
Now, the the game number (1-8) is just the index to games plus 1, the match number (1-5) is just the index to matches plus 1 and if you want to make unique player numbers that go from 1-16 you can do this:
i=1;
for(int g=0;g<8;g++)
for(int p=0;p<1;p++)
games[g].player[p]=i++;
You need to initialize the array with dynamic allocation.
How do I work with dynamic multi-dimensional arrays in C?
Think of it this way
A = [
[B],
[C],
[D],
...
]
So lets say we need an array round with 10 rows and each row has 20 columns. They will all be filled with integer values.
Option One - Dynamically Allocating
Define the number of buckets the array will have. We are taking the size of the pointer because each bucket will container a pointer/array that represents the inner array.
int** round;
round = malloc(10 * sizeof(int*))
Now that we have allocated the space for the buckets, go through and give space for the points. This one is just a normal integer so we take the sizeof(int).
for (int i = 0; i < 10; i++) {
round[i] = malloc(20* sizeof(int))
}
Option Two
We can define the size of the multidimensional array in a bit of an easier way. We know the number of rows and the number of columns. So alternatively we can allocate the space like this:
int* round;
round = malloc (10 * 20 * sizeof(int));
Both of these will produce the array round[10][20] with memory allocated for it. With C you can't add elements to an array on the fly if the size of the array is unknown, in my experience linked lists are better for this.
Edit: I see that you updated the question, this code can be used with a 3D array also. You can easily use option two as 3D like round = malloc(x * y * z * sizeof(int)), where x, y, and z are equal to the dimensional values. You can also modify option one to work with this also.

How do you removing a cycle of integers (e.g. 1-2-3-1) from an array

If you have an array of integers, such as 1 2 5 4 3 2 1 5 9
What is the best way in C, to remove cycles of integers from an array.
i.e. above, 1-2-5-4-3-2-1 is a cycle and should be removed to be left with just 1 5 9.
How can I do this?
Thanks!!
A straight forward search in an array could look like this:
int arr[] = {1, 2, 5, 4, 3, 2, 1, 5, 9};
int len = 9;
int i, j;
for (i = 0; i < len; i++) {
for (j = 0; j < i; j++) {
if (arr[i] == arr[j]) {
// remove elements between i and j
memmove(&arr[j], &arr[i], (len-i)*sizeof(int));
len -= i-j;
i = j;
break;
}
}
}
Build a graph and select edges based on running depth first search on it.
Mark vertices when you visit them, add edges as you traverse graph, don't add edges that have already been selected - they would connect previously visited components and therefore create a cycle.
From the array in your example we can't tell what is considered a cycle.
In your example both 2 -> 5 and 1 -> 5 as well as 1 -> 2 so in graph (?):
1 -> 2
| |
| V
+--> 5
So where is the information of which elements are connected?
There is a simple way, with O(n^2) complexity: simply iterate over each array entry from the beginning, and search the array for the last identical value. If that is in the same position as your current position, move on. Otherwise, delete the sequence (except for the initial value) and move on. You should be able to implement this using two nested for loops plus a conditional memcpy.
There is a more complex way, with O(n log n) complexity. If your data set is large, this one will be preferable for performance, though it is more complex to implement and therefore more error-prone.
1) Sort the array - this is the O(n log n) part if you use a good sorting algorithm. Do so by reference - you want to keep the original. This moves all identical values together. Break sort-order ties by position in the original array, this will help in the next step.
2) Iterate once over the sorted array (O(n)), looking for runs of the same value. Because these runs are themselves sorted by position, you can trivially find each cycle involving that value by comparing adjacent pairs for equality. Erase (not delete) each cycle from the original array by replacing each value except the last with a sentinel (zero might work). Don't close the gaps yet, or the references will break.
NB: At this stage you need to ignore any endpoints that have already been erased from the array. Because they will resolve to sentinels, you simply have to be careful to not erase "runs" that involve the sentinel value at either end.
3) Throw away the sorted array, and use the sentinels to close the gaps in the original array. This should be O(n).
Actually implementing this in any given language is left as an exercise for the reader. :-)

Resources