Space efficient trie - c

I'm trying to implement a space efficient trie in C. This is my struct:
struct node {
char val; //character stored in node
int key; //key value if this character is an end of word
struct node* children[256];
};
When I add a node, it's index is the unsigned char cast of the character. For example, if I want to add "c", then
children[(unsigned char)'c']
is the pointer to the newly added node. However, this implementation requires me to declare a node* array of 256 elements. What I want to do is:
struct node** children;
and then when adding a node, just malloc space for the node and have
children[(unsigned char)'c']
point to the new node. The issue is that if I don't malloc space for children first, then I obviously can't reference any index or else that's a big error.
So my question is: how do I implement a trie such that it only stores the non-null pointers to its children?

You could try using a de la Briandais trie, where you only have one child pointer for each node, and every node also has a pointer to a "sibling", so that all siblings are effectively stored as a linked list rather than directly pointed to by the parent.

You can't really have it both ways and be both space efficient and have O(1) lookup in the children nodes.
When you only allocate space for the entries that's actually added, and not the null pointers, you can no longer do
children[(unsigned char)'c']
As you can no longer index directly into the array.
One alternative is to simply do a linear search through the children. and store an additional count of how many entries the children array has i.e.
children[(unsigned char)'c'] = ...;
Have to become
for(i = 0; i < len; i++) {
if(children[i] == 'c')
break;
}
if(i == len) {
//...reallocate and add space for one item in children
}
children[i] = ...;
If your tree ends up with a lot of non-empty entries at one level, you might insert the children in sorted order and do a binary search. Or you might add the childrens as a linked list instead of an array.

If you just want to do an English keyword search, I think you can minimize the size of your children, from 256 to just 26 - just enough to cover the 26 letters a-z.
Furthermore, you can use a linked list to keep the number of children even smaller so we can have more efficient iteration.
I haven't gone through the libraries yet but I think trie implementation will help.

You can be both space efficient and keep the constant lookup time by making child nodes of every node a hash table of nodes. Especially when Unicode characters are involved and the set of characters you can have in your dictionary is not limited to 52 + some, this becomes more of a requirement than a nicety. This way you can keep the advantages of using a trie and be time and space efficient at the same time.
I must also add that if the character set you are using is approaching unbounded, chances are having a linked list of nodes may just do fine. If you like an unmanageable nightmare, you can opt for a hybrid approach where first few levels keep their children in hash tables while the lower levels have a linked list of them. For a true bug farm, opt for a dynamic one where as each linked list passes a threshold, you convert it to a hash table on the fly. You could easily amortize the cost.
Possibilities are endless!

Related

How will I determine if a certain index in my hash table has no value yet?

I'm trying to create a program that reads a file that is filled with words in the dictionary, then stores every word in the hash table, I already have a hash function, for example the hash function returns an index 123 how will I be able to determine if that index right there has no value yet, else if the certain index has value should I just make the word the new head of the list or should I add it to the end of the list? Should I initialize the whole array first to something like "NULL" because if a variable wasn't initialized it contains garbage value, does that work the same with arrays from a struct..
typedef struct node
{
char word[LENGTH + 1];
struct node *next;
}
node;
// Number of buckets in hash table
// N = 2 ^ 13
const unsigned int N = 8192;
// Hash table
node *table[N];
This is part of my code LENGTH here is defined above with the value of 45..
how will I be able to determine if that index right there has no value yet
The "slots" in your table are linked lists. The table stores pointers to the head nodes of these linked lists. If that pointer is NULL, the list is empty, but you don't need to make it a special case: When you look up a word, just walk the list while the pointer to the next node is not null. If the pointer to the head node is null, your walk is stopped short early, that's all.
should I just make the word the new head of the list or should I add it to the end of the list?
It shouldn't really matter. The individual lists at the nodes are supposed to be short. The idea of the hash table is to turn a linear search on all W words into a faster linear search on W/N words on average. If you see that your table has only a few long lists, your hash function isn't good.
You must walk the list once to ensure that you don't insert duplicates anyway, so you can insert at the end. Or you could try to keep each linked list alphabetically sorted. Pick one method and stick with it.
Should I initialize the whole array first to something like "NULL" because if a variable wasn't initialized it contains garbage value, does that work the same with arrays from a struct.
Yes, please initialize your array of head node pointers to NULL, so that the hash table is in a defined state. (If your array is at file scope or static, the table should be initialized to null pointers already, but it doesn't hurt to make the initialization explicit.)

Simple redundance in a linked list

So I have the following structure:
typedef struct listElement
{
element value;
struct listElement;
} listElement, *List;
element is not a known type, meaning I don't know exactly what data type I'm dealing with, wether they're integers or or floats or strings.
The goal is to make a function that eletes the listElements that are redundant more than twice (meaning a value can only appear 0 times, once or twice, not more)
I've already made a function that uses bruteforce, with a nested loop, but that's a cluster**** as I'm dealing with a large number of elements in my list. (Going through every element and comparing it to the rest of the elements in the list)
I was wondering if there was a better solution that uses less isntructions and has a lower complexity.
You can use a hash table and map elements to their count.
if hashTable[element] (count for this particular element) returns 2, then delete the current element.

Best algo to retrieve elements from random id

I'm currently trying to find the best data structure / algorithm that fits to my case:
I receive single unique random ids (uint32_t), and I create an element associated to each new id.
I need to retrieve elements from the id.
I also need to access the next and the previous element from any element (or even id) in the order of creation. The order of creation mainly depends on the current element, which is always accessible aside, so the new element should be its next.
Here is an example:
(12) <-> (5) <-> (8) <-> (1)
^ ^
'------------------------'
If I suppose the current element to be (8) and a new element (3) is created, it should look like:
(12) <-> (5) <-> (8) <-> (3) <-> (1)
^ ^
'--------------------------------'
An important thing to consider is that insertion, deletion and search happen with almost the same (high) frequency. Not completely sure about how many elements will live at the same time, but I would say max ~1000.
Knowing all of this, I think about using an AVL with ids as the sorted keys, keeping the previous and the next element too.
In C language, something like this:
struct element {
uint32_t id;
/* some other fields */
struct element *prev;
struct element *next;
}
struct node {
struct element *elt;
struct node *left;
struct node *right;
};
static struct element* current;
Another idea may be to use a hash map, but then I would need to find the right hash function. Not completely sure it always beats the AVL in practice for this amount of elements though. It depends on the hash function anyway.
Is the AVL a good idea or should I consider something else for this case?
Thanks !
PS: I'm not a student trying to make you do my homework, I'm just trying to develop a simple window manager (just for fun).
You are looking for some variation of what's called in java a LinkedHashMap
This is basically a combination of a hash-table and a (bi-directional) linked list.
The linked-list has elements in the desired order. Inserting an element in a known location (assuming you have the pointer to the correct location) is done in O(1). Same goes for deletion. The linked list contains all the elements in their desired order.
The second data-structure is the hash-map (or tree map). This data structure maps from a key (your unique id), to a POINTER in the linked list. This way, given an id - you can quickly finds its location on the linked-list, and from there you can easily access next and previous elements.
high level pseudo code for insertion:
insert(x, v, y): //insert key=x value=v, after element with key=y
if x is in hash-table:
abort
p = find(hash-table,y) //p is a pointer
insert_to_list_after(x,v,p) //insert key=x,value=v right after p
add(hash-table,x,p) //add x to the hash-table, and make it point to p.
high level pseudo code for search:
search(x):
if x is not in hash-table:
abort
p = find(hash-table,x)
return p->value;
deletion should be very similar to insertion (and in same time complexity).
Note that it is also fairly easy to find element that is after x:
p = find(hash-table,x)
if (p != NULL && p->next != NULL):
return p->next->value
My suggestion is that you use a combination of two data structures - a list to store the elements in the order they are inserted and a hash map or binary search tree to implement an associative array(map) between the id and list node. You will perform the search using the associative array and will be able to access neighboring elements using the list. Deletion is also relatively easy, but you need to delete from both structures.
Complexity of find/insert/delete will be log(n) if you use binary search tree and expected complexity is constant if you use a hash table.
You should definitely consider the Skip List data structure.
It seems perfect for your case, because it has an expected O(log(n)) insert / search / delete and if you have a pointer to a node, you can find the previous and the next element in O(1) by just moving that pointer.
The conclusion is that if you've just created a node, you have a pointer to it, and you can find the prev/next element in O(1) time.

Arranging elements in C array so there are no gaps

I have a regular array of structs in C, in a program that runs every second and updates all the data in the structs. When a condition is met, one of the elements gets cleared and is used as a free slot for new element (in this case timers) that might come in at any point.
What I do is just to parse all the elements of the array looking for active elements requiring updates. But even if the amount of elements is small (<2000), I feel this is wasting time going through the inactive ones. Is there a way I can keep the array gap-free so I just need to iterate though the number of currently allocated elements?
Assuming the specific order of the elements does not matter, it can be done very easily.
If you have your array A and the number of active elements N, you can then add an element E like this:
A[N++] = E;
and remove the element at index I like this:
A[I] = A[--N];
So how does this work? Well, it's fairly simple. We want the array to only store active elements, so we may assume that the array is like that when we start doing either of these things.
Adding an element will always put it at the end, and since all elements currently in the array, as well as the newly added element, will be active, we can safely add one to the end.
Removing an element is done by moving the last element to take over the array index of the element we want to remove. Thus, A[0..I-1] is active, as well as A[I+1..N], and by moving A[N] to A[I], the entire range A[0..N-1] is active (A[N] is not active, because it no longer exists - we moved it to A[I], and that's why we decrease N by 1).
If you're removing elements while iterating over them to update them, note that you can only increment your loop counter after processing an element which doesn't get removed, since otherwise, you would never process the moved elements.
Traversing 2,000 entries per second is negligible. It's really not worth optimizing. If you really feel you must, swap the inactive entry for the last active entry.
It doesn't sound like you have a great reason for not using a linked list. If you do the implementation well, you'll get O(1) inserts, O(1) removals and you'll only ever need to keep (and iterate over)active structs. There'd be some memory overhead... for even moderately sized structs, though, even a doubly-linked list would be pretty efficient. The nice thing about this approach is that you can keep elements in their insertion order without extra computational overhead.
A relatively simple way to accomplish this:
void remove(struct foo *foo_array, int *n)
{
struct foo *src = foo_array, *dst = foo_array;
int num_removed = 0;
for (int i=0; i<*n; ++i)
{
// Do we want to remove this? (should_remove() left as exercise for reader.)
if (should_remove(src))
{
// yes, remove; advance src without advancing dst
++src;
++num_removed;
}
else if (src != dst)
{
// advance src and dst (with copy)
*dst++ = *src++;
}
else
{
// advance both pointers (no copy)
++src;
++dst;
}
}
// update size of array
*n -= num_removed;
}
The idea is that you keep track of how many elements of the array are valid (*n here), and pass its pointer as an "in/out parameter". remove() decides which elements to remove and copies the ones that are out of place. Notice that this is O(n) regardless of how many elements are decided to be removed.
A few alternatives come to mind, choose according to your needs:
1) Leave it as it is unless you are having some performance issues or need to scale up.
2) Add a "next" pointer to each struct to use it as an element in a doubly linked list. Keep two lists, one for the active ones and one for the unused ones. Depending on how you use the struts, also consider making the list doubly linked. (You can also still have the elements in an array if you need to index the structs, or you can stop using the array if not.)
3) If you don't need indices (or order) of the structs in the array to be constant, move unused entries to the end of the array. Then when you iterate through the array from the beginning, you can stop whenever you reach the first unused one. (You can store the index of the last active struct so that whenever a struct is deactivated you can just have it switch places with the last active one, and then decrement the index of the last active struct.)
How about adding a linked list behavior in your struct, i.e. a pointer member pointing to the next active element?
You would have to update these pointers on element activation and deactivation.
EDIT: This method is not suitable for dynamically resized arrays, because that may change the memory object's address, invalidating the pointers used by the list..

Sorting a list with qsort?

I'm writing a program in which you enter words via the keyboard or file and then they come out sorted by length. I was told I should use linked lists, because the length of the words and their number aren't fixed.
should I use linked lists to represent words?
struct node{
char c;
struct node *next;
};
And then how can I use qsort to sort the words by length? Doesn't qsort work with arrays?
I'm pretty new to programming.
Thank you.
I think there is a bigger issue than the sorting algorithm which you should pick. The first of these is that the struct that you're defining is actually not going to hold a list of words, but rather a list of single letters (or a single word.) Strings in C are represented as null-terminated arrays of characters, laid out like so:
| A | n | t | h | o | n | y | \0 |
This array would ideally be declared as char[8] - one slot for each letter, plus one slot for the null byte (literally one byte of zeros in memory.)
Now I'm aware you probably know this, but it's worth pointing this out for clarity. When you operate on arrays, you can look at multiple bytes at a time and speed things up. With a linked list, you can only look at things in truly linear time: step from one character to the next. This is important when you're trying to do something quickly on strings.
The more appropriate way to hold this information is in a style that is very C like, and used in C++ as vectors: automatically-resized blocks of contiguous memory using malloc and realloc.
First, we setup a struct like this:
struct sstring {
char *data;
int logLen;
int allocLen;
};
typedef struct string sstring;
And we provide some functions for these:
// mallocs a block of memory and holds its length in allocLen
string_create(string* input);
// inserts a string and moves up the null character
// if running out of space, (logLen == allocLen), realloc 2x as much
string_addchar(string* input, char c);
string_delete(string* input);
Now, this isn't great because you can't just read into an easy buffer using scanf, but you can use a getchar()-like function to get in single characters and place them into the string using string_addchar() to avoid using a linked list. The string avoids reallocation as much as possible, only once every 2^n inserts, and you can still use string functions on it from the C string library!! This helps a LOT with implementing your sorts.
So now how do I actually implement a sort with this? You can create a similar type intended to hold entire strings in a similar manner, growing as necessary, to hold the input strings from the console. Either way, all your data now lives in contiguous blocks of memory that can be accessed as an array - because it is an array! For example, say we've got this:
struct stringarray {
string *data;
int logLen;
int allocLen;
};
typedef struct stringarray cVector;
cVector myData;
And similar functions as before: create, delete, insert.
The key here is that you can implement your sort functions using strcmp() on the string.data element since it's JUST a C string. Since we've got a built-in implementation of qsort that uses a function pointer, all we have to do is wrap strcmp() for use with these types and pass the address in.
If you know how you want the items sorted, you should use an insertion sort when reading the data so that once all the input has been entered, all you have to do is write the output. Using a linked list would be ok, though you'll find that it has O(N2) performance. If you store the input in a binary tree ordered by length (a balanced tree would be best), then your algorithm will have O(NlogN) performance. If you're only going to do it once, then go for simplicity of implementation over efficiency.
Pseudocode:
list = new list
read line
while not end of file
len = length(line)
elem = head(list)
while (len > length(elem->value))
elem = elem->next
end
insert line in list before elem
read line
end
// at this point the list's elements are sorted from shortest to longest
// so just write it out in order
elem = head(list)
while (elem != null)
output elem->value
elem = elem->next
end
Yes, the classic "C" library function qsort() only works on an array. That is a contiguous collection of values in memory.
Tvanfosson advice is pretty good - as you build the linked list, you can insert elements at the correct position. That way, the list is always sorted.
I think the comment you made that you were told to use a linked list is interesting. Indeed a list can be a good data structure to use in many instances, but it does have draw backs; for example, it must be traversed to find elements.
Depending on your application, you may want to use a hash table. In C++ you could use a hash_set or a hash_map.
I would recommend you you spend some time studying basic data structures. Time spent here will server you will and better put you in a position to evaluate advice such as "use a linked list".
There are lots of ways to handle it... You can use arrays, via dynamic memory allocation, with realloc, if you feel brave enough to try.
The standard implementation of qsort, though, needs each element to be a fixed length, which would mean having an array-of-pointers-to-strings.
Implementing a linked list, though, should be easy, compared to using pointers to pointers.
I think what you were told to do was not to save the strings as list; but in a linked list:
struct node {
char *string;
node *next;
}
Then, all you have to do is, every time you read a string, add a new node into the list, in its ordered place. (Walk the list until the current string's length is greater than the string you just read.)
The problem of words not being a fixed length is common, and it's usually handled by storing the world temporarily in a buffer, and then copying it into a proper length array (dynamically allocated, of course).
Edit:
In pseudo code:
array = malloc(sizeof(*char))
array_size = 1
array_count = 0
while (buffer = read != EOF):
if(array_count == array_size)
realloc(array, array_size * 2)
array_count++
sring_temp = malloc(strlen(buffer))
array[array_count] = string_temp
qsort(array, array_count, sizeof(*char), comparison)
print array
Of course, that needs a TON of polishing. Remember that array is of type char **array, ie "A pointer to a pointer to char" (which you handle as an array of pointers); since you're passing pointers around, you can't just pass the buffer into the array.
You qsort a linked list by allocating an array of pointers, one per list element.
You then sort that array, where in the compare function you are of course receiving pointers to your list elements.
This then gives you a sorted list of pointers.
You then traverse your list, by traversing the array of pointers and adjusting each element in turn. rearranging its order in the list to match the order of your array of pointers.

Resources