How would I convert this recursive function to an iterative one? - c

I'm having trouble trying to convert this recursive function find_reachable(..) to it's iterative equivalent. I have looked around and saw suggestions to use a stack but cannot figure it out. I also recognize that this function is tail recursive, but don't know what to do with this info. The if statement has me particularly stumped. Any help appreciated, thanks.
void find_reachable(struct person *current, int steps_remaining,
bool *reachable){
// mark current root person as reachable
reachable[person_get_index(current)] = true;
// now deal with this person's acquaintances
if (steps_remaining > 0){
int num_known = person_get_num_known(current);
for (int i = 0; i < num_known; i++){
struct person *acquaintance = person_get_acquaintance(current, i);
find_reachable(acquaintance, steps_remaining - 1, reachable);
}
}
}
.....
struct person {
int person_index; // each person has an index 0..(#people-1)
struct person ** known_people;
int number_of_known_people;
};
// return the person's unique index between zero and #people-1
int person_get_index(struct person * p) {
return p->person_index;
}
// return the number of people known by a person
int person_get_num_known(struct person * p) {
return p->number_of_known_people;
}
// get the index'th person known by p
struct person * person_get_acquaintance(struct person * p, int index) {
//fprintf(stderr, "index %d, num_known %d\n", index, p->number_of_known_people);
assert( (index >= 0) && (index < p->number_of_known_people) );
return p->known_people[index];
}

This looks like a depth-first search: it examines each person by examining that person's first acquaintance, then examining that acquaintance's first acquaintance, and so on before eventually backtracking. Recursion is generally a pretty good strategy for the depth-first search, and most iterative implementations do in fact use a stack to record the addresses of nodes higher up in the tree in order to backtrack to them later. The iterative depth-first search goes like this:
Let S be a stack, initially containing only the root of the graph being searched.
Pop from the stack into variable v.
Push to S all children (or, as they are called in this case, "acquaintances") of v.
If the stack is empty, terminate; otherwise, go to step 2.
The simplest way to implement stacks in C is to use a singly linked list, like so:
struct person_stack;
struct person_stack {
struct person *who;
struct person_stack *next;
};
struct person *person_stack_pop(struct person_stack **s) {
struct person_stack *old_top = *s;
struct person *who = old_top->who;
*s = *s->next;
free(old_top);
return who;
}
struct person_stack *person_stack_push(struct person_stack **s, struct person *p) {
struct person_stack *new_top = malloc(sizeof (struct person_stack));
new_top->next = *s;
new_top->who = p;
*s = new_top;
return *s;
}
There is one complication here, though! Your function only searches to a given depth. This is the reason why that if statement is there in the first place: to terminate the recursion when the search has gone deep enough. The regular DFS backtracks only when it runs out of children to search, so you'll have to add some extra logic to make it aware of its distance from the root.
You may also want to make sure that an acquaintance is only pushed into the stack if it is not already in the stack. This will save you from redundant iterations—think about what would happen if many of these people have mutual acquaintances.

Full transparency, I don't know c too well, so I used pseudocode where I didn't know the syntax, but something like this might work?
The general idea of converting recursion to iteration often requires adding objects (in this case, people) that need processing to the stack. You add these objects on some condition (in this case, if they're not too deep in the web of acquaintances).
You then iterate as long as the stack is not empty, popping off the top element and processing it. The 'processing' step usually also consists of adding more elements to the stack.
You are effectively mimicking the 'call stack' which results from a recursive function.
There's more involved in this if you care about the order in which elements are processed, but in this case, it doesn't seem like you do since they're only marked as 'reachable'.
void find_reachable(struct person *current, int steps_remaining, bool *reachable){
stack = /* create a new stack which contains a (person, int) touple or struct */
stack.push(/* current & steps_remaining*/)
while (/*stack is not empty*/) {
currentStruct = /* pop the top of stack */
current = currentStruct.person
depth = currentStruct.depth
reachable[person_get_index(current)] = true;
if (depth - 1 <= 0) {
continue;
// don't add this person's aquantances b/c we're 'steps_remaining' levels in
}
int num_known = person_get_num_known(current);
for (int i = 0; i < num_known; i++){
struct person *acquaintance = person_get_acquaintance(current, i);
stack.add(/*acquantance & depth - 1*/)
}
}
}
EDIT: Improved code

Related

compact multiple-array implementation of doubly linked list with O(1) insertion and deletion

I am confused about my solution to an exercise (10.3-4) in CLRS (Cormen Intro to Algorithms 3ed). My implementation seems to be able to perform deletion + de-allocation in O(1) time, while two solutions I have found online both require O(n) time for these operations, and I want to know who is correct.
Here's the text of the exercise:
It is often desirable to keep all elements of a doubly linked list compact in storage, using, for example, the first m index locations in the multiple-array representation. (This is the case in a paged, virtual-memory computing environment.) Explain how to implement the procedures ALLOCATE OBJECT and FREE OBJECT so that the representation is compact. Assume that there are no pointers to elements of the linked list outside the list itself. (Hint: Use the array implementation of a stack.)
By "multiple-array representation", they are referring to an implementation of a linked list using next, prev, and key arrays, with indices acting as pointers stored in the arrays rather than objects with members pointing to next and prev. That particular implementation was discussed in the text of Section 10.3 of CLRS, while this particular exercise seems to be simply imposing the addition condition of having the elements be "compact", or, as I understand it, packed into the beginning of the arrays, without any gaps or holes with "inactive" elements.
There was a previous thread on the same exercise here, but that I couldn't figure out what I want to know from that thread.
The two solutions I found online are first one here and second one here, on page 6 of the pdf. Both solutions say to shift all elements after a gap down by one in order to fill the gap, taking O(n) time. My own implementation instead simply takes the last "valid" element in the array and uses it to fill any gap that is created, which happens only when elements are deleted. This maintains the "compactness" property. Of course, the appropriate prev and next "pointers" are updated, and this is O(1) time. Additionally, the ordinary implementation from Sec. 10.3 in the book, which does not require compactness, had a variable named "free" which pointed to the beginning of a second linked list, which has all the "non-valid" elements, which are available to be written over. For my implementation, since any insertion must be done at the earliest available, e.g. non-valid array slot, I simply had my variable "free" act more like the variable "top" in a stack. This seemed so obvious that I'm not sure why both of those solutions called for an O(n) "shift down everything after the gap" method. So which one is it?
Here is my C implementation. As far as I know, everything works and takes O(1) time.
typedef struct {
int *key, *prev, *next, head, free, size;
} List;
const int nil = -1;
List *new_list(size_t size){
List *l = malloc(sizeof(List));
l->key = malloc(size*sizeof(int));
l->prev = malloc(size*sizeof(int));
l->next = malloc(size*sizeof(int));
l->head = nil;
l->free = 0;
l->size = size;
return l;
}
void destroy_list(List *l){
free(l->key);
free(l->prev);
free(l->next);
free(l);
}
int allocate_object(List *l){
if(l->free == l->size){
printf("list overflow\n");
exit(1);
}
int i = l->free;
l->free++;
return i;
}
void insert(List *l, int k){
int i = allocate_object(l);
l->key[i] = k;
l->next[i] = l->head;
if(l->head != nil){
l->prev[l->head] = i;
}
l->prev[i] = nil;
l->head = i;
}
void free_object(List *l, int i){
if(i != l->free-1){
l->next[i] = l->next[l->free-1];
l->prev[i] = l->prev[l->free-1];
l->key[i] = l->key[l->free-1];
if(l->head == l->free-1){
l->head = i;
}else{
l->next[l->prev[l->free-1]] = i;
}
if(l->next[l->free-1] != nil){
l->prev[l->next[l->free-1]] = i;
}
}
l->free--;
}
void delete(List *l, int i){
if(l->prev[i] != nil){
l->next[l->prev[i]] = l->next[i];
}else{
l->head = l->next[i];
}
if(l->next[i] != nil){
l->prev[l->next[i]] = l->prev[i];
}
free_object(l, i);
}
Your approach is correct.
The O(n) "shift-everything-down" solution is also correct in the sense that it meets the specification of the problem, but clearly your approach is preferable from a runtime perspective.

C - Expanding an array of structs with pointers already existing inside it

I have an array of structs that is declared as such
typedef struct bucket{
char * value;
char * key;
}BUCKET;
typedef struct item{
struct bucket * data;
struct item * next;
struct item * prev;
}ITEM;
typedef struct base{
struct item * first;
}BASE;
typedef BASE *SPACE;
It works perfectly for everything that I had to do with it. Basically I have to do an implementation of a hashmap in C. I managed to do it, but I am completely stuck on this one task. I need to make the hashmap resizable by the user.
If I want a hashmap of size 5, I do so:
SPACE *hashmap = malloc(sizeof(SPACE *) * 5);
and it works perfectly for the purpose of the program.
However, if I try to resize it using the following block of code:
void expandHashspace(SPACE *hashmap){
printf("Please enter how large you want the hashspace to be.\n");
printf("Enter a number between %d and 100. Enter any other number to exit.\n>",hashSpaceSize);
int temp = 0;
scanf("%d",&temp);
if(temp>100 || temp<hashSpaceSize){
printf("Exiting...\n");
}
else {
SPACE *nw = NULL;
nw = realloc(hashmap, sizeof(SPACE *) * temp);
hashmap = nw;
hashSpaceSize = temp;
printf("Your hashspace is now %d rows long.\n", hashSpaceSize);
}
}
It also works properly. However, when I go to utilise the hashmap itself, it ends up with a segmentation fault. Or SIGSEGV Signal 11.
For example, I have the following display function.
void displayHashspace(SPACE *hashmap){
printf("\n");
int j = 0;
for(int i = 0; i < hashSpaceSize && hashmap; i++){
BASE *linkedList = hashmap[i];
if(linkedList) {
ITEM *node = linkedList->first;
printf("\n[HASH %d]\n", i);
while (node) {
printf("\t[BUCKET %d]\n\t[VALUE] : %s\n\t[KEY] : %s\n\n",j, node->data->value, node->data->key);
node = node->next;
j++;
etc...
Using CLion's debugging, I realised this:
Let's say the hashmap size is 3. That would mean that only hashmap[0-2] exist.
If I resize the hashmap to, let's say 10, it allows me to resize.
However, while displaying, the address of hashmap[3] is really weird.
Whereas every other address is pretty long, with almost 8 digits or more, the address of hashmap[3] is always 0x21.
After this, once it reaches ITEM *node = linkedList->first; with linkedList being hashmap[3], the segmentation fault occurs.
Here's another example. Here's my saving function:
void saveHash(SPACE *hashmap){
FILE *f = fopen("hashmap.hsh","w");
fprintf(f,"%d\n",hashSpaceSize);
for(int i = 0; i < hashSpaceSize;i++){
if(hashmap[i]){
ITEM *save = hashmap[i]->first;
do{
fprintf(f,"---\n%s\n%s\n",save->data->value,save->data->key);
save = save->next;
}while(save);
etc...
Here, the story is different. It can only reach hashmap[0] before crashing after the resizing. Using the debugger, I found that somehow, the save, which is set to hashmap[0]->first (which normally works before expanding), has a BUCKET whose VALUE variable is suddenly set to NULL for some reason, hence the crash.
I tried setting every "new" BASE after expansion to NULL, but the save function still breaks after using expandHashspace().
What am I doing wrong?
Reallocating memory to hashmap wasn't working because hashmap was being a local variable in that method. Meaning everything just became a confusing nightmare.
Returning the hashmap itself instead of returning nothing solved every problem.

C Function returning pointer to garbage memory [duplicate]

This question already has answers here:
How to access a local variable from a different function using pointers?
(10 answers)
Closed 5 years ago.
I am writing a program that, given a set of inputs and outputs, figures out what the equation is. The way the program works is by randomly generating binary trees and putting them through a genetic algorithm to see which is the best.
All the functions I have written work individually, but there is either one or two that do not.
In the program I use two structs, one for a node in the binary tree and the other to keep track of how accurate each tree is given the data (its fitness):
struct node {
char value;
struct node *left, *right;
};
struct individual {
struct node *genome;
double fitness;
};
One function I use to randomly create trees is a subtree crossover function, which randomly merges two trees, returning two trees that are sort of a mixture of each other. The function is as follows:
struct node **subtree_crossover(struct node parent1, struct node parent2) {
struct node *xo_nodes[2];
for (int i = 0; i < 2; i++) {
struct node *parent = (i ? &parent2 : &parent1);
// Find the subtree at the crossover point
xo_nodes[i] = get_node_at_index(&parent, random_index);
}
else {
// Swap the nodes
struct node tmp = *xo_nodes[0];
*xo_nodes[0] = *xo_nodes[1];
*xo_nodes[1] = tmp;
}
struct node **parents = malloc(sizeof(struct node *) * 2);
parents[0] = &parent1;
parents[1] = &parent2;
return parents;
}
Another function used one that takes two populations (list of individuals) and selects the best from both, returning the next population. It is as follows:
struct individual *generational_replacement(struct individual *new_population,
int size, struct individual *old_population) {
int elite_size = 3;
struct individual *population = malloc(sizeof(struct individual) * (elite_size + size));
int i;
for (i = 0; i < size; i++) {
population[i] = new_population[i];
}
for (i; i < elite_size; i++) {
population[i] = old_population[i];
}
sort_population(population);
population = realloc(population, sizeof(struct individual) * size);
return population;
}
Then there is the function that essentially is the main part of the program. This functions loops through a population, randomly modifies them and chooses the best among them across multiple generations. From this, it selects the best individual (the highest fitness) and returns it. It is as follows:
struct individual *search_loop(struct individual *population) {
int pop_size = 10;
int tourn_size = 3;
int new_pop_i = 0;
int generation = 1
struct individual *new_population = malloc(sizeof(struct individual) * pop_size);
while (generation < 10) {
while (new_pop_i < pop_size) {
// Insert code where random subtrees are chosen
struct node **nodes = subtree_crossover(random_subtree_1, random_subtree_2);
// Insert code to add the trees to new_population
}
population = generational_replacement(new_population, pop_size, population);
// Insert code to sort population by fitness value
}
return &population[0];
}
The issue I am having is that the search_loop function returns a pointer to an individual that is filled with garbage values. To narrow down the causes, I began to comment out code. By commenting out either subtree_crossover() or generational_replacement() the function returns a valid individual. Based on this, my guess is that the error is caused by either subtree_crossover() or generational_replacement().
Obviously, this is a heavily reduced version of the code I am using, but I believe it still will show the error that I am getting. If you would like to view the full source code, look in the development branch of this project: https://github.com/dyingpie1/pony_gp_c/tree/Development
Any help would be greatly appreciated. I have been trying to figure this out for multiple days.
Your subtree_crossover() function is taking two nodes as values. The function will receive copies, which will then live on the stack until the function exits, at which point they will become invalid. Unfortunately, the function later sticks their addresses into an array that it returns. Therefore, the result of subtree_crossover() is going to contain two invalid pointers to garbage data.
You could initialize parents as a struct node * instead of a struct node **, and make it twice the size of a struct node. Then, you could just copy the nodes into the array. This would avoid the issue. Alternatively, you could copy the nodes onto the heap, so that you could return a struct node **. You'd then have to remember to eventually free the copies, though.

c - creating a linked list without malloc

in order to create a linked list(which will contain an attribute of next and previous node),i will be using pointers for the 2 next and previous nodes,yet i was wondering if i could complete the code without using malloc(allocating memory):
for example:
instead of malloc-ing:
link *const l = (link *)malloc(sizeof(link));
if(l == NULL)
/* Handle allocation failure. */
...
l->data = d;
l->next = list->head;
head = l;
can i simply create a new link variable with the attributes formatted(value,pointer to next and previous link),and simply link the last link in my last link in the chain to this one?
my list file is b,for example.
link i;
i.date=d;
getlast(b).next=&i
i appologize ahead for the fact i am new to c,and will be more than glad to receive an honest solution :D
edit:
i tried using malloc to solve the matter.i will be glad if anyone could sort out my error in the code,as i can not seem to find it.
#include <stdio.h>
#include <malloc.h>
struct Node{
int value;
struct Node * Next;
struct Node * Previous;
};
typedef struct Node Node;
struct List{
int Count;
int Total;
Node * First;
Node * Last;
};
typedef struct List List;
List Create();
void Add(List a,int value);
void Remove(List a,Node * b);
List Create()
{
List a;
a.Count=0;
return a;
}
void Add(List a,int value)
{
Node * b = (Node *)malloc(sizeof(Node));
if(b==NULL)
printf("Memory allocation error \n");
b->value=value;
if(a.Count==0)
{
b->Next=NULL;
b->Previous=NULL;
a.First=b;
}
else
{
b->Next=NULL;
b->Previous=a.Last;
a.Last->Next=b;
}
++a.Count;
a.Total+=value;
a.Last=b;
}
void Remove(List a,Node * b)
{
if(a.Count>1)
{
if(a.Last==b)
{
b->Previous->Next=NULL;
}
else
{
b->Previous->Next=b->Next;
b->Next->Previous=b->Previous;
}
}
free(b);
}
Yes - you can do that.
e.g.
link l1,l2;
l1.next = &l2;
l2.next = NULL;
Is a perfectly fine and valid linked list of 2 nodes.
You could also create a bunch of nodes, and link them together based on your needs, e.g. create a linked list of the argv:
int main(int argc, char *argv[])
int i;
link links[100];
for (i = 0; i < argc && i < 100; i++) {
//assuming the nodes can hold a char*
links[i].data = argv[i];
links[i].next = NULL;
if (i > 0)
links[i-1].next = &links[i];
}
There are of course some drawbacks:
The number of nodes is determined at compile time in these examples. (in the last example one could malloc a buffer for argc
nodes instead of hardcoding 100 though)
The lifetime of these nodes are the scope they are declared in, they no longer exists when the scope ends.
So you cannot do something like this:
void append_link(link *n, char *data)
{
link new_link;
n->next = &new_link;
new_link.next = NULL;
new_link.data = data;
}
That is invalid, since when append_link ends, the new_link is gone. And the passed in n->next now points to a local variable that is invalid. If new_link instead was malloc'ed, it will live beyond this function - and all is ok.
Not really.
You could create a variable for each and every node in your list, but what happens when you want another node? Fifty more nodes? These variables also won't hang around after you've left the scope they were defined in, which means you'd either have to make everything global or use static storage and expose a pointer to them. This means that all pointers to them after that scope will be invalid. These are both very ugly solutions.
If you don't understand what I mean by scope, here's a quick example:
int main() { /* Entering function scope. */
int x = 5;
{ /* Entering block scope. */
int y = 7;
printf("%d\n", y);
} /* Exiting block scope, all variables of this scope are gone. (y) */
printf("%d %d\n", x, y); /* Won't compile because y doesn't exist here. */
} /* Exiting function scope, all non-static storage variables are gone. (x)
You could also create a global array, thinking that this gets around having a lot of different variables, but if your solution is to implement this using an array, why are you using a linked list and not an array? You've lost the benefits of a linked list by this point.
There are only two ways in C to create in-memory data structures that don't have a fixed-at-compile-time size:
with allocated storage duration, i.e. via malloc.
with automatic storage duration, which in terms of implementation, means "on the stack", either using variable-length arrays or recursion (so that you get a new instance at each level of recursion).
The latter (automatic storage) has the property that its lifetime ends when execution of the block where it's declared terminates, so it's difficult to use for long-lived data. There's also typically a bound on the amount of such storage you can obtain, and no way to detect when you've exceeded that bound (typically this results in a crash or memory corruption). So from a practical standpoint, malloc is the only way to make runtime-dynamic-sized data structures.
Note that in cases where your linked list does not need to have dynamic size (i.e. it's of fixed or bounded size) you can use static storage duration for it, too.
Memory for new nodes has to come from somwhere. You can certainly create individual variables and link them manually:
link a, b, c;
...
a.next = &b;
b.next = &c;
c.next = NULL;
As you can imagine, this approach doesn't scale; if you want more than 3 elements in your list, you'd have to allocate more than 3 link variables. Note that the following won't work:
void addToList( link *b )
{
link new;
...
b->next = &new;
}
because new ceases to exist when the addToList exits, so that pointer is no longer meaningful1.
What you can do is use an array of link as your "heap", and allocate from that array. You'll need to keep track of which elements are available for use; an easy way of doing that is initializing the array so that each a[i] points to a[i+1] (except for the last element, which points to NULL), then have a pointer which points to the first available element. Something like the following:
// You really want your "heap" to have static storage duration
static link a[HEAP_SIZE];
// Initialize the "heap"
for ( size_t i = 0; i < SIZE - 1; i++ )
a[i].next = &a[i+1];
a[i].next = NULL;
// Set up the freeList pointer; points to the first available element in a
link *freeList = &a[0];
// Get an element from the "heap"
link *newNode = freeList;
freeList = freeList->next;
newNode->next = NULL;
// Add a node back to the "heap" when you're done with it:
deletedNode->next = freeList;
freeList = deletedNode;
Again, you're limited in how many list nodes you can create, but this way you can create a large enough "heap" to satisfy your requirements.
1. Obviously, the phsyical memory location that new occupied still exists, but it's now free for other processes/threads to use, so the value contained in that address will no longer be what you expect.

Hashing with large data sets and C implementation

I have a large number of values ranging from 0 - 5463458053. To each value, I wish to map a set containing strings so that the operation lookup, i. e. finding whether a string is present in that set takes the least amount of time. Note that this set of values may not contain all values from (0 - 5463458053), but yes, a large number of them.
My current solution is to hash those values (between 0 - 5463458053) and for each value, have a linked list of strings corresponding to that value. Every time, I want to check for a string in a given set, I hash the value(between 0 - 5463458053), get the linked list, and traverse it to find out whether it contains the aforementioned string or not.
While this might seem easier, it's a little time consuming. Can you think of a faster solution? Also, collisions will be dreadful. They'll lead to wrong results.
The other part is about implementing this in C. How would I go about doing this?
NOTE: Someone suggested using a database instead. I wonder if that'll be useful.
I'm a little worried about running out of RAM naturally. :-)
You could have an hash-table of hash-sets. The first hash-table has keys your integers. The values inside it are hash-sets, i.e. hash-tables whose keys are strings.
You could also have an hashed set, with the keys being pairs of integers and strings.
There are many libraries implementing such data structures (and in C++, the standard library is implementing them, as std::map & std::set). For C, I was thinking of Glib from GTK.
With hashing techniques, memory use is proportional to the size of the considered sets (or relations). For instance, you could accept 30% emptiness rate.
Large number of strings + fast lookup + limited memory ----> you want a prefix trie, crit-bit tree, or anything of that family (many different names for very similar things, e.g. PATRICIA... Judy is one such thing too). See for example this.
These data structores allow for prefix-compression, so they are able to store a lot of strings (which somehow necessarily will have common prefixes) very efficiently. Also, lookup is very fast. Due to caching and paging effects that the common big-O notation does not account for, they can be as fast or even faster than a hash, at a fraction of the memory (even though according to big-O, nothing except maybe an array can beat a hash).
A Judy Array, with the C library that implements it, might be exactly the base of what you need. Here's a quote that describes it:
Judy is a C library that provides a state-of-the-art core technology
that implements a sparse dynamic array. Judy arrays are declared
simply with a null pointer. A Judy array consumes memory only when it
is populated, yet can grow to take advantage of all available memory
if desired. Judy's key benefits are scalability, high performance, and
memory efficiency. A Judy array is extensible and can scale up to a
very large number of elements, bounded only by machine memory. Since
Judy is designed as an unbounded array, the size of a Judy array is
not pre-allocated but grows and shrinks dynamically with the array
population. Judy combines scalability with ease of use. The Judy API
is accessed with simple insert, retrieve, and delete calls that do not
require extensive programming. Tuning and configuring are not required
(in fact not even possible). In addition, sort, search, count, and
sequential access capabilities are built into Judy's design.
Judy can be used whenever a developer needs dynamically sized arrays,
associative arrays or a simple-to-use interface that requires no
rework for expansion or contraction.
Judy can replace many common data structures, such as arrays, sparse
arrays, hash tables, B-trees, binary trees, linear lists, skiplists,
other sort and search algorithms, and counting functions.
If the entries are from 0 to N and consecutive: use an array. (Is indexing fast enough for you?)
EDIT: the numbers do not seem to be consecutive. There is a large number of {key,value} pairs, where the key is a big number (>32 bits but < 64 bits) and the value is a bunch of strings.
If memory is available, a hash table is easy, if the bunch of strings is not too large you can inspect them sequentially. If the same strings occur (much) more than once, you could enumerate the strings (put pointers to them in a char * array[] and use the index into that array instead. finding the index given a string probably involves another hash table)
For the "master" hashtable an entry would probably be:
struct entry {
struct entry *next; /* for overflow chain */
unsigned long long key; /* the 33bits number */
struct list *payload;
} entries[big_enough_for_all] ; /* if size is known in advance
, preallocation avoids a lot of malloc overhead */
if you have enough memory to store a heads-array, you chould certainly do that:
struct entry *heads[SOME_SIZE] = {NULL, };
, otherwise you can combine the heads array with the array of entries. (like I did Lookups on known set of integer keys here)
Handling collisions is easy: as you walk the overflow chain, just compare your key with the key in the entry. If they are unequal: walk on. If they are equal: found; now go walking the strings.
You can use a single binary search tree (AVL/Red-black/...) to contain all the strings, from all sets, by keying them lexicographically as (set_number, string). You don't need to store sets explicitly anywhere. For example, the comparator defining the order of nodes for the tree could look like:
function compare_nodes (node1, node2) {
if (node1.set_number < node2.set_number) return LESS;
if (node1.set_number > node2.set_number) return GREATER;
if (node1.string < node2.string) return LESS;
if (node1.string > node2.string) return GREATER;
return EQUAL;
}
With such a structure, some common operations are possible (but maybe not straightforward).
To find whether a string s exists in the set set_number, simply lookup (set_number, s) in the tree, for an exact match.
To find all strings in the set set_number:
function iterate_all_strings_in_set (set_number) {
// Traverse the tree from root downwards, looking for the given key. Return
// wherever the search ends up, whether it found the value or not.
node = lookup_tree_weak(set_number, "");
// tree empty?
if (node == null) {
return;
}
// We may have gotten the greatest node from the previous set,
// instead of the first node from the set we're interested in.
if (node.set_number != set_number) {
node = successor(node);
}
while (node != null && node.set_number == set_number) {
do_something_with(node.string);
node = successor(node);
}
}
The above requires O((k+1)*log(n)) time, where k is the number of strings in set_number, and n is the number of all strings.
To find all set numbers with at least one string associated:
function iterate_all_sets ()
{
node = first_node_in_tree();
while (node != null) {
current_set = node.set_number;
do_something_with(current_set);
if (cannot increment current_set) {
return;
}
node = lookup_tree_weak(current_set + 1, "");
if (node.set_number == current_set) {
node = successor(node);
}
}
}
The above requires O((k+1)*log(n)) time, where k is the number of sets with at least one string, and n is the number of all strings.
Note that the above code assumes that the tree is not modified in the "do_something" calls; it may crash if nodes are removed.
Addidionally, here's some real C code which demonstrates this, using my own generic AVL tree implemetation. To compile it, it's enough to copy the misc/ and structure/ folders from BadVPN source somewhere and add an include path there.
Note how my AVL tree does not contain any "data" in its nodes, and how it doesn't do any of its own memory allocation. This comes handy when you have a lot of data to work with. To make it clear: the program below does only a single malloc(), which is the one that allocates the nodes array.
#include <stdlib.h>
#include <stdio.h>
#include <inttypes.h>
#include <assert.h>
#include <structure/BAVL.h>
#include <misc/offset.h>
struct value {
uint32_t set_no;
char str[3];
};
struct node {
uint8_t is_used;
struct value val;
BAVLNode tree_node;
};
BAVL tree;
static int value_comparator (void *unused, void *vv1, void *vv2)
{
struct value *v1 = vv1;
struct value *v2 = vv2;
if (v1->set_no < v2->set_no) {
return -1;
}
if (v1->set_no > v2->set_no) {
return 1;
}
int c = strcmp(v1->str, v2->str);
if (c < 0) {
return -1;
}
if (c > 0) {
return 1;
}
return 0;
}
static void random_bytes (unsigned char *out, size_t n)
{
while (n > 0) {
*out = rand();
out++;
n--;
}
}
static void random_value (struct value *out)
{
random_bytes((unsigned char *)&out->set_no, sizeof(out->set_no));
for (size_t i = 0; i < sizeof(out->str) - 1; i++) {
out->str[i] = (uint8_t)32 + (rand() % 94);
}
out->str[sizeof(out->str) - 1] = '\0';
}
static struct node * find_node (const struct value *val)
{
// find AVL tree node with an equal value
BAVLNode *tn = BAVL_LookupExact(&tree, (void *)val);
if (!tn) {
return NULL;
}
// get node pointer from pointer to its value (same as container_of() in Linux kernel)
struct node *n = UPPER_OBJECT(tn, struct node, tree_node);
assert(n->val.set_no == val->set_no);
assert(!strcmp(n->val.str, val->str));
return n;
}
static struct node * lookup_weak (const struct value *v)
{
BAVLNode *tn = BAVL_Lookup(&tree, (void *)v);
if (!tn) {
return NULL;
}
return UPPER_OBJECT(tn, struct node, tree_node);
}
static struct node * first_node (void)
{
BAVLNode *tn = BAVL_GetFirst(&tree);
if (!tn) {
return NULL;
}
return UPPER_OBJECT(tn, struct node, tree_node);
}
static struct node * next_node (struct node *node)
{
BAVLNode *tn = BAVL_GetNext(&tree, &node->tree_node);
if (!tn) {
return NULL;
}
return UPPER_OBJECT(tn, struct node, tree_node);
}
size_t num_found;
static void iterate_all_strings_in_set (uint32_t set_no)
{
struct value v;
v.set_no = set_no;
v.str[0] = '\0';
struct node *n = lookup_weak(&v);
if (!n) {
return;
}
if (n->val.set_no != set_no) {
n = next_node(n);
}
while (n && n->val.set_no == set_no) {
num_found++; // "do_something_with_string"
n = next_node(n);
}
}
static void iterate_all_sets (void)
{
struct node *node = first_node();
while (node) {
uint32_t current_set = node->val.set_no;
iterate_all_strings_in_set(current_set); // "do_something_with_set"
if (current_set == UINT32_MAX) {
return;
}
struct value v;
v.set_no = current_set + 1;
v.str[0] = '\0';
node = lookup_weak(&v);
if (node->val.set_no == current_set) {
node = next_node(node);
}
}
}
int main (int argc, char *argv[])
{
size_t num_nodes = 10000000;
// init AVL tree, using:
// key=(struct node).val,
// comparator=value_comparator
BAVL_Init(&tree, OFFSET_DIFF(struct node, val, tree_node), value_comparator, NULL);
printf("Allocating...\n");
// allocate nodes (missing overflow check...)
struct node *nodes = malloc(num_nodes * sizeof(nodes[0]));
if (!nodes) {
printf("malloc failed!\n");
return 1;
}
printf("Inserting %zu nodes...\n", num_nodes);
size_t num_inserted = 0;
// insert nodes, giving them random values
for (size_t i = 0; i < num_nodes; i++) {
struct node *n = &nodes[i];
// choose random set number and string
random_value(&n->val);
// try inserting into AVL tree
if (!BAVL_Insert(&tree, &n->tree_node, NULL)) {
printf("Insert collision: (%"PRIu32", '%s') already exists!\n", n->val.set_no, n->val.str);
n->is_used = 0;
continue;
}
n->is_used = 1;
num_inserted++;
}
printf("Looking up...\n");
// lookup all those values
for (size_t i = 0; i < num_nodes; i++) {
struct node *n = &nodes[i];
struct node *lookup_n = find_node(&n->val);
if (n->is_used) { // this node is the only one with this value
ASSERT(lookup_n == n)
} else { // this node was an insert collision; some other
// node must have this value
ASSERT(lookup_n != NULL)
ASSERT(lookup_n != n)
}
}
printf("Iterating by sets...\n");
num_found = 0;
iterate_all_sets();
ASSERT(num_found == num_inserted)
printf("Removing all strings...\n");
for (size_t i = 0; i < num_nodes; i++) {
struct node *n = &nodes[i];
if (!n->is_used) { // must not remove it it wasn't inserted
continue;
}
BAVL_Remove(&tree, &n->tree_node);
}
return 0;
}

Resources