Very simple map implemention in C (for caching purpose)?

Very simple map implemention in C (for caching purpose)? - c

I have a program that read urls in a file and does a gethostbyname() on each URL host. This call is quite consuming. I want to cache them.
Is there a very simple map-base code snippet in C out there that I could use to do the caching? (I just don't want to reinvent the wheel).
It has to have the following points :
Open-source with a permissive license (think BSD or public domain).
Very simple : ideally less than 100 LOC
Keys are char* and values void*. No need to copy them.
No real need to implement remove(), but contains() is either needed or put() should replace the value.
PS: I tagged it homework, since it could be. I'm just being very lazy and do want to avoid all the common pitfalls I could encounter while reimplementing.

Here's a very simple and naive one
Fixed bucket size
No delete operation
inserts replaces the key and value, and can optionally free them
:
#include <string.h>
#include <stdlib.h>
#define NR_BUCKETS 1024
struct StrHashNode {
char *key;
void *value;
struct StrHashNode *next;
};
struct StrHashTable {
struct StrHashNode *buckets[NR_BUCKETS];
void (*free_key)(char *);
void (*free_value)(void*);
unsigned int (*hash)(const char *key);
int (*cmp)(const char *first,const char *second);
};
void *get(struct StrHashTable *table,const char *key)
{
unsigned int bucket = table->hash(key)%NR_BUCKETS;
struct StrHashNode *node;
node = table->buckets[bucket];
while(node) {
if(table->cmp(key,node->key) == 0)
return node->value;
node = node->next;
}
return NULL;
}
int insert(struct StrHashTable *table,char *key,void *value)
{
unsigned int bucket = table->hash(key)%NR_BUCKETS;
struct StrHashNode **tmp;
struct StrHashNode *node ;
tmp = &table->buckets[bucket];
while(*tmp) {
if(table->cmp(key,(*tmp)->key) == 0)
break;
tmp = &(*tmp)->next;
}
if(*tmp) {
if(table->free_key != NULL)
table->free_key((*tmp)->key);
if(table->free_value != NULL)
table->free_value((*tmp)->value);
node = *tmp;
} else {
node = malloc(sizeof *node);
if(node == NULL)
return -1;
node->next = NULL;
*tmp = node;
}
node->key = key;
node->value = value;
return 0;
}
unsigned int foo_strhash(const char *str)
{
unsigned int hash = 0;
for(; *str; str++)
hash = 31*hash + *str;
return hash;
}
#include <stdio.h>
int main(int argc,char *argv[])
{
struct StrHashTable tbl = {{0},NULL,NULL,foo_strhash,strcmp};
insert(&tbl,"Test","TestValue");
insert(&tbl,"Test2","TestValue2");
puts(get(&tbl,"Test"));
insert(&tbl,"Test","TestValueReplaced");
puts(get(&tbl,"Test"));
return 0;
}

Christoper Clark's hashtable implementation is very straightforward. It is more than 100 lines, but not by much.
Clark's code seems to have made its way into Google's Conccurrency Library as a parallelization example.

std::map in C++ is a red-black tree under the hood; what about using an existing red-black tree implementation in C? The one I linked is more like 700 LOC, but it's pretty well commented and looks sane from the cursory glance I took at it. You can probably find others; this one was the first hit on Google for "C red-black tree".
If you're not picky about performance you could also use an unbalanced binary tree or a min-heap or something like that. With a balanced binary tree, you're guaranteed O(log n) lookup; with an unbalanced tree the worst case for lookup is O(n) (for the pathological case where nodes are inserted in-order, so you end up with one really long branch that acts like a linked-list), but (if my rusty memory is correct) the average case is still O(log n).

You can try using following implemntation
clib

memcached?
Not a code snippet, but a high performance distributed caching engine.

Not lazy, deeply sensible to avoid writing this stuff.
How's this library never used it myself but it seems to claim to do what you ask for.

Dave Hanson's C Interfaces and Implementations includes a nice hash table, as well as many other useful modules. The hash table clocks in at 150 lines, but that's including memory management, a higher-order mapping function, and conversion to array. The software is free, and the book is worth buying.

Found an implementation here : c file and h file that's fairly close to what you asked. W3C license

Related

compare two address, list C

If I have a long char array[100], which store a list of structs in it, and if I want to add one one struct in the end, how do I check if it exceeds the boundary or not?
For example,
static char arr[100];
typedef NODE* node_ptr;
typedef struct node
{
char a;
char b;
int size;
node_ptr next;
}NODE;
//arr already contains few node in it.
//size: the new node size, I want to add in the end
node_ptr add_node(node_ptr last, size_t size)
{
node_ptr new;
if(last+2*sizeof(NODE)+size<arr+100)
//add new node
return new;
}
How can I check if new node exceed the array boundary?

this is COMMENT
do not upvote - but feel free to DV.
You ask difficult question and it is not easy to answer. You need also to consider some more complex cases like
or
or
Small array will become fragmented in a very short time. This is one of the reasons why uC developers try to avoid this kind of memory allocation. There are other techniques like pools used in the embeedded programming.

What is the fastest way to implement a list and queue in c?

Which one stack and queue realization will be faster and more optimal and why? Based on array (dynamic or static) or list?
For example, I have these ways:
Dynamic array based:
typedef struct Stack {
char* values;
int avl_el;
int now_el;
int top;
}Stack;
void push(Stack* stack, char data) {
if (stack->now_el >= stack->avl_el) {
stack->avl_el += INCR;
stack->values = (char*)malloc(stack->avl_el * sizeof(char));
}
if (stack->top == -1) {
stack->top++;
stack->values[stack->top] = data;
stack->now_el++;
}else {
stack->top++;
stack->values[stack->top] = data;
stack->now_el++;
}
}
char pop(Stack* stack) {
char tmp = stack->values[stack->top];
stack->values[stack->top] = 0;
stack->top--;
stack->now_el--;
return tmp;
}
List based:
typedef struct Node {
char data; // in this case we save char symb
struct Node *next;
}Node;
typedef struct Stack {
struct Node* topElem;
}Stack;
void push(Stack* stack, char data) {
Node* tmp = (Node*)malloc(1 * sizeof(Node));
if(!tmp) {
printf("Can't push!\n");
return;
}
tmp->data = data;
tmp->next = stack->topElem;
stack->topElem = tmp; // making new top element
}
char pop(Stack* stack) {
Node* tmp = stack->topElem;
char del_data = stack->topElem->data;
stack->topElem = stack->topElem->next;
free(tmp);
return del_data;
}
Will be any different with stack based on dynamic and stack based on static arrays?

Assuming you fix your bugs, let's discuss the principles. The biggest performance bug is incrementing size with a constant INC. With this bug, the complexity for inserting n elements is O(n2). For better complexity, reallocate in multiples of 2 or 1.5, after the fix the complexity of inserting n elements becomes O(n), or amortized O(1) for a single insertion.
The two approaches have been tested extensively with C++: what is faster std:: vector (similar to your stack) or std::list (a doubly linked list). Here is a list of resources:
Bjarne Stroustrup, the creator of c++, compared lists and vectors.
stack overflow: Relative performance of std::vector vs. std::list vs. std::slist?
Lists are easier to implement, and have a better predictability (no resizing), but vectors are faster in the stack scenario on average, and more memory efficient.
Vectors (the stack in the question):
Size: No need to store pointers to the next element. So it's more efficient.
Speed: consecutive elements are near each other, resulting in better memory predictability, and higher cache efficiency.
lists:
Size: no need to find one big block of memory (works better in a fragmented memory).
Speed: predictable - no need to copy big chunks of memory once in a while.

Managing duplicates in a binary tree with memory efficiency

I have a self balancing key-value binary tree (similar to Tarjan's Zip Tree) where there will be duplication of keys. To ensure O(log N) performance the only thing I can come up with is to maintain three pointers per node; a less than, a greater than, and an "equals". The equals pointer is a pointer to a linked-list of members having the same key.
This seems memory inefficient to me because I'll have an extra 8 bytes per node in the whole tree to handle the infrequent duplicate occurrences. Is there a better way that doesn't involve "cheats" like bit banging the left or right pointers for use as a flag?

When you have a collision insertion, allocate new buffer, copy new data.
Hash the new data pointer down to one or two bytes. You'll need a hash that only returns zero on zero input!
Store the hash value in your node. This field would be zero if there are no collision data, so you are O(log KeyCount) for all keys without extra data elements. You're worst case is log KeyCount plus whatever your hashing algorithm yields on lookups, which might be a constant close to 1 additional step until your table has to be resized.
Obviously, choice of hashing algorithm is critical here. Look for one that is good with pointer values on whatever architecture you are targeting. You may need different hashes for different architectures.
You can carry this even further by using only one byte hash values that get you the hash table that you then use the key hash (can be a larger integer) to find the pointer to the additional data. When a hash table fills up, insert a new one into the parent table. I'll leave the math to you.
Regarding data locality. Since the node data are large, you already don't have good node record to actual data locality anyway. This scheme doesn't change that, except in the case where you have multiple data nodes for a particular key, in which case, you'd likely have cache miss getting to the correct index of a variable array embedded in the node. This scheme avoids having to reallocate the nodes on collisions, and probably won't have a severe impact on your cache miss rate.

I usually use this setup when i do a binary search tree, it skips in an array the duplicates values:
#include <stdio.h>
#include <stdlib.h>
#define SIZE 13
typedef struct Node
{
struct Node * right;
struct Node * left;
int value;
}TNode;
typedef TNode * Nodo;
void bst(int data, Nodo * p )
{
Nodo pp = *p;
if(pp == NULL)
{
pp = (Nodo)malloc(sizeof(struct Node));
pp->right = NULL;
pp->left = NULL;
pp->value = data;
*p = pp;
}
else if(data == pp->value)
{
return;
}
else if(data > pp->value)
{
bst(data, &pp->right);
}
else
{
bst(data, &pp->left);
}
}
void displayDesc(Nodo p)
{
if(p != NULL)
{
displayDesc(p->right);
printf("%d\n", p->value);
displayDesc(p->left);
}
}
void displayAsc(Nodo p)
{
if(p != NULL)
{
displayAsc(p->left);
printf("%d\n", p->value);
displayAsc(p->right);
}
}
int main()
{
int arr[SIZE] = {4,1,0,7,5,88,8,9,55,42,0,5,6};
Nodo head = NULL;
for(int i = 0; i < SIZE; i++)
{
bst(arr[i], &head);
}
displayAsc(head);
exit(0);
}

Any faster methods to find data?

This is an Interview question.
We are developing a k/v system, part of it has been developed, we need you to finish it.
Things already done -
1) Return a hash of any string, you can assume return value is always unique, no collision,
it's up to you to use it or not
int hash(char *string);
Things you have to finish -
int set(char *key, char *value);
char *get(char *key);
And my answer was
struct kv {
int key;
char *value;
kv *next;
};
struct kv *top;
struct kv *end;
void set(char *key, char *value) {
if(top == NULL) {
top = malloc(struct kv);
end = top;
}
sturct kv *i = top;
int k = hash(key);
while(i != end) {
if(i->key == k) {
i->value = value;
return;
}
i = i->next;
}
i = malloc(struct kv);
i->key = k;
i->value = value;
end = i;
}
char *get(char *key) {
if(top == NULL) {
return NULL;
}
sturct kv *i = top;
int k = hash(key);
while(i != end) {
if(i->key == k) {
return i->value;
}
i = i->next;
}
return NULL;
}
Q: - Is there any faster way to do it? What do you think is the fastest way?

What you have done is made a linked list to store the key value pairs. But as you can see, the search complexity is O(n). You can make it faster by creating a hash table. You already have a hash function with guaranteed 0 collisions.
What you can do is
char* hash_tables[RANGE_OF_HASH] = {NULL}; // Your interviewer should provide you RANGE_OF_HASH
Then your set and get become -
void set(char* key, char* value) {
hash_table[hash(key)] = value; // Can do this because no collisions are guaranteed.
}
char* get(char* key) {
return hash_table[hash(key)];
}
In this case since you don't have to iterate over all the keys inserted, the get complexity is O(1) (also set).
But you need to be aware that this usually occupies more space than your approach.
Your method occupies O(n) space but this occupies O(RANGE_OF_HASH). Which might not be acceptable in situations where memory is a constraint.
If RANGE_OF_HASH is very huge(like INT_MAX) and you don't have enough memory for hash_table, you can create a multi level hash table.
For instance, your main hash_table will have only 256 slots. Each of the entry will point to another hash table of 256 entries and so on. You will have
to do some bit masking to get the hash value for each level. You can allocate each level on demand basis. This way you will minimize the memory usage.

There's lots of great ways of doing this. Here's a small reading list, go through it. There's definitely more out there that I'm not aware of.
Sorted list with binary search - Depending on the usage patterns, can be fast or slow to build, but lookups are guaranteed to be O(log(N)).
Hash table - fast, close to O(1) on average, O(N) in worst case for all operations.
Binary tree - best case O(log(N)), worst case O(N).
AVL tree - guaranteed O(log(N)) for all operations.
Red-black tree - similar to AVL but trades off lookup speed for more inserting speed.
Trie - True O(1) on all operations, at the expense of more memory usage.
After this, take a break, brace yourself, and delve into this article about computer memory. This is already advanced stuff and will show you that sometimes a worse big-O measure can actually perform better in real world scenarios. It's all down to what kind of data will there be and what the usage patterns are.

Hashing with large data sets and C implementation

I have a large number of values ranging from 0 - 5463458053. To each value, I wish to map a set containing strings so that the operation lookup, i. e. finding whether a string is present in that set takes the least amount of time. Note that this set of values may not contain all values from (0 - 5463458053), but yes, a large number of them.
My current solution is to hash those values (between 0 - 5463458053) and for each value, have a linked list of strings corresponding to that value. Every time, I want to check for a string in a given set, I hash the value(between 0 - 5463458053), get the linked list, and traverse it to find out whether it contains the aforementioned string or not.
While this might seem easier, it's a little time consuming. Can you think of a faster solution? Also, collisions will be dreadful. They'll lead to wrong results.
The other part is about implementing this in C. How would I go about doing this?
NOTE: Someone suggested using a database instead. I wonder if that'll be useful.
I'm a little worried about running out of RAM naturally. :-)

You could have an hash-table of hash-sets. The first hash-table has keys your integers. The values inside it are hash-sets, i.e. hash-tables whose keys are strings.
You could also have an hashed set, with the keys being pairs of integers and strings.
There are many libraries implementing such data structures (and in C++, the standard library is implementing them, as std::map & std::set). For C, I was thinking of Glib from GTK.
With hashing techniques, memory use is proportional to the size of the considered sets (or relations). For instance, you could accept 30% emptiness rate.

Large number of strings + fast lookup + limited memory ----> you want a prefix trie, crit-bit tree, or anything of that family (many different names for very similar things, e.g. PATRICIA... Judy is one such thing too). See for example this.
These data structores allow for prefix-compression, so they are able to store a lot of strings (which somehow necessarily will have common prefixes) very efficiently. Also, lookup is very fast. Due to caching and paging effects that the common big-O notation does not account for, they can be as fast or even faster than a hash, at a fraction of the memory (even though according to big-O, nothing except maybe an array can beat a hash).

A Judy Array, with the C library that implements it, might be exactly the base of what you need. Here's a quote that describes it:
Judy is a C library that provides a state-of-the-art core technology
that implements a sparse dynamic array. Judy arrays are declared
simply with a null pointer. A Judy array consumes memory only when it
is populated, yet can grow to take advantage of all available memory
if desired. Judy's key benefits are scalability, high performance, and
memory efficiency. A Judy array is extensible and can scale up to a
very large number of elements, bounded only by machine memory. Since
Judy is designed as an unbounded array, the size of a Judy array is
not pre-allocated but grows and shrinks dynamically with the array
population. Judy combines scalability with ease of use. The Judy API
is accessed with simple insert, retrieve, and delete calls that do not
require extensive programming. Tuning and configuring are not required
(in fact not even possible). In addition, sort, search, count, and
sequential access capabilities are built into Judy's design.
Judy can be used whenever a developer needs dynamically sized arrays,
associative arrays or a simple-to-use interface that requires no
rework for expansion or contraction.
Judy can replace many common data structures, such as arrays, sparse
arrays, hash tables, B-trees, binary trees, linear lists, skiplists,
other sort and search algorithms, and counting functions.

If the entries are from 0 to N and consecutive: use an array. (Is indexing fast enough for you?)
EDIT: the numbers do not seem to be consecutive. There is a large number of {key,value} pairs, where the key is a big number (>32 bits but < 64 bits) and the value is a bunch of strings.
If memory is available, a hash table is easy, if the bunch of strings is not too large you can inspect them sequentially. If the same strings occur (much) more than once, you could enumerate the strings (put pointers to them in a char * array[] and use the index into that array instead. finding the index given a string probably involves another hash table)
For the "master" hashtable an entry would probably be:
struct entry {
struct entry *next; /* for overflow chain */
unsigned long long key; /* the 33bits number */
struct list *payload;
} entries[big_enough_for_all] ; /* if size is known in advance
, preallocation avoids a lot of malloc overhead */
if you have enough memory to store a heads-array, you chould certainly do that:
struct entry *heads[SOME_SIZE] = {NULL, };
, otherwise you can combine the heads array with the array of entries. (like I did Lookups on known set of integer keys here)
Handling collisions is easy: as you walk the overflow chain, just compare your key with the key in the entry. If they are unequal: walk on. If they are equal: found; now go walking the strings.

You can use a single binary search tree (AVL/Red-black/...) to contain all the strings, from all sets, by keying them lexicographically as (set_number, string). You don't need to store sets explicitly anywhere. For example, the comparator defining the order of nodes for the tree could look like:
function compare_nodes (node1, node2) {
if (node1.set_number < node2.set_number) return LESS;
if (node1.set_number > node2.set_number) return GREATER;
if (node1.string < node2.string) return LESS;
if (node1.string > node2.string) return GREATER;
return EQUAL;
}
With such a structure, some common operations are possible (but maybe not straightforward).
To find whether a string s exists in the set set_number, simply lookup (set_number, s) in the tree, for an exact match.
To find all strings in the set set_number:
function iterate_all_strings_in_set (set_number) {
// Traverse the tree from root downwards, looking for the given key. Return
// wherever the search ends up, whether it found the value or not.
node = lookup_tree_weak(set_number, "");
// tree empty?
if (node == null) {
return;
}
// We may have gotten the greatest node from the previous set,
// instead of the first node from the set we're interested in.
if (node.set_number != set_number) {
node = successor(node);
}
while (node != null && node.set_number == set_number) {
do_something_with(node.string);
node = successor(node);
}
}
The above requires O((k+1)*log(n)) time, where k is the number of strings in set_number, and n is the number of all strings.
To find all set numbers with at least one string associated:
function iterate_all_sets ()
{
node = first_node_in_tree();
while (node != null) {
current_set = node.set_number;
do_something_with(current_set);
if (cannot increment current_set) {
return;
}
node = lookup_tree_weak(current_set + 1, "");
if (node.set_number == current_set) {
node = successor(node);
}
}
}
The above requires O((k+1)*log(n)) time, where k is the number of sets with at least one string, and n is the number of all strings.
Note that the above code assumes that the tree is not modified in the "do_something" calls; it may crash if nodes are removed.
Addidionally, here's some real C code which demonstrates this, using my own generic AVL tree implemetation. To compile it, it's enough to copy the misc/ and structure/ folders from BadVPN source somewhere and add an include path there.
Note how my AVL tree does not contain any "data" in its nodes, and how it doesn't do any of its own memory allocation. This comes handy when you have a lot of data to work with. To make it clear: the program below does only a single malloc(), which is the one that allocates the nodes array.
#include <stdlib.h>
#include <stdio.h>
#include <inttypes.h>
#include <assert.h>
#include <structure/BAVL.h>
#include <misc/offset.h>
struct value {
uint32_t set_no;
char str[3];
};
struct node {
uint8_t is_used;
struct value val;
BAVLNode tree_node;
};
BAVL tree;
static int value_comparator (void *unused, void *vv1, void *vv2)
{
struct value *v1 = vv1;
struct value *v2 = vv2;
if (v1->set_no < v2->set_no) {
return -1;
}
if (v1->set_no > v2->set_no) {
return 1;
}
int c = strcmp(v1->str, v2->str);
if (c < 0) {
return -1;
}
if (c > 0) {
return 1;
}
return 0;
}
static void random_bytes (unsigned char *out, size_t n)
{
while (n > 0) {
*out = rand();
out++;
n--;
}
}
static void random_value (struct value *out)
{
random_bytes((unsigned char *)&out->set_no, sizeof(out->set_no));
for (size_t i = 0; i < sizeof(out->str) - 1; i++) {
out->str[i] = (uint8_t)32 + (rand() % 94);
}
out->str[sizeof(out->str) - 1] = '\0';
}
static struct node * find_node (const struct value *val)
{
// find AVL tree node with an equal value
BAVLNode *tn = BAVL_LookupExact(&tree, (void *)val);
if (!tn) {
return NULL;
}
// get node pointer from pointer to its value (same as container_of() in Linux kernel)
struct node *n = UPPER_OBJECT(tn, struct node, tree_node);
assert(n->val.set_no == val->set_no);
assert(!strcmp(n->val.str, val->str));
return n;
}
static struct node * lookup_weak (const struct value *v)
{
BAVLNode *tn = BAVL_Lookup(&tree, (void *)v);
if (!tn) {
return NULL;
}
return UPPER_OBJECT(tn, struct node, tree_node);
}
static struct node * first_node (void)
{
BAVLNode *tn = BAVL_GetFirst(&tree);
if (!tn) {
return NULL;
}
return UPPER_OBJECT(tn, struct node, tree_node);
}
static struct node * next_node (struct node *node)
{
BAVLNode *tn = BAVL_GetNext(&tree, &node->tree_node);
if (!tn) {
return NULL;
}
return UPPER_OBJECT(tn, struct node, tree_node);
}
size_t num_found;
static void iterate_all_strings_in_set (uint32_t set_no)
{
struct value v;
v.set_no = set_no;
v.str[0] = '\0';
struct node *n = lookup_weak(&v);
if (!n) {
return;
}
if (n->val.set_no != set_no) {
n = next_node(n);
}
while (n && n->val.set_no == set_no) {
num_found++; // "do_something_with_string"
n = next_node(n);
}
}
static void iterate_all_sets (void)
{
struct node *node = first_node();
while (node) {
uint32_t current_set = node->val.set_no;
iterate_all_strings_in_set(current_set); // "do_something_with_set"
if (current_set == UINT32_MAX) {
return;
}
struct value v;
v.set_no = current_set + 1;
v.str[0] = '\0';
node = lookup_weak(&v);
if (node->val.set_no == current_set) {
node = next_node(node);
}
}
}
int main (int argc, char *argv[])
{
size_t num_nodes = 10000000;
// init AVL tree, using:
// key=(struct node).val,
// comparator=value_comparator
BAVL_Init(&tree, OFFSET_DIFF(struct node, val, tree_node), value_comparator, NULL);
printf("Allocating...\n");
// allocate nodes (missing overflow check...)
struct node *nodes = malloc(num_nodes * sizeof(nodes[0]));
if (!nodes) {
printf("malloc failed!\n");
return 1;
}
printf("Inserting %zu nodes...\n", num_nodes);
size_t num_inserted = 0;
// insert nodes, giving them random values
for (size_t i = 0; i < num_nodes; i++) {
struct node *n = &nodes[i];
// choose random set number and string
random_value(&n->val);
// try inserting into AVL tree
if (!BAVL_Insert(&tree, &n->tree_node, NULL)) {
printf("Insert collision: (%"PRIu32", '%s') already exists!\n", n->val.set_no, n->val.str);
n->is_used = 0;
continue;
}
n->is_used = 1;
num_inserted++;
}
printf("Looking up...\n");
// lookup all those values
for (size_t i = 0; i < num_nodes; i++) {
struct node *n = &nodes[i];
struct node *lookup_n = find_node(&n->val);
if (n->is_used) { // this node is the only one with this value
ASSERT(lookup_n == n)
} else { // this node was an insert collision; some other
// node must have this value
ASSERT(lookup_n != NULL)
ASSERT(lookup_n != n)
}
}
printf("Iterating by sets...\n");
num_found = 0;
iterate_all_sets();
ASSERT(num_found == num_inserted)
printf("Removing all strings...\n");
for (size_t i = 0; i < num_nodes; i++) {
struct node *n = &nodes[i];
if (!n->is_used) { // must not remove it it wasn't inserted
continue;
}
BAVL_Remove(&tree, &n->tree_node);
}
return 0;
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight