I want to create a hash table that relies on an independent vector data structure in C99. I can do this in C++ with the help of OO, but I'm unsure how to approach this using structs and unions.
I would prefer that any linked examples do not include hash table implementations that have highly complex hashing functions. I do not particularly care about collisions or efficiency of storage. I just want either advice as to how to proceed or a simple example that exemplifies the form rather than function of the respective data structures.
If I infer correctly that you want to implement growing hash tables in a fully generic way, then you'll need a lot of void pointers. A vector isn't too hard, it just takes a lot of typing:
typedef struct {
size_t capacity, nelems;
void **contents;
} Vector;
enum { INITIAL_CAPACITY = 256 };
Vector *make_vector()
{
Vector *v = malloc(sizeof(Vector));
if (v == NULL)
return NULL;
v->capacity = INITIAL_CAPACITY;
v->contents = malloc(sizeof(void *) * v->capacity);
if (v->contents == NULL) {
free(v);
return NULL;
}
v->nelems = 0;
return v;
}
// exercise for the reader
int vector_append(Vector *, void *);
void *vector_at(Vector const *);
Keep in mind that a generic hash function would have prototype size_t hash(void const *, size_t), i.e. you need to pass in the size.
(Side note: it's not C++'s OOP features that you're going to miss; it's templates, the type safety that they buy, and syntactic sugar such as operator overloading. Take a look at OpenBSD's ohash library for more examples.)
The following book has probably the best description of both linked lists and a hash table in C using structs:
http://en.wikipedia.org/wiki/The_C_Programming_Language_(book)
It implements a simple hashing algorithm as well.
Another simple, yet uniformly distributed hashing algorithm is the cdb algorithm as defined here:
http://cr.yp.to/cdb/cdb.txt
Related
I am taking advantage of polymorphism in C by using virtual tables as described in Polymorphism (in C) and it works great.
Unfortunately, the limitation of my current project does not allow me to use function pointer or reference to structs in some part of my code. As a consequence, I cannot use the original approach directly.
In the mentioned approach, the base "class/struct" has a member that points to the virtual table. In order to get ride of this pointer, I decided to replace it with an enumerate that acts as key to access the virtual table.
It works but I wonder if if is the best solution. Do you come up with any alternative that fits better than my proposal?
/**
* This example shows a common approach to achive polymorphism in C and an
* alternative that does NOT include a reference to function pointer in the
* base
* class.
**/
#include<stdio.h>
// some functions to make use of polymorphism
void funBase1()
{
printf("base 1 \n");
}
void funBase2()
{
printf("base 2 \n");
}
void funDerived1()
{
printf("derived 1 \n");
}
void funDerived2()
{
printf("derived 2 \n");
}
// struct to host virtual tables
typedef struct vtable {
void (*method1)(void);
void (*method2)(void);
}sVtable;
// enumerate to access the virtual table
typedef enum {BASE, DERIVED} eTypes;
// global virtual table used for the alternative solution
const sVtable g_vtableBaseAlternative[] = {
{funBase1, funBase2},
{funDerived1, funDerived2}, };
// original approach that i cannot use
typedef struct base {
const sVtable* vtable;
int baseAttribute;
}sBase;
// alternative approach
typedef struct baseAlternative {
const eTypes vtable_key;
int baseAttribute;
}sBaseAlternative;
typedef struct derived {
sBase base;
int derivedAttribute;
}sDerived;
// original way to use
static inline void method1(sBase* base)
{
base->vtable->method1();
}
const sVtable* getVtable(const int key, const sVtable* vTableDic)
{
return &vTableDic[key];
}
// Alternative to get a reference to the virtual table
static inline void method1Aternative(sBaseAlternative* baseAlternative)
{
const sVtable* vtable;
vtable = getVtable(baseAlternative->vtable_key, g_vtableBaseAlternative);
printf("alternative version: ");
vtable->method1();
}
int main() {
const sVtable vtableBase[] = { {funBase1, funBase2} };
const sVtable vtableDerived[] = { {funDerived1, funDerived2} };
sBase base = {vtableBase, 0 };
sBase derived = {vtableDerived, 1 };
sBaseAlternative baseAlternative = {DERIVED, 1 };
method1(&base);
method1(&derived);
method1Aternative(&baseAlternative);
}
my current project does not allow me to use function pointer or reference to structs
You could use an array of T (any type you like) to represent a data type. For example, I tend to use arrays of unsigned char to serialise and deserialise my data structures for web transfer... Let's for example assume you're using sprintf and sscanf for serialisation and deserialisation (which you shouldn't really do, but they're okay for demos)... Instead of struct arguments, you use char * arguments, and you use sscanf to read that data to local variables, sprintf to modify it... that covers the no reference to structs allowed problem.
With regards to the function pointer problem, you could combine all of your functions into one which switches on... a tagged structure in string form... Here's a simple (yet incomplete) example involving two candidates for classes: a length-prefixed string which uses two bytes to encode the length and kind of derives from C-string behaviour, and a C string.
enum { fubar_is_string, fubar_is_length_prefixed_string };
typedef unsigned char non_struct_str_class;
size_t non_struct_strlen(non_struct_str_class *fubar) {
size_t length = 0;
switch (fubar++[0]) {
case fubar_is_length_prefixed_string:
length = fubar++[0];
length <<= 8;
length += fubar++[0];
// carry through into the next case
// to support strings longer than 64KB
case fubar_is_string: if (!length)
length = strlen(fubar);
/* handle fubar as string */
}
return length;
}
C is a turing complete programming language, so of course it can be used to mimic object oriented polymorphism... but it's far better at mimicking procedural polymorphism, or in some cases even functional polymorphism... As an example, you could say qsort and bsearch use a primitive form of parametric polymorphism similar to that of map (even more similar to a filter idiom).
You could also use _Generic with limited success, for example as the C11 standard does by providing a generic cbrt macro for all of the standard floating point types:
#define cbrt(X) _Generic((X), \
long double: cbrtl, \
default: cbrt, \
float: cbrtf \
)(X)
The preprocessor is particularly useful if you're going to go the route of mimicry... You might be interested in the C11 book by Klemens.
I am taking advantage of polymorphism in C
you do not of course. You only create a prothesis of it. IMO the simulation of objects in C is the worst possible solution. If you prefer the OOP paradigm - use the OO language. In this case C++.
Answering your question - you can't do it (sane way) without the function pointers.
I discourage people from attempts of OOP like programming in the procedural languages. It usually leads to the less readable, error prone and very difficult to maintain programs.
Choose the correct tool (the language is the tool) for the task and the method.
It is like using the knife instead of screwdriver. You can, but the screwdriver will definitely be much better.
I am porting some c++ code to c. What is a viable equivalent of std::map in c?
I know there is no equivalent in c.
This is what I am thinking of using:
In c++:
std::map< uint, sTexture > m_Textures;
In c:
typedef struct
{
uint* intKey;
sTexture* textureValue;
} sTMTextureMap;
Is that viable or am I simplifying map too much? Just in case you did not get the purpose its a Texture Map.
Many C implementations support tsearch(3) or hsearch(3). tsearch(3) is a binary tree and you can provide a comparator callback. I think that's about as close as you're going to get to a std::map.
Here's some c99 example code
#include <search.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
typedef struct
{
int key;
char* value;
} intStrMap;
int compar(const void *l, const void *r)
{
const intStrMap *lm = l;
const intStrMap *lr = r;
return lm->key - lr->key;
}
int main(int argc, char **argv)
{
void *root = 0;
intStrMap *a = malloc(sizeof(intStrMap));
a->key = 2;
a->value = strdup("two");
tsearch(a, &root, compar); /* insert */
intStrMap *find_a = malloc(sizeof(intStrMap));
find_a->key = 2;
void *r = tfind(find_a, &root, compar); /* read */
printf("%s", (*(intStrMap**)r)->value);
return 0;
}
Why don't you just wrap a C interface around std::map? Ie write a few C++ functions in their own module:
typedef std::map<int, char*> Map;
extern "C" {
void* map_create() {
return reinterpret_cast<void*> (new Map);
}
void map_put(void* map, int k, char* v) {
Map* m = reinterpret_cast<Map*> (map);
m->insert(std::pair<int, char*>(k, v));
}
// etc...
} // extern "C"
And then link into your C app.
That is certainly one possible implementation. You might want to consider how you'll implement the indexing and what performance impact that will have. For example, you could have the intKey list be a sorted list of the keys. Looking up a key would be O(log N) time, but inserting a new item would be O(N).
You could implement it as a tree (like std::map), and then you'd have O(log N) insertion and lookup.
Another alternative would be to implement it as a hash table, which would have better runtime performance, assuming a good hash function and a sparse enough intKey array.
You can implement it however you choose. If you use a linked-list approach your insertion will be O(1) but your retrieval and deletion will be O(n). If you use something more complex like a red-black tree you'll have much better average performance.
If you're implementing it yourself linked-list is probably the easiest, otherwise grabbing some appropriately licensed red-black or other type of tree from the internet would be the best option. Implementing your own red-black tree is not recommended... I've done this and would prefer not to do it again.
And to answer a question you didn't ask: maybe you should reexamine whether porting to C from C++ really provides all the benefits you wanted. Certainly there are situations where it could be necessary, but there aren't many.
I have tried implementing a map in C, it is based on void *
https://github.com/davinash/cstl
It is work in progress, but map is complete.
https://github.com/davinash/cstl/blob/master/src/c_map.c
It is written based on Red Black Tree.
There is no standard library in C that provides functionality analogous to a map. You will need to implement your own map-like functionality using some form of container that supports accessing elements via keys.
man dbopen
Provide NULL as the file argument and it'll be an in-memory only container for key/value data.
There is also various Berkeley database library interfaces with similar key/value functionality (man dbm, check out BerkeleyDB from Sleepycat, try some searches, etc).
I'm trying to create a generic hash table in C. I've read a few different implementations, and came across a couple of different approaches.
The first is to use macros like this: http://attractivechaos.awardspace.com/khash.h.html
And the second is to use a struct with 2 void pointers like this:
struct hashmap_entry
{
void *key;
void *value;
};
From what I can tell this approach isn't great because it means that each entry in the map requires at least 2 allocations: one for the key and one for the value, regardless of the data types being stored. (Is that right???)
I haven't been able to find a decent way of keeping it generic without going the macro route. Does anyone have any tips or examples that might help me out?
C does not provide what you need directly, nevertheless you may want to do something like this:
Imagine that your hash table is a fixed size array of double linked lists and it is OK that items are always allocated/destroyed on the application layer. These conditions will not work for every case, but in many cases they will. Then you will have these data structures and sketches of functions and protototypes:
struct HashItemCore
{
HashItemCore *m_prev;
HashItemCore *m_next;
};
struct HashTable
{
HashItemCore m_data[256]; // This is actually array of circled
// double linked lists.
int (*GetHashValue)(HashItemCore *item);
bool (*CompareItems)(HashItemCore *item1, HashItemCore *item2);
void (*ReleaseItem)(HashItemCore *item);
};
void InitHash(HashTable *table)
{
// Ensure that user provided the callbacks.
assert(table->GetHashValue != NULL && table->CompareItems != NULL && table->ReleaseItem != NULL);
// Init all double linked lists. Pointers of empty list should point to themselves.
for (int i=0; i<256; ++i)
table->m_data.m_prev = table->m_data.m_next = table->m_data+i;
}
void AddToHash(HashTable *table, void *item);
void *GetFromHash(HashTable *table, void *item);
....
void *ClearHash(HashTable *table);
In these functions you need to implement the logic of the hash table. While working they will be calling user defined callbacks to find out the index of the slot and if items are identical or not.
The users of this table should define their own structures and callback functions for every pair of types that they want to use:
struct HashItemK1V1
{
HashItemCore m_core;
K1 key;
V1 value;
};
int CalcHashK1V1(void *p)
{
HashItemK1V1 *param = (HashItemK1V1*)p;
// App code.
}
bool CompareK1V1(void *p1, void *p2)
{
HashItemK1V1 *param1 = (HashItemK1V1*)p1;
HashItemK1V1 *param2 = (HashItemK1V1*)p2;
// App code.
}
void FreeK1V1(void *p)
{
HashItemK1V1 *param = (HashItemK1V1*)p;
// App code if needed.
free(p);
}
This approach will not provide type safety because items will be passed around as void pointers assuming that every application structure starts with HashItemCore member. This will be sort of hand made polymorphysm. This is maybe not perfect, but this will work.
I implemented this approach in C++ using templates. But if you will strip out all fancies of C++, in the nutshell it will be exactly what I described above. I used my table in multiple projects and it worked like charm.
A generic hashtable in C is a bad idea.
a neat implementation will require function pointers, which are slow, since these functions cannot be inlined (the general case will need at least two function calls per hop: one to compute the hash value and one for the final compare)
to allow inlining of functions you'll either have to
write the code manually
or use a code generator
or macros. Which can get messy
IIRC, the linux kernel uses macros to create and maintain (some of?) its hashtables.
C does not have generic data types, so what you want to do (no extra allocations and no void* casting) is not really possible. You can use macros to generate the right data functions/structs on the fly, but you're trying to avoid macros as well.
So you need to give up at least one of your ideas.
You could have a generic data structure without extra allocations by allocating something like:
size_t key_len;
size_t val_len;
char key[];
char val[];
in one go and then handing out either void pointers, or adding an api for each specific type.
Alternatively, if you have a limited number of types you need to handle, you could also tag the value with the right one so now each entry contains:
size_t key_len;
size_t val_len;
int val_type;
char key[];
char val[];
but in the API at least you can verify that the requested type is the right one.
Otherwise, to make everything generic, you're left with either macros, or changing the language.
Below is the program,
#include<stddef.h>
#include<stdlib.h>
#include<string.h>
#include<stdio.h>
#define INITIAL_ARRAY_SIZE 10
typedef struct{
int *a;
int lastItem; //Location of lastest element stored in array
int size; //Size of array
}List;
void newList(List **lptr, int size){
*lptr = malloc(sizeof(List));
(*lptr)->a = calloc(size, sizeof(int));
(*lptr)->lastItem = -1;
(*lpr)->size =0;
}
List* insertItem(List *lptr, int newItem){
if(lptr->lastItem + 1 == lptr->size){
List *newLptr = NULL;
newList(&newLptr, 2*lptr->size);
memcpy(newLptr->a, lptr->a, (lptr->lastItem)+1);
newLptr->lastItem = lptr->lastItem;
newLptr->size = 2*lptr->size;
newLptr->a[++(newLptr->lastItem)] = newItem;
free(lptr);
return newLptr;
}
lptr->a[++(lptr->lastItem)] = newItem;
return lptr;
}
int main(void){
List *lptr = NULL;
newList(&lptr, INITIAL_ARRAY_SIZE);
lptr = insertItem(lptr, 6);
for(int i=0; i < INITIAL_ARRAY_SIZE;i++){
printf("Item: %d\n", lptr->a[i]);
}
printf("last item value: %d", lptr->lastItem);
}
written to implement List using C array.
Above code is written to follow abstraction.
How to ensure encapsultion and polymorphism, in this code?
Encapsulation, abstraction, and polymorphism. Because these three things blur together, their meanings are fuzzy, and they're kinda difficult to do in C, here's how I'm defining them for this answer.
Encapsulation restricts, or in the case of C discourages, knowledge of how the underlying thing works. Ideally the data and methods are bundled together.
Abstraction hides complexity from the user, generally through a well defined interface applicable to multiple scenarios.
Polymorphism allows the same interface to be used for multiple types.
They build on each other. Very generally, encapsulation allows abstraction allows polymorphism.
First, let's start with an encapsulation violation.
for(int i=0; i < INITIAL_ARRAY_SIZE;i++){
printf("Item: %d\n", lptr->a[i]);
}
INITIAL_ARRAY_SIZE is not part of lptr. It's some external data. If you pass lptr around, INITIAL_ARRAY_SIZE won't go with it. So that loop violates encapsulation. Your list is not well encapsulated. The size should be a detail that is either part of the List struct or not necessary at all.
You could add the size to the struct and use that to iterate.
for(int i=0; i < lptr->size; i++){
printf("Item: %d\n", lptr->a[i]);
}
But this still has the user poking at struct details. To avoid this you could add an iterator and the user never knows about the size at all. This is like the C++ vector interface but more awkward because C lacks method calls.
ListIter iter;
int *value;
/* Associate the iterator with the List */
ListIterInit(&iter, lptr);
/* ListIterNext returns a pointer so it can use NULL to stop */
/* Otherwise you can't store 0 */
while( value = ListIterNext(&iter) ) {
printf("Item: %d\n", *value);
}
Now the struct has full control over how things iterate and how it stores things.
This iterator interface is inspired by Gnome Lib Hash Tables.
This iterator interface also provides abstraction. We've removed details about the struct. Now it's a thing you just iterate through. You don't need to know how the data is stored or how much there is or even if it's stored at all. It could be generated on the fly for all you know. This is the beauty of the iterator pattern.
...except we still need to know the type. This can be fixed in two ways. First is by telling the list how big each element is. Rather than modifying yours, let's look at how Gnome Lib does it with their arrays
.
GArray *garray = g_array_new (FALSE, FALSE, sizeof (gint));
for (i = 0; i < 10000; i++) {
g_array_append_val (garray, i);
}
The array is told to store things sizeof(gint), which it remembers. Then all other array operations are encapsulated in a function. Even getting an element out is encapsulated.
gint g_array_index(garray, gint, 5);
This is done with a clever macro that does the typecasting for you.
The second option is to store all data as pointers. I'll leave it as an exercise for you to look at Gnome Lib's pointer arrays.
And will you look at that? We have polymorphism. A single array struct can now handle data of all types.
This isn't particularly easy to do in C. It involves some macro juggling. It's good to look at things like Gnome Lib to get, conceptually, how it's done. Maybe try to do it yourself for practice and understanding.
But for production just use things like Gnome Lib. There's a tremendous number of edge cases and little details that they've thought through.
I'm implementing a set of common yet not so trivial (or error-prone) data structures for C (here) and just came with an idea that got me thinking.
The question in short is, what is the best way to implement two structures that use similar algorithms but have different interfaces, without having to copy-paste/rewrite the algorithm? By best, I mean most maintainable and debug-able.
I think it is obvious why you wouldn't want to have two copies of the same algorithm.
Motivation
Say you have a structure (call it map) with a set of associated functions (map_*()). Since the map needs to map anything to anything, we would normally implement it taking a void *key and void *data. However, think of a map of int to int. In this case, you would need to store all the keys and data in another array and give their addresses to the map, which is not so convenient.
Now imagine if there was a similar structure (call it mapc, c for "copies") that during initialization takes sizeof(your_key_type) and sizeof(your_data_type) and given void *key and void *data on insert, it would use memcpy to copy the keys and data in the map instead of just keeping the pointers. An example of usage:
int i;
mapc m;
mapc_init(&m, sizeof(int), sizeof(int));
for (i = 0; i < n; ++i)
{
int j = rand(); /* whatever */
mapc_insert(&m, &i, &j);
}
which is quite nice, because I don't need to keep another array of is and js.
My ideas
In the example above, map and mapc are very closely related. If you think about it, map and set structures and functions are also very similar. I have thought of the following ways to implement their algorithm only once and use it for all of them. Neither of them however are quite satisfying to me.
Use macros. Write the function code in a header file, leaving the structure dependent stuff as macros. For each structure, define the proper macros and include the file:
map_generic.h
#define INSERT(x) x##_insert
int INSERT(NAME)(NAME *m, PARAMS)
{
// create node
ASSIGN_KEY_AND_DATA(node)
// get m->root
// add to tree starting from root
// rebalance from node to root
// etc
}
map.c
#define NAME map
#define PARAMS void *key, void *data
#define ASSIGN_KEY_AND_DATA(node) \
do {\
node->key = key;\
node->data = data;\
} while (0)
#include "map_generic.h"
mapc.c
#define NAME mapc
#define PARAMS void *key, void *data
#define ASSIGN_KEY_AND_DATA(node) \
do {\
memcpy(node->key, key, m->key_size);\
memcpy(node->data, data, m->data_size);\
} while (0)
#include "map_generic.h"
This method is not half bad, but it's not so elegant.
Use function pointers. For each part that is dependent on the structure, pass a function pointer.
map_generic.c
int map_generic_insert(void *m, void *key, void *data,
void (*assign_key_and_data)(void *, void *, void *, void *),
void (*get_root)(void *))
{
// create node
assign_key_and_data(m, node, key, data);
root = get_root(m);
// add to tree starting from root
// rebalance from node to root
// etc
}
map.c
static void assign_key_and_data(void *m, void *node, void *key, void *data)
{
map_node *n = node;
n->key = key;
n->data = data;
}
static map_node *get_root(void *m)
{
return ((map *)m)->root;
}
int map_insert(map *m, void *key, void *data)
{
map_generic_insert(m, key, data, assign_key_and_data, get_root);
}
mapc.c
static void assign_key_and_data(void *m, void *node, void *key, void *data)
{
map_node *n = node;
map_c *mc = m;
memcpy(n->key, key, mc->key_size);
memcpy(n->data, data, mc->data_size);
}
static map_node *get_root(void *m)
{
return ((mapc *)m)->root;
}
int mapc_insert(mapc *m, void *key, void *data)
{
map_generic_insert(m, key, data, assign_key_and_data, get_root);
}
This method requires writing more functions that could have been avoided in the macro method (as you can see, the code here is longer) and doesn't allow optimizers to inline the functions (as they are not visible to map_generic.c file).
So, how would you go about implementing something like this?
Note: I wrote the code in the stack-overflow question form, so excuse me if there are minor errors.
Side question: Anyone has a better idea for a suffix that says "this structure copies the data instead of the pointer"? I use c that says "copies", but there could be a much better word for it in English that I don't know about.
Update:
I have come up with a third solution. In this solution, only one version of the map is written, the one that keeps a copy of data (mapc). This version would use memcpy to copy data. The other map is an interface to this, taking void *key and void *data pointers and sending &key and &data to mapc so that the address they contain would be copied (using memcpy).
This solution has the downside that a normal pointer assignment is done by memcpy, but it completely solves the issue otherwise and is very clean.
Alternatively, one can only implement the map and use an extra vectorc with mapc which first copies the data to vector and then gives the address to a map. This has the side effect that deletion from mapc would either be substantially slower, or leave garbage (or require other structures to reuse the garbage).
Update 2:
I came to the conclusion that careless users might use my library the way they write C++, copy after copy after copy. Therefore, I am abandoning this idea and accepting only pointers.
You roughly covered both possible solutions.
The preprocessor macros roughly correspond to C++ templates and have the same advantages and disadvantages:
They are hard to read.
Complex macros are often hard to use (consider type safety of parameters etc.)
They are just "generators" of more code, so in the compiled output a lot of duplicity is still there.
On other side, they allow compiler to optimize a lot of stuff.
The function pointers roughly correspond to C++ polymorphism and they are IMHO cleaner and generally easier-to-use solution, but they bring some cost at runtime (for tight loops, few extra function calls can be expensive).
I generally prefer the function calls, unless the performance is really critical.
There's also a third option that you haven't considered: you can create an external script (written in another language) to generate your code from a series of templates. This is similar to the macro method, but you can use a language like Perl or Python to generate the code. Since these languages are more powerful than the C pre-processor, you can avoid some of the potential problems inherent in doing templates via macros. I have used this method in cases where I was tempted to use complex macros like in your example #1. In the end, it turned out to be less error-prone than using the C preprocessor. The downside is that between writing the generator script and updating the makefiles, it's a little more difficult to get set up initially (but IMO worth it in the end).
What you're looking for is polymorphism. C++, C# or other object oriented languages are more suitable to this task. Though many people have tried to implement polymorphic behavior in C.
The Code Project has some good articles/tutorials on the subject:
http://www.codeproject.com/Articles/10900/Polymorphism-in-C
http://www.codeproject.com/Articles/108830/Inheritance-and-Polymorphism-in-C