Very fast hash table lookup in C (e.g. by MPH) - c

I need a very fast hash table in C (or C++). Conditions are like this:
There exist N known keys which shall map to an object (with some state)
There exist more unknown keys which do not map to anything
Because all keys which map to an object are known at startup (but not at compile-time), it's okay if building the hashtable is expensive. However, it's required that lookup is (very) fast.
I thought about using cmph (for a perfect minimal perfect hash function). The hash table would be built with the N keys and at runtime I would do a query like this:
const cmph_uint32 id = cmph_search(hash, &key, sizeof(key));
if (id >= N) {
return; // object not found
}
const MyState *state = &states[id];
if (state->key != key) {
return; // object not found
}
// object found
By storing the actual key in the state, it should be possible to detect if we have an invalid collision. However, I'm not sure if calling cmph_search with a "unknown" key is undefined behavior (e.g. weird memory access or something).
Maybe someone has a better idea? Or maybe someone knows if calling cmph_search with a unknown key is fine?

Hard to tell just by looking at the (pretty much non-existent) documentation of CMPH. Digging into the source code seems simpler. The implementation of the internal hash function used by CMPH can be found in the hash() function, which ends up calling __jenkins_hash_vector(). This hash function was originally designed by Robert J. Jenkins Jr. in 1997 and can be found here. As far as the function is concerned, nothing weird happens with the key used by the function, so this hash function can be safely used even for invalid (non present) keys.
The cmph_search() function calls the correct *_search() function based on the algorithm you configured (CHD, BDZ, BMZ, and so on). Then hash() is called and the resulting values are used in different ways depending on the algorithm.
For simpler algorithms such as BMZ, BMZ8 and FCH I can see that the hash is simply used to index an internal array (mphf->data->g). All the accesses are performed modulo the size of this array (mphf->data->n) so this looks fine. Just from this, I would say if you are using these algorithms you are safe. For more complex algorithms like BDZ it's a bit harder to understand what is really going on and where/how the calculated hashes are actually used for.
Taking a look at the tests implemented in the library source, (for example at this one), we can see that the author uses a logic similar to yours to detect whether a key is a duplicate or unknown:
/* ... */
cmph_uint32 siz = cmph_size(mphf);
hashtable = (cmph_uint8*)malloc(siz*sizeof(cmph_uint8));
memset(hashtable, 0, (size_t)siz);
//check all keys
for (i = 0; i < source->nkeys; ++i)
{
cmph_uint32 h;
char *buf;
cmph_uint32 buflen = 0;
source->read(source->data, &buf, &buflen);
h = cmph_search(mphf, buf, buflen);
if (!(h < siz))
{
fprintf(stderr, "Unknown key %*s in the input.\n", buflen, buf);
ret = 1;
} else if(hashtable[h])
{
fprintf(stderr, "Duplicated or unknown key %*s in the input\n", buflen, buf);
ret = 1;
} else hashtable[h] = 1;
if (verbosity)
{
printf("%s -> %u\n", buf, h);
}
source->dispose(source->data, buf, buflen);
}
/* ... */
The only thing that's missing from the above code is storing and comparing the keys like you are doing in your example. At the end of the day, it looks to me like calling cmph_search() with an unknown key is fine. Understanding whether the key is unknown or not given the resulting hash is then the job of the library user.

Related

scanf inside function to return value (or other function)

so i was going to run a function in an infinite loop which takes a number input, but then I remembered I codn't do
while (true) {
myfunc(scanf("%d));
}
because I need to put the scanf input into a variable. I can't do scanf(%*d) because that doesn't return value at all. I don't want to have to do
int temp;
while (true) {
scanf("%d", &temp);
myfunc(temp);
or include more libraries. Is there any standard single function like gets (I cod do myfunc((int) strtol(gets(), (char**) NULL, 10)); but its kinda messy sooo yea)
srry if im asking too much or being pedantic and i shod do ^
btw unrelated question is there any way to declare a string as an int--- or even better, a single function for converting int to string? I usually use
//num is some number
char* str = (char*) malloc(12);
sprintf(str, "%d", num);
func(str);
but wodnt func(str(num)); be easier?
For starters, the return value of scanf (and similar functions) is the number of conversions that took place. That return value is also used to signify if an error occurred.
In C you must manually manage these errors.
if ((retv = scanf("%d", &n)) != 1) {
/* Something went wrong. */
}
What you seem to be looking for are conveniences found in higher-level languages. Languages & runtimes that can hide the details from you with garbage collection strategies, exception nets (try .. catch), etc. C is not that kind of language, as by today's standards it is quite a low-level language. If you want "non-messy" functions, you will have to build them up from scratch, but you will have to decide what kinds of tradeoffs you can live with.
For example, perhaps you want a simple function that just gets an int from the user. A tradeoff you could make is that it simply returns 0 on any error whatsoever, in exchange for never knowing if this was an error, or the user actually input 0.
int getint(void) {
int n;
if (scanf("%d", &n) != 1)
return 0;
return n;
}
This means that if a user makes a mistake on input, you have no way of retrying, and the program must simply roll on ahead.
This naive approach scales poorly with the fact that you must manually manage memory in C. It is up to you to free any memory you dynamically allocate.
You could certainly write a simple function like
char *itostr(int n) {
char *r = malloc(12);
if (r && sprintf(r, "%d", n) < 1) {
r[0] = '0';
r[1] = '\0';
}
return r;
}
which does the most minimal of error checking (Again, we don't know if "0" is an error, or a valid input).
The problem comes when you write something like func(itostr(51));, unless func is to be expected to free its argument (which would rule out passing non-dynamically allocated strings), you will constantly be leaking memory with this pattern.
So no there is no real "easy" way to do these things. You will have to get "messy" (handle errors, manage memory, etc.) if you want to build anything with complexity.

Should you check parameters passed into function before passing them, or check them in the function?

As a good practice, do you think one should verify passed parameters within a function to which the parameters are being passed, or simply make sure the function will always accept correct parameters?
Consider the following code:
Matrix * add_matrices(const Matrix * left, const Matrix * right)
{
assert(left->rowsCount == right->rowsCount
&& left->colsCount == right->colsCount);
int rowsCount = left->rowsCount;
int colsCount = left->colsCount;
Matrix * matOut = create_matrix(rowsCount, colsCount);
int i = 0;
int j = 0;
for (i; i < rowsCount; ++i)
{
for (j; j < colsCount; ++j)
{
matOut->matrix[i][j] = left->matrix[i][j] + right->matrix[i][j];
}
}
return matOut;
}
Do you think I should check the parameters before passing them to the function or after, ie. in the function? What is a better practice or is it programmer dependant?
Inside. The function can be viewed as an individual component.
Its author is best placed to define any preconditions and check them.
Checking them outside presupposes the caller knows the preconditions which may not be the case.
Also by placing them inside the function you're assured every call is checked.
You should also check any post-conditions before leaving the function.
For example if you have a function called int assertValid(const Matrix*matrix) that checks integrity of the object (e.g. the data is not a NULL pointer) you could call it on entry to all functions and before returning from functions that modify a Matrix.
Consistently use of pre- and post- condition integrity are an enormously effective way of ensuring quality and localising faults.
In practice zealous conformance to this rule usually results in unacceptable performance. The assert() macro or a similar conditional compilation construct is a great asset. See <assert.h>.
Depends if the function is global in scope or local static.
A global function cannot control what calls it. Defensive coding will perform validation of the arguments received. But how much validation to do?
int my_abs(int x) {
assert(x >= -INT_MAX);
return abs(x);
}
The above example, in a debug build, checks to insure the absolute value function will succeed as abs(INT_MIN) may be a problem. Now if this checking should be in production builds is another question.
int some_string(char *s) {
assert(s != NULL);
...
}
In some_string() the test for NULL-ness may be dropped as function definition may state that s must be a string. Even though NULL is not a C string, testing for NULL-ness is only 1 of many bad pointers that could be passed which do not point to a string. So this test has limited validation.
With static functions, the code is under local control. Argument validation could occur by the function, the caller, both or neither. That selection is code dependent.
A counter-example exist with user/file input. Basic data qualification should occur promptly.
int GetDriversAge(FILE *inf) {
int age;
if (fscanf("%d", &age) != 1) Handle_Error();
if (age < 16 || age > 122) Handle_Error();
return age
}
In OP's example, parameter checking is done by the function, not the caller. Without the equivalence test, the function can easily fail in mysterious ways. The cost of this check here is a small fraction of the code's work. That makes it a good check as expensive checks (time, complexity) can cause more trouble than they solve. Note that if the calling code did this test and add_matrices() was called from N places, then that checking code is replicated N times in various, perhaps, inconsistent ways.
Matrix * add_matrices(const Matrix * left, const Matrix * right) {
assert(left->rowsCount == right->rowsCount
&& left->colsCount == right->colsCount);
Conclusion: more compelling reasons to check the parameters in the function than in the caller though exceptions exist.
What I do is to check the parameters inside the function and act accordingly (throw exceptions, return error messages, etc.). I suppose it's the function's job to check whether the passed parameters are of the correct data type and contain valid values.
The function should perform its task correctly, otherwise, it should throw an exception. The client/consuming code may or may not do a check, it depends on the data source and how much you trust it, either way, you should also enclose the function call in a catch-try block to catch invalid argument exception.
EDIT:
Sorry, I confused C for C++. Instead of throwing an exception, you can return null. The client doesn't necessarily have to check the data before calling (depending on the data source and other factors like performance constraints), but must always check for null as a return value.

C Programming: how to avoid code duplication without losing clarity

edit: Thanks to all repliers. I should have mentioned in my original post that I am not allowed to change any of the specifications of these functions, so solutions using assertions and/or allowing to dereference NULL are out of the question.
With this in mind, I gather that it's either I go with a function pointer, or just leave the duplication as it is. For the sake of clarity I'd like to avoid function pointers this time.
original:
I am trying to avoid code duplication without losing clarity.
often when working on a specific assignment (Uni - undergrad) I recognize these patterns of functions return , but not always with a "great-job" solution..
What would any of you suggest I should do (pointers to functions, macros, etc.) with these three C functions that check some of their arguments in the same way to make the checking more modular (it should be more modular, right?)?
BTW these are taken directly from a HW assignment, so the details of their functionality are not concerning my question, only the arguments checking at the function's top.
teamIsDuplicateCoachName(Team team, bool* isDuplicate) {
TeamResult result = TEAM_SUCCESS;
if (!team || !isDuplicate) {
result = TEAM_NULL_ARGUMENT;
} else if (teamEmpty(team)) {
result = TEAM_IS_EMPTY;
} else {
for (int i = 0; i < team->currentFormations; ++i) {
if (teamIsPlayerInFormation(team->formations[i], team->coach)) {
*isDuplicate = true;
break;
}
}
}
return result;
}
TeamResult teamGetWinRate(Team team, double* winRate) {
TeamResult result = TEAM_SUCCESS;
if (!team || !winRate) {
result = TEAM_NULL_ARGUMENT;
} else {
int wins = 0, games = 0;
for (int i = 0; i < team->currentFormations; ++i) {
Formation formation = team->formations[i];
if (formationIsComplete(formation)) {
games += formation->timesPlayed;
wins += formation->timesWon;
}
}
double win = ( games == 0 ) ? 0 : (double) wins / games;
assert(win >= 0 && win <= 1);
*winRate = win;
}
return result;
}
TeamResult teamGetNextIncompleteFormation(Team team, Formation* formation,
int* index) {
TeamResult result = TEAM_SUCCESS;
if (!team || !formation || !index) {
result = TEAM_NULL_ARGUMENT;
} else {
*formation = NULL; /* default result, will be returned if there are no incomplete formations */
for (int i = 0; i < team->currentFormations; ++i) {
Formation formationPtr = team->formations[i];
if (!formationIsComplete(formationPtr)) {
*formation = formationPtr;
*index = i;
break;
}
}
}
return result;
}
Any advice on how (specifically) to avoid the code duplication would be appreciated.
Thanks for your time! :)
It looks like it's a coding mistake to pass nulls to these functions. There's three main ways to deal with this situation.
Handle the erroneous nulls and return an error value. This introduces extra code which checks the arguments to return error values, and extra code around every call site, which now has to handle the error return values. Probably none of this code is tested, since if you knew that code was mistakenly passing nulls you'd just fix it.
Use assert to check validity of arguments, resulting in a clean error message, clear to read preconditions, but some extra code.
Have no precondition checks, and debug segfaults when you deference a NULL.
In my experience 3 is usually the best approach. It adds zero extra code, and a segfault is usually just as easy to debug as the clean error message you'd get from 2. However, you'll find many software engineers who would prefer 2, and it's a matter of taste.
Your code, which is pattern 1, has some significant downsides. First, it's adding extra code which can't be optimised away. Second, more code means more complexity. Third, it's unclear if the functions are supposed to be able to accept broken arguments, or if the code's just there to help debugging when things are wrong.
I would create a function to check the team object:
TeamResult TeamPtrCheck(Team *team)
{
if (team == NULL)
return TEAM_NULL_ARGUMENT;
else if (teamEmpty(team))
return TEAM_IS_EMPTY;
else
return TEAM_SUCCESS;
}
And then reference that + your other checks at the top of each function, for example
TeamResult = TeamPtrCheck(team);
if (TeamResult != TEAM_SUCCESS)
return TeamResult;
if (winRate == NULL)
return TEAM_NULL_ARGUMENT;
Otherwise, if each function is different then leave the checks as different!
If you are concerned about the duplication of the NULL checks at the start of each function, I wouldn't be. It makes it clear to the user that you are simply doing input validation prior to doing any work. No need to worry about the few lines.
In general, don't sweat the small stuff like this.
There are a few techniques to reduce the redundancy you percieve, which one is applicable heavily depends on the nature of the condition you are checking. In any case, I would advise against any (preprocessor) tricks to reduce duplication which hide what is actually happening.
If you have a condition that should not happen, one concise way to check for it is to use an assert. With an assert you basically say: This condition must be true, otherwise my code has a bug, please check if my assumption is true, and kill my program immediately if it's not. This is often used like this:
#include <assert.h>
void foo(int a, int b) {
assert((a < b) && "some error message that should appear when the assert fails (a failing assert prints its argument)");
//do some sensible stuff assuming a is really smaller than b
}
A special case is the question whether a pointer is null. Doing something like
void foo(int* bar) {
assert(bar);
*bar = 3;
}
is pretty pointless, because dereferencing a null pointer will securely segfault your program on any sane platform, so the following will just as securely stop your program:
void foo(int* bar) {
*bar = 3;
}
Language lawyers may not be happy with what I'm saying because, according to the standard, dereferencing a null pointer is undefined behaviour, and technically the compiler would be allowed to produce code that formats your harddrive. However, dereferencing a null pointer is such a common error that you can expect your compiler not to do stupid things with it, and you can expect your system to take special care to ensure that the hardware will scream if you try to do it. This hardware check comes for free, the assert takes a few cycles to check.
The assert (and segfaulting null pointers), however, is only suitable for checking for fatal conditions. If you are just checking for a condition that makes any further work inside a function pointless, I would not hesitate to use an early return. It is usually much more readable, especially since syntax highlighting readily reveals the return statements to the reader:
void foo(int a, int b) {
if(a >= b) return;
//do something sensible assuming a < b
}
With this paradigm, your first function would look like this:
TeamResult teamIsDuplicateCoachName(Team team, bool* isDuplicate) {
if(!team || !isDuplicate) return TEAM_NULL_ARGUMENT;
if(teamEmpty(team)) return TEAM_IS_EMPTY;
for (int i = 0; i < team->currentFormations; ++i) {
if (teamIsPlayerInFormation(team->formations[i], team->coach)) {
*isDuplicate = true;
break;
}
}
return TEAM_SUCCESS;
}
I believe, this is much more clear and concise than the version with the if around the body.
This is more or less a design question. If the functions above are all static functions (or only one is extern), then the whole "bundle of function" should check the condition - execution flow-wise - once for each object and let the implementation details of lower level functions assume that input data is valid.
For example, if you go back to wherever the team is created, allocated and initialized and wherever the formation is created, allocated and initialized and build rules there that ensure that every created team exists and that no duplicate exists, you will not have to valid the input because by definition/construction it will always be. This is examples of pre conditions. Invariants would be the persistance of the truthfulness of these definitions (no function may alter invariant states upon return) and post conditions would be somewhat the opposite (for example when they are free'd but pointers still exists somewhere).
That being said, manipulating "object-like" data in C, my personnal preference is to create extern functions that creates, returns and destroys such objects. If the members are kept static within the .c files with minimal .h interface, you get something conceptually similar to object oriented programming (though you can never make members fully "private").
Thanks to all repliers. I should have mentioned in my original post that I am not allowed to change any of the specifications of these functions, so solutions using assertions and/or allowing to dereference NULL are out of the question, though I'll consider them for other occasions.
With this in mind, I gather that it's either I go with a function pointer, or just leave the duplication as it is. For the sake of clarity I'd like to avoid function pointers this time.

expand [x-y] and {a,b,c} in strings -- like glob(3) but without filename matching

I'm looking for a canned C routine that does what glob(3) does, except without matching the results against filenames, e.g.
input: "x[1-4]y"
output: "x1y", "x2y", "x3y", "x4y"
regardless of whether any files with those names happen to exist. EDIT: This doesn't need to produce the list all at once; in fact it would be better if it had an iterator-style "give me the next name now" API, as the list could be enormous.
Obviously this cannot support * and ?, but that's fine; I only need the [a-z] notation. Support for the {foo,bar,baz} notation would be nice too.
Best option is telling me the name of a routine that is already in everybody's C library that does this. Second best would be a pointer to a chunk of BSD-licensed (or more permissively) code. GPL code would be awkward, but I could live with it.
cURL (the command line tool, not the library) contains code that does this job, which is relatively easy to extract:
https://github.com/bagder/curl/blob/master/src/tool_urlglob.c
https://github.com/bagder/curl/blob/master/src/tool_urlglob.h
They'll have to be edited to remove some dependencies on the guts of cURL that are not part of the public library interface. The API is a little confusing, so here's some wrapper code I wrote:
#include "tool_urlglob.h"
struct url_iter
{
char **upats;
URLGlob *uglob;
int nglob;
};
static inline struct url_iter
url_prep(char **upats)
{
struct url_iter it;
it.upats = upats;
it.uglob = NULL;
it.nglob = -1;
return it;
}
static char *
url_next(struct url_iter *it)
{
char *url;
if (!it->uglob) {
for (;;) {
if (!*it->upats)
return 0;
if (!glob_url(&it->uglob, *it->upats, &it->nglob, stderr))
break;
it->upats++;
}
}
if (glob_next_url(&url, it->uglob))
abort();
if (--it->nglob == 0) {
glob_cleanup(it->uglob);
it->uglob = 0;
it->upats++;
}
return url;
}
Pass an array of strings to url_prep, call url_next on the result until it returns NULL. Strings returned from url_next must be deallocated with free when you're done with them.
Here's a sketch of how I'd write the iterator:
Count the instances of [ in the string. This will be the number of "dimensions" you iterate over.
For each dimension, establish a range of values based on the number of characters in the bracket expression.
Simply iterate an n-tuple of integers over these ranges, and use the resulting values as indices into the bracket expressions to expand the string based on the values.

Designing Around a Large Number of Discrete Functions in C

Greetings and salutations,
I am looking for information regrading design patterns for working with a large number of functions in C99.
Background:
I am working on a complete G-Code interpreter for my pet project, a desktop CNC mill. Currently, commands are sent over a serial interface to an AVR microcontroller. These commands are then parsed and executed to make the milling head move. a typical example of a line might look like
N01 F5.0 G90 M48 G1 X1 Y2 Z3
where G90, M48, and G1 are "action" codes and F5.0, X1, Y2, Z3 are parameters (N01 is the optional line number and is ignored). Currently the parsing is coming along swimmingly, but now it is time to make the machine actually move.
For each of the G and M codes, a specific action needs to be taken. This ranges from controlled motion to coolant activation/deactivation, to performing canned cycles. To this end, my current design features a function that uses a switch to select the proper function and return a pointer to that function which can then be used to call the individual code's function at the proper time.
Questions:
1) Is there a better way to resolve an arbitrary code to its respective function than a switch statement? Note that this is being implemented on a microcontroller and memory is EXTREMELY tight (2K total). I have considered a lookup table but, unfortunately, the code distribution is sparse leading to a lot of wasted space. There are ~100 distinct codes and sub-codes.
2) How does one go about function pointers in C when the names (and possibly signatures) may change? If the function signatures are different, is this even possible?
3) Assuming the functions have the same signature (which is where I am leaning), is there a way to typedef a generic type of that signature to be passed around and called from?
My apologies for the scattered questioning. Thank you in advance for your assistance.
1) Perfect hashing may be used to map the keywords to token numbers (opcodes) , which can be used to index a table of function pointers. The number of required arguments can also be put in this table.
2) You don's want overloaded / heterogeneous functions. Optional arguments might be possible.
3) your only choice is to use varargs, IMHO
I'm not an expert on embedded systems, but I have experience with VLSI. So sorry if I'm stating the obvious.
The function-pointer approach is probably the best way. But you'll need to either:
Arrange all your action codes to be consecutive in address.
Implement an action code decoder similar to an opcode decoder in a normal processor.
The first option is probably the better way (simple and small memory footprint). But if you can't control your action codes, you'll need to implement a decoder via another lookup table.
I'm not entirely sure on what you mean by "function signature". Function pointers should just be a number - which the compiler resolves.
EDIT:
Either way, I think two lookup tables (1 for function pointers, and one for decoder) is still going to be much smaller than a large switch statement. For varying parameters, use "dummy" parameters to make them all consistent. I'm not sure what the consequences of force casting everything to void-pointers to structs will be on an embedded processor.
EDIT 2:
Actually, a decoder can't be implementated with just a lookup table if the opcode space is too large. My mistake there. So 1 is really the only viable option.
Is there a better way ... than a switch statement?
Make a list of all valid action codes (a constant in program memory, so it doesn't use any of your scarce RAM), and sequentially compare each one with the received code. Perhaps reserve index "0" to mean "unknown action code".
For example:
// Warning: untested code.
typedef int (*ActionFunctionPointer)( int, int, char * );
struct parse_item{
const char action_letter;
const int action_number; // you might be able to get away with a single byte here, if none of your actions are above 255.
// alas, http://reprap.org/wiki/G-code mentions a "M501" code.
const ActionFunctionPointer action_function_pointer;
};
int m0_handler( int speed, int extrude_rate, char * message ){ // M0: Stop
speed_x = 0; speed_y = 0; speed_z = 0; speed_e = 0;
}
int g4_handler ( int dwell_time, int extrude_rate, char * message ){ // G4: Dwell
delay(dwell_time);
}
const struct parse_item parse_table[] = {
{ '\0', 0, unrecognized_action } // special error-handler
{ 'M', 0, m0_handler }, // M0: Stop
// ...
{ 'G', 4, g4_handler }, // G4: Dwell
{ '\0', 0, unrecognized_action } // special error-handler
}
ActionFunctionPointer get_action_function_pointer( char * buffer ){
char letter = get_letter( buffer );
int action_number = get_number( buffer );
int index = 0;
ActionFunctionPointer f = 0;
do{
index++;
if( (letter == parse_table[index].action_letter ) and
(action_number == parse_table[index].action_number) ){
f = parse_table[index].action_function_pointer;
};
if('\0' == parse_table[index].action_letter ){
index = 0;
f = unrecognized_action;
};
}while(0 == f);
return f;
}
How does one go about function pointers in C when the names (and
possibly signatures) may change? If the function signatures are
different, is this even possible?
It's possible to create a function pointer in C that (at different times) points to functions with more or less parameters (different signatures) using varargs.
Alternatively, you can force all the functions that might possibly be pointed to by that function pointer to all have exactly the same parameters and return value (the same signature) by adding "dummy" parameters to the functions that require fewer parameters than the others.
In my experience, the "dummy parameters" approach seems to be easier to understand and use less memory than the varargs approach.
Is there a way to typedef a generic type of that signature
to be passed around and called from?
Yes.
Pretty much all the code I've ever seen that uses function pointers
also creates a typedef to refer to that particular type of function.
(Except, of course, for Obfuscated contest entries).
See the above example and Wikibooks: C programming: pointers to functions for details.
p.s.:
Is there some reason you are re-inventing the wheel?
Could maybe perhaps one of the following pre-existing G-code interpreters for the AVR work for you, perhaps with a little tweaking?
FiveD,
Sprinter,
Marlin,
Teacup Firmware,
sjfw,
Makerbot,
or
Grbl?
(See http://reprap.org/wiki/Comparison_of_RepRap_Firmwares ).

Resources