Get all keys from (search.h) hash search table - c

I'm using the search.h library to define a hash table through the hcreate function.
How can I go through all the keys in that table? hsearch always expects an entry to search for (or store).
This is the documentation to all the three functions that manage the hash table ( hcreate, hsearch and hdestroy) but there's not mention of how to iterate through the structure to obtain all the stored keys.
When storing an entry in the table, I malloc the key value and so would like to have an easy way to free those malloc'd values.
Can I avoid having to store those in a separate structure such as an array?
I wouldn't expect hdestroy to be doing this automatically for me, as it has no way of knowing if key points to dynamically allocated or static memory (or indeed if I haven't already freed that memory).
Switching to a different hash search table library is not an option. I have to work with this. I'm on CentOS and use GCC 4.1.2.

There is no standard functionality of iterating through the entries of the hash table. This question is addressed here (in the hdestroy section):
It is important to remember that the elements contained in the hashing
table at the time hdestroy is called are not freed by this function.
It is the responsibility of the program code to free those strings (if
necessary at all). Freeing all the element memory is not possible
without extra, separately kept information since there is no function
to iterate through all available elements in the hashing table. If it
is really necessary to free a table and all elements the programmer
has to keep a list of all table elements and before calling hdestroy
s/he has to free all element’s data using this list. This is a very
unpleasant mechanism and it also shows that this kind of hashing
tables is mainly meant for tables which are created once and used
until the end of the program run.

Without looking at the actual source of the library, I would say there is no way walk the hash table after it has been created. You would be required to remember the pointers for your malloc'd memory in a separate structure.
Frankly, I don't think I'd touch that library with a ten foot pole. The API has numerous problems
Atrocious documentation
The library can only support a single hash table (note that hcreate does not return a handle which is then passed to hsearch or hdestroy)
The inability to walk the table, or retrieve the keys severely limits its uses.
Instead, depending on your platform (you don't say whether you are on Windows or a Unix-based OS), I'd take a good long look at glib which supports a rich set of data-structures (documentation home)
The docs for hash tables are here. That's for v2.42 of the library - they don't have a generic link for the "latest version".
glib is the core of GNOME (the Ubuntu UI) but you don't need to use any of the gmainloop or event pump related features.

Related

C - Ways to free() groups of elements in a hash table?

I'm currently fiddling with a program that's trying to solve a 2d rubix cube.The program is using a hash table as a memory of sorts where it saves different categories of information and it's runs on repeat. From run to run there are certain categories of information I'd like to free/remove instead of freeing the whole table at the end of each run (which is what I'm currently doing).
I've come up two ways and I'm unsure which to use. Either i basically make one array/stack for each of the categories where I save a pointer that i can later free. Or i make separate hash tables for all of the different categories and free each one at my discretion.
Are there other options? Some where i read about a pointer pool and I'm not sure what that might be. Any ides or helpful comments would be great!
Did you have more memory or time ?) If you use hash table (include separate) then you need use if for check all element in your hash table. It is very more time. I think best way create second struct when you save object after create. For free all table you need run on simple array without check and use memory set zero for flush your hash table. You need little bit more memory but more effective work.

How do I correctly use libsodium so that it is compatible between versions?

I'm planning on storing a bunch of records in a file, where each record is then signed with libsodium. However, I would like future versions of my program to be able to check signatures the current version has made, and ideally vice-versa.
For the current version of Sodium, signatures are made using the Ed25519 algorithm. I imagine that the default primitive can change in new versions of Sodium (otherwise libsodium wouldn't expose a way to choose a particular one, I think).
Should I...
Always use the default primitive (i.e. crypto_sign)
Use a specific primitive (i.e. crypto_sign_ed25519)
Do (1), but store the value of sodium_library_version_major() in the file (either in a dedicated 'sodium version' field or a general 'file format revision' field) and quit if the currently running version is lower
Do (3), but also store crypto_sign_primitive()
Do (4), but also store crypto_sign_bytes() and friends
...or should I do something else entirely?
My program will be written in C.
Let's first identify the set of possible problems and then try to solve it. We have some data (a record) and a signature. The signature can be computed with different algorithms. The program can evolve and change its behaviour, the libsodium can also (independently) evolve and change its behaviour. On the signature generation front we have:
crypto_sign(), which uses some default algorithm to produce signatures (at the moment of writing is just invokes crypto_sign_ed25519())
crypto_sign_ed25519(), which produces signatures based on specific ed25519 algorithm
I assume that for one particular algorithm given the same input data and the same key we'll always get the same result, as it's math and any deviation from this rule would make the library completely unusable.
Let's take a look at the two main options:
Using crypto_sign_ed25519() all the time and never changing this. Not that bad of an option, because it's simple and as long as crypto_sign_ed25519() exists in libsodium and is stable in its output you have nothing to worry about with stable fixed-size signature and zero management overhead for this. Of course, in future someone can discover some horrible problem with this algorithm and if you're not prepared to change the algorithm that could mean horrible problem for you.
Using crypto_sign(). With this we suddenly have a lot of problems, because the algorithm can change, so you must store some metadata along with the signature, which opens up a set of questions:
what to store?
should this metadata be record-level or file-level?
What do we have in mentioned functions for the second approach?
sodium_library_version_major() is a function to tell us the library API version. It's not directly related to changes in supported/default algorithms so it's of little use for our problems.
crypto_sign_primitive() is a function that returns a string identifying the algorithm used in crypto_sign(). That's a perfect match for what we need, because supposedly its output will change at exactly the time when the algorithm would change.
crypto_sign_bytes() is a function that returns the size of signature produced by crypto_sign() in bytes. That's useful for determining the amount of storage needed for the signature, but it can easily stay the same if algorithm changes, so it's not the metadata we need to store explicitly.
Now that we know what to store there is a question of processing that stored data. You need to get the algorithm name and use that to invoke matching verification function. Unfortunately, from what I see, libsodium itself doesn't provide any simple way to get the proper function given the algorithm name (like EVP_get_cipherbyname() or EVP_get_digestbyname() in openssl), so you need to make one yourself (which of course should fail for unknown name). And if you have to make one yourself maybe it would be even easier to store some numeric identifier instead of the name from library (more code though).
Now let's get back to file-level vs record-level. To solve that there are another two questions to ask — can you generate new signatures for old records at any given time (is that technically possible, is that allowed by policy) and do you need to append new records to old files?
If you can't generate new signatures for old records or you need to append new records and don't want the performance penalty of signature regeneration, then you don't have much choice and you need to:
have dynamic-size field for your signature
store the algorithm (dynamic string field or internal (for your application) ID) used to generate the signature along with the signature itself
If you can generate new signatures or especially if you don't need to append new records, then you can get away with simpler file-level approach when you store the algorithm used in a special file-level field and, if the signature algorithm changes, regenerate all signatures when saving the file (or use the old one when appending new records, that's also more of a compatibility policy question).
Other options? Well, what's so special about crypto_sign()? It's that its behaviour is not under your control, libsodium developers choose the algorithm for you (no doubt they choose good one), but if you have any versioning information in your file structure (not signature-specific, I mean) nothing prevents you from making your own particular choice and using one algorithm with one file version and another with another (with conversion code when needed, of course). Again, that's also based on the assumption that you can generate new signature and that's allowed by policy.
Which brings us back to the original two choices with question of whether it's worth the trouble of doing all that compared to just using crypto_sign_ed25519(). That mostly depends on your program life span, I'd probably say (just as an opinion) that if that's less than 5 years then it's easier to just use one particular algorithm. If it can easily be more than 10 years, then no, you really need to be able to survive algorithm (and probably even whole crypto library) changes.
Just use the high-level API.
Functions from the high-level API are not going to use a different algorithm without the major version of the library being bumped.
The only breaking change one can expect in libsodium 1.x.y is the removal of deprecated/undocumented functions (that don't even exist in current releases compiled with the --enable-minimal switch). Everything else will remain backward compatible.
New algorithms might be introduced in 1.x.y versions without high-level wrappers, and will be stabilized and exposed via a new high-level API in libsodium 2.
Therefore, do not bother calling crypto_sign_ed25519(). Just use crypto_sign().

Dictionary lookup - why use a hash table for fixed data?

In some (horrible 3rd party) code we're working with there is a dictionary lookup routine that scans through a table populated with "'name-string' -> function_pointer" pairs, basically copy-pasted from K&R Section 6.6.
I've had to extend this, and while reading the code was struck by the seemingly pointless inclusion of hashing routines that iterate through the source data structure and create a hash table.
Given that the source data structure is fixed at compile time (so will never be added to or changed when running), is there any point in having hashing routines in there?
I'm just having one of those moments when I can't tell if the author was doing something clever that I've missed, or was being lazy and not thinking (so far the latter has been the case more often than not).
Is there a reason to have a hash table for data that will never change?
Is there a reason to have a hash table for data that will never
change?
Probably the hash table code was already there and working fine, and the programmer just wanted to get the job done (e.g. looking up a function pointer from a string). If this function is not performance critical I see no reason to change it.
If you want to change it, then I suggest to take a look at perfect hash tables.
These are hash tables were the hash function is created from a fixed set of predefined keys. The good thing about them: They are often faster than a tree data-structure.
GPERF is a tool that does just this. It creates C-code from a set of strings: https://www.gnu.org/software/gperf/

C hashtable library natively supporting multiple values per key

If you want to store multiple values for a key, there's always the possibility of tucking a list in between the hashtable and the values. However, I figure that to be rather inefficient, as:
The hashtable has to resolve collisions, anyway, so does some kind of list walking. Instead of stopping when it found the first key in a bucket that matches the query key, it could just continue to walk the bucket, presumably giving better cache performance than walking yet another list after following yet another indirection.
Is anyone aware of library implementations that support this by default (and ideally are also otherwise shiny, fast, hashtables as well as BSD or similarly licensed)? I've looked through a couple of libraries but none did what I wanted, glib's datasets coming closest, though storing records, not lists.
So… something like a multimap?
Libgee, building off of GLib, provides a MultiMap. (It's written in Vala, but that is converted to plain C.)

which one to use linked list or static arrays?

I have a structure in C which resembles that of a database table record.
Now when I query the table using select, I do not know how many records I will get.
I want to store all the returned records from the select query in a array of my structure data type.
Which method is best?
Method 1: find array size and allocate
first get the count of records by doing select count(*) from table
allocate a static array
run select * from table and then store each records in my structure in a loop.
Method 2: use single linked list
while ( records returned )
{
create new node
store the record in node
}
Which implementation is best?
My requirement is that when I have all the records,
I will probably make copies of them or something.
But I do not need random access and I will not be doing any search of a particular record.
Thanks
And I forgot option #4. Allocate an array of fixed size. When that array is full, allocate another. You can keep track of the arrays by linking them in a linked list, or having a higher level array that keeps the pointers to the data arrays. This two-level scheme is great when you need random access, you just need to break your index into two parts.
A problem with 'select count(*)' is that the value might change between calls, so your "real" select will have a number of items different from the count you'd expect.
I think the best solution is your "2".
Instead of a linked list, I would personally allocate an array (reallocating as necessary). This is easier in languages that support growing arrays (e.g. std::vector<myrecord> in C++ and List<myrecord> in C#).
You forgot option 3, it's a little more complicated but it might be best for your particular case. This is the way it's typically done in C++ std::vector.
Allocate an array of any comfortable size. When that array is filled, allocate a new larger array of 1.5x to 2x the size of the filled one, then copy the filled array to this one. Free the original array and replace it with the new one. Lather, rinse, repeat.
There are a good many possible critiques that should be made.
You are not talking about a static array at all - a static array would be of pre-determined size fixed at compile time, and either local to a source file or local to a function. You are talking about a dynamically allocated array.
You do not give any indication of record size or record count, nor of how dynamic the database underneath is (that is, could any other process change any of the data while yours is running). The sizing information isn't dreadfully critical, but the other factor is. If you're doing a report of some sort, then fetching the data into memory is fine; you aren't going to modify the database and the data is an accurate snapshot. However, if other people could be modifying the records while you are modifying records, your outline solution is a major example of how to lose other people's updates. That is a BAD thing!
Why do you need all the data in memory at once? Ignoring size constraints, what exactly is the benefit of that compared with processing each relevant record once in the correct sequence? You see, DBMS put a lot of effort into being able to select the relevant records (WHERE clauses) and the relevant data (SELECT lists) and allow you to specify the sequence (ORDER BY clauses) and they have the best sort systems they can afford (better than the ones you or I are likely to produce).
Beware of quadratic behaviour if you allocate your array in chunks. Each time you reallocate, there's a decent chance the old memory will have to be copied to the new location. This will fragment your memory (the old location will be available for reuse, but by definition will be too small to reuse). Mark Ransom points out a reasonable alternative - not the world's simplest scheme overall (but it avoids the quadratic behaviour I referred to). Of course, you can (and would) abstract that away by a set of suitable functions.
Bulk fetching (also mentioned by Mark Ransom) is also useful. You would want to preallocate the array into which a bulk fetch fetches so that you don't have to do extra copying. This is just linear behaviour though, so it is less serious.
Create a data structure to represent your array or list. Pretend you're in an OO language and create accessors and constructors for everything you need. Inside that data structure, keep an array, and, as others have said, when the array is filled to capacity, allocate a new array 2x as large and copy into it. Access the structure only through your defined routines for accessing it.
This is the way Java and other languages do this. Internally, this is even how Perl is implemented in C.
I was going to say your best option is to look for a library that already does this ... maybe you can borrow Perl's C implementation of this kind of data structure. I'm sure it's more well tested than anything you or I could roll up from scratch. :)
while(record = get_record()) {
records++;
records_array = (record_struct *) realloc(records_array, (sizeof record_struct)*records);
*records_array[records - 1] = record;
}
This is strictly an example — please don't use realloc() in production.
The linked list is a nice, simple option. I'd go with that. If you prefer the growing array, you can find an implementation as part of Dave Hanson's C Interfaces and Implementations, which as a bonus also provides linked lists.
This looks to me like a design decision that is likely to change as your application evolves, so you should definitely hide the representation behind a suitable API. If you don't already know how to do this, Hanson's code will give you a number of nice examples.

Resources