Avoiding database when checking for existing values - database

Is there a way through hashes or bitwise operators or another algorithm to avoid using database when simply checking for previously appeared string or value?
Assuming, there is no way to store whole history of the strings appeared before, only little information can be stored.

You may be interested in Bloom filters. They don't let you authoritatively say, "yes, this value is in the set of interest", but they do let you say "yes, this value is probably in the set" vs. "no, this value definitely is not in the set". For many situations, that's enough to be useful.
The way it works is:
You create an array of Boolean values (i.e. of bits). The larger you can afford to make this array, the better.
You create a bunch of different hash functions that each take an input string and map it to one element of the array. You want these hash functions to be independent, so that even if one hash function maps two strings to the same element, a different hash function will most likely map them to different elements.
To record that a string is in the set, you apply each of your hash functions to it in turn — giving you a set of elements in the array — and you set all of the mapped-to elements to TRUE.
To check if a string is (probably) is in the set, you do the same thing, except that now you just check the mapped-to elements to see if they are TRUE. If all of them are TRUE, then the string is probably in the set; otherwise, it definitely isn't.
If you're interested in this approach, see https://en.wikipedia.org/wiki/Bloom_filter for detailed analysis that can help you tune the filter appropriately (choosing the right array-size and number of hash functions) to get useful probabilities.

Related

How are HashTables in GLib Useful?

I'm familiar with the idea of a hash function but I'm unclear on how GLib's implementation is useful. I'll explain this with an example.
Suppose I have an expensive function that is recursive (somehow) on the positive real numbers in a weird way that depends on number theory (I'm a mathematician). Let's say I have an algorithm that needs to compute the function on some smallish-range of large numbers. Say [1000000000 - 1000999999].
I don't want to call my expensive function one million times, so I start memoizing values recursively. Then at each call I don't need to necessarily compute the whole function from scratch, I can hopefully remember any values of the function on the lower numbers (during my recursing) that I have already computed. Let's assume that the actual total number of calls at that first level of recursion is low. So that there are a lot of repeated values and memoizing actually saves you a lot of time.
This is my cartoony way of understanding why a hash table data structure is useful. What I don't get is how to do this without knowing exactly what keys I'll need in advance.
Since the recursive function is number theoretic in general I don't know which values it will take over and over again. So I'd like to just throw these in a bucket (hash table) as they pop out of recursive calls to my function
For GLib, it would seem that your (key,value) pairs are always pointers to data that you personally have to keep lying around somewhere. So if my function is computing for input x. I don't know how to tell if I've seen the value x before, the function g_hash_table_contains() for example needs a pointer, not the value x. So what's the use!?
I'm still learning so be kind. I'm familiar with coding in C, but haven't yet used hash tables in this language and I'm trying to do so and be adept at it with GLib but I just don't get this.
Let me take a dig at it to explain it.
First of all, if we are using hashmap, then we need [key, value] pair for sure as our input.
So as a user of hashmap, we have to be creative about choosing key, and it varies depending upon the usecase.
In your case, as far as I understood, you have a function which works on a range and gives you result. And when calculating, it uses memoization so that results of small problem, which constitutes the bigger problem, can be used.
So for example, your case, you can use string as your key where string will be [1000000009] which may use result of [1000999998] which may further use result of 1000999997 and so on, and you do not find results in hashmap, then you will calculate it and save it in hashmap.
In nutshell, as a user, we need to be creative about choosing keys.
The analogues to understand is how you would have done, if you have to think about choosing primary key of database.
Another example to think is how you would have thought about solving fibonacci(n) using the hashmap.

Is it more efficent to use a linked list and delete nodes or use an array and do a small computation to a string to see if element can be skipped?

I am writing a program in C that reads a file. Each line of the file is a string of characters to which a computation will be done. The result of the computation on a particular string may imply that strings latter on in the file do not need any computations done to them. Also if the reverse of the string comes in alphabetical order before the (current, non-reversed) string then it does not need to be checked.
My question is would it be better to put each string in a linked list and delete each node after finding particular strings don’t need to be checked or using an array and checking the last few characters of a string and if it is alphabetically after the string in the previous element skip it? Either way the list or array only needs to be iterated through once.
Rules of thumb is that if you are dealing with small objects (< 32 bytes), std::vector is better than a linked list for most of general operations.
But for larger objects, (say, 1K bytes), generally you need to consider lists.
There is an article details the comparison you can check , the link is here
http://www.baptiste-wicht.com/2012/11/cpp-benchmark-vector-vs-list/3/
Without further details about what are your needs is a bit difficult to tell you which one would fit more with your requirements.
Arrays are easy to access, specially if you are going to do it in a non sequential way, but they are hard to maintain if you need to perform deletions on it or if you don't have a good approximation of the final number of elements.
Lists are good if you plan to access them sequentially, but terrible if you need to jump between its elements. Also deletion over them can be done in constant time if you are already in the node you want to delete.
I don't quite understand how you plan to access them since you say that either one would be iterated just once, but if that is the case then either structure would give you the similar performance since you are not really taking advantage of their key benefits.
It's really difficult to understand what you are trying to do, but it sounds like you should create an array of records, with each record holding one of your strings and a boolean flag to indicate whether it should be processed.
You set each record's flag to true as you load the array from the file.
You use one pointer to scan the array once, processing only the strings from records whose flags are still true.
For each record processed, you use a second pointer to scan from the first pointer + 1 to the end of the array, identify strings that won't need processing (in light of the current string), and set their flags to false.
-Al.

Linking filenames or labels to numeric index

In a C99+SDL game, I have an array that contains sound effects (SDL_mixer chunk data and some extra flags and filename string) and is referenced by index such as "sounds[2].data".
I'd like to be able to call sounds by filename, but I don't want to strcmp all the array until a match is found. This way as I add more sounds, or change the order, or allow for player-defined sound mods, they can still be called with a common identifier (such as "SHOT01" or "EXPL04").
What would be the fastest approach for this? I heard about hashing, which would result in something similar to lua's string indexes (such as table["field"]) but I don't know anything about the topic, and seems fairly complicated.
Just in case it matters, I plan to have filenames or labels be from 6 to 8 all caps filenames (such as "SHOT01.wav").
So to summarize, where can I learn about hashing short strings like that, or what would be the fastest way to keep track of something like sound effects so they can be called using arbitrary labels or identifiers?
I think in your case you can probably just keep all the sounds in a sorted data structure and use a fast search algorithm to find matches. Something like a binary search is very simple implement and it gives good performance.
However, if you are interested in hash tables and hashing, the basics of it all are pretty simple. There is no place like Wikipedia to get the basics down and you can then tailor your searches better on Google to find more in depth articles.
The basics are you start out with a fixed size array and store everything in there. To figure out where to store something you take the key (in your case the sound name) and you perform some operation on it such that it gives you an exact location where the value can be found. So the simplest case for string hashing is just adding up all the letters in the string as integer values then take the value and use modulus to give you an index in your array.
position = SUM(string letters) % [array size]
Of course naturally multiple strings will have same sum and thus give you the same position. This is called a collision, and collisions can be handled in many ways. The simplest way is to have an array of lists rather than array of values, and simply append to the list every there there is a collision. When searching for a value, simply iterate the lists and find the value you need.
Ideally a good hashing algorithm will have few collisions and quick hashing algorithm thus providing huge performance boost.
I hope this helps :)
You are right, when it comes to mapping objects with a set of string keys, hash tables are often the way to go.
I think this article on wikipedia is a good starting point to understand hash table mechanism: http://en.wikipedia.org/wiki/Hash_table

C - How to implement Set data structure?

Is there any tricky way to implement a set data structure (a collection of unique values) in C? All elements in a set will be of the same type and there is a huge RAM memory.
As I know, for integers it can be done really fast'N'easy using value-indexed arrays. But I'd like to have a very general Set data type. And it would be nice if a set could include itself.
There are multiple ways of implementing set (and map) functionality, for example:
tree-based approach (ordered traversal)
hash-based approach (unordered traversal)
Since you mentioned value-indexed arrays, let's try the hash-based approach which builds naturally on top of the value-indexed array technique.
Beware of the advantages and disadvantages of hash-based vs. tree-based approaches.
You can design a hash-set (a special case of hash-tables) of pointers to hashable PODs, with chaining, internally represented as a fixed-size array of buckets of hashables, where:
all hashables in a bucket have the same hash value
a bucket can be implemented as a dynamic array or linked list of hashables
a hashable's hash value is used to index into the array of buckets (hash-value-indexed array)
one or more of the hashables contained in the hash-set could be (a pointer to) another hash-set, or even to the hash-set itself (i.e. self-inclusion is possible)
With large amounts of memory at your disposal, you can size your array of buckets generously and, in combination with a good hash method, drastically reduce the probability of collision, achieving virtually constant-time performance.
You would have to implement:
the hash function for the type being hashed
an equality function for the type being used to test whether two hashables are equal or not
the hash-set contains/insert/remove functionality.
You can also use open addressing as an alternative to maintaining and managing buckets.
Sets are usually implemented as some variety of a binary tree. Red black trees have good worst case performance.
These can also be used to build an map to allow key / value lookups.
This approach requires some sort of ordering on the elements of the set and the key values in a map.
I'm not sure how you would manage a set that could possibly contain itself using binary trees if you limit set membership to well defined types in C ... comparison between such constructs could be problematic. You could do it easily enough in C++, though.
The way to get genericity in C is by void *, so you're going to be using pointers anyway, and pointers to different objects are unique. This means you need a hash map or binary tree containing pointers, and this will work for all data objects.
The downside of this is that you can't enter rvalues independently. You can't have a set containing the value 5; you have to assign 5 to a variable, which means it won't match a random 5. You could enter it as (void *) 5, and for practical purposes this is likely to work with small integers, but if your integers can get into large enough sizes to compete with pointers this has a very small probability of failing.
Nor does this work with string values. Given char a[] = "Hello, World!"; char b[] = "Hello, World!";, a set of pointers would find a and b to be different. You would probably want to hash the values, but if you're concerned about hash collisions you should save the string in the set and do a strncmp() to compare the stored string with the probing string.
(There's similar problems with floating-point numbers, but trying to represent floating-point numbers in sets is a bad idea in the first place.)
Therefore, you'd probably want a tagged value, one tag for any sort of object, one for integer value, and one for string value, and possibly more for different sorts of values. It's complicated, but doable.
If the maximum number of elements in the set (the cardinality of the underlying data type) is small enough, you might want to consider using a plain old array of bits (or whatever you call them in your favourite language).
Then you have a simple set membership check: bit n is 1 if element n is in the set. You could even count 'ordinary' members from 1, and only make bit 0 equal to 1 if the set contains itself.
This approach will probably require some sort of other data structure (or function) to translate from the member data type to the position in the bit array (and back), but it makes basic set operations (union, intersection, membership test, difference, insertion, removal,compelment) very very easy. And it is only suitable for relatively small sets, you wouldn't want to use it for sets of 32-bit integers I don't suppose.

The optimum* way to do a table-lookup-like function in C?

I have to do a table lookup to translate from input A to output A'. I have a function with input A which should return A'. Using databases or flat files are not possible for certain reasons. I have to hardcode the lookup in the program itself.
What would be the the most optimum (*space-wise and time-wise separately): Using a hashmap, with A as the key and A' as the value, or use switch case statements in the function?
The table is a string to string lookup with a size of about 60 entries.
If speed is ultra ultra necessary, then I would consider perfect hashing. Otherwise I'd use an array/vector of string to string pairs, created statically in sort order and use binary search. I'd also write a small test program to check the speed and memory constraints were met.
I believe that both the switch and the table-look up will be equivalent (although one should do some tests on the compiler being used). A modern C compiler will implement a big switch with a look-up table. The table look-up can be created more easily with a macro or a scripting language.
For both solutions the input A must be an integer. If this is not the case, one solution will be to implement a huge if-else statement.
If you have strings you can create two arrays - one for input and one for output (this will be inefficient if they aren't of the same size). Then you need to iterate the contents of the input array to find a match. Based on the index you find, you return the corresponding output string.
Make a key that is fast to calculate, and hash
If the table is pretty static, unlikely to change in future, you could have a look-see if adding a few selected chars (with fix indexes) in the "key" string could get unique values (value K). From those insert the "value" strings into a hash_table by using the pre-calculated "K" value for each "key" string.
Although a hash method is fast, there is still the possibility of collision (two inputs generating the same hash value). A fast method depends on the data type of the input.
For integral types, the fastest table lookup method is an array. Use the incoming datum as an index into the array. One of the problems with this method is that the array must account for the entire spectrum of values for the fastest speed. Otherwise execution is slowed down by translating the original index into an index for the array (kind of like a hashing method).
For string input types, a nested look up may be the fastest. One example is to break up tables by length. The first array returns pointers to the table to search based on length, e.g. char * sub_table = First_Array[5] for a string of length 5. These can be configured for specialized input data.
Another method is to use a B-Tree, which is a binary tree of "pages". Behavior is similar to nested arrays.
If you let us know the input type, we can better answer your question.

Resources