Redis list with Binary array values instead of strings? - database

When I have look at the Redis documentation, I learn that we can use lists for strings (https://redis.io/docs/data-types/lists/).
However, for my use case it would be advantageous, if I could have a list with binary array values stored instead.
I have a few questions regarding that:
Is such a thing possible?
If not, is there a workaround?
Also, say the different binary arrays have size x, does that restrict the number of elements that can be stored in the list if the size x increases?

I would suggest against that setup, if possible. Instead, I would:
put each (possibly large) binary into its own key, through the SET command, naming each key with a hash of the binary data (for example, file:<CRC32 hash value>);
use a list to store just the aforementioned hashes (or the key names).
Key values have a limit of 512MB; lists can have up to 2^32-1 items.

Related

Data Structure to do lookup on large number

I have a requirement to do a lookup based on a large number. The number could fall in the range 1 - 2^32. Based on the input, i need to return some other data structure. My question is that what data structure should i use to effectively hold this?
I would have used an array giving me O(1) lookup if the numbers were in the range say, 1 to 5000. But when my input number goes large, it becomes unrealistic to use an array as the memory requirements would be huge.
I am hence trying to look at a data structure that yields the result fast and is not very heavy.
Any clues anybody?
EDIT:
It would not make sense to use an array since i may have only 100 or 200 indices to store.
Abhishek
unordered_map or map, depending on what version of C++ you are using.
http://www.cplusplus.com/reference/unordered_map/unordered_map/
http://www.cplusplus.com/reference/map/map/
A simple solution in C, given you've stated at most 200 elements is just an array of structs with an index and a data pointer (or two arrays, one of indices and one of data pointers, where index[i] corresponds to data[i]). Linearly search the array looking for the index you want. With a small number of elements, (200), that will be very fast.
One possibility is a Judy Array, which is a sparse associative array. There is a C Implementation available. I don't have any direct experience of these, although they look interesting and could be worth experimenting with if you have the time.
Another (probably more orthodox) choice is a hash table. Hash tables are data structures which map keys to values, and provide fast lookup and insertion times (provided a good hash function is chosen). One thing they do not provide, however, is ordered traversal.
There are many C implementations. A quick Google search turned up uthash which appears to be suitable, particularly because it allows you to use any value type as the key (many implementations assume a string as the key). In your case you want to use an integer as the key.

Hash function for hash table with strings and integers as keys

i am in search for a good Hash function which i can use in Hash table implementation. The thing is that i want to give both strings and integers as parameters(keys) in my hash function.
i have a txt file with ~500 data and every one of them consists of integers and strings(max 15 chars). So, the thing that i want to do is to pick one of these ints/strings and use it as a key for my hash function in order to put my data in the "right" bucket.
Is there any good function to do this?
Thank you :)
Use the Integer value if that's present & reasonably well distributed, then hash the String if it's not. Integer hashcode is much cheaper to compute than String.
The algorithm has to be repeatable, obviously.
Your question is somewhat vague. It's unclear if your data set has 500 columns and you are trying to figure out which column to use for hashing, or if it has 500 items which you want to hash.
If you are looking for a decent general purpose hash that will produce well-distributed hash values, you may want to check out the Jenkins hash functions which have variants for strings and integers. But, to be frank, if your dataset has 500 fixed items you may want to look at a perfect hash function generator, like GNU gperf or even alternative data structures depending on your data.
Since you want to hash using two keys, I presume the distribution improves using two keys.
For string hashing, I have had good results with PJW algorithm. Just google for "PJW Hash String". One variation here
To augment the hash with an integer, see here

Understanding hash tables

I understand that some hash tables use "buckets", which is a linked list of "entries".
HashTable
-size //total possible buckets to use
-count // total buckets in use
-buckets //linked list of entries
Entry
-key //key identifier
-value // the object you are storing for reference
-next //the next entry
In order to get the bucket by index, you have to call:
myBucket = someHashTable[hashIntValue]
Then, you could iterate the linked list of entries until you find the one you are looking for or null.
Does the hash function always return a NUMBER % HashTable.size? That way, you stay within the limit? Is that how the hash function should work?
Mathematically speaking, a hash function is usually defined as a mapping from the universe of elements you want to store in the hash table to the range {0, 1, 2, .., numBuckets - 1}. This means that in theory, there's no requirement whatsoever that you use the mod operator to map some integer hash code into the range of valid bucket indices.
However, in practice, almost universally programmers will use a generic hash code that produces a uniformly-distributed integer value and then mod it down so that it fits in the range of the buckets. This allows hash codes to be developed independently of the number of buckets used in the hash table.
EDIT: Your description of a hash table is called a chained hash table and uses a technique called closed addressing. There are many other implementations of hash tables besides the one you've described. If you're curious - and I hope you are! :-) - you might want to check out the Wikipedia page on the subject.
what is hash table?
It is also known as hash map is a data structure used to implement an associative array.It is a structure that can map keys to values.
How it works?
A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found.
See the below diagram it clearly explains.
Advantages:
In a well-dimensioned hash table, the average cost for each lookup is independent of the number of elements stored in the table.
Many hash table designs also allow arbitrary insertions and deletions of key-value pairs.
In many situations, hash tables turn out to be more efficient than search trees or any other table lookup structure.
Disadvantages:
The hash tables are not effective when the number of entries is very small. (However, in some cases the high cost of computing the hash function can be mitigated by saving the hash value together with the key.)
Uses:
They are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches and sets.
There is no predefined rule for how a hash function should behave. You can have all of your values map to index 0 - a perfectly valid hash function (performs poorly, but works).
Of course, if your hash function returns a value outside of the range of indices in your associated array, it won't work correctly. Thats not to say however, that you need to use the formula (number % TABLE_SIZE)
No, the table is typically an array of entries. You don't iterate it until you found the same hash, you use the hash result (or usually hash modulo numBuckets) to directly index into the array of entries. That gives you the O(1) behaviour (iterating would be O(n)).
When you try to store two different objects with the same hash result (called a 'hash collision'), you have to find some way to make space. Different implementations vary in how they handle collisions. You can create a linked list of all the objects with same hash, or use some rehashing to store in a different entry of the table.

C - How to implement Set data structure?

Is there any tricky way to implement a set data structure (a collection of unique values) in C? All elements in a set will be of the same type and there is a huge RAM memory.
As I know, for integers it can be done really fast'N'easy using value-indexed arrays. But I'd like to have a very general Set data type. And it would be nice if a set could include itself.
There are multiple ways of implementing set (and map) functionality, for example:
tree-based approach (ordered traversal)
hash-based approach (unordered traversal)
Since you mentioned value-indexed arrays, let's try the hash-based approach which builds naturally on top of the value-indexed array technique.
Beware of the advantages and disadvantages of hash-based vs. tree-based approaches.
You can design a hash-set (a special case of hash-tables) of pointers to hashable PODs, with chaining, internally represented as a fixed-size array of buckets of hashables, where:
all hashables in a bucket have the same hash value
a bucket can be implemented as a dynamic array or linked list of hashables
a hashable's hash value is used to index into the array of buckets (hash-value-indexed array)
one or more of the hashables contained in the hash-set could be (a pointer to) another hash-set, or even to the hash-set itself (i.e. self-inclusion is possible)
With large amounts of memory at your disposal, you can size your array of buckets generously and, in combination with a good hash method, drastically reduce the probability of collision, achieving virtually constant-time performance.
You would have to implement:
the hash function for the type being hashed
an equality function for the type being used to test whether two hashables are equal or not
the hash-set contains/insert/remove functionality.
You can also use open addressing as an alternative to maintaining and managing buckets.
Sets are usually implemented as some variety of a binary tree. Red black trees have good worst case performance.
These can also be used to build an map to allow key / value lookups.
This approach requires some sort of ordering on the elements of the set and the key values in a map.
I'm not sure how you would manage a set that could possibly contain itself using binary trees if you limit set membership to well defined types in C ... comparison between such constructs could be problematic. You could do it easily enough in C++, though.
The way to get genericity in C is by void *, so you're going to be using pointers anyway, and pointers to different objects are unique. This means you need a hash map or binary tree containing pointers, and this will work for all data objects.
The downside of this is that you can't enter rvalues independently. You can't have a set containing the value 5; you have to assign 5 to a variable, which means it won't match a random 5. You could enter it as (void *) 5, and for practical purposes this is likely to work with small integers, but if your integers can get into large enough sizes to compete with pointers this has a very small probability of failing.
Nor does this work with string values. Given char a[] = "Hello, World!"; char b[] = "Hello, World!";, a set of pointers would find a and b to be different. You would probably want to hash the values, but if you're concerned about hash collisions you should save the string in the set and do a strncmp() to compare the stored string with the probing string.
(There's similar problems with floating-point numbers, but trying to represent floating-point numbers in sets is a bad idea in the first place.)
Therefore, you'd probably want a tagged value, one tag for any sort of object, one for integer value, and one for string value, and possibly more for different sorts of values. It's complicated, but doable.
If the maximum number of elements in the set (the cardinality of the underlying data type) is small enough, you might want to consider using a plain old array of bits (or whatever you call them in your favourite language).
Then you have a simple set membership check: bit n is 1 if element n is in the set. You could even count 'ordinary' members from 1, and only make bit 0 equal to 1 if the set contains itself.
This approach will probably require some sort of other data structure (or function) to translate from the member data type to the position in the bit array (and back), but it makes basic set operations (union, intersection, membership test, difference, insertion, removal,compelment) very very easy. And it is only suitable for relatively small sets, you wouldn't want to use it for sets of 32-bit integers I don't suppose.

The optimum* way to do a table-lookup-like function in C?

I have to do a table lookup to translate from input A to output A'. I have a function with input A which should return A'. Using databases or flat files are not possible for certain reasons. I have to hardcode the lookup in the program itself.
What would be the the most optimum (*space-wise and time-wise separately): Using a hashmap, with A as the key and A' as the value, or use switch case statements in the function?
The table is a string to string lookup with a size of about 60 entries.
If speed is ultra ultra necessary, then I would consider perfect hashing. Otherwise I'd use an array/vector of string to string pairs, created statically in sort order and use binary search. I'd also write a small test program to check the speed and memory constraints were met.
I believe that both the switch and the table-look up will be equivalent (although one should do some tests on the compiler being used). A modern C compiler will implement a big switch with a look-up table. The table look-up can be created more easily with a macro or a scripting language.
For both solutions the input A must be an integer. If this is not the case, one solution will be to implement a huge if-else statement.
If you have strings you can create two arrays - one for input and one for output (this will be inefficient if they aren't of the same size). Then you need to iterate the contents of the input array to find a match. Based on the index you find, you return the corresponding output string.
Make a key that is fast to calculate, and hash
If the table is pretty static, unlikely to change in future, you could have a look-see if adding a few selected chars (with fix indexes) in the "key" string could get unique values (value K). From those insert the "value" strings into a hash_table by using the pre-calculated "K" value for each "key" string.
Although a hash method is fast, there is still the possibility of collision (two inputs generating the same hash value). A fast method depends on the data type of the input.
For integral types, the fastest table lookup method is an array. Use the incoming datum as an index into the array. One of the problems with this method is that the array must account for the entire spectrum of values for the fastest speed. Otherwise execution is slowed down by translating the original index into an index for the array (kind of like a hashing method).
For string input types, a nested look up may be the fastest. One example is to break up tables by length. The first array returns pointers to the table to search based on length, e.g. char * sub_table = First_Array[5] for a string of length 5. These can be configured for specialized input data.
Another method is to use a B-Tree, which is a binary tree of "pages". Behavior is similar to nested arrays.
If you let us know the input type, we can better answer your question.

Resources