Good Hashing Function - c

I'm looking to make an hashtable to store some data that I need to access quickly instead of iterating through a linked list and I'm having problems defining a good hash function.
Consider S as the hashtable.
I initialize S[10] with labels (0,...,0) and S[1w1] = (v11,v12)
then I have two loops, j=2 to N, a=0 to W.
N and W can be any positive integer.
In there, I do S[ja] = addSomeDifferentStuff(S[(j-1)a]), creating the node S[ja].
I really can't find a hash function that doesn't create collisions, a friend of mine has suggested hash = j + a * W.
Any suggestions?
UPDATE:
Ok, so I to clarify myself. This was a implementation of solution on the bi-criteria 0.1 knapsack problem based on a labeling algorithm that converts the knapsack problem to a shortest path problem. W is my capacity, and n is the number of items. Consider wj the weight of item j.
Inside of the loops, I'm verifying if the item can be added, if it is then I'll make S[ja] = S[(j-1)a-wj] + values[j1,j2], and otherwise I just copy S[ja] = S[(j-1)a]. But accessing the labels in S[(j-1)a] or S[(j-1)a-wj] is expensive with linked lists since I need to iterate through every element until I find it. That is the purpose of the hashtable.

N and W can be any positive integer.
Well that's surely going to present a computability problem. You seem to be asking how to construct a perfect hash function for objects consisting of pairs of integers drawn from the ranges 0 ... N and 0 ... W, respectively. Such a function must compute (N + 1) * (W + 1) distinct values, and the bounds on N and W affect the suitable data types and algorithms.
Note, too, that it is probably most useful to consider the keys to be integer pairs, not integer powers, because N and W don't need to get very large before the powers involved are too large to be represented by any built-in type offered by your implementation. The pairs will be easier to work with on several levels.
a friend of mine has suggested hash = j + a * W.
I suppose your friend meant hash(j,a) = j + a * (N + 1). Provided that it does not overflow, that will produce a different value for each pair (j, a) drawn from the ranges specified. Alternatively, you could also use hash(j,a) = j * (W + 1) + a, subject to the same proviso about overflow. If indeed you need a perfect hash function over the full domain you've described, then I don't see much room for improvement over that on the performance side, except possibly by replacing the multiplication with a suitably-large left shift.
The values of those functions do vary with a and j in a completely systematic way, however, and that would be an undesirable characteristic for some uses of such a function. Finding a perfect hash function that does not have that property is a difficult problem. One typically would use a program such as gperf for such a task but that's not amenable to dynamic adaptation to different values of N and W.
Note that although that answers the question that I think you actually asked, I'm not certain it's what you are really looking for. Inasmuch as you seem to have rejected my characterization of S as an array of hashtables, instead going back to it being a singular hashtable, I suspect that you mean something different by the term "hashtable" than I do. Nevertheless, I take the question to be about the hash function, and the use to which you put that function is a separate concern.

Maybe look at https://github.com/Cyan4973/xxHash for both xxHash, and its list of competing hash functions.

Related

What is algorithm to find K for finding medians in two sorted array in leetcode

The solution implementing find medians in two sorted array is awesome. However, I am still very confused about code to calculate K
var aMid = aLength * k / (aLength + bLength)
var bMid = k - aMid - 1
I guess this is the key part of this algorithm which I really dont know why is calculated like this. To explain more clearly what I mean, the core logic is divide and conquer, considering the fact that different size list should be divided differently. I wonder why this formula is working perfectly.
Can someone give me some insight about it. I searched lots of online documents and it is very hard to find materials to explain this part well.
Many thanks in advance
The link shows two different ways of computing the comparison points in each array: one always uses k/2, even if the array doesn't have that many elements; the other (which you quote) tries to distribute the comparison points based on the size of the arrays.
As can be seen from these two examples, neither of which is optimal, it doesn't make much difference how you compute the comparison points, as long as the size of the the two components is generally linear in K (using a fixed size of 5 for one of the comparison points won't work, for example.)
The algorithm effectively reduces the problem size by either aMid or bMid on each iteration. Ideally, the problem size would be reduced by k/2; and that's the computation you should use if both arrays have at least k/2 members. If one has two few members, you can set the comparison point for the array to its last element, and compute the other comparison point so that the total is k - 1. If you end up discarding all of the elements from some array, you can then immediately return element k of the other array.
That strategy will usually perform fewer iterations than either of the proposals in your link, but it is still O(log k).

Count distinct array entries [with no add memory nor array changes]

Task is count unique numbers of a given array. I saw numerous similar questions on SO, but here we have additional requirements, which weren't stated in other questions:
Amount of allowed additional memory is O(1)
Changes to array are
prohibited
I was able to write quadratic algorithm, which agrees with given constraints. But I keep wondering, may one could do better on such a problem? Thank you for your time.
Algorithm working with O(n^2)
def count(a):
unique = len(a)
ind = 0
while ind < len(a):
x = a[ind]
i = ind+1
while i < len(a):
if a[i] == x:
unique -= 1
break
i += 1
ind += 1
print("Total uniques: ", unique)
This is a very similar problem to a follow-up question in chapter 1 (Arrays and Strings) from Cracking the Coding Interview:
Implement an algorithm to determine if a string has all unique
characters. What if you cannot use additional data structures?
The answer (to the follow-up question) is that if you can't assume anything about the array (namely, it is not sorted, you don't know its size, etc.), then there is no algorithm better than what you showed.
That being said, you may think about relaxing the constraints a little bit, to make it more interesting. For example, if you have an upper bound on the array size, you could use a bit vector to keep track of which values you've read before while traversing the array, although this is not strictly an O(1) solution when it comes to memory usage (one could argue that by knowing the maximum array size, the memory usage is constant, and thus O(1), but that is a little bit of cheating). Similarly, if the array was sorted, you could also solve it in O(n) by going through each element at a time and check if its neighbors are different numbers.
Because there is no underlying structure in the array given (sorted, etc.) you are forced to brute force every value in the array...
There is a more complicated approach that I believe would work. It entails keeping your array of unique numbers sorted. This means that it would take more time when inserting into the array but would allow you to look-up values much quicker. You should be able to insert into the array in logn time by looking at the value directly in the middle of the array and checking if it's larger or smaller. You'd then eliminate half the array as a valid insertion location and repeat. You would use a similar approach to look-up values in the array. The only issue with this is that it requires more memory space than I believe you are allocated (1).
That being said, I think given the constraints on the task restrict the algorithm to O(n^2).

Minimal perfect hash for N number of unknown keys

I have two unsorted arrays of 32-bit unsigned integers, size N1 and N2, respectively. Each array may contain duplicates. I would like to map each value (2^32 possible keys) to a spot in a byte-array of size (N1 + N2) to record frequencies of each key. Duplicate key values should map to the same position in this array. Additionally, the frequency of each integer won't go above 100 (which is why I chose a byte-array to record each key's frequency to save space); if the max possible frequency were to go above this, I would simply change the byte-array to an array of shorts or something.
In the end, I need an array of size N1 + N2 -- not necessarily all entries will be used, as duplicates may have been encountered -- with frequencies of each unique key value. Worst case scenario, only one byte entry will be used (e.g. all values in both arrays are the same) leaving ((N1 + N2) - 1) entries unused. Best case scenario, all byte-entries are used.
From what I understand, I need to find a minimally perfect hashing function to map a known number of unknown keys (N1 + N2; all ranging from 0 - 2^32) to a known number of spots (N1 + N2). I was able to find a few other posts, but both answers basically said use gperf:
Is it possible to make a minimal perfect hash function in this situation?
Minimal perfect hash function
The second one (Minimal perfect hash function) is exactly what I'm attempting to do.
Rather than expecting source code from an answer (I'm using C by the way), I'd much prefer an explanation of how to go about creating a minimally perfect hashing function for N-number of any possible positive integers to N buckets. I could easily do this with a 4 GB array of direct mappings for every possible integer with lots of unused space, but I'd rather try to reduce this massive inefficiency of space. I'm also hoping to not use any external libraries, mostly for educational purposes to learn more about hashing, itself.
This is clearly impossible. If you have N numbers, there's no way to come up with a function which will hash them all to distinct values in the range [0, N) unless you know what those numbers are going to be beforehand. Otherwise, given any such function (with N < 2^32, of course), there will be at least one pair of integers such that both of those integers hash to the same value, so that function won't be perfect if those integers both show up in the input.
If you relax the conditions to allow the function to be created on the fly, this becomes possible, but only in a really trivial and useless way. Namely, a hash function could build itself up as it goes by recording each number that's fed into it and generating a new unique output for each one (say, counting up from 0). But such a function would need a hash table (or something equivalent) as part of its implementation, so it'd certainly be no use in implementing a hash table!
According to the Pigeonhole Principle, you will have "hash slots" occupied by more than one number. In other words: different numbers will "hash" to the same value.
Now, I wonder if you could benefit from a Bloom Filter. From Wikipedia:
False positive matches are possible, but false negatives are not; i.e.
a query returns either "possibly in set" or "definitely not in set".
If something is "definitely" not in the set of keys, you can move on (its frequency is one), and if it possibly is in the set, then you process it further to accumulate its actual statistic.

Infinity as sentinel in mergesort?

I am currently reading Cormen's "Introduction to Algorithms" and I found something called a sentinel.
It's used in the mergesort algorithm as a tool to decide when one of the two merging lists is exhausted. Cormen uses the infinity symbol for the sentinels in his pseudocode and I would like to know how such an infinite value can be implemented in C.
A sentinel is just a dummy value. For strings, you might use a NULL pointer since that's not a sensible thing to have in a list. For integers, you might use a value unlikely to occur in your data set e.g. if you are dealing with a list ages, then you can use the age -1 to denote the list.
You can get an "infinite value" for floats, but it's not the best idea. For arrays, pass the size explicitly; for lists, use a null pointer sentinel.
in C, when sorting an array, you usually know the size so you could actually sort a range [begin, end) in which end is one past the end of the array. E.g. int a[n] could be sorted as sort(a, a + n).
This allow you to do two things:
call your sort recursively with the part of the array you haven't sorted yet (merge sort is a recursive algorithm)
use end as a sentinel.
If you know the elements in your list will range from the smallest to the highest possible values for the given data type the code you are looking at won't work. You'll have to come up with something else, which I am sure can be done. I have that book in front of me right now and I am looking at the code that is causing you trouble and I have a solution that will work for you if you know the values range from the smallest for the given data type to the largest minus one at most. Open that book back up to page 31 and take a look at the Merge function. The lines causing you problems are lines 8 and 9 where the sentinel value of infinity is being used. Now, we know the two arrays are each sorted already and that we just need to merge them to get the array that is twice as big and in sorted order. This means that the largest elements in each half is at the end of the sub-arrays, and that the larger of the two is the largest in the array that is twice as big and we will have sorted once the merge function has completed. All we need to do is determine the largest of those two values, increment that value by one, and use that as our sentinel. So, lines 8 and 9 of the code should be replaced by the following 6 lines of code:
if L[n1] < R[n2]
largest = R[n2]
else
largest = L[n1]
L[n1 + 1] = largest + 1
R[n2 + 1] = largest + 1
That should work for you. I have a test tomorrow in my algorithms course on this stuff and I came across your post here and thought I'd help you out. The authors' use of sentinels in this book is something that has always bugged me, and I absolutely can not stand how much they are in love with recursion. Iteration is faster and in my opinion usually easier to come up with and grasp.
The trick is that you don't have to check array bounds when incrementing the index in only one of the lists in the inner while loops. Hence you need sentinels that are larger than all other elements. In c++ I usually use std::numeric_limits<TYPE>::max().
The C-equivalent should be macros like INT_MAX, UINT_MAX, LONG_MAX etc. Those are good sentinels. If you need two different sentinels, use ..._MAX and ..._MAX - 1
This is all assuming you're merging two lists that are ordered ascending.

How to calculate difference between two sets in C?

I have two arrays, say A and B with |A|=8 and |B|=4. I want to calculate the set difference A-B. How do I proceed? Please note that there are no repeated elements in either of the sets.
Edit: Thank you so much everybody for a myriad of elegant solutions. Since I am in prototyping stage of my project, for now I implemented the simplest solution told by Brian and Owen. But I do appreciate the clever use of data structures as suggested here by the rest of you, even Though I am not a computer scientist but an engineer and never studied data structures as a course. Looks like it's about time I should really start reading CLRS which I have been procrastinating for quite a while :) Thanks again!
sort arrays A and B
result will be in C
let a - the first elem of A
let b - the first elem of B
then:
1) while a < b: insert a into C and a = next elem of A
2) while a > b: b = next elem of B
3) if a = b: a = next elem of A and b = next elem of B
4) if b goes to end: insert rest of A into C and stop
5) if a goes to end: stop
Iterate over each element of A, if each of those elements are not in B, then add them to a new set C.
It depends on how you want to represent your sets, but if they are just packed bits then you can use bitwise operators, e.g. D = A & ~B; would give you the set difference A-B if the sets fit into an integer type. For larger sets you might use arrays of integer types and iterate, e.g.
for (i = 0; i < N; ++i)
{
D[i] = A[i] & ~B[i];
}
The following assumes the sets are stored as a sorted container (as std::set does).
There's a common algorithm for merging two ordered lists to produce a third. The idea is that when you look at the heads of the two lists, you can determine which is the lower, extract that, and add it to the tail of the output, then repeat.
There are variants which detect the case where the two heads are equal, and treat this specially. Set intersections and unions are examples of this.
With a set asymmetric difference, the key point is that for A-B, when you extract the head of B, you discard it. When you extract the head of A, you add it to the input unless the head of B is equal, in which case you extract that too and discard both.
Although this approach is designed for sequential-access data structures (and tape storage etc), it's sometimes very useful to do the same thing for a random-access data structure so long as it's reasonably efficient to access it sequentially anyway. And you don't necessarily have to extract things for real - you can do copying and step instead.
The key point is that you step through the inputs sequentially, always looking at the lowest remaining value next, so that (if the inputs have no duplicates) you will the matched items. You therefore always know whether your next lowest value to handle is an item from A with no match in B, and item in B with no match in A, or an item that's equal in both A and B.
More generally, the algorithm for the set difference depends on the representation of the set. For example, if the set is represented as a bit-vector, the above would be overcomplex and slow - you'd just loop through the vectors doing bitwise operations. If the set is represented as a hashtable (as in the tr1 unordered_set) the above is wrong as it requires ordered inputs.
If you have your own binary tree code that you're using for the sets, one good option is to convert both trees into linked lists, work on the lists, then convert the resulting list to a perfectly balanced tree. The linked-list set-difference is very simple, and the two conversions are re-usable for other similar operations.
EDIT
On the complexity - using these ordered merge-like algorithms is O(n) provided you can do the in-order traversals in O(n). Converting to a list and back is also O(n) as each of the three steps is O(n) - tree-to-list, set-difference and list-to-tree.
Tree-to-list basically does a depth-first traversal, deconstructing the tree as it goes. Theres a trick for making this iterative, storing the "stack" in part-handled nodes - changing a left-child pointer into a parent-pointer just before you step to the left child. This is a good idea if the tree may be large and unbalanced.
Converting a list to a tree basically involves a depth-first traversal of an imaginary tree (based on the size, known from the start) building it for real as you go. If a tree has 5 nodes, for instance, you can say that the root will be node 3. You recurse to build a two-node left subtree, then grab the next item from the list for that root, then recurse to build a two-node right subtree.
The list-to-tree conversion shouldn't need to be implemented iteratively - recursive is fine as the result is always perfectly balanced. If you can't handle the log n recursion depth, you almost certainly can't handle the full tree anyway.
Implement a set object in C. You can do it using a hash table for the underlying storage. This is obviously a non trivial exercise, but a few Open Source solutions exist. Then you simply need to add all the elements of A and then iterate over B and remove any that are elements of your set.
The key point is to use the right data structure for the job.
For larger sets I'd suggest sorting the numbers and iterating through them by emulating the code at http://www.cplusplus.com/reference/algorithm/set_difference/ which would be O(N*logN), but since the set sizes are so small, the solution given by Brian seems fine even though it's theoretically slower at O(N^2).

Resources