I need some ideas to develop a good hashing function for my assignment. I have a list of all the countries in the world (around 190) in total. The names of each country is the key for the hashing function. Is there a specific kind of hashing function anyone would recommend to store this data in a hashing function without many collisions? Also, can you perhaps give an example of how to implement it?
Use GNU gperf. For inputs like yours, it will generate C code for you which implements a perfect hash function (for the given inputs). No collisions, no worries.
You can use generated perfect hash for that (GNU perf).
Of if the set of strings is dynamic then you can use ternary trie.
For N unique strings it will give you unique number [1..N]. For your case it will be faster than with hash tables.
Here is my implementation of such thing:
http://code.google.com/p/tiscript/source/browse/trunk/tool/tl_ternary_tree.h
The simplest approach I can think of is for each country's name to compute the sum of the ASCII values in its representation and use this as the hash value:
int hash(const char *s)
{
int h = 0;
while (s && *s)
h += *s++;
return h;
}
If your hash map has size N, you store country names with map[hash(my_country) % N] = my_country. Conceptually.
Just try this approach and see whether the resulting hash values are sufficiently uniformly distributed. Note that the quality of the distribution may also depend on N.
Related
I have a quite peculiar case here. I have a file containing several million entries and want to find out if there exists at least one duplicate. The language here isn't of great importance, but C seems like a reasonable choice for speed. Now, what I want to know is what kind of approach to take to this? Speed is the primary goal here. Naturally, we want to stop looking as soon as one duplicate is found, that's clear, but when the data comes in, I don't know anything about how it's sorted. I just know it's a file of strings, separated by newline. Now keep in mind, all I want to find out is if a duplicate exists. Now, I have found a lot of SO questions regarding finding all duplicates in an array, but most of them go the easy and comprehensive way, rather than the fastest.
Hence, I'm wondering: what is the fastest way to find out if an array contains at least one duplicate? So far, the closest I've been able to find on SO is this: Finding out the duplicate element in an array. The language chosen isn't important, but since it is, after all, programming, multi-threading would be a possibility (I'm just not sure if that's a feasible way to go about it).
Finally, the strings have a format of XXXNNN (3 characters and 3 integers).
Please note that this is not strictly theoretical. It will be tested on a machine (Intel i7 with 8GB RAM), so I do have to take into consideration the time of making a string comparison etc. Which is why I'm also wondering if it could be faster to split the strings in two, and first compare the integer part, as an int comparison will be quicker, and then the string part? Of course, that will also require me to split the string and cast the second half to an int, which might be slower...
Finally, the strings have a format of XXXNNN (3 characters and 3 integers).
Knowing your key domain is essential to this sort of problem, so this allows us to massively simplify the solution (and this answer).
If X ∈ {A..Z} and N ∈ {0..9}, that gives 263 * 103 = 17,576,000 possible values ... a bitset (essentially a trivial, perfect Bloom filter with no false positives) would take ~2Mb for this.
Here you go: a python script to generate all possible 17 million keys:
import itertools
from string import ascii_uppercase
for prefix in itertools.product(ascii_uppercase, repeat=3):
for numeric in range(1000):
print "%s%03d" % (''.join(prefix), numeric)
and a simple C bitset filter:
#include <limits.h>
/* convert number of bits into number of bytes */
int filterByteSize(int max) {
return (max + CHAR_BIT - 1) / CHAR_BIT;
}
/* set bit #value in the filter, returning non-zero if it was already set */
int filterTestAndSet(unsigned char *filter, int value) {
int byteIndex = value / CHAR_BIT;
unsigned char mask = 1 << (value % CHAR_BIT);
unsigned char byte = filter[byteIndex];
filter[byteIndex] = byte | mask;
return byte & mask;
}
which for your purposes you'd use like so:
#include <stdlib.h>
/* allocate filter suitable for this question */
unsigned char *allocMyFilter() {
int maxKey = 26 * 26 * 26 * 10 * 10 * 10;
return calloc(filterByteSize(maxKey), 1);
}
/* key conversion - yes, it's horrible */
int testAndSetMyKey(unsigned char *filter, char *s) {
int alpha = s[0]-'A' + 26*(s[1]-'A' + 26*(s[2]-'A'));
int numeric = s[3]-'0' + 10*(s[4]-'0' + 10*(s[5]-'0'));
int key = numeric + 1000 * alpha;
return filterTestAndSet(filter, key);
}
#include <stdio.h>
int main() {
unsigned char *filter = allocMyFilter();
char key[8]; /* 6 chars + newline + nul */
while (fgets(key, sizeof(key), stdin)) {
if (testAndSetMyKey(filter, key)) {
printf("collision: %s\n", key);
return 1;
}
}
return 0;
}
This is linear, although there's obviously scope to optimise the key conversion and file input. Anyway, sample run:
useless:~/Source/40044744 $ python filter_test.py > filter_ok.txt
useless:~/Source/40044744 $ time ./filter < filter_ok.txt
real 0m0.474s
user 0m0.436s
sys 0m0.036s
useless:~/Source/40044744 $ cat filter_ok.txt filter_ok.txt > filter_fail.txt
useless:~/Source/40044744 $ time ./filter < filter_fail.txt
collision: AAA000
real 0m0.467s
user 0m0.452s
sys 0m0.016s
admittedly the input file is cached in memory for these runs.
The reasonable answer is to keep the algorithm with the smallest complexity. I encourage you to use a HashTable to keep track of inserted elements; the final algorithm complexity is O(n), because search in HashTable is O(1) theoretically. In your case I suggest you, to run the algorithm when reading file.
public static bool ThereAreDuplicates(string[] inputs)
{
var hashTable = new Hashtable();
foreach (var input in inputs)
{
if (hashTable[input] != null)
return true;
hashTable.Add(input, string.Empty);
}
return false;
}
A fast but inefficient memory solution would use
// Entries are AAA####
char found[(size_t)36*36*36*36*36*36 /* 2,176,782,336 */] = { 0 }; // or calloc() this
char buffer[100];
while (fgets(buffer, sizeof buffer, istream)) {
unsigned long index = strtoul(buffer, NULL, 36);
if (found[index]++) {
Dupe_found();
break;
}
}
The trouble with the post is that it wants "Fastest algorithm", but does not detail memory concerns and its relative importance to speed. So speed must be king and the above wastes little time. It does meet the "stop looking as soon as one duplicate is found" requirement.
Depending on how many different things there can be you have some options:
Sort whole array and then lookup for repeating element, complexity O(n log n) but can be done in place, so memory will be O(1)
Build set of all elements. Depending on chosen set implementation can be O(n) (when it will be hash set) or O(n log n) (binary tree), but it would cost you some memory to do so.
The fastest way to find out if an array contains at least one duplicate is to use a bitmap, multiple CPUs and an (atomic or not) "test and set bit" instruction (e.g. lock bts on 80x86).
The general idea is to divide the array into "total elements / number of CPUs" sized pieces and give each piece to a different CPU. Each CPU processes it's piece of the array by calculating an integer and doing the atomic "test and set bit" for the bit corresponding to that integer.
However, the problem with this approach is that you're modifying something that all CPUs are using (the bitmap). A better idea is to give each CPU a range of integers (e.g. CPU number N does all integers from "(min - max) * N / CPUs" to "(min - max) * (N+1) / CPUs"). This means that all CPUs read from the entire array, but each CPU only modifies it's own private piece of the bitmap. This avoids some performance problems involved with cache coherency protocols ("read for ownership of cache line") and also avoids the need for atomic instructions.
Then next step beyond that is to look at how you're converting your "3 characters and 3 digits" strings into an integer. Ideally, this can/would be done using SIMD; which would require that the array is in "structure of arrays" format (and not the more likely "array of structures" format). Also note that you can convert the strings to integers first (in an "each CPU does a subset of the strings" way) to avoid the need for each CPU to convert each string and pack more into each cache line.
Since you have several million entries I think the best algorithm would be counting sort. Counting sort does exactly what you asked: it sorts an array by counting how many times every element exists. So you could write a function that does the counting sort to the array :
void counting_sort(int a[],int n,int max)
{
int count[max+1]={0},i;
for(i=0;i<n;++i){
count[a[i]]++;
if (count[a[i]]>=2) return 1;
}
return 0;
}
Where you should first find the max element (in O(n)). The asymptotic time complexity of counting sort is O(max(n,M)) where M is the max value found in the array. So because you have several million entries if M has size order of some millions this will work in O(n) (or less for counting sort but because you need to find M it is O(n)). If also you know that there is no way that M is greater than some millions than you would be sure that this gives O(n) and not just O(max(n,M)).
You can see counting sort visualization to understand it better, here:
https://www.cs.usfca.edu/~galles/visualization/CountingSort.html
Note that in the above function we don't implement exactly counting sort, we stop when we find a duplicate which is even more efficient, since yo only want to know if there is a duplicate.
I'd like a suggestion in c language for the following problem:
I need an association between strings and integers like this:
"foo" => 45,
"bar" => 1023,
etc...
and be able to find the string using the associated integer and the integer using the associated string.
For string to integer I can use hash tables but I'll loose the way back.
The simple solution that I'm using but which is very slow is to create a table:
static param_t params [] = {
{ "foo", 45 },
{ "bar", 1023 },
...
};
and using two functions compare each entry (string or integer) to get the string or the integer.
This works perfectly by this is linear search which is very slow.
What could I use to have a search algorithm in O(1) to find a string and O(size of string) to find the integer?
Any ideas?
The easiest way is to implement lookup tables, preferably sorted by the integer value ("primary key").
typedef enum
{
FOO_INDEX,
BAR_INDEX,
...
N
} some_t;
const int int_array [] = // values sorted in ascending order, smallest first
{
45,
1023,
...
};
const char* str_array [] =
{
"foo",
"bar",
...
};
Now you can use int_array[FOO_INDEX] and str_array[FOO_INDEX] to get the desired data.
Since these are constant tables set at compile-time, you can sort the data. All lookups can then be done with binary search, O(log n). If you have the integer value but need to know the index, perform a binary search on the int_array. And once you have found the index, you get instant lookup from there.
For this to work, both arrays must have the exact size N. To ensure array sizes and data integrity inside those arrays, use a compile-time assert:
static_assert(sizeof(int_array)/sizeof(*int_array) == N, "Bad int_array");
static_assert(sizeof(str_array)/sizeof(*str_array) == N, "Bad str_array");
Sort your list with qsort first, and then use bsearch to find items. It's not O(1), but at least it is O(log(n)).
Use two hashmaps. One for the association from integer to string and another one for the association from string to integer.
An inefficient way would be to kind of convert it to base 256. First-letter-ASCII times 256 in the power of 0 (1) PLUS Second-letter-ASCII times 256 in the power of 1 and so on. Very inefficient (Because long won't be enough so either contain the number in another string or use a mathematical C library. I know there are hashes in Ruby and Perl and it's basically an array that you get into with a certain key (can be a string) but I don't know how it's working.
I am taking the cs50 course on edx and am doing the hacker edition of pset3 (in essence it is the advanced version).
Basically the program takes a value to be searched for as the command-line argument, and then asks for a bunch of numbers to be used in an array.
Then it sorts and searches that array for the value entered at the command-line.
The way the program is implemented, it uses a pseudo-random number generator to feed the numbers for the array.
The task is to write the search and sorting functions.
I already have the searching function, but the sorting function is supposed to be O(n).
In the regular version you were supposed to use a O(n ^ 2) algorithm which wasn't a problem to implement. Also using a log n algorithm wouldn't be an issue either.
But the problem set specifically ask's for a big O(n) algorithm.
It gives a hint in saying that, since no number in the array is going to be negative, and the not greater than LIMIT (the numbers output by the generator are modulo'd so they are not greater than 65000). But how does that help in getting the algorithm to be O(n)?
But the counting sort algorithm, which purports to be an acceptable solution, returns a new sorted array rather than actually sort the original one, and that contradicts with the pset specification that reads 'As this return type of void implies, this function must not return a sorted array; it must instead "destructively" sort the actual array that it’s passed by moving around the values therein.'
Also, if we decide to copy the sorted array onto the original one using another loop, with so many consecutive loops, I'm not sure if the sorting function can be considered to have a running time of O(n) anymore. Here is the actual pset, the question is about the sorting part.
Any ideas to how to implement such an algorithm would be greatly appreciated. It's not necessary to provide actual code, rather just the logic of you can create a O(n) algorithm under the conditions provided.
It gives a hint in saying that, since no number in the array is going
to be negative, and the not greater than LIMIT (the numbers outputted
by the generator are modulo'd to not be higher than 65000). But how
does that help in getting the algorithm to be O(n).
That hint directly seems to point towards counting sort.
You create 65000 buckets and use them to count the number of occurrences of each number.
Then, you just revisit the buckets and you have the sorted result.
It takes advantage of the fact that:
They are integers.
They have a limited range.
Its complexity is O(n) and as this is not a comparison-based sort, the O(nlogn) lower bound on sorting does not apply. A very good visualization is here.
As #DarkCthulhu said, counting sort is clearly what they were urging you to use. But you could also use a radix sort.
Here is a particularly concise radix 2 sort that exploits a nice connection to Gray codes. In your application it would require 16 passes over the input, one per data bit. For big inputs, the counting sort is likely to be faster. For small ones, the radix sort ought to be fster because you avoid initializing 256K bytes or more of counters.
See this article for explanation.
void sort(unsigned short *a, int len)
{
unsigned short bit, *s = a, *d = safe_malloc(len * sizeof *d), *t;
unsigned is, id0, id1;
for (bit = 1; bit; bit <<= 1, t = s, s = d, d = t)
for (is = id0 = 0, id1 = len; is < len; ++is)
if (((s[is] >> 1) ^ s[is]) & bit)
d[--id1] = s[is];
else
d[id0++] = s[is];
free(d);
}
I won't go into details, but I'm attempting to implement an algorithm similar to the Boyer-Moore-Horspool algorithm, only using hex color values instead of characters (i.e., there is a much greater range).
Following the example on Wikipedia, I originally had this:
size_t jump_table[0xFFFFFF + 1];
memset(jump_table, default_value, sizeof(jump_table);
However, 0xFFFFFF is obviously a huge number and this quickly causes C to seg-fault (but not stack-overflow, disappointingly).
Basically, what I need is an efficient associative array mapping integers to integers. I was considering using a hash table, but having a malloc'd struct for each entry just seems overkill to me (I also do not need hashes generated, as each key is a unique integer and there can be no duplicate entries).
Does anyone have any alternatives to suggest? Am I being overly pragmatic about this?
Update
For those interested, I ended up using a hash table via the uthash library.
0xffffff is rather too large to put on the stack on most systems, but you absolutely can malloc a buffer of that size (at least on current computers; not so much on a smartphone). Whether or not you should do it for this task is a separate issue.
Edit: Based on the comment, if you expect the common case to have a relatively small number of entries other than the "this color doesn't appear in the input" skip value, you should probably just go ahead and use a hash map (obviously only storing values that actually appear in the input).
(ignore earlier discussion of other data structures, which was based on an incorrect recollection of the algorithm under discussion -- you want to use a hash table)
If the array you were going to make (of size 0xFFFFFF) was going to be sparse you could try making a smaller array to act as a simple hash table, with the size being 0xFFFFFF / N and the hash function being hexValue / N (or hexValue % (0xFFFFFF / N)). You'll have to be creative to handle collisions though.
This is the only way I can foresee getting out of mallocing structs.
You can malloc(3) 0xFFFFFF blocks of size_t on the heap (for simplicity), and address them as you do with an array.
As for the stack overflow. Basically the program receives a SIGSEGV, which can be a result of a stack overflow or accessing illegal memory or writing on a read-only segment etc... They are all abstracted under the same error message "Segmentation fault".
But why don't you use a higher level language like python that supports associate arrays?
At possibly the cost of some speed, you could try modifying the algorithm to find only matches that are aligned to some boundary (every three or four symbols), then perform the search at byte level.
You could create a sparse array of sorts which has "pages" like this (this example uses 256 "pages", so the upper most byte is the page number):
int *pages[256];
/* call this first to make sure all of the pages start out NULL! */
void init_pages(void) {
for(i = 0; i < 256; ++i) {
pages[i] = NULL;
}
}
int get_value(int index) {
if(pages[index / 0x10000] == NULL) {
pages[index / 0x10000] = calloc(0x10000, 1); /* calloc so it will zero it out */
}
return pages[index / 0x10000][index % 0x10000];
}
void set_value(int index, int value) {
if(pages[index / 0x10000] == NULL) {
pages[index / 0x10000] = calloc(0x10000, 1); /* calloc so it will zero it out */
}
pages[index / 0x10000][index % 0x10000] = value;
}
this will allocate a page the first time it is touched, read or write.
To avoid the overhead of malloc you can use a hashtable where the entries in the table are your structs, assuming they are small. In your case a pair of integers should suffice, with a special value to indicate emptyness of the slot in the table.
How many values are there in your output space, i.e. how many different values do you map to in the range 0-0xFFFFF?
Using randomized universal hashing you can come up with a collision-free hash function with a table no bigger than 2 times the number of values in your output space (for a static table)
Update: Please file this under bad ideas. You don't get anything for free in life and here is certainly proof. A simple idea gone bad. It is definitely something to learn from however.
Lazy programming challenge. If I pass a function that 50-50 returns true or false for the qsort's comparision function I think that I can effectively unsort an array of structures writing 3 lines of code.
int main ( int argc, char **argv)
{
srand( time(NULL) ); /* 1 */
...
/* qsort(....) */ /* 2 */
}
...
int comp_nums(const int *num1, const int *num2)
{
float frand =
(float) (rand()) / ((float) (RAND_MAX+1.0)); /* 3 */
if (frand >= 0.5f)
return GREATER_THAN;
return LESS_THAN;
}
Any pitfalls I need to look for? Is it possible in fewer lines through swapping or is this the cleanest I get for 3 non trivial lines?
Bad idea. I mean really bad.
Your solution gives an unpredictable result, not a random result and there is a big difference. You have no real idea of what a qsort with a random comparison will do and whether all combinations are equally likely. This is the most important criterion for a shuffle: all combinations must be equally likely. Biased results equal big trouble. There's no way to prove that in your example.
You should implement the Fisher-Yates shuffle (otherwise known as the Knuth shuffle).
In addition to the other answers, this is worse than a simple Fisher-Yates shuffle because it is too slow. The qsort algorithm is O(n*log(n)), the Fisher-Yates is O(n).
Some more detail is available in Wikipedia on why this kind of "shuffle" does not generally work as well as the Fisher-Yates method:
Comparison with other shuffling
algorithms
The Fisher-Yates shuffle is quite
efficient; indeed, its asymptotic time
and space complexity are optimal.
Combined with a high-quality unbiased
random number source, it is also
guaranteed to produce unbiased
results. Compared to some other
solutions, it also has the advantage
that, if only part of the resulting
permutation is needed, it can be
stopped halfway through, or even
stopped and restarted repeatedly,
generating the permutation
incrementally as needed. In high-level
programming languages with a fast
built-in sorting algorithm, an
alternative method, where each element
of the set to be shuffled is assigned
a random number and the set is then
sorted according to these numbers, may
be faster in practice[citation
needed], despite having worse
asymptotic time complexity (O(n log n)
vs. O(n)). Like the Fisher-Yates
shuffle, this method will also produce
unbiased results if correctly
implemented, and may be more tolerant
of certain kinds of bias in the random
numbers. However, care must be taken
to ensure that the assigned random
numbers are never duplicated, since
sorting algorithms in general won't
order elements randomly in case of a
tie. A variant of the above method
that has seen some use in languages
that support sorting with
user-specified comparison functions is
to shuffle a list by sorting it with a
comparison function that returns
random values. However, this does not
always work: with a number of commonly
used sorting algorithms, the results
end up biased due to internal
asymmetries in the sorting
implementation.[7]
This links to here:
just one more thing While writing this
article I experimented with various
versions of the methods and discovered
one more flaw in the original version
(renamed by me to shuffle_sort). I was
wrong when I said “it returns a nicely
shuffled array every time it is
called.”
The results are not nicely shuffled at
all. They are biased. Badly. That
means that some permutations (i.e.
orderings) of elements are more likely
than others. Here’s another snippet of
code to prove it, again borrowed from
the newsgroup discussion:
N = 100000
A = %w(a b c)
Score = Hash.new { |h, k| h[k] = 0 }
N.times do
sorted = A.shuffle
Score[sorted.join("")] += 1
end
Score.keys.sort.each do |key|
puts "#{key}: #{Score[key]}"
end
This code
shuffles 100,000 times array of three
elements: a, b, c and records how many
times each possible result was
achieved. In this case, there are only
six possible orderings and we should
got each one about 16666.66 times. If
we try an unbiased version of shuffle
(shuffle or shuffle_sort_by), the
result are as expected:
abc: 16517
acb: 16893
bac: 16584
bca: 16568
cab: 16476
cba: 16962
Of course,
there are some deviations, but they
shouldn’t exceed a few percent of
expected value and they should be
different each time we run this code.
We can say that the distribution is
even.
OK, what happens if we use the
shuffle_sort method?
abc: 44278
acb: 7462
bac: 7538
bca: 3710
cab: 3698
cba: 33314
This is not
an even distribution at all. Again?
It shows how the sort method is biased and goes into detail why this is so. FInally he links to Coding Horror:
Let's take a look at the correct
Knuth-Fisher-Yates shuffle algorithm.
for (int i = cards.Length - 1; i > 0; i--)
{
int n = rand.Next(i + 1);
Swap(ref cards[i], ref cards[n]);
}
Do you see the difference? I missed
it the first time. Compare the swaps
for a 3 card deck:
Naïve shuffle Knuth-Fisher-Yates shuffle
rand.Next(3); rand.Next(3);
rand.Next(3); rand.Next(2);
rand.Next(3);
The naive shuffle
results in 3^3 (27) possible deck
combinations. That's odd, because the
mathematics tell us that there are
really only 3! or 6 possible
combinations of a 3 card deck. In the
KFY shuffle, we start with an initial
order, swap from the third position
with any of the three cards, then swap
again from the second position with
the remaining two cards.
No, this won't properly shuffle the array, it will barely move elements around their original locations, with exponential distribution.
The comparison function isn't supposed to return a boolean type, it's supposed to return a negative number, a positive number, or zero which qsort() uses to determine which argument is greater than the other.
The Old New Thing takes on this one
I think the basic idea of randomly partition the set recursively on the way down and concatenate the results on the way up will work (It will average O(n*log n) binary decisions and that is darn close to log2(fact(n)) but q-sort will not be sure to do that with a random predicate.
BTW I think the same argument and issues can be said for any O(n*log n) sort strategy.
Rand isn't the most random thing out there... If you want to shuffle cards or something this isn't the best. Also a Knuth shuffle would be quicker, but your solution is ok if it doesn't loop forever