Fast string comparison in C - c

I currently have this kind of loop
while(1)
{
generate_string(&buffer);
for(int i = 0; i < filelines; i++)
{
if(strcmp(buffer,line[i]) == 0)
{
/* do something */
}
}
}
I have a file with a few million strings(which hopefully should be cut by half sometime soon), the number of all these strings is stored in filelines
line[i] is basically where the string itself is stored.
Currently, due to the comparison of these million strings, function generate_string(&buffer); is executed around 42 times per second.
Is there a faster way to do string comparison in C?

strcmp is usually optimized by all vendors. However, if you're not satisfied with this you can try:
Lookup Burst Tries
Use a suffix tree for fast string comparison -- see this article
Depending on the size of strings in your application you can write a custom string comparator. E.g: GNU libc used to have this optimization for small strings where they tested strings smaller than five bytes as integers. MS cl also has some optimizations for small-strings (do look it up).
But more importantly make sure strcmp is your real bottleneck.

I can assure you, the function strcmp is ABSOLUTELY NOT the bottleneck. Typically, strcmp is well optimized and can do 32 or 64 bit comparisons for strings longer than 4/8 bytes depending on architecture. Both newlib and GNU libc do this. But even if you were to look at each byte in both strings 20 times, it doesn't matter as much as the algo & data structure choices made here.
The real bottle neck is the O(N) search algorithm. A single O(N log N) pass at the file could be used to at appropriate data structure (whether it's a normal BST, a trie, or just a simple sorted array) for doing O(log N) lookups.
Bear with me here--a lot of math follows. But I think this is a good opportunity to illustrate why choice of algorithm & data structure are sometimes FAR more important than method of string comparison. Steve touches on this, but I wanted to explain it in a little more depth.
With N=1e6, log(1e6, 2) = 19.9, so round up to 20 comparisons on an ideal data structure.
Currently you're doing a a worst case search of O(N), or 1e6 operations.
So say you just build a red-black tree with O(log N) insertion time, and you insert N items, that's O(N log N) time to build the tree. So that's 1e6 x 20 or 20e6 operations necessary to build your tree.
In your current approach, building the data structure is O(N), or 1e6 operations, but your worst case search time is O(N) as well. So by the time you read the file and do just 20 search operations, you're up to a theoretical worst case of 21,000,000 operations. By comparison, your worst case with a red-black tree and 20 searches is 20,000,400 operations, or 999,600 operations BETTER than the O(N) search on an unsorted array. So at 20 searches, you're at the first point where a more sophisticated data structure really pays off. But look at what happens at 1000 searches:
Unsorted array = initialization + 1000 x search time = O(N) + 1000 * O(N) = 1,000,000 + 2,000,000,000 = 2,001,000,000 operations.
Red-black = initialization + 1000 x search time = O(N log N) + 1000 * O(log N) = 20,000,000 + 20,000 = 20,020,000 operations.
2,001,000,000 / 20,020,000 ~= 100x as many operations for the O(N) search.
At 1e6 searches, that's (1e6 + 1e6 * 1e6) / (20e6 + 1e6 * 20 ) = 25,000x as many operations.
Assume your computer can handle the 40e6 'operations' it takes to do the log N searches in 1 minute. It would take 25,000 minutes, or 17 DAYS to do the same work with your current algorithm. Or another way to look at is that the O(N) search algorithm can only handle 39 searches in the time the O(log N) algorithm can do 1,000,000. And the more searches you do, the uglier it gets.
See responses from Steve and dirkgently for several better choices of data structures & algorithms. My only additional caution would be that qsort() suggested by Steve might have a worst-case complexity of O(N*N), which is far, far, worse than the O(N log N) you get with a heapsort or various tree-like structures.

Optimization of Computer Programs in C
You can save a little time by checking the first characters of the strings in question before doing the call. Obviously, if the first characters differ, there's no reason to call strcmp to check the rest. Because of the non-uniform distribution of letters in natural languages, the payoff is not 26:1 but more like 15:1 for uppercase data.
#define QUICKIE_STRCMP(a, b) (*(a) != *(b) ? \
(int) ((unsigned char) *(a) - \
(unsigned char) *(b)) : \
strcmp((a), (b)))
If The dictionary of words you are using are well defined (meaning you don't mind return value form strcmp but the 0==equal), for example, a set of command line arguments that begins with same prefix, ex: tcp-accept, tcp-reject than you can rewrite the macro and do some pointer arithmetic to compare not the 1st one but the Nth char, in this case, the 4th char, ex:
#define QUICKIE_STRCMP(a, b, offset) \
(*(a+offset) != *(b+offset))\ ? -1 : strcmp((a), (b)))

If I get your question correctly, you need to check if a string is along all the lines read so far. I would propose using a TRIE or even better a Patricia tree from the file lines. This way instead of going all over all the lines you can check linearly if your string is present(and with a little more effort - where).

You can try something 'cheap' like screening based on the first char. If the first chars don't match, the strings cannot be equal. If they match, then call strcmp to compare the entire string. You may wish to consider a better algorithm if that is appropriate for your situation; examples would be sorting the file/lines and doing a binary search, using a hash table, or similar string table techniques.

You're already compiling with optimization, right?
If you have a Trie or hashtable data structure lying around the place, ready to use, then you should.
Failing that, a fairly easy change that will probably speed things up is to sort your array line once, before you start generating strings to search for. Then binary search for buffer in the sorted array. It's easy because the two functions you need are standard -- qsort and bsearch.
A binary search into a sorted array only needs to do about log2(filelines) string comparisons, instead of about filelines. So in your case that's 20-something string comparisons per call to generate_string instead of a few million. From the figures you've given, I think you can reasonably expect it to go 20-25 times faster, although I promise nothing.

You can use a byte-wise comparator macro instead of strcmp() to achieve a very fast string comparison (of standard 8-bit char) if you know the string length beforehand. I benchmarked the byte-wise comparator macro against glibc's strcmp(), and the macro version significantly outperformed strcmp() implementation; it takes advantage of the CPU's vector processor.
Example:
#define str3_cmp(x, y0, y1, y2, y3) x[0] == y0 && x[1] == y1 && x[2] == y2 && x[3] == y3
static inline bool str3_cmp_helper(const char *x, const char *y) {
return str3_cmp(x, *y, *(y + 1), *(y + 2), *(y + 3));
}
const char *i = "hola"; // dynamically generated (eg: received over a network)
if (str3_cmp_helper(i, "hola")) {
/* do something */
} else {
/* do something else */
}
However, writing such a macro is tiresome, so I have included a PHP script to generate the macro. This script takes two arguments, (1) the string length to be compared (this argument is variadic so write as many macros as you want), and (2) the output filename.
#!/usr/bin/php
<?php
function generate_macro($num) : string {
$returner = "#define str".$num."cmp_macro(ptr, ";
for($x = 0; $x < $num; $x++){
$returner .= "c".$x;
if($x != $num-1){ $returner .= ", "; }
}
$returner .= ") ";
for($x = 0; $x < $num; $x++){
$returner .= "*(ptr+".$x.") == c".$x;
if($x != $num-1){ $returner .= " && "; }
}
return $returner;
}
function generate_static_inline_fn(&$generated_macro, $num) : string {
$generated_macro .= "static inline bool str".$num."cmp(const char* ptr, const char* cmp)".
"{\n\t\treturn str".$num."cmp_macro(ptr, ";
for($x = 0; $x < $num; $x++){
$generated_macro .= " *(cmp+".$x.")";
if($x != $num-1){ $generated_macro .= ", "; }
}
$generated_macro .= ");\n}\n";
return $generated_macro;
}
function handle_generation($argc, $argv) : void {
$out_filename = $argv[$argc-1];
$gen_macro = "";
for($x = 0; $x < $argc-2; $x++){
$macro = generate_macro($argv[$x+1])."\n";
$gen_macro .= generate_static_inline_fn($macro, $argv[$x+1]);
}
file_put_contents($out_filename, $gen_macro);
}
handle_generation($argc, $argv);
?>
Script example: $ ./gen_faststrcmp.php 3 5 fast_strcmp.h.
This generates fast_strcmp.h with macros for comparing strings of length 3 and 5:
#define str3cmp_macro(ptr, c0, c1, c2) *(ptr+0) == c0 && *(ptr+1) == c1 && *(ptr+2) == c2
static inline bool str3cmp(const char* ptr, const char* cmp){
return str3cmp_macro(ptr, *(cmp+0), *(cmp+1), *(cmp+2));
}
#define str5cmp_macro(ptr, c0, c1, c2, c3, c4) *(ptr+0) == c0 && *(ptr+1) == c1 && *(ptr+2) == c2 && *(ptr+3) == c3 && *(ptr+4) == c4
static inline bool str5cmp(const char* ptr, const char* cmp){
return str5cmp_macro(ptr, *(cmp+0), *(cmp+1), *(cmp+2), *(cmp+3), *(cmp+4));
}
You can use the macro like so:
const char* compare_me = "Hello";
if(str5cmp(compare_me, "Hello")) { /* code goes here */ }

I don't know that there's a faster way than calling strcmp to do string comparisons, but you can perhaps avoid calling strcmp so much. Use a hash table to store your strings and then you can check whether the string in buffer is in the hash table. If the index of a hit is important when you "do something", the table can map strings to indexes.

you may be able to get by with a binary comparison in this case because your program does not actually sort, but compares for equality.
you can also improve comparison speeds here by determining the lengths in advance (provided of course they vary enough). when the length does not match here, do something will not happen.
of course, hashing here would be another consideration depending on how many times you read the hashed value.

It depends on the length of the string.
If it's not too long, you can try to compare byte by byte:
str[0] == str2[0] && str[1] == str2[1] && str[2] == str2[2]
Otherwise, use memcmp(), it compares chunks of memory.

Use strcmp for regular strings. But if the string if really long you can use memcmp. It will compare chunks of memory.

Related

Fastest algorithm to figure out if an array has at least one duplicate

I have a quite peculiar case here. I have a file containing several million entries and want to find out if there exists at least one duplicate. The language here isn't of great importance, but C seems like a reasonable choice for speed. Now, what I want to know is what kind of approach to take to this? Speed is the primary goal here. Naturally, we want to stop looking as soon as one duplicate is found, that's clear, but when the data comes in, I don't know anything about how it's sorted. I just know it's a file of strings, separated by newline. Now keep in mind, all I want to find out is if a duplicate exists. Now, I have found a lot of SO questions regarding finding all duplicates in an array, but most of them go the easy and comprehensive way, rather than the fastest.
Hence, I'm wondering: what is the fastest way to find out if an array contains at least one duplicate? So far, the closest I've been able to find on SO is this: Finding out the duplicate element in an array. The language chosen isn't important, but since it is, after all, programming, multi-threading would be a possibility (I'm just not sure if that's a feasible way to go about it).
Finally, the strings have a format of XXXNNN (3 characters and 3 integers).
Please note that this is not strictly theoretical. It will be tested on a machine (Intel i7 with 8GB RAM), so I do have to take into consideration the time of making a string comparison etc. Which is why I'm also wondering if it could be faster to split the strings in two, and first compare the integer part, as an int comparison will be quicker, and then the string part? Of course, that will also require me to split the string and cast the second half to an int, which might be slower...
Finally, the strings have a format of XXXNNN (3 characters and 3 integers).
Knowing your key domain is essential to this sort of problem, so this allows us to massively simplify the solution (and this answer).
If X &in; {A..Z} and N &in; {0..9}, that gives 263 * 103 = 17,576,000 possible values ... a bitset (essentially a trivial, perfect Bloom filter with no false positives) would take ~2Mb for this.
Here you go: a python script to generate all possible 17 million keys:
import itertools
from string import ascii_uppercase
for prefix in itertools.product(ascii_uppercase, repeat=3):
for numeric in range(1000):
print "%s%03d" % (''.join(prefix), numeric)
and a simple C bitset filter:
#include <limits.h>
/* convert number of bits into number of bytes */
int filterByteSize(int max) {
return (max + CHAR_BIT - 1) / CHAR_BIT;
}
/* set bit #value in the filter, returning non-zero if it was already set */
int filterTestAndSet(unsigned char *filter, int value) {
int byteIndex = value / CHAR_BIT;
unsigned char mask = 1 << (value % CHAR_BIT);
unsigned char byte = filter[byteIndex];
filter[byteIndex] = byte | mask;
return byte & mask;
}
which for your purposes you'd use like so:
#include <stdlib.h>
/* allocate filter suitable for this question */
unsigned char *allocMyFilter() {
int maxKey = 26 * 26 * 26 * 10 * 10 * 10;
return calloc(filterByteSize(maxKey), 1);
}
/* key conversion - yes, it's horrible */
int testAndSetMyKey(unsigned char *filter, char *s) {
int alpha = s[0]-'A' + 26*(s[1]-'A' + 26*(s[2]-'A'));
int numeric = s[3]-'0' + 10*(s[4]-'0' + 10*(s[5]-'0'));
int key = numeric + 1000 * alpha;
return filterTestAndSet(filter, key);
}
#include <stdio.h>
int main() {
unsigned char *filter = allocMyFilter();
char key[8]; /* 6 chars + newline + nul */
while (fgets(key, sizeof(key), stdin)) {
if (testAndSetMyKey(filter, key)) {
printf("collision: %s\n", key);
return 1;
}
}
return 0;
}
This is linear, although there's obviously scope to optimise the key conversion and file input. Anyway, sample run:
useless:~/Source/40044744 $ python filter_test.py > filter_ok.txt
useless:~/Source/40044744 $ time ./filter < filter_ok.txt
real 0m0.474s
user 0m0.436s
sys 0m0.036s
useless:~/Source/40044744 $ cat filter_ok.txt filter_ok.txt > filter_fail.txt
useless:~/Source/40044744 $ time ./filter < filter_fail.txt
collision: AAA000
real 0m0.467s
user 0m0.452s
sys 0m0.016s
admittedly the input file is cached in memory for these runs.
The reasonable answer is to keep the algorithm with the smallest complexity. I encourage you to use a HashTable to keep track of inserted elements; the final algorithm complexity is O(n), because search in HashTable is O(1) theoretically. In your case I suggest you, to run the algorithm when reading file.
public static bool ThereAreDuplicates(string[] inputs)
{
var hashTable = new Hashtable();
foreach (var input in inputs)
{
if (hashTable[input] != null)
return true;
hashTable.Add(input, string.Empty);
}
return false;
}
A fast but inefficient memory solution would use
// Entries are AAA####
char found[(size_t)36*36*36*36*36*36 /* 2,176,782,336 */] = { 0 }; // or calloc() this
char buffer[100];
while (fgets(buffer, sizeof buffer, istream)) {
unsigned long index = strtoul(buffer, NULL, 36);
if (found[index]++) {
Dupe_found();
break;
}
}
The trouble with the post is that it wants "Fastest algorithm", but does not detail memory concerns and its relative importance to speed. So speed must be king and the above wastes little time. It does meet the "stop looking as soon as one duplicate is found" requirement.
Depending on how many different things there can be you have some options:
Sort whole array and then lookup for repeating element, complexity O(n log n) but can be done in place, so memory will be O(1)
Build set of all elements. Depending on chosen set implementation can be O(n) (when it will be hash set) or O(n log n) (binary tree), but it would cost you some memory to do so.
The fastest way to find out if an array contains at least one duplicate is to use a bitmap, multiple CPUs and an (atomic or not) "test and set bit" instruction (e.g. lock bts on 80x86).
The general idea is to divide the array into "total elements / number of CPUs" sized pieces and give each piece to a different CPU. Each CPU processes it's piece of the array by calculating an integer and doing the atomic "test and set bit" for the bit corresponding to that integer.
However, the problem with this approach is that you're modifying something that all CPUs are using (the bitmap). A better idea is to give each CPU a range of integers (e.g. CPU number N does all integers from "(min - max) * N / CPUs" to "(min - max) * (N+1) / CPUs"). This means that all CPUs read from the entire array, but each CPU only modifies it's own private piece of the bitmap. This avoids some performance problems involved with cache coherency protocols ("read for ownership of cache line") and also avoids the need for atomic instructions.
Then next step beyond that is to look at how you're converting your "3 characters and 3 digits" strings into an integer. Ideally, this can/would be done using SIMD; which would require that the array is in "structure of arrays" format (and not the more likely "array of structures" format). Also note that you can convert the strings to integers first (in an "each CPU does a subset of the strings" way) to avoid the need for each CPU to convert each string and pack more into each cache line.
Since you have several million entries I think the best algorithm would be counting sort. Counting sort does exactly what you asked: it sorts an array by counting how many times every element exists. So you could write a function that does the counting sort to the array :
void counting_sort(int a[],int n,int max)
{
int count[max+1]={0},i;
for(i=0;i<n;++i){
count[a[i]]++;
if (count[a[i]]>=2) return 1;
}
return 0;
}
Where you should first find the max element (in O(n)). The asymptotic time complexity of counting sort is O(max(n,M)) where M is the max value found in the array. So because you have several million entries if M has size order of some millions this will work in O(n) (or less for counting sort but because you need to find M it is O(n)). If also you know that there is no way that M is greater than some millions than you would be sure that this gives O(n) and not just O(max(n,M)).
You can see counting sort visualization to understand it better, here:
https://www.cs.usfca.edu/~galles/visualization/CountingSort.html
Note that in the above function we don't implement exactly counting sort, we stop when we find a duplicate which is even more efficient, since yo only want to know if there is a duplicate.

Find longest suffix of string in given array

Given a string and array of strings find the longest suffix of string in array.
for example
string = google.com.tr
array = tr, nic.tr, gov.nic.tr, org.tr, com.tr
returns com.tr
I have tried to use binary search with specific comparator, but failed.
C-code would be welcome.
Edit:
I should have said that im looking for a solution where i can do as much work as i can in preparation step (when i only have a array of suffixes, and i can sort it in every way possible, build any data-structure around it etc..), and than for given string find its suffix in this array as fast as possible. Also i know that i can build a trie out of this array, and probably this will give me best performance possible, BUT im very lazy and keeping a trie in raw C in huge peace of tangled enterprise code is no fun at all. So some binsearch-like approach will be very welcome.
Assuming constant time addressing of characters within strings this problem is isomorphic to finding the largest prefix.
Let i = 0.
Let S = null
Let c = prefix[i]
Remove strings a from A if a[i] != c and if A. Replace S with a if a.Length == i + 1.
Increment i.
Go to step 3.
Is that what you're looking for?
Example:
prefix = rt.moc.elgoog
array = rt.moc, rt.org, rt.cin.vof, rt.cin, rt
Pass 0: prefix[0] is 'r' and array[j][0] == 'r' for all j so nothing is removed from the array. i + 1 -> 0 + 1 -> 1 is our target length, but none of the strings have a length of 1, so S remains null.
Pass 1: prefix[1] is 't' and array[j][1] == 'r' for all j so nothing is removed from the array. However there is a string that has length 2, so S becomes rt.
Pass 2: prefix[2] is '.' and array[j][2] == '.' for the remaining strings so nothing changes.
Pass 3: prefix[3] is 'm' and array[j][3] != 'm' for rt.org, rt.cin.vof, and rt.cin so those strings are removed.
etc.
Another naïve, pseudo-answer.
Set boolean "found" to false. While "found" is false, iterate over the array comparing the source string to the strings in the array. If there's a match, set "found" to true and break. If there's no match, use something like strchr() to get to the segment of the string following the first period. Iterate over the array again. Continue until there's a match, or until the last segment of the source string has been compared to all the strings in the array and failed to match.
Not very efficient....
Naive, pseudo-answer:
Sort array of suffixes by length (yes, there may be strings of same length, which is a problem with the question you are asking I think)
Iterate over array and see if suffix is in given string
If it is, exit the loop because you are done! If not, continue.
Alternatively, you could skip the sorting and just iterate, assigning the biggestString if the currentString is bigger than the biggestString that has matched.
Edit 0:
Maybe you could improve this by looking at your array before hand and considering "minimal" elements that need to be checked.
For instance, if .com appears in 20 members you could just check .com against the given string to potentially eliminate 20 candidates.
Edit 1:
On second thought, in order to compare elements in the array you will need to use a string comparison. My feeling is that any gain you get out of an attempt at optimizing the list of strings for comparison might be negated by the expense of comparing them before doing so, if that makes sense. Would appreciate if a CS type could correct me here...
If your array of strings is something along the following:
char string[STRINGS][MAX_STRING_LENGTH];
string[0]="google.com.tr";
string[1]="nic.tr";
etc, then you can simply do this:
int x, max = 0;
for (x = 0; x < STRINGS; x++) {
if (strlen(string[x]) > max) {
max = strlen(string[x]);
}
}
x = 0;
while(true) {
if (string[max][x] == ".") {
GOTO out;
}
x++;
}
out:
char output[MAX_STRING_LENGTH];
int y = 0;
while (string[max][x] != NULL) {
output[y++] = string[++x];
}
(The above code may not actually work (errors, etc.), but you should get the general idea.
Why don't you use suffix arrays ? It works when you have large number of suffixes.
Complexity, O(n(logn)^2), there are O(nlogn) versions too.
Implementation in c here. You can also try googling suffix arrays.

Simple hash functions

I'm trying to write a C program that uses a hash table to store different words and I could use some help.
Firstly, I create a hash table with the size of a prime number which is closest to the number of the words I have to store, and then I use a hash function to find an address for each word.
I started with the simplest function, adding the letters together, which ended up with 88% collision.
Then I started experimenting with the function and found out that whatever I change it to, the collisions don't get lower than 35%.
Right now I'm using
unsigned int stringToHash(char *word, unsigned int hashTableSize){
unsigned int counter, hashAddress =0;
for (counter =0; word[counter]!='\0'; counter++){
hashAddress = hashAddress*word[counter] + word[counter] + counter;
}
return (hashAddress%hashTableSize);
}
which is just a random function that I came up with, but it gives me the best results - around 35% collision.
I've been reading articles on hash functions for the past a few hours and I tried to use a few simple ones, such as djb2, but all of them gave me even worse results.(djb2 resulted in 37% collision, which is't much worse, but I was expecting something better rather than worse)
I also don't know how to use some of the other, more complex ones, such as the murmur2, because I don't know what the parameters (key, len, seed) they take in are.
Is it normal to get more than 35% collisions, even with using the djb2, or am I doing something wrong?
What are the key, len and seed values?
Try sdbm:
hashAddress = 0;
for (counter = 0; word[counter]!='\0'; counter++){
hashAddress = word[counter] + (hashAddress << 6) + (hashAddress << 16) - hashAddress;
}
Or djb2:
hashAddress = 5381;
for (counter = 0; word[counter]!='\0'; counter++){
hashAddress = ((hashAddress << 5) + hashAddress) + word[counter];
}
Or Adler32:
uint32_t adler32(const void *buf, size_t buflength) {
const uint8_t *buffer = (const uint8_t*)buf;
uint32_t s1 = 1;
uint32_t s2 = 0;
for (size_t n = 0; n < buflength; n++) {
s1 = (s1 + buffer[n]) % 65521;
s2 = (s2 + s1) % 65521;
}
return (s2 << 16) | s1;
}
// ...
hashAddress = adler32(word, strlen(word));
None of these are really great, though. If you really want good hashes, you need something more complex like lookup3, murmur3, or CityHash for example.
Note that a hashtable is expected to have plenty of collisions as soon as it is filled by more than 70-80%. This is perfectly normal and will even happen if you use a very good hash algorithm. That's why most hashtable implementations increase the capacity of the hashtable (e.g. capacity * 1.5 or even capacity * 2) as soon as you are adding something to the hashtable and the ratio size / capacity is already above 0.7 to 0.8. Increasing the capacity means a new hashtable is created with a higher capacity, all values from the current one are added to the new one (therefor they must all be rehashed, as their new index will be different in most cases), the new hashtable array replaces the old one and the old one is released/freed. If you plan on hashing 1000 words, a hashtable capacity of at 1250 least recommended, better 1400 or even 1500.
Hashtables are not supposed to be "filled to brim", at least not if they shall be fast and efficient (thus they always should have spare capacity). That's the downside of hashtables, they are fast (O(1)), yet they will usually waste more space than would be necessary for storing the same data in another structure (when you store them as a sorted array, you will only need a capacity of 1000 for 1000 words; the downside is that the lookup cannot be faster than O(log n) in that case). A collision free hashtable is not possible in most cases either way. Pretty much all hashtable implementations expect collisions to happen and usually have some kind of way to deal with them (usually collisions make the lookup somewhat slower, but the hashtable will still work and still beat other data structures in many cases).
Also note that if you are using a pretty good hash function, there is no requirement, yet not even an advantage, if the hashtable has a power of 2 capacity if you are cropping hash values using modulo (%) in the end. The reason why many hashtable implementations always use power of 2 capacities is because they do not use modulo, instead they use AND (&) for cropping because an AND operation is among the fastest operations you will find on most CPUs (modulo is never faster than AND, in the best case it would be equally fast, in most cases it is a lot slower). If your hashtable uses power of 2 sizes, you can replace any module with an AND operation:
x % 4 == x & 3
x % 8 == x & 7
x % 16 == x & 15
x % 32 == x & 31
...
This only works for power of 2 sizes, though. If you use modulo, power of 2 sizes can only buy something, if the hash is a very bad hash with a very bad "bit distribution". A bad bit distribution is usually caused by hashes that do not use any kind of bit shifting (>> or <<) or any other operations that would have a similar effect as bit shifting.
I created a stripped down lookup3 implementation for you:
#include <stdint.h>
#include <stdlib.h>
#define rot(x,k) (((x)<<(k)) | ((x)>>(32-(k))))
#define mix(a,b,c) \
{ \
a -= c; a ^= rot(c, 4); c += b; \
b -= a; b ^= rot(a, 6); a += c; \
c -= b; c ^= rot(b, 8); b += a; \
a -= c; a ^= rot(c,16); c += b; \
b -= a; b ^= rot(a,19); a += c; \
c -= b; c ^= rot(b, 4); b += a; \
}
#define final(a,b,c) \
{ \
c ^= b; c -= rot(b,14); \
a ^= c; a -= rot(c,11); \
b ^= a; b -= rot(a,25); \
c ^= b; c -= rot(b,16); \
a ^= c; a -= rot(c,4); \
b ^= a; b -= rot(a,14); \
c ^= b; c -= rot(b,24); \
}
uint32_t lookup3 (
const void *key,
size_t length,
uint32_t initval
) {
uint32_t a,b,c;
const uint8_t *k;
const uint32_t *data32Bit;
data32Bit = key;
a = b = c = 0xdeadbeef + (((uint32_t)length)<<2) + initval;
while (length > 12) {
a += *(data32Bit++);
b += *(data32Bit++);
c += *(data32Bit++);
mix(a,b,c);
length -= 12;
}
k = (const uint8_t *)data32Bit;
switch (length) {
case 12: c += ((uint32_t)k[11])<<24;
case 11: c += ((uint32_t)k[10])<<16;
case 10: c += ((uint32_t)k[9])<<8;
case 9 : c += k[8];
case 8 : b += ((uint32_t)k[7])<<24;
case 7 : b += ((uint32_t)k[6])<<16;
case 6 : b += ((uint32_t)k[5])<<8;
case 5 : b += k[4];
case 4 : a += ((uint32_t)k[3])<<24;
case 3 : a += ((uint32_t)k[2])<<16;
case 2 : a += ((uint32_t)k[1])<<8;
case 1 : a += k[0];
break;
case 0 : return c;
}
final(a,b,c);
return c;
}
This code is not as highly optimized for performance as the original code, therefor it is a lot simpler. It is also not as portable as the original code, but it is portable to all major consumer platforms in use today. It is also completely ignoring the CPU endian, yet that is not really an issue, it will work on big and little endian CPUs. Just keep in mind that it will not calculate the same hash for the same data on big and little endian CPUs, but that is no requirement; it will calculate a good hash on both kind of CPUs and its only important that it always calculates the same hash for the same input data on a single machine.
You would use this function as follows:
unsigned int stringToHash(char *word, unsigned int hashTableSize){
unsigned int initval;
unsigned int hashAddress;
initval = 12345;
hashAddress = lookup3(word, strlen(word), initval);
return (hashAddress%hashTableSize);
// If hashtable is guaranteed to always have a size that is a power of 2,
// replace the line above with the following more effective line:
// return (hashAddress & (hashTableSize - 1));
}
You way wonder what initval is. Well, it is whatever you want it to be. You could call it a salt. It will influence the hash values, yet the hash values will not get better or worse in quality because of this (at least not in the average case, it may lead to more or less collisions for very specific data, though). E.g. you can use different initval values if you want to hash the same data twice, yet each time should produce a different hash value (there is no guarantee it will, but it is rather likely if initval is different; if it creates the same value, this would be a very unlucky coincidence that you must treat that as a kind of collision). It is not advisable to use different initval values when hashing data for the same hashtable (this will rather cause more collisions on average). Another use for initval is if you want to combine a hash with some other data, in which case the already existing hash becomes initval when hashing the other data (so both, the other data as well as the previous hash influence the outcome of the hash function). You may even set initval to 0 if you like or pick a random value when the hashtable is created (and always use this random value for this instance of hashtable, yet each hashtable has its own random value).
A note on collisions:
Collisions are usually not such a huge problem in practice, it usually does not pay off to waste tons of memory just to avoid them. The question is rather how you are going to deal with them in an efficient way.
You said you are currently dealing with 9000 words. If you were using an unsorted array, finding a word in the array will need 4500 comparisons on average. On my system, 4500 string comparisons (assuming that words are between 3 and 20 characters long) need 38 microseconds (0.000038 seconds). So even such a simple, ineffective algorithm is fast enough for most purposes. Assuming that you are sorting the word list and use a binary search, finding a word in the array will need only 13 comparisons on average. 13 comparisons are close to nothing in terms of time, it's too little to even benchmark reliably. So if finding a word in a hashtable needs 2 to 4 comparisons, I wouldn't even waste a single second on the question whether that may be a huge performance problem.
In your case, a sorted list with binary search may even beat a hashtable by far. Sure, 13 comparisons need more time than 2-4 comparisons, however, in case of a hashtable you must first hash the input data to perform a lookup. Hashing alone may already take longer than 13 comparisons! The better the hash, the longer it will take for the same amount of data to be hashed. So a hashtable only pays off performance-wise if you have a really huge amount of data or if you must update the data frequently (e.g. constantly adding/removing words to/from the table, since these operations are less costly for a hashtable than they are for a sorted list). The fact that a hashatble is O(1) only means that regardless how big it is, a lookup will approx. always need the same amount of time. O(log n) only means that the lookup grows logarithmically with the number of words, that means more words, slower lookup. Yet the Big-O notation says nothing about absolute speed! This is a big misunderstanding. It is not said that a O(1) algorithm always performs faster than a O(log n) one. The Big-O notation only tells you that if the O(log n) algorithm is faster for a certain number of values and you keep increasing the number of values, the O(1) algorithm will certainly overtake the O(log n) algorithm at some point of time, but your current word count may be far below that point. Without benchmarking both approaches, you cannot say which one is faster by just looking at the Big-O notation.
Back to collisions. What should you do if you run into a collision? If the number of collisions is small, and here I don't mean the overall number of collisions (the number of words that are colliding in the hashtable) but the per index one (the number of words stored at the same hashtable index, so in your case maybe 2-4), the simplest approach is to store them as a linked list. If there was no collision so far for this table index, there is just a single key/value pair. If there was a collision, there is a linked list of key/value pairs. In that case your code must iterate over the linked list and verify each of the keys and return the value if it matches. Going by your numbers, this linked list won't have more than 4 entries and doing 4 comparisons is insignificant in terms of performance. So finding the index is O(1), finding the value (or detecting that this key is not in the table) is O(n), but here n is only the number of linked list entries (so it is 4 at most).
If the number of collisions raises, a linked list can become to slow and you may also store a dynamically sized, sorted array of key/value pairs, which allows lookups of O(log n) and again, n is only the number of keys in that array, not of all keys in the hashtable. Even if there were 100 collisions at one index, finding the right key/value pair takes at most 7 comparisons. That's still close to nothing. Despite the fact that if you really have 100 collisions at one index, either your hash algorithm is unsuited for your key data or the hashtable is far too small in capacity. The disadvantage of a dynamically sized, sorted array is that adding/removing keys is somewhat more work than in case of a linked list (code-wise, not necessarily performance-wise). So using a linked list is usually sufficient if you keep the number of collisions low enough and it is almost trivial to implement such a linked list yourself in C and add it to an existing hashtable implementation.
Most hashtable implementations I have seen use such a "fallback to an alternate data structure" to deal with collisions. The disadvantage is that these require a little bit extra memory to store the alternative data structure and a bit more code to also search for keys in that structure. There are also solutions that store collisions inside the hashtable itself and that don't require any additional memory. However, these solutions have a couple of drawbacks. The first drawback is that every collision increases the chances for even more collisions as more data is added. The second drawback is that while lookup times for keys decrease linearly with the number of collisions so far (and as I said before, every collision leads to even more collisions as data is added), lookup times for keys not in the hashtable decrease even worse and in the end, if you perform a lookup for a key that is not in the hashtable (yet you cannot know without performing the lookup), the lookup may take as long as a linear search over the whole hashtable (YUCK!!!). So if you can spare the extra memory, go for an alternate structure to handle collisions.
Firstly, I create a hash table with the size of a prime number which is the closes to the number of the words I have to store, and then I use a hash function to find an address for each word.
...
return (hashAddress%hashTableSize);
Since the number of different hashes is comparable to the number of words you cannot expect to have much lower collisions.
I made a simple statistical test with a random hash (which is the best you could achieve) and found that 26% is the limiting collision rate if you have #words == #different hashes.

C - Returning the most repeated/occurring string in an array of char pointers

I have almost completed the code for this problem, which I shall state as under:
Given:
Array of length 'n' (say n = 10000) declared as below,
char **records = malloc(10000*sizeof(*records));
Each record[i] is a char pointer and points to a non-empty string.
records[i] = malloc(11);
The strings are of fixed length (10 chars + '\0').
Requirement:
Return the most frequently occurring string in the above array.
But now, I am interested in obtaining a slightly less brutal algorithm than the primitive one which I have currently, which is to sift through the entire array in two for loops :(, storing strings encountered by the two loops in a temporary array of similar size ('n' - in case all are unique strings) for comparison with the next strings. The inner loop iterates from 'outer loop position + 1' to 'n'. At the same time, I have an integer array, of similar size - 'n', for counting repeat occurrences, with each i th element corresponding to the i th (unique) string in the comparison array. Then find the largest integer and use its index in the comparison array to return the most frequently occurring string.
I hope I am clear enough. I am quite ashamed of the algo myself, but it had to be done. I am sure there is a much smarter way to do this in C.
Have a great Sunday,
Cheers!
Without being good at nice algorithms (Google, Wikipedia and Stackoverflow are good enough for me), one solution that comes out at the top of my head is to sort the array, then use a single loop to go through the entries. As long as the current string is the same as the previous, increase a counter for that string. When done you have a "list" of strings and their occurrence, which can then be sorted if needed.
In most languages, the usual approach would be to construct a hashtable, mapping strings to counts. This has O(N) complexity.
For example, in Python (although usually you would use collections.Counter for this, and even this code can be made more concise using more specialised Python knowledge, but I've made it explicit for demonstration).
def most_common(strings):
counts = {}
for s in strings:
if s not in counts:
counts[s] = 0
counts[s] += 1
return max(counts, key=counts.get)
But in C, you don't have a hashtable in the standard library (although in C++ you can use hash_map from the STL), so a sort and scan can be done instead. It's O(N.log(N)) complexity, which is worse than optimal, but quite practical.
Here's some C (actually C99) code that implements this.
int compare_strings(const void*s0, const void*s1) {
return strcmp((const char*)s0, (const char*)s1);
}
const char *most_common(const char **records, size_t n) {
qsort(records, n, sizeof(records[0]), compare_strings);
const char *best = 0; // The most common string found so far.
size_t max = 0; // The longest run found.
size_t run = 0; // The length of the current run.
for (size_t i = 0; i < n; i++) {
if (!compare_strings(records[i], records[i - run])) {
run += 1;
} else {
run = 1;
}
if (run > max) {
best = records[i];
max = run;
}
}
return best;
}

Generating all combinations with repeating digits

I've been reading this site long enough to know not to hide that this is a homework assignment. But I am trying to write a code that can generate all possible combinations of a string of only 0's and 1's. If the length of the string is n^2, then there will be n 1's and the rest will be 0's. The length is always a perfect square. I am coding in C and I have been trying to do it in nested loops but it seems like it could be done more easily in a recursive manner, I'm just not sure how to get that set up. Any tips or advice would be greatly appreciated.
pseudocode:
myfun(pos,length, ones)
if (length==0)
pos='\0'
#print, collect, whatever...
return
if (length>ones)
pos='0'
myfun(pos+1,length-1, ones)
pos='1'
myfun(pos+1, length-1, ones-1)
task(n)
#allocate n^2 buffer
myfun(buffer, n*n, n)
I'm not sure that this problem lends itself to recursion. In C (and most languages), every time you call a function you create a stack frame and use a few processor cycles and a chunk of stack memory. Any recursive solution to this problem will create n^2 stack frames, even the the recursion itself is only adding one bit of information.
A really bad solution is outlined below. What it doesn't do:
Exploit the fact that n is always a perfect square.
Use memory in a very intelligent way
Frees any of the memory it uses
Might not even work. ;)
...but it might give you an idea of the basic pattern.
void foo(int zeros_left, int length_left, char *s)
{
if (length_left == 0)
printf("%s\n", s);
else
{
if (zeros_left > 0)
{
char *next = malloc(strlen(s) + 2);
strcpy(next, s);
strcat(next, "0");
foo(zeros_left - 1, length_left - 1, next);
}
if (zeros_left != length_left)
{
char *next = malloc(strlen(s) + 2);
strcpy(next, s);
strcat(next, "1");
foo(zeros_left, length_left - 1, next);
}
}
}
The key to modelling a problem recursively is to break a larger version of the problem into a simple calculation combined with a smaller version of the same problem, and a trivial case that terminates the recursion.
In this case, the problem is:
For non-negative M and N, output all strings of length M that contain exactly N 1s.
You can break this down into:
If M = 0, then output an empty string; otherwise
Output all strings of length M-1 that contain exactly N 1s, prefixing each with an 0; and
If N > 0, then output all strings of length M-1 that contain exactly N-1 1s, prefixing each with a 1.
Here, M = 0 is the trivial case that terminates the recursion. Turning the above into code is reasonably simple.

Resources