Making a perfect hash (all consecutive buckets full), gperf or alternatives? - c

Let's say I want to build a perfect hash table for looking up an array where the predefined keys are 12 Months, thus I would want
hash("January")==0
hash("December")==11
I run my Month names through gperf and got a nice hash function, but it appears to give out 16 buckets(or rather the range is 16)!
#define MIN_HASH_VALUE 3
#define MAX_HASH_VALUE 18
/* maximum key range = 16, duplicates = 0 */
Looking at the generated gperf code, its hash function code does a simple return of len plus char value lookup from a 256 size table. Somehow, in my head I imagined a fancy looking function... :)
What if I want exactly 12 buckets(that is I do not want to skip over unused buckets)? For small sets as this, it really doesn't matter, but when I have 1000 predefined keys and want exactly 1000 buckets in a row?
Can one find a deterministic way to do this?

I was interested in the answer to this question & came to it via a search for gperf. I tried gperf, but it was very slow on a large input file and thus did not seem suitable. I tried cmph but I wasn't happy with it. It requires building a file which is then loaded into the C program at run time. Further, the program is so fragile (crashes with "segmentation fault" on any kind of mistaken input) that I did not trust it. A further Google search led me to this page, and onward to mph. I downloaded mph and found it is very nice. It has an optional program to generate a C file, called "emitc", and using it like
mph < systemdictionaryfile | emitc > output.c
worked almost instantly (a few seconds with a dictionary of about 200,000 words) and created a working C file which compiles with no problems. My tests also indicate that it works. I haven't tested the performance of the hashing algorithm yet though.

The only alternative to gperf I know is cmph : http://cmph.sourceforge.net/ but, as Jerome said in the comment, having 16 buckets provides you some speed benefit.
When I first looked at minimal perfect hasihing I found very interesting readings on CiteseerX but I resisted the temptation to try coding one of those solutions myself. I know I would end up with an inferior solution respect to gperf or cmph or, even assuming the solution was comparable, I would have to spend a lot of time on it.

There are many MPH solutions and algorithms, gerf doesn't yet do MPH's, but I'm working on it. Esp. for large sets. See https://gitlab.com/rurban/gperf/-/tree/hashfuncs
The classic cmph has a lot of constant overhead and is only recommended for huge key sets.
There's the NetBSD nbperf and my improved variant: https://github.com/rurban/nbperf
which does CHM, CHM3 and BZD, with integer key support, optimizations for smaller key sets and alternate hash functions.
There's Bob Jenkin's generator, and Taj Khattra's mph-1.2.
There are also two perl libraries to generate C lookups, one in PostgresQL (PerfectHash.pm) and one for late perl5 unicode lookups (regen/mph.pl), and a tool to compare various generators: https://github.com/rurban/Perfect-Hash

Related

comparing 2 large unsorted data files: algorithm with implementation

This is not the actual problem I have. But I want to test boundaries in order to choose a best or even good strategy.
The theoretical problem is this:
I have 2 files, 2Tb (make it 4Tb if any better) each of keys, 1 line 1 string, but given with a strong encoder, which makes each entry very different.
I need to know if the 2 files shares some of these keys. We can assume low overlap so that the output is not a storage issue. Or in other words: I want to compare 2 large files.
In a dummy approach, what I would do is to split file1 in bunches, as big as my ram, and cycle on file2 to find matches. I would need to cycle on I/O quite a bit.
We might also assume that sorting "in place" is not available. It is an option if part of the algorithm.
I assume that mapreduce/hadoop might get a kind of solution, and the first obvious one is using a sorting algorithm, as in any case the keys uses characters. We would need to map on 25 alpha plus few alphanumeric.
But does exists any better approach?

MD5 hashes and Regular Expressions

I received a MD5 hash and a Regular Expression which have the same plaintext..
How do I use the Regular Expression to crack the MD5 hash and find the text behind the MD5?
b89e49cab317f2681be60fb3d1c0f8f8
[(a|c|d)n-t\|]{8}
The idea would be to use the regex as a template and generate inputs that satisfy it.
You can search for a regex visualizer to see this, but what that one says is any of the characters ()acd| or any character between n and t (inclusive) in any order, repeated eight times. I tested this in hashcat, and the regex is correct despite it looking like it means something else. A shorter way to write that would be [acd|()n-t]{8}.
So you start generating 8 character strings with those values and taking the md5 of them. You can do this in almost any programming language but Python is a good choice. Look up the hashlib library, it has a function md5. You'll call the function hexdigest on that and compare it to the provided hash.
>>> import hashlib
>>> hashlib.md5(b'cybering').hexdigest()
'61e4feebe66ad22349e292d1462afd3a'
Additionally, if you want to use cracking software, look up JohnTheRipper or hashcat. You should be able to provide them a dictionary and have it attempt to break the hash. I was able to solve this with hashcat on my 980ti in ~5 seconds. This tutorial helped me set up the custom charset and mask to perform the attack.
Have fun!
One approach would be to generate all possible eight-character combinations (with repetition) of the 19 characters allowed by the regex. Test each combination by computing the md5 hash and comparing it to the one you were given.
That would be 13^8 = 815,730,721 possible combinations to check. The answer will likely be found before checking all of them.
I was able to whip out a little Node.js program on my laptop that found the solution in about 4 minutes (I split the problem up using workers to take advantage of multiple CPU cores).
Edit: I thought the regex had n-z instead of n-t so the search space was actually much smaller.
You cant crack the md5 hash value it has used one way hashing algorithm.

strstr vs regex in c

Let's say, for example, I have a list of user id's, access times, program names, and version numbers as a list of CSV strings, like this:
1,1342995305,Some Program,0.98
1,1342995315,Some Program,1.20
2,1342985305,Another Program,15.8.3
1,1342995443,Bob's favorite game,0.98
3,1238543846,Something else,
...
Assume this list is not a file, but is an in-memory list of strings.
Now let's say I want to find out how often a program has been accessed to certain programs, as listed by their version number. (e.g. "Some Program version 1.20" was accessed 193 times, "Some Program version 0.98" was accessed 876 times, and "Some Program 1.0.1" was accessed 1,932 times)
Would it be better to build a regular expression and then use regexec() to find the matches and pull out the version numbers, or strstr() to match the program name plus comma, and then just read the following part of the string as the version number? If it makes a difference, assume I am using GCC on Linux.
Is there a performance difference? Is one method "better" or "more proper" than the other? Does it matter at all?
Go with strstr() - using regex to count a number of occurrences is not a good idea, as you would need to use loop anyway, so I would suggest you to do a simple loop with searching for poistion of substring and increase counter and starting search position after each match.
strchr/memcmp is how most libc versions implemented strstr. Hardware-dependent implementations of strstr in glibc do better. Both SSE2 and SSE4.2 (x86) instruction sets can do way better than scanning byte-by-byte. If you want to see how, I posted a couple blog articles a while back --- SSE2 and strstr and SSE2 and BNDM search --- that you might find interesting.
I'd do neither: I'm betting it would be faster to use strchr() to find the commas, and strcmp() to check the program name.
As for performance, I expect string functions (strtok/strstr/strchr/strpos/strcmp...) to run all more or less at the same speed (i.e. really, really fast), and regex to run appreciably slower albeit still quite fast.
The real performance benefit would come from properly designing the search though: how many times it must run, is the number of programs fixed...?
For example, a single scan whereby you get ALL the frequency data for all the programs would be much slower than a single scan seeking for a given program. But properly designed, all subsequent queries for other programs would run way faster.
strtok(), and break the data up into something more structured (like a list of structs).

Can Git detect if two source files are essentially copies of each others?

Sorry if this is off-topic, but here is your chance to reduce the amount of "homework" questions on this site :-)
I'm teaching a class of C programming where the students work on a small library of numeric routines in C. This year, the source files from several groups of students had significant amounts of code duplication in them.
(Down to identically misspelled printf debug statements. I mean, how dumb can you be.)
I know that Git can detect when two source files are similar to each others beyond a certain threshold but I never manager to get that to work on two source files that are not in a Git repository.
Keep in mind that these are not particularly sophisticated students. It is unlikely that they would go to the trouble of changing variable/function names.
Is there a way I can use Git to detect significant and literal code duplication a.k.a plagiarism? Or is there some other tool you could recommend for that
Why use git at all? A simple but effective technique would be to compare the sizes of the diffs between all of the different submissions, and then to manually inspect and compare those with the smallest differences.
Moss is a tool that was developed by a Stanford CS prof. I think they use it there as well. It's like diff for source code.
Adding to the other answers, you could use diff -- but I don't think the answers will be that useful by themselves. What you want is the number of lines that match, minus the number of non-blank lines, and to get that automatically you need to do a fair bit of magic with wc -l and grep to compute the sum of the lengths of the files, minus the length of the diff file, minus the number of blank lines that diff included as matching. And even then you'll miss some cases where diff decided that identical lines didn't match because of different things inserted before them.
A much better option is one of the suggestions listed in https://stackoverflow.com/questions/5294447/how-can-i-find-source-code-copying (or in https://stackoverflow.com/questions/4131900/how-to-detect-plagiarized-code, though the answers seem to duplicate).
You could use diff and check whether the two files seem similar:
diff -iEZbwB -U 0 file1.cpp file2.cpp
Those options tell diff to ignore whitespace changes and make a git-like diff file. Try it out on two samples.
Using diff is absolutely not a good idea unless you want to venture in the realm of combinatory hell:
If you have 2 submissions, you have to perform 1 diff to check for plagiarism,
If you have 3 submissions, you have to perform 2 diff to check for plagiarism,
If you have 4 submissions, you have to perform 6 diff to check for plagiarism,
...
If you have n submissions, you have to perform (n-1)! diff !
On the other hand, Moss, already suggested in an other answer, uses a completely different algorithm. Basically, it computes a set of fingerprints for significant k-grams of each document. The fingerprint is in fact a hash used to classify documents, and a possible plagiarism is detected when two documents end-up being sorted in the same bucket.

Speeding up large switches and if-elses

What can I do to improve large switches and if-elses speed manually? I will probably need some kind of hash or lookup table.
I'm working with gcc and C code, I doubt that gcc has any inbuilt optimizations for this.
Edit:
My switch code is what every switch looks like, do something based on if a particular int is some value.
My if-elses look like this:
if( !strcmp( "val1", str ) )
foo();
else if( !strcmp( "val2", str ) )
foo2();
...
I also have ifs that do this
if( struct.member1 != NULL )
foo();
if( struct.member2 != NULL )
foo2();
EDIT2:
Thank you everyone. I'm not sure which one I should pick as an answer, because a lot of these answers have valid points and valuable insights. Unfortunately, I have to pick just one. But thanks all!
In the end, using a perfect hash table seems the best way to get O(n) time on the access for both ifs and switches.
To use a hash table:
Pick a hash function. This one is a biggie. There are tradeoffs between speed, the quality of the hash, and the size of the output. Encryption algorithms can make good hash functions. The hash function performs some computation using all the bits of your input value to return some output value with a smaller number of bits.
So the hash function takes a string
and returns an integer between 0 and
N .
Now you can look up a pointer to a function in a table of size N.
Each entry in the table will be a linked list (or some other searchable data structure) because of the chance of collision, that is two strings that map to the same hash value.
E.g.
lets say hash(char*) returns a value between 0 and 3.
hash("val1") returns 2
hash("val2") returns 0
hash("val3") also returns 0
hash("val4") returns 1
Now your hash table looks something like:
table[0] ("val2",foo2) ("val3", foo3)
table[1] ("val4",foo4)
table[2] ("val1",foo1)
table[3] <empty>
I hope you can see how the cost of doing matching using a hash table is bound by the time it takes to calculate the hash function and the small time it takes to search the entry in the hash table. If the hash table is large enough most hash table entries would have very few items.
For strings, if you have a small finite number of possible strings, use a perfect hash and switch on the result. With only 30 or so strings, finding a perfect hash should be pretty easy. If you also need to validate the input, you'll have to do a single strcmp in each case, but that's pretty cheap.
Beyond that, just let the compiler optimize your switches. Only do anything fancier if you've done sufficient testing to know the time spent here is performance-critical.
I'm not sure what are you looking for, but branch prediction with gcc is discussed in this question
It has. Just see the generated code. At least it optimizes switches.
You may use hash table to optimize your code, but I'm sure that GCC does the same for you.
Another thing is if-else's, when they contain some complex boolean expressions. I will not answer this part of question here.
It really depends on the code base you are working with and whether it is open to further/better modularization. Otherwise, if nothing else I can recommend this.
If there are more common cases than others (one or two things happen more than the rest), place them at the beginning of the switch/if/else, that way in the more common cases your program will only make that first one or two comparisons and short circuit its path. Generally a good idea on its own for any code IMHO.
It depends much on the strings that you are comparing. You could do a switch on some characteristics of the strings:
If you know that they differ pretty
well in the 4th position, you could
do a switch on str[3] and only
then do the strcmp.
Or look at some sort on checksum and switch.
But all of this is quite handcrafted, you definitely should check the assembler that gcc produces.
A hash table would be ideal for speeding up a bunch of string compares.
You might want to look into a string library that does not use nul terminated strings like the C stdlib does. Lots of string manipulation in C involves a lot of "look through the string for the nul, then do your operation".
A string library like SafeStr keeps info about the length of the strings so there's no need to burn time to scan for nuls, especially for strings with unequal lengths
(I'm quoting some from my prior research I've written on this topic)
The specINT2006 benchmark, 458.sjeng, which implements
a chess simulator, uses many switch statements to process the
different chess pieces. Each statement is in a form like:
switch (board[from]) {
case (wpawn): ...
case (wknight): ...
Which the compiler (gcc) generates as an instruction sequence similar
to the following:
40752b: mov -0x28(%rbp),%eax
40752e: mov 0x4238(,%rax,8),%rax
407536: jmpq *%rax
This assembly acts as a lookup table. You can speed up the compiled code further by splitting your switch ... case into multiple switch statements. You'll want to keep the case values consecutive and put the most frequent cases into different switch statements. This particularly improves the indirect branch prediction.
I'll leave the remainder of your questions to others.
Other answers have already suggested using a hash table, I'd recommend generating a perfect hash function using gperf (or a minimal perfect hash function, see the wikipedia page for a few links)

Resources