Hash function for short strings - c

I want to send function names from a weak embedded system to the host computer for debugging purpose. Since the two are connected by RS232, which is short on bandwidth, I don't want to send the function's name literally. There are some 15 chars long function names, and I sometimes want to send those names at a pretty high rate.
The solution I thought about, was to find a hash function which would hash those function names to a single byte, and send this byte only. The host computer would scan all the functions in the source, compute their hash using the same function, and then would translate the hash to the original string.
The hash function must be
Collision free for short strings.
Simple (since I don't want too much code in my embedded system).
Fit a single byte
Obviously, it does not need to be secure by any means, only collision free. So I don't think using cryptography-related hash function is worth their complexity.
An example code:
int myfunc() {
sendToHost(hash("myfunc"));
}
The host would then be able to present me with list of times where the myfunc function was executed.
Is there some known hash function which holds the above conditions?
Edit:
I assume I will use much less than 256 function-names.
I can use more than a single byte, two bytes would have me pretty covered.
I prefer to use a hash function instead of using the same function-to-byte map on the client and the server, because (1) I have no map implementation on the client, and I'm not sure I want to put one for debugging purposes. (2) It requires another tool in my build chain to inject the function-name-table into my embedded system code. Hash is better in this regard, even if that means I'll have a collision once in many while.

Try minimal perfect hashing:
Minimal perfect hashing guarantees that n keys will map to 0..n-1 with no collisions at all.
C code is included.

Hmm with only 256 possible values, since you will parse your source code to know all possible functions, maybe the best way to do it would be to attribute a number to each of your function ???
A real hash function would probably won't work because you have only 256 possible hashes.
but you want to map at least 26^15 possible values (assuming letter-only, case-insensitive function names).
Even if you restricted the number of possible strings (by applying some mandatory formatting) you would be hard pressed to get both meaningful names and a valid hash function.

You could use a Huffman tree to abbreviate your function names according to the frequency they are used in your program. The most common function could be abbreviated to 1 bit, less common ones to 4-5, very rare functions to 10-15 bits etc. A Huffman tree is not very hard to implement but you will have to do something about the bit alignment.

No, there isn't.
You can't make a collision free hash code, or even close to it, with just an eight bit hash. If you allow strings that are longer than one character, you have more possible strings than there are possible hash codes.
Why not just extract the function names and give each function name an id? Then you only need a lookup table on each side of the wire.
(As others have shown you can generate a hash algorithm without collisions if you already have all the function names, but then it's easier to just assign a number to each name to make a lookup table...)

If you have a way to track the functions within your code (i.e. a text file generated at run-time) you can just use the memory locations of each function. Not exactly a byte, but smaller than the entire name and guaranteed to be unique. This has the added benefit of low overhead. All you would need to 'decode' the address is the text file that maps addresses to actual names; this could be sent to the remote location or, as I mentioned, stored on the local machine.

In this case you could just use an enum to identify functions. Declare function IDs in some header file:
typedef enum
{
FUNC_ID_main,
FUNC_ID_myfunc,
FUNC_ID_setled,
FUNC_ID_soundbuzzer
} FUNC_ID_t;
Then in functions:
int myfunc(void)
{
sendFuncIDToHost(FUNC_ID_myfunc);
...
}

If sender and receiver share the same set of function names, they can build identical hashtables from these. You can use the path taken to get to an hash element to communicate this. This can be {starting position+ number of hops} to communicate this. This would take 2 bytes of bandwidth. For a fixed-size table (lineair probing) only the final index is needed to address an entry.
NOTE: when building the two "synchronous" hash tables, the order of insertion is important ;-)

Described here is a simple way of implementing it yourself: http://www.devcodenote.com/2015/04/collision-free-string-hashing.html
Here is a snippet from the post:
It derives its inspiration from the way binary numbers are decoded and converted to decimal number format. Each binary string representation uniquely maps to a number in the decimal format.
if say we have a character set of capital English letters, then the length of the character set is 26 where A could be represented by the number 0, B by the number 1, C by the number 2 and so on till Z by the number 25. Now, whenever we want to map a string of this character set to a unique number , we perform the same conversion as we did in case of the binary format

Related

Tcl String function that does "shimmering" ruined my customized tcl type defined in c

I have defined a customized tcl type using tcl library in c/c++. I basically make the Tcl_Obj.internalRep.otherValuePtr point to my own data structure. The problem happens by calling [string length myVar] or other similar string functions that does so called shimmering behaviour which replace my internalRep with it's own string structure. So that after the string series tcl function, myVar cannot convert back! because it's a complicate data structure cannot be converted back from the Tcl_Obj.bytes representation plus the type is no longer my customized type. How can I avoid that.
The string length command converts the internal representation of the values it is given to the special string type, which records information to allow many string operations to be performed rapidly. Apart from most of the string command's various subcommands, the regexp and regsub commands are the main ones that do this (for their string-to-match-the-RE-against argument). If you have a precious internal representation of your own and do not wish to lose it, you should avoid those commands; there are some operations that avoid the trouble. (Tcl mostly assumes that internal representations are not fragility, and therefore that they can be regenerated on demand. Beware when using fragility!)
The key operations that are mostly safe (as in they generate the bytes/length rep through calling the updateStringProc if needed, but don't clear the internal rep) are:
substitution into a string; the substituted value won't have the internal rep, but it will still be in the original object.
comparison with the eq and ne expression operators. This is particularly relevant for checks to see if the value is the empty string.
Be aware that there are many other operations that spoil the internal representation in other ways, but most don't catch people out so much.
[EDIT — far too long for a comment]: There are a number of relatively well-known extensions that work this way (e.g., TCOM and Tcl/Java both do this). The only thing you can really do is “be careful” as the values really are fragile. For example, put them in an array and then pass the indexes into the array around instead, as those need not be fragile. Or keep things as elements in a list (probably in a global variable) and pass around the list indices; those are just plain old numbers.
The traditional, robust approach is to put a map (e.g., a Tcl_HashTable or std::map) in your C or C++ code and have the indices into that be short strings with not too much meaning (I like to use the name of the type of value followed by either a sequence number or a serialisation of the pointer, such as you might get with the %p conversion in sprintf(); the printed pointer reveals more of the implementation details, is a little more helpful if you're debugging, and generally doesn't actually make that much difference in practice). You then have the removal of things from the map be an explicit deletion operation, and it is also easy to provide operations like listing all the known current values. This is safe, but prone to “leaking” (though it's not formally a memory leak if you provide the listing operation). It can be accelerated by caching the lookup in a Tcl_Obj*'s internal representation (a cheap way to handle deletion is to use a sequence number that you increment when you delete something, and only bypass the map lookup if the sequence number that you cache in the intrep is equal to the main sequence number) but it's not usually a big deal; only go to that sort of thing if you've measured a bottleneck in the lookups.
But I'd probably just live with fragility in my own code, and would just take care to ensure that I never bust the assumptions. The problem is really that you're being incautious about how you use the values; the Tcl code should just pass them around and nothing else really. Also, I've experimented a fair bit with wrapping such things up inside a TclOO object; it's far too heavyweight (by the design of TclOO) for values that you're making a lot of, but if you've only got a few of them and you're wanting to treat them as objects with methods, this can work very well indeed (and gives many more options for automatic cleanup).

PBC: get hash from an element

In the PBC library there is a function to get an element from a hash:
void element_from_hash(element_t e, void *data, int len)
Generate an element e deterministically from the len bytes stored in the buffer data.
PBC manuals: Converting elements
Is there a way to go the other way around, i.e. recover the corresponding hash from an element?
Alternatively, is there a consistent way to translate strings back and forth to elements?
Does element_to_bytes / element_from_bytes work or do these bytes be ina particular form to be interpreted as a PBC element (as I strongly suppose)?
I am building a cryptosystem and clearly I need to recover the cleartext after the decryption ;).
To explain better my problem, I need a way to encode strings as an element and decode an element to a string.
I studied the manual and looked up some theory. I am not an expert in cryptography and forgot much about eliptic rings and groups.
The basic purpose of Pairing-Based Cryptography is to be able to safely exchange or construct a shared key for use in further encryption or authentication, knowing that someone may be eavesdropping on the communications used in the exchange/construction.
The hashing the manual and theory talks about means to generate a hash from the message. That hash will in turn be used to obtain a component of a key from the ring using the functions of the PBC library.
Hashing by definition means to map from something larger onto something smaller. For example generating a 16 bit integer hash from a 1024 character message. As a consequence it is also by definition not possible to obtain an original from a hash - there are an infinite number of messages that would qualify as they all generate the same hash.
However, using the key obtained from the exchange and using the library's functions lets you decrypt the message.

Precalculating MD5

I've got MD5 hash of one million symbols password and I've got first 999,992 symbols. I need to bruteforce last 8 digits. Can I precount first symbols' hash (let's call it base hash) and then just brute 8 chars length string and add its hash to base hash to make finding right pass faster? What algorithm should I use or what software can help me?
Yes, that's possible. MD5 is based on the Merkle-Damgård construction, which performs the hashing in blocks. You can hash a number of blocks, then save the state of the hash function and use it as the starting point to try different possibilities for the remaining blocks.
Based on the documentation (I haven't tested), I think calling clone() on a Java MessageDigest will copy the current state of the hash function. You could use that to build your partial hash from the known characters, then create a clone for each guess. That's assuming that the MD5 implementation actually supports cloning. There's a chance (depending on what language and library you use) that you might have to write your own MD5 implementation.
Note that MD5's block size is 512 bits (64 characters), and the length of your password (one million) is a whole multiple of that. That means your password characters will completely fill up the last block of data, and the hash function will need an additional block for padding. So you'll precompute the partial hash of the first 999,936 characters that you know, then produce the final data block from the remaining 56 characters that you know plus the 8 that you're guessing, then append the padding block after that.
An implementation like Java's MessageDigest should take care of the details of dividing things into blocks, though. You can probably (again, I haven't tested) just create a MessageDigest, call digest(byte[]) with your 999,992 known bytes, and then call clone().

Fast string search using bitwise operators

What is the fastest (parallel?) way to find a substring in a very long string using bitwise operators?
e.g. find all positions of "GCAGCTGAAAACA" sequence in a human genome http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/hg18.2bit (770MB)
*the alphabet consists of 4 symbols ('G','C',T,'A') represented using 2 bits:
'G':00, 'A':01, 'T':10, 'C':11
*you can assume the query string (the shorter one) is fixed in length, e.g. 127 characters
*by fastest I mean not including any pre-processing/indexing time
*the file is going to be loaded into memory after pre-processing, basically there will be billions of short strings to be searched for in a larger string, all in-memory.
*bitwise because I'm looking for the simplest, fastest way to search for a bit pattern in a large bit array and stay as close as possible to the silicon.
*KMP wouldn't work well as the alphabet is small
*C code, x86 machine code would all be interesting.
Input format description (.2bit): http://jcomeau.freeshell.org/www/genome/2bitformat.html
Related:
Fastest way to scan for bit pattern in a stream of bits
Algorithm help! Fast algorithm in searching for a string with its partner
http://www.arstdesign.com/articles/fastsearch.html
http://en.wikipedia.org/wiki/Bitap_algorithm
If you're just looking through a file, you're pretty much guaranteed to be io-bound. Use of a large buffer (~16K), and strstr() should be all you need. If the file is encoded in ascii,search just for "gcagctgaaaaca". If it actually is encoded in bits; just permute the possible accepted strings(there should be ~8; lop off the first byte), and use memmem() plus a tiny overlapping bit check.
I'll note here that glibc strstr and memmem already use Knuth-Morris-Pratt to search in linear time, so test that performance. It may surprise you.
If you first encode/compress the DNA string with a lossless coding method (e.g. Huffman, exponential Golumb, etc.) then you get a ranked probability table ("coding tree") for DNA tokens of various combinations of nucleotides (e.g., A, AA, CA, etc.).
What this means is that, once you compress your DNA:
You'll probably be using fewer bits to store GCAGCTGAAAACA and other subsequences, than the "unencoded" approach of always using two bits per base.
You can walk through the coding tree or table to build an encoded search string, which will usually be shorter than the unencoded search string.
You can apply the same family of exact search algorithms (e.g. Boyer-Moore) to locate this shorter, encoded search string.
As for a parallelized approach, split the encoded target string up into N chunks and run the search algorithm on each chunk, using the shortened, encoded search string. By keeping track of the bit offsets of each chunk, you should be able to generate match positions.
Overall, this compression approach would be useful if you plan on doing millions of searches on sequence data that won't change. You'd be searching fewer bits — potentially many fewer, in aggregate.
Boyer-More is a technique used to search for substrings in plain strings. The basic idea is that if your substring is, say, 10 characters long, you can look at the character at position 9 in the string to search. If that character is not part of your search string, you could simply start the search after that character. (If that character is, indeed, in your string, the Boyer-More algorithm use a look-up table to skip the optimal number of characters forward.)
It might be possible to reuse this idea for your packed representation of the genome string. After all, there are only 256 different bytes, so you could safely pre-calculate the skip-table.
The benefit of encoding the alphabet into bit fields is compactness: one byte holds the equivalent of four characters. This is similar to some of the optimizations Google performs searching for words.
This suggests four parallel executions, each with the (transformed) search string offset by one character (two bits). A quick-and-dirty approach might be to just look for the first or second byte of the search string and then check extra bytes before and after matching the rest of the string, masking off the ends if necessary. The first search is handily done by the x86 instruction scasb. Subsequent byte matches can build upon the register values with cmpb.
You could create a state machine. In this topic,
Fast algorithm to extract thousands of simple patterns out of large amounts of text
, I used [f]lex to create the state machine for me. It would require some hackery to use the 4 letter ( := two bit) alphabet, but it can be done using the same tables as generated by [f]lex. (you could even create your own fgetc() like function which extracts two bits at a time from the input stream, and keeps the other six bits for consecutive calls. Pushback will be a bit harder, but not undoable).
BTW: I seriously doubt if there is any gain in compressing the data to two bits per nucleotide, but that is a different matter.
Okay, given your parameters, the problem isn't that hard, just not one you'd approach like a traditional string search problem. It more resembles a database table-join problem, where the tables are much larger than RAM.
select a good rolling hash function aka buzhash. If you have billions of strings, you're looking for a hash with 64-bit values.
create a hash table based on each 127-element search string. The table in memory only needs to store (hash,string-id), not the whole strings.
scan your very large target string, computing the rolling hash and looking up each value of the hash in your table. Whenever there's a match, write the (string-id, target-offset) pair to a stream, possibly a file.
reread your target string and the pair stream, loading search strings as needed to compare them against the target at each offset.
I am assuming that loading all pattern strings into memory at once is prohibitive. There are ways to segment the hash table into something that is larger than RAM but not a traditional random-access hash file; if you're interested, search for "hybrid hash" and "grace hash", which are more common in the database world.
I don't know if it's worth your while, but your pair stream gives you the perfect predictive input to manage a cache of pattern strings -- Belady's classic VM page replacement algorithm.

Hash a byte string

I'm working on a personal project, a file compression program, and am having trouble with my symbol dictionary. I need to store previously encountered byte strings into a structure in such a way that I can quickly check for their existence and retrieve them. I've been operating under the assumption that a hash table would be best suited for this purpose so my question will be pertaining to hash functions. However, if someone can suggest a better alternative to a hash table, I'm all ears.
All right. So the problem is that I can't come up with a good hashing key for these byte strings. Everything I think of either has a very uneven distribution, or is takes too long. Here is a list of the situation I'm working with:
All byte strings will be at least
two bytes in length.
The hash table will have a maximum size of 3839, and it is very likely it will fill.
Testing has shown that, with any given byte, the highest order bit is significantly less likely to be set, as compared to the lower seven bits.
Otherwise, bytes in the string can be any value from 0 - 255 (I'm working with raw byte-data of any format).
I'm working with the C language in a UNIX environment. I'd prefer to stick with standard libraries, but it doesn't need to be portable to other OSs. (I.E. unistd.h is fine).
Security is of NO concern.
Speed is of a HIGH concern.
The size isn't of intense concern, as it will NOT be written to file. However, considering the potential size of the byte strings being stored, memory space could become an issue during the compression.
A trie is better suited to this kind of thing because it lets you store your symbols as a tree and quickly parse it to match values (or reject them).
And as a bonus, you don't need a hash at all. You're storing/retrieving/comparing the entire sequence at once, while still only holding a minimal amount of memory.
Edit: And as an additional bonus, with only a second parse, you can look up sequences that are "close" to your current sequence, so you can get rid of a sequence and use the previous one for both of them, with some internal notation to hold the differences. That will help you compress files better because:
smaller dictionary means smaller files, you have to write the dictionary to your file
smaller number of items can free up space to hold other, more rare sequences if you add a population cap and you hit it with a large file.

Resources