64-bit multiplicative hashing - c

I'm working on fast 64-bit hash. Many existing secure hash functions are too way slow, some non-cryptographic hash functions like FNV are just bad.
Well, I came up with a FNV-like hash:
UINT64 hash=0;
// for each input byte
Main question is about HASH_PRIME. Often, we may see a "golden ratio" term for multiplicative hashing.
For 64-bit hash, golden ratio is 0x9e3779b97f4a7c13.
I tested the 32-bit golden ratio for period in PRNG:
DWORD hash=0;
// loop
A good value here may produce the period of 0xFFFFFFFF - i.e. max possible. This golden ratio produces notably smaller period.
or just
DWORD hash=~0;
// loop
And again, a good enough multiplier can produce period of 0x3FFFFFFF bytes. Golden ratio here produces again much shorter period.
Never tested the 64-bit primes - too computationally expensive.
Is period important for my hash? And where to find a good 64-bit HASH_PRIMES and how to test such stuff?

Are you doing this is as an exercise? otherwise I would advise having a looking at well known hash functions as Bob Jenkin's lookup8 and lookup family (http://burtleburtle.net/bob/hash/ ) and Austin Appleby's murmur http://code.google.com/p/smhasher/ (a speed killer and my favorite). Good hash functions are hard to build... and if you are after a rolling type of hash, Rabin fingerprints are hard to beat.
And to make sure that your hashes are decent if you really want to roll your own, use either Appleby and Jenkins hash tests (torture and smhasher )

Not sure about the first two examples. But in the third to get a full period out of the code you need to add an odd number. Otherwise this will have a maximum period of 65537, It could be as low as 3. There may even be a fixed point.
Wherever you got the 0x3FFFFFFF for a good period is not correct. One of the Knuth volumes discusses this in excessive detail.
The multiplier must be of the form 4n+1 and there must be an odd addend


64-bit hash/digest in C

I am trying to find out if there is any API in C for calculating a 64 bit hash.
I found out that some people use top 64 bits of MD5/SHA1 etc. Is it a good approach?
You could try SipHash in its form as a MAC (which requires key management, though). It is particularly well-suited for short input messages and aims at cryptographic strength. A C implementation is also available.
But if you really care about someone actively messing around with your files, you shouldn't restrict yourself to 64 bits of security. 64 bits can be broken even by brute force today, given enough time and resources. You should use SHA-256 or stronger for that. Or let me state it the other way round, blacklisting broken options: don't use MD5 (or MD-anything for that matter). Use SHA-1 only if you can't use SHA-256 for some reason.
Using a hash also has the advantage that you don't need to manage any keys (opposed to using a MAC). You should just keep the hashes you compute in a different place than the files you are about to monitor - otherwise somebody tampering with your files can easily tamper with the checksum, too.
Regarding whether truncating hashes is good or bad
In theory, I can't see why it should be wrong to truncate a let's say 160 bit hash value down to 64 bit, regardless of whether you take the most significant bits or the least significant bits or pick them using any arbitrary pattern. The only reason why this isn't done more often that I can think of is efficiency - why bring the big guns if there are more efficient algorithms for handling the smaller problems.
In what follows, I assume a cryptographically secure hash for this purpose, general-purpose hashes are quite a different topic - they might expose attack surfaces when truncated for all I know.
But, for a cryptographically secure hash, unless the algorithm is broken, we can assume that its output is indistinguishable from that of a uniformly distributed random variable.
If we truncate this value now, we don't offer any further insight into the inner workings of the algorithm. Still, we do weaken the security by the simple fact that brute-forcing (be it collisions or finding pre-images) now takes less time by laws of probability.
For example, finding a collision for a 64 bit hash takes roughly 2^32 attempts on average - says the Birthday Paradoxon. If you truncate your output down to the least significant 32 bits of the original 64 bit hash, then you will find collisions in time roughly 2^16, because you simply ignore the most significant 32 bits and the de-facto uniform distribution does the rest - it's like you started searching for collisions with a 32 bit value in the first place.
It's a bad idea. Hash function values are always meant to be taken as a whole.
For the implied question of "how to calculate a 64 bit hash": what's your intended use? Remember that 64 bits are too few for a crypto-strength hash function.
Use CRC to protect against random changes.
Use HMAC to protect against an attacker changing your files. HMAC uses a secret key that is necessary to generate and verify the tags. The result of an HMAC is as long as the underlying hash function (e.g. 20 bytes for an HMAC-SHA1), but it is frequently truncated. I.e. according to NIST SP 800-107 p.14 64-96 bits should be enough for most applications.
64 bits is small for a hash and usually, hashes are meant to be taken as a whole.
Now, what do you need these 64 bits for ? Answer will depend of expected usage.
Keep in mind that md5 is quite broken nowadays and 64 bits is very low security.
If you just need integrity checking against random changes, then a simple checksum as given in the other answers may be enough.
If you need cryptographic strength to ensure the original content, then 64 bit is too weak. Better use the full value of an unbroken algorithm, i.e. not MD5. SHA1 is still okay, but for longer term security better use SHA256. Or even go further with HMAC, as mentioned in the other answer.
There is nothing wrong with using the truncated value of a cryptographic hash. In fact, SHA224/384 are calculated by calculating a SHA256/512 hash with a different initialization vector and then truncating the result. However, this is only valid for cryptographic hashes. It may be a bad idea for normal checksums and table hashes.
Use OpenSSL's API for the calculations.(www.openssl.org).

32-bit checksum algorithm better quality than CRC32?

Are there any 32-bit checksum algorithm with either:
Smaller hash collision probability for input data sizes < 1 KB ?
Collision hits with more uniform distribution.
These relative to CRC32. I'm practically not counting on first property, because of limitation of storage space of 32 bits. But for the second ... seems there could be improvements.
Any ideas ? Thanks. (I need concrete implementation, better in C, but C++/ C# or anything to start with is also OK).
How about MurmurHash? It is said, that this hash has good distribution (passes chi-square tests) and good avalanche effect. Also very good computing speed.
Not for the first criteria. Any well designed hash function with a 32 bit output has a 1 in 2^32 chance of a collision for any pair of inputs. The second criteria is not very well defined, although there are surely some statistical tests that could be used, and I'm sure someone has done it (chi-square for collision intervals?). As for needing an implementation, I strongly recommend that you not accept any proposed code for a hash function that is not an implementation of a well known hash, as there is a high risk of security problems or poor performance when rolling your own hash or encryption. A well known but bad hash function is better than one you designed yourself, even if the latter one tests well and has a 'good' collision distribution, simply because the former has more eyeballs on it.

How high do I have to count before I hit an MD5 hash collision?

Never mind why I'm doing this -- this is mainly theoretical.
If I were MD5 hashing string representations of integers, how high would I have to count before two of the hashes collide?
This problem (in generic case) is known as Birthday Paradox
The probability of collision in generic case can be computed easily. However, in your particular case, you have to actually compute (and store!) each MD5.
EDIT #Scott : not really. The Pigeonhole principle (being just a particular case of Birthday problem) would say that having 2^128 possible MD5 values, we surely will have a collision after 1 + 2^128 tries. The birthday paradox says that the probability of collision will be grater than 0.5 for about 2^70 MD5 values.
With these estimates for storage requirements, it's up to you to decide if the problem worth it. By me it does not.
Apparently, one can base a thesis on this very thing (or similar problems, anyway). I haven't read it, but maybe something in Stevens' thesis will help you (it's apparently linked from the Wikipedia article).
In a perfect world, to 1 + 2^128. But I doubt md5 is perfect, I cant give you a number but is guaranteed to be <= 1+ 2^128
Here is a scientific way to find out an estimate of how high you would have to count.
Make MD5 hash that is cut down to say 4 bits. Calculate that (make sure you calculate until you reach say 100 collisions so you get a good average)
Then make the same thing at 8 bits (again, wait for many collisions so you can calculate an average).
Do it again and again until you have averages for 4, 8, 12, 16 bits and then see if you can find a trend. Follow that trend up to 128 bits
You may want to xor all 128 bits to come up with your shorter version. Taking the first or last part may not be the best test.

What is the fastest substring search algorithm?

OK, so I don't sound like an idiot I'm going to state the problem/requirements more explicitly:
Needle (pattern) and haystack (text to search) are both C-style null-terminated strings. No length information is provided; if needed, it must be computed.
Function should return a pointer to the first match, or NULL if no match is found.
Failure cases are not allowed. This means any algorithm with non-constant (or large constant) storage requirements will need to have a fallback case for allocation failure (and performance in the fallback care thereby contributes to worst-case performance).
Implementation is to be in C, although a good description of the algorithm (or link to such) without code is fine too.
...as well as what I mean by "fastest":
Deterministic O(n) where n = haystack length. (But it may be possible to use ideas from algorithms which are normally O(nm) (for example rolling hash) if they're combined with a more robust algorithm to give deterministic O(n) results).
Never performs (measurably; a couple clocks for if (!needle[1]) etc. are okay) worse than the naive brute force algorithm, especially on very short needles which are likely the most common case. (Unconditional heavy preprocessing overhead is bad, as is trying to improve the linear coefficient for pathological needles at the expense of likely needles.)
Given an arbitrary needle and haystack, comparable or better performance (no worse than 50% longer search time) versus any other widely-implemented algorithm.
Aside from these conditions, I'm leaving the definition of "fastest" open-ended. A good answer should explain why you consider the approach you're suggesting "fastest".
My current implementation runs in roughly between 10% slower and 8 times faster (depending on the input) than glibc's implementation of Two-Way.
Update: My current optimal algorithm is as follows:
For needles of length 1, use strchr.
For needles of length 2-4, use machine words to compare 2-4 bytes at once as follows: Preload needle in a 16- or 32-bit integer with bitshifts and cycle old byte out/new bytes in from the haystack at each iteration. Every byte of the haystack is read exactly once and incurs a check against 0 (end of string) and one 16- or 32-bit comparison.
For needles of length >4, use Two-Way algorithm with a bad shift table (like Boyer-Moore) which is applied only to the last byte of the window. To avoid the overhead of initializing a 1kb table, which would be a net loss for many moderate-length needles, I keep a bit array (32 bytes) marking which entries in the shift table are initialized. Bits that are unset correspond to byte values which never appear in the needle, for which a full-needle-length shift is possible.
The big questions left in my mind are:
Is there a way to make better use of the bad shift table? Boyer-Moore makes best use of it by scanning backwards (right-to-left) but Two-Way requires a left-to-right scan.
The only two viable candidate algorithms I've found for the general case (no out-of-memory or quadratic performance conditions) are Two-Way and String Matching on Ordered Alphabets. But are there easily-detectable cases where different algorithms would be optimal? Certainly many of the O(m) (where m is needle length) in space algorithms could be used for m<100 or so. It would also be possible to use algorithms which are worst-case quadratic if there's an easy test for needles which provably require only linear time.
Bonus points for:
Can you improve performance by assuming the needle and haystack are both well-formed UTF-8? (With characters of varying byte lengths, well-formed-ness imposes some string alignment requirements between the needle and haystack and allows automatic 2-4 byte shifts when a mismatching head byte is encountered. But do these constraints buy you much/anything beyond what maximal suffix computations, good suffix shifts, etc. already give you with various algorithms?)
Note: I'm well aware of most of the algorithms out there, just not how well they perform in practice. Here's a good reference so people don't keep giving me references on algorithms as comments/answers: http://www-igm.univ-mlv.fr/~lecroq/string/index.html
Build up a test library of likely needles and haystacks. Profile the tests on several search algorithms, including brute force. Pick the one that performs best with your data.
Boyer-Moore uses a bad character table with a good suffix table.
Boyer-Moore-Horspool uses a bad character table.
Knuth-Morris-Pratt uses a partial match table.
Rabin-Karp uses running hashes.
They all trade overhead for reduced comparisons to a different degree, so the real world performance will depend on the average lengths of both the needle and haystack. The more initial overhead, the better with longer inputs. With very short needles, brute force may win.
A different algorithm might be best for finding base pairs, english phrases, or single words. If there were one best algorithm for all inputs, it would have been publicized.
Think about the following little table. Each question mark might have a different best search algorithm.
short needle long needle
short haystack ? ?
long haystack ? ?
This should really be a graph, with a range of shorter to longer inputs on each axis. If you plotted each algorithm on such a graph, each would have a different signature. Some algorithms suffer with a lot of repetition in the pattern, which might affect uses like searching for genes. Some other factors that affect overall performance are searching for the same pattern more than once and searching for different patterns at the same time.
If I needed a sample set, I think I would scrape a site like google or wikipedia, then strip the html from all the result pages. For a search site, type in a word then use one of the suggested search phrases. Choose a few different languages, if applicable. Using web pages, all the texts would be short to medium, so merge enough pages to get longer texts. You can also find public domain books, legal records, and other large bodies of text. Or just generate random content by picking words from a dictionary. But the point of profiling is to test against the type of content you will be searching, so use real world samples if possible.
I left short and long vague. For the needle, I think of short as under 8 characters, medium as under 64 characters, and long as under 1k. For the haystack, I think of short as under 2^10, medium as under a 2^20, and long as up to a 2^30 characters.
Published in 2011, I believe it may very well be the "Simple Real-Time Constant-Space String Matching" algorithm by Dany Breslauer, Roberto Grossi, and Filippo Mignosi.
In 2014 the authors published this improvement: Towards optimal packed string matching.
I was surprised to see our tech report cited in this discussion; I am one of the authors of the algorithm that was named Sustik-Moore above. (We did not use that term in our paper.)
I wanted here to emphasize that for me the most interesting feature of the algorithm is that it is quite simple to prove that each letter is examined at most once. For earlier Boyer-Moore versions they proved that each letter is examined at most 3 and later 2 times at most, and those proofs were more involved (see cites in paper). Therefore I also see a didactical value in presenting/studying this variant.
In the paper we also describe further variations that are geared toward efficiency while relaxing the theoretical guarantees. It is a short paper and the material should be understandable to an average high school graduate in my opinion.
Our main goal was to bring this version to the attention of others who can further improve on it. String searching has so many variations and we alone cannot possibly think of all where this idea could bring benefits. (Fixed text and changing pattern, fixed pattern different text, preprocessing possible/not possible, parallel execution, finding matching subsets in large texts, allow errors, near matches etc., etc.)
The http://www-igm.univ-mlv.fr/~lecroq/string/index.html
link you point to is
an excellent source and summary of some of the best known and researched
string matching algorithms.
Solutions to most search problems involve
trade offs with respect to pre-processing overhead, time and
space requirements. No single
algorithm will be optimal or practical in all cases.
If you objective is to design a specific algorithm for string searching, then ignore the
rest of what I have to say, If you want to develop a generalized string searching service
routine then try the following:
Spend some time reviewing the specific strengths and weaknesses of
the algorithms you have already referenced. Conduct the
review with the objective of finding a set of
algorithms that cover the range and scope of string searches you are
interested in. Then, build a front end search selector based on a classifier
function to target the best algorithm for the given inputs. This way you may
employ the most efficient algorithm to do the job. This is particularly
effective when an algorithm is very good for certain searches but degrades poorly. For
example, brute force is probably the best for needles of length 1 but
quickly degrades as needle length increases, whereupon the sustik-moore algoritim may become more efficient (over small alphabets), then for longer needles and larger alphabets, the KMP or Boyer-Moore algorithms may be better. These are just examples to illustrate a possible strategy.
The multiple algorithm approach not a new idea. I believe it has been employed by a few
commercial Sort/Search packages (e.g. SYNCSORT commonly used on mainframes implements
several sort algorithms and uses heuristics to choose the "best" one for the given inputs)
Each search algorithm comes in several variations that
can make significant differences to its performance, as,
for example, this paper illustrates.
Benchmark your service to categorize the areas where additional search strategies are needed or to more effectively
tune your selector function. This approach is not quick or easy but if
done well can produce very good results.
The fastest substring search algorithm is going to depend on the context:
the alphabet size (e.g. DNA vs English)
the needle length
The 2010 paper "The Exact String Matching Problem: a Comprehensive Experimental Evaluation" gives tables with runtimes for 51 algorithms (with different alphabet sizes and needle lengths), so you can pick the best algorithm for your context.
All of those algorithms have C implementations, as well as a test suite, here:
A really good question. Just add some tiny bits...
Someone were talking about DNA sequence matching. But for DNA sequence, what we usually do is to build a data structure (e.g. suffix array, suffix tree or FM-index) for the haystack and match many needles against it. This is a different question.
It would be really great if someone would like to benchmark various algorithms. There are very good benchmarks on compression and the construction of suffix arrays, but I have not seen a benchmark on string matching. Potential haystack candidates could be from the SACA benchmark.
A few days ago I was testing the Boyer-Moore implementation from the page you recommended (EDIT: I need a function call like memmem(), but it is not a standard function, so I decided to implement it). My benchmarking program uses random haystack. It seems that the Boyer-Moore implementation in that page is times faster than glibc's memmem() and Mac's strnstr(). In case you are interested, the implementation is here and the benchmarking code is here. This is definitely not a realistic benchmark, but it is a start.
A faster "Search for a single matching character" (ala strchr) algorithm.
Important notes:
These functions use a "number / count of (leading|trailing) zeros" gcc compiler intrinsic- __builtin_ctz. These functions are likely to only be fast on machines that have an instruction(s) that perform this operation (i.e., x86, ppc, arm).
These functions assume the target architecture can perform 32 and 64 bit unaligned loads. If your target architecture does not support this, you will need to add some start up logic to properly align the reads.
These functions are processor neutral. If the target CPU has vector instructions, you might be able to do (much) better. For example, The strlen function below uses SSE3 and can be trivially modified to XOR the bytes scanned to look for a byte other than 0. Benchmarks performed on a 2.66GHz Core 2 laptop running Mac OS X 10.6 (x86_64) :
843.433 MB/s for strchr
2656.742 MB/s for findFirstByte64
13094.479 MB/s for strlen
... a 32-bit version:
#ifdef __BIG_ENDIAN__
#define findFirstZeroByte32(x) ({ uint32_t _x = (x); _x = ~(((_x & 0x7F7F7F7Fu) + 0x7F7F7F7Fu) | _x | 0x7F7F7F7Fu); (_x == 0u) ? 0 : (__builtin_clz(_x) >> 3) + 1; })
#define findFirstZeroByte32(x) ({ uint32_t _x = (x); _x = ~(((_x & 0x7F7F7F7Fu) + 0x7F7F7F7Fu) | _x | 0x7F7F7F7Fu); (__builtin_ctz(_x) + 1) >> 3; })
unsigned char *findFirstByte32(unsigned char *ptr, unsigned char byte) {
uint32_t *ptr32 = (uint32_t *)ptr, firstByte32 = 0u, byteMask32 = (byte) | (byte << 8);
byteMask32 |= byteMask32 << 16;
while((firstByte32 = findFirstZeroByte32((*ptr32) ^ byteMask32)) == 0) { ptr32++; }
return(ptr + ((((unsigned char *)ptr32) - ptr) + firstByte32 - 1));
... and a 64-bit version:
#ifdef __BIG_ENDIAN__
#define findFirstZeroByte64(x) ({ uint64_t _x = (x); _x = ~(((_x & 0x7F7F7F7F7f7f7f7full) + 0x7F7F7F7F7f7f7f7full) | _x | 0x7F7F7F7F7f7f7f7full); (_x == 0ull) ? 0 : (__builtin_clzll(_x) >> 3) + 1; })
#define findFirstZeroByte64(x) ({ uint64_t _x = (x); _x = ~(((_x & 0x7F7F7F7F7f7f7f7full) + 0x7F7F7F7F7f7f7f7full) | _x | 0x7F7F7F7F7f7f7f7full); (__builtin_ctzll(_x) + 1) >> 3; })
unsigned char *findFirstByte64(unsigned char *ptr, unsigned char byte) {
uint64_t *ptr64 = (uint64_t *)ptr, firstByte64 = 0u, byteMask64 = (byte) | (byte << 8);
byteMask64 |= byteMask64 << 16;
byteMask64 |= byteMask64 << 32;
while((firstByte64 = findFirstZeroByte64((*ptr64) ^ byteMask64)) == 0) { ptr64++; }
return(ptr + ((((unsigned char *)ptr64) - ptr) + firstByte64 - 1));
Edit 2011/06/04 The OP points out in the comments that this solution has a "insurmountable bug":
it can read past the sought byte or null terminator, which could access an unmapped page or page without read permission. You simply cannot use large reads in string functions unless they're aligned.
This is technically true, but applies to virtually any algorithm that operates on chunks that are larger than a single byte, including the method suggested by the OP in the comments:
A typical strchr implementation is not naive, but quite a bit more efficient than what you gave. See the end of this for the most widely used algorithm: http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
It also really has nothing to do with alignment per-se. True, this could potentially cause the behavior discussed on the majority of common architectures in use, but this has more to do with microarchitecture implementation details- if the unaligned read straddles a 4K boundary (again, typical), then that read will cause a program terminating fault if the next 4K page boundary is unmapped.
But this isn't a "bug" in the algorithm given in the answer- that behavior is because functions like strchr and strlen do not accept a length argument to bound the size of the search. Searching char bytes[1] = {0x55};, which for the purposes of our discussion just so happens to be placed at the very end of a 4K VM page boundary and the next page is unmapped, with strchr(bytes, 0xAA) (where strchr is a byte-at-a-time implementation) will crash exactly the same way. Ditto for strchr related cousin strlen.
Without a length argument, there is no way to tell when you should switch out of the high speed algorithm and back to a byte-by-byte algorithm. A much more likely "bug" would be to read "past the size of the allocation", which technically results in undefined behavior according to the various C language standards, and would be flagged as an error by something like valgrind.
In summary, anything that operates on larger than byte chunks to go faster, as this answers code does and the code pointed out by the OP, but must have byte-accurate read semantics is likely to be "buggy" if there is no length argument to control the corner case(s) of "the last read".
The code in this answer is a kernel for being able to find the first byte in a natural CPU word size chunk quickly if the target CPU has a fast ctz like instruction. It is trivial to add things like making sure it only operates on correctly aligned natural boundaries, or some form of length bound, which would allow you to switch out of the high speed kernel and in to a slower byte-by-byte check.
The OP also states in the comments:
As for your ctz optimization, it only makes a difference for the O(1) tail operation. It could improve performance with tiny strings (e.g. strchr("abc", 'a'); but certainly not with strings of any major size.
Whether or not this statement is true depends a great deal on the microarchitecture in question. Using the canonical 4 stage RISC pipeline model, then it is almost certainly true. But it is extremely hard to tell if it is true for a contemporary out-of-order super scalar CPU where the core speed can utterly dwarf the memory streaming speed. In this case, it is not only plausible, but quite common, for there to be a large gap in "the number of instructions that can be retired" relative to "the number of bytes that can be streamed" so that you have "the number of instructions that can be retired for each byte that can be streamed". If this is large enough, the ctz + shift instruction can be done "for free".
I know it's an old question, but most bad shift tables are single character. If it makes sense for your dataset (eg especially if it's written words), and if you have the space available, you can get a dramatic speedup by using a bad shift table made of n-grams rather than single characters.
Here's Python's search implementation, used from throughout the core. The comments indicate it uses a compressed boyer-moore delta 1 table.
I have done some pretty extensive experimentation with string searching myself, but it was for multiple search strings. Assembly implementations of Horspool and Bitap can often hold their own against algorithms like Aho-Corasick for low pattern counts.
Just search for "fastest strstr", and if you see something of interest just ask me.
In my view you impose too many restrictions on yourself (yes we all want sub-linear linear at max searcher), however it takes a real programmer to step in, until then I think that the hash approach is simply a nifty-limbo solution (well reinforced by BNDM for shorter 2..16 patterns).
Just a quick example:
Doing Search for Pattern(32bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 3041%, 6801754 skips/iterations
Railgun_Quadruplet_7Hasherezade_hits/Railgun_Quadruplet_7Hasherezade_clocks: 0/58
Railgun_Quadruplet_7Hasherezade performance: 3483KB/clock
Doing Search for Pattern(32bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 1554%, 13307181 skips/iterations
Boyer_Moore_Flensburg_hits/Boyer_Moore_Flensburg_clocks: 0/83
Boyer_Moore_Flensburg performance: 2434KB/clock
Doing Search for Pattern(32bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 129%, 160239051 skips/iterations
Two-Way_hits/Two-Way_clocks: 0/816
Two-Way performance: 247KB/clock
The Two-Way Algorithm that you mention in your question (which by the way is incredible!) has recently been improved to work efficiently on multibyte words at a time: Optimal Packed String Matching.
I haven't read the whole paper, but it seems they rely on a couple of new, special CPU instructions (included in e.g. SSE 4.2) being O(1) for their time complexity claim, though if they aren't available they can simulate them in O(log log w) time for w-bit words which doesn't sound too bad.
You could implement, say, 4 different algorithms. Every M minutes (to be determined empirically) run all 4 on current real data. Accumulate statistics over N runs (also TBD). Then use only the winner for the next M minutes.
Log stats on Wins so that you can replace algorithms that never win with new ones. Concentrate optimization efforts on the winningest routine. Pay special attention to the stats after any changes to the hardware, database, or data source. Include that info in the stats log if possible, so you won't have to figure it out from the log date/time-stamp.
I recently discovered a nice tool to measure the performance of the various available algos:
You might find it useful.
Also, if I have to take a quick call on substring search algorithm, I would go with Knuth-Morris-Pratt.
The fastest is currently EPSM, by S. Faro and O. M. Kulekci.
See https://smart-tool.github.io/smart/
"Exact Packed String Matching" optimized for SIMD SSE4.2 (x86_64 and aarch64). It performs stable and best on all sizes.
The site I linked to compares 199 fast string search algorithms, with the usual ones (BM, KMP, BMH) being pretty slow. EPSM outperforms all the others being mentioned here on these platforms. It's also the latest.
Update 2020: EPSM was recently optimized for AVX and is still the fastest.
You might also want to have diverse benchmarks with several types of strings, as this may have a great impact on performance. The algos will perform differenlty based on searching natural language (and even here there still might be fine grained distinctions because of the different morphologoies), DNA strings or random strings etc.
Alphabet size will play a role in many algos, as will needle size. For instance Horspool does good on English text but bad on DNA because of the different alphabet size, making life hard for the bad-character rule. Introducing the good-suffix allieviates this greatly.
Use stdlib strstr:
char *foundit = strstr(haystack, needle);
It was very fast, only took me about 5 seconds to type.
I don't know if it's the absolute best, but I've had good experience with Boyer-Moore.

Finding prime factors to large numbers using specially-crafted CPUs

My understanding is that many public key cryptographic algorithms these days depend on large prime numbers to make up the keys, and it is the difficulty in factoring the product of two primes that makes the encryption hard to break. It is also my understanding that one of the reasons that factoring such large numbers is so difficult, is that the sheer size of the numbers used means that no CPU can efficiently operate on the numbers, since our minuscule 32 and 64 bit CPUs are no match for 1024, 2048 or even 4096 bit numbers. Specialized Big Integer math libraries must be used in order to process those numbers, and those libraries are inherently slow since a CPU can only hold (and process) small chunks (like 32 or 64 bits) at one time.
Why can't you build a highly specialized custom chip with 2048 bit registers, and giant arithmetic circuits, much in the same way that we scaled from 8 to 16 to 32 to 64-bit CPUs, just build one a LOT larger? This chip wouldn't need most of the circuitry on conventional CPUs, after all it wouldn't need to handle things like virtual memory, multithreading or I/O. It wouldn't even need to be a general-purpose processor supporting stored instructions. Just the bare minimum to perform the necessary arithmetical calculations on ginormous numbers.
I don't know a whole lot about IC design, but I do remember learning about how logic gates work, how to build a half adder, full adder, then link together a bunch of adders to do multi-bit arithmetic. Just scale up. A lot.
Now, I'm fairly certain that there is a very good reason (or 17) that the above won't work (since otherwise one of the many people smarter than I am would have already done it) but I am interested in knowing why it won't work.
(Note: This question may need some re-working, as I'm not even sure yet if the question makes sense)
What #cube said, and the fact that a giant arithmetic logic unit would take more time for the logic signals to stabilize, and include other complications in digital design. Digital logic design includes something that you take for granted in software, namely that signals through combinational logic take a small but nonzero time to propagate and settle. A 32x32 multiplier needs to be designed carefully. A 1024x1024 multiplier would not only take a huge amount of physical resources in a chip, but it also would be slower than a 32x32 multiplier (though perhaps faster than a 32x32 multiplier computing all the partial products needed to perform a 1024x1024 multiply). Plus it's not only the multiplier that's the bottleneck: you've got memory pathways. You'd have to spend a bunch of time gathering the 1024 bits from a memory circuit that's only 32 bits wide, and storing the resulting 2048 bits back into the memory circuit.
Almost certainly it's better to get a bunch of "conventional" 32-bit or 64-bit systems working in parallel: you get the speedup w/o the hardware design complexity.
edit: if anyone has ACM access (I don't), perhaps take a look at this paper to see what it says.
Its because this speedup would be only in O(n), but the complexity of factoring the number is something like O(2^n) (with respect to the number of bits). So if you made this überprocessor and factorized the numbers 1000 times faster, I would only have to make the numbers 10 bits larger and we would be back on the start again.
As indicated above, the primary problem is simply how many possibilities you have to go through to factor a number. That being said, specialized computers do exist to do this sort of thing.
The real progress for this sort of cryptography is improvements in number factoring algorithms. Currently, the fastest known general algorithm is the general number field sieve.
Historically, we seem to be able to factor numbers twice as large each decade. Part of that is faster hardware, and part of it is simply a better understanding of mathematics and how to perform factoring.
I can't comment on the feasibility of an approach exactly like the one you described, but people do similar things very frequently using FPGAs:
Crack DES keys
Crack GSM conversations
Open source graphics card
Shamir & Tromer suggest a similar approach, using a kind of grid computing:
This article discusses a new design for a custom hardware
implementation of the sieving step, which
reduces [the cost of sieving, relative to TWINKLE,] to about $10M. The new device,
called TWIRL, can be seen as an extension of the
TWINKLE device. However, unlike TWINKLE it
does not have optoelectronic components, and can
thus be manufactured using standard VLSI technology
on silicon wafers. The underlying idea is to use
a single copy of the input to solve many subproblems
in parallel. Since input storage dominates cost, if the
parallelization overhead is kept low then the resulting
speedup is obtained essentially for free. Indeed, the
main challenge lies in achieving this parallelism efficiently while allowing compact storage of the input.
Addressing this involves myriad considerations, ranging
from number theory to VLSI technology.
Why don't you try building an uber-quantum computer and run Shor's algorithm on it?
"... If a quantum computer with a sufficient number of qubits were to be constructed, Shor's algorithm could be used to break public-key cryptography schemes such as the widely used RSA scheme. RSA is based on the assumption that factoring large numbers is computationally infeasible. So far as is known, this assumption is valid for classical (non-quantum) computers; no classical algorithm is known that can factor in polynomial time. However, Shor's algorithm shows that factoring is efficient on a quantum computer, so a sufficiently large quantum computer can break RSA. ..." -Wikipedia
