strstr faster than algorithms? - c

I have a file that's 21056 bytes.
I've written a program in C that reads the entire file into a buffer, and then uses multiple search algorithms to search the file for a token that's 82 chars.
I've used all the implementations of the algorithms from the “Exact String Matching Algorithms” page. I've used: KMP, BM, TBM, and Horspool. And then I used strstr and benchmarked each one.
What I'm wondering is, each time the strstr outperforms all the other algorithms. The only one that is faster sometimes is BM.
Shouldn't strstr be the slowest?
Here's my benchmark code with an example of benchmarking BM:
double get_time()
{
LARGE_INTEGER t, f;
QueryPerformanceCounter(&t);
QueryPerformanceFrequency(&f);
return (double)t.QuadPart/(double)f.QuadPart;
}
before = get_time();
BM(token, strlen(token), buffer, len);
after = get_time();
printf("Time: %f\n\n", after - before);
Could someone explain to me why strstr is outperforming the other search algorithms? I'll post more code on request if needed.

Why do you think strstr should be slower than all the others? Do you know what algorithm strstr uses? I think it's quite likely that strstr uses a fine-tuned, processor-specific, assembly-coded algorithm of the KMP type or better. In which case you don't stand a chance of out-performing it in C for such small benchmarks.
(The reason I think this is likely is that programmers love to implement such things.)

Horspool, KMP et al are optimal at minimizing the number of byte-comparisons.
However, that's not the bottleneck on a modern processor. On an x86/64 processor, your string is being loaded into L1 cache in cache-line-width chunks (typically 64 bytes). No matter how clever your algorithm is, unless it gives you strides that are larger than that, you gain nothing; and the more complicated Horspool code is (at least one table lookup) can't compete.
Furthermore, you're stuck with the "C" string constraint of null-termination: SOMEWHERE the code has to examine every byte.
strstr() is expected to be optimal for a wide range of cases; e.g. searching for tiny strings like "\r\n" in a short string, as well as much longer ones where some smarter algorithm might have a hope. The basic strchr/memcmp loop is pretty hard to beat across the whole range of likely inputs.
Pretty much all x86-compatible processors since 2003 have supported SSE2. If you disassembled strlen()/x86 for glibc, you may have noticed that it uses some SSE2 PCMPEQ and MOVMASK operations to search for the null terminator 16 bytes at a time. The solution is so efficient that it beats the obvious super-simple loop, for anything longer than the empty string.
I took that idea and came up with a strstr() that beats glibc's strstr() for all cases greater than 1 byte --- where the relative difference is pretty much moot. If you're interested, check out:
Convergence SSE2 and strstr()
A better strstr() with no ASM code
If you care to see a non-SSE2 solution that dominates strstr() for target strings of more than 15 bytes, check out:
which makes use of multibyte comparisons rather than strchr(), to find a point at which to do a memcmp.
BTW you've probably figured out by now that the x86 REP SCASB/REP CMPSB ops fall on their ass for anything longer than 32 bytes, and are not much improvement for shorter strings. Wish Intel had devoted a little more attention to that, than to adding SSE4.2 "string" ops.
For strings large enough to matter, my perf tests show that BNDM is better than Horspool, across the board. BNDM is more tolerant of "pathological" cases, such as targets that heavily repeat the last byte of a pattern. BNDM can also make use of SSE2 (128bit registers) in a way that competes with 32bit registers in efficiency and start-up cost. Source code here.

Without seeing your code, it's hard to say exactly. strstr is heavily optimized, and usually written in assembly language. It does things like reading data 4 bytes at a time and comparing them (bit-twiddling if necessary if the alignment isn't right) to minimize memory latency. It can also take advantage of things like SSE to load 16 bytes at a time. If your code is only loading one byte at a time, it's probably getting killed by memory latency.
Use your debugger and step through the disassembly of strstr -- you'll probably find some interesting things in there.

Imagine you want to get something cleaned. You could just clean it yourself, or you could hire ten professional cleaners to clean it. If the cleaning job is an office building, the latter solution would be preferable. If the cleaning job was one window, the former would be preferable.
You never get any payback for the time spent setting up to do the job efficiently because the job doesn't take very long.

Related

Is strncmp faster than strcmp

Is strcmp slower than strncmp as one can give pre-calculated string length to it, but strcmp does not receive such information ?
I am writing an interpreter. I am aware that these functions are both optimized. I wonder what will be the better approach (in terms of performance), as I will do scan anyway and I will know offset positions hence lengths.
They do different things, so comparing them directly does not make sense. strncmp compares the first n (or fewer, if the string ends sooner) characters of a string. strcmp compares whole strings. If n is sufficiently large that strncmp will compare the whole strings (so that the behavior becomes effectively the same as strcmp) then strncmp is likely to be moderately slower because it also has to keep track of a counter, but the difference might or might not be measurable or even present in a given implementation. For example an implementation of strcmp could just pass SIZE_MAX as the value for n to strncmp.
There is only one way to know: benchmark it. Speculation is of no use.
Be sure to do that with a sufficiently large number of strings and in representative conditions (statistical distribution of string lengths and statistical distribution of matching prefix lengths).
My bet is that there will be no significant difference.
You state that performance is a problem, so let's concentrate on that.
Implementations of library functions vary from compiler vendor to compiler vendor, and also across versions of the same compiler or development environment. Thus, Yves Daoust is correct when he says "there is only one way to know: benchmark it."
I would go further and suggest that if you haven't profiled your code, you start by doing that. The bottlenecks are all too often in surprising places you'd not expect.
It may do some good, however, to compare the implementations of strcmp() and strncmp() if you have the source code.
I once found myself in very nearly the same situation you are in. (Writing a front end information display that used multiple character based terminal backends to do its job. It required repeated near-real-time parsing of several text buffers.) The Borland compiler we were using at the time had an inefficient strncmp(). Since the processor had a length-limited instruction for comparing character buffers, I wrote a specialized variant of strncmp using assembler. "Before and after" benchmarks and profiling revealed we'd removed the primary bottleneck.
Several years later when folks went back to improve and modernize that system, the compiler and its library had changed (and the processors upgraded): there was no longer any real need for the (now obsolete) special version. New benchmarks also revealed that the bottlenecks had moved due to changing compilers, necessitating different optimizations.

optimized code for reversing a string in c

I have always implemented the c code for reversing a string as:
looping I till the length of the string or till the half of its length
placing the pointers at the end and beginning of the string
swapping them one by one .
but I want an optimized code that reduces the time complexity for this problem apart from the one that I mentioned. I tried google search but did not find any relevant solution to it.
If by "time complexity" you're referring to the big-O notation which excludes coefficients and lower-order terms, you will not be able to beat a simple O(n) algorithm for reversing a C string.
If you're referring to the time it takes for a specific machine (or class of machines) to execute the operation, there is a number of approaches to optimize the reversal. Typical optimizations include loop unrolling, consuming the characters machine word by machine word instead of character by character, and a smart search for the terminating NUL character. The freely available GNU libc contains examples of such optimizations.
Some of the above optimizations, such as loop unrolling, may be automatically implemented by optimizing compilers. Others may be counter-productive on some platforms, or their speedup dependent on the size of the string. In some cases hand-written optimization can hinder the compiler's own effort to optimize the code. The only way to be sure you're not making things worse, develop a benchmark that covers your intended usage and meticulously benchmark your code as you progress.
for(fctr=0,bctr=len-1;fctr<len/2;fctr++,bctr--)
{
temp=str[fctr];
str[fctr]=str[bctr];
str[bctr]=temp;
}
this may work fast!

Faster to use Integers as Booleans?

From a memory access standpoint... is it worth attempting an optimization like this?
int boolean_value = 0;
//magical code happens and boolean_value could be 0 or 1
if(boolean_value)
{
//do something
}
Instead of
unsigned char boolean_value = 0;
//magical code happens and boolean_value could be 0 or 1
if(boolean_value)
{
//do something
}
The unsigned char of course takes up only 1 byte as apposed to the integers 4 (assuming 32 bit platform here), but my understanding is that it would be faster for a processor to read the integer value from memory.
It may or may not be faster, and the speed depends on so many things that a generic answer is impossible. For example: hardware architecture, compiler, compiler options, amount of data (does it fit into L1 cache?), other things competing for the CPU, etc.
The correct answer, therefore, is: try both ways and measure for your particular case.
If measurement does not indicate that one method is significantly faster than the other, opt for the one that is clearer.
From a memory access standpoint... is
it worth attempting an optimization
like this?
Probably not. In almost all modern processors, memory will get fetched based on the word size of the processor. In your case, even to get one byte of memory out, your processor probably fetches the entire 32-bit word or more based on the caching of that processor. Your architecture may vary, so you will want to understand how your CPU works to gauge.
But as others have said, it doesn't hurt to try it and measure it.
This is almost never a good idea. Many systems can only read word-sized chunks from memory at once, so reading a byte then masking or shifting will actually take more code space and just as much (data) memory. If you're using an obscure tiny system, measure, but in general this will actually slow down and bloat your code.
Asking how much memory unsigned char takes versus int is only meaningful when it's in an array (or possibly a structure, if you're careful to order the elements to take care of alignment). As a lone variable, it's very unlikely that you save any memory at all, and the compiler is likely to generate larger code to truncate the upper bits of registers.
As a general policy, never use smaller-than-int types except in arrays unless you have a really good reason other than trying to save space.
Follow the standard rules of optimization. First, don't optimize. Then test if your code needs it at some point. Then optimize that point. This link provides an excellent intro to the topic of optimization.
http://www.catb.org/~esr/writings/taoup/html/optimizationchapter.html

What is the fastest substring search algorithm?

OK, so I don't sound like an idiot I'm going to state the problem/requirements more explicitly:
Needle (pattern) and haystack (text to search) are both C-style null-terminated strings. No length information is provided; if needed, it must be computed.
Function should return a pointer to the first match, or NULL if no match is found.
Failure cases are not allowed. This means any algorithm with non-constant (or large constant) storage requirements will need to have a fallback case for allocation failure (and performance in the fallback care thereby contributes to worst-case performance).
Implementation is to be in C, although a good description of the algorithm (or link to such) without code is fine too.
...as well as what I mean by "fastest":
Deterministic O(n) where n = haystack length. (But it may be possible to use ideas from algorithms which are normally O(nm) (for example rolling hash) if they're combined with a more robust algorithm to give deterministic O(n) results).
Never performs (measurably; a couple clocks for if (!needle[1]) etc. are okay) worse than the naive brute force algorithm, especially on very short needles which are likely the most common case. (Unconditional heavy preprocessing overhead is bad, as is trying to improve the linear coefficient for pathological needles at the expense of likely needles.)
Given an arbitrary needle and haystack, comparable or better performance (no worse than 50% longer search time) versus any other widely-implemented algorithm.
Aside from these conditions, I'm leaving the definition of "fastest" open-ended. A good answer should explain why you consider the approach you're suggesting "fastest".
My current implementation runs in roughly between 10% slower and 8 times faster (depending on the input) than glibc's implementation of Two-Way.
Update: My current optimal algorithm is as follows:
For needles of length 1, use strchr.
For needles of length 2-4, use machine words to compare 2-4 bytes at once as follows: Preload needle in a 16- or 32-bit integer with bitshifts and cycle old byte out/new bytes in from the haystack at each iteration. Every byte of the haystack is read exactly once and incurs a check against 0 (end of string) and one 16- or 32-bit comparison.
For needles of length >4, use Two-Way algorithm with a bad shift table (like Boyer-Moore) which is applied only to the last byte of the window. To avoid the overhead of initializing a 1kb table, which would be a net loss for many moderate-length needles, I keep a bit array (32 bytes) marking which entries in the shift table are initialized. Bits that are unset correspond to byte values which never appear in the needle, for which a full-needle-length shift is possible.
The big questions left in my mind are:
Is there a way to make better use of the bad shift table? Boyer-Moore makes best use of it by scanning backwards (right-to-left) but Two-Way requires a left-to-right scan.
The only two viable candidate algorithms I've found for the general case (no out-of-memory or quadratic performance conditions) are Two-Way and String Matching on Ordered Alphabets. But are there easily-detectable cases where different algorithms would be optimal? Certainly many of the O(m) (where m is needle length) in space algorithms could be used for m<100 or so. It would also be possible to use algorithms which are worst-case quadratic if there's an easy test for needles which provably require only linear time.
Bonus points for:
Can you improve performance by assuming the needle and haystack are both well-formed UTF-8? (With characters of varying byte lengths, well-formed-ness imposes some string alignment requirements between the needle and haystack and allows automatic 2-4 byte shifts when a mismatching head byte is encountered. But do these constraints buy you much/anything beyond what maximal suffix computations, good suffix shifts, etc. already give you with various algorithms?)
Note: I'm well aware of most of the algorithms out there, just not how well they perform in practice. Here's a good reference so people don't keep giving me references on algorithms as comments/answers: http://www-igm.univ-mlv.fr/~lecroq/string/index.html
Build up a test library of likely needles and haystacks. Profile the tests on several search algorithms, including brute force. Pick the one that performs best with your data.
Boyer-Moore uses a bad character table with a good suffix table.
Boyer-Moore-Horspool uses a bad character table.
Knuth-Morris-Pratt uses a partial match table.
Rabin-Karp uses running hashes.
They all trade overhead for reduced comparisons to a different degree, so the real world performance will depend on the average lengths of both the needle and haystack. The more initial overhead, the better with longer inputs. With very short needles, brute force may win.
Edit:
A different algorithm might be best for finding base pairs, english phrases, or single words. If there were one best algorithm for all inputs, it would have been publicized.
Think about the following little table. Each question mark might have a different best search algorithm.
short needle long needle
short haystack ? ?
long haystack ? ?
This should really be a graph, with a range of shorter to longer inputs on each axis. If you plotted each algorithm on such a graph, each would have a different signature. Some algorithms suffer with a lot of repetition in the pattern, which might affect uses like searching for genes. Some other factors that affect overall performance are searching for the same pattern more than once and searching for different patterns at the same time.
If I needed a sample set, I think I would scrape a site like google or wikipedia, then strip the html from all the result pages. For a search site, type in a word then use one of the suggested search phrases. Choose a few different languages, if applicable. Using web pages, all the texts would be short to medium, so merge enough pages to get longer texts. You can also find public domain books, legal records, and other large bodies of text. Or just generate random content by picking words from a dictionary. But the point of profiling is to test against the type of content you will be searching, so use real world samples if possible.
I left short and long vague. For the needle, I think of short as under 8 characters, medium as under 64 characters, and long as under 1k. For the haystack, I think of short as under 2^10, medium as under a 2^20, and long as up to a 2^30 characters.
Published in 2011, I believe it may very well be the "Simple Real-Time Constant-Space String Matching" algorithm by Dany Breslauer, Roberto Grossi, and Filippo Mignosi.
Update:
In 2014 the authors published this improvement: Towards optimal packed string matching.
I was surprised to see our tech report cited in this discussion; I am one of the authors of the algorithm that was named Sustik-Moore above. (We did not use that term in our paper.)
I wanted here to emphasize that for me the most interesting feature of the algorithm is that it is quite simple to prove that each letter is examined at most once. For earlier Boyer-Moore versions they proved that each letter is examined at most 3 and later 2 times at most, and those proofs were more involved (see cites in paper). Therefore I also see a didactical value in presenting/studying this variant.
In the paper we also describe further variations that are geared toward efficiency while relaxing the theoretical guarantees. It is a short paper and the material should be understandable to an average high school graduate in my opinion.
Our main goal was to bring this version to the attention of others who can further improve on it. String searching has so many variations and we alone cannot possibly think of all where this idea could bring benefits. (Fixed text and changing pattern, fixed pattern different text, preprocessing possible/not possible, parallel execution, finding matching subsets in large texts, allow errors, near matches etc., etc.)
The http://www-igm.univ-mlv.fr/~lecroq/string/index.html
link you point to is
an excellent source and summary of some of the best known and researched
string matching algorithms.
Solutions to most search problems involve
trade offs with respect to pre-processing overhead, time and
space requirements. No single
algorithm will be optimal or practical in all cases.
If you objective is to design a specific algorithm for string searching, then ignore the
rest of what I have to say, If you want to develop a generalized string searching service
routine then try the following:
Spend some time reviewing the specific strengths and weaknesses of
the algorithms you have already referenced. Conduct the
review with the objective of finding a set of
algorithms that cover the range and scope of string searches you are
interested in. Then, build a front end search selector based on a classifier
function to target the best algorithm for the given inputs. This way you may
employ the most efficient algorithm to do the job. This is particularly
effective when an algorithm is very good for certain searches but degrades poorly. For
example, brute force is probably the best for needles of length 1 but
quickly degrades as needle length increases, whereupon the sustik-moore algoritim may become more efficient (over small alphabets), then for longer needles and larger alphabets, the KMP or Boyer-Moore algorithms may be better. These are just examples to illustrate a possible strategy.
The multiple algorithm approach not a new idea. I believe it has been employed by a few
commercial Sort/Search packages (e.g. SYNCSORT commonly used on mainframes implements
several sort algorithms and uses heuristics to choose the "best" one for the given inputs)
Each search algorithm comes in several variations that
can make significant differences to its performance, as,
for example, this paper illustrates.
Benchmark your service to categorize the areas where additional search strategies are needed or to more effectively
tune your selector function. This approach is not quick or easy but if
done well can produce very good results.
The fastest substring search algorithm is going to depend on the context:
the alphabet size (e.g. DNA vs English)
the needle length
The 2010 paper "The Exact String Matching Problem: a Comprehensive Experimental Evaluation" gives tables with runtimes for 51 algorithms (with different alphabet sizes and needle lengths), so you can pick the best algorithm for your context.
All of those algorithms have C implementations, as well as a test suite, here:
http://www.dmi.unict.it/~faro/smart/algorithms.php
A really good question. Just add some tiny bits...
Someone were talking about DNA sequence matching. But for DNA sequence, what we usually do is to build a data structure (e.g. suffix array, suffix tree or FM-index) for the haystack and match many needles against it. This is a different question.
It would be really great if someone would like to benchmark various algorithms. There are very good benchmarks on compression and the construction of suffix arrays, but I have not seen a benchmark on string matching. Potential haystack candidates could be from the SACA benchmark.
A few days ago I was testing the Boyer-Moore implementation from the page you recommended (EDIT: I need a function call like memmem(), but it is not a standard function, so I decided to implement it). My benchmarking program uses random haystack. It seems that the Boyer-Moore implementation in that page is times faster than glibc's memmem() and Mac's strnstr(). In case you are interested, the implementation is here and the benchmarking code is here. This is definitely not a realistic benchmark, but it is a start.
A faster "Search for a single matching character" (ala strchr) algorithm.
Important notes:
These functions use a "number / count of (leading|trailing) zeros" gcc compiler intrinsic- __builtin_ctz. These functions are likely to only be fast on machines that have an instruction(s) that perform this operation (i.e., x86, ppc, arm).
These functions assume the target architecture can perform 32 and 64 bit unaligned loads. If your target architecture does not support this, you will need to add some start up logic to properly align the reads.
These functions are processor neutral. If the target CPU has vector instructions, you might be able to do (much) better. For example, The strlen function below uses SSE3 and can be trivially modified to XOR the bytes scanned to look for a byte other than 0. Benchmarks performed on a 2.66GHz Core 2 laptop running Mac OS X 10.6 (x86_64) :
843.433 MB/s for strchr
2656.742 MB/s for findFirstByte64
13094.479 MB/s for strlen
... a 32-bit version:
#ifdef __BIG_ENDIAN__
#define findFirstZeroByte32(x) ({ uint32_t _x = (x); _x = ~(((_x & 0x7F7F7F7Fu) + 0x7F7F7F7Fu) | _x | 0x7F7F7F7Fu); (_x == 0u) ? 0 : (__builtin_clz(_x) >> 3) + 1; })
#else
#define findFirstZeroByte32(x) ({ uint32_t _x = (x); _x = ~(((_x & 0x7F7F7F7Fu) + 0x7F7F7F7Fu) | _x | 0x7F7F7F7Fu); (__builtin_ctz(_x) + 1) >> 3; })
#endif
unsigned char *findFirstByte32(unsigned char *ptr, unsigned char byte) {
uint32_t *ptr32 = (uint32_t *)ptr, firstByte32 = 0u, byteMask32 = (byte) | (byte << 8);
byteMask32 |= byteMask32 << 16;
while((firstByte32 = findFirstZeroByte32((*ptr32) ^ byteMask32)) == 0) { ptr32++; }
return(ptr + ((((unsigned char *)ptr32) - ptr) + firstByte32 - 1));
}
... and a 64-bit version:
#ifdef __BIG_ENDIAN__
#define findFirstZeroByte64(x) ({ uint64_t _x = (x); _x = ~(((_x & 0x7F7F7F7F7f7f7f7full) + 0x7F7F7F7F7f7f7f7full) | _x | 0x7F7F7F7F7f7f7f7full); (_x == 0ull) ? 0 : (__builtin_clzll(_x) >> 3) + 1; })
#else
#define findFirstZeroByte64(x) ({ uint64_t _x = (x); _x = ~(((_x & 0x7F7F7F7F7f7f7f7full) + 0x7F7F7F7F7f7f7f7full) | _x | 0x7F7F7F7F7f7f7f7full); (__builtin_ctzll(_x) + 1) >> 3; })
#endif
unsigned char *findFirstByte64(unsigned char *ptr, unsigned char byte) {
uint64_t *ptr64 = (uint64_t *)ptr, firstByte64 = 0u, byteMask64 = (byte) | (byte << 8);
byteMask64 |= byteMask64 << 16;
byteMask64 |= byteMask64 << 32;
while((firstByte64 = findFirstZeroByte64((*ptr64) ^ byteMask64)) == 0) { ptr64++; }
return(ptr + ((((unsigned char *)ptr64) - ptr) + firstByte64 - 1));
}
Edit 2011/06/04 The OP points out in the comments that this solution has a "insurmountable bug":
it can read past the sought byte or null terminator, which could access an unmapped page or page without read permission. You simply cannot use large reads in string functions unless they're aligned.
This is technically true, but applies to virtually any algorithm that operates on chunks that are larger than a single byte, including the method suggested by the OP in the comments:
A typical strchr implementation is not naive, but quite a bit more efficient than what you gave. See the end of this for the most widely used algorithm: http://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
It also really has nothing to do with alignment per-se. True, this could potentially cause the behavior discussed on the majority of common architectures in use, but this has more to do with microarchitecture implementation details- if the unaligned read straddles a 4K boundary (again, typical), then that read will cause a program terminating fault if the next 4K page boundary is unmapped.
But this isn't a "bug" in the algorithm given in the answer- that behavior is because functions like strchr and strlen do not accept a length argument to bound the size of the search. Searching char bytes[1] = {0x55};, which for the purposes of our discussion just so happens to be placed at the very end of a 4K VM page boundary and the next page is unmapped, with strchr(bytes, 0xAA) (where strchr is a byte-at-a-time implementation) will crash exactly the same way. Ditto for strchr related cousin strlen.
Without a length argument, there is no way to tell when you should switch out of the high speed algorithm and back to a byte-by-byte algorithm. A much more likely "bug" would be to read "past the size of the allocation", which technically results in undefined behavior according to the various C language standards, and would be flagged as an error by something like valgrind.
In summary, anything that operates on larger than byte chunks to go faster, as this answers code does and the code pointed out by the OP, but must have byte-accurate read semantics is likely to be "buggy" if there is no length argument to control the corner case(s) of "the last read".
The code in this answer is a kernel for being able to find the first byte in a natural CPU word size chunk quickly if the target CPU has a fast ctz like instruction. It is trivial to add things like making sure it only operates on correctly aligned natural boundaries, or some form of length bound, which would allow you to switch out of the high speed kernel and in to a slower byte-by-byte check.
The OP also states in the comments:
As for your ctz optimization, it only makes a difference for the O(1) tail operation. It could improve performance with tiny strings (e.g. strchr("abc", 'a'); but certainly not with strings of any major size.
Whether or not this statement is true depends a great deal on the microarchitecture in question. Using the canonical 4 stage RISC pipeline model, then it is almost certainly true. But it is extremely hard to tell if it is true for a contemporary out-of-order super scalar CPU where the core speed can utterly dwarf the memory streaming speed. In this case, it is not only plausible, but quite common, for there to be a large gap in "the number of instructions that can be retired" relative to "the number of bytes that can be streamed" so that you have "the number of instructions that can be retired for each byte that can be streamed". If this is large enough, the ctz + shift instruction can be done "for free".
I know it's an old question, but most bad shift tables are single character. If it makes sense for your dataset (eg especially if it's written words), and if you have the space available, you can get a dramatic speedup by using a bad shift table made of n-grams rather than single characters.
Here's Python's search implementation, used from throughout the core. The comments indicate it uses a compressed boyer-moore delta 1 table.
I have done some pretty extensive experimentation with string searching myself, but it was for multiple search strings. Assembly implementations of Horspool and Bitap can often hold their own against algorithms like Aho-Corasick for low pattern counts.
Just search for "fastest strstr", and if you see something of interest just ask me.
In my view you impose too many restrictions on yourself (yes we all want sub-linear linear at max searcher), however it takes a real programmer to step in, until then I think that the hash approach is simply a nifty-limbo solution (well reinforced by BNDM for shorter 2..16 patterns).
Just a quick example:
Doing Search for Pattern(32bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 3041%, 6801754 skips/iterations
Railgun_Quadruplet_7Hasherezade_hits/Railgun_Quadruplet_7Hasherezade_clocks: 0/58
Railgun_Quadruplet_7Hasherezade performance: 3483KB/clock
Doing Search for Pattern(32bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 1554%, 13307181 skips/iterations
Boyer_Moore_Flensburg_hits/Boyer_Moore_Flensburg_clocks: 0/83
Boyer_Moore_Flensburg performance: 2434KB/clock
Doing Search for Pattern(32bytes) into String(206908949bytes) as-one-line ...
Skip-Performance(bigger-the-better): 129%, 160239051 skips/iterations
Two-Way_hits/Two-Way_clocks: 0/816
Two-Way performance: 247KB/clock
Sanmayce,
Regards
The Two-Way Algorithm that you mention in your question (which by the way is incredible!) has recently been improved to work efficiently on multibyte words at a time: Optimal Packed String Matching.
I haven't read the whole paper, but it seems they rely on a couple of new, special CPU instructions (included in e.g. SSE 4.2) being O(1) for their time complexity claim, though if they aren't available they can simulate them in O(log log w) time for w-bit words which doesn't sound too bad.
You could implement, say, 4 different algorithms. Every M minutes (to be determined empirically) run all 4 on current real data. Accumulate statistics over N runs (also TBD). Then use only the winner for the next M minutes.
Log stats on Wins so that you can replace algorithms that never win with new ones. Concentrate optimization efforts on the winningest routine. Pay special attention to the stats after any changes to the hardware, database, or data source. Include that info in the stats log if possible, so you won't have to figure it out from the log date/time-stamp.
I recently discovered a nice tool to measure the performance of the various available algos:
http://www.dmi.unict.it/~faro/smart/index.php
You might find it useful.
Also, if I have to take a quick call on substring search algorithm, I would go with Knuth-Morris-Pratt.
The fastest is currently EPSM, by S. Faro and O. M. Kulekci.
See https://smart-tool.github.io/smart/
"Exact Packed String Matching" optimized for SIMD SSE4.2 (x86_64 and aarch64). It performs stable and best on all sizes.
The site I linked to compares 199 fast string search algorithms, with the usual ones (BM, KMP, BMH) being pretty slow. EPSM outperforms all the others being mentioned here on these platforms. It's also the latest.
Update 2020: EPSM was recently optimized for AVX and is still the fastest.
You might also want to have diverse benchmarks with several types of strings, as this may have a great impact on performance. The algos will perform differenlty based on searching natural language (and even here there still might be fine grained distinctions because of the different morphologoies), DNA strings or random strings etc.
Alphabet size will play a role in many algos, as will needle size. For instance Horspool does good on English text but bad on DNA because of the different alphabet size, making life hard for the bad-character rule. Introducing the good-suffix allieviates this greatly.
Use stdlib strstr:
char *foundit = strstr(haystack, needle);
It was very fast, only took me about 5 seconds to type.
I don't know if it's the absolute best, but I've had good experience with Boyer-Moore.

Can I use Duff's Device on an array in C?

I have a loop here and I want to make it run faster. I am passing in a large array. I recently heard of Duff's Device can it be applied to this for loop? any ideas?
for (i = 0; i < dim; i++) {
for (j = 0; j < dim; j++) {
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
}
Please, please don't use Duff's device. A thousand maintenance programmers will thank you. I used to work for a training company where someone thought it funny to introduce the device in the first ten pages of their C programming course. As an instructor it was impossible to deal with, unless (as the guy that that wrote that bit of the course apparently did) you believe in "kewl" coding.
Needless to say, I had the thing expunged from the course, ASAP.
Why do you want to make it run faster?
Is there an actual performance problem?
If so, have you profiled and found that this is executing often enough, and hence worth optimizing?
If so, you may want to write it in two ways, the straightforward way you have now and with Duff's Device, or any other method you like.
At that point, you test the performance. You may be surprised. Modern optimizers are quite good, and modern CPUs are really complicated, so source-level optimization is often counterproductive. (I once did this in a loop that was taking a whole lot of time, and found that tightening up the loop, even while introducing some indirection, improved performance. Your mileage is almost certainly going to vary.)
Finally, if Duff's Device is indeed faster, you have to decide whether the performance improvement is worth taking this straightforward and optimizable code and substituting a maintenance problem that may not improve performance at all in the next compiler version.
You should never unroll loops by hand. It would only give you a very platform-specific advantage, if any. All good compilers can unroll loops, but it's not even guaranteed to make the code faster, because it takes up more memory bandwidth to read a longer program from main memory.
If you want the loop to run fast, you should make sure that whatever RIDX computes, dst is accessed sequentially, so you minimize the number of cache misses. Other than that I can't see how you could make the loop faster.
Duff's Device is simply a technique for loop unrolling. And since any loop can be unrolled, you can use Duff's Device.
Were you able to figure this out and get a gain it would be a pittance and would in no way justify the complexity.
You would be better served spending your energies a level up--reconsidering your entire solution. Perhaps rather than copying values you could create a translation array and spend a little more time looking up answers indirectly when you need them (not really a good idea for building images--just trying to give you a different way to look at it).
Or maybe there is some completely different approach--look at your entire problem and try completely throwing away your current approaches and concepts and just see if there is something you haven't considered because you are too tied to this implementation.
Could your graphics card do some of this work?
Rethinking the problem at a high level works a lot more often than you might think.
Edit:
Looking at your sample more, it looks like you are taking a block of your image and copying it, pixel for pixel, to another image. If so, there are almost certainly ways to do it getting rid of the macro and copying byte for byte instead, or even using a block move assembly function then tweaking the edges of the result to match.
Or I may have guessed wrong, but chances are that looking at it on a larger scale than pixel for pixel might help you a lot more than unrolling loops.
The number of instruction cycles to implement the statement
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
will far outweigh the loop overhead, so unrolling the loop will be very little help on a percentage basis.
I believe this is a candidate for Duff's Device, depending on what the RIDX() function does. But I hope you don't expect someone to write the code for you... Also, you might want to format your code properly so it's actually readable.
Probably, as long as dim is a power of 2 or you have fast modulus on your target system. Learned something new today. I independently discovered that construct 5 years back and dropped it into our memCopy() routine. Who knew :)
Pedantically, no. Duff's Device was for writing to a hardware register (thus the target of the copy was always the same address).
You can implement something very much like Duff's Device for a copy like this, but there will be a clarity and maintenance cost. I'd first profile to make sure it's a problem. I'd also look into whether you can simplify the indexing, as that may enable the compiler to do the dirty work of unrolling the loop.
If you use it, make sure you measure it to determine that the improvement is both real , significant, and necessary in terms of your performance requirements. I doubt it will be.
For large loops, the remainder dealt with by Duff's device will be an insignificant proportion of the operation, and for small loops where the remainder is significant you will only see a benefit if you have many such loops (themselves in a loop), because small loops by definition don't take that long! Even then the compiler's optimiser is likely to do as well or better without rendering your code unreadable. It is also possible that the application of Duff's device will prevent the optimiser from applying more perhaps effective optimisations, which is why if you use it you need to measure it.
All the time you are likely to save on this (if any) you have probably wasted several times over reading responses to this question.
Duff's device may not be the optimized solution in an unrolled loop.
I had a function that sent a bit to a port, followed by a clock pulse to another port. For each bit, the functions were:
if (bit == 1)
{
write to the set port.
}
else
{
write to the clear port.
}
write high clock bit.
write low clock bit.
This was put into a Duff's device loop, along with bit shifting and bit count incrementing.
I improved the efficiency of the loop by using nibble values instead of bits (a nibble being 4 bits). The switch statement was based on the nibble value. This allowed 4 bits to be processed without any if statements, improving the flow through instruction cache (pipeline).
There are times when Duff's device may not be the optimal solution; but can be the foundation for a more efficient solution.
Modern compilers already do loop unrolling for you when optimizations are turned on, which renders Duff's device obsolete. The compiler knows better than you do the optimal level of unrolling for your compilation target, and you don't have to write any extra code to do it. It was a neat hack at the time, but these days Duff's device is just a historical curiosity, not a good programming practice.
In the end whoever makes the call on optimization everyone involved needs to be sure it is well documented and written in style that is as self documenting as possible using correctly spelled meaningful names for variables, functions etc. So it is obvious if the comments and the code get out of sync.
The need for optimization will never end. I was talking with a grad student that had broken malloc()/free() working on the largest file of genetic data ever attempted in one pass. After while the heap became too fragmented for malloc to to find a block of contiguous RAM to allocate to the calling function. He had to switch to a library that malloc only issued blocks of memory on 32k boundaries. It took 160% more the memory the old library, ran a good slower but it finished the job.
You must be careful using Duff's Device and many other optimizations to be sure the compiler does't optimize your optimization into obscure broken object code. As we enter an environment using automatic parallelizing tools this will become more of a problem.
I expect the lower the level the optimization the more likely future optimizations are to break the code. I can see that my habit of discarding line feeds in code designed to run on multiple platforms and putting the line feed back in in the print and write functions on each platform will run into problems in several of the things discussed in this thread.
-gcouger

Resources