How to implement Suffix array in c - suffix-array

However its easy to get the implementation using C++ as there is built-in Sort() function in algorithm header file.
I have gone through the both naive method and O(nlogn) methods of forming the array. In both the cases the sort() function is used for sorting the suffixes.
Is there any good method in C?

Are you saying you googled "sort c" and found NOTHING? I see several helpful links when I do that. For example, have a look at this question and its answers: C library function to do sort. Also the Wikipedia article on suffix arrays gives good overview of methods of constructing suffix arrays: O(N) method of constructing a suffix tree and then the suffix array, O(N^2 log N) method of sorting the suffixes (sorting requires O(N log N) comparisons, and each comparison is O(N), so the total time is O(N^2 log N)), and other advanced methods. The Wikipedia article also points to a few implementations in Java, C/C++ etc.

Related

Running Time of built in functions

if I use built in functions from java do I have to take into consideration there running time or should I count them as constant time. what will be the time complexity of the following function
def int findMax(int [] a)
{
a.Arrays.sort();
n=a.length;
return a[n-1];
}
Nearly all the work here is being done by the sort() algorithm. For sorting arrays, Java uses Quicksort, which has an O(n log n) average and O(n^2) worst case performance.
From the Java documentation on sort():
Implementation note: The sorting algorithm is a Dual-Pivot Quicksort by Vladimir Yaroslavskiy, Jon Bentley, and Joshua Bloch. This algorithm offers O(n log(n)) performance on many data sets that cause other quicksorts to degrade to quadratic performance, and is typically faster than traditional (one-pivot) Quicksort implementations.
Given your question lacks any use case whatsoever, and your comment on my answer, I feel like I have to point out that this is a classic example of premature optimization. You're looking for the mathematical complexity of a trivial method without any indication that this method will account for any significant portion of the CPU-time used by your program. This is especially true given that your implementation is incredibly inefficient: iterating through the array and storing the highest value would execute in O(n).

How do different languages implement sorting in their standard libraries? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
From what I have (briefly) read, Java and Python both look like they make use of timsort in their standard libaries, while the sorting method in C's stdlib is called qsort because it once was quicksort.
What algorithm do typical languages have implemented in their standard libraries today, and why did they choose that algorithm? Also, did C deviate from quicksort?
I know this question lacks an "actual problem(s) that [I] face" and may seem open ended to some, but knowing how/why certain algorithms are chosen as standard seems pretty useful but relatively untaught. I also feel as though an in depth answer addressing concerns that are language specific (data types?) and machine specific (cache hits?) would provide more insight to how different languages and algorithms work than uni cares to explain.
In musl, we use Smooth Sort. Conceptually it's a variant of heap sort (and likewise in-place and O(n log n) time), but it has the nice property that the worst-case performance approaches O(n) for already-sorted or near-sorted input. I'm not convinced it's the best possible choice, but it appears very difficult to do better with an in-place algorithm with O(n log n) worst-case.
Being a little-known invention of Dijkstra's also makes it kind of cool. :-)
C does not specificy the algorithm to be used by qsort.
On current glibc (2.17) qsort allocates memory (using malloc or alloca if memory required is really small) and uses merge sort algorithm. If memory requirements are too high or if malloc fails it uses quicksort algorithm.
My machine's C library provides qsort, heapsort, and mergesort, saying in the man page:
The qsort() and qsort_r() functions are an implementation of C.A.R. Hoare's "quicksort" algorithm, a variant of partition-exchange sorting; in particular, see D.E.
Knuth's Algorithm Q. Quicksort takes O(n lg n) average time. This implementation uses median selection to avoid its O(n2) worst-case behavior.
The heapsort() function is an implementation of J.W.J. William's "heapsort" algorithm, a variant of selection sorting; in particular, see D.E. Knuth's Algorithm H.
Heapsort takes O(n lg n) worst-case time. Its only advantage over qsort() is that it uses almost no additional memory; while qsort() does not allocate memory, it is
implemented using recursion.
The function mergesort() requires additional memory of size nel * width bytes; it should be used only when space is not at a premium. The mergesort() function is optimized for data with pre-existing order; its worst case time is O(n lg n); its best case is O(n).
Normally, qsort() is faster than mergesort() which is faster than heapsort(). Memory availability and pre-existing order in the data can make this untrue.
There are plenty of open source C libraries for you to look at if you want to see specific details of the implementation.
As far as 'why did system X choose algorithm Y', that's a pretty tough question to answer meaningfully - if you're not lucky enough to find a rationale in the documentation, you'd have to ask the designers directly.
I did a quick scan in the C11 standard about qsort() and I couldn't find any reference about how qsort() should be implemented
and the expected time/space complexity of the algorithm. All it has to say was about certain conditions about the comparator
function.
What that means is an implementation can choose any comparator based algorithm that fits with qsort(). For example, an implementation can choose to use a naive algorithm such as bubble sort to implement qsort() which is not as efficient as the real quick sort. Bottom line is that it's upto the implementation to decide on the actual algorithm.

what is the running time of Extracting from a Max Heap?

I got this homework question
"James claims that he succeed to implement extracting from a maximum heap (ExtractMax) which takes O((log n)^0.5)
explain why James wrongs
I know that extracting from maximum heap takes O(log n) but how I can prove that James is wrong?
As can be seen here, building a heap can be done in O(n). Now if extracting the maximum could be done in O((log n)^0.5), then it would be possible to sort the entire set in n * O((log n)^0.5) by repeatedly extracting the largest element. This, however is impossible because the lower bound for sorting is n*logn.
Therefore, James's does not exist.
#Duh's solution of converting your extraction problem into a sorting problem is actually very creative. It shouldn't be too hard to find some proof that sorting is O(n * log n) and it's very common in the study of algorithms to convert one problem into a different one (for example, all NP-Complete problems are conversions of each other. That's how you prove they are NP-Complete). That said, I think there's a much simpler solution.
You stated it directly in your question: extracting from a binary heap is O(log n). Think about why it is O(log n). What is the structure of a binary heap? What actions are required to extract from a binary heap? Why is the worst case log n operations? Are these limits influenced at all by implementation?
Now, remember that there are two parts to James' claim:
He can extract in O((log n)^0.5)
He is using a binary heap.
Given what you know about binary heaps, can both these claims be true? Why or why not? Is there a contradiction? If so, why is there a contradiction? Finally, think about what this means for James.

Regarding in-place merge in an array

I came across the following question.
Given an array of n elements and an integer k where k < n. Elements {a0...ak} and
{ak+1...an} are already sorted. Give an algorithm to sort in O(n) time and O(1) space.
It does not seem to me like it can be done in O(n) time and O(1) space. The problem really seems to be asking how to do the merge step of mergesort but in-place. If it was possible, wouldn't mergesort be implemented that way? I am unable to convince myself though and need some opinion.
This seems to indicate that it is possible to do in O(lg^2 n) space. I cannot see how to prove that it is impossible to merge in constant space, but I cannot see how to do it either.
Edit:
Chasing references, Knuth Vol 3 - Exercise 5.5.3 says "A considerably more complicated algorithm of L. Trabb-Pardo provides the best possible answer to this problem: It is possible to do stable merging in O(n) time and stable sorting in O(n lg n) time, using only O(lg n) bits of auxiliary memory for a fixed number of index variables.
More references that I have not read. Thanks for an interesting problem.
Further edit:
This article claims that the article by Huang and Langston have an algorithm that merges two lists of size m and n in time O(m + n), so the answer to your question would seem to be yes. Unfortunately I do not have access to the article, so I must trust the second hand information. I'm not sure how to reconcile this with Knuth's pronouncement that the Trabb-Pardo algorithm is optimal. If my life depended on it, I'd go with Knuth.
I now see that this had been asked as and earlier Stack Overflow question a number of times. I don't have the heart to flag it as a duplicate.
Huang B.-C. and Langston M. A., Practical in-place merging, Comm. ACM 31 (1988) 348-352
There are several algorithms for doing this, none of which are very easy to intuit. The key idea is to use a part of the arrays to merge as a buffer, then doing a standard merge using this buffer for auxiliary space. If you can then reposition the elements so that the buffer elements are in the right place, you're golden.
I have written up an implementation of one of these algorithms on my personal site if you're interested in looking at it. It's based on the paper "Practical In-Place Merging" by Huang and Langston. You probably will want to look over that paper for some insight.
I've also heard that there are good adaptive algorithms for this, which use some fixed-size buffer of your choosing (which could be O(1) if you wanted), but then scale elegantly with the buffer size. I don't know any of these off the top of my head, but I'm sure a quick search for "adaptive merge" might turn something up.
No it isn't possible, although my job would be much easier if it was :).
You have a O(log n) factor which you can't avoid. You can choose to take it as time or space, but the only way to avoid it is to not sort. With O(log n) space you can build a list of continuations that keep track of where you stashed the elements that didn't quite fit. With recursion this can be made to fit in O(1) heap, but that's only by using O(log n) stack frames instead.
Here is the progress of merge-sorting odds and evens from 1-9. Notice how you require log-space accounting to track the order inversions caused by the twin constraints of constant space and linear swaps.
. -
135792468
. -
135792468
: .-
125793468
: .-
123795468
#.:-
123495768
:.-
123459768
.:-
123456798
.-
123456789
123456789
There are some delicate boundary conditions, slightly harder than binary search to get right, and even in this (possible) form, and therefore a bad homework problem; but a really good mental exercise.
Update
Apparently I am mistaken and there is an algorithm that provides O(n) time and O(1) space. I have downloaded the papers to enlighten myself, and withdraw this answer as incorrect.

What are good test cases for benchmarking & stress testing substring search algorithms?

I'm trying to evaluate different substring search (ala strstr) algorithms and implementations and looking for some well-crafted needle and haystack strings that will catch worst-case performance and possible corner-case bugs. I suppose I could work them out myself but I figure someone has to have a good collection of test cases sitting around somewhere...
Some thoughts and a partial answer to myself:
Worst case for brute force algorithm:
a^(n+1) b in (a^n b)^m
e.g. aaab in aabaabaabaabaabaabaab
Worst case for SMOA:
Something like yxyxyxxyxyxyxx in (yxyxyxxyxyxyxy)^n. Needs further refinement. I'm trying to ensure that each advancement is only half the length of the partial match, and that maximal suffix computation requires the maximal amount of backtracking. I'm pretty sure I'm on the right track because this type of case is the only way I've found so far to make my implementation of SMOA (which is asymptotically 6n+5) run slower than glibc's Two-Way (which is asymptotically 2n-m but has moderately painful preprocessing overhead).
Worst case for anything rolling-hash based:
Whatever sequence of bytes causes hash collisions with the hash of the needle. For any reasonably-fast hash and a given needle, it should be easy to construct a haystack whose hash collides with the needle's hash at every point. However, it seems difficult to simultaneously create long partial matches, which are the only way to get the worst-case behavior. Naturally for worst-case behavior the needle must have some periodicity, and a way of emulating the hash by adjusting just the final characters.
Worst case for Two-Way:
Seems to be very short needle with nontrivial MS decomposition - something like bac - where the haystack contains repeated false positives in the right-half component of the needle - something like dacdacdacdacdacdacdac. The only way this algorithm can be slow (other than by glibc authors implementing it poorly...) is by making the outer loop iterate many times and repeatedly incur that overhead (and making the setup overhead significant).
Other algorithms:
I'm really only interested in algorithms that are O(1) in space and have low preprocessing overhead, so I haven't looked at their worst cases so much. At least Boyer-Moore (without the modifications to make it O(n)) has a nontrivial worst-case where it becomes O(nm).
Doesn't answer your question directly, but you may find the algorithms in the book - Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology - interesting (has many novel algorithms on sub-string search). Additionally, it is also a good source of special and complex cases.
A procedure that might give interesting statistics, though I have no time to test right now:
Randomize over string length,
then randomize over string contents of that length,
then randomize over offset/length of a substring (possibly something not in the string),
then randomily clobber over the substring (possibly not at all),
repeat.
You can generate container strings (resp., contained test values) recursively by:
Starting with the empty string, generate all strings given by the augmentation of a string currently in the set by adding a character from an alphabet to the left or the right (both).
The alphabet for generating container strings is chosen by you.
You test 2 alphabets for contained strings. One is the one that makes up container strings, the other is its complement.

Resources