Most Common K length Sequences - c

I saw this coding challenge posted somewhere on a Elixir forum and have not quite figured out how to solve it. I have generalized the problem to make it more understandable.
Given an input of a random sequence of numbers, compute the M most common K-length sequences. M and K are constants. For example, compute the 10 most common 3-number sequences from the input.
The input could be potentially very large, so the solution should scale to any size.
I know that storing the sequences in a hash table in a higher level language is potentially the simplest, most efficient solution, but I’d like to find another solution that can be done in C without any hash functions

Related

How expressive can we be with arrays in Z3(Py)? An example

My first question is whether I can express the following formula in Z3Py:
Exists i::Integer s.t. (0<=i<|arr|) & (avg(arr)+t<arr[i])
This means: whether there is a position i::0<i<|arr| in the array whose value a[i] is greater than the average of the array avg(arr) plus a given threshold t.
I know this kind of expressions can be queried in Dafny and (since Dafny uses Z3 below) I guess this can be done in Z3Py.
My second question is: how expressive is the decidable fragment involving arrays in Z3?
I read this paper on how the full theory of arrays is not decidable (http://theory.stanford.edu/~arbrad/papers/arrays.pdf), but only a concrete fragment, the array property fragment.
Is there any interesting paper/tutorial on what can and cannot be done with arrays+quantifiers+functions in Z3?
You found the best paper to read regarding reasoning with Array's, so I doubt there's a better resource or a tutorial out there for you.
I think the sequence logic (not yet officially supported by SMTLib, but z3 supports it), is the right logic to use for reasoning about these sorts of problems, see: https://microsoft.github.io/z3guide/docs/theories/Sequences/
Having said that, most properties about arrays/sequences of "arbitrary size" require inductive proofs. This is because most interesting functions on them are essentially recursive (or iterative), and induction is the only way to prove properties for such programs. While SMT solvers improved significantly regarding support for recursive definitions and induction, they still don't perform anywhere near well compared to a traditional theorem prover. (This is, of course, to be expected.)
I'd recommend looking at the sequence logic, and playing around with recursive definitions. You might get some mileage out of that, though don't expect proofs for anything that require induction, especially if the inductive-hypothesis needs some clever invariant to be specified.
Note that if you know the length of your array concretely (i.e., 10, 15, or some other hopefully not too large a number), then it's best to allocate the elements symbolically yourself, and not use arrays/sequences at all. (And you can repeat your proof for lenghts 0, 1, 2, .. upto some fixed number.) But if you want proofs that work for arbitrary lengths, your best bet is to use sequences in z3, not arrays; with all the caveats I mentioned above.

Determine if a given integer number is element of the Fibonacci sequence in C without using float

I had recently an interview, where I failed and was finally told having not enough experience to work for them.
The position was embedded C software developer. Target platform was some kind of very simple 32-bit architecture, those processor does not support floating-point numbers and their operations. Therefore double and float numbers cannot be used.
The task was to develop a C routine for this architecture. This takes one integer and returns whether or not that is a Fibonacci number. However, from the memory only an additional 1K temporary space is allowed to use during the execution. That means: even if I simulate very great integers, I can't just build up the sequence and interate through.
As far as I know, a positive integer is a exactly then a Fibonacci number if one of
(5n ^ 2) + 4
or
(5n ^ 2) − 4
is a perfect square. Therefore I responded the question: it is simple, since the routine must determine whether or not that is the case.
They responded then: on the current target architecture no floating-point-like operations are supported, therefore no square root numbers can be retrieved by using the stdlib's sqrt function. It was also mentioned that basic operations like division and modulus may also not work because of the architecture's limitations.
Then I said, okay, we may build an array with the square numbers till 256. Then we could iterate through and compare them to the numbers given by the formulas (see above). They said: this is a bad approach, even if it would work. Therefore they did not accept that answer.
Finally I gave up. Since I had no other ideas. I asked, what would be the solution: they said, it won't be told; but advised me to try to look for it myself. My first approach (the 2 formula) should be the key, but the square root may be done alternatively.
I googled at home a lot, but never found any "alternative" square root counter algorithms. Everywhere was permitted to use floating numbers.
For operations like division and modulus, the so-called "integer-division" may be used. But what is to be used for square root?
Even if I failed the interview test, this is a very interesting topic for me, to work on architectures where no floating-point operations are allowed.
Therefore my questions:
How can floating numbers simulated (if only integers are allowed to use)?
What would be a possible soultion in C for that mentioned problem? Code examples are welcome.
The point of this type of interview is to see how you approach new problems. If you happen to already know the answer, that is undoubtedly to your credit but it doesn't really answer the question. What's interesting to the interviewer is watching you grapple with the issues.
For this reason, it is common that an interviewer will add additional constraints, trying to take you out of your comfort zone and seeing how you cope.
I think it's great that you knew that fact about recognising Fibonacci numbers. I wouldn't have known it without consulting Wikipedia. It's an interesting fact but does it actually help solve the problem?
Apparently, it would be necessary to compute 5n²±4, compute the square roots, and then verify that one of them is an integer. With access to a floating point implementation with sufficient precision, this would not be too complicated. But how much precision is that? If n can be an arbitrary 32-bit signed number, then n² is obviously not going to fit into 32 bits. In fact, 5n²+4 could be as big as 65 bits, not including a sign bit. That's far beyond the precision of a double (normally 52 bits) and even of a long double, if available. So computing the precise square root will be problematic.
Of course, we don't actually need a precise computation. We can start with an approximation, square it, and see if it is either four more or four less than 5n². And it's easy to see how to compute a good guess: it will very close to n×√5. By using a good precomputed approximation of √5, we can easily do this computation without the need for floating point, without division, and without a sqrt function. (If the approximation isn't accurate, we might need to adjust the result up or down, but that's easy to do using the identity (n+1)² = n²+2n+1; once we have n², we can compute (n+1)² with only addition.
We still need to solve the problem of precision, so we'll need some way of dealing with 66-bit integers. But we only need to implement addition and multiplication of positive integers, is considerably simpler than a full-fledged bignum package. Indeed, if we can prove that our square root estimation is close enough, we could safely do the verification modulo 2³¹.
So the analytic solution can be made to work, but before diving into it, we should ask whether it's the best solution. One very common caregory of suboptimal programming is clinging desperately to the first idea you come up with even when as its complications become increasingly evident. That will be one of the things the interviewer wants to know about you: how flexible are you when presented with new information or new requirements.
So what other ways are there to know if n is a Fibonacci number. One interesting fact is that if n is Fib(k), then k is the floor of logφ(k×√5 + 0.5). Since logφ is easily computed from log2, which in turn can be approximated by a simple bitwise operation, we could try finding an approximation of k and verifying it using the classic O(log k) recursion for computing Fib(k). None of the above involved numbers bigger than the capacity of a 32-bit signed type.
Even more simply, we could just run through the Fibonacci series in a loop, checking to see if we hit the target number. Only 47 loops are necessary. Alternatively, these 47 numbers could be precalculated and searched with binary search, using far less than the 1k bytes you are allowed.
It is unlikely an interviewer for a programming position would be testing for knowledge of a specific property of the Fibonacci sequence. Thus, unless they present the property to be tested, they are examining the candidate’s approaches to problems of this nature and their general knowledge of algorithms. Notably, the notion to iterate through a table of squares is a poor response on several fronts:
At a minimum, binary search should be the first thought for table look-up. Some calculated look-up approaches could also be proposed for discussion, such as using find-first-set-bit instruction to index into a table.
Hashing might be another idea worth considering, especially since an efficient customized hash might be constructed.
Once we have decided to use a table, it is likely a direct table of Fibonacci numbers would be more useful than a table of squares.

Is it possible to allow mismatches in KMP algorithm?

I am looking for an efficient algorithm to allow mismatches (at most 3) when comparing a pattern with a text. Original KMP does this job efficiently on my data but was considering this to extend this algo to accommodate for mismatches.
For my case: GACCCT is considered a match with GGGGGAGGTTTTTT with start position 4 in second sequence
I need to do pairwise comparison between two files. Each contains approximately 500,000 sequences. Sequences in one file is relatively short (~50 bases) while in other is longer (~200)
I tried Regex package in python, Levenshtein algorithm and edit distances. But they are slow and I will have to wait for couple of weeks to get the work done.
I think your data isn't too large, so maybe this will work:
I think you should create a suffix tree for your data. Once you do this, finding substrings will be very easy, whether or not you want to count mismatches: you just traverse the tree with the characters you're looking for, until you've either found a substring, or hit the most number of mismatches you can tolerate.
If you want at most three mismatches, there's a simple but kind of daft algorithm that'll work on most real cases. Break your pattern into four contiguous parts arbitrarily. (It is probably useful for them to match a random text location with roughly the same probability.) Find all matches in the text of your four contiguous parts. See which of those completes to an at-most-three-mismatches match by brute force.
Mehrdad's solution of using a suffix tree is better in general, but it requires more programming effort.

traversing numbers in an interval wisely

I want to scan the numbers in a big interval wisely until I find the one I need.
But, I don't have any clue where this number might be and I will not have any clue during searching process.
Let me give an example to make it easy to state my question
Assume I am searching a number between 100000000000000 and 999999999999999
Naive approach would be starting from 100000000000000 and counting to 99... one by one.
but this is not wise because number can be on the far end If I am not lucky.
so, what is the best approach to this problem. I am not looking for mathematically best, I need a technique which is easy to implement in C programming Language.
thanks in advance.
There is no solution to your problem, but knowledge. If you don't know anything about the number, any strategy to enumerate them is equally good (or bad).
If you suppose that you are fighting against an adversary that is trying to hide the number for you, a strategy would be to make your next move unguessable. That would be to randomly pick numbers in the range and ask for them. (to avoid repetitions, you'd have to use a random permutation of your numbers.) By that you'd then find your number with an expected number of about half the total number, that is you'd gain a factor of two from the worst case. But as said all of that depends on the assumption that you can make.
Use bisection search. First see if your number is above or below the middle of the range. Depending on the answer, repeat the process for the upper or lower half of the range, respectively.
As you already know there is no strategy to improve search speed. All you can do is to speed up the search itself by using multithreading. So the technically best approach might be to try to implement the algorithm in OpenCL (which is fairly similar to C and which can be used through a C library) and run several hundred tests in parallel, depending on your hardware (GPU).

What are good test cases for benchmarking & stress testing substring search algorithms?

I'm trying to evaluate different substring search (ala strstr) algorithms and implementations and looking for some well-crafted needle and haystack strings that will catch worst-case performance and possible corner-case bugs. I suppose I could work them out myself but I figure someone has to have a good collection of test cases sitting around somewhere...
Some thoughts and a partial answer to myself:
Worst case for brute force algorithm:
a^(n+1) b in (a^n b)^m
e.g. aaab in aabaabaabaabaabaabaab
Worst case for SMOA:
Something like yxyxyxxyxyxyxx in (yxyxyxxyxyxyxy)^n. Needs further refinement. I'm trying to ensure that each advancement is only half the length of the partial match, and that maximal suffix computation requires the maximal amount of backtracking. I'm pretty sure I'm on the right track because this type of case is the only way I've found so far to make my implementation of SMOA (which is asymptotically 6n+5) run slower than glibc's Two-Way (which is asymptotically 2n-m but has moderately painful preprocessing overhead).
Worst case for anything rolling-hash based:
Whatever sequence of bytes causes hash collisions with the hash of the needle. For any reasonably-fast hash and a given needle, it should be easy to construct a haystack whose hash collides with the needle's hash at every point. However, it seems difficult to simultaneously create long partial matches, which are the only way to get the worst-case behavior. Naturally for worst-case behavior the needle must have some periodicity, and a way of emulating the hash by adjusting just the final characters.
Worst case for Two-Way:
Seems to be very short needle with nontrivial MS decomposition - something like bac - where the haystack contains repeated false positives in the right-half component of the needle - something like dacdacdacdacdacdacdac. The only way this algorithm can be slow (other than by glibc authors implementing it poorly...) is by making the outer loop iterate many times and repeatedly incur that overhead (and making the setup overhead significant).
Other algorithms:
I'm really only interested in algorithms that are O(1) in space and have low preprocessing overhead, so I haven't looked at their worst cases so much. At least Boyer-Moore (without the modifications to make it O(n)) has a nontrivial worst-case where it becomes O(nm).
Doesn't answer your question directly, but you may find the algorithms in the book - Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology - interesting (has many novel algorithms on sub-string search). Additionally, it is also a good source of special and complex cases.
A procedure that might give interesting statistics, though I have no time to test right now:
Randomize over string length,
then randomize over string contents of that length,
then randomize over offset/length of a substring (possibly something not in the string),
then randomily clobber over the substring (possibly not at all),
repeat.
You can generate container strings (resp., contained test values) recursively by:
Starting with the empty string, generate all strings given by the augmentation of a string currently in the set by adding a character from an alphabet to the left or the right (both).
The alphabet for generating container strings is chosen by you.
You test 2 alphabets for contained strings. One is the one that makes up container strings, the other is its complement.

Resources