traversing numbers in an interval wisely - c

I want to scan the numbers in a big interval wisely until I find the one I need.
But, I don't have any clue where this number might be and I will not have any clue during searching process.
Let me give an example to make it easy to state my question
Assume I am searching a number between 100000000000000 and 999999999999999
Naive approach would be starting from 100000000000000 and counting to 99... one by one.
but this is not wise because number can be on the far end If I am not lucky.
so, what is the best approach to this problem. I am not looking for mathematically best, I need a technique which is easy to implement in C programming Language.
thanks in advance.

There is no solution to your problem, but knowledge. If you don't know anything about the number, any strategy to enumerate them is equally good (or bad).
If you suppose that you are fighting against an adversary that is trying to hide the number for you, a strategy would be to make your next move unguessable. That would be to randomly pick numbers in the range and ask for them. (to avoid repetitions, you'd have to use a random permutation of your numbers.) By that you'd then find your number with an expected number of about half the total number, that is you'd gain a factor of two from the worst case. But as said all of that depends on the assumption that you can make.

Use bisection search. First see if your number is above or below the middle of the range. Depending on the answer, repeat the process for the upper or lower half of the range, respectively.

As you already know there is no strategy to improve search speed. All you can do is to speed up the search itself by using multithreading. So the technically best approach might be to try to implement the algorithm in OpenCL (which is fairly similar to C and which can be used through a C library) and run several hundred tests in parallel, depending on your hardware (GPU).

Related

Determine if a given integer number is element of the Fibonacci sequence in C without using float

I had recently an interview, where I failed and was finally told having not enough experience to work for them.
The position was embedded C software developer. Target platform was some kind of very simple 32-bit architecture, those processor does not support floating-point numbers and their operations. Therefore double and float numbers cannot be used.
The task was to develop a C routine for this architecture. This takes one integer and returns whether or not that is a Fibonacci number. However, from the memory only an additional 1K temporary space is allowed to use during the execution. That means: even if I simulate very great integers, I can't just build up the sequence and interate through.
As far as I know, a positive integer is a exactly then a Fibonacci number if one of
(5n ^ 2) + 4
or
(5n ^ 2) − 4
is a perfect square. Therefore I responded the question: it is simple, since the routine must determine whether or not that is the case.
They responded then: on the current target architecture no floating-point-like operations are supported, therefore no square root numbers can be retrieved by using the stdlib's sqrt function. It was also mentioned that basic operations like division and modulus may also not work because of the architecture's limitations.
Then I said, okay, we may build an array with the square numbers till 256. Then we could iterate through and compare them to the numbers given by the formulas (see above). They said: this is a bad approach, even if it would work. Therefore they did not accept that answer.
Finally I gave up. Since I had no other ideas. I asked, what would be the solution: they said, it won't be told; but advised me to try to look for it myself. My first approach (the 2 formula) should be the key, but the square root may be done alternatively.
I googled at home a lot, but never found any "alternative" square root counter algorithms. Everywhere was permitted to use floating numbers.
For operations like division and modulus, the so-called "integer-division" may be used. But what is to be used for square root?
Even if I failed the interview test, this is a very interesting topic for me, to work on architectures where no floating-point operations are allowed.
Therefore my questions:
How can floating numbers simulated (if only integers are allowed to use)?
What would be a possible soultion in C for that mentioned problem? Code examples are welcome.
The point of this type of interview is to see how you approach new problems. If you happen to already know the answer, that is undoubtedly to your credit but it doesn't really answer the question. What's interesting to the interviewer is watching you grapple with the issues.
For this reason, it is common that an interviewer will add additional constraints, trying to take you out of your comfort zone and seeing how you cope.
I think it's great that you knew that fact about recognising Fibonacci numbers. I wouldn't have known it without consulting Wikipedia. It's an interesting fact but does it actually help solve the problem?
Apparently, it would be necessary to compute 5n²±4, compute the square roots, and then verify that one of them is an integer. With access to a floating point implementation with sufficient precision, this would not be too complicated. But how much precision is that? If n can be an arbitrary 32-bit signed number, then n² is obviously not going to fit into 32 bits. In fact, 5n²+4 could be as big as 65 bits, not including a sign bit. That's far beyond the precision of a double (normally 52 bits) and even of a long double, if available. So computing the precise square root will be problematic.
Of course, we don't actually need a precise computation. We can start with an approximation, square it, and see if it is either four more or four less than 5n². And it's easy to see how to compute a good guess: it will very close to n×√5. By using a good precomputed approximation of √5, we can easily do this computation without the need for floating point, without division, and without a sqrt function. (If the approximation isn't accurate, we might need to adjust the result up or down, but that's easy to do using the identity (n+1)² = n²+2n+1; once we have n², we can compute (n+1)² with only addition.
We still need to solve the problem of precision, so we'll need some way of dealing with 66-bit integers. But we only need to implement addition and multiplication of positive integers, is considerably simpler than a full-fledged bignum package. Indeed, if we can prove that our square root estimation is close enough, we could safely do the verification modulo 2³¹.
So the analytic solution can be made to work, but before diving into it, we should ask whether it's the best solution. One very common caregory of suboptimal programming is clinging desperately to the first idea you come up with even when as its complications become increasingly evident. That will be one of the things the interviewer wants to know about you: how flexible are you when presented with new information or new requirements.
So what other ways are there to know if n is a Fibonacci number. One interesting fact is that if n is Fib(k), then k is the floor of logφ(k×√5 + 0.5). Since logφ is easily computed from log2, which in turn can be approximated by a simple bitwise operation, we could try finding an approximation of k and verifying it using the classic O(log k) recursion for computing Fib(k). None of the above involved numbers bigger than the capacity of a 32-bit signed type.
Even more simply, we could just run through the Fibonacci series in a loop, checking to see if we hit the target number. Only 47 loops are necessary. Alternatively, these 47 numbers could be precalculated and searched with binary search, using far less than the 1k bytes you are allowed.
It is unlikely an interviewer for a programming position would be testing for knowledge of a specific property of the Fibonacci sequence. Thus, unless they present the property to be tested, they are examining the candidate’s approaches to problems of this nature and their general knowledge of algorithms. Notably, the notion to iterate through a table of squares is a poor response on several fronts:
At a minimum, binary search should be the first thought for table look-up. Some calculated look-up approaches could also be proposed for discussion, such as using find-first-set-bit instruction to index into a table.
Hashing might be another idea worth considering, especially since an efficient customized hash might be constructed.
Once we have decided to use a table, it is likely a direct table of Fibonacci numbers would be more useful than a table of squares.

What is the logical difference between loops and recursive functions?

I came across this video which is discussing how most recursive functions can be written with for loops but when I thought about it, I couldn't see the logical difference between the two. I found this topic here but it only focuses on the practical difference as do many other similar topics on the web so what is the logical difference in the way a loop and a recursion are handled?
Bottom line up front -- recursion is more versatile but in practice is generally less efficient than looping.
A loop could in principle always be implemented as a recursion if you wished to do so. In practice the limits of stack resources put serious constraints on the size of the problems you can address. I can and have built loops that iterate a billion times, something I'd never try with recursion unless I was certain the compiler could and would convert the recursion into a loop. Because of the stack limits and efficiency, people often try to find a looping equivalent for recursions.
Tail recursions can always be converted to loops. However, there are recursions that can't be converted. As an example, I work with statistical design of experiments. Sometimes a large design is constructed by "crossing" several smaller sub-designs. Crossing is where you concatenate every row of a second design to each row of the first. For two sub-designs, all this needs is simple nested looping, but for three or more designs you need to increase the level of nesting, adding one level of nesting for each additional sub-design. So while this is nested looping in principle, in practice the amount of nesting is variable. If you tried to implement it with looping you'd have to revise your program to add/subtract nested loops every time you were dealing with a different number of sub-designs to be crossed, so you can't write an immutable loop-based version. This can easily be implemented with recursion. In this case, I'm happy to trade a slight amount of efficiency, because I wrote and debugged the code 6 years ago and haven't had to revise it since, despite creating lots of crossed designs of varying complexity since then.
One way to think through this is that the choice for recursion or iteration depends on how you think about the problem being solved. Certain "ways of thinking" lead more naturally to recursive solutions, and other ways of thinking lead to more iterative solutions. For any problem, you can in principle think in a way that gives you a recursive solution or a way that gives you an iterative solution. (Sometimes the iterative solution will just end up simulating a recursion stack, but there is no actual recursion there.)
Here's an example. You have an array of integers (positive or negative), and you want to find the maximum segment sum. A segment is a piece of the array that is contiguous. So in the array [3, -4, 2, 1, -2, 4], the maximum segment sum is 5, and you get that from the segment [2, 1, -2, 4]; its sum is 5.
OK - so how might we solve this problem? One thing you might do is reason like this: "if I knew the maximum segment sum in the left half, and the maximum segment sum in the right half, then maybe I could somehow jam those together and figure out the maximum segment sum overall". This idea would require you to find the maximum segment sum on the two subhalves, and this is a smaller instance of the original problem. This is recursion, and a direct translation of this idea into code would therefore be recursive.
But the maximum segment sum problem isn't "recursive" or "iterative" -- it can be both, depending on how you think about the solution. I gave a recursive thought process above. Here is an iterative process: "well, if I add up the elements in each of the segments that start at some index i and end at some index j, I can just take the maximum of these to solve the problem". And directly trying to code this approach would give you triply nested loops (and a bad mark on an assignment because it's horribly inefficient!).
So, the same problem, depending on how the problem is conceptualized, can lead to a recursive or iterative solution. Now, I happened to choose a problem where there are many ways of solving it, and where there are reasonable recursive and iterative solutions. Some problems, however, admit only one type of solution, and that solution may be most naturally implemented using recursion or iteration. For example, if I asked you to write a function that keeps asking the user to enter a letter until they enter y or n, you might start thinking: "keep repeating the prompt and asking for input..." and before you know it you have some iterative code. Perhaps you might instead think recursively: "if the user enters y or n, I am done; otherwise ask the user for y or n"... in which case you'd generate a recursive algorithm. But the recursion here doesn't give you much: it unnecessarily uses a stack and doesn't make the program any faster. (Recursion sometimes makes it easier to prove correctness, in which case you might present something recursively even though you could alternately give a reasonable iterative solution.)

Is it possible to allow mismatches in KMP algorithm?

I am looking for an efficient algorithm to allow mismatches (at most 3) when comparing a pattern with a text. Original KMP does this job efficiently on my data but was considering this to extend this algo to accommodate for mismatches.
For my case: GACCCT is considered a match with GGGGGAGGTTTTTT with start position 4 in second sequence
I need to do pairwise comparison between two files. Each contains approximately 500,000 sequences. Sequences in one file is relatively short (~50 bases) while in other is longer (~200)
I tried Regex package in python, Levenshtein algorithm and edit distances. But they are slow and I will have to wait for couple of weeks to get the work done.
I think your data isn't too large, so maybe this will work:
I think you should create a suffix tree for your data. Once you do this, finding substrings will be very easy, whether or not you want to count mismatches: you just traverse the tree with the characters you're looking for, until you've either found a substring, or hit the most number of mismatches you can tolerate.
If you want at most three mismatches, there's a simple but kind of daft algorithm that'll work on most real cases. Break your pattern into four contiguous parts arbitrarily. (It is probably useful for them to match a random text location with roughly the same probability.) Find all matches in the text of your four contiguous parts. See which of those completes to an at-most-three-mismatches match by brute force.
Mehrdad's solution of using a suffix tree is better in general, but it requires more programming effort.

Randomness in Artificial Intelligence & Machine Learning

This question came to my mind while working on 2 projects in AI and ML. What If I'm building a model (e.g. Classification Neural Network,K-NN, .. etc) and this model uses some function that includes randomness. If I don't fix the seed, then I'm going to get different accuracy results every time I run the algorithm on the same training data. However, If I fix it then some other setting might give better results.
Is averaging a set of accuracies enough to say that the accuracy of this model is xx % ?
I'm not sure If this is the right place to ask such a question/open such a discussion.
Simple answer, yes, you randomize it and use statistics to show the accuracy. However, it's not sufficient to just average a handful of runs. You need, at a minimum, some notion of the variability as well. It's important to know whether "70%" accurate means "70% accurate for each of 100 runs" or "100% accurate once and 40% accurate once".
If you're just trying to play around a bit and convince yourself that some algorithm works, then you can just run it 30 or so times and look at the mean and standard deviation and call it a day. If you're going to convince anyone else that it works, you need to look into how to do more formal hypothesis testing.
There are models which are naturally dependent on randomness (e.g., random forests) and models which only use randomness as part of exploring the space (e.g., initialisation of values for neural networks), but actually have a well-defined, deterministic, objective function.
For the first case, you will want to use multiple seeds and report average accuracy, std. deviation, and the minimum you obtained. It is often good if you have a way to reproduce this, so just use multiple fixed seeds.
For the second case, you can always tell, just on the training data, which run is best (although it might actually not be the one which gives you the best test accuracy!). Thus, if you have the time, it is good to do say, 10 runs, and then evaluate on the one with the best training error (or validation error, just never evaluate on testing for this decision). You can go a level up and do multiple multiple runs and get a standard deviation too. However, if you find that this is significant, it probably means you weren't trying enough initialisations or that you are not using the right model for your data.
Stochastic techniques are typically used to search very large solution spaces where exhaustive search is not feasible. So it's almost inevitable that you will be trying to iterate over a large number of sample points with as even a distribution as possible. As mentioned elsewhere, basic statistical techniques will help you determine when your sample is big enough to be representative of the space as a whole.
To test accuracy, it is a good idea to set aside a portion of your input patterns and avoid training against those patterns (assuming you are learning from a data set). Then you can use the set to test whether your algorithm is learning the underlying pattern correctly, or whether it's simply memorizing the examples.
Another thing to think about is the randomness of your random number generator. Standard random number generators (such as rand from <stdlib.h>) may not make the grade in many cases so look around for a more robust algorithm.
I generalize the answer from what i get of your question,
I suppose Accuracy is always average accuracy of multiple runs and the standard deviation. So if you are considering accuracy you get using different seeds to the random generator, are you not actually considering a greater range of input (which should be a good thing). But you have to consider the Standard deviation to consider the accuracy. Or did i get your question it totally wrong ?
I believe cross-validation may give you what you ask about: an averaged, and therefore more reliable, estimate of classification performance. It contains no randomness, except in permuting the data set initially. The variation comes from choosing different train/test splits.

The limits of parallelism (job-interview question)

Is it possible to solve a problem of O(n!) complexity within a reasonable time given infinite number of processing units and infinite space?
The typical example of O(n!) problem is brute-force search: trying all permutations (ordered combinations).
It sure is. Consider the Traveling Salesman Problem in it's strict NP form: given this list of costs for traveling from each point to each other point, can you put together a tour with cost less than K? With the new infinite-core CPU from Intel, you just assign one core to each possible permutation, and add up the costs (this is fast), and see if any core flags a success.
More generally, a problem in NP is a decision problem such that a potential solution can be verified in polynomial time (i.e., efficiently), and so (since the potential solutions are enumerable) any such problem can be efficiently solved with sufficiently many CPUs.
It sounds like what you're really asking is whether a problem of O(n!) complexity can be reduced to O(n^a) on a non-deterministic machine; in other words, whether Not-P = NP. The answer to that question is no, there are some Not-P problems that are not NP. For example, a limited halting problem (that asks if a program halts in at most n! steps).
The problem would be distributing the work and collecting the results.
If all the CPUs can read the same piece of memory at once, and if each one has a unique CPU-ID that is known to it, then the ID may be used to select a permutation, and the distribution problem is solveable in constant time.
Gathering the results would be tricky, though. Each CPU could compare with its (numerical) neighbor, and then that result compared to the result of the two closest neighbors, etc. This will be a O(log(n!)) process. I don't know for sure, but I suspect that O(log(n!)) is hyperpolynomial, so I don't think that's a solution.
No, N! is even higher than NP. Thinking unlimited parallelism could solve NP problem in polynomial time, which is usually considered as a "reasonable" time complexity, N! problem is still higher than polynomial on such a setup.
You mentioned search as a "typical" problem, but were you actually asked specifically about a search problem? If so, then yes, search is typically parallelizable, but as far as I can tell O(n!) in principle does not imply the degree of concurrency available, does it? You could have a completely serial O(n!) problem, which means infinite computers won't help. I once had an unusual O(n^4) problem that actually was completely serial.
So, available concurrency is the first thing, and IMHO you should get points for bringing up Amdahl's law in an interview. Next potential pitfall is inter-processor communication, and in general the nature of the algorithm. Consider, for example, this list of application classes: http://view.eecs.berkeley.edu/wiki/Dwarf_Mine. FWIW the O(n^4) code I mentioned earlier sort of falls into the FSM category.
Another somewhat related anecdote: I've heard an engineer from a supercomputer vendor claim that if 10% of their CPU time were being spent in MPI libraries, they consider the parallelization a solid success (though that may have just been limited to codes in the computational chemistry domain).
If the problem is one of checking permutations/answers to a problem of complexity O(n!), then of course you can do it efficiently with an infinite number of processors.
The reason is that you can easily distribute atomic pieces of the problem (an atomic piece of the problem might, say, be one of the permutations to check) with logarithmic efficiency.
As a simple example, you could set up the processors as a 'binary tree', so to speak. You could be at the root, and have the processors deliver permutations of the problem (or whatever the smallest pieces of the problem might be) to the leaf processors to solve, and you'd end up solving the problem in log(n!) time.
Remember it's the delivery of the permutations to the processors that takes a long time. Each part of the problem itself will actually be solved instantly.
Edit: Fixed my post according to the comments below.
Sometimes the correct answer is, "How many times does this come up with your code base?" but in this case, there is a real answer.
The correct answer is no, because not all problems can be solved using perfect parallel processing. For example, a travelling salesman-like problem must commit to one path for the second leg of the journey to be considered.
Assuming a fully connected matrix of cities, should you want to display all possible non-cyclic routes for our weary salesman, you're stuck with a O(n!) problem, which can be decomposed to an O(n)*O((n-1)!) problem. The issue is that you need to commit to one path (on the O(n) side of the equation) before you can consider the remaining paths (on the O((n-1)!) side of the equation).
Since some of the computations must be performed prior to other computations, then there is no way to scatter the results perfectly in a single scatter / gather pass. That means the solution will be waiting on the results of calculations which must come before the "next" step can be started. This is the key, as the need for prior partial solutions provide a "bottle neck" in the ability to proceed with the computation.
Since we've proven we can make a number of these infinitely fast, infinitely numerous, CPUs wait (even if they are waiting on themselves), we know that the runtime cannot be O(1), and we only need to pick a very large N to guarantee an "unacceptable" run time.
This is like asking if an infinite number of monkeys typing on a monkey-destruction proof computer with a word-processor can come up with all the works of Shakespeare; given an infinite amount of time. The realist would say not since the conditions are no physically possible. The idealist will say yes; in theory it can happen. Since Software Engineering (Software Engineering, not Computer Science) focuses on real system we can see and touch, then the answer is no. If you doubt me, then go build it and prove me wrong! IMHO.
Disregarding the cost of setup (whatever that might be...assigning a range of values to a processing unit, for instance), then yes. In such a case, any value less than infinity could be solved in one concurrent iteration across an equal number of processing units.
Setup, however, is something significant to disregard.
Each problem could be solved by one CPU, but who would deliver these jobs to all infinite CPU's? In general, this task is centralized, so if we have infinite jobs to deliver to all infinite CPU's, we could take infinite time to do so.

Resources