Minimum Number of Operations needed

Minimum Number of Operations needed - c

I have a problem, suppose I have a given string: "best", the target string is suppose: "beast". Then I have to determine the number of operations to convert the given string to the target string, however the operations allowed are:
1. add a character to string.
2. delete a character.
3. swap two char positions. (should be used wisely, we have only one chance to swap.)
In above case it is 1.
How do we solve such kind of problem, and what kind of problem is it?
I am a newbie learner.

One widely-used measure of this kind of thing is called the Levenshtein distance.
http://en.wikipedia.org/wiki/Levenshtein_distance
The WP page also mentions/links to other similar concepts. It is essentially a metric of the number of edits needed to turn one word into another.

Levenshtein distance

Related

Find linearity in a graph

I am doing automation for a project and the results I get is in the form of a graph wherein I take the performance results.
Now the performance results which I take is generally at a straight line from the graph.
For example lets say the results from the graph in a List could be like this:
10, 30,90,100, 150,200,250,300,350,400,450,800,1000,1500,2000,2010,2006,2004,2000,1900,1800,1700, 1600,1000,500,400,0.
As you see the performance of the device starts increasing and then at a certain point it remains linear and with failures it starts dropping.
The point I want to take is the linear line.
As you can see in the list of numbers we see that from (2000,2010,2006,2004,2000) there is some kind of a linear line.
I am not asking for any code or Algorithm to solve this....I do not need an answer. If anyone can just give me a hint or a little clue I will try to do the rest.

Do you mean constant or linear?
If you mean linear:
Why not take the differences of adjacent values and search for a sequence that stays close to constant?
If you mean constant:
Why not take the differences of adjacent values and search for a sequence that stays close to 0?

First decide on the absolute or relative tolerance you can handle, that decides what is a straight line.
Then iterate trough the array checking the value of a point with the next point, if they are within tolerance, continue iterating until you get a point that is not and store those points. They represent a straight line.
This solution is very simple, not perfect and takes O(n) time.

How to find that two words differ by how much distance>> Is there any shortest way for this

I have read about Levenshtein distance about the calculation of the distance between the two distinct words.
I have one source string and i have to match it with all 10,000 target words. The closest word should be returned.
The problem is I have given a list of 10,000 target words, and input source words is also huge.... So what shortest and efficient algorithm to apply here. Levenshtein distance calculation for each n every combination(brute force logic) would be very time consuming.
Any hints, or ideas are most welcome.

I guess it depends a little on how the words are structured. For example this guy improved the implementation based on the fact that he processes his words in order and does not repeat calculations for common prefixes. But if all your 10,000 words are totally different that won't do you much good. It's written in python so might be a bit of work involved to port to C.
There are also some kinda homebrew algorithms out there (with which I mean there is no official paper written about it) but that might do the trick.

There's two common approaches for this, and I've blogged about both. The simpler one to implement is BK-Trees - a tree datastructure that speeds lookup based on levenshtein distance by only searching relevant parts of the tree. They'll probably be perfectly sufficient for your use-case.
A more complicated but more efficient approach is Levenshtein Automata. This works by constructing an NFA that recognizes all words within levenshtein distance x of your target string, then iterating through it and the dictionary in lockstep, effectively performing a merge join on them.

finding a number appearing again among numbers stored in a file

Say, i have 10 billions of numbers stored in a file. How would i find the number that has already appeared once previously?
Well i can't just populate billions of number at a stretch in array and then keep a simple nested loop to check if the number has appeared previously.
How would you approach this problem?
Thanks in advance :)

I had this as an interview question once.
Here is an algorithm that is O(N)
Use a hash table. Sequentially store pointers to the numbers, where the hash key is computed from the number value. Once you have a collision, you have found your duplicate.
Author Edit:
Below, #Phimuemue makes the excellent point that 4-byte integers have a fixed bound before a collision is guaranteed; that is 2^32, or approx. 4 GB. When considered in the conversation accompanying this answer, worst-case memory consumption by this algorithm is dramatically reduced.
Furthermore, using the bit array as described below can reduce memory consumption to 1/8th, 512mb. On many machines, this computation is now possible without considering either a persistent hash, or the less-performant sort-first strategy.
Now, longer numbers or double-precision numbers are less-effective scenarios for the bit array strategy.
Phimuemue Edit:
Of course one needs to take a bit "special" hash table:
Take a hashtable consisting of 2^32 bits. Since the question asks about 4-byte-integers, there are at most 2^32 different of them, i.e. one bit for each number. 2^32 bit = 512mb.
So now one has just to determine the location of the corresponding bit in the hashmap and set it. If one encounters a bit which already is set, the number occured in the sequence already.

The important question is whether you want to solve this problem efficiently, or whether you want accurately.
If you truly have 10 billion numbers and just one single duplicate, then you are in a "needle in the haystack" type of situation. Intuitively, short of very grimy and unstable solution, there is no hope of solving this without storing a significant amount of the numbers.
Instead, turn to probabilistic solutions, which have been used in most any practical application of this problem (in network analysis, what you are trying to do is look for mice, i.e., elements which appear very infrequently in a large data set).
A possible solution, which can be made to find exact results: use a sufficiently high-resolution Bloom filter. Either use the filter to determine if an element has already been seen, or, if you want perfect accuracy, use (as kbrimington suggested you use a standard hash table) the filter to, eh, filter out elements which you can't possibly have seen and, on a second pass, determine the elements you actually see twice.
And if your problem is slightly different---for instance, you know that you have at least 0.001% elements which repeat themselves twice, and you would like to find out how many there are approximately, or you would like to get a random sample of such elements---then a whole score of probabilistic streaming algorithms, in the vein of Flajolet & Martin, Alon et al., exist and are very interesting (not to mention highly efficient).

Read the file once, create a hashtable storing the number of times you encounter each item. But wait! Instead of using the item itself as a key, you use a hash of the item iself, for example the least significant digits, let's say 20 digits (1M items).
After the first pass, all items that have counter > 1 may point to a duplicated item, or be a false positive. Rescan the file, consider only items that may lead to a duplicate (looking up each item in table one), build a new hashtable using real values as keys now and storing the count again.
After the second pass, items with count > 1 in the second table are your duplicates.
This is still O(n), just twice as slow as a single pass.

How about:
Sort input by using some algorith which allows only portion of input to be in RAM. Examples are there
Seek duplicates in output of 1st step -- you'll need space for just 2 elements of input in RAM at a time to detect repetitions.

Finding duplicates
Noting that its a 32bit integer means that you're going to have a large number of duplicates, since a 32 bit int can only represent 4.3ish billion different numbers and you have "10 billions".
If you were to use a tightly packed set you could represent whether all the possibilities are in 512 MB, which can easily fit into current RAM values. This as a start pretty easily allows you to recognise the fact if a number is duplicated or not.
Counting Duplicates
If you need to know how many times a number is duplicated you're getting into having a hashmap that contains only duplicates (using the first 500MB of the ram to tell efficiently IF it should be in the map or not). At a worst case scenario with a large spread you're not going to be able fit that into ram.
Another approach if the numbers will have an even amount of duplicates is to use a tightly packed array with 2-8 bits per value, taking about 1-4GB of RAM allowing you to count up to 255 occurrances of each number.
Its going to be a hack, but its doable.

You need to implement some sort of looping construct to read the numbers one at a time since you can't have them in memory all at once.
How? Oh, what language are you using?

You have to read each number and store it into a hashmap, so that if a number occurs again, it will automatically get discarded.

If possible range of numbers in file is not too large then you can use some bit array to indicate if some of the number in range appeared.

If the range of the numbers is small enough, you can use a bit field to store if it is in there - initialize that with a single scan through the file. Takes one bit per possible number.
With large range (like int) you need to read through the file every time. File layout may allow for more efficient lookups (i.e. binary search in case of sorted array).

If time is not an issue and RAM is, you could read each number and then compare it to each subsequent number by reading from the file without storing it in RAM. It will take an incredible amount of time but you will not run out of memory.

I have to agree with kbrimington and his idea of a hash table, but first of all, I would like to know the range of the numbers that you're looking for. Basically, if you're looking for 32-bit numbers, you would need a single array of 4.294.967.296 bits. You start by setting all bits to 0 and every number in the file will set a specific bit. If the bit is already set then you've found a number that has occurred before. Do you also need to know how often they occur?Still, it would need 536.870.912 bytes at least. (512 MB.) It's a lot and would require some crafty programming skills. Depending on your programming language and personal experience, there would be hundreds of solutions to solve it this way.

Had to do this a long time ago.
What i did... i sorted the numbers as much as i could (had a time-constraint limit) and arranged them like this while sorting:
1 to 10, 12, 16, 20 to 50, 52 would become..
[1,10], 12, 16, [20,50], 52, ...
Since in my case i had hundreds of numbers that were very "close" ($a-$b=1), from a few million sets i had a very low memory useage
p.s. another way to store them
1, -9, 12, 16, 20, -30, 52,
when i had no numbers lower than zero
After that i applied various algorithms (described by other posters) here on the reduced data set

#include <stdio.h>
#include <stdlib.h>
/* Macro is overly general but I left it 'cos it's convenient */
#define BITOP(a,b,op) \
((a)[(size_t)(b)/(8*sizeof *(a))] op (size_t)1<<((size_t)(b)%(8*sizeof *(a))))
int main(void)
{
unsigned x=0;
size_t *seen = malloc(1<<8*sizeof(unsigned)-3);
while (scanf("%u", &x)>0 && !BITOP(seen,x,&)) BITOP(seen,x,|=);
if (BITOP(seen,x,&)) printf("duplicate is %u\n", x);
else printf("no duplicate\n");
return 0;
}

This is a simple problem that can be solved very easily (several lines of code) and very fast (several minutes of execution) with the right tools
my personal approach would be in using MapReduce
MapReduce: Simplified Data Processing on Large Clusters
i'm sorry for not going into more details but once getting familiar with the concept of MapReduce it is going to be very clear on how to target the solution
basicly we are going to implement two simple functions
Map(key, value)
Reduce(key, values[])
so all in all:
open file and iterate through the data
for each number -> Map(number, line_index)
in the reduce we will get the number as the key and the total occurrences as the number of values (including their positions in the file)
so in Reduce(key, values[]) if number of values > 1 than its a duplicate number
print the duplicates : number, line_index1, line_index2,...
again this approach can result in a very fast execution depending on how your MapReduce framework is set, highly scalable and very reliable, there are many diffrent implementations for MapReduce in many languages
there are several top companies presenting already built up cloud computing environments like Google, Microsoft azure, Amazon AWS, ...
or you can build your own and set a cluster with any providers offering virtual computing environments paying very low costs by the hour
good luck :)
Another more simple approach could be in using bloom filters
AdamT

Implement a BitArray such that ith index of this array will correspond to the numbers 8*i +1 to 8*(i+1) -1. ie first bit of ith number is 1 if we already had seen 8*i+1. Second bit of ith number is 1 if we already have seen 8*i + 2 and so on.
Initialize this bit array with size Integer.Max/8 and whenever you saw a number k, Set the k%8 bit of k/8 index as 1 if this bit is already 1 means you have seen this number already.

number combination algorithm

Write a function that given a string of digits and a target value, prints where to put +'s and *'s between the digits so they combine exactly to the target value. Note there may be more than one answer, it doesn't matter which one you print.
Examples:
"1231231234",11353 -> "12*3+1+23*123*4"
"3456237490",1185 -> "3*4*56+2+3*7+490"
"3456237490",9191 -> "no solution"

If you have an N digit value, there are N-1 possible slots for the + or * operators. So brute force, there are 3^(N-1) possibilities. Testing all of these are inefficient.
BUT
Your examples are all 10 digits. 3^9 = 19683, so brute force is FINE! No need to get any fancier.
So all you need to do is iterate through all 19683 cases, each time building a string for that case, and evaluating the expression. Evaluating the expression is a straightforward task. Iterating is straightforward (just use an incrementing value, you can read the state of the first slot by (i%3), which gives you "no operator" "+" or "*", the state of the second slot is (i/3)%3, the state of the third slot is (i/9)%3 and so on.)
Even with crude parsing code, CPUs are fast.
The brute force option starts becoming ugly after about 20 digits, and you'd have to switch to be more clever.

If this is for the gaming programmer position, do not use the brute force approach. I did that but failed this a couple of years ago. Later heard from someone inside that dynamic programming approach is the one that gets the job.

This can be solved either by backtracking or by dynamic programming.

The "cleverer" approach (using dynamic programming) is basically this:
For each substring of the original string, figure out all possible values it can create. (e.g. in your first example "12" can become either 1+2=3 or 1*2=2)
There may be a lot of different combinations, but many of them will be duplicates. (Also, you should ignore all combinations that are greater than the target).
Thus, when you add a "+" or a "*", you can envision it as combining two substrings of the string. (and since you have the possible values for each substring, you can see if such a combination is possible)
These values can be generated similarly: try splitting the substring in all possible ways, and combine the different values in each half of the substring.
The total number of "states", then, is something like |S|^2 * target - for your example case, it's worse than the brute-force method. But if you had a string of length 1000 and a target of say 5000, then the problem would be solvable with dynamic programming.

Google Code Jam had an extended version of this problem last year (in Round 1C), called Ugly Numbers. You can visit that link and click "Contest Analysis" for some approaches to that problem, when extended to large numbers of digits.

Similarity between line strings

I have a number of tracks recorded by a GPS, which more formally can be described as a number of line strings.
Now, some of the recorded tracks might be recordings of the same route, but because of inaccurasies in the GPS system, the fact that the recordings were made on separate occasions and that they might have been recorded travelling at different speeds, they won't match up perfectly, but still look close enough when viewed on a map by a human to determine that it's actually the same route that has been recorded.
I want to find an algorithm that calculates the similarity between two line strings. I have come up with some home grown methods to do this, but would like to know if this is a problem that's already has good algorithms to solve it.
How would you calculate the similarity, given that similar means represents the same path on a map?
Edit: For those unsure of what I'm talking about, please look at this link for a definition of what a line string is: http://msdn.microsoft.com/en-us/library/bb895372.aspx - I'm not asking about character strings.

Compute the Fréchet distance on each pair of tracks. The distance can be used to gauge the similarity of your tracks.
Math alert: Fréchet was a pioneer in the field of metric space which is relevant to your problem.

I would add a buffer around the first line based on the estimated probable error, and then determine if the second line fits entirely within the buffer.

To determine "same route," create the minimal set of normalized path vectors, calculate the total power differences and compare the total to a quality measure.
Normalize the GPS waypoints on total path length,
walk the vectors of the paths together, creating a new set of path vectors for each path based upon the shortest vector at each waypoint,
calculate the total power differences between endpoints of each vector in the normalized paths weighting for vector length, and
compare against a quality measure.
Tune the power of the differences (start with, say, squared differences) and the quality measure (say as a percent of the total power differences) visually. This algorithm produces a continuous quality measure of the path match as well as a binary result (Are the paths the same?)
Paul Tomblin said: I would add a buffer
around the first line based on the
estimated probable error, and then
determine if the second line fits
entirely within the buffer.
You could modify the algorithm as the normalized vector endpoints are compared. You could determine if any endpoint difference was above a certain size (implementing Paul's buffer idea) or perhaps, if the endpoints were outside the "buffer," use that fact to ignore that endpoint difference, allowing a comparison ignoring side trips.

You could walk along each point (Pa) of LineString A and measure the distance from Pa to the nearest line-segment of LineString B, averaging each of these distances.
This is not a quick or perfect method, but should be able to give use a useful number and is pretty quick to implement.
Do the line strings start and finish at similar points, or are they of very different extents?

If you consider a single line string to be a sequence of [x,y] points (or [x,y,z] points), then you could compute the similarity between each pair of line strings using the Needleman-Wunsch algorithm. As described in the referenced Wikipedia article, the Needleman-Wunsch algorithm requires a "similarity matrix" which defines the distance between a pair of points. However, it would be easy to use a function instead of a matrix. In your case you could simply use the 2D Euclidean distance function (or a 3D Euclidean function if your points have elevation) to provide the distance between each pair of points.

I actually side with the person (Aaron F) who said that you might be interested in the Levenshtein distance problem (and cited this). His answer seems to me to be the best so far.
More specifically, Levenshtein distance (also called edit distance), does not measure strictly the character-by-character distance, but also allows you to perform insertions and deletions. The best algorithm for this distance measure can be computed in quadratic time (pretty slow if your strings are long), but the computational biologists have pretty good heuristics for this, that might be of interest to you on their own. Check out BLAST and FASTA.
In your problem, it seems that you are dealing with differences between strings of numbers, and you care about the numbers. If you give more information, I might be able to direct you to the right variant of BLAST/FASTA/etc for your purposes. In any case, you might consider adapting BLAST and FASTA for your needs. They're quite simple.
1: http://en.wikipedia.org/wiki/Levenshtein_distance, http://www.nist.gov/dads/HTML/Levenshtein.html