How should I write this generic algorithm [closed] - c

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Suppose you have two strings. Each string has lines, seperated by a newline character. Now you want to compare both strings and then find the best method (shortest number of steps) by only adding or deleting lines of one string, to transform the second string in to the first string.
i.e.
string #2:
abc
def
efg
hello
123
and string #1:
abc
def
efg
adc
123
The best (shortest steps) solution to transform string #2 in to string #1 would be:
remove line at line position 3 ('hello')
add 'abc' after line
position 3
How would one write a generic algorithm to find the quickest, least steps, solutions for transforming one string to another, given that you can only add or remove lines?

This is a classic problem.
For a given set of allowed operations the edit distance between two strings is the minimal number of operations required to transform one into the other.
When the set of allowed operations consists of insertion and deletion only, it is known as the longest common subsequence edit distance.
You'll find everything you need to compute this distance in Longest common subsequence problem.

Note that to answer this question fully, one would have to thoroughly cover the huge subject of graph similarity search / graph edit distance, which I will not do here. I will, however, point you in directions where you can study the problem more thoroughly on your own.
... to find the quickest, least steps, solutions for transforming
one string to another ...
This is a quite common problem known as the (minimum) edit distance problem (or, originally, the specific 'The String-to-String Correction problem', by R. Wagner and M. Fischer), which is a non-trivial problem for the optimal (minimum = least steps) edit distance, which is what you ask for in your question.
See e.g.:
https://en.wikipedia.org/wiki/Edit_distance
https://web.stanford.edu/class/cs124/lec/med.pdf
The minimum edit distance problem for string similarity is in itself a subclass of the more general minimum graph edit distance problem, or graph similarity search (since any string or even sequenced object, as you have noted yourself, can be represented as a graph), see e.g. A survey on graph edit distance.
For details regarding this problem here on SO, refer to e.g. Edit Distance Algorithm and Faster edit distance algorithm.
This should get you started.
I'd tag this problem rather as a math problem (algorithmic instructions) rather than language specific problems, unless someone could guide you to an existing language (C) library for solving edit distance problems.

The fastest way would be to remove all sub-strings, then append (not insert) all new sub-strings; and to do "all sub-strings at once" if you can (possibly leading to a destPointer = sourcePointer approach).
The overhead of minimising the amount of sub-strings removed and inserted will be higher than removing and inserting/appending without checking if its necessary. It's like spending $100 to pay a consultant to determine if you should spend $5.

Related

How do programs that can find dictionary words within anagrams work? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Examples include http://www.thewordfinder.com/, http://www.anagram-solver.org/, or the various applications for "cheating" at anagram based games such as Words With Friends.
Where would one begin if they were looking to create an application with similar functionality using Swift?
Since this seems to be getting down-voted, can someone tell me where the best place to ask this question is?
Converting a word list into a prefix tree (especially one that uses extra memory to avoid sparse arrays) should allow one to efficiently check all permutations of an input word, since entire branches of the search can be eliminated quickly.
Let's consider a simplified example. Assume our dictionary contains 3 words: aab, abb, and abc. If we convert it to a prefix tree, that would look like:
a
ab
b
b
c
Where indentation indicates children of the row above. All the words in our dictionary start with an a, followed by either ab or b, and in the latter case followed by either b or c. As you can see, we don't branch on every letter, but rather on alternatives. This helps keep the size of the tree down.
Now, if we iterate over all permutations of our input word cba:
abc
acb
bac
bca
cab
cba
All permutations that start with a letter other than a can be eliminated quickly since our prefix tree has no matching entry at the root level. acb can be eliminated on the second check, meaning only abc would find a match. If you integrate the check into your algorithm for generating permutations, you could match incrementally and avoid generating partial permutations that don't match the current branch of the tree you're exploring.
The comment about avoiding sparse trees is to avoid having to use binary search to match children of the current branch. Rather, an array of 26 elements, indexed by the ascii values of the letters (pick upper or lowercase and be consistent), should allow very quick lookups, at the cost of additional memory.

When to use quicksort, mergesort, and radix sort? (with regard to quantities rather than paradigms) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I've seen plenty of questions asking whether quicksort or mergesort is 'better', and when to use each of them, but what I'd like to see is some input on when to use them with regard to the size of the data being sorted.
Let's say I have a number of items, whether they be ints or custom objects. I sort these items.
I see mergesort, in a way, as being the optimal case of quicksort (picking the median as the pivot) at every step, but with some overhead. So at a certain size, when the overhead is negligible in comparison to the consistent optimal nature of mergesort, it would make sense to use it in favor of quicksort.
Radix sort has a 'linear' runtime given that the number of digits of the keys being sorted on does not approach the number of separate items being sorted. However, radix sort also has a relatively large constant on its runtime to my knowledge as well.
If I recall from some testing in the past, it made sense to use mergesort when the number of items being sorted began to number in the millions, and radix in the high millions/billion range.
Am I reasonably accurate in these assessments? Can someone confirm, deny, or correct them to some extent?
(I'm talking about rather 'simple' implementations of each sort. Also, in the case of radix sort, let's say that the largest single key is no larger than twice the number of items being sorted. i.e. sorting 4,000,000 items, the largest possible key is 8,000,000)
edit - I would like some input on the generic number ranges that one of the given sorts is fastest. I provided some in the question, and that may have been a mistake. What I'd like to see in an answer is an opinion on the number ranges. I know quicksort tends to be the default since its usually 'good enough' and doesn't have the space complexity of merge and doesn't come with the worry of malicious data purposefully made with obscenely large keys (radix).

A/B testing sorting algorithm [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I want to make an algorithm which will enable the conduct of A/B testing over a variable number of subjects with a variable number of properties per subject.
For example I have 1000 people with the following properties: they come from two departments, some are managers, some are women etc. these properties may increase/decrease according to the situation.
I want to make an algorithm which will split the population in two with the best representation possible in both A and B of all the properties. So i want two groups of 500 people with equal number of both departments in both, equal number of managers and equal number of women. More specifically, I would like to maintain the ratio of each property in both A and B. So if we have 10% managers I want 10% of sample A and Sample B to be managers.
Any pointers on where to begin? I am pretty sure that such an algorithm exists. I have a gut feeling that this may be unsolvable in some cases as there may be an odd number of managers AND women AND Dept. 1.
Make a list of permutations of all a/b variables.
Dept1,Manager,Male
Dept1,Manager,Female
Dept1,Junior,Male
...
Dept2,Junior,Female
Go through all the people and assign them to their respective permutation. Maybe randomise the order of the people first just to be sure there is no bias in the order they are added to each permutation.
Dept1,Manager,Male-> Person1, Person16, Person143...
Dept1,Manager,Female-> Person7, Person10, Person83...
Have a second process that goes through each permutation and assigns half the people to one test group and half to the other. You will need to account for odd numbers of people in the group, but that should be fairly easy to factor in, obviously a larger sample size will reduce the impact of this odd number on the final results.
The algorithm for splitting the groups is simple - take each group of people who have all dimensions in common and assign half to the treatment and half to the control. You don't need to worry about odd numbers of people, whatever statistical test you are using will account for that. If some dimension is so skewed (i.e., there are only 2 females in your entire sample), it may be wise throw the dimension out.
Simple A/B tests usually use a t-test or g-test, but in your case, you'd be better of using an ANOVA to determine the significance of the treatment on each of the individual dimensions.

How to understand Locality Sensitive Hashing? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I noticed that LSH seems a good way to find similar items with high-dimension properties.
After reading the paper http://www.slaney.org/malcolm/yahoo/Slaney2008-LSHTutorial.pdf, I'm still confused with those formulas.
Does anyone know a blog or article that explains that the easy way?
The best tutorial I have seen for LSH is in the book: Mining of Massive Datasets.
Check Chapter 3 - Finding Similar Items
http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf
Also I recommend the below slide:
http://www.cs.jhu.edu/%7Evandurme/papers/VanDurmeLallACL10-slides.pdf .
The example in the slide helps me a lot in understanding the hashing for cosine similarity.
I borrow two slides from Benjamin Van Durme & Ashwin Lall, ACL2010 and try to explain the intuitions of LSH Families for Cosine Distance a bit.
In the figure, there are two circles w/ red and yellow colored, representing two two-dimensional data points. We are trying to find their cosine similarity using LSH.
The gray lines are some uniformly randomly picked planes.
Depending on whether the data point locates above or below a gray line, we mark this relation as 0/1.
On the upper-left corner, there are two rows of white/black squares, representing the signature of the two data points respectively. Each square is corresponding to a bit 0(white) or 1(black).
So once you have a pool of planes, you can encode the data points with their location respective to the planes. Imagine that when we have more planes in the pool, the angular difference encoded in the signature is closer to the actual difference. Because only planes that resides between the two points will give the two data different bit value.
Now we look at the signature of the two data points. As in the example, we use only 6 bits(squares) to represent each data. This is the LSH hash for the original data we have.
The hamming distance between the two hashed value is 1, because their signatures only differ by 1 bit.
Considering the length of the signature, we can calculate their angular similarity as shown in the graph.
I have some sample code (just 50 lines) in python here which is using cosine similarity.
https://gist.github.com/94a3d425009be0f94751
Tweets in vector space can be a great example of high dimensional data.
Check out my blog post on applying Locality Sensitive Hashing to tweets to find similar ones.
http://micvog.com/2013/09/08/storm-first-story-detection/
And because one picture is a thousand words check the picture below:
http://micvog.files.wordpress.com/2013/08/lsh1.png
Hope it helps.
#mvogiatzis
Here's a presentation from Stanford that explains it. It made a big difference for me. Part two is more about LSH, but part one covers it as well.
A picture of the overview (There are much more in the slides):
Near Neighbor Search in High Dimensional Data - Part1:
http://www.stanford.edu/class/cs345a/slides/04-highdim.pdf
Near Neighbor Search in High Dimensional Data - Part2:
http://www.stanford.edu/class/cs345a/slides/05-LSH.pdf
LSH is a procedure that takes as input a set of documents/images/objects and outputs a kind of Hash Table.
The indexes of this table contain the documents such that documents that are on the same index are considered similar and those on different indexes are "dissimilar".
Where similar depends on the metric system and also on a threshold similarity s which acts like a global parameter of LSH.
It is up to you to define what the adequate threshold s is for your problem.
It is important to underline that different similarity measures have different implementations of LSH.
In my blog, I tried to thoroughly explain LSH for the cases of minHashing( jaccard similarity measure) and simHashing (cosine distance measure). I hope you find it useful:
https://aerodatablog.wordpress.com/2017/11/29/locality-sensitive-hashing-lsh/
I am a visual person. Here is what works for me as an intuition.
Say each of the things you want to search for approximately are physical objects such as an apple, a cube, a chair.
My intuition for an LSH is that it is similar to take the shadows of these objects. Like if you take the shadow of a 3D cube you get a 2D square-like on a piece of paper, or a 3D sphere will get you a circle-like shadow on a piece of paper.
Eventually, there are many more than three dimensions in a search problem (where each word in a text could be one dimension) but the shadow analogy is still very useful to me.
Now we can efficiently compare strings of bits in software. A fixed length bit string is kinda, more or less, like a line in a single dimension.
So with an LSH, I project the shadows of objects eventually as points (0 or 1) on a single fixed length line/bit string.
The whole trick is to take the shadows such that they still make sense in the lower dimension e.g. they resemble the original object in a good enough way that can be recognized.
A 2D drawing of a cube in perspective tells me this is a cube. But I cannot distinguish easily a 2D square from a 3D cube shadow without perspective: they both looks like a square to me.
How I present my object to the light will determine if I get a good recognizable shadow or not. So I think of a "good" LSH as the one that will turn my objects in front of a light such that their shadow is best recognizable as representing my object.
So to recap: I think of things to index with an LSH as physical objects like a cube, a table, or chair, and I project their shadows in 2D and eventually along a line (a bit string). And a "good" LSH "function" is how I present my objects in front of a light to get an approximately distinguishable shape in the 2D flatland and later my bit string.
Finally when I want to search if an object I have is similar to some objects that I indexed, I take the shadows of this "query" object using the same way to present my object in front of the light (eventually ending up with a bit string too). And now I can compare how similar is that bit string with all my other indexed bit strings which is a proxy for searching for my whole objects if I found a good and recognizable way to present my objects to my light.
As a very short, tldr answer:
An example of locality sensitive hashing could be to first set planes randomly (with a rotation and offset) in your space of inputs to hash, and then to drop your points to hash in the space, and for each plane you measure if the point is above or below it (e.g.: 0 or 1), and the answer is the hash. So points similar in space will have a similar hash if measured with the cosine distance before or after.
You could read this example using scikit-learn: https://github.com/guillaume-chevalier/SGNN-Self-Governing-Neural-Networks-Projection-Layer

How do spell checkers work? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
I need to implement a spell checker in C. Basically, I need all the standard operations... I need to be able to spell check a block of text, make word suggestions and dynamically add new words to the index.
I'd kind of like to write this myself, tho I really don't know where to begin.
Read up on Tree Traversal. The basic concept is as follows:
Read a dictionary file into memory (this file contains the entire list of correctly spelled words that are possible/common for a given language). You can download free dictionary files online, such as Oracle's example dictionary.
Parse this dictionary file into a search tree to make the actual text search as efficient as possible. I won't describe all of the dirty details of this type of tree structure, but the tree will be made up of nodes which have (up to) 26 links to child nodes (one for each letter), plus a flag to indicate wether or not the current node is the end of a valid word.
Loop through all of the words in your document, and check each one against the search tree. If you reach a node in the tree where the next letter in the word is not a valid child of the current node, the word is not in the dictionary. Also, if you reach the end of your word, and the "valid end of word" flag is not set on that node, the word is not in the dictionary.
If a word is not found in the dictionary, inform the user. At this stage, you can also suggest alternate spellings, but that gets a tad more complicated. You will have to loop through each character in the word, substituting alternate characters and test each of them against the search tree. There are probably more efficient algorithms for finding the recommended words, but I don't know what they are.
A really short example:
Dictionary:apex apple appoint appointed
Tree: (* indicates valid end of word)
update: Thank you to Curt Sampson for pointing out that this data structure is called a Patricia Tree
A -> P -> E -> X*
\\-> P -> L -> E*
\\-> O -> I -> N -> T* -> E -> D*
Document:apple appint ape
Results:
"apple" will be found in the tree, so it is considered correct.
"appint" will be flagged as incorrect. Traversing the tree, you will follow A -> P -> P, but the second P does not have an I child node, so the search fails.
"ape" will also fail, since the E node in A -> P -> E does not have the "valid end of word" flag set.
edit: For more details on spelling suggestions, look into Levenshtein Distance, which measures the smallest number of changes that must be made to convert one string into another. The best suggestions would be the dictionary words with the smallest Levenshtein Distance to the incorrectly spelled word.
Given you don't know where to begin, I'd suggest using an existing solution. See, for example, aspell
(GLPL licenced). If you really have to implement it yourself, please tell us why.
One should look at prefixes and suffixes.
suddenly = sudden + ly.
by removing ly's you can get away storing just the root word.
Likewise preallocate = pre + allocate.
And lovingly = love + ing + ly
gets a bit more complex, as the english rules for ing get invoked.
There is also the possibility of using some sort of hashing function to map a root word
into a specific bit is a large bit map, as a constant time method of determining if the root word is spelled correctly.
You can get even more complex by trying to provide an alternate list of possible correct spellings to a misspelled word. You might research the soundex algorithm to get some ideas.
I would advise prototyping with a small set of words. Do a lot of testing, then scale up.
It is a wonderful educational problem.
Splitting a word into root and suffix is knonw as the "Porter Stemming Algorithm" it's a good way of fitting an English ditionary into an amazingly small memory.
It's also useful for seach so "spell checker" will also find "spelling check" and "spell checking"
I've done this in class
You should consider python Natural Language Toolkit NLTK which is made specificaly to handle this.
It also allows to create text interpreters such as chatbots
The Open Office Spell checker Hunspell can be a good starting point. Here is the Homepage:
Hunspell at Sourceforge
E James gives a great answer for how to tell if a word is valid. It probably depends on the spell checker for how they determine likely misspellings.
One such method, and the one that I would use is the Levenshteinn String Similarity which looks at how many letters must be added, removed, or swaped in a word in order to make another word.
If you say spelled: Country as Contry. The levenshtein string similarity would be 1 since you have to only add 1 letter to transform contry into country.
You could then loop through all possible correct spellings of words (only 171,000 english words and 3000 of those account for 95% of text). Determine those with the lowest levenshtein string similarity value, and then return the top X words that are most similar to the misspelled word.
There's a great python package called Fuzzy Wuzzy which implements this efficiently and generates a % similarity between two words or sentences based on this formula.

Resources