Generating a string hash to compare two substrings - arrays

I have a string "abcdbca" and I'm instructed to slice two subarrays, say [0:3] and [4:7], I get strings "abc" and "bca". I've to find out if the two substrings are similar(same elements, max_allowed_mismatch_error = 1).
I tried count sort, but it's not that much of optimization. So, I though the next more optimized method could be hashing. But I can't figure out hash function to accurately solve the problem. I need to perform the operation several times.

Hashing is no good.
There are two solutions, the simple one, which is to insist that the sub strings be of equal length and count equal characters, and the complex one, which is to use an alignment algorithm like Needleman-Wunch. That will give a much more robust idea of string similarity.

Related

Algorithm - What is the best algorithm for detecting duplicate numbers in small array?

What is the best algorithm for detecting duplicate numbers in array, the best in speed, memory and avoiving overhead.
Small Array like [5,9,13,3,2,5,6,7,1] Note that 5 i dublicate.
After searching and reading about sorting algorithms, I realized that I will use one of these algorithms, Quick Sort, Insertion Sort or Merge Sort.
But actually I am really confused about what to use in my case which is a small array.
Thanks in advance.
To be honest, with that size of array, you may as well choose the O(n2) solution (checking every element against every other element).
You'll generally only need to worry about performance if/when the array gets larger. For small data sets like this, you could well have found the duplicate with an 'inefficient' solution before the sort phase of an efficient solution will have finished :-)
In other words, you can use something like (pseudo-code):
for idx1 = 0 to nums.len - 2 inclusive:
for idx2 = idx1 + 1 to nums.len - 1 inclusive:
if nums[idx1] == nums[idx2]:
return nums[idx1]
return no dups found
This finds the first value in the array which has a duplicate.
If you want an exhaustive list of duplicates, then just add the duplicate value to another (initially empty) array (once only per value) and keep going.
You can sort it using any half-decent algorithm though, for a data set of the size you're discussing, even a bubble sort would probably be adequate. Then you just process the sorted items sequentially, looking for runs of values but it's probably overkill in your case.
Two good approaches depend on the fact that you know or not the range from which numbers are picked up.
Case 1: the range is known.
Suppose you know that all numbers are in the range [a, b[, thus the length of the range is l=b-a.
You can create an array A the length of which is l and fill it with 0s, thus iterate over the original array and for each element e increment the value of A[e-a] (here we are actually mapping the range in [0,l[).
Once finished, you can iterate over A and find the duplicate numbers. In fact, if there exists i such that A[i] is greater than 1, it implies that i+a is a repeated number.
The same idea is behind counting sort, and it works fine also for your problem.
Case 2: the range is not known.
Quite simple. Slightly modify the approach above mentioned, instead of an array use a map where the keys are the number from your original array and the values are the times you find them. At the end, iterate over the set of keys and search those that have been found more then once.
Note.
In both the cases above mentioned, the complexity should be O(N) and you cannot do better, for you have at least to visit all the stored values.
Look at the first example: we iterate over two arrays, the lengths of which are N and l<=N, thus the complexity is at max 2*N, that is O(N).
The second example is indeed a bit more complex and dependent on the implementation of the map, but for the sake of simplicity we can safely assume that it is O(N).
In memory, you are constructing data structures the sizes of which are proportional to the number of different values contained in the original array.
As it usually happens, memory occupancy and performance are the keys of your choice. Greater the former, better the latter and vice versa. As suggested in another response, if you know that the array is small, you can safely rely on an algorithm the complexity of which is O(N^2), but that does not require memory at all.
Which is the best choice? Well, it depends on your problem, we cannot say.

fast indexing for slow cpu?

I have a large document that I want to build an index of for word searching. (I hear this type of array is really called a concordances). Currently it takes about 10 minutes. Is there a fast way to do it? Currently I iterate through each paragraph and if I find a word I have not encountered before, I add it too my word array, along with the paragraph number in a subsidiary array, any time I encounter that same word again, I add the paragraph number to the index. :
associativeArray={chocolate:[10,30,35,200,50001],parsnips:[5,500,100403]}
This takes forever, well, 5 minutes or so. I tried converting this array to a string, but it is so large it won't work to include in a program file, even after removing stop words, and would take a while to convert back to an array anyway.
Is there a faster way to build a text index other than linear brute force? I'm not looking for a product that will do the index for me, just the fastest known algorithm. The index should be accurate, not fuzzy, and there will be no need for partial searches.
I think the best idea is to build a trie, adding a word at the time of your text, and having for each leaf a List of location you can find that word.
This would not only save you some space, since storing word with similar prefixes will require way less space, but the search will be faster too. Search time is O(M) where M is the maximum string length, and insert time is O(n) where n is the length of the key you are inserting.
Since the obvious alternative is an hash table, here you can find some more comparison between the two.
I would use a HashMap<String, List<Occurrency>> This way you can check if a word is already in yoz index in about O(1).
At the end, when you have all word collected and want to search them very often, you might try to find a hash-function that has no or nearly no collisions. This way you can guarantee O(1) time for the search (or nearly O(1) if you have still some collisions).
Well, apart from going along with MrSmith42's suggestion of using the built in HashMap, I also wonder how much time you are spending tracking the paragraph number?
Would it be faster to change things to track line numbers instead? (Especially if you are reading the input line-by-line).
There are a few things unclear in your question, like what do you mean in "I tried converting this array to a string, but it is so large it won't work to include in a program file, even after removing stop words, and would take a while to convert back to an array anyway."?! What array, is your input in form of array of paragraphs or do you mean the concordance entries per word, or what.
It is also unclear why your program is so slow, probably there is something inefficient there - i suspect is you check "if I find a word I have not encountered before" - i presume you look up the word in the dictionary and then iterate through the array of occurrences to see if paragraph# is there? That's slow linear search, you will be better served to use a set there (think hash/dictionary where you care only about the keys), kind of
concord = {
'chocolate': {10:1, 30:1, 35:1, 200:1, 50001:1},
'parsnips': {5:1, 500:1, 100403:1}
}
and your check then becomes if paraNum in concord[word]: ... instead of a loop or binary search.
PS. actually assuming you are keeping list of occurrences in array AND scanning the text from 1st to last paragraph, that means arrays will form sorted, so you only need to check the very last element there if word in concord and paraNum == concord[word][-1]:. (Examples are in pseudocode/python but you can translate to your language)

How to write a Matlab function that when given a single cell array of strings, returns a structure array with the same strings in alphabetical order?

I'm sure there may be a matlab function to do this but I'm required to write my own. As the title says, I need to write a function which when given a single cell array of strings, returns a structure array, containing the same strings but in alphabetical order. Furthermore, the 'count' fields must contain the number of times that that particular string has occurred eg
z=myfunction({'bag','dig','bag'})
ans =
str: 'bag'
count = 2
Ideally, the method should have an expected number of comparisons for n strings of O(n log n)
Assuming you don't want to use standard functions like sort or unique this is not an easy question. Furthermore it is more about math then about programming.
If you just want to practice programming, try implementing something simple like bubble sort.
This will not achieve O(n log n) however, if you really want that look into merge sort for example.
Several options are explained roughly here, and with a bit of searching it should not be hard to find what you need.

Infinity as sentinel in mergesort?

I am currently reading Cormen's "Introduction to Algorithms" and I found something called a sentinel.
It's used in the mergesort algorithm as a tool to decide when one of the two merging lists is exhausted. Cormen uses the infinity symbol for the sentinels in his pseudocode and I would like to know how such an infinite value can be implemented in C.
A sentinel is just a dummy value. For strings, you might use a NULL pointer since that's not a sensible thing to have in a list. For integers, you might use a value unlikely to occur in your data set e.g. if you are dealing with a list ages, then you can use the age -1 to denote the list.
You can get an "infinite value" for floats, but it's not the best idea. For arrays, pass the size explicitly; for lists, use a null pointer sentinel.
in C, when sorting an array, you usually know the size so you could actually sort a range [begin, end) in which end is one past the end of the array. E.g. int a[n] could be sorted as sort(a, a + n).
This allow you to do two things:
call your sort recursively with the part of the array you haven't sorted yet (merge sort is a recursive algorithm)
use end as a sentinel.
If you know the elements in your list will range from the smallest to the highest possible values for the given data type the code you are looking at won't work. You'll have to come up with something else, which I am sure can be done. I have that book in front of me right now and I am looking at the code that is causing you trouble and I have a solution that will work for you if you know the values range from the smallest for the given data type to the largest minus one at most. Open that book back up to page 31 and take a look at the Merge function. The lines causing you problems are lines 8 and 9 where the sentinel value of infinity is being used. Now, we know the two arrays are each sorted already and that we just need to merge them to get the array that is twice as big and in sorted order. This means that the largest elements in each half is at the end of the sub-arrays, and that the larger of the two is the largest in the array that is twice as big and we will have sorted once the merge function has completed. All we need to do is determine the largest of those two values, increment that value by one, and use that as our sentinel. So, lines 8 and 9 of the code should be replaced by the following 6 lines of code:
if L[n1] < R[n2]
largest = R[n2]
else
largest = L[n1]
L[n1 + 1] = largest + 1
R[n2 + 1] = largest + 1
That should work for you. I have a test tomorrow in my algorithms course on this stuff and I came across your post here and thought I'd help you out. The authors' use of sentinels in this book is something that has always bugged me, and I absolutely can not stand how much they are in love with recursion. Iteration is faster and in my opinion usually easier to come up with and grasp.
The trick is that you don't have to check array bounds when incrementing the index in only one of the lists in the inner while loops. Hence you need sentinels that are larger than all other elements. In c++ I usually use std::numeric_limits<TYPE>::max().
The C-equivalent should be macros like INT_MAX, UINT_MAX, LONG_MAX etc. Those are good sentinels. If you need two different sentinels, use ..._MAX and ..._MAX - 1
This is all assuming you're merging two lists that are ordered ascending.

How do I find common elements from n arrays

I am thinking of sorting and then doing binary search. Is that the best way?
I advocate for hashes in such cases: you'll have time proportional to common size of both arrays.
Since most major languages offer hashtable in their standard libraries, I hardly need to show your how to implement such solution.
Iterate through each one and use a hash table to store counts. The key is the value of the integer and the value is the count of appearances.
It depends. If one set is substantially smaller than the other, or for some other reason you expect the intersection to be quite sparse, then a binary search may be justified. Otherwise, it's probably easiest to step through both at once. If the current element in one is smaller than in the other, advance to the next item in that array. When/if you get to equal elements, you send that as output, and advance to the next item in both arrays. (This assumes, that as you advocated, you've already sorted both, of course).
This is an O(N+M) operation, where N is the size of one array, and M the size of the other. Using a binary search, you get O(N lg2 M) instead, which can be lower complexity if one array is lot smaller than the other, but is likely to be a net loss if they're close to the same size.
Depending on what you need/want, the versions that attempt to just count occurrences can cause a pretty substantial problem: if there are multiple occurrences of a single item in one array, they will still count that as two occurrences of that item, indicating an intersection that doesn't really exist. You can prevent this, but doing so renders the job somewhat less trivial -- you insert items from one array into your hash table, but always set the count to 1. When that's finished, you process the second array by setting the count to 2 if and only if the item is already present in the table.
Define "best".
If you want to do it fast, you can do it O(n) by iterating through each array and keeping a count for each unique element. Details of how to count the unique elements depend on the alphabet of things that can be in the array, eg, is it sparse or dense?
Note that this is O(n) in the number of arrays, but O(nm) for arrays of length m).
The best way is probably to hash all the values and keep a count of occurrences, culling all that have not occurred i times when you examine array i where i = {1, 2, ..., n}. Unfortunately, no deterministic algorithm can get you less than an O(n*m) running time, since it's impossible to do this without examining all the values in all the arrays if they're unsorted.
A faster algorithm would need to either have an acceptable level of probability (Monte Carlo), or rely on some known condition of the lists to examine only a subset of elements (i.e. you only care about elements that have occurred in all i-1 previous lists when considering the ith list, but in an unsorted list it's non-trivial to search for elements.

Resources