The task is this: Find the longest substring found in two lines. The peculiarity of the problem is that these lines are very long (contents of the file, that is to 400,000 characters each), and the alphabet from which they are composed of short - 4 characters.
Strings can be of different length.
I invented and implemented the following algorithm:
To get the contents of the first file and write to a string str1, removing the line breaks
To get the contents of the second file and write to a string str2, removing the line breaks
We shall consider all substrings the string str1, from the longest to the shortest. To do this, define the cycle while (i>0), at each iteration, which after the main content decreases the length of the string by one. And so to the strings of length 1.
Inside the while loop: All substring of length N differ only in the beginning position.
Let have a string of length N:
It is one substring of length N, which contains, starting at position 0.
There are two substring of length N-1 that start inside positions 0 and 1
In it for three substring of length N-2, which starts inside positions 0, 1, and 2
...
K+1 substring of length N-K, which start from the position 0,1,...,K
The starting position of the count in the for loop(z=0; z<=g-i; z++), within which the function is called getSubstring receiving the substring. And then running the standard function strstr with this substring of a string str2
But does this algorithm long enough. Is there no way to make it faster?
P.S. Write in C
There are at least two classical options to solve longest common substring efficiently
Build a generalized suffix array or suffix tree of the two strings. One can show that the LCS is a prefix of two adjacent suffixes in the suffix array that have different colors (belong to the different strings). I once wrote an answer that describes a simple O(n log n) suffix array construction algorithm
Build a suffix automaton of one string and feed the other string into it. At every point check how "deep" you are in the automaton and report the maximum over all those depths. You can find a C++ implementation in my GitHub.
Related
{
char array[2][10]={"3234","5"};
int n=strcmp(array+0,array+1);
printf("%d",n);
}
The following code prints -1, event though 3234>5.
It however prints 1 if the first number is 5234 or 6234 etc.
What is the logic behind the results?(Are only the first digits taken
into account?)
So, is there a way to compare two number strings without comapring
their actual integer values?
What is the logic behind the results?
strcmp compares characters in the strings, using their values as unsigned char. First, it compares the first character of one string with the first character of the other. If they differ, it reports the first string is “lesser than” the second if its character is lesser than the other’s and it reports “greater than” if the first string’s character is greater. If the characters are equal, then strcmp compares the second characters of the strings, then the third, and so on. (If one string is shorter than the other but is identical up to its end, the null character that terminates it will cause it to be lesser than the other string.)
So, is there a way to compare two number strings without comapring their actual integer values?
There is no standard library routine for this. You could write a routine for it.
We have two arrays
char A[]="ABABABABBBABAB";
And the other is
char B[]="BABA";
How can I find B in A and where it starts and where it ends for every occurence?
For example for this one
Between 2-5
Between 4-7
Between 10-13
Yes you can do this using strstr function.
This function returns a pointer to the first occurrence in haystack of any of the entire sequence of characters specified in needle, or a null pointer if the sequence is not present in haystack.
So you will find the pointer pointing the beginning of the string. But then again if you want to find the next occurence you will change the first parameter accordingly omitting the portion where first occurence is found. A simple illustration :-
char haystack[]="abismyabnameab";
char needle[]="ab";
char *ret;
ret = strstr(haystack, needle);
while(ret != NULL){
/* do work */
printf("%s (%zu,%zu)\n",ret, ret-haystack, ret-haystack+strlen(needle)-1 );
ret = strstr(haystack+(ret-haystack)+1,needle);
}
I omitted the part where you get those count's where it spits out the indices of the needle. As an hint notice one thing - the length of the needle will eb known to you and where it starts you know that using strstr. (ret - haystack specifically for each instance of needle in haystack).
Note this illustration code is showing the example for strings which are non-recurring within itself. For example, BB is found in BBBBB then we will find every occurrence in each position. But the solution above skips the second occurrence. A simple modification is adding to haystack 1 to search in string one character later than the previous iteration.
Better solution is to find the failure function using KMP. That will give a better complexity solution. O(n+m). But in earlier case it is O(n*m).
If we traverse the string array for every character and compare with all others,we would find duplicates,but that is O(n^2)
I need some idea to do it in less than O(n^2)
Lets say input string is: nice book, then
output will be: o
Iterate over all characters and store them into a HashMap with key is the character itself and value is nay thing (the character, true, integer,..). Before adding the character to the hash map check if it already exists, if exists then it is a duplicate if not insert it
Here is a pesudo code
for character char in String
if charactersMap.get(char) == null
charactersMap.put(char, char)
else
print C
This solution is O(n) as it iterates over the characters once and looking up for a key into a map takes constant time.
I am trying to create a programme that is able to determine whether two inputted words are anagrams of each other.
The way in which I have been told to go by my tutor is to count how many of the first letter of one word there is, then compare to the other, then repeat for the rest of the letters. Therefore if the word gets to the end, then it considers them anagrams. However that is all he has helped me with, and I am really struggling with the problem.
The programme is required to print whether or not they are anagrams like so,
Success! "Carthorse" and "Orchestra" are anagrams!
Edit: Thanks guys for all of your responses, whilst I understand the whole idea behind them, I am finding it very difficult to put them into code, would anyone be able to simply writing the annotated code for me? It's not for a homework or anything, it's simply a personal project.
It sounds like you're new to C! Welcome :)
Tasks like that can seem complex, so the first step I'd do here is break it down into steps that you can google for how to do. So:
"count how many of the first letter of one word there is, then compare to the other, then repeat for the rest of the letter"
Read in the words/create variables of them
Create an array of length 26, to store each letter of the alphabet
Loop through the first word and for each letter, add one to the correct array index (a = 0, m = 12, etc)
e.g.
int index = string[i] - 'a'; // This will subtract the ascii value from the letter, getting a = 0 etc
letterCounts[index]++; // or letterCounts[index]--;
Loop through the second word, and for each letter, subtract one from the array index
If at the end any index is not 0, it is not an anagram.
Convert both strings to lowercase characters.
Create two arrays of 26 characters for the letters of the alphabet.
Run through each string counting the letters and incrementing the appropriate element in the alphabet arrays.
Then compare the two alphabet arrays and if they are equal for each character, your strings are anagrams.
1) Convert both strings to lowercase as necessary (use tolower from ctype.h).
2) Sort each string, e.g., by using qsort from stdlib.h:
static int cmp(const void *a, const void *b) { return *(char *)a - *(char *)b; }
qsort(str1, strlen(str1), 1, (cmp));
qsort(str2, strlen(str2), 1, (cmp));
3) Compare the sorted strings with strcmp from string.h - if equal, they are anagrams, otherwise not.
Given a string like "geekthegeertheregeers" .So we have to find longest common substring in the string itself.
Like in this case "geer" will be longest common substring.
My question is that which algorithm will be applied here.can LCS be modified to find this solution of this problem?
Is the question "finding longest substring occur more than once in substring set"?
Result for "geekthegeertheregeers" should be "egeer"?
If so, you can build suffix array for input string, and construct LCP(Longest Common Prefix) array for suffix array. Both can be done in O(N) (N is length of input string).
Reference:
Suffix Array (http://en.wikipedia.org/wiki/Suffix_array )
LCP array (http://en.wikipedia.org/wiki/LCP_array )