atmost K mismatch substrings?

atmost K mismatch substrings? - c

I'm tryring to solve this problem though using brute force I was able to solve it, but
the following optimised algo is giving me incorrect results for some of the testcases .I tried but couldn;t find the problem with the code can any body help me.
Problem :
Given a string S and and integer K, find the integer C which equals the number of pairs of substrings(S1,S2) such that S1 and S2 have equal length and Mismatch(S1, S2) <= K where the mismatch function is defined below.
The Mismatch Function
Mismatch(s1,s2) is the number of positions at which the characters in S1 and S2 differ. For example mismatch(bag,boy) = 2 (there is a mismatch in the second and third position), mismatch(cat,cow) = 2 (again, there is a mismatch in the second and third position), Mismatch(London,Mumbai) = 6 (since the character at every position is different in the two strings). The first character in London is ‘L’ whereas it is ‘M’ in Mumbai, the second character in London is ‘o’ whereas it is ‘u’ in Mumbai - and so on.
int main() {
int k;
char str[6000];
cin>>k;
cin>>str;
int len=strlen(str);
int i,j,x,l,m,mismatch,count,r;
count=0;
for(i=0;i<len-1;i++)
for(j=i+1;j<len;j++)
{ mismatch=0;
for(r=0;r<len-j+i;r++)
{
if(str[i+r]!=str[j+r])
{ ++mismatch;
if(mismatch>=k)break;
}
if(mismatch<=k)++count;
}
}
cout<<count;
return 0;
}
Sample test cases
Test case (passing for above code)
**input**
0
abab
**output**
3
Test case (failing for above code)
**input**
3
hjdiaceidjafcchdhjacdjjhadjigfhgchadjjjbhcdgffibeh
**expected output**
4034
**my output**
4335

You have two errors. First,
for(r=1;r<len;r++)
should be
for(r=1;r<=len-j;r++)
since otherwise,
str[j+r]
would at some point begin comparing characters past the null-terminator (i.e. beyond the end of the string). The greatest r can be is the remaining number of characters from the jth index to the last character.
Second, writing
str[i+r]
and
str[j+r]
skips the comparison of the ith and jth characters since r is always at least 1. You should write
for(r=0;r<len-j;r++)

You have two basic errors. You are quitting when mismatches>=k instead of mismatches>k (mismatches==k is an acceptable number) and you are letting r get too large. These skew the final count in opposite directions but, as you see, the second error "wins".
The real inner loop should be:
for (r=0; r<len-j; ++r)
{
if (str[i+r] != str[j+r])
{
++mismatch;
if (mismatch > k)
break;
}
++count;
}
r is an index into the substring, and j+r MUST be less than len to be valid for the right substring. Since i<j, if str[j+r] is valid, then so it str[i+r], so there's no need to have i involved in the upper limit calculation.
Also, you want to break on mismatch>k, not on >=k, since k mismatches are allowed.
Next, if you test for too many mismatches after incrementing mismatch, you don't have to test it again before counting.
Finally, the upper limit of r<len-j (instead of <=) means that the trailing '\0' character won't be compared as part of the str[j+r] substring. You were comparing that and more when j+r >= len, but mismatches was less than k when that first happened.
Note: You asked about a faster method. There is one, but the coding is more involved. Make the outer loop on the difference delta between starting index values. (0<delta<len) Then, count all acceptable matches with something like:
count = 0;
for delta = 1 to len-1
set i=0; j=delta; mismatches=0; r=0;
while j < len
.. find k'th mismatch, or end of str:
while mismatches < k and j+r&ltlen
if str[i+r] != str[j+r] then mismatches=mismatches+1
r = r+1
end while
.. extend r to cover any trailing matches:
while j+r<len and str[i+r]==str[j+r]
r + r+1
end while
.. arrive here with r being the longest string pair starting at str[i]
.. and str[j] with no more than k mismatches. This loop will add (r)
.. to the count and advance i,j one space to the right without recounting
.. the character mismatches inside. Rather, if a mismatch is dropped off
.. the front, then mismatches is decremented by 1.
repeat
count = count + r
if str[i] != str[j] then mismatches=mismatches-1
i = i+1, j = j+1, r = r-1
until mismatches < k
end if
end while
That's pseudocode, and also pseudocorrect. The general idea is to compare all substrings with starting indices differing by (delta) in one pass, starting and the left, and increasing the substring length r until the end of the source string is reached or k+1 mismatches have been seen. That is, str[j+r] is either the end of the string, or the camel's-back-breaking mismatch position in the right substring. That makes r substrings that had k or fewer mismatches starting at str[i] and str[j].
So count those r substrings and move to the next positions i=i+1,j=j+1 and new length r=r-1, reducing the mismatch count if unequal characters were dropped off the left side.
It should be pretty easy to see that on each loop either r increases by 1 or j increases by 1 and (j+r) stays the same. Both will j and (j+r) will reach len in O(n) time, so the whole thing is O(n^2).
Edit: I fixed the handing of r, so the above should be even more pseudocorrect. The improvement to O(n^2) runtime might help.
Re-edit: Fixed comment bugs.
Re-re-edit: More typos in algorithm, mostly mismatches misspelled and incremented by 2 instead of 1.

#Mike I have some modifications in your logic and here is the correct code for it...
#include<iostream>
#include<string>
using namespace std;
int main()
{
long long int k,c=0;
string s;
cin>>k>>s;
int len = s.length();
for(int gap = 1 ; gap < len; gap ++)
{
int i=0,j=gap,mm=0,tmp_len=0;
while (mm <=k && (j+tmp_len)<len)
{
if (s[i+tmp_len] != s[j+tmp_len])
mm++;
tmp_len++;
}
// while (((j+tmp_len)<len) && (s[i+tmp_len]==s[j+tmp_len]))
// tmp_len++;
if(mm>k){tmp_len--;mm--;}
do{
c = c + tmp_len ;
if (s[i] != s[j]) mm--;
i++;
j++;
tmp_len--;
while (mm <=k && (j+tmp_len)<len)
{
if (s[i+tmp_len] != s[j+tmp_len])
mm++;
tmp_len++;
}
if(mm>k){tmp_len--;mm--;}
}while(tmp_len>0);
}
cout<<c<<endl;
return 0;
}

Related

intermix command line strings in C

This is homework, so I am not looking for a direct answer I am more-so looking for the logic behind this. I do not believe the question is stated very well for novice C devs, and I cannot find any resources to help me out here. I am new to C much more a Java guy so this may seem totally and utterly noobish. The instructions are below
$ ./mixedupecho HELLO!
.H/EmLiLxOe!dHuEpLeLcOh!oH
*
For this program, you can ignore any command-line arguments beyond the first two (including the program name itself):
$ ./mixedupecho HELLO! morestuff lalala
.H/EmLiLxOe!dHuEpLeLcOh!oH
*
Notice how "HELLO!" is shorter than "./mixedupecho", and so the program "wraps around"
and starts over again at 'H'whenever it reaches the end of the string.
*
How can you implement that? The modulo % operator is your friend here.
Spcecifically, note that "HELLO!"[5] yields '!', and "HELLO!"[6] is beyond the bounds of the array.
But "HELLO!"[6 % 6] evaluates to "HELLO!"[0], which yields 'H'.
And "HELLO!"[7 % 6] evaluates to "HELLO!"[1] ...
Below is the code I have so far. This iterates through the every character of the argv string which I get. What I don't get is how to print it off so instead of the sequence [0][0], [0][1], [0][2]... I get [0][0], [1][0], [0][1]... etc.
Can someone take a crack at explaining this to me?
int main(int argc, string argv[])
{
for(int i = 0; i < argc; i++)
{
for(int j = 0, n = strlen(argv[i]); j < n; j++)
{
printf("%c", argv[i][j]);
}
}
printf("\n");
}
THANKS SO MUCH! THIS IS DRIVING ME INSANE!

You want to keep the index i incrementing until it is equal to the index of the null terminator in the longest string. Meanwhile, you'll use the % operator to ensure that i stays within the boundaries of the shorter string.
Here's how I'd do it:
Set the initial (unsigned) lengths to -1U to avoid calculating lengths unnecessarily. I'll use LIMIT for the rest of this example as if I did #define LIMIT -1U.
Iterate through the strings, checking to ensure that argv[N][i] % len[N] is not a null terminator. If it is a null terminator and len[N] == LIMIT, set len[N] = i.
When the expression len[0] != LIMIT && len[1] != LIMIT is true, the loop ends since both strings will have the correct length, meaning all characters in each string have been enumerated.
The only thing left is printing the character for each string, which I'm sure you can handle. I would have used 0 as the initial length, except that complicates things since you can't do x % 0. The reason for unsigned length is that -1U results in an unsigned int value (e.g. 4294967295 or 65535); plain -1 results in x % -1, which makes no sense because dividing by -1 yields no remainder.

How should I generate the n-th digit of this sequence in logarithmic time complexity?

I have the following problem:
The point (a) was easy, here is my solution:
#include <stdio.h>
#include <string.h>
#define MAX_DIGITS 1000000
char conjugateDigit(char digit)
{
if(digit == '1')
return '2';
else
return '1';
}
void conjugateChunk(char* chunk, char* result, int size)
{
int i = 0;
for(; i < size; ++i)
{
result[i] = conjugateDigit(chunk[i]);
}
result[i] = '\0';
}
void displaySequence(int n)
{
// +1 for '\0'
char result[MAX_DIGITS + 1];
// In this variable I store temporally the conjugates at each iteration.
// Since every component of the sequence is 1/4 the size of the sequence
// the length of `tmp` will be MAX_DIGITS / 4 + the string terminator.
char tmp[(MAX_DIGITS / 4) + 1];
// There I assing the basic value to the sequence
strcpy(result, "1221");
// The initial value of k will be 4, since the base sequence has ethe length
// 4. We can see that at each step the size of the sequence 4 times bigger
// than the previous one.
for(int k = 4; k < n; k *= 4)
{
// We conjugate the first part of the sequence.
conjugateChunk(result, tmp, k);
// We will concatenate the conjugate 2 time to the original sequence
strcat(result, tmp);
strcat(result, tmp);
// Now we conjugate the conjugate in order to get the first part.
conjugateChunk(tmp, tmp, k);
strcat(result, tmp);
}
for(int i = 0; i < n; ++i)
{
printf("%c", result[i]);
}
printf("\n");
}
int main()
{
int n;
printf("Insert n: ");
scanf("%d", &n);
printf("The result is: ");
displaySequence(n);
return 0;
}
But for the point b I have to generate the n-th digit in logarithmic time. I have no idea how to do it. I have tried to find a mathematical property of that sequence, but I failed. Can you help me please? It is not the solution itself that really matters, but how do you tackle this kind of problems in a short amount of time.
This problem was given last year (in 2014) at the admission exam at the Faculty of Mathematics and Computer Science at the University of Bucharest.

Suppose you define d_ij as the value of the ith digit in s_j.
Note that for a fixed i, d_ij is defined only for large enough values of j (at first, s_j is not large enough).
Now you should be able to prove to yourself the two following things:
once d_ij is defined for some j, it will never change as j increases (hint: induction).
For a fixed i, d_ij is defined for j logarithmic in i (hint: how does the length of s_j increase as a function of j?).
Combining this with the first item, which you solved, should give you the result along with the complexity proof.

There is a simple programming solution, the key is to use recursion.
Firstly determine the minimal k that the length of s_k is more than n, so that n-th digit exists in s_k. According to a definition, s_k can be split into 4 equal-length parts. You can easily determine into which part the n-th symbol falls, and what is the number of this n-th symbol within that part --- say that n-th symbol in the whole string is n'-th within this part. This part is either s_{k-1}, either inv(s_{k-1}). In any case you recursively determine what is n'-th symbol within that s_{k-1}, and then, if needed, invert it.

The digits up to 4^k are used to determine the digts up to 4^(k+1). This suggests writing n in base 4.
Consider the binary expansion of n where we pair digits together, or equivalently the base 4 expansion where we write 0=(00), 1=(01), 2=(10), and 3=(11).
Let f(n) = +1 if the nth digit is 1, and -1 if the nth digit is 2, where the sequence starts at index 0 so f(0)=1, f(1)=-1, f(2)-1, f(3)=1. This index is one lower than the index starting from 1 used to compute the examples in the question. The 0-based nth digit is (3-f(n))/2. If you start the indices at 1, the nth digit is (3-f(n-1))/2.
f((00)n) = f(n).
f((01)n) = -f(n).
f((10)n) = -f(n).
f((11)n) = f(n).
You can use these to compute f recursively, but since it is a back-recursion you might as well compute f iteratively. f(n) is (-1)^(binary weight of n) = (-1)^(sum of the binary digits of n).
See the Thue-Morse sequence.

Non-recursive combination algorithm to generate distinct character strings

This problem has been irritating me for too long. I need a non-recursive algorithm in C to generate non-distinct character strings. For instance, if a given character string is 26 characters long, and the string is of length 2, then there are 26^2 non-distinct characters.
Please note that these are distinct combinations, aab is not the same as baa or aba. I've searched S.O., and most solutions produce non-distinct combinations. Also, I do not need permutations.
The algorithm can't rely on a libraries. I'm going to translate this C code into cuda where standard C libraries don't work (at least not efficiently).
Before I show you what I started, let me explain an aspect of the program. It is multithreaded on a GPU, so I initialize the beginning string with a few characters, aa in this case. To create a combination, I add one or more characters depending on the desired length.
Here's one method that I have attempted:
int main(void){
//Declarations
char final[12] = {0};
char b[3] = "aa";
char charSet[27] = "abcdefghijklmnopqrstuvwxyz";
int max = 4; //Set for demonstration purposes
int ul = 1;
int k,i;
//This program is multithreaded on a GPU. Each thread is initialized
//to a starting value for the string. In this case, it is aa
//Set final with a starting prefix
int pref = strlen(b);
memcpy(final, b, pref+1);
//Determine the number of non-distinct combinations
for(int j = 0; j < length; j++) ul *= strlen(charSet);
//Start concatenating characters to the current character string
for(k = 0; k < ul; k++)
{
final[pref+1] = charSet[k];
//Do some work with the string
}
...
It should be obvious that this program does nothing useful, accept if I'm only appending one character from charSet.
My professor suggested that I try using a mapping (this isn't homework; I asked him about possible ways to generate distinct combinations without recursion).
His suggestion is similar to what I started above. Using the number of combinations calculated, he suggested to decompose it according to mod 10. However, I realized it wouldn't work.
For example, say I need to append two characters. This gives me 676 combinations using the character set above. If I am on the 523rd combination, the decomposition he demonstrated would yield
523 % 10 = 3
52 % 10 = 2
5 % 10 = 5
It should be obvious that this doesn't work. For one, it yields three characters, and two, if my character set is larger than 10 characters, the mapping ignores those above index 9.
Still, I believe a mapping is key to the solution.
The other method I explored utilized for loops:
//Psuedocode
c = charset;
for(i = 0; i <length(charset); i++){
concat string
for(j = 0; i <length(charset); i++){
concat string
for...
However, this hardcodes the length of the string I want to compute. I could use an if statement with a goto to break it, but I would like to avoid this method.
Any constructive input is appreciated.

Given a string, to find the next possible string in the sequence:
Find the last character in the string which is not the last character in the alphabet.
Replace it with the next character in the alphabet.
Change every character to the right of that character with the first character in the alphabet.
Start with a string which is a repetition of the first character of the alphabet. When step 1 fails (because the string is all the last character of the alphabet) then you're done.
Example: the alphabet is "ajxz".
Start with aaaa.
First iteration: the rightmost character which is not z is the last one. Change it to the next character: aaaj
Second iteration. Ditto. aaax
Third iteration: Again. aaaz
Four iteration: Now the rightmost non-z character is the second last one. Advance it and change all characters to the right to a: aaja
Etc.

First, thanks for everyone's input; it was helpful. Being that I am translating this algorithm into cuda, I need it to be as efficient as possible on a GPU. The methods proposed certainly work, but not necessarily optimal for GPU architecture. I came up with a different solution using modular arithmetic that takes advantage of the base of my character set. Here's an example program, primarily in C with a mix of C++ for output, and it's fairly fast.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <iostream>
using namespace std;
typedef unsigned long long ull;
int main(void){
//Declarations
int init = 2;
char final[12] = {'a', 'a'};
char charSet[27] = "abcdefghijklmnopqrstuvwxyz";
ull max = 2; //Modify as need be
int base = strlen(charSet);
int placeHolder; //Maps to character in charset (result of %)
ull quotient; //Quotient after division by base
ull nComb = 1;
char comb[max+1]; //Array to hold combinations
int c = 0;
ull i,j;
//Compute the number of distinct combinations ((size of charset)^length)
for(j = 0; j < max; j++) nComb *= strlen(charSet);
//Begin computing combinations
for(i = 0; i < nComb; i++){
quotient = i;
for(j = 0; j < max; j++){ //No need to check whether the quotient is zero
placeHolder = quotient % base;
final[init+j] = charSet[placeHolder]; //Copy the indicated character
quotient /= base; //Divide the number by its base to calculate the next character
}
string str(final);
c++;
//Print combinations
cout << final << "\n";
}
cout << "\n\n" << c << " combinations calculated";
getchar();
}

find the longest non decreasing sub sequence

given a string consists only of 0s and 1s say 10101
how to find the length of the longest non decreasing sub-sequence??
for example,
for the string,
10101
the longest non decreasing sub sequences are
111
001
so you should output 3
for the string
101001
the longest non decreasing sub sequence is
0001
so you should output 4
how to find this??
how can this be done when we are provided with limits.sequence between the limit
for example
101001
limits [3,6]
the longest non decreasing sub sequence is
001
so you should output 3
can this be achieved in o(strlen)

Can this be achieved in O(strlen)?
Yes. Observe that the non-decreasing subsequences would have one of these three forms:
0........0 // Only zeros
1........1 // Only ones
0...01...1 // Some zeros followed by some ones
The first two forms can be easily checked in O(1) by counting all zeros and by counting all ones.
The last one is a bit harder: you need to go through the string keeping the counter of zeros that you've seen so far, along with the length of the longest string of 0...01...1 form that you have discovered so far. At each step where you see 1 in the string, the length of the longest subsequence of the third form is the larger of the number of zeros plus one or the longest 0...01...1 sequence that you've seen so far plus one.
Here is the implementation of the above approach in C:
char *str = "10101001";
int longest0=0, longest1=0;
for (char *p = str ; *p ; p++) {
if (*p == '0') {
longest0++;
} else { // *p must be 1
longest1 = max(longest0, longest1)+1;
}
}
printf("%d\n", max(longest0, longest1));
max is defined as follows:
#define max( a, b ) ( ((a) > (b)) ? (a) : (b) )
Here is a link to a demo on ideone.

Use dynamic programming. Run through the string from left to right, and keep track of two variables:
zero: length of longest subsequence ending in 0
one: length of longest subsequence ending in 1
If we see a 0, we can append this to any prefix that ends in 0, so we increase zero. If we see a 1, we can either append it to the prefix that ends in 0, or in 1, so we set one the one which is longest. In C99:
int max(int a, int b) {
return a > b ? a : b;
}
int longest(char *string) {
int zero = 0;
int one = 0;
for (; *string; ++string) {
switch (*string) {
case '0':
++zero;
break;
case '1':
one = max(zero, one) + 1;
break;
}
}
return max(zero, one);
}

do {
count++;
if (array[i] < prev) {
if (count > max)
max = count;
count = 0;
}
prev = array[i];
} while (++i < length);
Single pass. Will even work on any numbers, not just 1s and 0s.
For limits - set i to starting number, use ending instead of array length.

Algorithm for processing the string

I really don't know how to implement this function:
The function should take a pointer to an integer, a pointer to an array of strings, and a string for processing. The function should write to array all variations of exchange 'ch' combination to '#' symbol and change the integer to the size of this array. Here is an example of processing:
choker => {"choker","#oker"}
chocho => {"chocho","#ocho","cho#o","#o#o"}
chachacha => {"chachacha","#achacha","cha#acha","chacha#a","#a#acha","cha#a#a","#acha#a","#a#a#a"}
I am writing this in C standard 99. So this is sketch:
int n;
char **arr;
char *string = "chacha";
func(&n,&arr,string);
And function sketch:
int func(int *n,char ***arr, char *string) {
}
So I think I need to create another function, which counts the number of 'ch' combinations and allocates memory for this one. I'll be glad to hear any ideas about this algorithm.

You can count the number of combinations pretty easily:
char * tmp = string;
int i;
for(i = 0; *tmp != '\0'; i++){
if(!(tmp = strstr(tmp, "ch")))
break;
tmp += 2; // Skip past the 2 characters "ch"
}
// i contains the number of times ch appears in the string.
int num_combinations = 1 << i;
// num_combinations contains the number of combinations. Since this is 2 to the power of the number of occurrences of "ch"

First, I'd create a helper function, e.g. countChs that would just iterate over the string and return the number of 'ch'-s. That should be easy, as no string overlapping is involved.
When you have the number of occurences, you need to allocate space for 2^count strings, with each string (apart from the original one) of length strlen(original) - 1. You also alter your n variable to be equal to that 2^count.
After you have your space allocated, just iterate over all indices in your new table and fill them with copies of the original string (strcpy() or strncpy() to copy), then replace 'ch' with '#' in them (there are loads of ready snippets online, just look for "C string replace").
Finally make your arr pointer point to the new table. Be careful though - if it pointed to some other data before, you should think about freeing it or you'll end up having memory leaks.

If you would like to have all variations of replaced string, array size will have 2^n elements. Where n - number of "ch" substrings. So, calculating this will be:
int i = 0;
int n = 0;
while(string[i] != '\0')
{
if(string[i] == 'c' && string[i + 1] == 'h')
n++;
i++;
}
Then we can use binary representation of number. Let's note that incrementing integer from 0 to 2^n, the binary representation of i-th number will tell us, which "ch" occurrence to change. So:
for(long long unsigned int i = 0; i < (1 << n); i++)
{
long long unsigned int number = i;
int k = 0;
while(number > 0)
{
if(number % 2 == 1)
// Replace k-th occurence of "ch"
number /= 2;
k++;
}
// Add replaced string to array
}
This code check every bit in binary representation of number and changes k-th occurrence if k-th bit is 1. Changing k-th "ch" is pretty easy, and I leave it for you.
This code is useful only for 64 or less occurrences, because unsigned long long int can hold only 2^64 values.

There are two sub-problems that you need to solve for your original problem:
allocating space for the array of variations
calculating the variations
For the first problem, you need to find the mathematical function f that takes the number of "ch" occurrences in the input string and returns the number of total variations.
Based on your examples: f(1) = 1, f(2) = 4 and f(3) = 8. This should give you a good idea of where to start, but it is important to prove that your function is correct. Induction is a good way to make that proof.
Since your replace process ensures that the results have either the same of a lower length than the original you can allocate space for each individual result equal to the length of original.
As for the second problem, the simplest way is to use recursion, like in the example provided by nightlytrails.
You'll need another function which take the array you allocated for the results, a count of results, the current state of the string and an index in the current string.
When called, if there are no further occurrences of "ch" beyond the index then you save the result in the array at position count and increment count (so the next time you don't overwrite the previous result).
If there are any "ch" beyond index then call this function twice (the recurrence part). One of the calls uses a copy of the current string and only increments the index to just beyond the "ch". The other call uses a copy of the current string with the "ch" replaced by "#" and increments the index to beyond the "#".
Make sure there are no memory leaks. No malloc without a matching free.
After you make this solution work you might notice that it plays loose with memory. It is using more than it should. Improving the algorithm is an exercise for the reader.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

atmost K mismatch substrings? - c

Related

intermix command line strings in C

How should I generate the n-th digit of this sequence in logarithmic time complexity?

Non-recursive combination algorithm to generate distinct character strings

find the longest non decreasing sub sequence

Algorithm for processing the string

Categories

Resources