Match sub-string within a string with tolerance of 1 character mismatch

Match sub-string within a string with tolerance of 1 character mismatch - c

I was going through some Amazon interview questions on CareerCup.com, and I came across this interesting question which I haven't been able to figure out how to do. I have been thinking on this since 2 days. Either I am taking a way off approach, or its a genuinely hard function to write.
Question is as follows:
Write a function in C that can find if a string is a sub-string of another. Note that a mismatch of one character
should be ignored.
A mismatch can be an extra character: ’dog’ matches ‘xxxdoogyyyy’
A mismatch can be a missing character: ’dog’ matches ‘xxxdgyyyy’
A mismatch can be a different character: ’dog’ matches ‘xxxdigyyyy’
The return value wasn't mentioned in the question, so I assume the signature of the function can be something like this:
char * MatchWithTolerance(const char * str, const char * substr);
If there is a match with the given rules, return the pointer to the beginning of matched substring within the string. Else return null.
Bonus
If someone can also figure out a generic way of making the tolerance to n instead of 1, then that would be just brilliant.
In that case the signature would be:
char * MatchWithTolerance(const char * str, const char * substr, unsigned int tolerance = 1);

This seems to work, let me know if you find any errors and I'll try to fix them:
int findHelper(const char *str, const char *substr, int mustMatch = 0)
{
if ( *substr == '\0' )
return 1;
if ( *str == '\0' )
return 0;
if ( *str == *substr )
return findHelper(str + 1, substr + 1, mustMatch);
else
{
if ( mustMatch )
return 0;
if ( *(str + 1) == *substr )
return findHelper(str + 1, substr, 1);
else if ( *str == *(substr + 1) )
return findHelper(str, substr + 1, 1);
else if ( *(str + 1) == *(substr + 1) )
return findHelper(str + 1, substr + 1, 1);
else if ( *(substr + 1) == '\0' )
return 1;
else
return 0;
}
}
int find(const char *str, const char *substr)
{
int ok = 0;
while ( *str != '\0' )
ok |= findHelper(str++, substr, 0);
return ok;
}
int main()
{
printf("%d\n", find("xxxdoogyyyy", "dog"));
printf("%d\n", find("xxxdgyyyy", "dog"));
printf("%d\n", find("xxxdigyyyy", "dog"));
}
Basically, I make sure only one character can differ, and run the function that does this for every suffix of the haystack.

This is related to a classical problem of IT, referred to as Levenshtein distance.
See Wikibooks for a bunch of implementations in different languages.

This is slightly different than the earlier solution, but I was intrigued by the problem and wanted to give it a shot. Obviously optimize if desired, I just wanted a solution.
char *match(char *str, char *substr, int tolerance)
{
if (! *substr) return str;
if (! *str) return NULL;
while (*str)
{
char *str_p;
char *substr_p;
char *matches_missing;
char *matches_mismatched;
str_p = str;
substr_p = substr;
while (*str_p && *substr_p && *str_p == *substr_p)
{
str_p++;
substr_p++;
}
if (! *substr_p) return str;
if (! tolerance)
{
str++;
continue;
}
if (strlen(substr_p) <= tolerance) return str;
/* missed due to a missing letter */
matches_missing = match(str_p, substr_p + 1, tolerance - 1);
if (matches_missing == str_p) return str;
/* missed due to a mismatch of letters */
matches_mismatched = match(str_p + 1, substr_p + 1, tolerance - 1);
if (matches_mismatched == str_p + 1) return str;
str++;
}
return NULL;
}

Is the problem to do this efficiently?
The naive solution is to loop over every substring of size substr in str, from left to right, and return true if the current substring if only one of the characters is different in a comparison.
Let n = size of str
Let m = size of substr
There are O(n) substrings in str, and the matching step takes time O(m). Ergo, the naive solution runs in time
O(n*m)

With arbitary no. of tolerance levels.
Worked for all the test cases I could think of. Loosely based on |/|ad's solution.
#include<stdio.h>
#include<string.h>
report (int x, char* str, char* sstr, int[] t) {
if ( x )
printf( "%s is a substring of %s for a tolerance[%d]\n",sstr,str[i],t[i] );
else
printf ( "%s is NOT a substring of %s for a tolerance[%d]\n",sstr,str[i],t[i] );
}
int find_with_tolerance (char *str, char *sstr, int tol) {
if ( (*sstr) == '\0' ) //end of substring, and match
return 1;
if ( (*str) == '\0' ) //end of string
if ( tol >= strlen(sstr) ) //but tol saves the day
return 1;
else //there's nothing even the poor tol can do
return 0;
if ( *sstr == *str ) { //current char match, smooth
return find_with_tolerance ( str+1, sstr+1, tol );
} else {
if ( tol <= 0 ) //that's it. no more patience
return 0;
for(int i=1; i<=tol; i++) {
if ( *(str+i) == *sstr ) //insertioan of a foreign character
return find_with_tolerance ( str+i+1, sstr+1, tol-i );
if ( *str == *(sstr+i) ) //deal with dletion
return find_with_tolerance ( str+1, sstr+i+1, tol-i );
if ( *(str+i) == *(sstr+i) ) //deal with riplacement
return find_with_tolerance ( str+i+1, sstr+i+1, tol-i );
if ( *(sstr+i) == '\0' ) //substr ends, thanks to tol & this loop
return 1;
}
return 0; //when all fails
}
}
int find (char *str, char *sstr, int tol ) {
int w = 0;
while (*str!='\0')
w |= find_with_tolerance ( str++, sstr, tol );
return (w) ? 1 : 0;
}
int main() {
const int n=3; //no of test cases
char *sstr = "dog"; //the substr
char *str[n] = { "doox", //those cases
"xxxxxd",
"xxdogxx" };
int t[] = {1,1,0}; //tolerance levels for those cases
for(int i = 0; i < n; i++) {
report( find ( *(str+i), sstr, t[i] ), *(str+i), sstr, t[i] );
}
return 0;
}

Related

using binary search to find the first capital letter in a sorted string [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I wrote the following code to find the first capital letter in a string using binary search:
char first_capital(const char str[], int n)
{
int begin = 0;
int end = n - 1;
int mid;
while (begin <= end)
{
mid = (begin + end) / 2;
if (mid == 0 && isupper(str[mid]))
{
return mid;
}
else if (mid > 0 && isupper(str[mid]) && islower(str[mid - 1]))
{
return mid;
}
if (islower(str[mid]))
{
begin = mid + 1;
}
else
{
end = mid - 1;
}
}
return 0;
}
Currently my code isn't working as expected while testing it. If anyone can mention where I went wrong it would help a lot.
NOTE: The input string will be already sorted (all lower case letters appear before upper case letters). const char str[] is the string and int n is the length of the string.
EDIT: for example: first_capital("abcBC", 5) should return 'B'.

Your logic is completely right, but you returned the wrong value
char first_capital(const char str[], int n)
{
int begin = 0;
int end = n - 1;
int mid;
while (begin <= end)
{
mid = (begin + end) / 2;
if(mid == 0 && isupper(str[mid]))
{
return mid; // Here the index is returned not the character
}
else if (mid > 0 && isupper(str[mid]) && islower(str[mid-1]))
{
return mid; // Same goes here
}
if(islower(str[mid]))
{
begin = mid+1;
}
else
{
end = mid - 1;
}
}
return 0;
}
The driver code
int main(){
printf("%d\n", first_capital("abcabcabcabcabcZ", 16));
}
will be giving 15 as an answer which is the index of the character Z.
if u want the character to be returned replace return mid with return str[mid] and 'Z' will be returned.

#include <stdio.h>
/* This will find and return the first UPPERCASE character in txt
* provided that txt is zero-or-more lowercase letters,
* followed by zero-or-more uppercase letters.
* If it is all lower-case letters, it will return \0 (end of string)
* If it is all upper-case letters, it will return the first letter (txt[0])
* If there are non-alpha characters in the string, all bets are off.
*/
char findFirstUpper(const char* txt)
{
size_t lo = 0;
size_t hi = strlen(txt);
while(hi-lo > 1)
{
size_t mid = lo + (hi-lo)/2;
*(isupper(txt[mid])? &hi : &lo) = mid;
}
return isupper(txt[lo])? txt[lo] : txt[hi];
}
int main(void)
{
char answer = findFirstUpper("abcBC");
printf("Found char %c\n", answer);
return 0;
}

If the function deals with strings then the second parameter should be removed.
The function should return a pointer to the first upper case letter or a null pointer if such a letter is not present in the string. That is the function declaration and behavior should be similar to the declaration and behavior of the standard string function strchr. The only difference is that your function does not require a second parameter of the type char because the searched character is implicitly defined by the condition to be an upper case character.
On the other hand, though your function has the return type char it returns an integer that specifies the position of the found character. Also your function does not make a difference between the situations when an upper case character is not found and when a string contains an upper case character in its first position.
Also your function has too many if-else statements.
The function can be declared and defined the following way as it is shown in the demonstrative program below.
#include <stdio.h>
#include <string.h>
#include <ctype.h>
char * first_capital( const char s[] )
{
const char *first = s;
const char *last = s + strlen( s );
while ( first < last )
{
const char *middle = first + ( last - first ) / 2;
if ( islower( ( unsigned char )*middle ) )
{
first = middle + 1;
}
else
{
last = middle;
}
}
return ( char * )( isupper( ( unsigned char )*first ) ? first : NULL );
}
int main(void)
{
const char *s = "";
char *result = first_capital( s );
if ( result )
{
printf( "%c at %zu\n", *result, ( size_t )( result - s ) );
}
else
{
printf( "The string \"%s\" does not contain an upper case letter.\n", s );
}
s = "a";
result = first_capital( s );
if ( result )
{
printf( "%c at %zu\n", *result, ( size_t )( result - s ) );
}
else
{
printf( "The string \"%s\" does not contain an upper case letter.\n", s );
}
s = "A";
result = first_capital( s );
if ( result )
{
printf( "%c at %zu\n", *result, ( size_t )( result - s ) );
}
else
{
printf( "The string \"%s\" does not contain an upper case letter.\n", s );
}
s = "abcdefA";
result = first_capital( s );
if ( result )
{
printf( "%c at %zu\n", *result, ( size_t )( result - s ) );
}
else
{
printf( "The string \"%s\" does not contain an upper case letter.\n", s );
}
s = "abAB";
result = first_capital( s );
if ( result )
{
printf( "%c at %zu\n", *result, ( size_t )( result - s ) );
}
else
{
printf( "The string \"%s\" does not contain an upper case letter.\n", s );
}
return 0;
}
The program output is
The string "" does not contain an upper case letter.
The string "a" does not contain an upper case letter.
A at 0
A at 6
A at 2

How can I go about finding balance in a string in C?

I want this program to recursively solve this using a stack implementation with push and pop. I have the push and pop done, as well as these functions:
A string the users enter can only be made up of these characters. Any other characters and it returns unbalanced.
'(', ')', '{', '}', '[', ']'
An example of a balanced string is like this
()
(())
()()
{()()}
[]
[()[]{}]()
etc..
An unbalanced string looks like this:
{}}
()[}
[()}
etc..
This is the recursive definition of a balanced string:
(BASIS)The empty string is balanced
(NESTING) If s is also a balanced string then (s), [s], and {s} is balanced.
(CONCATENATION) If A and B are both strings, then AB is also balanced.
I do not know what my base case would be or how to implement this in recursion. I can without but I want to learn recursion. Any help?

I think you want to implement "Parenthesis Balanced" problem.
You can solve it easily by using stack without any recursion operation.
You can follow this.
//stk is a stack
// s is a string
for(int i=0; i<s.size(); i++)
{
if(str[i]=='('||str[i]=='[')
stk.push(s[i]);
else if(str[i]==')' && !stk.empty() && stk.top()=='(')
stk.pop();
else if(str[i]==']' && !stk.empty() && stk.top()=='[')
stk.pop();
}
Then by using a flag you can find this string of parenthesis is balanced or not.
You can get help from this question. Same to your question(Basic Recursion, Check Balanced Parenthesis) I think.

Well, having double as your stack's element type is rather wasteful, but I'll play along:
int is_balanced(char *ins) {
SPointer st = stk_create(),
int rval = 1;
for (int i = 0; i < strlen(ins); i += 1) {
int c = ins[i];
if ('(' == c) stk_push(st, (ElemType)')');
else if ('[' == c) stk_push(st, (ElemType)']');
else if ('{' == c) stk_push(st, (ElemType)'}');
else if (')' == c || ']' == c || '}' == c) {
if (stk_empty(st) || c != stk_pop(st)) {
rval = 0;
break;
}
} else {
rval = 0;
break;
}
}
if (! stk_empty(st)) rval = 0;
stk_free(st);
return rval;
}

Recursively done...
char* balanced_r(char* s, int* r)
{
const char* brackets= "([{\0)]}";
char *b = brackets;
if (s == 0) return s;
if (*s == 0) return s;
while (*b && *b != *s) b++;
if (*s == *b)
{
s = balanced_r(s+1, r);
if (*s != *(b+4)) *r = 0;
return balanced_r(s + 1, r);
}
return s;
}
int balanced(char* s)
{
int r = 1;
balanced_r(s, &r);
return r;
}

Here is a demonstrative program written in C++ that you can use as an algorithm and rewrite it in C
#include <iostream>
#include <iomanip>
#include <stack>
#include <cstring>
bool balance( const char *s, std::stack<char> &st )
{
const char *open = "({[<";
const char *close = ")}]>";
if ( *s == '\0' )
{
return st.empty();
}
const char *p;
if ( ( p = std::strchr( open, *s ) ) != nullptr )
{
st.push( *s );
return balance( s + 1, st );
}
else if ( ( p = std::strchr( close, *s ) ) != nullptr )
{
if ( !st.empty() && st.top() == open[p-close] )
{
st.pop();
return balance( s + 1, st );
}
else
{
return false;
}
}
else
{
return false;
}
}
int main()
{
for ( const char *s : {
"()", "(())", "()()", "{()()}", "[]", "[()[]{}]()",
"{}}", "()[}", "[()}"
} )
{
std::stack<char> st;
std::cout <<'\"' << s << "\" is balanced - "
<< std::boolalpha << balance( s, st )
<< std::endl;
}
return 0;
}
The program output is
"()" is balanced - true
"(())" is balanced - true
"()()" is balanced - true
"{()()}" is balanced - true
"[]" is balanced - true
"[()[]{}]()" is balanced - true
"{}}" is balanced - false
"()[}" is balanced - false
"[()}" is balanced - false

Match exact string with strstr

Suppose I have the following string:
in the interior of the inside is an inner inn
and I want to search, say, for the occurences of "in" (how often "in" appears).
In my program, I've used strstr to do so, but it returns false positives. It will return:
- in the interior of the inside is an inner inn
- interior of the inside is an inner inn
- inside is an inner inn
- inner inn
- inn
Thus thinking "in" appears 5 times, which is obviously not true.
How should I proceed in order to search exclusively for the word "in"?

Try the following
#include <stdio.h>
#include <string.h>
#include <ctype.h>
int main(void)
{
char *s = "in the interior of the inside is an inner inn";
char *t = "in";
size_t n = strlen( t );
size_t count = 0;
char *p = s;
while ( ( p = strstr( p, t ) ) != NULL )
{
char *q = p + n;
if ( p == s || isblank( ( unsigned char ) *( p - 1 ) ) )
{
if ( *q == '\0' || isblank( ( unsigned char ) *q ) ) ++count;
}
p = q;
}
printf( "There are %zu string \"%s\"\n", count, t );
return 0;
}
The output is
There are 1 string "in"
You can also add a check for ispunct if the source string can contain puctuations.

Search for " in "; note the spaces. Then consider the edge cases of a sentence starting with "in " and ending with " in".

One more way to do it is:
Use strtok() on your whole sentence with space as delimiter.
So now you can check your token against "in"

Add a isdelimiter() to check the before and after result of strstr().
// Adjust as needed.
int isdelimiter(char ch) {
return (ch == ' ') || (ch == '\0');
}
int MatchAlex(const char *haystack, const char *needle) {
int match = 0;
const char *h = haystack;
const char *m;
size_t len = strlen(needle);
while ((m = strstr(h, needle)) != NULL) {
if ((m == haystack || isdelimiter(m[-1])) && isdelimiter(m[len])) {
// printf("'%s'",m);
match++;
h += len;
} else {
h++;
}
}
return match;
}
int main(void) {
printf("%d\n",
MatchAlex("in the interior of the inside is an inner inn xxin", "in"));
return 0;
}

C general programming with strings

If I am not allowed to use the <string.h> library, how can I easily compare values of a string. I have a data file with 6 possible values for one member of a structure. All I need to do is create a loop to count how many of each value is present in an array of structs. The problem is, I cannot figure out how to compare the value and thus when to increment the counter.
for (i = 0; i < datasize; i++){
if (struct.membervalue == given)
givencount++;
if (struct.membervalue == given2) // But I can't compare them with the ==
givencount2++ ; // because they are strings.
}
EDIT: predefined enum that I MUST USE
typedef enum {
penny = 1,
nickel = 5,
dime = 10,
quarter = 25
}changeT;
I have the value "penny" how do I compare to this or relate it?

bool isEqual(const char *string1, const char *string2)
{
do
{
if (*string1 != *string2) return false;
if (*string1 == 0) return true;
++string1;
++string2;
} while (1);
}
Update: The enum doesn't change anything. You still have to identify the string "penny" before you can assign it the value for a penny.

You can try the following function:
int str_cmp(const unsigned char* str1, const unsigned char* str2)
{
int result;
do {
result = (int)*str1 - (int)*str2;
str1++;
str2++;
} while((!result) && (*str1|*str2))
return result;
}
Output is a positive if str1>str2, negative if str1<str2 and zero if they are equal.

Fastest one:
int strcmp(const char *s1, const char *s2) {
int ret = 0;
while (!(ret = *(unsigned char *) s1 - *(unsigned char *) s2) && *s2)
++s1, ++s2;
if (ret < 0) {
ret = -1;
}
else if (ret > 0) {
ret = 1 ;
}
return ret;
}

/*These variants could point to invalid memmory, but dont de-reference it.*/
int isEqual(const char *string1, const char *string2)
{
while (*string1 == *string2++)
if ( 0 == *string1++ ) return 1;
return 0;
}
/* This variant is NULL-resistent. For both NULL return true.*/
int isEqual(const char *string1, const char *string2)
{
if ( !string1 || !string2 ) return string1 == string2 ;
while (*string1 == *string2++)
if ( 0 == *string1++ ) return 1;
return 0;
}
These are only the function to compare strings. In order to help more we need to see the code you are trying. It could be something like:
if (isEqual(data.membervalue, "penny" ) pennycount++;
else
if (isEqual(data.membervalue, "nickel") nickelcount++;
And the enum you provided is not of great help to count. It is useful to calculate the "monetary" total.
int Total= penny * pennycount + nickel * nickelcount ... ;
If all you need is the total, thing get simpler:
if (isEqual(data.membervalue, "penny" ) Total += penny;
else
if (isEqual(data.membervalue, "nickel") Total += nickel;

ANSI C splitting string

Hey there!
I'm stuck on an ANSI C problem which I think should be pretty trivial (it is at least in any modern language :/).
The (temporary) goal of my script is to split a string (array of char) of 6 characters ("123:45") which represents a timestamp minutes:seconds (for audio files so it's ok to have 120 minutes) into just the minutes and just the seconds.
I tried several approaches - a general one with looking for the ":" and a hardcoded one just splitting the string by indices but none seem to work.
void _splitstr ( char *instr, int index, char *outstr ) {
char temp[3];
int i;
int strl = strlen ( instr );
if ( index == 0 ) {
for ( i = 0; i < 3; ++i ) {
if ( temp[i] != '\0' ) {
temp[i] = instr[i];
}
}
} else if ( index == 1 ) {
for ( i = 6; i > 3; i-- ) {
temp[i] = instr[i];
}
}
strcpy ( outstr, temp );
}
Another "funny" thing is that the string length of an char[3] is 6 or 9 and never actually 3. What's wrong with that?

How about using sscanf(). As simple as it can get.
char time[] = "123:45";
int minutes, seconds;
sscanf(time, "%d:%d", &minutes, &seconds);
This works best if you can be sure that time string syntax is always valid. Otherwise you must add check for that. On success, sscanf function returns the number of items succesfully read so it's pretty easy to detect errors too.
Working example: http://ideone.com/vVoBI

How about...
int seconds, minutes;
minutes = atoi(instr);
while(*instr != ':' && *++instr != '\0');
seconds = atoi(instr);
Should be pretty fast.

You have basically three options
change the input string (can't be a string literal)
copy data to output strings (input can be a literal)
transform sequences of characters to numbers
Changing the input string implies transforming "123:45" to "123\0" "45" with an embedded null.
Copying data implies managing storage for the copy.
Transforming sequences of characters implies using, for example, strtol.

You aren't putting a terminating null on your string in temp[], so when you do a strlen(temp), you are accessing arbitrary memory.
Using your known lengths, you can use something like this:
char temp[4];
if (index==0)
{
strncpy(temp, instr, 3);
temp[3] = 0;
}
else if (index==1)
{
strncpy(temp, instr+4, 2);
temp[2] = 0;
}
strcpy(outstr, temp);
But, I'll caution that I've skipped all sorts of checking for valid lengths in instr and outstr.

you can try something like that:
void break_string(const char* input, char* hours, char* minutes)
{
if(input == 0 || hours == 0 || minutes == 0)
return;
while(*input != ':')
*hours++ = *input++;
*hours = '\0';
++input;
while(*minutes++ = *input++);
return;
}
Here is the same function a bit simplified:
void break_string(const char* input, char* hours, char* minutes)
{
if(input == 0 || hours == 0 || minutes == 0)
return;
while(*input != ':')
{
*hours = *input;
++hours;
++input;
}
*hours = '\0';
++input; //ignore the ':' part
while(*input)
{
*minutes = *input;
++minutes;
++input;
}
*minutes = '\0';
return;
}
int main()
{
char* test = "123:45";
char* minutes = malloc( sizeof(char) * 12 );
char* hours = malloc( sizeof(char) * 12 );
break_string( test , hours , minutes );
printf("%s , %s \n" , hours , minutes );
//...
free( minutes );
free( hours ) ;
}

This?
char *mins, *secs;
mins = str;
while(*(++str) != ':');
str[0] = '\0';
secs = s + 1;

Here's one way, I have ignore the "index" argument above:
#include <stdio.h>
#include <string.h>
void _splitstr ( char *instr, char *outstr ) {
char temp[10];
char* end = strchr(instr, ':');
int i = 0;
if(end != 0) {
while(instr != end)
temp[i++] = *instr++;
temp[i] = '\0';
strcpy(outstr, temp);
} else {
outstr = '\0';
}
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Match sub-string within a string with tolerance of 1 character mismatch - c

This is related to a classical problem of IT, referred to as Levenshtein distance. See Wikibooks for a bunch of implementations in different languages.

Related

using binary search to find the first capital letter in a sorted string [closed]

How can I go about finding balance in a string in C?

Match exact string with strstr

C general programming with strings

ANSI C splitting string

Categories

Resources