String similarity in c

String similarity in c - c

For two strings A and B, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.
Calculate the sum of similarities of a string S with each of its suffixes
Here is my solution...
#include<stdio.h>
#include<string.h>
int getSim(char str[],int subindex)
{
int l2=subindex
int i=0;
int count=0;
for(i=0;i<l2;i++)
if(str[i]==str[subindex])
{
count++;
subindex++;
}
else
break;
return count;
}
int main()
{
int testcase=0;
int len=0;
int sum=0;
int i=0;
char s[100000];
scanf("%d",&testcase);
while(testcase--)
{
sum=0;
scanf("%s",s);
for(i=0;i<strlen(s);i++)
if(s[i]==s[0])
{
sum=sum+getSim(s,i);
}
printf("%d\n",sum);
}
}
How can we go about solving this problem using suffix array??

I'm not sure if it is the best algorithm, but here is the solution.
First of all, build suffix array. The naive algorithm(putting all suffixes into array and then sorting it) is quite slow - O(n^2 * log(n)) operations, there are several algorithms to do this in O(nlogn) time.
I'm assuming that strings are 0-indexed.
Now, take the first letter l in the string s, and use one binary search to find the index i of the first string in the suffix array which starts with l, and another binary search to find the index j of the first string in range [i..n], which doesn't start with l. Then you'll have that all strings in the range [i..j-1] starts with the same letter l. So the answer to the problem is at least j-i.
Then apply similar procedure to the strings in range [i..j). Take the second letter l2, find indexes i2 and j2 corresponding to the first string s[i2] such that s[i2][1] == l2 and the first string s[j2] such that s[j2][1] != l2. Add j2-i2 to the answer.
Repeat this procedure n times, until you run out of letters in the original string. The answer to the problem is j1-i1 + j2-i2 + ... + jn-in

You mention in the comments that it is correct, but it's very slow.
In Java, you can get the length of a String with s.length() - this value is cached in the object and it is O(1) to get.
But when you go to C, you get the length of a string with strlen(s) which recalculates (in O(n)) each time. So while you should be doing O(n), because you have an O(n) operation in there, the entire function becomes O(n^2).
To get around this, cache the value once when you run it. This will bring you back into linear time.
Bad:
scanf("%s",s);
for(i=0;i<strlen(s);i++)
if(s[i]==s[0])
{
sum=sum+getSim(s,i);
}
Good:
scanf("%s",s);
strlen = strlen(s); /* assume you declared "int strlen" earlier */
for(i=0;i<strlen;i++) /* this is now constant time to run */
if(s[i]==s[0])
{
sum=sum+getSim(s,i);
}

Related

How to extract words from a string array in c language and store it in another 1-d array? I don't want to use pointers

I wrote this code in C, where I wanted to extract words from str, then store them in char word[] (one by one), and send it to another function- palindrome. However, the words are not being formed properly. I'm new to this language so I don't want to use pointers or something else. I want to do it in the most simple way possible. Could you please suggest modifications to the code so that the words get formed properly?
int main()
{
char str[100];
int a=0, l, p=0;
printf("Enter the text \n");
gets(str);
l=strlen(str);
for(int i=0;i<l;i++)
{
char word[100]; int a=0;
if(str[i]==' '||str[i]=='\0')
{
for(int j=p;j<i;j++)
{
word[a]=str[j];
a++;
}
printf("This \n");
puts(word);
palindrome(a,word);
}
}
return 0;
}

Basic things in C don't mean easy to understand. I advise you to understand this code well in order to understand why using high level functions while being a beginner is not necessarily a good thing.
#include <unistd.h>
int main()
{
char buffer;
while(read(0, &buffer, 1) > 0)
write(1, &buffer, 1);
return 0;
}

Since you are an enthusiastic learner and have already tried to solve a problem, the best way to help you is to provide you the guidelines which will assure that you will end up solving your issue successfully.
Pointers
You do not want to use pointers. But what pointers are? Nothing more than variables pointing to a certain position in memory, being aware of the type. You already use arrays, but arrays are memory sections having a type and a size. That section has a start address, so your array is actually not very different from a pointer. So, the intent to avoid using pointers was already breached by the use of arrays, but I think I understand what you want. You want to minimalize pointer arithmetics.
Understandable, implementable solution
You can iterate the characters of your array and store the first non-space character index since the last space you have encountered and find the next space. At that point, you can create a new array of a size which matches the number of characters in that interval and copy the characters from the interval question into that array. Since you are learning C, I will provide you an algorithm that you can implement (this is not actual program code)
wordStart <- -1
lastSpace <- -1
index <- 0
while (index < sizeof(input)) do
if (input[index] != ' ') then
if (wordStart == lastSpace) then
wordStart <- index
end if
else
if (wordStart == lastSpace) then
wordStart <- index
lastSpace <- index
else
lastSpace <- index
//create array of characters, having lastSpace - wordStart elements
//copy the elements between the index of lastSpace and wordStart - 1 into your new array
//pass that new array into your call to palindrom
end if
end if
index <- index + 1
end while

Performing a sum between two arrays of digis

Had an interview today and I was asked the following question - given two arrays arr1 and arr2 of chars where they contain only numbers and one dot and also given a value m, sum them into one array of chars where they contain m digits after the dot. The program should be written in C. The algorithm was not important for them, they just gave me a compiler and 20 minutes to pass their tests.
First of all I though to find the maximum length and iterate through the array from the end and sum the values while keeping the carry:
int length = (firstLength < secondLength) ? secondLength : firstLength;
char[length] result;
for (int i = length - 1; i >= 0; i--) {
// TODO: add code
}
The problem is that for some reason I'm not sure what is the right way to perform that sum while keeping with the dot. This loop should just perform the look and not counter to k. I mean that at this point I thought just adding the values and at the end i'll insert another loop which will print k values after the dot.
My question is how should look the first loop I mentioned (the one that actually sums), I'm really got stuck on it.

The algorithm was not important
Ok, I'll let libc do it for me in that case (obviously error handling is missing):
void sum(char *as, char *bs, char *out, int precision)
{
float a, b;
sscanf(as, "%f", &a);
sscanf(bs, "%f", &b);
a += b;
sprintf(out, "%.*f", precision, a);
}

It actually took me a lot longer than 20 mins to do this. The code is fairly long too so I don't plan on posting it here. In a nutshell, the code does:
normalize the 2 numbers into 2 new strings so they have the same number of decimal digits
allocate a new string with length of longer of the 2 strings above + 1
add the 2 strings together, 2 digits at a time, with carrier
it is not clear if the final answer needs to be rounded. If not, just expand/truncate the decimals to m digits. Remove any leading zero if needed.

I am not sure whether this is the best solution or not but here's a solution and I hope it helps.
#include<stdio.h>
#include<math.h>
double convertNumber(char *arr){
int i;
int flag_d=0; //To check whether we are reading digits before or after decimal
double a=0;
int j=1;
for(i=0;i<arr[i]!='\0';i++){
if(arr[i] !='.'){
if(flag_d==0)
a = a*10 + arr[i]-48;
else{
a = a + (arr[i]-48.0)/pow(10, j);
j++;
}
}else{
flag_d=1;
}
}
return a;
}
int main() {
char num1[] = "23.20";
char num2[] = "20.2";
printf("%.6lf", convertNumber(num1) + convertNumber(num2));
}

A string of codes connected to multiple strings.I have some doubts about the two dimensional array and the for loop. Thank you very much！

/* link many strings*/
#include<stdio.h>
char *mystrcat(char * strDest, char * strSrc);
int main(void)
{
int n;
while(scanf("%d",&n))//输入要连接的字符串个数
{
if(n==0) break;//输入0结束
else
{
char words[n][100];
int i=0;
int j;
for(i=0;i<=n;i++)
{
while(fgets(words[i],100,stdin)!=NULL)
{
j=0;
while(words[i][j]!='\n')
j++;
if(words[i][j]=='\n') words[i][j]='\0';break;
}
}//输入字符串
for(i=n;i>0;i--)
{
mystrcat(words[i-1],words[i]);
}//连接这几个字符串
fputs(words[0],stdout);//输出字符串
printf("\n");
}
}
return 0;
}
//strcat函数原型
char *mystrcat(char * strDest,char * strSrc)
{
char *res=strDest;
while(*strDest)strDest++;
while(*strDest=*strSrc)
{
strDest++;
strSrc++;
}
return res;
}
This is a string of correct code to connect multiple strings. But I think n should be n-1 in two for cycles. But if you change the n to n-1, you can only enter n-1 strings, one less than I think. Can you tell me where my idea is wrong?

for(i=0;i<=n;i++)
Accessing array index out of bound when i=n - this is undefined behavior. So of course indexing should be from n-1 to 0( at max) or 0 to n-1.
And also array indexing in C starts from 0. So there are n elements that you are accessing, not n-1.
So corrections would be
for(i=0;i<=n-1;i++)
The thing is - you are reading in the n locations having index 0 to n-1 on the array and then you concatenate them one by one and at last all concatenated strings will be in words[0]. You are printing it out.
The second loop would be like
for(i=n-1;i>0;i--)
{
mystrcat(words[i-1],words[i]);
}
The idea is no matter what while accessing array indices don't access array index out bound. Here you can simply write it like this as shown in the second case. The thing is here we have ensured that all the indices used are from {0,1,2,3...,n-1}.
First determine what you want to do, if you want to take n string and then try to concatenate them then yes you can. That's what is being done here. but a much cleaner way to do it would be that keep a different result string on which you will concatenate n strings. That will not overwrite or change the already inputted strings.

Algorithm for processing the string

I really don't know how to implement this function:
The function should take a pointer to an integer, a pointer to an array of strings, and a string for processing. The function should write to array all variations of exchange 'ch' combination to '#' symbol and change the integer to the size of this array. Here is an example of processing:
choker => {"choker","#oker"}
chocho => {"chocho","#ocho","cho#o","#o#o"}
chachacha => {"chachacha","#achacha","cha#acha","chacha#a","#a#acha","cha#a#a","#acha#a","#a#a#a"}
I am writing this in C standard 99. So this is sketch:
int n;
char **arr;
char *string = "chacha";
func(&n,&arr,string);
And function sketch:
int func(int *n,char ***arr, char *string) {
}
So I think I need to create another function, which counts the number of 'ch' combinations and allocates memory for this one. I'll be glad to hear any ideas about this algorithm.

You can count the number of combinations pretty easily:
char * tmp = string;
int i;
for(i = 0; *tmp != '\0'; i++){
if(!(tmp = strstr(tmp, "ch")))
break;
tmp += 2; // Skip past the 2 characters "ch"
}
// i contains the number of times ch appears in the string.
int num_combinations = 1 << i;
// num_combinations contains the number of combinations. Since this is 2 to the power of the number of occurrences of "ch"

First, I'd create a helper function, e.g. countChs that would just iterate over the string and return the number of 'ch'-s. That should be easy, as no string overlapping is involved.
When you have the number of occurences, you need to allocate space for 2^count strings, with each string (apart from the original one) of length strlen(original) - 1. You also alter your n variable to be equal to that 2^count.
After you have your space allocated, just iterate over all indices in your new table and fill them with copies of the original string (strcpy() or strncpy() to copy), then replace 'ch' with '#' in them (there are loads of ready snippets online, just look for "C string replace").
Finally make your arr pointer point to the new table. Be careful though - if it pointed to some other data before, you should think about freeing it or you'll end up having memory leaks.

If you would like to have all variations of replaced string, array size will have 2^n elements. Where n - number of "ch" substrings. So, calculating this will be:
int i = 0;
int n = 0;
while(string[i] != '\0')
{
if(string[i] == 'c' && string[i + 1] == 'h')
n++;
i++;
}
Then we can use binary representation of number. Let's note that incrementing integer from 0 to 2^n, the binary representation of i-th number will tell us, which "ch" occurrence to change. So:
for(long long unsigned int i = 0; i < (1 << n); i++)
{
long long unsigned int number = i;
int k = 0;
while(number > 0)
{
if(number % 2 == 1)
// Replace k-th occurence of "ch"
number /= 2;
k++;
}
// Add replaced string to array
}
This code check every bit in binary representation of number and changes k-th occurrence if k-th bit is 1. Changing k-th "ch" is pretty easy, and I leave it for you.
This code is useful only for 64 or less occurrences, because unsigned long long int can hold only 2^64 values.

There are two sub-problems that you need to solve for your original problem:
allocating space for the array of variations
calculating the variations
For the first problem, you need to find the mathematical function f that takes the number of "ch" occurrences in the input string and returns the number of total variations.
Based on your examples: f(1) = 1, f(2) = 4 and f(3) = 8. This should give you a good idea of where to start, but it is important to prove that your function is correct. Induction is a good way to make that proof.
Since your replace process ensures that the results have either the same of a lower length than the original you can allocate space for each individual result equal to the length of original.
As for the second problem, the simplest way is to use recursion, like in the example provided by nightlytrails.
You'll need another function which take the array you allocated for the results, a count of results, the current state of the string and an index in the current string.
When called, if there are no further occurrences of "ch" beyond the index then you save the result in the array at position count and increment count (so the next time you don't overwrite the previous result).
If there are any "ch" beyond index then call this function twice (the recurrence part). One of the calls uses a copy of the current string and only increments the index to just beyond the "ch". The other call uses a copy of the current string with the "ch" replaced by "#" and increments the index to beyond the "#".
Make sure there are no memory leaks. No malloc without a matching free.
After you make this solution work you might notice that it plays loose with memory. It is using more than it should. Improving the algorithm is an exercise for the reader.

Fastest way to count the number of occurrences of a string

I was wondering what is the fastest way to count the number of occurrences of a string (needle) within another string (haystack). The way I'm doing it is:
int findWord(char * file, char * word){
char *fptr;
char * current = strtok_r(file, " ,.\n", &fptr);
int sum = 0;
while (current != NULL){
//printf("%s\n", current);
if(strcmp(current, word) == 0)
sum+=1;
current = strtok_r(NULL, " ,.\n", &fptr);
}
return sum;
}
Would it be faster to use a more complex algorithm (Boyer-Moore)?
Thanks

Currently, if your program is counting word "blah" and encounters a token is "blahblah", your algorithm counts it as zero occurrences. If it needed to count it as two, you cound benefit from a more advanced approach.
If your program does what you want, you are processing as fast as you can: it is already linear in the number of letters of the longer "word", so you cannot speed it up further.
An even more interesting solution would be required to count words with self-aliasing: for example, count "aa"s inside "aaaa" string. If you needed to return 3 for this situation, you'd need a lot more advanced algorithm.

Would it be faster to use a more complex algorithm (Boyer-Moore)?
In your algorithm, the unit of comparison is a word rather than a character. This enables the algorithm to ignore matches that straddle a word boundary, and thus makes it run in O(n) time.
I doubt you'd be able to beat that asymptotically.
As far as lowering the multiplicative constant, right now your algorithm looks at every character in file twice. You can eliminate that redundancy by rewriting the code to use a pair of pointers and a single for loop (figuring out the details is left as an exercise for the reader :))

Unless your system has a bad implementation of string functions, this should be roughly the fastest:
const char *s, *t;
size_t cnt;
for (cnt=0, s=haystack; t=strchr(s, needle); s=t+1, cnt++);
Adjust it a bit (+strlen(needle) rather than +1) if you don't want to count overlapping matches.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

String similarity in c - c

Related

How to extract words from a string array in c language and store it in another 1-d array? I don't want to use pointers

Performing a sum between two arrays of digis

A string of codes connected to multiple strings.I have some doubts about the two dimensional array and the for loop. Thank you very much！

Algorithm for processing the string

Fastest way to count the number of occurrences of a string

Categories

Resources