We have two arrays
char A[]="ABABABABBBABAB";
And the other is
char B[]="BABA";
How can I find B in A and where it starts and where it ends for every occurence?
For example for this one
Between 2-5
Between 4-7
Between 10-13
Yes you can do this using strstr function.
This function returns a pointer to the first occurrence in haystack of any of the entire sequence of characters specified in needle, or a null pointer if the sequence is not present in haystack.
So you will find the pointer pointing the beginning of the string. But then again if you want to find the next occurence you will change the first parameter accordingly omitting the portion where first occurence is found. A simple illustration :-
char haystack[]="abismyabnameab";
char needle[]="ab";
char *ret;
ret = strstr(haystack, needle);
while(ret != NULL){
/* do work */
printf("%s (%zu,%zu)\n",ret, ret-haystack, ret-haystack+strlen(needle)-1 );
ret = strstr(haystack+(ret-haystack)+1,needle);
}
I omitted the part where you get those count's where it spits out the indices of the needle. As an hint notice one thing - the length of the needle will eb known to you and where it starts you know that using strstr. (ret - haystack specifically for each instance of needle in haystack).
Note this illustration code is showing the example for strings which are non-recurring within itself. For example, BB is found in BBBBB then we will find every occurrence in each position. But the solution above skips the second occurrence. A simple modification is adding to haystack 1 to search in string one character later than the previous iteration.
Better solution is to find the failure function using KMP. That will give a better complexity solution. O(n+m). But in earlier case it is O(n*m).
Related
i need to check if content in a binary file in in other binary file.
i've tried to copy both files content into a array of chars with fread and check them with strstr, but strstr is always returning NULL even if the content supposed to be found in the other file.
Any ideas?
Thanks.
Since the strstr function won't work here for an arbitrary binary data (it is working only for strings with \0. termination), I can see three approaches here:
1) Naive approach: iterate over one array of bytes, and use memcmp with the other array starting at different positions each time. Easy, but consumes O(k*n) time (k, n - sizes of the data).
2) Using the KMP algorithm. Requires some work on understanding and coding, but giving the best time complexity O(k+n).
3) If the performance is not important, and you don't want to mess with ANY somewhat non-trivial algorithms:
-- Convert your binary datas to strings, representing each byte with it's two digits HEX value.
-- Use strstr.
Update: After a little thinking about the third approach, there might be a case when it won't work right. Consider that you want to find the data represented by AA AA inside 1A AA A1. It shouldn't be found, since it is not there. But, if you represent the data as concatenated characters without delimiters, it will be like find AAAA in 1AAAA1, which will succeed. So adding some delimiter would be a good idea here.
Do it yourself (notify me if there's a bug):
/* Returns location of substring in string. If not found, return -1.
* ssize_t is defined by POSIX. */
ssize_t bin_strstr(void* data, size_t len, void* subdata, size_t sublen) {
len -= sublen;
for ( ; len >= 1; --len)
if (memcmp(data + len, subdata, sublen) == 0)
return len;
return memcmp(data, subdata, sublen) ? 0 : -1;
}
The task is this: Find the longest substring found in two lines. The peculiarity of the problem is that these lines are very long (contents of the file, that is to 400,000 characters each), and the alphabet from which they are composed of short - 4 characters.
Strings can be of different length.
I invented and implemented the following algorithm:
To get the contents of the first file and write to a string str1, removing the line breaks
To get the contents of the second file and write to a string str2, removing the line breaks
We shall consider all substrings the string str1, from the longest to the shortest. To do this, define the cycle while (i>0), at each iteration, which after the main content decreases the length of the string by one. And so to the strings of length 1.
Inside the while loop: All substring of length N differ only in the beginning position.
Let have a string of length N:
It is one substring of length N, which contains, starting at position 0.
There are two substring of length N-1 that start inside positions 0 and 1
In it for three substring of length N-2, which starts inside positions 0, 1, and 2
...
K+1 substring of length N-K, which start from the position 0,1,...,K
The starting position of the count in the for loop(z=0; z<=g-i; z++), within which the function is called getSubstring receiving the substring. And then running the standard function strstr with this substring of a string str2
But does this algorithm long enough. Is there no way to make it faster?
P.S. Write in C
There are at least two classical options to solve longest common substring efficiently
Build a generalized suffix array or suffix tree of the two strings. One can show that the LCS is a prefix of two adjacent suffixes in the suffix array that have different colors (belong to the different strings). I once wrote an answer that describes a simple O(n log n) suffix array construction algorithm
Build a suffix automaton of one string and feed the other string into it. At every point check how "deep" you are in the automaton and report the maximum over all those depths. You can find a C++ implementation in my GitHub.
I wanted to find the length of a part of a string after searching for it within a bigger string.
I cannot use strlen since I am dealing with binary data.
char *temp= "this is some random text";
char *temp1 = strstr(temp,"some");
int len = strlen(temp);
int len1 =0;
len1 = temp+len - temp1;
to get length of "some random text"
len1 returns negative value (even the positive value of it is wrong)
If your data is not NULL-terminated, then you cannot call strstr() on it for the same reason you can't call strlen(). If you do that, you can end up scanning past the end of your data. If you find a match there (which is quite possible; reading past the end of arrays is not guaranteed to crash the program), then your pointer arithmetic is going to give you a negative value, because you're subtracting a larger address from a smaller one.
On the other hand, if your data is actually properly NULL-terminated, then your problem is probably that strstr() doesn't find the substring and thus returns NULL. Are you checking for NULL? Otherwise, what you end up doing is:
len1 = temp + len - (char*)NULL;
Final answer:
You're looking for len - (temp1 - temp). The length of the first part is temp1 - temp. Substract it from the length of the entire string to get the length of the remaining part.
Longer answer:
Since strlen (which is what you have used in your example, even if it only works for proper text messages) goes until it finds a \0 character you can simply use strlen(temp1) for the length of the last part of the input. If you are really concerned that calling strlen twice will harm your performance (really?) then you can use len - (temp1 - temp).
You only need to do pointer substraction if you are interested in the length of the first part of the input.
If you want to work with binary arrays which contain \0 in them at non-terminal position you cannot use strlen at all in your code. However, you have to have a way to specify the length of the entire input. Either you have this in an integer variable or you have a specific delimiter an a length-computing function.
If you have the integer variable for length then, since the length of the first part of the input is obtained by pointer substraction, you only have to do len - (temp1 - temp).
If you have a length-computing function, simply call it with temp1 as argument.
PS: Don't forget to check if strstr returns NULL (by the way, you cannot use strstr if you have binary data with \0 inside the buffer)
I have a character buffer which will contain text in this format.
somecontent...boundary="abc_is_the_boundary"
content-length=1234
--abc_is_the_boundary
somecontent
--abc_is_the_boundary
This buffer is stored in char * buf;
Now my objective is identify the boundary value which is abc_is_the_boundary in this case and pass all the contents in the buffer under that boundary to a function and get a new string which will replace it. Even --abc_is_the_boundary will be sent to the function.
So in this case the buffer passed to the function will be
--abc_is_the_boundary
somecontent
--abc_is_the_boundary
After processing, say it returns xyz.
The content-length has changed to 3 and now the resulting buffer must look like this
somecontent...boundary="abc_is_the_boundary"
content-length=3
xyz
I can identify the boundary value using strstr. But how do I find first instance of the boundary and last instance of the boundary? the boundary can be there multiple times, but only first and last have to be found. The content-length can be modified by using strstr again, and go to the speicific location and modified. Is that the best way.
I hope you have understood
You can use simple pointer arithmetic for finding the first and the last occurrence of the pattern. Think about it this way: For the first appearance of the pattern you use the first result of strstr, since this is exactly what this function was designed for. Then you ask yourself "is there another occurrence of the pattern after the first one" and use strstr again for this. You repeat this until you find no further occurrence. The last one you found must then be the last one in the whole buffer.
It would then look somewhat like this. The code below is neither compiled, nor tested, but the idea should be clear:
char *buf, *pattern, *firstOcc, *lastOcc, *temp;
// ... extract pattern from buffer
firstOcc = strstr(buf, pattern);
temp = firstOcc;
do {
lastOcc = temp;
temp = strstr(lastOcc + 1, pattern);
} while(temp != 0);
By searching from the last found location + 1 you exclude the last location, whence strstr will deliver to you the location after the last one found.
I tried to make a function that replaces all occurrences of str1 in a text t with str2 but I keep getting a "buffer overflow" error message. Can you please tell me what is wrong with my function?
#include <stdio.h>
#include <string.h>
#include <assert.h>
//replace all *str1 in *t with *str2, put the result in *x, return *x
char * result(char *str1,char *str2,char *t)
{
char *x=NULL,*p=t,*r=t;
x=malloc(400*sizeof(char));
assert(x!=NULL);
x[0]='\0';
r=strstr(t,str1); //r is at the first occurrence of str1 in t, p is at the beginning of t
while(r!=NULL)
{
strncat(x,p,r-p); //copy r-p chars from p to x
strcat(x,str2); //copy str2 to x
p=r+strlen(str1); //p will be at the first char after the last occurrence of str1 in t
r=strstr(r+strlen(str1),str1); //r goes to the next occurrence of str1 in t
}
strcat(x,p);
return x;
}
I did not used the gets() function to read any char array.
My compiler is gcc version 4.6.3
I updated the code, it works, but the result is not the as expected.
main() function:
int main(void)
{
char *sir="ab",*sir2="xyz",*text="cabwnab4jkab",*final;
final=result(sir,sir2,text);
puts(final);
free(final);
return 0;
}
printed string:
b
I expected cxyzwnxyz4jkxyz
It looks like you've got your strncpy arguments mixed up: the second argument is the source string, not the limit on the number of chars to copy, which should be the third argument:
strncpy(x, p, r - p); // copy r - p chars from p to x
Furthermore, you want to use strcat instead of strcpy. Using strcpy, you'll just overwrite the contents of the result with the replacement string, every time. Using strcat, be sure to initialize the result with \0 before starting.
Finally, you're returning a reference to a local variable x from your function: you can't do this as the memory isn't usable after the function returns.
Your code contains quite a few weird bugs.
Firstly, x is a pointer to your destination buffer. For come reason you are doing all your copyings directly to x, i.e. everything is copied to the very beginning of the buffer, overwriting previously copied data. This doesn't make any sense at all. Whay are you doing this? You need to create a dedicated pointer to would keep the current destination position in x and write data to that position (instead of writing it to x).
I see that you edited your code and replaced copying with concatenation. Well... Even though it might fix the problem, this is still bad design. strcat/strncat functions have no place in good C code. Anyway, your code is still broken, since you are trying to use strcat functions on uninitialized buffer x. You need to initialize x as an empty string first.
Secondly, there's a more subtle problem with your search for replacement string. At the end of the cycle you continue the search from the next symbol r=strstr(r+1,str1), i.e. you increment the search position by only 1. I'm not sure this is what you want.
Consider aaaa as input text, and the request to replace aa with bc. How many replacements do you want to do in this case? How many occurrences of aa are there in aaaa? 2 or 3? If you want to get bcbc as the result (2 replacements), you have to increase r by strlen(str1), not by 1.
In fact, in the current implementation you set p=r+strlen(str1), but continue the search from r+1 position. This will lead to completely meaningless results with overlapping occurrences of search string, as in my example. Try this
char *str1="aa",*str2="xyz",*text="aaaa",*final;
final=result(str1,str2,text);
and see what happens.