String tokenizer in c

String tokenizer in c - c

the following code will break down the string command using space i.e " " and a full stop i.e. "." What if i want to break down command using the occurrence of both the space and full stop (at the same time) and not each by themselves e.g. a command like: 'hello .how are you' will be broken into the pieces (ignoring the quotes)
[hello]
[how are you today]
char *token2 = strtok(command, " .");

You can do it pretty easily with strstr:
char *strstrtok(char *str, char *delim)
{
static char *prev;
if (!str) str = prev;
if (str) {
char *end = strstr(str, delim);
if (end) {
prev = end + strlen(delim);
*end = 0;
} else {
prev = 0;
}
}
return str;
}
This is pretty much exactly the same as the implementation of strtok, just calling strstr and strlen instead of strcspn and strspn. It also might return empty tokens (if there are two consecutive delimiters or a delimiter at either end); you can arrange to ignore those if you would prefer.

Your best bet might just be to crawl your input with strstr, which finds occurrences of a substring, and manually tokenize on those.
It's a common question you ask, but I've yet to see a particularly elegant solution. The above is straightforward and workable, however.

Related

skip strtok's null terminators safely

I want to use strtok and then return the string after the null terminator that strtok has placed.
char *foo(char *bar)
{
strtok(bar, " ");
return after_strtok_null(bar);
}
/*
examples:
foo("hello world") = "world"
foo("remove only the first") = "only the first"
*/
my code is not for skipping the first word (as I know a simple while loop will do) but I do want to use strtok once and then return the part that was not tokenized.
I will provide details of what I am trying to do at the end of the question, although I don't think it's really necessary
one solution that came into my mind was to simply skip all the null terminators until I reach a non - null:
char *foo(char *bar)
{
bar = strtok(bar, " ");
while(!(*(bar++)));
return bar;
}
This works fine for the examples shown above, but when it comes to using it on single words - I may misidentify the string's null terminator to be strtok's null terminator, and then I may access non - allocated memory.
For example, if I will try foo("demo"\* '\0' *\) the of strtok will be "demo"\* '\0' *\
and then, if I would run the while loop I will accuse the part after the string demo. another solution I have tried is to use strlen, but this one have the exact same problem.
I am trying to create a function that gets a sentence. some of the sentences have have their first word terminated with colons, although not necessarily. The function need to take the first word if it is terminated with colons and insert it (without the colons) into some global table. Then return the sentence without the first colons - terminated word and without the spaces that follow the word if the word has colons - terminated word at the start and otherwise, just return the sentence without the spaces in the start of the sentence.

You could use str[c]spn instead:
char *foo(char *bar) {
size_t pos = strcspn(bar, " ");
pos = strspn((bar += pos), "");
// *bar = '\0'; // uncomment to mimic strtok
return bar + pos;
}
You will get the expected substring of an empty string.
A good point is that you can avoid changing the original string - even if mimicing strtok is trivial...

Replacing a whole word and not substrings in a string in C

I am trying to replace a whole word in C array of characters and skip the substrings. I made research and I ended up with really hard resolutions while I think I have better idea if someone can give me a hand.
Let's say I have the string:
char sentence[100]= "apple tree house";
And I would like to replace tree with the number 12:
"apple 12 house"
I know that the words are delimited by space so my idea is to :
1.Tokenize the string with delimiter white space
2.In the while loop checking with the library function STRCMP if the string is equal to the token and if it is then to be replaced.
The problem for me comes when I try to replace the string as I couldn't make it.
void wordreplace(char string[], char search[], char replace[]) {
// Tokenize
char * token = strtok(string, " ");
while (token != NULL) {
if (strcmp(search, token) == 0) {
REPLACE SEARCH STRING WITH REPLACE STRING
}
token = strtok(NULL, " ");
}
printf("Sentence : %s", string);
}
Any suggestions what I can use ? I guess it might be really simple but I am beginner much appreciated :)
[EDIT]: Spaces are the only delimiters and usually the string to be replaced is not longer than the original.

I would avoid strtok in this case (because it will modify the string as a side effect of tokenizing it), and approach this by looking at the string essentially character-by-character and maintaining a "read" and "write" index. Because the output can never be longer than the input, the write index will never get ahead of the read one, and you can "write-back" and make the change within the same string.
To visualize this, I find it useful to write out the input in boxes and draw arrows to current read and write indexes and track through the process so you can verify that you have a system that will do what you want it to do and that your loops and indexes all work like you expect.
Here is one implementation that matches how my own mind tends to approach this sort of algorithm. It walks the string and looks ahead to try matching from the current character. If it finds a match, it copies the replace onto the current spot, and increments both indexes accordingly.
void wordreplace(char * string, const char * search, const char * replace) {
// This is required to be true since we're going to do the replace
// in-place:
assert(strlen(replace) <= strlen(search));
// Get ourselves set up
int r = 0, w = 0;
int str_len = strlen(string);
int search_len = strlen(search);
int replace_len = strlen(replace);
// Walk through the input character by character.
while (r < str_len) {
// Is this character the start of a matching token? It is
// if we see the search string followed by a space or end of
// string.
if (strncmp(&string[r], search, search_len) == 0 &&
(string[r+search_len] == ' ' || string[r+search_len] == '\0')) {
// We matched the search token. Copy the replace token.
memcpy(&string[w], replace, replace_len);
// Update our indexes.
w += replace_len;
r += search_len;
} else {
// Otherwise just copy this character.
string[w++] = string[r++];
}
}
// Be sure to terminate the final version of the string.
string[w] = '\0';
}
(Note that I tweaked your function signature to use the more idiomatic pointer notation rather than char arrays, and per flu's comment below, I marked the search and replace tokens as "const" which is a way of the function advertising that it will not modify those strings.)

To do what you want to do becomes a little more involved because you need to handle the scenarios where:
replacement is shorter than original -- so you will need to move the remainder of line to follow the replacement text to avoid leaving empty space;
replacement is same length as original -- trivial case, just overwrite original with replacement; and finally
replacement is longer than original -- where you must validate the original string plus the replacement length difference will still fit in the storage for the original string, you must copy the end of line to a temporary buffer before making the replacement, and then add the rest of the line in the temporary buffer to the end.
strtok is some disadvantages here due to it making changes to the original string during the tokenizing process. (you can just make a copy, but if you want an in-place replacement, you need to look further). A combination of strstr and strcspn allow you to operate on the original string in more efficient manner when looking for a specific search string within the original.
strcspn can be used like strtok with the set of delimiters to provide the length of the current token found (to ensure strstr didn't match your search term as a lesser-included-substring of a longer word, like tree in trees) Then it becomes a simple matter of looping with strstr and validating the length of the token with strcspn and then just applying one of the three cases above.
A short example implementation with comments included in-line to help you follow along could be:
#include <stdio.h>
#include <string.h>
#define MAXLIN 100
void wordreplace (char *str, const char *srch,
const char *repl, const char *delim)
{
char *p = str; /* pointer to str */
size_t lenword, /* length of word found */
lenstr = strlen (str), /* length of total string */
lensrch = strlen (srch), /* length of search word */
lenrepl = strlen (repl); /* length of replace word */
while ((p = strstr (p, srch))) { /* srch exist in rest of string? */
lenword = strcspn (p, delim); /* get length of word found */
if (lenword == lensrch) { /* word len match search len */
if (lenrepl == lensrch) /* if replace is same len */
memcpy (p, repl, lenrepl); /* just copy over */
else if (lenrepl > lensrch) { /* if replace is longer */
/* check that additional lenght will fit in str */
if (lenstr + lenrepl - lensrch > MAXLIN - 1) {
fputs ("error: replaced length would exeed size.\n",
stderr);
return;
}
if (!p[lenword]) { /* if no following char */
memcpy (p, repl, lenrepl); /* just copy replace */
p[lenrepl] = 0; /* and nul-terminate */
}
else { /* store rest of line in buffer, replace, add end */
char endbuf[MAXLIN]; /* temp buffer for end */
size_t lenend = strlen (p + lensrch); /* end length */
memcpy (endbuf, p + lensrch, lenend + 1); /* copy end */
memcpy (p, repl, lenrepl); /* make replacement */
memcpy (p + lenrepl, endbuf, lenend); /* add end after */
}
}
else { /* otherwise replace is shorter than search */
size_t lenend = strlen (p + lenword); /* get end length */
memcpy (p, repl, lenrepl); /* copy replace */
/* move end to after replace */
memmove (p + lenrepl, p + lenword, lenend + 1);
}
}
}
}
int main (int argc, char **argv) {
char str[MAXLIN] = "apple tree house in the elm tree";
const char *search = argc > 1 ? argv[1] : "tree",
*replace = argc > 2 ? argv[2] : "12",
*delim = " \t\n";
wordreplace (str, search, replace, delim);
printf ("str: %s\n", str);
}
Example Use/Output
Your replace "tree" with "12" example in "apple tree house in the elm tree":
$ ./bin/wordrepl_strstr_strcspn
str: apple 12 house in the elm 12
A simple same-length replacement of "tree" with "core", e.g.
$ ./bin/wordrepl_strstr_strcspn tree core
str: apple core house in the elm core
The "longer than" replacemnt of "tree" with "bobbing":
$ ./bin/wordrepl_strstr_strcspn tree bobbing
str: apple bobbing house in the elm bobbing
There are many different ways you can approach this problem, so no one way is the right way. The key is to make it understandable and reasonably efficient. Look things over and let me know if you have further questions.

String split in C with strtok function

I'm trying to do split some strings by {white_space} symbol.
btw, there is a problem within some splits. which means, I want to split by {white_space} symbol but also quoted sub-strings.
example,
char *pch;
char str[] = "hello \"Stack Overflow\" good luck!";
pch = strtok(str," ");
while (pch != NULL)
{
printf ("%s\n",pch);
pch = strtok(NULL, " ");
}
This will give me
hello
"Stack
Overflow"
good
luck!
But What I want, as you know,
hello
Stack Overflow
good
luck!
Any suggestion or idea please?

You'll need to tokenize twice. The program flow you currently have is as follows:
1) Search for space
2) Print all characters prior to space
3) Search for next space
4) Print all characters between last space, and this one.
You'll need to start thinking in a different matter, two layers of tokenization.
Search for Quotation Mark
On odd-numbered strings, perform your original program (search for spaces)
On even-numbered strings, print blindly
In this case, even numbered strings are (ideally) within quotes. ab"cd"ef would result in ab being odd, cd being even... etc.
The other side, is remembering what you need to do, and what you're actually looking for (in regex) is "[a-zA-Z0-9 \t\n]*" or, [a-zA-Z0-9]+. That means the difference between the two options, are whether it's separated by quotes. So separate by quotes, and identify from there.

Try altering your strategy.
Look at non-white space things, then when you find quoted string you can put it in one string value.
So, you need a function that examines characters, between white space. When you find '"' you can change the rules and hoover everything up to a matching '"'. If this function returns a TOKEN value and a value (the string matched) then what calls it, can decide to do the correct output. Then you have written a tokeniser, and there actually exist tools to generate them called "lexers" as they are used widely, to implement programming languages/config files.
Assuming nextc reads next char from string, begun by firstc( str) :
for (firstc( str); ((c = nextc) != NULL;) {
if (isspace(c))
continue;
else if (c == '"')
return readQuote; /* Handle Quoted string */
else
return readWord; /* Terminated by space & '"' */
}
return EOS;
You'll need to define return values for EOS, QUOTE and WORD, and a way to get the text in each Quote or Word.

Here's the code that works... in C
The idea is that you first tokenize the quote, since that's a priority (if a string is inside the quotes than we don't tokenize it, we just print it). And for each of those tokenized strings, we tokenize within that string on the space character, but we do it for alternate strings, because alternate strings will be in and out of the quotes.
#include <stdio.h>
#include <string.h>
#include <stdbool.h>
int main() {
char *pch1, *pch2, *save_ptr1, *save_ptr2;
char str[] = "hello \"Stack Overflow\" good luck!";
pch1 = strtok_r(str,"\"", &save_ptr1);
bool in = false;
while (pch1 != NULL) {
if(in) {
printf ("%s\n", pch1);
pch1 = strtok_r(NULL, "\"", &save_ptr1);
in = false;
continue;
}
pch2 = strtok_r(pch1, " ", &save_ptr2);
while (pch2 != NULL) {
printf ("%s\n",pch2);
pch2 = strtok_r(NULL, " ", &save_ptr2);
}
pch1 = strtok_r(NULL, "\"", &save_ptr1);
in = true;
}
}
References
Tokenizing multiple strings simultaneously
http://linux.die.net/man/3/strtok_r
http://www.cplusplus.com/reference/cstring/strtok/

Set a string to a substring in C

I have a long string that I want to strip off the end. I want to get rid of everything after the character "<" (inclusively). Here is the code that works:
char *end;
end = strchr(mystring, '<');
mystring[strlen(mystring) - strlen(end)] = '\0';
So if mystring was
"asdfjk234klsjadflnwer023jokmnasdf</tag>alskjdflk23<tag2>akjsldfjsdf</tag2>blabla"
this code would return
"asdfjk234klsjadflnwer023jokmnasdf"
I'm wondering if this can be done in a easier way? I know I can increment a counter over each character in mystring till I find "<" and then used that int as the index, but that seems equally troublesome. All the other built-in string libraries don't seem useful but I'm sure I'm just looking at this in the wrong way. I haven't used C for years.
Any help is appreciated!

Sure. This is the idiomatic way to do it:
char *end;
end = strchr(mystring, '<');
if (end)
*end = '\0';

How about *end = '\0'; rather than the mystring[strlen... part?

char *end;
end = strchr(mystring, '<');
if (end != NULL)
*end = '\0';

strchr() returns a pointer to the character (if found), so:
if(end)
*end = '\0';

Remove the first part of a C String

I'm having a lot of trouble figuring this out. I have a C string, and I want to remove the first part of it. Let's say its: "Food,Amount,Calories". I want to copy out each one of those values, but not the commas. I find the comma, and return the position of the comma to my method. Then I use
strncpy(aLine.field[i], theLine, end);
To copy "theLine" to my array at position "i", with only the first "end" characters (for the first time, "end" would be 4, because that is where the first comma is). But then, because it's in a Loop, I want to remove "Food," from the array, and do the process over again. However, I cannot see how I can remove the first part (or move the array pointer forward?) and keep the rest of it. Any help would be useful!

What you need is to chop off strings with comma as your delimiter.
You need strtok to do this. Here's an example code for you:
int main (int argc, const char * argv[]) {
char *s = "asdf,1234,qwer";
char str[15];
strcpy(str, s);
printf("\nstr: %s", str);
char *tok = strtok(str, ",");
printf("\ntok: %s", tok);
tok = strtok(NULL, ",");
printf("\ntok: %s", tok);
tok = strtok(NULL, ",");
printf("\ntok: %s", tok);
return 0;
}
This will give you the following output:
str: asdf,1234,qwer
tok: asdf
tok: 1234
tok: qwer

If you have to keep the original string, then strtok. If not, you can replace each separator with '\0', and use the obtained strings directly:
char s_RO[] = "abc,123,xxxx", *s = s_RO;
while (s){
char* old_str = s;
s = strchr(s, ',');
if (s){
*s = '\0';
s++;
};
printf("found string %s\n", old_str);
};

The function you might want to use is strtok()
Here is a nice example - http://www.cplusplus.com/reference/clibrary/cstring/strtok/

Personally, I would use strtok().
I would not recommend removing extracted tokens from the string. Removing part of a string requires copying the remaining characters, which is not very efficient.
Instead, you should keep track of your positions and just copy the sections you want to the new string.
But, again, I would use strtok().

if you know where the comma is, you can just keep reading the string from that point on.
for example
void readTheString(const char *theLine)
{
const char *wordStart = theLine;
const char *wordEnd = theLine;
int i = 0;
while (*wordStart) // while we haven't reached the null termination character
{
while (*wordEnd != ',')
wordEnd++;
// ... copy the substring ranging from wordStart to wordEnd
wordStart = ++wordEnd; // start the next word
}
}
or something like that.
the null termination check is probably wrong, unless the string also ends with a ','... but you get the idea.
anyway, using strtok would probably be a better idea.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

String tokenizer in c - c

Your best bet might just be to crawl your input with strstr, which finds occurrences of a substring, and manually tokenize on those. It's a common question you ask, but I've yet to see a particularly elegant solution. The above is straightforward and workable, however.

Related

skip strtok's null terminators safely

Replacing a whole word and not substrings in a string in C

String split in C with strtok function

Set a string to a substring in C

Remove the first part of a C String

Categories

Resources