Tokenize the string in c - c

i need to tokenize the string in c. suppose if i have a string like this
"product=c,author=dennis,category=programming".
I want to extract only the values among these key values pairs. Like
[c,dennis,programming].
I have used strtok function which tokenizes with "=" and I get values
[product,c,author,dennis,category,programming].
Is there any built in function that can generate only the values like mentioned above.

Just a simple scanf
#include<stdio.h>
int main()
{
char token[20] = { 0 };
char c, name[20];
int i=0, offset;
while (scanf("%[a-z]%*[^a-z]", token) == 1) {
i++;
if(i%2==0)
printf("[%s]\n",token );
}
return 0;
}
./a.out
product=c,author=dennis,category=programming,
[c]
[dennis]
[programming]
Ctrl+D
Note. I have added , at the end of the string

You could simply skip every second token like that:
#include <stdio.h>
#include <string.h>
int main(void) {
char str[] = "product=c,author=dennis,category=programming";
char* p = strtok(str, ",=");
while (p != NULL) {
p = strtok(NULL, ",=");
if (p != NULL) {
printf("%s\n", p);
strtok(NULL, ",="); // skip this
}
}
return 0;
}

I can think of a couple of ways:
First tokenize on ,, then split each part on the =.
Find the first =, then the , after it, and get the word in between. Repeat.
If there are always three values, you can use sscanf to read the values.
You can use a regex library to parse the string.

You can first tokenize on ,, splitting the contents into 3 different strings, then tokenize on '=' for each of those strings:
char *kvpair[N] = {NULL}; // where N is large enough for the expected
// number of key-value pairs
char *tok = strtok(input, ",");
size_t kvcount = 0;
while (tok != NULL && kvcount < N)
{
kvpair[kvcount++] = tok;
tok = strtok(NULL, ",");
}
...
for (i = 0; i < kvcount; i++)
{
char delim = '[';
char *key = strtok(kvpair[i], "=");
char *val = strtok(NULL, "=");
printf("%c%s", delim, val);
delim = ',';
}
putchar(']');
This is just a rough sketch; it assumes that the maximum number of key-value pairs is known ahead of time, it doesn't attempt to handle empty keys or values, or really do any sort of error handling at all. But it should point you in the right direction.
Remember that strok modifies its input; if your original data is a string literal or if you need to preserve the original data, you'll need to make a copy and work on that copy.
Note that, because of how strok works, you can't "nest" calls; that is, you can't tokenize the first key-value pair, then split it into key and value tokens, then get the next key-value pair. You'll have to tokenize all the key-value pairs first, then process each one in turn.

Related

C string nested splitting

I'm a beginner at C and I'm stuck on a simple problem. Here it goes:
I have a string formatted like this: "first1:second1\nsecond2\nfirst3:second3" ... and so on.
As you can see from the the example the first field is optional ([firstx:]secondx).
I need to get a resulting string which contains only the second field. Like this: "second1\nsecond2\nsecond3".
I did some research here on stack (string splitting in C) and I found that there are two main functions in C for string splitting: strtok (obsolete) and strsep.
I tried to write the code using both functions (plus strdup) without success. Most of the time I get some unpredictable result.
Better ideas?
Thanks in advance
EDIT:
This was my first try
int main(int argc, char** argv){
char * stri = "ciao:come\nva\nquialla:grande\n";
char * strcopy = strdup(stri); // since strsep and strtok both modify the input string
char * token;
while((token = strsep(&strcopy, "\n"))){
if(token[0] != '\0'){ // I don't want the last match of '\n'
char * sub_copy = strdup(token);
char * sub_token = strtok(sub_copy, ":");
sub_token = strtok(NULL, ":");
if(sub_token[0] != '\0'){
printf("%s\n", sub_token);
}
}
free(sub_copy);
}
free(strcopy);
}
Expected output: "come", "si", "grande"
Here's a solution with strcspn:
#include <stdio.h>
#include <string.h>
int main(void) {
const char *str = "ciao:come\nva\nquialla:grande\n";
const char *p = str;
while (*p) {
size_t n = strcspn(p, ":\n");
if (p[n] == ':') {
p += n + 1;
n = strcspn(p , "\n");
}
if (p[n] == '\n') {
n++;
}
fwrite(p, 1, n, stdout);
p += n;
}
return 0;
}
We compute the size of the initial segment not containing : or \n. If it's followed by a :, we skip over it and get the next segment that doesn't contain \n.
If it's followed by \n, we include the newline character in the segment. Then we just need to output the current segment and update p to continue processing the rest of the string in the same way.
We stop when *p is '\0', i.e. when the end of the string is reached.

how to save a string token , save its content to an array, then use those contents for further comparison

/*I am unsure if my code for saving the tokens in an array is accurate.
This is so because been whenever I run my program, the code to compare
token[0] with my variable doesn't give an output nor perform assigned function.
Hence I am sure there is something inaccurate about my coding.*/
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
int main()
{
//variable declarations
const char *array[] = {"ax","bo","cf"};
char delim[]=" \n";
char* myline;
size_t max = 500;
char* token1;
char* token2[max];
int n = 0;
while(1) //loop always
{
printf("Enter an argument\n"); //asks for an input
getline (&myline, &max, stdin); //read the input/line
//for loop -- splits up the line into tokens
for(token1 = strtok(myline, " "); token1 != NULL; token1 = strtok(NULL, delim))
{
token2[n] = malloc(strlen(token1)+1); //allocate some space/memory to token2[n]
//save the token in an array by copying from token1 to token2
strcpy(token2[n],token1);
int m;
for(m = 0; m<sizeof(array);m++) //loop through the array elements for comparison
{
//compare array at index m with token at index 0 -- compare only first token with a specific variable
if(strcmp(token2[0], array[m]) == 0)
{
printf("equal");
}
}
}
free(token2[n]); //deallocate assigned memory
}
return(0);
}
I think you should try vector of string like
vector < string > str = { "ax","bo","cf" };
Their seems to be a few issues in your current code:
for(m = 0; m<strlen;m++) is not correct. strlen() is a <string.h> function used to obtain the length of a C string. Since you want array[i], you need to include the size of array in the guard. To find the size of the array, you can use sizeof(array)/sizeof(array[0]). It would be good to include this in a macro:
#define ARRAYSIZE(x) (sizeof x/sizeof x[0])
Then your loop can be:
size_t m;
for(m = 0; m<ARRAYSIZE(array); m++)
You need to check return of malloc(), as it can return NULL on failure to allocate spaces. Here is a way to check this:
token2[n] = malloc(strlen(token1)+1);
if (token2[n] == NULL) {
/* handle error */
It is possible to skip the malloc()/strcpy() step by simply using strdup.
getline() returns -1 on failure to read a line, so its good to check this. It also adds a \n character at the end of the buffer, so you need to remove this. Otherwise, strcmp will never find equal strings, as you will be comparing strcmp("string\n", "string"). You need to find the \n character in your buffer, and replace it with a \0 null-terminator.
You can achieve this like:
size_t slen = strlen(myline);
if (slen > 0 && myline[slen-1] == '\n') {
myline[slen-1] = '\0';
}
You also need to free() all of the char* pointers in token2[].
Since you are using the same delimeter for strtok(), its better to make this const. So const char *delim = " \n"; instead.
Alot of the fixes I suggested in the comments, so I didn't post them here, as you seemed to have updated your code with those suggestions.

How do you split a string in C?

If I have a string like:
const char* mystr = "Test Test Bla Bla \n Bla Bla Test \n Test Test \n";
How would I use the newline character '\n', to split the string into an array of strings?
I'm trying to accomplish in C, the thing string.Split() does in C# or boost's string algorithm split does in C++ .
Try to use the strtok function. Be aware that it modifies the source memory so you can't use it with a string literal.
char *copy = strdup(mystr);
char *tok;
tok = strtok(copy, "\n");
/* Do something with tok. */
while (tok) {
tok = strtok(NULL, "\n");
/* ... */
}
free(copy);
The simplest way to split a string in C is to use strtok() however that comes along with an arm's length list of caveats on its usage:
It's destructive (destroys the input string), and you couldn't use it on the string you have above.
It's not reentrant (it keeps its state between calls, and you can only be using it to tokenize one string at a time... let alone if you wanted to use it with threads). Some systems provide a reentrant version, e.g. strtok_r(). Your example might be split up like:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main (void) {
char mystr[] = "Test Test Bla Bla \n Bla Bla Test \n Test Test \n";
char *word = strtok(mystr, " \n");
while (word) {
printf("word: %s\n", word);
word = strtok(NULL, " \n");
}
return 0;
}
Note the important change of your string declaration -- it's now an array and can be modified. It's possible to tokenize a string without destroying it, of course, but C does not provide a simple solution for doing so as part of the standard library.
Remember that C makes you do all the memory allocation by hand. Remember also that C doesn't really have strings, only arrays of characters. Also, string literals are immutable, so you're going to need to copy it. It will be easier to copy the whole thing first.
So, something like this (wholly untested):
char *copy = xstrdup(mystr);
char *p;
char **arry;
size_t count = 0;
size_t i;
for (p = copy; *p; p++)
if (*p == '\n')
count++;
arry = xmalloc((count + 1) * sizeof(char *));
i = 0;
p = copy;
arry[i] = p;
while (*p)
{
if (*p == '\n')
{
*p = '\0';
arry[i++] = p+1;
}
p++;
}
return arry; /* deallocating arry and arry[0] is
the responsibility of the caller */
In the above reactions, I see only while(){} loops, where IMHO for(){} loops are more compact.
cnicutar:
for(tok = strtok(copy, "\n");tok; tok = strtok(NULL, "\n") {
/* ... */
}
FatalError:
char *word;
for ( word = strtok(mystr, " \n");word; word = strtok(NULL, " \n") {
printf("word: %s\n", word);
}
Zack:
for (arry[i=0]=p=copy; *p ; p++)
{
if (*p == '\n')
{
*p = '\0';
arry[i++] = p+1;
}
}
[the clarity of this last example is disputable]
You can use below mentioned library. It has many other useful functions.
http://www.boost.org/doc/libs/1_48_0/libs/tokenizer/index.html
Or you can use strtok function.

Working with tokenizing in c

I am trying to tokenize a line and put it into a two dimensional array so far I have come up with this but I feel I am far off:
/**
* Function to tokenize an input line into seperate tokens
*
* The first arg is the line to be tokenized and the second arg points to
* a 2-dimentional string array. The number of rows of this array should be
* at least MAX_TOKENS_PER_LINE size, and the number of columns (i.e., length
* of each string should be at least MAX_TOKEN_SIZE)
*
* Returns 0 on success and negative number on failure
*/
int __tokenize(char *line, char tokens[][MAX_TOKEN_SIZE], int *num_tokens){
char *tokenPtr;
tokenPtr = strtok(line, " \t");
for(int j =0; j<MAX_TOKEN_SIZE; j++){
while(tokenPtr != NULL){
if(!(tokens[][j] = tokenPtr)){return -1;}
num_tokens++;
tokenPtr = strtok(NULL, " \t");
}
}
return 0;
}
int __tokenize(char *line, char tokens[][MAX_TOKEN_SIZE], int *num_tokens)
{
char *tokenPtr;
tokenPtr = strtok(line, " \t");
for (int i = 0; tokenPtr; i++)
{
tokens[i] = tokenPtr;
tokenPtr = strtok(NULL, " \t");
}
}
Hope this should work.
You should implement a finite state machine, I've just finish my shell command Lexer/Parser (LL)
Look : How to write a (shell) lexer by hand
tokenPtr is not initialized - it may or may not be NULL the first time through the loop.
strtok takes 2 arguments. If you want to split on multiple chars, include them all in the 2nd string.
After the strtok call, token pointer points to the string you want. Now what? You need somewhere to store it. Perhaps an array of char*? Or an 2d array of characters, as in your edited prototype.
tokens[i] is storage for MAX_TOKEN_SIZE characters. strtok() returns a pointer to a string (a sequence of 1 or more characters ). You need to copy one into the other.
What is the inner loop accomplishing?
Note that char tokens[][MAX] is usually referred to as a 2-D array of characters. (or a 1-D array of fixed-length strings). A 2-D array of strings would be char* tokens[][MAX]

How does strtok() split the string into tokens in C?

Please explain to me the working of strtok() function. The manual says it breaks the string into tokens. I am unable to understand from the manual what it actually does.
I added watches on str and *pch to check its working when the first while loop occurred, the contents of str were only "this". How did the output shown below printed on the screen?
/* strtok example */
#include <stdio.h>
#include <string.h>
int main ()
{
char str[] ="- This, a sample string.";
char * pch;
printf ("Splitting string \"%s\" into tokens:\n",str);
pch = strtok (str," ,.-");
while (pch != NULL)
{
printf ("%s\n",pch);
pch = strtok (NULL, " ,.-");
}
return 0;
}
Output:
Splitting string "- This, a sample string." into tokens:
This
a
sample
string
the strtok runtime function works like this
the first time you call strtok you provide a string that you want to tokenize
char s[] = "this is a string";
in the above string space seems to be a good delimiter between words so lets use that:
char* p = strtok(s, " ");
what happens now is that 's' is searched until the space character is found, the first token is returned ('this') and p points to that token (string)
in order to get next token and to continue with the same string NULL is passed as first
argument since strtok maintains a static pointer to your previous passed string:
p = strtok(NULL," ");
p now points to 'is'
and so on until no more spaces can be found, then the last string is returned as the last token 'string'.
more conveniently you could write it like this instead to print out all tokens:
for (char *p = strtok(s," "); p != NULL; p = strtok(NULL, " "))
{
puts(p);
}
EDIT:
If you want to store the returned values from strtok you need to copy the token to another buffer e.g. strdup(p); since the original string (pointed to by the static pointer inside strtok) is modified between iterations in order to return the token.
strtok() divides the string into tokens. i.e. starting from any one of the delimiter to next one would be your one token. In your case, the starting token will be from "-" and end with next space " ". Then next token will start from " " and end with ",". Here you get "This" as output. Similarly the rest of the string gets split into tokens from space to space and finally ending the last token on "."
strtok maintains a static, internal reference pointing to the next available token in the string; if you pass it a NULL pointer, it will work from that internal reference.
This is the reason strtok isn't re-entrant; as soon as you pass it a new pointer, that old internal reference gets clobbered.
strtok doesn't change the parameter itself (str). It stores that pointer (in a local static variable). It can then change what that parameter points to in subsequent calls without having the parameter passed back. (And it can advance that pointer it has kept however it needs to perform its operations.)
From the POSIX strtok page:
This function uses static storage to keep track of the current string position between calls.
There is a thread-safe variant (strtok_r) that doesn't do this type of magic.
strtok will tokenize a string i.e. convert it into a series of substrings.
It does that by searching for delimiters that separate these tokens (or substrings). And you specify the delimiters. In your case, you want ' ' or ',' or '.' or '-' to be the delimiter.
The programming model to extract these tokens is that you hand strtok your main string and the set of delimiters. Then you call it repeatedly, and each time strtok will return the next token it finds. Till it reaches the end of the main string, when it returns a null. Another rule is that you pass the string in only the first time, and NULL for the subsequent times. This is a way to tell strtok if you are starting a new session of tokenizing with a new string, or you are retrieving tokens from a previous tokenizing session. Note that strtok remembers its state for the tokenizing session. And for this reason it is not reentrant or thread safe (you should be using strtok_r instead). Another thing to know is that it actually modifies the original string. It writes '\0' for teh delimiters that it finds.
One way to invoke strtok, succintly, is as follows:
char str[] = "this, is the string - I want to parse";
char delim[] = " ,-";
char* token;
for (token = strtok(str, delim); token; token = strtok(NULL, delim))
{
printf("token=%s\n", token);
}
Result:
this
is
the
string
I
want
to
parse
The first time you call it, you provide the string to tokenize to strtok. And then, to get the following tokens, you just give NULL to that function, as long as it returns a non NULL pointer.
The strtok function records the string you first provided when you call it. (Which is really dangerous for multi-thread applications)
strtok modifies its input string. It places null characters ('\0') in it so that it will return bits of the original string as tokens. In fact strtok does not allocate memory. You may understand it better if you draw the string as a sequence of boxes.
To understand how strtok() works, one first need to know what a static variable is. This link explains it quite well....
The key to the operation of strtok() is preserving the location of the last seperator between seccessive calls (that's why strtok() continues to parse the very original string that is passed to it when it is invoked with a null pointer in successive calls)..
Have a look at my own strtok() implementation, called zStrtok(), which has a sligtly different functionality than the one provided by strtok()
char *zStrtok(char *str, const char *delim) {
static char *static_str=0; /* var to store last address */
int index=0, strlength=0; /* integers for indexes */
int found = 0; /* check if delim is found */
/* delimiter cannot be NULL
* if no more char left, return NULL as well
*/
if (delim==0 || (str == 0 && static_str == 0))
return 0;
if (str == 0)
str = static_str;
/* get length of string */
while(str[strlength])
strlength++;
/* find the first occurance of delim */
for (index=0;index<strlength;index++)
if (str[index]==delim[0]) {
found=1;
break;
}
/* if delim is not contained in str, return str */
if (!found) {
static_str = 0;
return str;
}
/* check for consecutive delimiters
*if first char is delim, return delim
*/
if (str[0]==delim[0]) {
static_str = (str + 1);
return (char *)delim;
}
/* terminate the string
* this assignmetn requires char[], so str has to
* be char[] rather than *char
*/
str[index] = '\0';
/* save the rest of the string */
if ((str + index + 1)!=0)
static_str = (str + index + 1);
else
static_str = 0;
return str;
}
And here is an example usage
Example Usage
char str[] = "A,B,,,C";
printf("1 %s\n",zStrtok(s,","));
printf("2 %s\n",zStrtok(NULL,","));
printf("3 %s\n",zStrtok(NULL,","));
printf("4 %s\n",zStrtok(NULL,","));
printf("5 %s\n",zStrtok(NULL,","));
printf("6 %s\n",zStrtok(NULL,","));
Example Output
1 A
2 B
3 ,
4 ,
5 C
6 (null)
The code is from a string processing library I maintain on Github, called zString. Have a look at the code, or even contribute :)
https://github.com/fnoyanisi/zString
This is how i implemented strtok, Not that great but after working 2 hr on it finally got it worked. It does support multiple delimiters.
#include "stdafx.h"
#include <iostream>
using namespace std;
char* mystrtok(char str[],char filter[])
{
if(filter == NULL) {
return str;
}
static char *ptr = str;
static int flag = 0;
if(flag == 1) {
return NULL;
}
char* ptrReturn = ptr;
for(int j = 0; ptr != '\0'; j++) {
for(int i=0 ; filter[i] != '\0' ; i++) {
if(ptr[j] == '\0') {
flag = 1;
return ptrReturn;
}
if( ptr[j] == filter[i]) {
ptr[j] = '\0';
ptr+=j+1;
return ptrReturn;
}
}
}
return NULL;
}
int _tmain(int argc, _TCHAR* argv[])
{
char str[200] = "This,is my,string.test";
char *ppt = mystrtok(str,", .");
while(ppt != NULL ) {
cout<< ppt << endl;
ppt = mystrtok(NULL,", .");
}
return 0;
}
For those who are still having hard time understanding this strtok() function, take a look at this pythontutor example, it is a great tool to visualize your C (or C++, Python ...) code.
In case the link got broken, paste in:
#include <stdio.h>
#include <string.h>
int main()
{
char s[] = "Hello, my name is? Matthew! Hey.";
char* p;
for (char *p = strtok(s," ,?!."); p != NULL; p = strtok(NULL, " ,?!.")) {
puts(p);
}
return 0;
}
Credits go to Anders K.
Here is my implementation which uses hash table for the delimiter, which means it O(n) instead of O(n^2) (here is a link to the code):
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#define DICT_LEN 256
int *create_delim_dict(char *delim)
{
int *d = (int*)malloc(sizeof(int)*DICT_LEN);
memset((void*)d, 0, sizeof(int)*DICT_LEN);
int i;
for(i=0; i< strlen(delim); i++) {
d[delim[i]] = 1;
}
return d;
}
char *my_strtok(char *str, char *delim)
{
static char *last, *to_free;
int *deli_dict = create_delim_dict(delim);
if(!deli_dict) {
/*this check if we allocate and fail the second time with entering this function */
if(to_free) {
free(to_free);
}
return NULL;
}
if(str) {
last = (char*)malloc(strlen(str)+1);
if(!last) {
free(deli_dict);
return NULL;
}
to_free = last;
strcpy(last, str);
}
while(deli_dict[*last] && *last != '\0') {
last++;
}
str = last;
if(*last == '\0') {
free(deli_dict);
free(to_free);
deli_dict = NULL;
to_free = NULL;
return NULL;
}
while (*last != '\0' && !deli_dict[*last]) {
last++;
}
*last = '\0';
last++;
free(deli_dict);
return str;
}
int main()
{
char * str = "- This, a sample string.";
char *del = " ,.-";
char *s = my_strtok(str, del);
while(s) {
printf("%s\n", s);
s = my_strtok(NULL, del);
}
return 0;
}
strtok() stores the pointer in static variable where did you last time left off , so on its 2nd call , when we pass the null , strtok() gets the pointer from the static variable .
If you provide the same string name , it again starts from beginning.
Moreover strtok() is destructive i.e. it make changes to the orignal string. so make sure you always have a copy of orignal one.
One more problem of using strtok() is that as it stores the address in static variables , in multithreaded programming calling strtok() more than once will cause an error. For this use strtok_r().
strtok replaces the characters in the second argument with a NULL and a NULL character is also the end of a string.
http://www.cplusplus.com/reference/clibrary/cstring/strtok/
you can scan the char array looking for the token if you found it just print new line else print the char.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
char *s;
s = malloc(1024 * sizeof(char));
scanf("%[^\n]", s);
s = realloc(s, strlen(s) + 1);
int len = strlen(s);
char delim =' ';
for(int i = 0; i < len; i++) {
if(s[i] == delim) {
printf("\n");
}
else {
printf("%c", s[i]);
}
}
free(s);
return 0;
}
So, this is a code snippet to help better understand this topic.
Printing Tokens
Task: Given a sentence, s, print each word of the sentence in a new line.
char *s;
s = malloc(1024 * sizeof(char));
scanf("%[^\n]", s);
s = realloc(s, strlen(s) + 1);
//logic to print the tokens of the sentence.
for (char *p = strtok(s," "); p != NULL; p = strtok(NULL, " "))
{
printf("%s\n",p);
}
Input: How is that
Result:
How
is
that
Explanation: So here, "strtok()" function is used and it's iterated using for loop to print the tokens in separate lines.
The function will take parameters as 'string' and 'break-point' and break the string at those break-points and form tokens. Now, those tokens are stored in 'p' and are used further for printing.
strtok is replacing delimiter with'\0' NULL character in given string
CODE
#include<iostream>
#include<cstring>
int main()
{
char s[]="30/4/2021";
std::cout<<(void*)s<<"\n"; // 0x70fdf0
char *p1=(char*)0x70fdf0;
std::cout<<p1<<"\n";
char *p2=strtok(s,"/");
std::cout<<(void*)p2<<"\n";
std::cout<<p2<<"\n";
char *p3=(char*)0x70fdf0;
std::cout<<p3<<"\n";
for(int i=0;i<=9;i++)
{
std::cout<<*p1;
p1++;
}
}
OUTPUT
0x70fdf0 // 1. address of string s
30/4/2021 // 2. print string s through ptr p1
0x70fdf0 // 3. this address is return by strtok to ptr p2
30 // 4. print string which pointed by p2
30 // 5. again assign address of string s to ptr p3 try to print string
30 4/2021 // 6. print characters of string s one by one using loop
Before tokenizing the string
I assigned address of string s to some ptr(p1) and try to print string through that ptr and whole string is printed.
after tokenized
strtok return the address of string s to ptr(p2) but when I try to print string through ptr it only print "30" it did not print whole string. so it's sure that strtok is not just returning adress but it is placing '\0' character where delimiter is present.
cross check
1.
again I assign the address of string s to some ptr (p3) and try to print string it prints "30" as while tokenizing the string is updated with '\0' at delimiter.
2.
see printing string s character by character via loop the 1st delimiter is replaced by '\0' so it is printing blank space rather than ''

Resources