Handling consecutive delimiters with strsep() in C

Handling consecutive delimiters with strsep() in C - c

I am trying to read a string word by word in C using strsep() function, which can be also done using strtok(). When there are consecutive delimiters -in my case the empty space- the function does not ignore them. I am expected to use strsep() and couldn't figure out the solution. I'd appreciate it if one of you can help me.
#include <stdio.h>
#include <string.h>
int main(){
char newLine[256]= "scalar i";
char *q;
char *token;
q = strdup(newLine);
const char delim[] = " ";
token = strsep(&q, delim);
printf("The token is: \"%s\"\n", token);
token = strsep(&q, delim);
printf("The token is: \"%s\"\n", token);
return 0;
}
Actual output is:
The token is: "scalar"
The token is: ""
What I expected is:
The token is: "scalar"
The token is: "i"
To do that I also tried to write a while loop so that I could continue until the token is non-empty.
But I cannot equate tokens with "", " ", NULL or "\n". Somehow the token is not equal to any of these.

First note that strsep(), while convenient is not in the standard C library, and will only be available on Unix systems with BSD-4.4 C library support. That's most Unix'ish systems today, but still.
Anyway, strsep() supports empty fields. That means that if your string has consecutive delimiters, it will find empty, length-0, tokens between each of these delimiters. For example, the tokens for string "ab cd" will be:
"ab"
""
"cd"
2 delimiters -> 3 tokens.
Now, you also said:
I cannot equate tokens with "", " ", NULL or "\n". Somehow the token is not equal to any of these.
I am guessing what you were trying to perform is simply comparison, e.g. if (my_token == "") { ... }. That won't work, because that is a comparison of pointers, not of the strings' contents. Two strings may have identical characters at different places in memory, and that is particularly likely with the example I just gave, since my_token will be dynamic, and will not be pointing to the static-storage-duration string "" used in the comparison.
Instead, you will need to use strcmp(my_token,""), or better yet, just check manually for the first char being '\0'.

Related

How to split Chinese char array by a specific delimiter `……` (non-single chars) in standard C?

When I tried to split a char array in standard C. The problem is that it cannot show full char (only return \n发\n办法 in this example) when input is Chinese characters with …… just like, 印发……办法. However, it is okay if the input is abc……def or 印发...办法. Why and how to solve this problem?
#pragma warning(disable:4996)
#include <string.h>
#include <stdio.h>
void split(char* str)
{
char* token;
const char delim[] = "……";
token = strtok(str, delim); //it is a c++ method
while (token != NULL)
{
printf("%s\n", token);
token = strtok(NULL, delim);
}
}
int main()
{
char ipt1[] = "印发……办法";
split(ipt1);
}

UTF-8 or other multibyte encoding represent ideographs, or ideograms, as sequence of multiple bytes. A single Chinese ideograph consists of multiple chars in UTF-8, for a single ideograph.
strtok doesn't know anything about multibyte ideographs. It recognizes delimiters as single chars. The second parameter to strtok is a character string, and every individual char value in it gets recognized as a delimiter.
The character …, encoded in UTF-8 is three chars:
E2 80 A6
Any one of those individual chars will be recognized by strtok as a valid delimiter for the string to be tokenized. These values will occur as part of other Chinese ideographs, resulting in strtok making mincemeat of the string that gets passed in for tokenization. strtok does not work with multibyte encodings.
If you need to implement this kind of tokenization using basic functions from the C library then the closest match would be strstr, which works in a completely different way. You'll need to reimplement this tokenization algorithm based on strstr.

Splitting a sentence using STRTOK in C

I`m having a hard time splitting a sentence read from a file in C programming language via strtok function. I scanned it from a file and stored it in a variable info, from which I need to separate words. I tried many things and eventually copied a code from the net and changed it a little bit. The code separates the first token nicely, but then it writes some nonsense.
#include <stdio.h>
#include <string.h>
void main()
{
//int i; //brojac
char info[]=""; // sve informacije, kasnije treba da bude u strukturi
FILE *pok;
pok=fopen("C:/Users/Trajkovici/Desktop/OsobeFajl.txt","r");
if(pok==NULL)
{
printf("Greška prilikom otvaranja datoteke!");
}
fscanf(pok,"%[^\n]",&info);
puts("INFO: ");
puts(info);
//fclose(pok);
char * token = strtok(info, " ");
// loop through the string to extract all other tokens
while( token != NULL )
{
puts("\nTOKEN:");
printf( " %s\n", token ); //printing each token
token = strtok(NULL, " ");
}
}
This is the file and the result:
The result
The file
BTW, I wrote the same code, without extracting a sentence from a file, but instead declaring it manually. It works perfectly fine.
#include<stdio.h>
#include <string.h>
int main() {
char string[] = "Sladjan Jankovic 46 Vranje";
// Extract the first token
puts(string);
char * token = strtok(string, " ");
// loop through the string to extract all other tokens
while( token != NULL )
{
printf( " %s\n", token ); //printing each token
token = strtok(NULL, " ");
}
return 0;
}
And this is the result of the above code:
The result
So, the problem is that I have two codes with literally same variables, but one of them splits into tokens fine, while the other one doesn`t. Any help about the first code?
P.S. Sorry for possible bad indentation, this is my first time posting on Stack Overflow. Also, some comments and lines from the file are in Serbian.

char info[]="";
will allocate only one element. Using it in
fscanf(pok,"%[^\n]",&info);
is dangerous because it will write out-of-bounds when a string with positive length is read. (even one-character string is too long because there must be a terminating null-character).
Allocate enough elements like (for example):
char info[102400]="";
and specify the maximum length to read (the limit have to be at most the size of buffer minus one for terminating null-character) to prevent buffer overrun like this:
fscanf(pok,"%102399[^\n]",info);
Also note that you should remove & before info. Arrays in expressions (except for some exceptions) are automatically converted to pointers for their first elements. Adding & will have it pass a pointer to an array while %[ expects a pointer to a character. Passing data having wrong type to fscanf() invokes undefined behavior.

Postfix calculator using words instead of operators

i need to create a postfix calculator using stack. Where user will write operators in words.
Like:
9.5 2.3 add =
or
5 3 5 sub div =
My problem, that i can't understand, what function i should use to scan input. Because it's mix of numbers, words and char (=).

What you want to do is essentially to write a parser.
First, use fgets to read a complete line. Then use strtok to get tokens separated by whitespace.
After that, check if the token is a number or not. You can do that with sscanf. Check the return value if the conversion to a number were successful. If the conversion were not successful, check if the string is equal to "add", "sub", "=" etc. If it's not a number or one of the approved operations, generate an error. You don't have to treat strings of length 1 (aka char) different.

My problem, that i can't understand, what funktion i should use to scan input. Because it's mix of numbers, words and char (=).
But all of these are separated by whitespace. You could tokenize based on that and then build up a parse tree manually with strcmp and strtol or by simply having a comparision on the first character of the token (assuming that keywords cannot start with a number and there are no variables).
See strtok(_r). The "Example" section explains how to use it in depth, but as an extract without error handling and corner cases:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
int main(void)
{
char eq[] = "5 3 5 sub div =";
for (char *tok = strtok(eq, " "); tok != NULL; tok = strtok(NULL, " ")) {
if (isdigit(tok[0]))
printf("token-num: %s\n", tok);
else if (tok[0] == '=')
printf("token-eq: =\n");
else
printf("token-op: %s\n", tok);
}
return EXIT_SUCCESS;
}

Tokenizing a string when encountered a newline - Not working newline is not getting recognized

I am trying to tokenize a string when encountered a newline.
rest = strdup(value);
while ((token = strtok_r(rest,"\n", &rest))) {
snprintf(new_value, MAX_BANNER_LEN + 1, "%s\n", token);
}
where 'value' is a string say, "This is an example\nHere is a newline"
But the above function is not tokenizing the 'value' and the 'new_value' variable comes as it is i.e. "This is an example\nHere is a newline".
Any suggestions to overcome this?
Thanks,
Poornima

Several things going on with your code:
strtok and strtok_r take the string to tokenize as first parameter. Subsequent tokenizations of the same string should pass NULL. (It is okay to tokenize the same string with different delimiters.)
The second parameter is a string of possible separators. In your case you should pass "\n". (strtok_r will treat stretches of the characters as single break. That means that tokenizing "a\n\n\nb" will produce two tokens.)
The third parameter to strtok_r is an internal parameter to the function. It will mark where the next tokenization should start, but you need not use it. Just define a char * and pass its address.
Especially, don't repurpose the source string variable as state. In your example, you will lose the handle to the strduped string, so that you cannot free it later, as you should.
It is not clear how you determine that your tokenization "doesn't work". You print the token to the same char buffer repeatedly. Do you want to keep only the part after the last newline? In that case, use strchrr(str, '\n'). If the result isn't NULL it is your "tail". If it is NULL the whole string is your tail.
Here's how tokenizing a string could work:
char *rest = strdup(str);
char *state;
char *token = strtok_r(rest, "\n", &state);
while (token) {
printf("'%s'\n", token);
token = strtok_r(NULL, "\n", &state);
}
free(rest);

Very basic strtok program misusing delimiters - C

Here is my program (written in C, compiled and run on Omega, if it makes any difference):
#include <stdio.h>
#include <string.h>
int main (void)
{
char string[] = " hello!how are you? I am fine.";
char *token = strtok(string,"!?.");
printf("Token points to '%c'.\n",*token);
return 0;
}
This is the output I'm expecting:
"Token points to '!'."
But the output I'm getting is:
"Token points to ' '."
From trial and error, I know this is referring to the first character in the string: the space before "hello!".
Why am I not getting the output I'm expecting, and how can I fix it? I do understand from what I've read on here already that strtok is better off buried in a ditch, but let's assume that (if it's possible) I have to use it here, and I have to make it work.

As per strtok man page description
The strtok() function parses a string into a sequence of tokens. On
the first call to strtok() the string to be parsed should be specified
in str. In each subsequent call that should parse the same string, str
should be NULL.
It parses the string based on delimiter and return you the string not the delimiter.
In your case delimiters are "!?."
char string[] = " hello!how are you? I am fine.";
First occurrence of the delimiter "!" match after the string " hello". So it will return " hello" as return of strtok. And your output is nothing but first character ' ' of the " hello" string.

Someone just posted an answer. It worked for me and now I can't find it. Reposting as best I remember in case someone else has the same question.
char *token = strtok(string,"!?.");
token = strtok(NULL, "!?."); //<--THIS
token points to the first letter after the first delimiter, which is at least something I can work with. Thank you stranger!