So I have the following function:
void tokenize() {
char *word;
char text[] = "Some - text, from stdin. We'll see! what happens? 4ND 1F W3 H4V3 NUM83R5?!?";
int nbr_words = 0;
word = strtok(text, " ,.-!?()");
while (word != NULL) {
printf("%s\n", word);
word = strtok(NULL, " ,.-!?()");
nbr_words += 1;
}
}
And the output is:
Some
text
from
stdin
We'll
see
what
happens
4ND
1F
W3
H4V3
NUM83R5
13 words
Basically what I'm doing is tokenizing paragraphs of text into words for futher analysis down the road. I have my text, and I have my delimiters. The only problem is tokenizing numbers at the same time as all the rest of the delimiters. I know that I can use isdigit in ctype.h. However, I don't know how I can include it in the strtok.
For example (obviously wrong): strtok(paragraph, " ,.-!?()isdigit()");
Something along those lines. But since I have each token (word) at this stage, is there some kind of post-processing if statement I could use to further tokenize each word, splitting at digits?
For example, the output would further degrade to:
ND
F
W
H
V
NUM
R
15 words // updated counter to include new tokens
strtok is very simple in this respect: just list all the digits as delimiters, one by one - like this:
strtok(paragraph, " ,.-!?()0123456789");
Note: strtok is an old, non-reentrant function that should not be used in modern programs. You should switch to strtok_r, which has a similar interface, but can be used in concurrent environments and other situations when you need reentrancy.
Why not just use
word = strtok(text, " ,.-!?()1234567890");
Related
I am attempting to split two words (and more) and put them into an array by splitting it into tokens with strtok. My delimitors include " \t\n"; as shown below in the code. For example if I were to type in "cat program.c", it just prints the cat token and not the program.c token and I have no idea why. Are my delimitors not correct or am I not splitting the string correctly? Here is the code
char b[256];
int k = 0;
char *args[4];
char *tokens;
char delimiters[] = " \t\n";
printf("Please enter the command you want to use:\n");
scanf("%255s", b);
tokens = strtok(b, delimiters);
while (tokens != NULL){
args[k++] = tokens;
printf("%s\n",tokens);
tokens = strtok(NULL, delimiters);
}
The problem is not strtok(), but rather scanf(). An %s field directive scans a whitespace-delimited string, so when the input is cat program.c, only the "cat" ever makes it into array b in the first place. (The program.c remains waiting to be read.) If you want to read a whole line of input at a time, then I would recommend fgets(), instead.
I'm currently having trouble with appending an equal sign, before and after my string is split into tokens. It leads me to the conclusion that I must replace the newline character at some point with my desired equal sign after splitting my string. I've tried looking at the c string.h library reference to see whether or not there is a way to replace the newline char using strstr to see whether or not there was already an "\n" in the tokenized string, but ran into an infinite loop when I tried that. I also thought about trying to replace the newline character, which should be the string length minus 1, and I admit, I have low familiarity in C. If you could take a look at my code, and provide some feedback, I would greatly appreciate it. Thank you for your time. I will admit I have low familiarity with C, but am currently reading the reference libraries.
// main method
int main(void){
// allocate memory
char string[256];
char *tokenizedString;
const char delimit[2] = " ";
const char *terminate = "\n";
do{
// prompt user for a string we will tokenize
do{
printf("Enter no more than 65 tokens:\n");
fgets(string, sizeof(string), stdin);
// verify input length
if(strlen(string) > 65 || strlen(string) <= 0) {
printf("Invalid input. Please try again\n"); }
} while(strlen(string) > 65);
// tokenize the string
tokenizedString = strtok(string, delimit);
while(tokenizedString != NULL){
printf("=%s=\n", tokenizedString);
tokenizedString = strtok(NULL, delimit);
}
// replace newline character implicitly made by enter, it seems to be adding my newline character at the end of output
} while(strcmp(string, "\n"));
return 0;
}// end of method main
OUTPUT:
Enter no more than most 65 tokens:
i am very tired sadface
=i=
=am=
=very=
=tired=
=sadface
=
DESIRED OUTPUT
Enter no more than 65 tokens:
i am very tired sadface
=i=
=am=
=very=
=tired=
=sadface=
Since you are using strlen(), you can do this instead
size_t length = strlen(string);
// Check that `length > 0'
string[length - 1] = '\0';
Advantages:
This way you would call strlen() only once. Calling it multiple times for the same string is inefficient anyway.
You always remove the trailing '\n' from the input string to your tokenization will work as expected.
Note: strlen() would never return a value < 0, because what it does is count the number of characters in the string, which is only 0 for "" and > 0 otherwise.
Well, you have two ways to do it, the simplest is to add a \n to the token delimiter string
const char delimit[] = " \n";
(you don't need to use an array size if you are going to initialize a string array with a string literal)
so it eliminates the final \n that comes in with your input. Another way is to search for it on reading and eliminate it from the input string. You can use strtok(3) for this purpose also:
tokenizedString = strtok(string, "\n");
tokenizedString = strtok(tokenizedString, delimit);
I am trying to use strtok to split up a text file into strings that I can pass to a spell check function, the text file includes characters such as '\n', ' ?!,.' etc...
I need to print any words that fail the spell check and the line number that they are on. Keeping track of the line is what I'm struggling with.
I have tried this so far but it only returns results for the first line of the text file:
char str[409377];
fread(str, noOfChars, 1, file);
fclose(file);
int lines=1;
char *token;
char *line;
char splitLine[] = "\n";
char delimiters[] = " ,.?!(){}*&^%$£_-+=";
line = strtok(str, splitLine);
while(line!=NULL){
token = strtok(line, delimiters);
while(token != NULL){
//print is just to test if I can loop through all the words
printf("%s", token);
//spellCheck function & logic here
token = strtok(NULL, delimiters);
}
line = strtok(NULL, splitLine);
lines++
}
Is using the nested while loop and strtok possible? Is there a better way to keep track of the line number?
The strtok function is not reentrant! It can not be used to tokenize multiple strings simultaneously. It's because it keeps internal state about the string currently being tokenized.
If you have a modern compiler and standard library then you could use strtok_s instead. Otherwise you have to come up with another solution.
You can use strtok, but it's not very easy to use. It's a stupid function, all it really does is replace delimiters with nuls and return a pointer to the start of the sequence it has delimited. So it's destructive. It can't handle special cases like English words being allowed one apostrophe (we're is a word, we'r'e is not), you have to make sure you list all the delimiters specifically.
It's probably best to write mystrok yourself, so you understand how it works. Then use that as the basis for your own word extractor.
The reason for your bug is that you chop off the first line, then that is all that strok sees on the subsequent calls.
I'm trying to do split some strings by {white_space} symbol.
btw, there is a problem within some splits. which means, I want to split by {white_space} symbol but also quoted sub-strings.
example,
char *pch;
char str[] = "hello \"Stack Overflow\" good luck!";
pch = strtok(str," ");
while (pch != NULL)
{
printf ("%s\n",pch);
pch = strtok(NULL, " ");
}
This will give me
hello
"Stack
Overflow"
good
luck!
But What I want, as you know,
hello
Stack Overflow
good
luck!
Any suggestion or idea please?
You'll need to tokenize twice. The program flow you currently have is as follows:
1) Search for space
2) Print all characters prior to space
3) Search for next space
4) Print all characters between last space, and this one.
You'll need to start thinking in a different matter, two layers of tokenization.
Search for Quotation Mark
On odd-numbered strings, perform your original program (search for spaces)
On even-numbered strings, print blindly
In this case, even numbered strings are (ideally) within quotes. ab"cd"ef would result in ab being odd, cd being even... etc.
The other side, is remembering what you need to do, and what you're actually looking for (in regex) is "[a-zA-Z0-9 \t\n]*" or, [a-zA-Z0-9]+. That means the difference between the two options, are whether it's separated by quotes. So separate by quotes, and identify from there.
Try altering your strategy.
Look at non-white space things, then when you find quoted string you can put it in one string value.
So, you need a function that examines characters, between white space. When you find '"' you can change the rules and hoover everything up to a matching '"'. If this function returns a TOKEN value and a value (the string matched) then what calls it, can decide to do the correct output. Then you have written a tokeniser, and there actually exist tools to generate them called "lexers" as they are used widely, to implement programming languages/config files.
Assuming nextc reads next char from string, begun by firstc( str) :
for (firstc( str); ((c = nextc) != NULL;) {
if (isspace(c))
continue;
else if (c == '"')
return readQuote; /* Handle Quoted string */
else
return readWord; /* Terminated by space & '"' */
}
return EOS;
You'll need to define return values for EOS, QUOTE and WORD, and a way to get the text in each Quote or Word.
Here's the code that works... in C
The idea is that you first tokenize the quote, since that's a priority (if a string is inside the quotes than we don't tokenize it, we just print it). And for each of those tokenized strings, we tokenize within that string on the space character, but we do it for alternate strings, because alternate strings will be in and out of the quotes.
#include <stdio.h>
#include <string.h>
#include <stdbool.h>
int main() {
char *pch1, *pch2, *save_ptr1, *save_ptr2;
char str[] = "hello \"Stack Overflow\" good luck!";
pch1 = strtok_r(str,"\"", &save_ptr1);
bool in = false;
while (pch1 != NULL) {
if(in) {
printf ("%s\n", pch1);
pch1 = strtok_r(NULL, "\"", &save_ptr1);
in = false;
continue;
}
pch2 = strtok_r(pch1, " ", &save_ptr2);
while (pch2 != NULL) {
printf ("%s\n",pch2);
pch2 = strtok_r(NULL, " ", &save_ptr2);
}
pch1 = strtok_r(NULL, "\"", &save_ptr1);
in = true;
}
}
References
Tokenizing multiple strings simultaneously
http://linux.die.net/man/3/strtok_r
http://www.cplusplus.com/reference/cstring/strtok/
I've got to parse a .txt file like this
autore: sempronio, caio; titolo: ; editore: ; luogo_pubblicazione: ; anno: 0; prestito: 0-1-1900; collocazione: ; descrizione_fisica: ; nota: ;
with fscanf in C code.
I tried with some formats in fscanf call, but none of them worked...
EDIT:
a = fscanf(fp, "autore: %s");
This is the first try I did; the patterns 'autore', 'titolo', 'editore', etc. must not be caught by fscanf().
Generally speaking, trying to parse input with fscanf is not a good idea, as it is difficult to recover gracefully if the input does not match expectations. It is generally better to read the input into an internal buffer (with fread or fgets), and parse it there (with sscanf, strtok, strtol etc.). Details on which functions are best depend on the definition of the input format (which you did not give us; example input is no replacement for a formal specification).
The following shows how to use strtok:
char* item;
char* input; // fill it with fgets
for (item = strtok(input, ";"); item != NULL; item = strtok(NULL, ";"))
{
// item loops through the following:
// "autore: sempronio, caio"
// " titolo: "
// " editore: "
// ...
}
The following shows how to use sscanf:
char tag[20];
int chars = -1;
if (sscanf(item, " %19[^:]: %n", tag, &chars) == 1 && chars >= 0)
{
printf("%s is %s\n", tag, item + chars);
}
Here, the format string consists of the following:
(space) - tells the parser to discard whitespace
19 - maximum number of bytes/chars in the tag
[^:] - tells the parser to read until it meets the colon character
: - tells the parser to discard the colon character
(whitespace) - as above
%n - tells the parser to report the number of bytes it read (check &chars)
If there was an unexpected input, the number of chars is not updated, so you have to set it to -1 before parsing each item.