I'm writing a program where I need to read in token by token and detect certain keywords. One of these keywords is "gt" which stands for greater than.
I split the text file into tokens by tabs, newlines, spaces, and returns. Buffer is simply a large char array.
char* word = strtok(buffer, " \n\t\r");
I then have several cases to check for the possible words. The gt is as follows. Weirdly enough, this works for other keywords and sometimes even other occurrences of 'gt'.
//gt
if(strcmp("gt", word) == 0){
type = GT;
literal_value = 0;
}
However, it isn't getting reached despite a 'gt' being input. I noticed that when I print, this happens
printf("WORD is %s!\n", word);
PRINTS "!ORD is gt"
Which clearly isn't right. If the answer is something obvious please let me know- this bug has been evading me for a long time!
updated fragment:
char * word = strtok(buffer, " \n\t\r");
while (word != NULL){
printf("word is %s!\n", sections); //PRINTS "!ORD is gt"
if(sections[0] == ';'){
break; //comment indicated by ';'
}
//gt
if(strcmp("gt", word) == 0){
type = GT;
literal_value = 0;
}
//...............
//other comparisons for less than, equal to
process(&curr, output_file); //function to process current token
word = strtok(NULL, " \n\t\r");
}
Partial answer.
The reason you get the output is that you have a ms-dos (or windows) type .txt file which has two newline characters. You are catching the '\n'line feed character but not the carriage return character... so your string %s is printing a carriage return. That is why the ! is the first character on the line.
Related
Trying to tokenise using strtok the input file is
InputVector:0(0,3,4,2,40)
Trying to get the numbers in but I encountered something unexpected that I don't understand, my tokenising code looks like this.
#define INV_DELIM1 ":"
#define INV_DELIM2 "("
#define INV_DELIM3 ",)"
checkBuff = fgets(buff, sizeof(buff), (FILE*)file);
if(checkBuff == NULL)
{
printf("fgets failure\n");
return FALSE;
}
else if(buff[strlen(buff) - 1] != '\n')
{
printf("InputVector String too big or didn't end with a new line\n");
return FALSE;
}
else
{
buff[strlen(buff) - 1] = '\0';
}
token = strtok(buff, INV_DELIM1);
printf("token %s", token);
token = strtok(buff, INV_DELIM2);
printf("token %s", token);
while(token != NULL) {
token = strtok(NULL, INV_DELIM3);
printf("token %s\n", token);
if(token != NULL) {
number = strtol(token, &endptr, 10);
if((token == endptr || *endptr != '\0')) {
printf("A token is Not a number\n");
return FALSE;
}
else {
vector[i] = number;
i++;
}
}
}
output:
token InputVector
token 0
token 0
token 3
token 4
token 2
token 40
token
So the code first calls fgets and checks if it's not bigger than the length of my buffer if it isn't it replaces the last character with '\0'.
Then I tokenise the first word, and the number outside of the brackets. the while loop tokenises the numbers inside the brackets and change them using strtol and put it inside of an array. I'm trying to use strtol to detect if the data type inside of the brackets is numerical but it always detects error because strtok reads that last token which isn't in the input. How do i get rid of that last token from being read so that my strtol doesn't pick it up? Or is there a better way I can tokenise and check the values inside the brackets?
The input file will later on contain more than one input vectors and I have to be able to check if they're valid or not.
The most likely explanation is that your input line ends with the Windows newline sequence \r\n. If your program runs on unix (or linux) and you are typing your input on Windows, Windows will send the two-character newline sequence but the Unix program won't know that it needs to do line-end translation. (If you ran the program diretly on the Windows system, the standard I/O library would deal with the newline sequence for you, by translating it to a single \n, as long as you don't open the file in binary mode.)
Since \r is not in your delimiter list, strtok will treat it as an ordinary character, so your last field will consist of the \r. Printing it out is not quite a no-op, but it's invisible, so it's easy to get fooled into thinking that an empty field is being printed. (The same would happen if the field consisted only of spaces.)
You could just add \r to your delimiter list. Indeed, you could add both \n and \r to the delimiter list in your strtok call, and then you wouldn't need to worry about trimming the input line. That will work because strtok treats any sequence of delimiter characters as a single delimiter.
However, that may not really be what you want, since that will hide certain input errors. For example, if the input had two consecutive commas, strtok would treat them as a single comma, and you would never know that the field was skipped. You could solve that particular problem by using strspn instead of strtok, but I personally think the better solution is to not use strtok at all since strtol will tell you where the line ends.
eg. (For simplicity, I left out printing of error messages. It's not necessary to check whether the line ends with a newline before this code; if you feel it necessary to do that check, you can do it after you find the close parenthesis at the end of the loop.):
#include <ctype.h> /* For 'isspace' */
#include <stdbool.h> /* For 'false' */
#include <stdlib.h> /* For 'strtol' */
#include <string.h> /* For 'strchr' */
// ...
char* token = strchr(buff, ':'); /* Find the colon */
if (token == NULL) return false; /* No colon */
++token; /* Character after the token */
char* endptr;
(void)strtol(token, &endptr, 10); /* Read and toss away a number */
if (endptr == token) return false; /* No number */
token = endptr; /* Character following number */
while (isspace(*token)) ++token; /* Skip spaces (maybe not necessary) */
if (*token != '(') return false; /* Wrong delimiter */
for (i = 0; i < n_vector; ++i) { /* Loop until vector is full or ')' is found */
++token;
vector[i] = strtol(token, &endptr, 10); /* Get another number */
if (endptr == token) return false; /* No number */
token = endptr; /* Character following number */
while (isspace(*token)) ++token; /* Skip spaces */
if (*token == ')') break; /* Found the close parenthesis */
if (*token != ',') return false; /* Not the right delimiter */
} /* Loop */
/* At this point, either we found the ')' or we read too many numbers */
if (*token != ')') return false; /* Too many numbers */
/* Could check to make sure the following characters are a newline sequence */
/* ... */
The code which calls strtol to get a number and then check what the delimiter is should be refactored, but I wrote it out like that for simplicity. I would normally use a function which reads a number and returns the delimiter (as with getchar()) or EOF if the end of the buffer is encountered. But it would depend on your precise needs.
When you use the function strtok() firt you are spliting a string in delimitier ":" e after "(". For example the sentence
InputVector:0(0,3,4,2,40)
When you apply strtok(buffer,":") you get the only the first result InputVector. You have to apply again strtok(NULL,":") to get the rest of the split 0(0,3,4,2,40). You can't apply a different delimitier to the same buffer, or apply strtok again in the same buff because the C split put a NULL on the end of each token and you will or lose the refference, or apply strtok just int the first part of the string. The best way to split this sentence is with all delimitier :(),, that will split all sentence like this:
InputVector
0
0
3
4
2
40
The changes that you need to do is
#define INV_DELIM1 ":(),\n"
token = strtok(buff,INV_DELIM1); //for the first call of strtok
token = strtok(NULL,INV_DELIM1); //for the rest of strtok call
I'm currently having trouble with appending an equal sign, before and after my string is split into tokens. It leads me to the conclusion that I must replace the newline character at some point with my desired equal sign after splitting my string. I've tried looking at the c string.h library reference to see whether or not there is a way to replace the newline char using strstr to see whether or not there was already an "\n" in the tokenized string, but ran into an infinite loop when I tried that. I also thought about trying to replace the newline character, which should be the string length minus 1, and I admit, I have low familiarity in C. If you could take a look at my code, and provide some feedback, I would greatly appreciate it. Thank you for your time. I will admit I have low familiarity with C, but am currently reading the reference libraries.
// main method
int main(void){
// allocate memory
char string[256];
char *tokenizedString;
const char delimit[2] = " ";
const char *terminate = "\n";
do{
// prompt user for a string we will tokenize
do{
printf("Enter no more than 65 tokens:\n");
fgets(string, sizeof(string), stdin);
// verify input length
if(strlen(string) > 65 || strlen(string) <= 0) {
printf("Invalid input. Please try again\n"); }
} while(strlen(string) > 65);
// tokenize the string
tokenizedString = strtok(string, delimit);
while(tokenizedString != NULL){
printf("=%s=\n", tokenizedString);
tokenizedString = strtok(NULL, delimit);
}
// replace newline character implicitly made by enter, it seems to be adding my newline character at the end of output
} while(strcmp(string, "\n"));
return 0;
}// end of method main
OUTPUT:
Enter no more than most 65 tokens:
i am very tired sadface
=i=
=am=
=very=
=tired=
=sadface
=
DESIRED OUTPUT
Enter no more than 65 tokens:
i am very tired sadface
=i=
=am=
=very=
=tired=
=sadface=
Since you are using strlen(), you can do this instead
size_t length = strlen(string);
// Check that `length > 0'
string[length - 1] = '\0';
Advantages:
This way you would call strlen() only once. Calling it multiple times for the same string is inefficient anyway.
You always remove the trailing '\n' from the input string to your tokenization will work as expected.
Note: strlen() would never return a value < 0, because what it does is count the number of characters in the string, which is only 0 for "" and > 0 otherwise.
Well, you have two ways to do it, the simplest is to add a \n to the token delimiter string
const char delimit[] = " \n";
(you don't need to use an array size if you are going to initialize a string array with a string literal)
so it eliminates the final \n that comes in with your input. Another way is to search for it on reading and eliminate it from the input string. You can use strtok(3) for this purpose also:
tokenizedString = strtok(string, "\n");
tokenizedString = strtok(tokenizedString, delimit);
I'm trying to do split some strings by {white_space} symbol.
btw, there is a problem within some splits. which means, I want to split by {white_space} symbol but also quoted sub-strings.
example,
char *pch;
char str[] = "hello \"Stack Overflow\" good luck!";
pch = strtok(str," ");
while (pch != NULL)
{
printf ("%s\n",pch);
pch = strtok(NULL, " ");
}
This will give me
hello
"Stack
Overflow"
good
luck!
But What I want, as you know,
hello
Stack Overflow
good
luck!
Any suggestion or idea please?
You'll need to tokenize twice. The program flow you currently have is as follows:
1) Search for space
2) Print all characters prior to space
3) Search for next space
4) Print all characters between last space, and this one.
You'll need to start thinking in a different matter, two layers of tokenization.
Search for Quotation Mark
On odd-numbered strings, perform your original program (search for spaces)
On even-numbered strings, print blindly
In this case, even numbered strings are (ideally) within quotes. ab"cd"ef would result in ab being odd, cd being even... etc.
The other side, is remembering what you need to do, and what you're actually looking for (in regex) is "[a-zA-Z0-9 \t\n]*" or, [a-zA-Z0-9]+. That means the difference between the two options, are whether it's separated by quotes. So separate by quotes, and identify from there.
Try altering your strategy.
Look at non-white space things, then when you find quoted string you can put it in one string value.
So, you need a function that examines characters, between white space. When you find '"' you can change the rules and hoover everything up to a matching '"'. If this function returns a TOKEN value and a value (the string matched) then what calls it, can decide to do the correct output. Then you have written a tokeniser, and there actually exist tools to generate them called "lexers" as they are used widely, to implement programming languages/config files.
Assuming nextc reads next char from string, begun by firstc( str) :
for (firstc( str); ((c = nextc) != NULL;) {
if (isspace(c))
continue;
else if (c == '"')
return readQuote; /* Handle Quoted string */
else
return readWord; /* Terminated by space & '"' */
}
return EOS;
You'll need to define return values for EOS, QUOTE and WORD, and a way to get the text in each Quote or Word.
Here's the code that works... in C
The idea is that you first tokenize the quote, since that's a priority (if a string is inside the quotes than we don't tokenize it, we just print it). And for each of those tokenized strings, we tokenize within that string on the space character, but we do it for alternate strings, because alternate strings will be in and out of the quotes.
#include <stdio.h>
#include <string.h>
#include <stdbool.h>
int main() {
char *pch1, *pch2, *save_ptr1, *save_ptr2;
char str[] = "hello \"Stack Overflow\" good luck!";
pch1 = strtok_r(str,"\"", &save_ptr1);
bool in = false;
while (pch1 != NULL) {
if(in) {
printf ("%s\n", pch1);
pch1 = strtok_r(NULL, "\"", &save_ptr1);
in = false;
continue;
}
pch2 = strtok_r(pch1, " ", &save_ptr2);
while (pch2 != NULL) {
printf ("%s\n",pch2);
pch2 = strtok_r(NULL, " ", &save_ptr2);
}
pch1 = strtok_r(NULL, "\"", &save_ptr1);
in = true;
}
}
References
Tokenizing multiple strings simultaneously
http://linux.die.net/man/3/strtok_r
http://www.cplusplus.com/reference/cstring/strtok/
In my code below I use strtok to parse a line of code from a file that looks like:
1023.89,863.19 1001.05,861.94 996.44,945.67 1019.28,946.92 1023.89,863.19
As the file can have lines of different lengths I don't use fscanf. The code below works of except for one small glitch. It loops around one time too many and reads in a long empty string " " before looping again recognizing the null token "" and exiting the while loop. I don't know why this could be.
Any help would be greatly appreciated.
fgets(line, sizeof(line), some_file);
while ((line != OPC_NIL) {
token = strtok(line, "\t"); //Pull the string apart into tokens using the commas
input = op_prg_list_create();
while (token != NULL) {
test_token = strdup(token);
if (op_prg_list_size(input) == 0)
op_prg_list_insert(input,test_token,OPC_LISTPOS_HEAD);
else
op_prg_list_insert(input,test_token,OPC_LISTPOS_TAIL);
token = strtok (NULL, "\t");
}
fgets(line, sizeof(line), some_file);
}
You must use the correct list of delimiters. Your code contradicts comments:
token = strtok(line, "\t"); //Pull the string apart into tokens using the commas
If you want to separate tokens by commas, use "," instead of "\t". In addition, you certainly don't want the tokens to contain the newline character \n (which appears at the end of each line read from file by fgets). So add the newline character to the list of delimiters:
token = strtok(line, ",\n"); //Pull the string apart into tokens using the commas
...
token = strtok (NULL, ",\n");
You might want to add the space character to the list of delimiters too (is 863.19 1001.05 a single token or two tokens? Do you want to remove spaces at end of line?).
Your use of sizeof(line) tells me that line is a fixed size array living on the stack. In this case, (line != OPC_NIL) will never be false. However, fgets() will return NULL when the end of file is reached or some other error occurs. Your outer while loop should be rewritten as:
while(fgets(line, sizeof(line), some_file)) {
...
}
Your input file likely also has a newline character at the end of the last input line resulting in a single blank line at the end. This is the difference between this:
1023.89,863.19 1001.05,861.94 996.44,945.67 1019.28,946.92 1023.89,863.19↵
<blank line>
and this:
1023.89,863.19 1001.05,861.94 996.44,945.67 1019.28,946.92 1023.89,863.19
The first thing you should do in the while loop is check that the string is actually in the format you expect. If it's not then break:
while(fgets(line, sizeof(line), some_file)) {
if(strlen(line) == 0) // or other checks such as "contains tab characters"
break;
...
}
I've got to parse a .txt file like this
autore: sempronio, caio; titolo: ; editore: ; luogo_pubblicazione: ; anno: 0; prestito: 0-1-1900; collocazione: ; descrizione_fisica: ; nota: ;
with fscanf in C code.
I tried with some formats in fscanf call, but none of them worked...
EDIT:
a = fscanf(fp, "autore: %s");
This is the first try I did; the patterns 'autore', 'titolo', 'editore', etc. must not be caught by fscanf().
Generally speaking, trying to parse input with fscanf is not a good idea, as it is difficult to recover gracefully if the input does not match expectations. It is generally better to read the input into an internal buffer (with fread or fgets), and parse it there (with sscanf, strtok, strtol etc.). Details on which functions are best depend on the definition of the input format (which you did not give us; example input is no replacement for a formal specification).
The following shows how to use strtok:
char* item;
char* input; // fill it with fgets
for (item = strtok(input, ";"); item != NULL; item = strtok(NULL, ";"))
{
// item loops through the following:
// "autore: sempronio, caio"
// " titolo: "
// " editore: "
// ...
}
The following shows how to use sscanf:
char tag[20];
int chars = -1;
if (sscanf(item, " %19[^:]: %n", tag, &chars) == 1 && chars >= 0)
{
printf("%s is %s\n", tag, item + chars);
}
Here, the format string consists of the following:
(space) - tells the parser to discard whitespace
19 - maximum number of bytes/chars in the tag
[^:] - tells the parser to read until it meets the colon character
: - tells the parser to discard the colon character
(whitespace) - as above
%n - tells the parser to report the number of bytes it read (check &chars)
If there was an unexpected input, the number of chars is not updated, so you have to set it to -1 before parsing each item.