Strtok - reading empty string at end of line - c

In my code below I use strtok to parse a line of code from a file that looks like:
1023.89,863.19 1001.05,861.94 996.44,945.67 1019.28,946.92 1023.89,863.19
As the file can have lines of different lengths I don't use fscanf. The code below works of except for one small glitch. It loops around one time too many and reads in a long empty string " " before looping again recognizing the null token "" and exiting the while loop. I don't know why this could be.
Any help would be greatly appreciated.
fgets(line, sizeof(line), some_file);
while ((line != OPC_NIL) {
token = strtok(line, "\t"); //Pull the string apart into tokens using the commas
input = op_prg_list_create();
while (token != NULL) {
test_token = strdup(token);
if (op_prg_list_size(input) == 0)
op_prg_list_insert(input,test_token,OPC_LISTPOS_HEAD);
else
op_prg_list_insert(input,test_token,OPC_LISTPOS_TAIL);
token = strtok (NULL, "\t");
}
fgets(line, sizeof(line), some_file);
}

You must use the correct list of delimiters. Your code contradicts comments:
token = strtok(line, "\t"); //Pull the string apart into tokens using the commas
If you want to separate tokens by commas, use "," instead of "\t". In addition, you certainly don't want the tokens to contain the newline character \n (which appears at the end of each line read from file by fgets). So add the newline character to the list of delimiters:
token = strtok(line, ",\n"); //Pull the string apart into tokens using the commas
...
token = strtok (NULL, ",\n");
You might want to add the space character to the list of delimiters too (is 863.19 1001.05 a single token or two tokens? Do you want to remove spaces at end of line?).

Your use of sizeof(line) tells me that line is a fixed size array living on the stack. In this case, (line != OPC_NIL) will never be false. However, fgets() will return NULL when the end of file is reached or some other error occurs. Your outer while loop should be rewritten as:
while(fgets(line, sizeof(line), some_file)) {
...
}
Your input file likely also has a newline character at the end of the last input line resulting in a single blank line at the end. This is the difference between this:
1023.89,863.19 1001.05,861.94 996.44,945.67 1019.28,946.92 1023.89,863.19↵
<blank line>
and this:
1023.89,863.19 1001.05,861.94 996.44,945.67 1019.28,946.92 1023.89,863.19
The first thing you should do in the while loop is check that the string is actually in the format you expect. If it's not then break:
while(fgets(line, sizeof(line), some_file)) {
if(strlen(line) == 0) // or other checks such as "contains tab characters"
break;
...
}

Related

How can I make strtok include newlines at the end of a token?

In a program I am writing, I need to be able to tokenize a input text file into words, do some encoding, and then write to an output file. Problem is, I need to preserve the new lines.
The approach I was trying is to have strtok preserve the newlines at the end of a word, however, strtok will only include one newline character before moving on. If there is a following newline, it becomes its own token. How can I change this behavior so that tokens include all newlines before moving onto the next word?
int changeNewLine(char* p) {
p = p + (strlen(p)-1);
int newlines = 0;
while(*p == '\n') {
*p = '\0';
newlines++;
p--;
}
return newlines;
}
void main(int argc, char *argv[]) {
FILE *inputfile = fopen(argv[1],"rw");
FILE *outputfile = fopen("output.txt","wb");
char buffer[128];
char *token;
char words[MAX_CODE][WORDLEN];
int i = 0;
unsigned short newlines[MAX_CODE];
while(fgets(buffer, 128, inputfile)){
token = strtok(buffer," ");
while(token != NULL) {
newlines[i] = changeNewLine(token);
strcpy(words[i], token);
i++;
token = strtok(NULL," ");
}
}
...
}
Above is a fragment of my code. The idea is to count the number of newlines in a token, and then write them back out later.
strtok already does include newlines in the token, since you are using a delimiter string that does not contain the newline. But in your program as it now is, you will never have more than one in a token because fgets reads (at most) one line at a time. That's its whole purpose. It will never give you a string containing two or more newlines, nor containing a newline anywhere other than the last character.
Your general alternatives are
to look ahead at subsequent lines in order to spot additional newlines, or
retrospectively update the previous line's newline count when encounter a line starting with a newline (and, therefore, containing nothing else).
Alternative (1) could include employing an altogether different approach to reading input, too, such as a block read with fread() or a character-at-a-time read with fgetc().

C tokenising using strtok is printing out unexpected values and is hindering my strtol validation

Trying to tokenise using strtok the input file is
InputVector:0(0,3,4,2,40)
Trying to get the numbers in but I encountered something unexpected that I don't understand, my tokenising code looks like this.
#define INV_DELIM1 ":"
#define INV_DELIM2 "("
#define INV_DELIM3 ",)"
checkBuff = fgets(buff, sizeof(buff), (FILE*)file);
if(checkBuff == NULL)
{
printf("fgets failure\n");
return FALSE;
}
else if(buff[strlen(buff) - 1] != '\n')
{
printf("InputVector String too big or didn't end with a new line\n");
return FALSE;
}
else
{
buff[strlen(buff) - 1] = '\0';
}
token = strtok(buff, INV_DELIM1);
printf("token %s", token);
token = strtok(buff, INV_DELIM2);
printf("token %s", token);
while(token != NULL) {
token = strtok(NULL, INV_DELIM3);
printf("token %s\n", token);
if(token != NULL) {
number = strtol(token, &endptr, 10);
if((token == endptr || *endptr != '\0')) {
printf("A token is Not a number\n");
return FALSE;
}
else {
vector[i] = number;
i++;
}
}
}
output:
token InputVector
token 0
token 0
token 3
token 4
token 2
token 40
token
So the code first calls fgets and checks if it's not bigger than the length of my buffer if it isn't it replaces the last character with '\0'.
Then I tokenise the first word, and the number outside of the brackets. the while loop tokenises the numbers inside the brackets and change them using strtol and put it inside of an array. I'm trying to use strtol to detect if the data type inside of the brackets is numerical but it always detects error because strtok reads that last token which isn't in the input. How do i get rid of that last token from being read so that my strtol doesn't pick it up? Or is there a better way I can tokenise and check the values inside the brackets?
The input file will later on contain more than one input vectors and I have to be able to check if they're valid or not.
The most likely explanation is that your input line ends with the Windows newline sequence \r\n. If your program runs on unix (or linux) and you are typing your input on Windows, Windows will send the two-character newline sequence but the Unix program won't know that it needs to do line-end translation. (If you ran the program diretly on the Windows system, the standard I/O library would deal with the newline sequence for you, by translating it to a single \n, as long as you don't open the file in binary mode.)
Since \r is not in your delimiter list, strtok will treat it as an ordinary character, so your last field will consist of the \r. Printing it out is not quite a no-op, but it's invisible, so it's easy to get fooled into thinking that an empty field is being printed. (The same would happen if the field consisted only of spaces.)
You could just add \r to your delimiter list. Indeed, you could add both \n and \r to the delimiter list in your strtok call, and then you wouldn't need to worry about trimming the input line. That will work because strtok treats any sequence of delimiter characters as a single delimiter.
However, that may not really be what you want, since that will hide certain input errors. For example, if the input had two consecutive commas, strtok would treat them as a single comma, and you would never know that the field was skipped. You could solve that particular problem by using strspn instead of strtok, but I personally think the better solution is to not use strtok at all since strtol will tell you where the line ends.
eg. (For simplicity, I left out printing of error messages. It's not necessary to check whether the line ends with a newline before this code; if you feel it necessary to do that check, you can do it after you find the close parenthesis at the end of the loop.):
#include <ctype.h> /* For 'isspace' */
#include <stdbool.h> /* For 'false' */
#include <stdlib.h> /* For 'strtol' */
#include <string.h> /* For 'strchr' */
// ...
char* token = strchr(buff, ':'); /* Find the colon */
if (token == NULL) return false; /* No colon */
++token; /* Character after the token */
char* endptr;
(void)strtol(token, &endptr, 10); /* Read and toss away a number */
if (endptr == token) return false; /* No number */
token = endptr; /* Character following number */
while (isspace(*token)) ++token; /* Skip spaces (maybe not necessary) */
if (*token != '(') return false; /* Wrong delimiter */
for (i = 0; i < n_vector; ++i) { /* Loop until vector is full or ')' is found */
++token;
vector[i] = strtol(token, &endptr, 10); /* Get another number */
if (endptr == token) return false; /* No number */
token = endptr; /* Character following number */
while (isspace(*token)) ++token; /* Skip spaces */
if (*token == ')') break; /* Found the close parenthesis */
if (*token != ',') return false; /* Not the right delimiter */
} /* Loop */
/* At this point, either we found the ')' or we read too many numbers */
if (*token != ')') return false; /* Too many numbers */
/* Could check to make sure the following characters are a newline sequence */
/* ... */
The code which calls strtol to get a number and then check what the delimiter is should be refactored, but I wrote it out like that for simplicity. I would normally use a function which reads a number and returns the delimiter (as with getchar()) or EOF if the end of the buffer is encountered. But it would depend on your precise needs.
When you use the function strtok() firt you are spliting a string in delimitier ":" e after "(". For example the sentence
InputVector:0(0,3,4,2,40)
When you apply strtok(buffer,":") you get the only the first result InputVector. You have to apply again strtok(NULL,":") to get the rest of the split 0(0,3,4,2,40). You can't apply a different delimitier to the same buffer, or apply strtok again in the same buff because the C split put a NULL on the end of each token and you will or lose the refference, or apply strtok just int the first part of the string. The best way to split this sentence is with all delimitier :(),, that will split all sentence like this:
InputVector
0
0
3
4
2
40
The changes that you need to do is
#define INV_DELIM1 ":(),\n"
token = strtok(buff,INV_DELIM1); //for the first call of strtok
token = strtok(NULL,INV_DELIM1); //for the rest of strtok call

reading tokens misreading simple string - c

I'm writing a program where I need to read in token by token and detect certain keywords. One of these keywords is "gt" which stands for greater than.
I split the text file into tokens by tabs, newlines, spaces, and returns. Buffer is simply a large char array.
char* word = strtok(buffer, " \n\t\r");
I then have several cases to check for the possible words. The gt is as follows. Weirdly enough, this works for other keywords and sometimes even other occurrences of 'gt'.
//gt
if(strcmp("gt", word) == 0){
type = GT;
literal_value = 0;
}
However, it isn't getting reached despite a 'gt' being input. I noticed that when I print, this happens
printf("WORD is %s!\n", word);
PRINTS "!ORD is gt"
Which clearly isn't right. If the answer is something obvious please let me know- this bug has been evading me for a long time!
updated fragment:
char * word = strtok(buffer, " \n\t\r");
while (word != NULL){
printf("word is %s!\n", sections); //PRINTS "!ORD is gt"
if(sections[0] == ';'){
break; //comment indicated by ';'
}
//gt
if(strcmp("gt", word) == 0){
type = GT;
literal_value = 0;
}
//...............
//other comparisons for less than, equal to
process(&curr, output_file); //function to process current token
word = strtok(NULL, " \n\t\r");
}
Partial answer.
The reason you get the output is that you have a ms-dos (or windows) type .txt file which has two newline characters. You are catching the '\n'line feed character but not the carriage return character... so your string %s is printing a carriage return. That is why the ! is the first character on the line.

strtok with empty string delimiter

I have seen the following piece of code in one of the library. What is the behavior of strtok when empty string is passed as a delimiter? I can see whatever buf contains, stored into token variable after strtok call.
char buf[256] = {0};
char token = NULL;
...
...
while (!feof(filePtr))
{
os_memset(buf, 0, sizeof(buf));
if (!fgets(buf, 256, filePtr))
{
token = strtok(buf, "");
...
...
}
}
strtok() starts by looking for the first character not in the delimiter list, to find the beginning of a token. Since all characters are not in the delimiter list, the first character of the string will be the beginning of the token.
Then it looks for the next character in the delimiter list, to find the end of the token. Since there are no delimiters, it will never find any of them, so it stops at the end of the string.
As a result, an empty delimiter list means the entire string will be parsed as a single token.
Why he wrote it like this is anyone's guess.

C - Nested loop using strtok

I am trying to use strtok to split up a text file into strings that I can pass to a spell check function, the text file includes characters such as '\n', ' ?!,.' etc...
I need to print any words that fail the spell check and the line number that they are on. Keeping track of the line is what I'm struggling with.
I have tried this so far but it only returns results for the first line of the text file:
char str[409377];
fread(str, noOfChars, 1, file);
fclose(file);
int lines=1;
char *token;
char *line;
char splitLine[] = "\n";
char delimiters[] = " ,.?!(){}*&^%$£_-+=";
line = strtok(str, splitLine);
while(line!=NULL){
token = strtok(line, delimiters);
while(token != NULL){
//print is just to test if I can loop through all the words
printf("%s", token);
//spellCheck function & logic here
token = strtok(NULL, delimiters);
}
line = strtok(NULL, splitLine);
lines++
}
Is using the nested while loop and strtok possible? Is there a better way to keep track of the line number?
The strtok function is not reentrant! It can not be used to tokenize multiple strings simultaneously. It's because it keeps internal state about the string currently being tokenized.
If you have a modern compiler and standard library then you could use strtok_s instead. Otherwise you have to come up with another solution.
You can use strtok, but it's not very easy to use. It's a stupid function, all it really does is replace delimiters with nuls and return a pointer to the start of the sequence it has delimited. So it's destructive. It can't handle special cases like English words being allowed one apostrophe (we're is a word, we'r'e is not), you have to make sure you list all the delimiters specifically.
It's probably best to write mystrok yourself, so you understand how it works. Then use that as the basis for your own word extractor.
The reason for your bug is that you chop off the first line, then that is all that strok sees on the subsequent calls.

Resources