Parsing .txt file in C code - c

I've got to parse a .txt file like this
autore: sempronio, caio; titolo: ; editore: ; luogo_pubblicazione: ; anno: 0; prestito: 0-1-1900; collocazione: ; descrizione_fisica: ; nota: ;
with fscanf in C code.
I tried with some formats in fscanf call, but none of them worked...
EDIT:
a = fscanf(fp, "autore: %s");
This is the first try I did; the patterns 'autore', 'titolo', 'editore', etc. must not be caught by fscanf().

Generally speaking, trying to parse input with fscanf is not a good idea, as it is difficult to recover gracefully if the input does not match expectations. It is generally better to read the input into an internal buffer (with fread or fgets), and parse it there (with sscanf, strtok, strtol etc.). Details on which functions are best depend on the definition of the input format (which you did not give us; example input is no replacement for a formal specification).

The following shows how to use strtok:
char* item;
char* input; // fill it with fgets
for (item = strtok(input, ";"); item != NULL; item = strtok(NULL, ";"))
{
// item loops through the following:
// "autore: sempronio, caio"
// " titolo: "
// " editore: "
// ...
}
The following shows how to use sscanf:
char tag[20];
int chars = -1;
if (sscanf(item, " %19[^:]: %n", tag, &chars) == 1 && chars >= 0)
{
printf("%s is %s\n", tag, item + chars);
}
Here, the format string consists of the following:
(space) - tells the parser to discard whitespace
19 - maximum number of bytes/chars in the tag
[^:] - tells the parser to read until it meets the colon character
: - tells the parser to discard the colon character
(whitespace) - as above
%n - tells the parser to report the number of bytes it read (check &chars)
If there was an unexpected input, the number of chars is not updated, so you have to set it to -1 before parsing each item.

Related

Using strtok_s, the second item is always NULL, even though the first works right. How can I get both values?

I am working in C and the strtok_s function isnt working as expected. I want to separate 2 halves of user input, delimited by a space character between them. Ive been reading the manual but i cannot figure it out. Below is the function I wrote. Its goal is to separate the first and second half of user input delimited by a space and return the value to 2 pointers. The print statement has only been used for my debugging.
void argGetter(char* commandDesired, char** firstArg, char** secondArg) {
// this char holds the first part of the command before the " "
char* commandCleanDesired;
// this char array holds the part after the " "
char *nextToken;
char *argument;
commandCleanDesired = strtok_s(commandDesired, " ", &nextToken);
argument = strtok_s(NULL, " ", &nextToken);
printf("\n\nCMD 1 is %s\n\nCMD 2 is %s\n\n\n", commandCleanDesired, argument);
*firstArg = commandCleanDesired;
*secondArg = argument;
}
//this shows how argGetter is called.
void main() {
// these hold the return values from argGetter()
char* secondArg = NULL;
char* firstArg = NULL;
//This holds user input
char commandDesired[255];
//This line prints the prompt
printf("\n\tSanity$hell> ");
//Then we get user input
scanf_s("%s", commandDesired, 255);
//split the command from args using argGetter
argGetter(commandDesired, &firstArg, &secondArg);
printf("\n First Arg is %s\n", firstArg);
printf("\nYour second arg is %s\n\n", secondArg);
}
It gets commandCleanDesired fine, but the second variable, (named 'argument') is ALWAYS null.
I have tried the things below to get the value after the space and store it in argument (unsuccessfully). These little code snippets show how I modified the above code during my attempts to solve the issue.
commandCleanDesired = strtok_s(commandDesired, " ", &commandDesired);
argument = strtok_s(commandDesired, " ", &commandDesired);
//the above resulted in NULL for the second value argument as well.
// Below is the next thing i tried.
char * nextToken;
commandCleanDesired = strtok_s(commandDesired, " ", &nextToken);
argument = strtok_s(NULL, " ", &nextToken);
//both result in argument being NULL.
//I tried the above after reading the manual more.
I have been reading the manual at https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/strtok-s-strtok-s-l-wcstok-s-wcstok-s-l-mbstok-s-mbstok-s-l?view=msvc-170.
I used NULL for the string argument the second time because the above manual led me to believe that was necessary for all subsequent calls after the first call. An example input of commandDesired would be "cd C://"
For the above input, i would like this function to have commandCleanDesired = 'cd' and argument = 'C://'
currently with the misbehavior of the above function for the above input, the function gives commandCleanDesired = 'cd' and argument = (NULL)
TLDR, How am I misusing the strtok_s function in C, how can I get the second value after the space to be stored in the "argument" pointer?
Thank you in advance.
The issue is that I used scanf_s or scanf to get the user input in main. This tokenizes the input, which is not what I want.
If you want to read a whole line, use fgets. When I use fgets instead, the issue is solved!
If you want to separate strings at the space characters, don't use scanf() (or friends) with the %s format specifier, as it stops reading at space characters themselves, so the string that finally reaches strtok (or friends) don't have spaces on it. This is probably the most probable reason (I have not looked in detail at your code, sorry) that you get the first word in the first time, and NULL later.
A good alternative, is to use fgets(), in something like:
char line[1024];
/* the following call to fgets() reads a complete line (incl. the
* \n char) into line. */
while (fgets(line, sizeof line, stdin)) { /* != NULL means not eof */
for ( char *arg = strtok(line, " \t\n");
arg != NULL;
arg = strtok(NULL, " \t\n"))
{
/*process argument in arg here */
}
}
Or, if you want to first get out the last \n char, and then process
the whole line to tokenize the arguments...
char line[1024];
/* the following call to fgets() reads a complete line (incl. the
* \n char) into line. */
while (fgets(line, sizeof line, stdin)) { /* != NULL means not eof */
process_line(strtok(line, "\n")); /* only one \n at end can be, at most */
}
Then, inside the process_line() function you need to check the parameter for NULL (for the case the string only has a single \n on it, that will result in a null output from strtok())
IMPORTANT WARNING: strtok() is not reentrant, and also it cannot be nested. It uses an internal, global iterator that is initialized each time you provide a first non-null parameter. If you need to run several levels of scanning, you have two options:
run the outer loop in full, appending work to do to a second level set of jobs (or similar) to be able to run strtok() on each separate level when the first loop is finished.
run the reentrant version of strtok(), e.g. strtok_r(). This will allow reentrancy and nesting, you just need to provide a different state buffer (where strtok stores the iterator state) for each nesting level (or thread)

reading tokens misreading simple string - c

I'm writing a program where I need to read in token by token and detect certain keywords. One of these keywords is "gt" which stands for greater than.
I split the text file into tokens by tabs, newlines, spaces, and returns. Buffer is simply a large char array.
char* word = strtok(buffer, " \n\t\r");
I then have several cases to check for the possible words. The gt is as follows. Weirdly enough, this works for other keywords and sometimes even other occurrences of 'gt'.
//gt
if(strcmp("gt", word) == 0){
type = GT;
literal_value = 0;
}
However, it isn't getting reached despite a 'gt' being input. I noticed that when I print, this happens
printf("WORD is %s!\n", word);
PRINTS "!ORD is gt"
Which clearly isn't right. If the answer is something obvious please let me know- this bug has been evading me for a long time!
updated fragment:
char * word = strtok(buffer, " \n\t\r");
while (word != NULL){
printf("word is %s!\n", sections); //PRINTS "!ORD is gt"
if(sections[0] == ';'){
break; //comment indicated by ';'
}
//gt
if(strcmp("gt", word) == 0){
type = GT;
literal_value = 0;
}
//...............
//other comparisons for less than, equal to
process(&curr, output_file); //function to process current token
word = strtok(NULL, " \n\t\r");
}
Partial answer.
The reason you get the output is that you have a ms-dos (or windows) type .txt file which has two newline characters. You are catching the '\n'line feed character but not the carriage return character... so your string %s is printing a carriage return. That is why the ! is the first character on the line.

C String manipulation: How do I append = (equal sign) to the beginning and end of a tokenized string, Wrong output due to newline upon pressing enter

I'm currently having trouble with appending an equal sign, before and after my string is split into tokens. It leads me to the conclusion that I must replace the newline character at some point with my desired equal sign after splitting my string. I've tried looking at the c string.h library reference to see whether or not there is a way to replace the newline char using strstr to see whether or not there was already an "\n" in the tokenized string, but ran into an infinite loop when I tried that. I also thought about trying to replace the newline character, which should be the string length minus 1, and I admit, I have low familiarity in C. If you could take a look at my code, and provide some feedback, I would greatly appreciate it. Thank you for your time. I will admit I have low familiarity with C, but am currently reading the reference libraries.
// main method
int main(void){
// allocate memory
char string[256];
char *tokenizedString;
const char delimit[2] = " ";
const char *terminate = "\n";
do{
// prompt user for a string we will tokenize
do{
printf("Enter no more than 65 tokens:\n");
fgets(string, sizeof(string), stdin);
// verify input length
if(strlen(string) > 65 || strlen(string) <= 0) {
printf("Invalid input. Please try again\n"); }
} while(strlen(string) > 65);
// tokenize the string
tokenizedString = strtok(string, delimit);
while(tokenizedString != NULL){
printf("=%s=\n", tokenizedString);
tokenizedString = strtok(NULL, delimit);
}
// replace newline character implicitly made by enter, it seems to be adding my newline character at the end of output
} while(strcmp(string, "\n"));
return 0;
}// end of method main
OUTPUT:
Enter no more than most 65 tokens:
i am very tired sadface
=i=
=am=
=very=
=tired=
=sadface
=
DESIRED OUTPUT
Enter no more than 65 tokens:
i am very tired sadface
=i=
=am=
=very=
=tired=
=sadface=
Since you are using strlen(), you can do this instead
size_t length = strlen(string);
// Check that `length > 0'
string[length - 1] = '\0';
Advantages:
This way you would call strlen() only once. Calling it multiple times for the same string is inefficient anyway.
You always remove the trailing '\n' from the input string to your tokenization will work as expected.
Note: strlen() would never return a value < 0, because what it does is count the number of characters in the string, which is only 0 for "" and > 0 otherwise.
Well, you have two ways to do it, the simplest is to add a \n to the token delimiter string
const char delimit[] = " \n";
(you don't need to use an array size if you are going to initialize a string array with a string literal)
so it eliminates the final \n that comes in with your input. Another way is to search for it on reading and eliminate it from the input string. You can use strtok(3) for this purpose also:
tokenizedString = strtok(string, "\n");
tokenizedString = strtok(tokenizedString, delimit);

Using sscanf to read strings

I am trying to save one character and 2 strings into variables.
I use sscanf to read strings with the following form :
N "OldName" "NewName"
What I want : char character = 'N' , char* old_name = "OldName" , char* new_name = "NewName" .
This is how I am trying to do it :
sscanf(mystring,"%c %s %s",&character,old_name,new_name);
printf("%c %s %s",character,old_name,new_name);
The problem is , my problem stops working without any outputs .
(I want to ignore the quotation marks too and save only its content)
When you do
char* new_name = "NewName";
you make the pointer new_name point to the read-only string array containing the constant string literal. The array contains exactly 8 characters (the letters of the string plus the terminator).
First of all, using that pointer as a destination for scanf will cause scanf to write to the read-only array, which leads to undefined behavior. And if you give a string longer than 7 character then scanf will also attempt to write out of bounds, again leading to undefined behavior.
The simple solution is to use actual arrays, and not pointers, and to also tell scanf to not read more than can fit in the arrays. Like this:
char old_name[64]; // Space for 63 characters plus string terminator
char new_name[64];
sscanf(mystring,"%c %63s %63s",&character,old_name,new_name);
To skip the quotation marks you have a couple of choices: Either use pointers and pointer arithmetic to skip the leading quote, and then set the string terminator at the place of the last quote to "remove" it. Another solution is to move the string to overwrite the leading quote, and then do as the previous solution to remove the last quote.
Or you could rely on the limited pattern-matching capabilities of scanf (and family):
sscanf(mystring,"%c \"%63s\" \"%63s\"",&character,old_name,new_name);
Note that the above sscanf call will work iff the string actually includes the quotes.
Second note: As said in the comment by Cool Guy, the above won't actually work since scanf is greedy. It will read until the end of the file/string or a white-space, so it won't actually stop reading at the closing double quote. The only working solution using scanf and family is the one below.
Also note that scanf and family, when reading string using "%s" stops reading on white-space, so if the string is "New Name" then it won't work either. If this is the case, then you either need to manually parse the string, or use the odd "%[" format, something like
sscanf(mystring,"%c \"%63[^\"]\" \"%63[^\"]\"",&character,old_name,new_name);
You must allocate space for your strings, e.g:
char* old_name = malloc(128);
char* new_name = malloc(128);
Or using arrays
char old_name[128] = {0};
char new_name[128] = {0};
In case of malloc you also have to free the space before the end of your program.
free(old_name);
free(new_name);
Updated:...
The other answers provide good methods of creating memory as well as how to read the example input into buffers. There are two additional items that may help:
1) You expressed that you want to ignore the quotation marks too.
2) Reading first & last names when separated with space. (example input is not)
As #Joachim points out, because scanf and family stop scanning on a space with the %s format specifier, a name that includes a space such as "firstname lastname" will not be read in completely. There are several ways to address this. Here are two:
Method 1: tokenizing your input.
Tokenizing a string breaks it into sections separated by delimiters. Your string input examples for instance are separated by at least 3 usable delimiters: space: " ", double quote: ", and newline: \n characters. fgets() and strtok() can be used to read in the desired content while at the same time strip off any undesired characters. If done correctly, this method can preserve the content (even spaces) while removing delimiters such as ". A very simple example of the concept below includes the following steps:
1) reading stdin into a line buffer with fgets(...)
2) parse the input using strtok(...).
Note: This is an illustrative, bare-bones implementation, sequentially coded to match your input examples (with spaces) and includes none of the error checking/handling that would normally be included.
int main(void)
{
char line[128];
char delim[] = {"\n\""};//parse using only newline and double quote
char *tok;
char letter;
char old_name[64]; // Space for 63 characters plus string terminator
char new_name[64];
fgets(line, 128, stdin);
tok = strtok(line, delim); //consume 1st " and get token 1
if(tok) letter = tok[0]; //assign letter
tok = strtok(NULL, delim); //consume 2nd " and get token 2
if(tok) strcpy(old_name, tok); //copy tok to old name
tok = strtok(NULL, delim); //consume 3rd " throw away token 3
tok = strtok(NULL, delim); //consume 4th " and get token 4
if(tok) strcpy(new_name, tok); //copy tok to new name
printf("%c %s %s\n", letter, old_name, new_name);
return 0;
}
Note: as written, this example (as do most strtok(...) implementations) require very narrowly defined input. In this case input must be no longer than 127 characters, comprised of a single character followed by space(s) then a double quoted string followed by more space(s) then another double quoted string, as defined by your example:
N "OldName" "NewName"
The following input will also work in the above example:
N "old name" "new name"
N "old name" "new name"
Note also about this example, some consider strtok() broken, while others suggest avoiding its use. I suggest using it sparingly, and only in single threaded applications.
Method 2: walking the string.
A C string is just an array of char terminated with a NULL character. By selectively copying some characters into another string, while bypassing the one you do not want (such as the "), you can effectively strip unwanted characters from your input. Here is an example function that will do this:
char * strip_ch(char *str, char ch)
{
char *from, *to;
char *dup = strdup(str);//make a copy of input
if(dup)
{
from = to = dup;//set working pointers equal to pointer to input
for (from; *from != '\0'; from++)//walk through input string
{
*to = *from;//set destination pointer to original pointer
if (*to != ch) to++;//test - increment only if not char to strip
//otherwise, leave it so next char will replace
}
*to = '\0';//replace the NULL terminator
strcpy(str, dup);
free(dup);
}
return str;
}
Example use case:
int main(void)
{
char line[128] = {"start"};
while(strstr(line, "quit") == NULL)
{
printf("Enter string (\"quit\" to leave) and hit <ENTER>:");
fgets(line, 128, stdin);
sprintf(line, "%s\n", strip_ch(line, '"'));
printf("%s", line);
}
return 0;
}

parse decimal string with sscanf

I want to parse string to integer.
The string can contain any data, including invalid or float integers. This is my code which shows how I'm using sscanf():
errno = 0;
uint32_t temp;
int res = sscanf(string, "%"SCNu32, &temp);
if (0 != errno || 1 != res)
{
return HEX_ECONVERSION;
}
where string passed as argument.
I've expected that this code would fail on "3.5" data. But unfortunately, sscanf() truncate "3.5" to 3 and write it to temp integer.
What's wrong?
Am I improperly using sscanf()? And how to achieve this task by using sscanf, without writing hand-written parser (by directly calling strtoul() or similar).
3 is a valid integer. Unfortunately sscanf is not complicated enough to look ahead and detect that 3.5 is a float and not give you any result.
Try using strtol and testing the return pointer to see if it parsed the entire string.
Using the "%n" records where the scan is in the buffer.
We can use it to determine what stopped the scan.
int n;
int res = sscanf(string, "%"SCNu32 " %n", &temp, &n);
if (0 != errno || 1 != res || string[n] != '\0')
return HEX_ECONVERSION;
Appending " %n" says to ignore following white-space, then note the buffer position. If there is not additional junk like ".5", the string[n] will point to the null terminator.
Be sure to test n after insuring temp was set. This was done above with 1 != res.
"%n" does not affect the value returned by sscanf().
sscanf only parses the part of the string that it can. You can add another format to it such as "%d%s". If the %s specifier captures a non-empty string (such as .5), throw an error.

Resources