Reading tab delimited record using fscanf - c

Data file:
Newton 30 United Kingdom Scientist
Maxwell 25 United Kingdom Mathematician
Edison 60 United States Engineer
Code to read it:
#define MAX_NAME 50
#define MAX_COUNTRY 25
#define MAX_PROFILE 20
struct person
{
char *name;
int age;
char *country;
char *profile;
};
struct person pObj;
pObj->name = (char *) malloc(sizeof(MAX_NAME));
pObj->country = (char *) malloc(sizeof(MAX_COUNTRY));
pObj->profile = (char *) malloc(sizeof(MAX_PROFILE));
fscanf(fPtr,"%s\t%d\t%s\t%s\n",pObj->name,&pObj->age,pObj->country,pObj->profile);
I wrote a program to read tab delimited record to a structure using fscanf(). Same thing I can do by strtok(), strsep() functions also. But If I use strtok(), I forced to use atoi() function to load age field. But I don't want to use that atoi() function. So I simply used fscanf() to read age as Integer directly from the FILE stream buffer. It works fine. BUT for some record, country field is empty as like below.
Newton 30 United Kingdom Scientist
Maxwell 25 Mathematician
Edison 60 United States Engineer
When I read the second record, fscanf() doesn't fill empty string to the country field instead it has been filled with profile data. We understand fscanf() works that way. But is it there any option to scan the country field even though it is empty in the file? Can I do this without using atoi() function for age? i.e., reading fields by that respective types but not all the fields as strings.

Original format
The %s conversion specification skips any white space (blanks, tabs, newlines, etc) in the input, and then reads non-white-space up to the next white space character. The \t appearing in the format string causes fscanf() to skip zero or more white space characters (not just tabs).
You have:
fscanf(fPtr,"%s\t%d\t%s\t%s\the n", pObj->name, pObj->age, pObj->country, pObj-profile);
You need to pass a pointer to the age and you need an arrow -> between pObj and profile (please post code that could compile; it doesn't inspire confidence when there are errors like this):
fscanf(fPtr,"%s\t%d\t%s\t%s\the n", pObj->name, &pObj->age, pObj->country, pObj->profile);
Given the first input line:
Newton 30 United Kingdom Scientist
fscanf() will read Newton into pObj->name, 30 into pObj->age,UnitedintopObj->countryandKingdomintopObj->profile.fscanf()` and family are very casual about white space, in general. Most conversions skip leading white space.
After the 4 values are assigned, you have \the n" at the end of the format. The tab skips the white space between Kingdom and Scientist, but the data doesn't match he n, so the scanning stops — not that you're any the wiser for that.
The next operation will pick up where this one stopped, so the next pObj->name will be assigned Scientist and then the pObj->age conversion will fail because Maxwell doesn't represent an integer. The conversions stop there on that fscanf().
And so the problems continue. Your claimed output can't be attained with the code you show in the question.
If you're adamant that you must use fscanf(), you'll need to use scan sets such as %24[^\t] to read the country. But you'd do better using fgets() or POSIX function getline() to read whole lines of input, and then perhaps use sscanf() but more likely use strcspn() or strpbrk() from Standard C (or perhaps strtok() or — far better — POSIX strtok_r() or Windows strtok_s(), or non-standard strsep()) to split the line into fields at tabs. Note that strtok_r() et al don't care how many repeats there are of the delimiter (tabs in your case) between the fields; you can't have empty fields with them. You can identify empty fields with strcspn(), strpbrk() and strsep().
Cleaned up format
The format string has been revised to:
fscanf(fPtr,"%s\t%d\t%s\t%s\n", pObj->name, &pObj->age, pObj->country, pObj->profile);
This won't work, but can now be adapted so it will work.
if (fscanf(fPtr," %49[^\t]\t%d\t%24[^\t]\t%19[^\n]", pObj->name, &pObj->age, pObj->country, pObj->profile) != 4)
…handle a format error…
Beware trailing white space in scanf() format strings. The leading blank skips any newline left over from previous lines, and skips any leading white space on a line. The %49[^\t] looks for up to 49 non-tabs; the tab is optional and matches any sequence of white space, but the first character will be a tab unless the name was too long. Then it reads a number, more optional white space (it doesn't have to be a tab, but it will be unless the data is malformatted), then up to 24 non-tabs, white space again (of which the first character will be a tab unless there's a formatting problem), and up to 19 non-tabs. The next character should be a newline, unless there's a formatting problem.

Related

Reading input from a file in C

I came across the following question:
If a file contains the line "I am a boy\r\n" then on reading this line into the array str using fgets(). What will str contain?
[A]. "I am a boy\r\n\0"
[B]. "I am a boy\r\0"
[C]. "I am a boy\n\0"
[D]. "I am a boy"
The answer has been given as option c with the explanation
Declaration: char *fgets(char *s, int n, FILE *stream);
fgets reads characters from stream into the string s. It stops when it reads either n - 1 characters or a newline character, whichever comes first.
However, I couldn't understand how will \r (carriage return) influence fgets. I mean, shouldn't it be that first "I am a boy" is read, then on encountering \r cursor is set at the initial position and "I" from "I am a body" is overwritten by \n and space following "I" is overwritten by \0.
Any help is deeply appreciated.
P.s: My claim is based on the explanation given on this link: https://www.quora.com/What-exactly-is-r-in-the-C-language
First, every time you see a multiple choice quiz on some programming website, I recommend you close the tab and do something productive instead such as watching videos of kittens. Because the questions seem to be just some variants of
Which of these is the first letter of the alphabet (only one is right)
A
a
6
a
the letter a
all of the above.
Carriage returns and line feeds do not affect the input read by a C program in that way. Each additional byte is just on top of the other bytes. Otherwise, this is very badly phrased question, as the answer be any of A, B, C or D, or maybe none of them. Saying that C is the only one that is right is wrong.
First question is what it means if "the file contains \r"? Here I assume that the author meant that the file contains the 10 characters I am a boy followed by ASCII 13 and ASCII 10 (carriage return and line feed).
In C there are two translation modes for reading files, text mode and binary mode. On POSIX systems (all those operating systems with X in their name, except for Windows eXcePtion) these are equal - the text mode is ignored. So when you read the line into a buffer with fgets on POSIX, it will look for that line feed and store all letters as is including the , so the buffer will have the following sequence of bytes I am a boy\r\n\0. Therefore A could be true.
But on Windows, the text mode translates the carriage return and the linefeed to one newline character with ASCII value 10 in memory, so what you will have is I am a boy\n\0. Therefore C could be true. If your file was opened in binary mode, you'll still have I am a boy\r\n\0 - so how'd you claim that C is the only one that can be true?
If the string that you'd read with fgets would be I am a boy\r\n (POSIX or binary mode) but you told fgets your buffer has space for only 12 characters, then you'd get 11 characters of the input and terminating \0, and therefore you'd have I am a boy\r\0. The carriage return character would remain in the stream. Therefore B could be true. B cannot be true if you indicated that the buffer will have more space.
Finally any of these array contents does contain the string I am a boy, therefore D would be true in all of the cases above.
And if your buffer didn't have enough space for 10 characters and the terminator then you'd have some prefix of the contents, such as I am a bo followed by \0 which means that none of these was true.

What is [^\n] in C? [duplicate]

I have run into some code and was wondering what the original developer was up to. Below is a simplified program using this pattern:
#include <stdio.h>
int main() {
char title[80] = "mytitle";
char title2[80] = "mayataiatale";
char mystring[80];
/* hugh ? */
sscanf(title,"%[^a]",mystring);
printf("%s\n",mystring); /* Output is "mytitle" */
/* hugh ? */
sscanf(title2,"%[^a]",mystring); /* Output is "m" */
printf("%s\n",mystring);
return 0;
}
The man page for scanf has relevant information, but I'm having trouble reading it. What is the purpose of using this sort of notation? What is it trying to accomplish?
The main reason for the character classes is so that the %s notation stops at the first white space character, even if you specify field lengths, and you quite often don't want it to. In that case, the character class notation can be extremely helpful.
Consider this code to read a line of up to 10 characters, discarding any excess, but keeping spaces:
#include <ctype.h>
#include <stdio.h>
int main(void)
{
char buffer[10+1] = "";
int rc;
while ((rc = scanf("%10[^\n]%*[^\n]", buffer)) >= 0)
{
int c = getchar();
printf("rc = %d\n", rc);
if (rc >= 0)
printf("buffer = <<%s>>\n", buffer);
buffer[0] = '\0';
}
printf("rc = %d\n", rc);
return(0);
}
This was actually example code for a discussion on comp.lang.c.moderated (circa June 2004) related to getline() variants.
At least some confusion reigns. The first format specifier, %10[^\n], reads up to 10 non-newline characters and they are assigned to buffer, along with a trailing null. The second format specifier, %*[^\n] contains the assignment suppression character (*) and reads zero or more remaining non-newline characters from the input. When the scanf() function completes, the input is pointing at the next newline character. The body of the loop reads and prints that character, so that when the loop restarts, the input is looking at the start of the next line. The process then repeats. If the line is shorter than 10 characters, then those characters are copied to buffer, and the 'zero or more non-newlines' format processes zero non-newlines.
The constructs like %[a] and %[^a] exist so that scanf() can be used as a kind of lexical analyzer. These are sort of like %s, but instead of collecting a span of as many "stringy" characters as possible, they collect just a span of characters as described by the character class. There might be cases where writing %[a-zA-Z0-9] might make sense, but I'm not sure I see a compelling use case for complementary classes with scanf().
IMHO, scanf() is simply not the right tool for this job. Every time I've set out to use one of its more powerful features, I've ended up eventually ripping it out and implementing the capability in a different way. In some cases that meant using lex to write a real lexical analyzer, but usually doing line at a time I/O and breaking it coarsely into tokens with strtok() before doing value conversion was sufficient.
Edit: I ended ripping out scanf() typically because when faced with users insisting on providing incorrect input, it just isn't good at helping the program give good feedback about the problem, and having an assembler print "Error, terminated." as its sole helpful error message was not going over well with my user. (Me, in that case.)
It's like character sets from regular expressions; [0-9] matches a string of digits, [^aeiou] matches anything that isn't a lowercase vowel, etc.
There are all sorts of uses, such as pulling out numbers, identifiers, chunks of whitespace, etc.
You can read about it in the ISO/IEC9899 standard available online.
Here is a paragraph I quote from the document about [ (Page 286):
Matches a nonempty sequence of characters from a set of expected
characters.
The conversion specifier includes all subsequent characters in the
format string, up to and including the matching right bracket (]). The
characters between the brackets (the scanlist) compose the scanset,
unless the character after the left bracket is a circumflex (^), in
which case the scanset contains all characters that do not appear in
the scanlist between the circumflex and the right bracket. If the
conversion specifier begins with [] or [^], the right bracket
character is in the scanlist and the next following right bracket
character is the matching right bracket that ends the specification;
otherwise the first following right bracket character is the one that
ends the specification. If a - character is in the scanlist and is not
the first, nor the second where the first character is a ^, nor the
last character, the behavior is implementation-defined.

How to write and read (including spaces) from text file

I'm using fscanf and fprintf.
I tried to delimit the strings on each line by \t and to read it like so:
fscanf(fp,"%d\t%s\t%s",&t->num,&t->string1,&t->string2);
The file contents:
1[TAB]string1[TAB]some string[NEWLINE]
It does not read properly. If I printf("%d %s %s",t->num,t->string1,t->string2) I get:
1 string1 some
Also I get this compile warning:
warning: format specifies type 'char *' but the argument has type 'char (*)[15]' [-Wformat]
How can I fix this without using binary r/w?
I'm guessing the space in "some string" is the problem. fscanf() reading a string using %s stops at the first whitespace character. To include spaces, use something like:
fscanf(fp, "%d\t%[^\n\t]\t%[^\n\t]", &t->num, &t->string1, &t->string2);
See also a reference page for fscanf() and/or another StackOverflow thread on reading tab-delimited items in C.
[EDIT in response to your edit: You seem to also have a problem with the arguments you're passing into fscanf(). You will need to post the declarations of t->string1 to be sure, but it looks like string1 is an array of characters, and therefore you should remove the & from the fscanf() call...]
The %s conversion specification stops reading at the first white space, and tabs and blanks both count as white space.
If you want to read a string of non-tabs, you can use a 'scan set' conversion specifier:
if (fscanf(fp, "%d\t%[^\t\n]\t%[^\t\n]", &t->num, t->string1, t->string2) != 3)
...oops - format error in input data...
(I'd lay odds that omitting the & from the string arguments is correct.) The question was edited; I win. Dropping the & is necessary to avoid the compiler warning!
This still doesn't quite do what you expect. If there are blanks at the start of the second field, they'll be eaten by the \t in the format string. Any white space in the format string eats any white space (including newlines) in the input. The %[^\t] conversion specification won't get started until there's a character that isn't white space in the input. I'm also assuming you want your input limited by newlines. You can leave out the \n characters if you prefer.
Note that I checked that the fscanf() interpreted 3 fields. It is important to error check your inputs.
If you really want control, you should probably read whole lines with fgets() and then use sscanf() to parse the data.
About fgets() and sscanf(); can you expand about how it will give more control?
Suppose the input data is written
1234
a string with spaces
another string
spread out over multiple lines like that. With raw fscanf(), this will be acceptable input even though it is spread over 9 lines of input. With fgets(), you can read a single line, and then analyze it with sscanf(), and you'll know that the first line was not in the correct format. You can then decide what to do about it.
Also, since mafso called me on it in his comment, we should ensure that there are no buffer overflows by limiting the size of the strings that the scan sets match.
if (fscanf(fp, "%d\t%14[^\t\n]\t%14[^\t\n]", &t->num, t->string1, t->string2) != 3)
...oops - format error in input data...
I'm using the error message about char (*)[15] to deduce that 14 is the correct number to use. Note that unlike printf(), you can't specify the sizes via * notation (in the scanf()-family, * supresses assignment), so you have to create the format with the correct sizes. Further, the size you specify is the number of characters before the terminating null byte, so if the array is of size 15, the size you specify in the format string is 14, as shown.

What is the purpose of using the [^ notation in scanf?

I have run into some code and was wondering what the original developer was up to. Below is a simplified program using this pattern:
#include <stdio.h>
int main() {
char title[80] = "mytitle";
char title2[80] = "mayataiatale";
char mystring[80];
/* hugh ? */
sscanf(title,"%[^a]",mystring);
printf("%s\n",mystring); /* Output is "mytitle" */
/* hugh ? */
sscanf(title2,"%[^a]",mystring); /* Output is "m" */
printf("%s\n",mystring);
return 0;
}
The man page for scanf has relevant information, but I'm having trouble reading it. What is the purpose of using this sort of notation? What is it trying to accomplish?
The main reason for the character classes is so that the %s notation stops at the first white space character, even if you specify field lengths, and you quite often don't want it to. In that case, the character class notation can be extremely helpful.
Consider this code to read a line of up to 10 characters, discarding any excess, but keeping spaces:
#include <ctype.h>
#include <stdio.h>
int main(void)
{
char buffer[10+1] = "";
int rc;
while ((rc = scanf("%10[^\n]%*[^\n]", buffer)) >= 0)
{
int c = getchar();
printf("rc = %d\n", rc);
if (rc >= 0)
printf("buffer = <<%s>>\n", buffer);
buffer[0] = '\0';
}
printf("rc = %d\n", rc);
return(0);
}
This was actually example code for a discussion on comp.lang.c.moderated (circa June 2004) related to getline() variants.
At least some confusion reigns. The first format specifier, %10[^\n], reads up to 10 non-newline characters and they are assigned to buffer, along with a trailing null. The second format specifier, %*[^\n] contains the assignment suppression character (*) and reads zero or more remaining non-newline characters from the input. When the scanf() function completes, the input is pointing at the next newline character. The body of the loop reads and prints that character, so that when the loop restarts, the input is looking at the start of the next line. The process then repeats. If the line is shorter than 10 characters, then those characters are copied to buffer, and the 'zero or more non-newlines' format processes zero non-newlines.
The constructs like %[a] and %[^a] exist so that scanf() can be used as a kind of lexical analyzer. These are sort of like %s, but instead of collecting a span of as many "stringy" characters as possible, they collect just a span of characters as described by the character class. There might be cases where writing %[a-zA-Z0-9] might make sense, but I'm not sure I see a compelling use case for complementary classes with scanf().
IMHO, scanf() is simply not the right tool for this job. Every time I've set out to use one of its more powerful features, I've ended up eventually ripping it out and implementing the capability in a different way. In some cases that meant using lex to write a real lexical analyzer, but usually doing line at a time I/O and breaking it coarsely into tokens with strtok() before doing value conversion was sufficient.
Edit: I ended ripping out scanf() typically because when faced with users insisting on providing incorrect input, it just isn't good at helping the program give good feedback about the problem, and having an assembler print "Error, terminated." as its sole helpful error message was not going over well with my user. (Me, in that case.)
It's like character sets from regular expressions; [0-9] matches a string of digits, [^aeiou] matches anything that isn't a lowercase vowel, etc.
There are all sorts of uses, such as pulling out numbers, identifiers, chunks of whitespace, etc.
You can read about it in the ISO/IEC9899 standard available online.
Here is a paragraph I quote from the document about [ (Page 286):
Matches a nonempty sequence of characters from a set of expected
characters.
The conversion specifier includes all subsequent characters in the
format string, up to and including the matching right bracket (]). The
characters between the brackets (the scanlist) compose the scanset,
unless the character after the left bracket is a circumflex (^), in
which case the scanset contains all characters that do not appear in
the scanlist between the circumflex and the right bracket. If the
conversion specifier begins with [] or [^], the right bracket
character is in the scanlist and the next following right bracket
character is the matching right bracket that ends the specification;
otherwise the first following right bracket character is the one that
ends the specification. If a - character is in the scanlist and is not
the first, nor the second where the first character is a ^, nor the
last character, the behavior is implementation-defined.

retrieving a string with spaces from a file in C

We were given an assignment that involved taking information from a file and storing the data in an array. The data in the file is sorted as follows
New York 40 43 N 74 01 W
the first 20 characters are the name of the city followed by the latitude and longitude. latitude and longitude should be easy with a few
fscanf(infile, "%d(or %c depending on which one i'm getting)", pointer)
operations so they won't be a problem.
My problem is that i do not know how to collect the string for the name of the city because some of the city names have spaces. I read something about using delimiters but from what i read, it seems like that is used more for reading an entire line. Is there any way to read the city name from a file and store the entire name with spaces in a character array? Thanks.
Here's a hint: With spaces as your only delimiter, how would you tell fscanf() where the city name starts and the latitude starts? You're getting close with your "it seems like that is used more for reading an entire line". Explore that, perhaps with fgets().
scanf() can take a limited amount of characters with the "%c" specifier.
ATTENTION It will not add a terminating null byte.
char cityname[21];
scanf("%20c%d%d %c%d%d %c", cityname,
&lat1, &lat2, &lat_letter,
&lon1, &lon2, &lon_letter);
cityname[20] = 0;
But you're better off using fgets() and parsing the string "manually". Otherwise you're going to have END-OF-LINE issues
G'day,
As you're scanning up until the first number, the latitude, for your city name maybe use a scan for non-numbers for the first item?
If you have spaces in your city name, you either need to use delimiters or define the city name field to be fixed length. Otherwise trying to parse a three-word city name, e.g. "Salt Lake City", will kill the next field.
Just a hint : read the entire line in memory and then take the first 20 chars for the city name, the next, say 10 chars for latitude and so on.
You can specify size for %c which will collect a block of characters of the specified size. In your case, if the city name is 20 characters long put %20c inside the line format of scanf.
Then you have to put terminator at the end and trim the string.
From "man fgets":
char *fgets(char *s, int size, FILE *stream);
fgets() reads in at most one less than size characters from stream and
stores them into the buffer pointed to by s. Reading stops after an
EOF or a newline. If a newline is read, it is stored into the buffer.
A '\0' is stored after the last character in the buffer.
fgets() return s on success, and NULL on error or when end
of file occurs while no characters have been read.
This means that you need a char array of 21 chars to store a 20 char string (the 21st char will be the '\0' delimiter at the end, and fgets will put it automagically).
Good luck!

Resources