Using fgets() without predefined buffer - c

I need to ask one more question about reading from the stdin.
I am reading a huge trunk of lines from the stdin, but it is definitely unknown which is the size of every line. So I don't want to have a buffer like 50Mio just for a file having lines of three char and than a file using these 50 Mio per line.
So at the moment I am having this code:
int cur_max = 2047;
char *str = malloc(sizeof(char) * cur_max);
int length = 0;
while(fgets(str, sizeof(str), stdin) != NULL) {
//do something with str
//for example printing
printf("%s",str);
}
free(str);
So I am using fgets for every line and I do have a first size of 2047 char per line.
My plan is to increase the size of the buffer (str) when a line hits the limit. So my idea is to count the size with length and if the current length is bigger than cur_max then I am doubling the cur_max.
The idea comes from here Read line from file without knowing the line length
I am currently not sure how to do this with fgets because I think fgets is not doing this char by char so I don't know the moment when to increase the size.

Incorrect code
sizeof(str) is the size of a pointer, like 2, 4 or 8 bytes. Pass to fgets() the size of the memory pointed to by str. #Andrew Henle #Steve Summit
char *str = malloc(sizeof(char) * cur_max);
...
// while(fgets(str, sizeof(str), stdin) != NULL
while(fgets(str, cur_max, stdin) != NULL
Environmental limits
Text files and fgets() are not the portable solution for reading excessively long lines.
An implementation shall support text files with lines containing at least 254 characters, including the terminating new-line character. The value of the macro BUFSIZ shall be at least 256 C11 §7.21.2 9
So once the line length exceeds BUFSIZ - 2, code is on its own as to if the C standard library functions can handle a text file.
So either read the data as binary, use other libraries that insure the desired functionality, or rely on hope.
Note: BUFSIZ defined in <stdio.h>

POSIX.1 getline() (man 3 getline) is available in almost all operating systems' C libraries (the only exception I know of is Windows). A loop to read lines of any length is
char *line_ptr = NULL;
size_t line_max = 0;
ssize_t line_len;
while (1) {
line_len = getline(&line_ptr, &line_max, stdin);
if (line_len == -1)
break;
/* You now have 'line_len' chars at 'line_ptr',
but it may contain embedded nul chars ('\0').
Also, line_ptr[line_len] == '\0'.
*/
}
/* Discard dynamically allocated buffer; allow reuse later. */
free(line_ptr);
line_ptr = NULL;
line_max = 0;
There is also a related function getdelim(), that takes an extra parameter (specified before the stream), used as an end-of-record marker. It is particularly useful in Unixy/POSIXy environments when reading file names from e.g. standard input, as you can use nul itself ('\0') as the separator (see e.g. find -print0 or xargs -0), allowing correct handling for all possible file names.
If you use Windows, or if you have text files with varying newline conventions (not just '\n', but any of '\n', '\r', "\r\n", or "\n\r"), you can use my getline_universal() function implementation from another of my answers. It differs from standard getline() and fgets() in that the newline is not included in the line it returns; it is also left in the stream and consumed/ignored by the next call to getline_universal(). If you use getline_universal() to read each line in a file or stream, it will work as expected.

Related

How to get fscanf to stop if it hits a newline? [duplicate]

I'm trying to read a line using the following code:
while(fscanf(f, "%[^\n\r]s", cLine) != EOF )
{
/* do something with cLine */
}
But somehow I get only the first line every time. Is this a bad way to read a line? What should I fix to make it work as expected?
It's almost always a bad idea to use the fscanf() function as it can leave your file pointer in an unknown location on failure.
I prefer to use fgets() to get each line in and then sscanf() that. You can then continue to examine the line read in as you see fit. Something like:
#define LINESZ 1024
char buff[LINESZ];
FILE *fin = fopen ("infile.txt", "r");
if (fin != NULL) {
while (fgets (buff, LINESZ, fin)) {
/* Process buff here. */
}
fclose (fin);
}
fgets() appears to be what you're trying to do, reading in a string until you encounter a newline character.
If you want read a file line by line (Here, line separator == '\n') just make that:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char **argv)
{
FILE *fp;
char *buffer;
int ret;
// Open a file ("test.txt")
if ((fp = fopen("test.txt", "r")) == NULL) {
fprintf(stdout, "Error: Can't open file !\n");
return -1;
}
// Alloc buffer size (Set your max line size)
buffer = malloc(sizeof(char) * 4096);
while(!feof(fp))
{
// Clean buffer
memset(buffer, 0, 4096);
// Read a line
ret = fscanf(fp, "%4095[^\n]\n", buffer);
if (ret != EOF) {
// Print line
fprintf(stdout, "%s\n", buffer);
}
}
// Free buffer
free(buffer);
// Close file
fclose(fp);
return 0;
}
Enjoy :)
If you try while( fscanf( f, "%27[^\n\r]", cLine ) == 1 ) you might have a little more luck. The three changes from your original:
length-limit what gets read in - I've used 27 here as an example, and unfortunately the scanf() family require the field width literally in the format string and can't use the * mechanism that the printf() can for passing the value in
get rid of the s in the format string - %[ is the format specifier for "all characters matching or not matching a set", and the set is terminated by a ] on its own
compare the return value against the number of conversions you expect to happen (and for ease of management, ensure that number is 1)
That said, you'll get the same result with less pain by using fgets() to read in as much of a line as will fit in your buffer.
Using fscanf to read/tokenise a file always results in fragile code or pain and suffering. Reading a line, and tokenising or scanning that line is safe, and effective. It needs more lines of code - which means it takes longer to THINK about what you want to do (and you need to handle a finite input buffer size) - but after that life just stinks less.
Don't fight fscanf. Just don't use it. Ever.
It looks to me like you're trying to use regex operators in your fscanf string. The string [^\n\r] doesn't mean anything to fscanf, which is why your code doesn't work as expected.
Furthermore, fscanf() doesn't return EOF if the item doesn't match. Rather, it returns an integer that indicates the number of matches--which in your case is probably zero. EOF is only returned at the end of the stream or in case of an error. So what's happening in your case is that the first call to fscanf() reads all the way to the end of the file looking for a matching string, then returns 0 to let you know that no match was found. The second call then returns EOF because the entire file has been read.
Finally, note that the %s scanf format operator only captures to the next whitespace character, so you don't need to exclude \n or \r in any case.
Consult the fscanf documentation for more information: http://www.cplusplus.com/reference/clibrary/cstdio/fscanf/
Your loop has several issues. You wrote:
while( fscanf( f, "%[^\n\r]s", cLine ) != EOF )
/* do something */;
Some things to consider:
fscanf() returns the number of items stored. It can return EOF if it reads past the end of file or if the file handle has an error. You need to distinguish a valid return of zero in which case there is no new content in the buffer cLine from a successfully read.
You do a have a problem when a failure to match occurs because it is difficult to predict where the file handle is now pointing in the stream. This makes recovery from a failed match harder to do than might be expected.
The pattern you wrote probably doesn't do what you intended. It is matching any number of characters that are not CR or LF, and then expecting to find a literal s.
You haven't protected your buffer from an overflow. Any number of characters may be read from the file and written to the buffer, regardless of the size allocated to that buffer. This is an unfortunately common error, that in many cases can be exploited by an attacker to run arbitrary code of the attackers choosing.
Unless you specifically requested that f be opened in binary mode, line ending translation will happen in the library and you will generally never see CR characters, and usually not in text files.
You probably want a loop more like the following:
while(fgets(cLine, N_CLINE, f)) {
/* do something */ ;
}
where N_CLINE is the number of bytes available in the buffer starting a cLine.
The fgets() function is a much preferred way to read a line from a file. Its second parameter is the size of the buffer, and it reads up to 1 less than that size bytes from the file into the buffer. It always terminates the buffer with a nul character so that it can be safely passed to other C string functions.
It stops on the first of end of file, newline, or buffer_size-1 bytes read.
It leaves the newline character in the buffer, and that fact allows you to distinguish a single line longer than your buffer from a line shorter than the buffer.
It returns NULL if no bytes were copied due to end of file or an error, and the pointer to the buffer otherwise. You might want to use feof() and/or ferror() to distinguish those cases.
i think the problem with this code is because when you read with %[^\n\r]s, in fact, you reading until reach '\n' or '\r', but you don't reading the '\n' or '\r' also.
So you need to get this character before you read with fscanf again at loop.
Do something like that:
do{
fscanf(f, "%[^\n\r]s", cLine) != EOF
/* Do something here */
}while(fgetc(file) != EOF)

does the fgets() function append the \n\0 characters exceeding the maximum length?

May seem like a silly question for most of you, but I'm still trying to determine the final answer. Some hours ago I decided to replace all the scanf() functions in my project with the fgets() in order to get a more robust code.
I learned that the fgets() automatically ends the inserted input string with the '\n' and the NUL characters but..
let's say I have something like this:
char user[16];
An array of 16 char which stores a username (15 characters max, I reserve the last one for the NUL terminator).
The question is: if I insert a 15 characters strings, then the '\n' would end up in the last cell of the array, but what about the NUL terminator?
does the '\0' get stored in the following block of memory?
(no segmentation fault when calling the printf() function implies that the inserted string is actually NUL terminated, right?).
As a complement to 5gon12eder answer. I assume you have something like :
char user[16];
fgets(user, 16, stdin);
and your input is abcdefghijklmno\n , that is 15 characters and a newline.
fgets will put in user the 15 (16-1) first characters of the input followed by a null and you will effectively get "abcdefghijklmno", which is what you want
But ... the \n still remains in stream buffer an is actually available for next read (be it a fgets or anything else) on same FILE. More exactly, until you do another fgets you cannot know whether there was other characters following the o.
As #5gon12eder suggests, use:
char user[16];
fgets(user, sizeof user, stdin);
// Function prototype for reference
#include <stdio.h>
char *fgets(char * restrict s, int n, FILE * restrict stream);
Now for details:
The '\n' and the '\0' are not automatically appended. Only the '\0' is automatically appended. fgets() will stop reading once it gets a '\n', but will stop for other reasons too including a full buffer. In those cases, there is no '\n' before the '\0'.
fgets() does not read a C string, but reads a line. The input stream is typically in text mode and then end-of-line translations occur. On some systems, '\r', '\n' pair will translate to '\n'. On others, it will not. Usually the files being read match this translation, but exceptions occur. In binary mode, no translations occur.
fgets() reads in '\0'. and continues reading. Thus using strlen(buf) does not always reflect the true number of char read. There may be a full-proof method to determine the true number of char read when '\0' are in the middle, but itis is likely easier to code with fread() or fgetc().
On EOF condition (and no data read) or IO error, fgets() returns NULL. When an I/O error occurs, the contents of the buffer is not defined.
Pedantic issue: The C standard uses a type of int as the size of the buffer but often code passes a variable of type size_t. A size n less than 1 or more than INT_MAX can be a problem. A size of 1 should do nothing more than fill the buf[0] = '\0', but some systems behave differently especially if the EOF condition is near or passed. But as long as 2 <= n <= INT_MAX, a terminating '\0' can be expected. Note: fgets() may return NULL when the size is too small.
Code typically likes to delete the terminating '\n' with something that could cause trouble. Suggest:
char buf[80];
if (fgets(buf, sizeof buf, stdin) == NULL) Handle_IOError_or_EOF();
// IMO potential UB and undesired behavior
// buf[strlen(buf)-1] = '\0';
// Suggested end-of-line deleter
size_t len = strlen(buf);
if (len > 0 && buf[len - 1] == '\n') buf[--len] = '\0';
Robust code checks the return value from fgets(). The following approach has short-comings. 1) if an IO Error occurred the buffer contents are not defined. Checking the buffer contents will not provide reliable results . 2) A '\0' may have been the first char read and the file is not in the EOF condition.
// Following is weak code.
buf[0] = '\0';
fgets(buf, sizeof buf, stdin);
if (strlen(buf) == 0) Handle_EOF();
// Robust, but too much for code snippets
if (fgets(buf, sizeof buf, stdin) == NULL) {
if (ferror(stdin)) Handle_IOError();
else if (feof(stdin)) Handle_EOF();
else if (sizeof buf <= 1) Handle_too_small_buffer(); // pedantic check
else Hmmmmmmm();
}
Documentation of fgets from the C99 Standard (N1256)
7.19.7.2 The fgets function
Synopsis
#include <stdio.h>
char *fgets(char * restrict s, int n,
FILE * restrict stream);
Description
The fgets function reads at most one less than the number of characters specified by n
from the stream pointed to by stream into the array pointed to by s. No additional
characters are read after a new-line character (which is retained) or after end-of-file. A
null character is written immediately after the last character read into the array.
Coming to your post, you said:
An array of 16 char which stores a username (15 characters max, I reserve the last one for the NUL terminator). The question is: if I insert a 15 characters strings, then the '\n' would end up in the last cell of the array, but what about the NUL terminator?
For such a case, the newline character is not read until the next call to fgets or any other call to read from the stream.
does the '\0' get stored in the following block of memory? (no segmentation fault when calling the printf() function implies that the inserted string is actually NUL terminated, right?).
The terminating null character is always set. In your case, the 16-th character will be the terminating null character.
From the man page of fgets:
char *fgets(char *s, int size, FILE *stream);
fgets() reads in at most one less than size characters from stream and stores them into the buffer pointed to by s. Reading stops after an EOF or a newline. If a newline is read, it is stored into the buffer. A terminating null byte ('\0') is stored after the last character in the buffer.
I think that is pretty clear, isn't it?

How do C's file I/O functions handle NUL characters?

Do the file input functions in standard C, like fgetc(), fgets() or fscanf(), have any problems with NUL ('\0') characters or treat them differently than other characters?
I was going to ask if I can use fgets() to read a line that may contain NUL characters, but I just realized that since that function NUL-terminates the input and doesn't return the length in any other way, it's worthless for that use anyway.
Can i use fgetc()/getc()/getchar() instead?
If what you're reading is actually text, then you're in somewhat of an awkward situation. fgets will read NULs just fine, store them in the buffer, and soldier on. Problem is, though, you've just read in what is no longer an NTBS (NUL-terminated byte string) as the C library typically expects, so most functions that expect a string will ignore everything after the first NUL. And you really don't have a reliable way to get the length, since fgets doesn't return it to you and strlen expects a C string. (You could conceivably zero out the buffer each time and look for the last non-NUL char in order to get the length, but for short strings in big buffers, that's kinda ugly.)
If you're dealing with binary, things are a lot simpler. You just fread and fwrite the data, and all's well. But if you want text with NULs in it, you're probably going to end up needing your own read-a-line function that returns the length.
If you open the file in "TEXT" mode, then you cannot read the file beyond the NULL character. However binary files can be open()ed, read() and close()d. Look up these functions and binary i/o.
Also, EOF character is set as the NULL character in a TEXT file. You can however query using fstat the size of the binary file, and read the binary data(which may include NULL character)
No, the input functions do not treat NUL differently than other characters. Since any which return an unknown number of characters use NUL termination, though, the easiest thing to do is to write your own, such as this:
ssize_t myfgets(char *buffer, size_t buffSize, FILE *file) {
ssize_t count = 0;
int character;
while(count < buffSize && (character = getc(file)) != EOF) {
buffer[count] = character;
++count;
if(character == '\n') break;
}
if(count == 0 && character == EOF) return EOF;
return count;
}
This function is like fgets, except that it returns the number of characters read and does not NUL terminate the string. If you want the string to be NUL-terminated, change the first condition in the while loop to count < buffSize-1 and add buffer[count] = '\0'; just after the loop.

fscanf not scanning file

I am trying to scan a file using fscanf and put the string into an char array of size 20 as follows:
char buf[20];
fscanf(fp, "%s", buf);
The file fp currently contains: 1 + 23.
I am setting a pointer to the first element in buf as follows:
char *p;
p = buf;
Printing buf, printf("%s", buf) yields only 1. Trying to increment p and printing prints out rubbish as well (p++; printf("%c", *p)).
What am I doing wrong with fscanf here? Why isn't it reading the whole string from the file?
fscanf (and related functions) with the format-string "%s" will try to read as many characters as it can without including a whitespace, in this case it will find the first character (1) and store it, then it will hit a space () and therefore stop searching.
If you'd like to read the whole line at once consider using fgets, it is also safer to use since you need to specify the size of your destination buffer as one of it's arguments.
fgets will try to read at maximum length-of-buffer minus 1 characters (last byte is saved for the trailing null-byte), it will stop at either reading that many characters, hitting a new-line or the end of the file.
fgets (buf, 20, fp);
Links to documentation
codecogs.com - scanf, fscanf and related functions - <stdio.h>
codecogs.com - fgets - <stdio.h>

File Handling question on C programming

I want to read line-by-line from a given input file,, process each line (i.e. its words) and then move on to other line...
So i am using fscanf(fptr,"%s",words) to read the word and it should stop once it encounters end of line...
but this is not possible in fscanf, i guess... so please tell me the way as to what to do...
I should read all the words in the given line (i.e. end of line should be encountered) to terminate and then move on to other line, and repeat the same process..
Use fgets(). Yeah, link is to cplusplus, but it originates from c stdio.h.
You may also use sscanf() to read words from string, or just strtok() to separate them.
In response to comment: this behavior of fgets() (leaving \n in the string) allows you to determine if the actual end-of-line was encountered. Note, that fgets() may also read only part of the line from file if supplied buffer is not large enough. In your case - just check for \n in the end and remove it, if you don't need it. Something like this:
// actually you'll get str contents from fgets()
char str[MAX_LEN] = "hello there\n";
size_t len = strlen(str);
if (len && str[len-1] == '\n') {
str[len-1] = 0;
}
Simple as that.
If you are working on a system with the GNU extensions available there is something called getline (man 3 getline) which allows you to read a file on a line by line basis, while getline will allocate extra memory for you if needed. The manpage contains an example which I modified to split the line using strtok (man 3 strtrok).
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
FILE * fp;
char * line = NULL;
size_t len = 0;
ssize_t read;
fp = fopen("/etc/motd", "r");
if (fp == NULL)
{
printf("File open failed\n");
return 0;
}
while ((read = getline(&line, &len, fp)) != -1) {
// At this point we have a line held within 'line'
printf("Line: %s", line);
const char * delim = " \n";
char * ptr;
ptr = (char * )strtok(line,delim);
while(ptr != NULL)
{
printf("Word: %s\n",ptr);
ptr = (char *) strtok(NULL,delim);
}
}
if (line)
{
free(line);
}
return 0;
}
Given the buffering inherent in all the stdio functions, I would be tempted to read the stream character by character with getc(). A simple finite state machine can identify word boundaries, and line boundaries if needed. An advantage is the complete lack of buffers to overflow, aside from whatever buffer you collect the current word in if your further processing requires it.
You might want to do a quick benchmark comparing the time required to read a large file completely with getc() vs. fgets()...
If an outside constraint requires that the file really be read a line at a time (for instance, if you need to handle line-oriented input from a tty) then fgets() probably is your friend as other answers point out, but even then the getc() approach may be acceptable as long as the input stream is running in line-buffered mode which is common for stdin if stdin is on a tty.
Edit: To have control over the buffer on the input stream, you might need to call setbuf() or setvbuf() to force it to a buffered mode. If the input stream ends up unbuffered, then using an explicit buffer of some form will always be faster than getc() on a raw stream.
Best performance would probably use a buffer related to your disk I/O, at least two disk blocks in size and probably a lot more than that. Often, even that performance can be beat by arranging the input to be a memory mapped file and relying on the kernel's paging to read and fill the buffer as you process the file as if it were one giant string.
Regardless of the choice, if performance is going to matter then you will want to benchmark several approaches and pick the one that works best in your platform. And even then, the simplest expression of your problem may still be the best overall answer if it gets written, debugged and used.
but this is not possible in fscanf,
It is, with a bit of wickedness ;)
Update: More clarification on evilness
but unfortunately a bit wrong. I assume [^\n]%*[^\n] should read [^\n]%*. Moreover, one should note that this approach will strip whitespaces from the lines. – dragonfly
Note that xstr(MAXLINE) [^\n] reads MAXLINE characters which can be anything except the newline character (i.e. \n). The second part of the specifier i.e. *[^\n] rejects anything (that's why the * character is there) if the line has more than MAXLINE characters upto but NOT including the newline character. The newline character tells scanf to stop matching. What if we did as dragonfly suggested? The only problem is scanf will not know where to stop and will keep suppressing assignment until the next newline is hit (which is another match for the first part). Hence you will trail by one line of input when reporting.
What if you wanted to read in a loop? A little modification is required. We need to add a getchar() to consume the unmatched newline. Here's the code:
#include <stdio.h>
#define MAXLINE 255
/* stringify macros: these work only in pairs, so keep both */
#define str(x) #x
#define xstr(x) str(x)
int main() {
char line[ MAXLINE + 1 ];
/*
Wickedness explained: we read from `stdin` to `line`.
The format specifier is the only tricky part: We don't
bite off more than we can chew -- hence the specification
of maximum number of chars i.e. MAXLINE. However, this
width has to go into a string, so we stringify it using
macros. The careful reader will observe that once we have
read MAXLINE characters we discard the rest upto and
including a newline.
*/
int n = fscanf(stdin, "%" xstr(MAXLINE) "[^\n]%*[^\n]", line);
if (!feof(stdin)) {
getchar();
}
while (n == 1) {
printf("[line:] %s\n", line);
n = fscanf(stdin, "%" xstr(MAXLINE) "[^\n]%*[^\n]", line);
if (!feof(stdin)) {
getchar();
}
}
return 0;
}

Resources