good way to read text file in C - c

I need to read a text file which may contain long lines of text. I am thinking of the best way to do this. Considering efficiency, even though I am doing this in C++, I would still choose C library functions to do the IO.
Because I don't know how long a line is, potentially really really long, I don't want to allocate a large array and then use fgets to read a line. On the other hand, I do need to know where each line ends. One use case of such is to count the words/chars in each line. I could allocate a small array and use fgets to read, and then determine whether there is \r, \n, or \r\n appearing in the line to tell whether a full line has been read. But this involves a lot of strstr calls (for \r\n, or there are better ways? for example from the return value of fgets?). I could also do fgetc to read each individual char one at a time. But does this function have buffering?
Please suggest compare these or other different ways of doing this task.

The correct way to do I/O depends on what you're going to do with the data. If you're counting words, line-based input doesn't make much sense. A more natural approach is to use fgetc and deal with a character at a time and let stdio worry about the buffering. Only if you need the whole line in memory at the same time to process it should you actually allocate a buffer big enough to contain it all.

Related

fgetc vs getline or fgets - which is most flexible

I am reading data from a regular file and I was wondering which would allow for the most flexibility.
I have found that both fgets and getline both read in a line (one with a maximum number of characters, the other with dynamic memory allocation). In the case of fgets, if the length of the line is bigger than the given size, the rest of the line would not be read but remain buffered in the stream. With getline, I am worried that it may attempt to assign a large block of memory for an obscenely long line.
The obvious solution for me seems to be turning to fgetc, but this comes with the problem that there will be many calls to the function, thereby resulting in the read process being slow.
Is this compromise in either case between flexibility and efficiency unavoidable, or can it worked through?
The three functions you mention do different things:
fgetc() reads a single character from a FILE * descriptor, it buffers input and so, you can process the file in a buffered way without having the overhelm of making a system call for each character. when your problem can be handled in a character oriented way, it is the best.
fgets() reads a single line from a FILE * descriptor, it's like calling fgetc() to fill the character array you pass to it in order to read line by line. It has the drawback of making a partial read in case your input line is longer than the buffer size you specify. This function buffers also input data, so it is very efficient. If you know that your lines will be bounded, this is the best to read your data line by line. Sometimes you want to be able to process data in an unbounded line size way, and you must redesign your problem to use the available memory. Then the one below is probably better election.
getline() this function is relatively new, and is not ANSI-C, so it is possible you port your program to some architecture that lacks it. It's the most flexible, at the price of being the less efficient. It requires a reference to a pointer that is realloc()ated to fill more and more data. It doesn't bind the line length at the cost of being possible to fill all the memory available on a system. Both, the buffer pointer and the size of the buffer are passed by reference to allow them to be updated, so you know where the new string is located and the new size. It mus be free()d after use.
The reason of having three and not only one function is that you have different needs for different cases and selecting the mos efficient one is normally the best selection.
If you plan to use only one, probably you'll end in a situation where using the function you selected as the most flexible will not be the best election and you will probably fail.
Much is case dependent.
getline() is not part of the standard C library. Its functionality may differ - depends on the implementation and what other standards it follows - thus an advantage for the standard fgetc()/fgets().
... case between flexibility and efficiency unavoidable, ...
OP is missing the higher priorities.
Functionality - If code cannot function right with the selected function, why use it? Example: fgets() and reading null characters create issues.
Clarity - without clarity, feel the wrath of the poor soul who later has to maintain the code.
would allow for the most flexibility. (?)
fgetc() allows for the most flexibility at the low level - yet helper functions using it to read lines tend to fail corner cases.
fgets() allows for the most flexibility at mid level - still have to deal with long lines and those with embedded null characters, but at least the low level of slogging in the weeds is avoided.
getline() useful when high portability not needed and risks of allowing the user to overwhelm resources is not a concern.
For robust handing of user/file input to read a line, create a wrapping function (e.g. int my_read_line(size_t buf, char *buf, FILE *f)) and call that and only that in user code. Then when issues arise, they can be handled locally, regardless of the low level input function selected.

How to buffer a line in a file by using System Calls in C?

Here is my approach:
int linesize=1
int ReadStatus;
char buff[200];
ReadStatus=read(file,buff,linesize)
while(buff[linesize-1]!='\n' && ReadStatus!=0)
{
linesize++;
ReadStatus=read(file,buf,linesize)
}
Is this idea right?
I think my code is a bit inefficient because the run time is O(FileWidth); however I think it can be O(log(FileWidth)) if we exponentially increase linesize to find the linefeed character.
What do you think?
....
I just saw a new problem. How do we read the second line?. Is there anyway to delimit the bytes?
Is this idea right?
No. At the heart of a comment written by Siguza, lies the summary of an issue:
1) read doesn't read lines, it just reads bytes. There's no reason buff should end with \n.
Additionally, there's no reason buff shouldn't contain multiple newline characters, and as there's no [posix] tag here there's no reason to suggest what read does, let alone whether it's a syscall. Assuming you're referring to the POSIX function, there's no error handling. Where's your logic to handle the return value/s reserved for errors?
I think my code is a bit inefficient because the run time is O(FileWidth); however I think it can be O(log(FileWidth)) if we exponentially increase linesize to find the linefeed character.
Providing you fix the issues mentioned above (more on that later), if you were to test this theory, you'd likely find, also at the heart of the comment by Siguza,
Disks usually work on a 512-byte basis and file system caches and even CPU/memory caches are a lot larger than that.
To an extent, you can expect your idea to approach O(log n), but your bottleneck will be one of those cache lines (likely the one closest to your keyboard/the filesystem/whatever is feeding the stream with information). At that point, you should stop guzzling memory which other programs might need because your optimisation becomes less and less effective.
What do you think?
I think you should just STOP! You're guessing!
Once you've written your program, decide whether or not it's too slow. If it's not too slow, it doesn't need optimisation, and you probably won't shave enough nanoseconds to make optimisation worthwhile.
If it is to slow, then you should:
Use a profiler to determine what the most significant bottleneck is,
apply optimisations based on what your profiler tells you, then
use your profiler again, with the same inputs as before, to measure the effect your optimisation had.
If you don't use a profiler, your guess-work could result in slower code, or you might miss opportunities for more significant optimisations...
How do we read the second line?
Naturally, it makes sense to read character by character, rather than two hundred characters at a time, because there's no other way to stop reading the moment you reach a line terminating character.
Is there anyway to delimit the bytes?
Yes. The most sensible tools to use are provided by the C standard, and syscalls are managed automatically to be most efficient based on configurations decided by the standard library devs (who are much likely better at this than you are). Those tools are:
fgets to attempt to read a line (by reading one character at a time), up to a threshold (the size of your buffer). You get to decide how large a line should be, because it's more often the case that you won't expect a user/program to input huge lines.
strchr or strcspn to detect newlines from within your buffer, in order to determine whether you read a complete line.
scanf("%*[^\n]"); to discard the remainder of an incomplete line, when you detect those.
realloc to reallocate your buffer, if you decide you want to resize it and call fgets a second time to retrieve more data rather than discarding the remainder. Note: this will have an effect on the runtime complexity of your code, not that I think you should care about that...
Other options are available for the first three. You could use fgetc (or even read one character at a time) like I did at the end of this answer, for example...
In fact, that answer is highly relevant to your question, as it does make an attempt to exponentially increase the size. I wrote another example of this here.
It should be pointed out that the reason to address these problems is not so much optimisation, but the need to read a large, yet variadic in size chunk of memory. Remember, if you haven't yet written the code, it's likely you won't know whether it's worthwhile optimising it!
Suffice to say, it isn't the read function you should try to reduce your dependence upon, but the malloc/realloc/calloc function... That's the real kicker! If you don't absolutely need to store the entire line, then don't!

How do I fscanf data within range only, instead of saving the whole data?

Basically I have rows and columns of data. I know I can use fgets to read line by line and then tokenise the line with strtok. After that I can check if the last 2 tokens/values are within range using atoi(), and if they are, I can then store them into an array. However, I heard strtok is a bad way to do things, and fscanf seems a much cleaner approach. The problems with fscanf is that if I use it I will have to store all the values in arrays first, and then check which values are within range. Since I have a lot of rows, I don't know how big the arrays should be and it would waste a lot of space. Is there a way to fscanf with if statements?
I don't know if it's a stupid question, thanks.
Whatever you waste it will never exceed the line length. Since yours is clearly a text format, a line length of 1024 is pretty typically the limit¹.
So, in worst case you require (several?) kilobytes of memory to parse each line. You can reuse that buffer and ignore the uninteresting values.
Of course, you can write your own parser and be more memory efficient.
UPDATE
There's also this: scanf:
(optional) assignment-suppressing character *. If this option is present, the function does not assign the result of the conversion to any receiving argument.
¹ it may vary of course

C fgets versus fgetc for reading line

I need to read a line of text (terminated by a newline) without making assumptions about the length. So I now face to possibilities:
Use fgets and check each time if the last character is a newline and continuously append to a buffer
Read each character using fgetc and occasionally realloc the buffer
Intuition tells me the fgetc variant might be slower, but then again I don't see how fgets can do it without examining every character (also my intuition isn't always that good). The lines are quite large so the performance is important.
I would like to know the pros and cons of each approach. Thank you in advance.
I suggest using fgets() coupled with dynamic memory allocation - or you can investigate the interface to getline() that is in the POSIX 2008 standard and available on more recent Linux machines. That does the memory allocation stuff for you. You need to keep tabs on the buffer length as well as its address - so you might even create yourself a structure to handle the information.
Although fgetc() also works, it is marginally fiddlier - but only marginally so. Underneath the covers, it uses the same mechanisms as fgets(). The internals may be able to exploit speedier operation - analogous to strchr() - that are not available when you call fgetc() directly.
Does your environment provide the getline(3) function? If so, I'd say go for that.
The big advantage I see is that it allocates the buffer itself (if you want), and will realloc() the buffer you pass in if it's too small. (So this means you need to pass in something gotten from malloc()).
This gets rid of some of the pain of fgets/fgetc, and you can hope that whoever wrote the C library that implements it took care of making it efficient.
Bonus: the man page on Linux has a nice example of how to use it in an efficient manner.
If performance matters much to you, you generally want to call getc instead of fgetc. The standard tries to make it easier to implement getc as a macro to avoid function call overhead.
Past that, the main thing to deal with is probably your strategy in allocating the buffer. Most people use fixed increments (e.g., when/if we run out of space, allocate another 128 bytes). I'd advise instead using a constant factor, so if you run out of space allocate a buffer that's, say, 1 1/2 times the previous size.
Especially when getc is implemented as a macro, the difference between getc and fgets is usually quite minimal, so you're best off concentrating on other issues.
If you can set a maximum line length, even a large one, then one fgets would do the trick. If not, multiple fgets calls will still be faster than multiple fgetc calls because the overhead of the latter will be greater.
A better answer, though, is that it's not worth worrying about the performance difference until and unless you have to. If fgetc is fast enough, what does it matter?
I would allocate a large buffer and then use fgets, checking, reallocing and repeating if you haven't read to the end of the line.
Each time you read (either via fgetc or fgets) you are making a system call which takes time, you want to minimize the number of times that happens, so calling fgets fewer times and iterating in memory is faster.
If you are reading from a file, mmap()ing in the file is another option.

When/why is it a bad idea to use the fscanf() function?

In an answer there was an interesting statement: "It's almost always a bad idea to use the fscanf() function as it can leave your file pointer in an unknown location on failure. I prefer to use fgets() to get each line in and then sscanf() that."
Could you expand upon when/why it might be better to use fgets() and sscanf() to read some file?
Imagine a file with three lines:
1
2b
c
Using fscanf() to read integers, the first line would read fine but on the second line fscanf() would leave you at the 'b', not sure what to do from there. You would need some mechanism to move past the garbage input to see the third line.
If you do a fgets() and sscanf(), you can guarantee that your file pointer moves a line at a time, which is a little easier to deal with. In general, you should still be looking at the whole string to report any odd characters in it.
I prefer the latter approach myself, although I wouldn't agree with the statement that "it's almost always a bad idea to use fscanf()"... fscanf() is perfectly fine for most things.
The case where this comes into play is when you match character literals. Suppose you have:
int n = fscanf(fp, "%d,%d", &i1, &i2);
Consider two possible inputs "323,A424" and "323A424".
In both cases fscanf() will return 1 and the next character read will be an 'A'. There is no way to determine if the comma was matched or not.
That being said, this only matters if finding the actual source of the error is important. In cases where knowing there is malformed input error is enough, fscanf() is actually superior to writing custom parsing code.
When fscanf() fails, due to an input failure or a matching failure, the file pointer (that is, the position in the file from which the next byte will be read) is left in a position other than where it would be had the fscanf() succeeded. This is typically undesirable in sequential file reads. Reading one line at a time results in the file input being predictable, while single line failures can be handled individually.
There are two reasons:
scanf() can leave stdin in a state that's difficult to predict; this makes error recovery difficult if not impossible (this is less of a problem with fscanf()); and
The entire scanf() family take pointers as arguments, but no length limit, so they can overrun a buffer and alter unrelated variables that happen to be after the buffer, causing seemingly random memory corruption errors that very difficult to understand, find, and debug, particularly for less experienced C programmers.
Novice C programmers are often confused about pointers and the “address-of” operator, and frequently omit the & where it's needed, or add it “for good measure” where it's not. This causes “random” segfaults that can be hard for them to find. This isn't scanf()'s fault, so I leave it off my list, but it is worth bearing in mind.
After 23 years, I still remember it being a huge pain when I started C programming and didn't know how to recognize and debug these kinds of errors, and (as someone who spent years teaching C to beginners) it's very hard to explain them to a novice who doesn't yet understand pointers and stack.
Anyone who recommends scanf() to a novice C programmer should be flogged mercilessly.
OK, maybe not mercilessly, but some kind of flogging is definitely in order ;o)
It's almost always a bad idea to use the fscanf() function as it can leave your file pointer in an unknown location on failure. I prefer to use fgets() to get each line in and then sscanf() that.
You can always use ftell() to find out current position in file, and then decide what to do from there. Basicaly, if you know what you can expect then feel free to use fscanf().
Basically, there's no way to to tell that function not to go out of bounds for the memory area you've allocated for it.
A number of replacements have come out, like fnscanf, which is an attempt to fix those functions by specifying a maximum limit for the reader to write, thus allowing it to not overflow.

Resources