fgetc vs getline or fgets - which is most flexible - c

I am reading data from a regular file and I was wondering which would allow for the most flexibility.
I have found that both fgets and getline both read in a line (one with a maximum number of characters, the other with dynamic memory allocation). In the case of fgets, if the length of the line is bigger than the given size, the rest of the line would not be read but remain buffered in the stream. With getline, I am worried that it may attempt to assign a large block of memory for an obscenely long line.
The obvious solution for me seems to be turning to fgetc, but this comes with the problem that there will be many calls to the function, thereby resulting in the read process being slow.
Is this compromise in either case between flexibility and efficiency unavoidable, or can it worked through?

The three functions you mention do different things:
fgetc() reads a single character from a FILE * descriptor, it buffers input and so, you can process the file in a buffered way without having the overhelm of making a system call for each character. when your problem can be handled in a character oriented way, it is the best.
fgets() reads a single line from a FILE * descriptor, it's like calling fgetc() to fill the character array you pass to it in order to read line by line. It has the drawback of making a partial read in case your input line is longer than the buffer size you specify. This function buffers also input data, so it is very efficient. If you know that your lines will be bounded, this is the best to read your data line by line. Sometimes you want to be able to process data in an unbounded line size way, and you must redesign your problem to use the available memory. Then the one below is probably better election.
getline() this function is relatively new, and is not ANSI-C, so it is possible you port your program to some architecture that lacks it. It's the most flexible, at the price of being the less efficient. It requires a reference to a pointer that is realloc()ated to fill more and more data. It doesn't bind the line length at the cost of being possible to fill all the memory available on a system. Both, the buffer pointer and the size of the buffer are passed by reference to allow them to be updated, so you know where the new string is located and the new size. It mus be free()d after use.
The reason of having three and not only one function is that you have different needs for different cases and selecting the mos efficient one is normally the best selection.
If you plan to use only one, probably you'll end in a situation where using the function you selected as the most flexible will not be the best election and you will probably fail.

Much is case dependent.
getline() is not part of the standard C library. Its functionality may differ - depends on the implementation and what other standards it follows - thus an advantage for the standard fgetc()/fgets().
... case between flexibility and efficiency unavoidable, ...
OP is missing the higher priorities.
Functionality - If code cannot function right with the selected function, why use it? Example: fgets() and reading null characters create issues.
Clarity - without clarity, feel the wrath of the poor soul who later has to maintain the code.
would allow for the most flexibility. (?)
fgetc() allows for the most flexibility at the low level - yet helper functions using it to read lines tend to fail corner cases.
fgets() allows for the most flexibility at mid level - still have to deal with long lines and those with embedded null characters, but at least the low level of slogging in the weeds is avoided.
getline() useful when high portability not needed and risks of allowing the user to overwhelm resources is not a concern.
For robust handing of user/file input to read a line, create a wrapping function (e.g. int my_read_line(size_t buf, char *buf, FILE *f)) and call that and only that in user code. Then when issues arise, they can be handled locally, regardless of the low level input function selected.

Related

what's more efficient: reading from a file or allocating memory

I have a text file and I should allocate an array with as many entries as the number of lines in the file. What's more efficient: to read the file twice (first to find out the number of lines) and allocate the array once, or to read the file once, and use "realloc" after each line read? thank you in advance.
Reading the file twice is a bad idea, regardless of efficiency. (It's also almost certainly less efficient.)
If your application insists on reading its input teice, that means its input must be rewindable, which excludes terminal input and pipes. That's a limitation so annoying that apps which really need to read their input more than once (like sort) generally have logic to make a temporary copy if the input is unseekable.
In this case, you are only trying to avoid the trivial overhead of a few extra malloc calls. That's not justification to limit the application's input options.
If that's not convincing enough, imagine what will happen if someone appends to the file between the first time you read it and the second time. If your implementation trusts the count it got on the first read, it will overrun the vector of line pointers on the second read, leading to Undefined Behaviour and a potential security vulnerability.
I presume you want to store the read lines also and not just allocate an array of that many entries.
Also that you don't want to change the lines and then write them back as in that case you might be better off using mmap.
Reading a file twice is always bad, even if it is cached the 2nd time, too many system calls are needed. Also allocing every line separately if a waste of time if you don't need to dealloc them in a random order.
Instead read the entire file at once, into an allocated area.
Find the number of lines by finding line feeds.
Alloc an array
Put the start pointers into the array by finding the same line feeds again.
If you need it as strings, then replace the line feed with \0
This might also be improved upon on modern cpu-architectures, instead of reading the array twice it might be faster simply allocating a "large enough" array for the pointer and scan the array once. This will cause a realloc at the end to have the right size and potentially a couple of times to make the array larger if it wasn't large enough at start.
Why is this faster? because you have a lot of if's that can take a lot of time for each line. So its better to only have to do this once, the cost is the reallocation, but copying large arrays with memcpy can be a bit cheaper.
But you have to measure it, your system settings, buffer sizes etc. will influence things too.
The answer to "What's more efficient/faster/better? ..." is always:
Try each one on the system you're going to use it on, measure your results accurately, and find out.
The term is "benchmarking".
Anything else is a guess.

How to buffer a line in a file by using System Calls in C?

Here is my approach:
int linesize=1
int ReadStatus;
char buff[200];
ReadStatus=read(file,buff,linesize)
while(buff[linesize-1]!='\n' && ReadStatus!=0)
{
linesize++;
ReadStatus=read(file,buf,linesize)
}
Is this idea right?
I think my code is a bit inefficient because the run time is O(FileWidth); however I think it can be O(log(FileWidth)) if we exponentially increase linesize to find the linefeed character.
What do you think?
....
I just saw a new problem. How do we read the second line?. Is there anyway to delimit the bytes?
Is this idea right?
No. At the heart of a comment written by Siguza, lies the summary of an issue:
1) read doesn't read lines, it just reads bytes. There's no reason buff should end with \n.
Additionally, there's no reason buff shouldn't contain multiple newline characters, and as there's no [posix] tag here there's no reason to suggest what read does, let alone whether it's a syscall. Assuming you're referring to the POSIX function, there's no error handling. Where's your logic to handle the return value/s reserved for errors?
I think my code is a bit inefficient because the run time is O(FileWidth); however I think it can be O(log(FileWidth)) if we exponentially increase linesize to find the linefeed character.
Providing you fix the issues mentioned above (more on that later), if you were to test this theory, you'd likely find, also at the heart of the comment by Siguza,
Disks usually work on a 512-byte basis and file system caches and even CPU/memory caches are a lot larger than that.
To an extent, you can expect your idea to approach O(log n), but your bottleneck will be one of those cache lines (likely the one closest to your keyboard/the filesystem/whatever is feeding the stream with information). At that point, you should stop guzzling memory which other programs might need because your optimisation becomes less and less effective.
What do you think?
I think you should just STOP! You're guessing!
Once you've written your program, decide whether or not it's too slow. If it's not too slow, it doesn't need optimisation, and you probably won't shave enough nanoseconds to make optimisation worthwhile.
If it is to slow, then you should:
Use a profiler to determine what the most significant bottleneck is,
apply optimisations based on what your profiler tells you, then
use your profiler again, with the same inputs as before, to measure the effect your optimisation had.
If you don't use a profiler, your guess-work could result in slower code, or you might miss opportunities for more significant optimisations...
How do we read the second line?
Naturally, it makes sense to read character by character, rather than two hundred characters at a time, because there's no other way to stop reading the moment you reach a line terminating character.
Is there anyway to delimit the bytes?
Yes. The most sensible tools to use are provided by the C standard, and syscalls are managed automatically to be most efficient based on configurations decided by the standard library devs (who are much likely better at this than you are). Those tools are:
fgets to attempt to read a line (by reading one character at a time), up to a threshold (the size of your buffer). You get to decide how large a line should be, because it's more often the case that you won't expect a user/program to input huge lines.
strchr or strcspn to detect newlines from within your buffer, in order to determine whether you read a complete line.
scanf("%*[^\n]"); to discard the remainder of an incomplete line, when you detect those.
realloc to reallocate your buffer, if you decide you want to resize it and call fgets a second time to retrieve more data rather than discarding the remainder. Note: this will have an effect on the runtime complexity of your code, not that I think you should care about that...
Other options are available for the first three. You could use fgetc (or even read one character at a time) like I did at the end of this answer, for example...
In fact, that answer is highly relevant to your question, as it does make an attempt to exponentially increase the size. I wrote another example of this here.
It should be pointed out that the reason to address these problems is not so much optimisation, but the need to read a large, yet variadic in size chunk of memory. Remember, if you haven't yet written the code, it's likely you won't know whether it's worthwhile optimising it!
Suffice to say, it isn't the read function you should try to reduce your dependence upon, but the malloc/realloc/calloc function... That's the real kicker! If you don't absolutely need to store the entire line, then don't!

good way to read text file in C

I need to read a text file which may contain long lines of text. I am thinking of the best way to do this. Considering efficiency, even though I am doing this in C++, I would still choose C library functions to do the IO.
Because I don't know how long a line is, potentially really really long, I don't want to allocate a large array and then use fgets to read a line. On the other hand, I do need to know where each line ends. One use case of such is to count the words/chars in each line. I could allocate a small array and use fgets to read, and then determine whether there is \r, \n, or \r\n appearing in the line to tell whether a full line has been read. But this involves a lot of strstr calls (for \r\n, or there are better ways? for example from the return value of fgets?). I could also do fgetc to read each individual char one at a time. But does this function have buffering?
Please suggest compare these or other different ways of doing this task.
The correct way to do I/O depends on what you're going to do with the data. If you're counting words, line-based input doesn't make much sense. A more natural approach is to use fgetc and deal with a character at a time and let stdio worry about the buffering. Only if you need the whole line in memory at the same time to process it should you actually allocate a buffer big enough to contain it all.

C fgets versus fgetc for reading line

I need to read a line of text (terminated by a newline) without making assumptions about the length. So I now face to possibilities:
Use fgets and check each time if the last character is a newline and continuously append to a buffer
Read each character using fgetc and occasionally realloc the buffer
Intuition tells me the fgetc variant might be slower, but then again I don't see how fgets can do it without examining every character (also my intuition isn't always that good). The lines are quite large so the performance is important.
I would like to know the pros and cons of each approach. Thank you in advance.
I suggest using fgets() coupled with dynamic memory allocation - or you can investigate the interface to getline() that is in the POSIX 2008 standard and available on more recent Linux machines. That does the memory allocation stuff for you. You need to keep tabs on the buffer length as well as its address - so you might even create yourself a structure to handle the information.
Although fgetc() also works, it is marginally fiddlier - but only marginally so. Underneath the covers, it uses the same mechanisms as fgets(). The internals may be able to exploit speedier operation - analogous to strchr() - that are not available when you call fgetc() directly.
Does your environment provide the getline(3) function? If so, I'd say go for that.
The big advantage I see is that it allocates the buffer itself (if you want), and will realloc() the buffer you pass in if it's too small. (So this means you need to pass in something gotten from malloc()).
This gets rid of some of the pain of fgets/fgetc, and you can hope that whoever wrote the C library that implements it took care of making it efficient.
Bonus: the man page on Linux has a nice example of how to use it in an efficient manner.
If performance matters much to you, you generally want to call getc instead of fgetc. The standard tries to make it easier to implement getc as a macro to avoid function call overhead.
Past that, the main thing to deal with is probably your strategy in allocating the buffer. Most people use fixed increments (e.g., when/if we run out of space, allocate another 128 bytes). I'd advise instead using a constant factor, so if you run out of space allocate a buffer that's, say, 1 1/2 times the previous size.
Especially when getc is implemented as a macro, the difference between getc and fgets is usually quite minimal, so you're best off concentrating on other issues.
If you can set a maximum line length, even a large one, then one fgets would do the trick. If not, multiple fgets calls will still be faster than multiple fgetc calls because the overhead of the latter will be greater.
A better answer, though, is that it's not worth worrying about the performance difference until and unless you have to. If fgetc is fast enough, what does it matter?
I would allocate a large buffer and then use fgets, checking, reallocing and repeating if you haven't read to the end of the line.
Each time you read (either via fgetc or fgets) you are making a system call which takes time, you want to minimize the number of times that happens, so calling fgets fewer times and iterating in memory is faster.
If you are reading from a file, mmap()ing in the file is another option.

I/O methods in C

I am looking for various ways of reading/writing data from stdin/stdout. Currently I know about scanf/printf, getchar/putchar and gets/puts. Are there any other ways of doing this? Also I am interesting in knowing that which one is most efficient in terms of Memory and Space.
Thanks in Advance
fgets()
fputs()
read()
write()
And others, details can be found here: http://www.cplusplus.com/reference/clibrary/cstdio/
As per your time question take a look at this: http://en.wikipedia.org/wiki/I/O_bound
Stdio is designed to be fairly efficient no matter which way you prefer to read data. If you need to do character-by-character reads and writes, they usually expand to macros which just access the buffer except when it's full/empty. For line-by-line text io, use puts/fputs and fgets. (But NEVER use gets because there's no way to control how many bytes it will read!) The printf family (e.g. fprintf) is of course extremely useful for text because it allows you to skip constructing a temporary buffer in memory before writing (and thus lets you avoid thinking about all the memory allocation, overflow, etc. issues). fscanf tends to be much less useful, but mostly because it's difficult to use. If you study the documentation for fscanf well and learn how to use %[, %n, and the numeric specifiers, it can be very powerful!
For large blocks of text (e.g. loading a whole file into memory) or binary data, you can also use the fread and fwrite functions. You should always pass 1 for the size argument and the number of bytes to read/write for the count argument; otherwise it's impossible to tell from the return value how much was successfully read or written.
If you're on a reasonably POSIX-like system (pretty much anything) you can also use the lower-level io functions open, read, write, etc. These are NOT part of the C standard but part of POSIX, and non-POSIX systems usually provide the same functions but possibly with slightly-different behavior (for example, file descriptors may not be numbered sequentially 0,1,2,... like POSIX would require).
If you're looking for immediate-mode type stuff don't forget about Curses (more applicable on the *NIX side but also available on Windows)

Resources