What are some best practices for file I/O in C? - c

I'm writing a fairly basic program for personal use but I really want to make sure I use good practices, especially if I decide to make it more robust later on or something.
For all intents and purposes, the program accepts some input files as arguments, opens them using fopen() read from the files, does stuff with that information, and then saves the output as a few different files in a subfolder. eg, if the program is in ~/program then the output files are saved in ~/program/csv/
I just output directly to the files, for example output = fopen("./csv/output.csv", "w");, print to it with fprintf(output,"%f,%f", data1, data2); in a loop, and then close with fclose(output); and I just feel like that is bad practice.
Should I be saving it in a temp directory wile it's being written to and then moving it when it's finished? Should I be using more advanced file i/o libraries? Am I just completely overthinking this?

Best practices in my eyes:
Check every call to fopen, printf, puts, fprintf, fclose etc. for errors
use getchar if you must, fread if you can
use putchar if you must, fwrite if you can
avoid arbitrary limits on input line length (might require malloc/realloc)
know when you need to open output files in binary mode
use Standard C, forget conio.h :-)
newlines belong at the end of a line, not at the beginning of some text, i.e. it is printf ("hello, world\n"); and not "\nHello, world" like those mislead by the Mighty William H. often write to cope with the sillyness of their command shell. Outputting newlines first breaks line buffered I/O.
if you need more than 7bit ASCII, chose Unicode (the most common encoding is UTF-8 which is ASCII compatible). It's the last encoding you'll ever need to learn. Stay away from codepages and ISO-8859-*.

Am I just completely overthinking this?
You are. If the task's simple, don't make a complicated solution on purpose just because it feels "more professional". While you're a beginner, focus on code readability, it will facilitate your and others' lives.

It's fine. I/O is fully buffered by default with stdio file functions, so you won't be writing to the file with every single call of fprintf. In fact, in many cases, nothing will be written to it until you call fclose.
It's good practice to check the return of fopen, to close your files when finished, etc. Let the OS and the compiler do their job in making the rest efficient, for simple programs like this.

If no other program is checking for the presence of ~/program/csv/output.csv for further processing, then what you're doing is just fine.
Otherwise you can consider writing to a FILE * obtained by a call to tmpfile in stdio.h or some similar library call, and when finished copy the file to the final destination. You could also put down a lock file output.csv.lck and remove that when you're done, but that depends on you being able to modify the other program's behaviour.

You can make your own cat, cp, mv programs for practice.

Related

Overall efficiency of fprintf and stdout

I have a program that regularly writes to stdout. Something like this:
fprintf(stdout, ...);
fprintf(stdout, ...);
fprintf(stdout, ...);
This makes the program easy to read but I'm curious to know how efficient is it compared to concatenating strings to some char[] and then calling a single fprintf(stdout...) on that char[]. By efficiency, I'm referring to processing efficiency.
The whole of stdio.h is notoriously slow, as are writes to the screen or files in general. What makes stdio.h particularly bad, is that it's a cumbersome wrapper around the underlying OS API. printf/scanf-like functions have an horrible interface forcing them to deal with both format string parsing and variable argument lists, before they can even pass along the data to the function doing the actual work.
Minimizing those fprintf calls into a single one will almost definitely improve performance. But then that depends on how you "concatenate strings", if it is done with sprintf, then you have only moved all the calling/parsing overhead from one icky stdio.h function to another.
The only reason you would ever use stdio.h is if you need to create very portable console and file I/O code. Otherwise, you'd call the OS API directly.
That being said, you should only manually optimize code when there is a need for it. If the program runs "fast enough" without any known bottlenecks, then leave it be and strive to write as readable code as possible.
There are 3 bottlenecks i know of that can cause slow performance when you call fprintf(stdout,....):
Format parsing
Buffering
Your Terminal or other stdout device
To avoid the format parsing, you could write using fwrite(), but then you have to create the output string in a other way and if this is faster is questionable.
Normally, stdout uses a line buffer, this means that the data has to be checked for \n characters and, assuming you running on a OS, for every line a syscall is used. Syscalls are relatively slow compared to normal function calls. Setting the buffer to full buffering with setvbuf and
_IOFBF is probably the fastest buffer method. Use BUFSIZ or try different buffer sizes and benchmark them to find the best value.
When your terminal is slow, then there is nothing you can do about in your program. You could write to a file which can be faster or use a faster terminal. AFAIK Alacritty is probably the fastest terminal on Linux.

How can I reuse a text file in a C program without having a file pointer?

My program has uses a text file to input data, but can't use a file pointer in the program itself. I'm supposed to use the < file.txt in the Linux terminal. I can't do it any other way because it's a college assignment, so please don't waste my time with rewind or other functions that require a pointer. I just need to be able to basically restart the text file that I already have open.
The C library provides a FILE pointer for standard input, stdin from <stdio.h>. However, it might not support all the functions you want, since it can be connected to another command (if you pipe your input from somewhere else) or the terminal (if you don't use input redirection). If you need to be able to support these, which you probably do, you won't be able to successfully call fseek or any of the related functions.
If that's the case, then this is fundamentally impossible. The computer doesn't store all the data which was sent to your program, so there is no way to go back and get it because there's nowhere to get it from. Instead, you either need to store the input yourself, or rework your algorithm to only need a single pass over the input data.

What is generally the best approach reading a file for a compiler?

I know this is a general question.
I'm going to program a compiler and I was wondering if it's better to take the tokens of the language while reading the file (i.e., first open the file, then extract tokens while reading, and finally close the file) or read the file first, close it and then work with the data in a variable. The pseudo-code for this would be something like:
file = open(filename);
textVariable = read(file);
close(file);
getTokens(textVariable);
The first option would be something like:
file = open(filename);
readWhileGeneratingTokens(file);
close(file);
I guess the first option looks better, since there isn't an additional cost in terms of main memory. However, I think there might be some benefits using the second option, for I minimize the time the file is going to be open.
I can't provide any hard data, but generally the amount of time a compiler spends tokenizing source code is rather small compared to the amount of time spent optimizing/generating target code. Because of this, wanting to minimize the amount of time the source file is open seems premature. Additionally, reading the entire source file into memory before tokenizing would prevent any sort of line-by-line execution (think interpreted language) or reading input from a non-file location (think of a stream like stdin). I think it is safe to say that the overhead in reading the entire source file into memory is not worth the computer's resources and will ultimately be detrimental to your project.
Compilers are carefully designed to be able to proceed on as little as one character at a time from the input. They don't read entire files prior to processing, or rather they have no need to do so: that would just add pointless latency. They don't even need to read entire lines before processing.

CommandLineToArgvW equivalent on Linux

I'm looking for an equivalent function to Windows' CommandLineToArgvW.
I have a string, which I want to break up exactly the way bash does it, with all the corner cases - i.e. taking into account single and double quotes, backslashes, etc., so splitting a b "c'd" "\"e\"f" "g\\" h 'i"j' would result into:
a
b
c'd
"e"f
g\
h
i"j
Since such a function already exist and is used by the OS/bash, I'm assuming there's a way to call it, or at least get its source code, so I don't need to reinvent the wheel.
Edit
To answer why I need it, it has nothing to do with spawning child processes. I want to make a program that searches text, watching for multiple regular expressions to be true in whatever order. But all the regular expressions would be input in the same text field, so I need to break them up.
GNU/Linux is made of free software and bash is free software, so you can get the source code and improve it (and you should publish your improving patches under GPL license).
But there is no common library doing that, because it is the role of the shell to expand the command line to arguments to the execve(2) syscall (which then go to the main of the invoked program).
(this was different in MS-DOS, where the called program had to expand its command line)
The function wordexp(3) is close to what you may want.
You may want to study the source code of simpler shells, e.g. download sash-3.7.tar.gz
If you want it to expand a string exactly the way Bash does, you will need to run Bash. Remember, Bash does parameter expansion, command substitution, and the like. If it really needs to act exactly like Bash, just call Bash itself.
FILE *f = popen("bash", "r+");
fprintf(f, "echo %s", your_string);
fgets(buffer, sizeof(buffer), f);
pclose(f);
Note that real code would need to handle errors and possibly allocating a bigger buffer if your original is not large enough.
Given your updated requirements, it sounds like you don't want to parse it exactly like Bash does. Instead, you just want to parse space-separated strings with quoting and escaping. I would recommend simply implementing this yourself; I do not know of any off the shelf library that will parse strings exactly the way that you specify. You don't have to write it entirely by hand; you can use a lexical scanner generator like flex or Ragel for this purpose.

What is the best way to interface with a a program that uses filenames on the command line for input and output?

I need to interface with some executables that expect to be passed filenames for input or output:
./helper in.txt out.txt
What is the standard (and preferably cross-platform) way of doing this?
I could create lots of temporary files int the /tmp directory, but I am concerned that creating tons of files might cause some issues. Also, I want to be able to install my program and not have to worry about permissions later.
I could also just be Unix specific and try to go for a solution using pipes, etc. But then, I don't think I would be able to find a solution with nice, unnamed pipes.
My alternative to this would be piping input to stdin (all the executables I need also accept it this way) and get the results from stdout. However, the outputs they give to stdout are all different and I would need to write lots of adapters by hand to make this uniform (the outputs through files obey a same standard). I don't like how this would lock in my program to a couple of formats though.
There isn't a right or wrong answer necessarily. Reading/writing to stdin/out is probably cleaner and doesn't use disk space. However, using temporary files is just fine too as long as you do it safely. Specifically, see the mktemp and mkstemp manual page for functions that let you create temporary files for short-term usage. Just clean them up afterward (unlink) and it's just fine to use and manipulate temp files.

Resources