Is it ‘safe’ to remove() open file? - c

I think about adding possibility of using same the filename for both input and output file to my program, so that it will replace the input file.
As the processed file may be quite large, I think that best solution would to be first open the file, then remove it and create a new one, i.e. like that:
/* input == output in this case */
FILE *inf = fopen(input, "r");
remove(output);
FILE *outf = fopen(output, "w");
(of course, with error handling added)
I am aware that not all systems are going to allow me to remove open file and that's acceptable as long as remove() is going to fail in that case.
I am worried though if there isn't any system which will allow me to remove that open file and then fail to read its' contents.
C99 standard specifies behavior in that case as ‘implementation-defined’; SUS doesn't even mention the case.
What is your opinion/experience? Do I have to worry? Should I avoid such solutions?
EDIT: Please note this isn't supposed to be some mainline feature but rather ‘last resort’ in the case user specifies same filename as both input and output file.
EDIT: Ok, one more question then: is it possible that in this particular case the solution proposed by me is able to do more evil than just opening the output file write-only (i.e. like above but without the remove() call).

No, it's not safe. It may work on your file system, but fail on others. Or it may intermittently fail. It really depends on your operating system AND file system. For an in depth look at Solaris, see this article on file rotation.
Take a look at GNU sed's '--in-place' option. This option works by writing the output to a temporary file, and then copying over the original. This is the only safe, compatible method.
You should also consider that your program could fail at any time, due to a power outage or the process being killed. If this occurs, then your original file will be lost. Additionally, for file systems which do have reference counting, your not saving any space, over the temp file solution, as both files have to exist on disk until the input file is closed.
If the files are huge, and space is at premium, and developer time is cheap, you may be able to open a single for read/write, and ensure that your write pointer does not advance beyond your read pointer.

All systems that I'm aware of that let you remove open files implement some form of reference-counting for file nodes. So, removing a file removes the directory entry, but the file node itself still has one reference from open file handle. In such an implementation, removing a file obviously won't affect the ability to keep reading it, and I find it hard to imagine any other reasonable way to implement this behavior.

I've always got this to work on Linux/Unix. Never on Windows, OS/2, or (shudder) DOS. Any other platforms you are concerned about?
This behaviour actually is useful in using temporary diskspace - open the file for read/write, and immediately delete it. It gets cleaned up automatically on program exit (for any reason, including power-outage), and makes it much harder (but not impossible) for others to monitor it (/proc can give clues, if you have read access to that process).

Related

Is there a better way to manage file pointer in C?

Is it better to use fopen() and fclose() at the beginning and end of every function that use that file, or is it better to pass the file pointer to every of these function ? Or even to set the file pointer as an element of the struct the file is related to.
I have two projects going on and each one use one method (because I thought about passing the file pointer after I began the first one).
When I say better, I mean in term of speed and/or readability. What's best practice ?
Thank you !
It depends. You certainly should document what function is fopen(3)-ing a FILE handle and what function is expecting to fclose(3) it.
You might put the FILE* in a struct but you should have a convention about who and when should the file be read and/or written and closed.
Be aware that opened files are some expansive resources in a process (=your running program). BTW, it is also operating system and file system specific. And FILE handles are buffered, see fflush(3) & setvbuf(3)
On small systems, the maximal number of fopen-ed files handles could be as small as a few dozens. On a current Linux desktop, a process could have a few thousand opened file descriptors (which the internal FILE is keeping, with its buffers). In any case, it is a rather precious and scare resource (on Linux, you might limit it with setrlimit(2))
Be aware that disk IO is very slow w.r.t. CPU.

Removing bytes from File in (C) without creating new File

I have a file let's log. I need to remove some bytes let's n bytes from starting of file only. Issue is, this file referenced by some other file pointers in other programs and may these pointer write to this file log any time. I can't re-create new file otherwise file-pointer would malfunction(i am not sure about it too).
I tried to google it but all suggestion for only to re-write to new files.
Is there any solution for it?
I can suggest two options:
Ring bufferUse a memory mapped file as your logging medium, and use it as a ring buffer. You will need to manually manage where the last written byte is, and wrap around your ring appropriately as you step over the end of the ring. This way, your logging file stays a constant size, but you can't tail it like a regular file. Instead, you will need to write a special program that knows how to walk the ring buffer when you want to display the log.
Multiple number of small log filesUse some number of smaller log files that you log to, and remove the oldest file as the collection of files grow beyond the size of logs you want to maintain. If the most recent log file is always named the same, you can use the standard tail -F utility to follow the log contents perpetually. To avoid issues of multiple programs manipulating the same file, your logging code can send logs as messages to a single logging daemon.
So... you want to change the file, but you cannot. The reason you cannot is that other programs are using the file. In general terms, you appear to need to:
stop all the other programs messing with the file while you change it -- to chop now unwanted stuff off the front;
inform the other programs that you have changed it -- so they can re-establish their file-pointers.
I guess there must be a mechanism to allow the other programs to change the file without tripping over each other... so perhaps you can extend that ? [If all the other programs are children of the main program, then if the children all O_APPEND, you have a fighting chance of doing this, perhaps with the help of a file-lock or a semaphore (which may already exist ?). But if the programs are this intimately related, then #jxh has other, probably better, suggestions.]
But, if you cannot change the other programs in any way, you appear to be stuck, except...
...perhaps you could try 'sparse' files ? On (recent-ish) Linux (at least) you can fallocate() with FALLOC_FL_PUNCH_HOLE, to remove the stuff you don't want without affecting the other programs file-pointers. Of course, sooner or later the other programs may overflow the file-pointer, but that may be a more theoretical than practical issue.

What is the need for closing a file?

Why do we need to close a file that we opened? I know the problems like - it can't be accessed by another process if the current process doesn't close it. But why at the end of execution of a process the OS checks whether it is closed and closes it if opened. There must be a reason for that.
When you close a file the buffer is flushed and all you wrote on it it's persisted to the file. If you suddenly exit your program without flush (or close) your FILE * stream, you will probably lose your data.
Two words: Resource exhaustion. File handles, no matter platform, is a limited resource. If a process just opens file and never closes them, it will soon run out of file handles.
A file can certainly be accessed by another process while it is opened by one. Some semantics depend on the operating system. For example, in Unix, two or more processes may open a file concurrently to write. Almost all systems will allow readonly access to multiple processes.
You open a file to connect the byte stream into the process. You close the file to disconnect the two. When you write into the file, the file may not get modified right away due to buffering. That implies that the memory buffer of the file is modified but the change is not immediately reflected into the file on disk. The OS will reflect the changes in disk when it has enough data for performance reason. When you close the file, the OS will flush out the changes into the file on disk.
If you "get" a resource, it is good practice to release it when you have done.
I think it's not a good thing to trust what an O.S. would do when the process end: it might free resources or not. Common O.S. does it: they close files, free allocated memory, …
But if it's not part of the standard of the language you use (e.g. if it implements garbage collectors), then you shouldn't rely on that common behaviour.
Otherwise, the risk is that your application would lock/eat resources on some systems, even if it ended.
In this way it is just a good practise. You are not obliged to close files at the end.
Imagine you'll write a genial but messy method. You'll forget about the caveats and later find out, that this method may be used somewhere else. Then you'll try to use this method maybe in a loop and you'll find out that your programm is unexpectedly crashing. You'll have to go deeper in the code and fix that. So why won't you make the function clean at the beginning?
Do you have something against (or are you afraid of) closing files?
Here is a good explanation: http://www.cs.bu.edu/teaching/c/file-io/intro/
When writing into an output file, the information is hold in a buffer and closing the file is a way to be sure that everything is posted to the file.

How to know if a file is being copied?

I am currently trying to check wether the copy of a file from a directory to another is done.
I would like to know if the target file is still being copied.
So I would like to get the number of file descriptors openned on this file.
I use C langage and don't really find a way to resolve that problem.
If you have control of it, I would recommend using the copy-move idiom on the program doing the copying:
cp file1 otherdir/.file1.tmp
mv otherdir/.file1.tmp otherdir/file1
The mv just changes some filesystem entries and is atomic and very fast compared to the copy.
If you're able to open the file for writing, there's a good chance that the OS has finished the copy and has released its lock on it. Different operating systems may behave differently for this, however.
Another approach is to open both the source and destination files for reading and compare their sizes. If they're of identical size, the copy has very likely finished. You can use fseek() and ftell() to determine the size of a file in C:
fseek(fp, 0L, SEEK_END);
sz = ftell(fp);
In linux, try the lsof command, which lists all of the open files on your system.
edit 1: The only C language feature that comes to mind is the fstat function. You might be able to use that with the struct's st_mtime (last modification time) field - once that value stops changing (for, say, a period of 10 seconds), then you could assume that file copy operation has stopped.
edit 2: also, on linux, you could traverse /proc/[pid]/fd to see which files are open. The files in there are symlinks, but C's readlink() function could tell you its path, so you could see whether it is still open. Using getpid(), you would know the process ID of your program (if you are doing a file copy from within your program) to know where to look in /proc.
I think your basic mistake is trying to synchronize a C program with a shell tool/external program that's not intended for synchronization. If you have some degree of control over the program/script doing the copying, you should modify it to perform advisory locking of some sort (preferably fcntl-based) on the target file. Then your other program can simply block on acquiring the lock.
If you don't have any control over the program performing the copy, the only solutions depend on non-portable hacks like lsof or Linux inotify API.
(This answer makes the big, big assumption that this will be running on Linux.)
The C source code of lsof, a tool that tells which programs currently have an open file descriptor to a specific file, is freely available. However, just to warn you, I couldn't make any sense out of it. There are references to reading kernel memory, so to me it's either voodoo or black magic.
That said, nothing prevents you from running lsof through your own program. Running third-party programs from your own program is normally something you try to avoid for several reasons, like security (if a rogue user changes lsof for a malicious program, it will run with your program's privileges, with potentially catastrophic consequences) but inspecting the lsof source code, I came to the conclusion that there's no public API to determine which program has which file open. If you're not afraid of people changing programs in /usr/sbin, you might consider this.
int isOpen(const char* file)
{
char* command;
// BE AWARE THAT THIS WILL NOT WORK IF THE FILE NAME CONTAINS A DOUBLE QUOTE
// OR IF IT CAN SOMEHOW BE ALTERED THROUGH SHELL EXPANSION
// you should either try to fix it yourself, or use a function of the `exec`
// family that won't trigger shell expansion.
// It would be an EXTREMELY BAD idea to call `lsof` without an absolute path
// since it could result in another program being run. If this is not where
// `lsof` resides on your system, change it to the appropriate absolute path.
asprintf(&command, "/usr/sbin/lsof \"%s\"", file);
int result = system(command);
free(command);
return result;
}
If you also need to know which program has your file open (presumably cp?), you can use popen to read the output of lsof in a similar fashion. popen descriptors behave like fopen descriptors, so all you need to do is fread them and see if you can find your program's name. On my machine, lsof output looks like this:
$ lsof document.pdf
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
SomeApp 873 felix txt REG 14,3 303260 5165763 document.pdf
As poundifdef mentioned, the fstat() function can give you the current modification time. But fstat also gives you the size of the file.
Back in the dim dark ages of C when I was monitoring files being copied by various programs I had no control over I always:
Waited until the target file size was >= the source size, and
Waited until the target modification time was at least N seconds older than the current time. N being a number such a 5, and set larger if experience showed that was necessary. Yes 5 seconds seems extreme, but it is safe.
If you don't know what the target file is then the only real choice you have is #2, but user a larger N to allow for the worse case network and local CPU delays, with a healthy safety factor.
using boost libs will solve the issue
boost::filesystem::fstream fileStream(filePath, std::ios_base::in | std::ios_base::binary);
if(fileStream.is_open())
//not getting copied
else
//Wait, the file is getting copied

Reading and piping large files with C

I am interested in writing a utility that modifies PostScript files. It needs to traverse the file, make certain decisions about the page count and dimensions, and then write the output to a file or stdout making certain modifications to the PostScript code.
What would be a good way to handle file processing on a *NIX system in this case? I'm fairly new to pipes and forking in C, and it is my understanding that, in case of reading a file directly, I could probably seek back and forth around the input file, but if input is directly piped into the program, I can't simply rewind to the beginning of an input as the input could be a network stream for example, correct?
Rather than store the entire PS file into memory, which can grow huge, it seems like it would make more sense to buffer the input to disk while doing my first pass of page analysis, then re-read from the temporary file, produce output, and remove the temporary file. If that's a viable solution, where would be a good place to store such a file on a *NIX system? I'm not sure how safe such code would be either: the program could potentially be used by multiple users on the same server. It sounds like I would have make sure to save the file somewhere in a temporary directory unique to a given user account as well as give the temporary file on disk a fairly unique name.
Would appreciate any tips and pointers on this crazy puzzling world of file processing.
Use mkstemp(3) to create your temporary file. It will handle concurrency issues for you. mmap(2) will let you move around in the file with abandon.
if input is directly piped into the program, I can't simply rewind to the beginning of an input as the input could be a network stream for example, correct?
That's correct. You can only perform random access on a file.
If you read the file, perhaps you could build a table of metadata, which you can use to seek specific portions of the file later, without keeping the file itself in memory.
/tmp is the temporary directory on unix systems. It's specified by FHS. It's cleaned out when the system is rebooted.
If you need more persistent data storage than that there's /var/tmp which is not cleaned out after reboots. Also FHS.
http://en.wikipedia.org/wiki/Filesystem_Hierarchy_Standard

Resources