What is the need for closing a file? - c

Why do we need to close a file that we opened? I know the problems like - it can't be accessed by another process if the current process doesn't close it. But why at the end of execution of a process the OS checks whether it is closed and closes it if opened. There must be a reason for that.

When you close a file the buffer is flushed and all you wrote on it it's persisted to the file. If you suddenly exit your program without flush (or close) your FILE * stream, you will probably lose your data.

Two words: Resource exhaustion. File handles, no matter platform, is a limited resource. If a process just opens file and never closes them, it will soon run out of file handles.

A file can certainly be accessed by another process while it is opened by one. Some semantics depend on the operating system. For example, in Unix, two or more processes may open a file concurrently to write. Almost all systems will allow readonly access to multiple processes.
You open a file to connect the byte stream into the process. You close the file to disconnect the two. When you write into the file, the file may not get modified right away due to buffering. That implies that the memory buffer of the file is modified but the change is not immediately reflected into the file on disk. The OS will reflect the changes in disk when it has enough data for performance reason. When you close the file, the OS will flush out the changes into the file on disk.

If you "get" a resource, it is good practice to release it when you have done.
I think it's not a good thing to trust what an O.S. would do when the process end: it might free resources or not. Common O.S. does it: they close files, free allocated memory, …
But if it's not part of the standard of the language you use (e.g. if it implements garbage collectors), then you shouldn't rely on that common behaviour.
Otherwise, the risk is that your application would lock/eat resources on some systems, even if it ended.

In this way it is just a good practise. You are not obliged to close files at the end.
Imagine you'll write a genial but messy method. You'll forget about the caveats and later find out, that this method may be used somewhere else. Then you'll try to use this method maybe in a loop and you'll find out that your programm is unexpectedly crashing. You'll have to go deeper in the code and fix that. So why won't you make the function clean at the beginning?
Do you have something against (or are you afraid of) closing files?

Here is a good explanation: http://www.cs.bu.edu/teaching/c/file-io/intro/
When writing into an output file, the information is hold in a buffer and closing the file is a way to be sure that everything is posted to the file.

Related

Why isn't it a good idea to open the file before each writing and close it right after each writing?

Suppose you are writing a list of names in a file, each name being written in a write command. Why isn't it a good idea to open the file before each writing and close it right after each writing?
Intuitively, I would say that this aproach is a lot more time consuming than writing to a buffer and posteriorly writing to the file. But I'm sure there is a better explanation to that. Can someone enlight me?
Let's sum it up:
Opening and closing a file for every single write operation, when many such operations are planned, is a bad idea because:
It is terribly inefficient.
It requires an extra seek to the end of file in order to append.
It forfeits atomicity, meaning that the file may be renamed, moved, deleted, written to, or locked by someone else between the write operations.
Think of all the possible reasons why fopen (and related) might fail when you call it even once: the file doesn't exist, your account doesn't have permission to access or create it, another program is using the file exclusively, etc.
If you are repeatedly opening and closing the file for every write operation, this chance of failure increases quite a bit.
Also, there is an overhead associated with acquiring and releasing resources (e.g. files). You'd observe it more if you were acquiring and releasing write access to the file every single time you needed to write.

Removing bytes from File in (C) without creating new File

I have a file let's log. I need to remove some bytes let's n bytes from starting of file only. Issue is, this file referenced by some other file pointers in other programs and may these pointer write to this file log any time. I can't re-create new file otherwise file-pointer would malfunction(i am not sure about it too).
I tried to google it but all suggestion for only to re-write to new files.
Is there any solution for it?
I can suggest two options:
Ring bufferUse a memory mapped file as your logging medium, and use it as a ring buffer. You will need to manually manage where the last written byte is, and wrap around your ring appropriately as you step over the end of the ring. This way, your logging file stays a constant size, but you can't tail it like a regular file. Instead, you will need to write a special program that knows how to walk the ring buffer when you want to display the log.
Multiple number of small log filesUse some number of smaller log files that you log to, and remove the oldest file as the collection of files grow beyond the size of logs you want to maintain. If the most recent log file is always named the same, you can use the standard tail -F utility to follow the log contents perpetually. To avoid issues of multiple programs manipulating the same file, your logging code can send logs as messages to a single logging daemon.
So... you want to change the file, but you cannot. The reason you cannot is that other programs are using the file. In general terms, you appear to need to:
stop all the other programs messing with the file while you change it -- to chop now unwanted stuff off the front;
inform the other programs that you have changed it -- so they can re-establish their file-pointers.
I guess there must be a mechanism to allow the other programs to change the file without tripping over each other... so perhaps you can extend that ? [If all the other programs are children of the main program, then if the children all O_APPEND, you have a fighting chance of doing this, perhaps with the help of a file-lock or a semaphore (which may already exist ?). But if the programs are this intimately related, then #jxh has other, probably better, suggestions.]
But, if you cannot change the other programs in any way, you appear to be stuck, except...
...perhaps you could try 'sparse' files ? On (recent-ish) Linux (at least) you can fallocate() with FALLOC_FL_PUNCH_HOLE, to remove the stuff you don't want without affecting the other programs file-pointers. Of course, sooner or later the other programs may overflow the file-pointer, but that may be a more theoretical than practical issue.

Updating status of a growing file without close/reopen

Suppose I open() a file on a *nix machine and I hit eof. But when I open()ed it, it was still being written.
Is there a way to detect if the file has grown since open()-time? Or do I need to close() and re-open() to check?
Is there a way to read past the old eof without close()ing and reopen()ing?
Although not portable, on linux you can use the inotify system and watch for file close events for your file of interest. See http://www.ibm.com/developerworks/library/l-inotify/.
Note that just because the file was closed by the other process doesn't necessarily mean its size has increased. If you care about file size changes, use stat() to check for size changes before you close it. See How do you determine the size of a file in C?

Atomically write 64kB

I need to write something like 64 kB of data atomically in the middle of an existing file. That is all, or nothing should be written. How to achieve that in Linux/C?
I don't think it's possible, or at least there's not any interface that guarantees as part of its contract that the write would be atomic. In other words, if there is a way that's atomic right now, that's an implementation detail, and it's not safe to rely on it remaining that way. You probably need to find another solution to your problem.
If however you only have one writing process, and your goal is that other processes either see the full write or no write at all, you can just make the changes in a temporary copy of the file and then use rename to atomically replace it. Any reader that already had a file descriptor open to the old file will see the old contents; any reader opening it newly by name will see the new contents. Partial updates will never be seen by any reader.
There are a few approaches to modify file contents "atomically". While technically the modification itself is never truly atomic, there are ways to make it seem atomic to all other processes.
My favourite method in Linux is to take a write lease using fcntl(fd, F_SETLEASE, F_WRLCK). It will only succeed if fd is the only open descriptor to the file; that is, nobody else (not even this process) has the file open. Also, the file must be owned by the user running the process, or the process must run as root, or the process must have the CAP_LEASE capability, for the kernel to grant the lease.
When successful, the lease owner process gets a signal (SIGIO by default) whenever another process is opening or truncating the file. The opener will be blocked by the kernel for up to /proc/sys/fs/lease-break-time seconds (45 by default), or until the lease owner releases or downgrades the lease or closes the file, whichever is shorter. Thus, the lease owner has dozens of seconds to complete the "atomic" operation, without any other process being able to see the file contents.
There are a couple of wrinkles one needs to be aware of. One is the privileges or ownership required for the kernel to allow the lease. Another is the fact that the other party opening or truncating the file will only be delayed; the lease owner cannot replace (hardlink or rename) the file. (Well, it can, but the opener will always open the original file.) Also, renaming, hardlinking, and unlinking/deleting the file does not affect the file contents, and therefore are not affected at all by file leases.
Remember also that you need to handle the signal generated. You can use fcntl(fd, F_SETSIG, signum) to change the signal. I personally use a trivial signal handler -- one with an empty body -- to catch the signal, but there are other ways too.
A portable method to achieve semi-atomicity is to use a memory map using mmap(). The idea is to use memmove() or similar to replace the contents as quickly as possible, then use msync() to flush the changes to the actual storage medium.
If the memory map offset in the file is a multiple of the page size, the mapped pages reflect the page cache. That is, any other process reading the file, in any way -- mmap() or read() or their derivatives -- will immediately see the changes made by the memmove(). The msync() is only needed to make sure the changes are also stored on disk, in case of a system crash -- it is basically equivalent to fsync().
To avoid preemption (kernel interrupting the action due to the current timeslice being up) and page faults, I'd first read the mapped data to make sure the pages are in memory, and then call sched_yield(), before the memmove(). Reading the mapped data should fault the pages into page cache, and sched_yield() releases the rest of the timeslice, making it extremely likely that the memmove() is not interrupted by the kernel in any way. (If you do not make sure the pages are already faulted in, the kernel will likely interrupt the memmove() for each page separately. You won't see that in the process, but other processes see the modifications to occur in page-sized chunks.)
This is not exactly atomic, but it is practical: it does not give you any guarantees, only makes the race window very very short; therefore I call this semi-atomic.
Note that this method is compatible with file leases. One could try to take a write lease on the file, but fall back to leaseless memory mapping if the lease is not granted within some acceptable time period, say a second or two. I'd use timer_create() and timer_settime() to create the timeout timer, and the same empty-body signal handler to catch the SIGALRM signal; that way the fcntl() is interrupted (returns -1 with errno == EINTR) when the timeout occurs -- with the timer interval set to some small value (say 25000000 nanoseconds, or 0.025 seconds) so it repeats very often after that, interrupting syscalls if the initial interrupt is missed for any reason.
Most userspace applications create a copy of the original file, modify the contents of the copy, then replace the original file with the copy.
Each process that opens the file will only see complete changes, never a mix of old and new contents. However, anyone keeping the file open, will only see their original contents, and not be aware of any changes (unless they check themselves). Most text editors do check, but daemons and other processes do not bother.
Remember that in Linux, the file name and its contents are two separate things. You can open a file, unlink/remove it, and still keep reading and modifying the contents for as long as you have the file open.
There are other approaches, too. I do not want to suggest any specific approach, because the optimal one depends heavily on the circumstances: Do the other processes keep the file open, or do they always (re)open it before reading the contents? Is atomicity preferred or absolutely required? Is the data plain text, structured like XML, or binary?
EDITED TO ADD:
Please note that there are no ways to guarantee beforehand that the file will be successfully modified atomically. Not in theory, and not in practice.
You might encounter a write error with the disk full, for example. Or the drive might hiccup at just the wrong moment. I'm only listing three practical ways to make it seem atomic in typical use cases.
The reason write leases are my favourite is that I can always use fcntl(fd,F_GETLEASE,&ptr) to check whether the lease is still valid or not. If not, then the write was not atomic.
High system load is unlikely to cause the lease to be broken for a 64k write, if the same data has been read just prior (so that it will likely be in page cache). If the process has superuser privileges, you can use setpriority(PRIO_PROCESS,getpid(),-20) to temporarily raise the process priority to maximum while taking the file lease and modifying the file. If the data to be overwritten has just been read, it is extremely unlikely to be moved to swap; thus swapping should not occur, either.
In other words, while it is quite possible for the lease method to fail, in practice it is almost always successful -- even without the extra tricks mentioned in this addendum.
Personally, I simply check if the modification was not atomic, using the fcntl() call after the modification, prior to msync()/fsync() (making sure the data hits the disk in case a power outage occurs); that gives me an absolutely reliable, trivial method to check whether the modification was atomic or not.
For configuration files and other sensitive data, I too recommend the rename method. (Actually, I prefer the hardlink approach used for NFS-safe file locking, which amounts to the same thing but uses a temporary name to detect naming races.) However, it has the problem that any process keeping the file open will have to check and reopen the file, voluntarily, to see the changed contents.
Disk writes cannot be atomic without a layer of abstraction. You should keep a journal and revert if a write is interrupted.
As far as I know a write below the size of PIPE_BUF is atomic. However I never rely on this. If the programs that access the file are written by you, you can use flock() to achieve exclusive access. This system call sets a lock on the file and allows other processes that know about the lock to get access or not.

Is it ‘safe’ to remove() open file?

I think about adding possibility of using same the filename for both input and output file to my program, so that it will replace the input file.
As the processed file may be quite large, I think that best solution would to be first open the file, then remove it and create a new one, i.e. like that:
/* input == output in this case */
FILE *inf = fopen(input, "r");
remove(output);
FILE *outf = fopen(output, "w");
(of course, with error handling added)
I am aware that not all systems are going to allow me to remove open file and that's acceptable as long as remove() is going to fail in that case.
I am worried though if there isn't any system which will allow me to remove that open file and then fail to read its' contents.
C99 standard specifies behavior in that case as ‘implementation-defined’; SUS doesn't even mention the case.
What is your opinion/experience? Do I have to worry? Should I avoid such solutions?
EDIT: Please note this isn't supposed to be some mainline feature but rather ‘last resort’ in the case user specifies same filename as both input and output file.
EDIT: Ok, one more question then: is it possible that in this particular case the solution proposed by me is able to do more evil than just opening the output file write-only (i.e. like above but without the remove() call).
No, it's not safe. It may work on your file system, but fail on others. Or it may intermittently fail. It really depends on your operating system AND file system. For an in depth look at Solaris, see this article on file rotation.
Take a look at GNU sed's '--in-place' option. This option works by writing the output to a temporary file, and then copying over the original. This is the only safe, compatible method.
You should also consider that your program could fail at any time, due to a power outage or the process being killed. If this occurs, then your original file will be lost. Additionally, for file systems which do have reference counting, your not saving any space, over the temp file solution, as both files have to exist on disk until the input file is closed.
If the files are huge, and space is at premium, and developer time is cheap, you may be able to open a single for read/write, and ensure that your write pointer does not advance beyond your read pointer.
All systems that I'm aware of that let you remove open files implement some form of reference-counting for file nodes. So, removing a file removes the directory entry, but the file node itself still has one reference from open file handle. In such an implementation, removing a file obviously won't affect the ability to keep reading it, and I find it hard to imagine any other reasonable way to implement this behavior.
I've always got this to work on Linux/Unix. Never on Windows, OS/2, or (shudder) DOS. Any other platforms you are concerned about?
This behaviour actually is useful in using temporary diskspace - open the file for read/write, and immediately delete it. It gets cleaned up automatically on program exit (for any reason, including power-outage), and makes it much harder (but not impossible) for others to monitor it (/proc can give clues, if you have read access to that process).

Resources