Secure and efficient way to modify multiple files on POSIX systems? - filesystems

I have been following the discussion on the "bug" on EXT4 that causes files to be zeroed in crash if one uses the "create temp file, write temp file, rename temp to target file" process. POSIX says that unless fsync() is called, you cannot be sure the data has been flushed to harddisk.
Obviously doing:
0) get the file contents (read it or make it somehow)
1) open original file and truncate it
2) write new contents
3) close file
is not good even with fsync() as the computer can crash during 2) or fsync() and you end up with partially written file.
Usually it has been thought that this is pretty safe:
0) get the file contents (read it or make it somehow)
1) open temp file
2) write contents to temp file
3) close temp file
4) rename temp file to original file
Unfortunately it isn't. To make it safe on EXT4 you would need to do:
0) get the file contents (read it or make it somehow)
1) open temp file
2) write contents to temp file
3) fsync()
4) close temp file
5) rename temp file to original file
This would be safe and on crash you should either have the new file contents or old, never zeroed contents or partial contents. But if the application uses lots of files, fsync() after every write would be slow.
So my question is, how to modify multiple files efficiently on a system where fsync() is required to be sure that changes have been saved to disk? And I really mean modifying many files, as in thousands of files. Modifying two files and doing fsync() after each wouldn't be too bad, but fsync() does slow things down when modifying multiple files.
EDIT: changed the fsync() close temp file to corrent order, added emphasis on writing many many many files.

The short answer is: Solving this in the app layer is the wrong place. EXT4 must make sure that after I close the file, the data is written in a timely manner. As it is now, EXT4 "optimizes" this writing to be able to collect more write requests and burst them out in one go.
The problem is obvious: No matter what you do, you can't be sure that your data ends on the disk. Calling fdisk() manually only makes things worse: You basically get in the way of EXT4's optimization, slowing the whole system down.
OTOH, EXT4 has all the information necessary to make an educated guess when it is necessary to write data out to the disk. In this case, I rename the temp file to the name of an existing file. For EXT4, this means that it must either postpone the rename (so the data of the original file stays intact after a crash) or it must flush at once. Since it can't postpone the rename (the next process might want to see the new data), renaming implicitly means to flush and that flush must happen on the FS layer, not the app layer.
EXT4 might create a virtual copy of the filesystem which contains the changes while the disk is not modified (yet). But this doesn't affect the ultimate goal: An app can't know what optimizations the FS if going to make and therefore, the FS must make sure that it does its job.
This is a case where ruthless optimizations have gone too far and ruined the results. Golden rule: Optimization must never change the end result. If you can't maintain this, you must not optimize.
As long as Tso believes that it is more important to have a fast FS rather than one which behaves correctly, I suggest not to upgrade to EXT4 and close all bug reports about this is "works as designed by Tso".
[EDIT] Some more thoughts on this. You could use a database instead of the file. Let's ignore the resource waste for a moment. Can anyone guarantee that the files, which the database uses, won't become corrupted by a crash? Probably. The database can write the data and call fsync() every minute or so. But then, you could do the same:
while True; do sync ; sleep 60 ; done
Again, the bug in the FS prevents this from working in every case. Otherwise, people wouldn't be so bothered by this bug.
You could use a background config daemon like the Windows registry. The daemon would write all configs in one big file. It could call fsync() after writing everything out. Problem solved ... for your configs. Now you need to do the same for everything else your apps write: Text documents, images, whatever. I mean almost any Unix process creates a file. This is the freaking basis of the whole Unix idea!
Clearly, this is not a viable path. So the answer remains: There is no solution on your side. Keep bothering Tso and the other FS developers until they fix their bugs.

My own answer would be to keep to the modifications on temp files, and after finishing writing them all, do one fsync() and then do rename on them all.

You need to swap 3 & 4 in your last listing - fsync(fd) uses the file descriptor. and I don't see why that would be particularly costly - you want the data written to disk by the close() anyway. So the cost will be the same between what you want to happen and what will happen with fsync().
If the cost is too much, (and you have it) fdatasync(2) avoid syncing the meta-data, so should be lighter cost.
EDIT:
So I wrote some extremely hacky test code:
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/time.h>
#include <time.h>
#include <stdio.h>
#include <string.h>
static void testBasic()
{
int fd;
const char* text = "This is some text";
fd = open("temp.tmp", O_WRONLY | O_CREAT);
write(fd,text,strlen(text));
close(fd);
rename("temp.tmp","temp");
}
static void testFsync()
{
int fd;
const char* text = "This is some text";
fd = open("temp1", O_WRONLY | O_CREAT);
write(fd,text,strlen(text));
fsync(fd);
close(fd);
rename("temp.tmp","temp");
}
static void testFdatasync()
{
int fd;
const char* text = "This is some text";
fd = open("temp1", O_WRONLY | O_CREAT);
write(fd,text,strlen(text));
fdatasync(fd);
close(fd);
rename("temp.tmp","temp");
}
#define ITERATIONS 10000
static void testLoop(int type)
{
struct timeval before;
struct timeval after;
long seconds;
long usec;
int i;
gettimeofday(&before,NULL);
if (type == 1)
{
for (i = 0; i < ITERATIONS; i++)
{
testBasic();
}
}
if (type == 2)
{
for (i = 0; i < ITERATIONS; i++)
{
testFsync();
}
}
if (type == 3)
{
for (i = 0; i < ITERATIONS; i++)
{
testFdatasync();
}
}
gettimeofday(&after,NULL);
seconds = (long)(after.tv_sec - before.tv_sec);
usec = (long)(after.tv_usec - before.tv_usec);
if (usec < 0)
{
seconds--;
usec += 1000000;
}
printf("%ld.%06ld\n",seconds,usec);
}
int main()
{
testLoop(1);
testLoop(2);
testLoop(3);
return 0;
}
On my laptop that produces:
0.595782
6.338329
6.116894
Which suggests doing the fsync() is ~10 times more expensive. and fdatasync() is slightly cheaper.
I guess the problem I see is that every application is going to think it's data is important enough to fsync(), so the performance advantages of merging writes over a minute will be eliminated.

The issue you refer to is well researched, you should definately read this:
https://www.academia.edu/9846821/Towards_Efficient_Portable_Application-Level_Consistency
Fsync can be skipped under safe rename behavior and directory fsync can be skipped under safe new file behavior. Both are implementation specific and not guaranteed by POSIX.

Related

how to (f)sync a directory under linux in c

I've some c application under linux. I'm renaming some files with rename(...)
How can I ensure that the renaming is written persistent to the underlaying disk?
With a file I can do something like:
FILE * f = fopen("foo","w");
...
fflush(f);
fsync(fileno(f));
fclose(f);
How can I fsync (or similar) a directory after a rename() in c?
This is how you can do what you want:
#include <fcntl.h>
int fd = open('/path/to/dir', O_RDONLY);
fsync(fd);
Don't forget to close the fd file descriptor when no longer needed of course.
Contrary to some misconceptions, the atomicity of rename() does not guarantee the file will be persisted to disk. The atomicity guarantee only ensures that the metadata in the file system buffers is in a consistent state but not that it has been persisted to disk.
rename() is atomic (on linux), so I don't think you need to worry about that
Atomicity is typically guaranteed in operations involving filename handling ; for example, for rename, “specification requires that the action of the function be atomic” – that is, when renaming a file from the old name to the new one, at no circumstances should you ever see the two files at the same time.
a power outage in the middle of a rename() operation shall not leave the filesystem in a “weird” state, with the filename being unreachable because its metadata has been corrupted. (ie. either the operation is lost, or the operation is committed.)
Source
So, I think you should only be worried about error value.
If you really want to be safe, fsync() also flush metadata (on linux), so you could fsync the directory and the file you want to be sure there are present on the disk.
According to the manual, at the return of the function, rename has been done effectively (return 0) or an error occured (return -1) and errno is set to check what's wrong.
If you want the system to apply the potential pending modifications only on this file after rename you can do :
int fd = open(new_name, O_RDONLY);
syncfs(fd);

How to replace contents of a file with null/0s/1s in linux kernel 3.5 programming

How can I erase a file's contents completely to 0s or 1s in the linux kernel 3.5 given its file name (path to it) as the only input parameter?
I studied the structure of the unlink system call and it after a lot of checking calls int vfs_unlink(struct inode *dir, struct dentry *dentry)
so from the *dentry how can I delete the file's contents? Or should I use *dentry at all?
EDIT
In response to the answers: I just want to overwrite the data. And I am not looking for a perfect result. I have progressed this far:
On one side: using vfs_unlink
I am confused at the following code:
error = security_inode_unlink(dir, dentry);
if (!error) {
error = dir->i_op->unlink(dir, dentry);
if (!error)
dont_mount(dentry)
}
Where is the actual unlink going on here?
Another approach: I just went ahead with writing the data using write system call:
I could not understand especially these lines:
143 int size = file->f_path.dentry->d_inode->i_size;
144 loff_t offs = *off;
145 int count = min_t(size_t, bytes, PAGE_SIZE);
151 if (size) {
152 if (offs > size)
153 return 0;
154 if (offs + count > size)
155 count = size - offs;
156 }
157
158 temp = memdup_user(userbuf, count);
162 mutex_lock(&bb->mutex);
163
164 memcpy(bb->buffer, temp, count);
165
166 count = flush_write(file, bb->buffer, offs, count);
167 mutex_unlock(&bb->mutex);
168
169 if (count > 0)
170 *off = offs + count;
171
172 kfree(temp);
173 return count;
Can someone explain this to me? So that I can just write null to the file. my function may look like this.
static void write(struct file *file)
I need help with this. I am not asking for code, but I am lost currently.
Thanks
PS: I know perfectly well how to do this very simple thing in user-level program. But that is not my task. I have to do it in kernel space. And I need help with that (and especially understanding the code as I new to kernel programming).
There's no good answer here, and certainly not at the level of an individual file. On simple filesystems (FAT, ext2) it's generally sufficient to simply open the file and overwrite it. But that fails on almost all modern systems. Modern filesystems can almost always be configured to journal data changes (though this is rarely default) and that data will live on in the journal until it happens to be overwritten in the future. Even if you know the filesystem has "forgotten" the data the storage system may not -- consider the case of live backups, or of offlining a LVM volume. Or the driver: NAND drivers routinely remap blocks as they are written, leaving "stale" content in place. Or even the hardware itself: flash technologies like SSDs or MMC do exactly the same kind of block remapping, leaving your old data present for reads via JTAG, etc...
If you want to be sure that your data isn't on persistent storage, the only clean solution in the modern world is never to write it there in the first place. Cache it in RAM, or write it to a tmpfs (that isn't backed by swap!), or come up with some kind of encryption scheme that makes sure storage compromise won't make it available to an attacker...
I think you can do it easily with write system call. The process is
Using write write NULL values to all the bytes of a given file
Delete the file using unlink.
Others have discussed the actual situation quite well, so I won't go there (and therefore don't bother modding this answer up, please). I just wanted to describe the code that confuses the OP.
First, this code is a snippet from fs/namei.c:vfs_unlink():
error = security_inode_unlink(dir, dentry);
if (!error) {
error = dir->i_op->unlink(dir, dentry);
if (!error)
dont_mount(dentry);
}
The security_inode_unlink() call first checks if the current (the current userspace process) has the required rights to remove directory entry dentry from directory dir.
Since Linux supports a number of different filesystems, and each filesystem (may) have their own inode operations, those operations are stored as function pointers in the struct inode (dir), i_op member structure. (Remember that a directory is very similar to a file, in that it contains the metadata for all entries contained in that directory. So having the operations be specific to a directory makes a lot of sense.)
The dir->i_op->unlink(dir, entry); simply calls the unlink() function for the directory dir. (Note that the fs/inode.c:vfs_unlink() does check that dir->i_op->unlink is non-NULL, before the snippet shown above.)
The final bit, dont_mount(entry);, is a helper function defined in include/linux/dcache.h, and simply marks entry to be "un-mountable", not accessible. (Directory entries are cached in dcache. Instead of having to ream the entry out of it, possibly a slow operation, this just marks it invalid. Sometime soon in the future, all stale dcache entries will be removed at once. It's very efficient this way, and quite simple, too, if you think about it.)
Sorry. I cannot help myself; I must add my spoon to the soup.
The problem is rather easy to solve, if you break it into manageable steps. If we assume this is for experimentation and learning only, not for security purposes -- for security you need scrubbing, not just clearing the contents --, then you need to do open()+lseek()+ftruncate()+close(), just in-kernel, right?
You don't want to use write(), because the filesystem-specific write functions require an userspace buffer. You'd need to allocate one (say, one page in length -- look at mm/mmap.c:sys_old_mmap() which calls the arch-specific sys_mmap_pgoff()), fill it with the data, write it into the file in a loop, then release the userspace buffer (using mm/mmap.c:vm_munmap()). Such a busy kernel loop is a big no-no; you'd need to move it into a worker thread..
No, it is much better to simply find out the length of the file, then truncate it to zero length, and re-truncate it to the desired length. That is the same as if you wrote zeros into it. Replacing the contents with anything but zero is just too much work, IMO.
For open(), call filp_open(). Remember, you only need the resulting struct file *myfile, not a file descritor, to manipulate a file in kernelspace.
For close(), you just call filp_close(myfile, current->files). If you omit the file descriptor maintenance stuff for the current process, that is really the only thing left from fs/open.c:sys_close().
For lseek(), you can just call loff_t file_length = vfs_lseek(myfile, (off_t)0, SEEK_END);. If you look at fs/read_write.c:sys_lseek(), that's what it does, after all.
For ftruncate(), look at fs/open.c:do_sys_ftruncate(), but remember to omit the file descriptor stuff. You already have the struct file *myfile, corresponding to the struct file *file in that function. Oh, and do remember you need to do it twice: first to zero length, then to file_length you obtained above.
Combining all the above seems to be a pretty feasible way to accomplish this -- assuming we all agree it is only useful for learning and experimentation, nothing practical.
Note that I did not really spend the time necessary to check all four syscalls (sys_open(), sys_lseek(), sys_ftruncate(), sys_close()) to check if there are any locking issues or race conditions I overlooked. You really should do that, because otherwise your experiment may oops your kernel (typical for race conditions). Do what I did, and start at the syscalls, then look at the functions -- and especially their comments, they usually mention if there are any locking requirements to calling that function -- to find out.

Using fopen twice on the same file with different access flags

I'm cleaning up some pretty complicated code that I didn't write and I'm looking for a way to touch the code as little as possible, so don't flame me for what may appear to be a newb question:
I have a library that works with an external data file that can be written to or read from, but generally all the writes happen at once and all the reads. Internally, the FILE* is fopened with "r+b", and the code appears to properly call fflush between switching between reads and writes. When the data file is in a place where the user has RW permissions, it works just as expected, however there are times where the data file may be in a location where the user has read-only permissions. Because of this the fopen(..."r+b") fails and returns a NULL file pointer and bad things are happening. It's totally reasonable for someone to have this data file in a read-only partition, they're not required to update the file and should be able to use the file in a read-only situation.
My question is instead of doing
FILE* pFile=fopen("filename","r+b");
could I edit the code and do something like
FILE* pRead=fopen("filename","rb");
FILE* pWrite=fopen("filename","r+b");
Then in the code that reads from the file, just use pRead, and in the code that writes to the file use pWrite. This way I could do something like this
int UpdateTheFile()
{
if (!pWrite) return 0; //we know that we shouldn't even try to write
//change all the existing update code to use pWrite instead of pFile
return 1;
}
int ReadFromTheFile()
{
if (!pRead) return 0;
...
return 1;
}
It seems wrong to me to have two file pointers to the same file, but since the code is already "correct" in its ability to flush between reads and writes now, I'm guessing that things may be kept in sync. Also, it's guaranteed that only 1 thread will be accessing this file at a time, so I don't need to worry about concurrency issues here.
Is this a really bad idea and should I think about properly switching between read-only and read-write in the proper functions with an fclose/fopen pair, or can I get away with this as a "quick fix"
int file_is_writable = 1;
FILE *pFile = fopen("filename", "r+b");
if (!pFile) {
pFile = fopen("filename", "rb");
file_is_writable = 0;
/* I highly suggest you check for open failure here and do something sane */
}
Then check file_is_writable before updates.

How to know if a file is being copied?

I am currently trying to check wether the copy of a file from a directory to another is done.
I would like to know if the target file is still being copied.
So I would like to get the number of file descriptors openned on this file.
I use C langage and don't really find a way to resolve that problem.
If you have control of it, I would recommend using the copy-move idiom on the program doing the copying:
cp file1 otherdir/.file1.tmp
mv otherdir/.file1.tmp otherdir/file1
The mv just changes some filesystem entries and is atomic and very fast compared to the copy.
If you're able to open the file for writing, there's a good chance that the OS has finished the copy and has released its lock on it. Different operating systems may behave differently for this, however.
Another approach is to open both the source and destination files for reading and compare their sizes. If they're of identical size, the copy has very likely finished. You can use fseek() and ftell() to determine the size of a file in C:
fseek(fp, 0L, SEEK_END);
sz = ftell(fp);
In linux, try the lsof command, which lists all of the open files on your system.
edit 1: The only C language feature that comes to mind is the fstat function. You might be able to use that with the struct's st_mtime (last modification time) field - once that value stops changing (for, say, a period of 10 seconds), then you could assume that file copy operation has stopped.
edit 2: also, on linux, you could traverse /proc/[pid]/fd to see which files are open. The files in there are symlinks, but C's readlink() function could tell you its path, so you could see whether it is still open. Using getpid(), you would know the process ID of your program (if you are doing a file copy from within your program) to know where to look in /proc.
I think your basic mistake is trying to synchronize a C program with a shell tool/external program that's not intended for synchronization. If you have some degree of control over the program/script doing the copying, you should modify it to perform advisory locking of some sort (preferably fcntl-based) on the target file. Then your other program can simply block on acquiring the lock.
If you don't have any control over the program performing the copy, the only solutions depend on non-portable hacks like lsof or Linux inotify API.
(This answer makes the big, big assumption that this will be running on Linux.)
The C source code of lsof, a tool that tells which programs currently have an open file descriptor to a specific file, is freely available. However, just to warn you, I couldn't make any sense out of it. There are references to reading kernel memory, so to me it's either voodoo or black magic.
That said, nothing prevents you from running lsof through your own program. Running third-party programs from your own program is normally something you try to avoid for several reasons, like security (if a rogue user changes lsof for a malicious program, it will run with your program's privileges, with potentially catastrophic consequences) but inspecting the lsof source code, I came to the conclusion that there's no public API to determine which program has which file open. If you're not afraid of people changing programs in /usr/sbin, you might consider this.
int isOpen(const char* file)
{
char* command;
// BE AWARE THAT THIS WILL NOT WORK IF THE FILE NAME CONTAINS A DOUBLE QUOTE
// OR IF IT CAN SOMEHOW BE ALTERED THROUGH SHELL EXPANSION
// you should either try to fix it yourself, or use a function of the `exec`
// family that won't trigger shell expansion.
// It would be an EXTREMELY BAD idea to call `lsof` without an absolute path
// since it could result in another program being run. If this is not where
// `lsof` resides on your system, change it to the appropriate absolute path.
asprintf(&command, "/usr/sbin/lsof \"%s\"", file);
int result = system(command);
free(command);
return result;
}
If you also need to know which program has your file open (presumably cp?), you can use popen to read the output of lsof in a similar fashion. popen descriptors behave like fopen descriptors, so all you need to do is fread them and see if you can find your program's name. On my machine, lsof output looks like this:
$ lsof document.pdf
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
SomeApp 873 felix txt REG 14,3 303260 5165763 document.pdf
As poundifdef mentioned, the fstat() function can give you the current modification time. But fstat also gives you the size of the file.
Back in the dim dark ages of C when I was monitoring files being copied by various programs I had no control over I always:
Waited until the target file size was >= the source size, and
Waited until the target modification time was at least N seconds older than the current time. N being a number such a 5, and set larger if experience showed that was necessary. Yes 5 seconds seems extreme, but it is safe.
If you don't know what the target file is then the only real choice you have is #2, but user a larger N to allow for the worse case network and local CPU delays, with a healthy safety factor.
using boost libs will solve the issue
boost::filesystem::fstream fileStream(filePath, std::ios_base::in | std::ios_base::binary);
if(fileStream.is_open())
//not getting copied
else
//Wait, the file is getting copied

Probing for filesystem block size

I'm going to first admit that this is for a class project, since it will be pretty obvious. We are supposed to do reads to probe for the block size of the filesystem. My problem is that the time taken to do this appears to be linearly increasing, with no steps like I would expect.
I am timing the read like this:
double startTime = getticks();
read = fread(x, 1, toRead, fp);
double endTime = getticks();
where getticks uses rdtsc instructions. I am afraid there is caching/prefetching that is causing the reads to not take time during the fread. I tried creating a random file between each execution my program, but that is not alleviating my problem.
What is the best way to accurately measure the time taken for a read from disk? I am pretty sure my block size is 4096, but how can I get data to support that?
The usual way of determining filesystem block size is to ask the filesystem what its blocksize is.
#include <sys/statvfs.h>
#include <stdio.h>
int main() {
struct statvfs fs_stat;
statvfs(".", &fs_stat);
printf("%lu\n", fs_stat.f_bsize);
}
But if you really want, open(…,…|O_DIRECT) or posix_fadvise(…,…,…,POSIX_FADV_DONTNEED) will try to let you bypass the kernel's buffer cache (not guaranteed).
You may want to use the system calls (open(), read(), write(), ...)
directly to reduce the impact of the buffering done by the FILE* stuff.
Also, you may want to use synchronous I/O somehow.
One ways is opening the file with the O_SYNC flag set
(or O_DIRECT as per ephemient's reply).
Quoting the Linux open(2) manual page:
O_SYNC The file is opened for synchronous I/O. Any write(2)s on the
resulting file descriptor will block the calling process until
the data has been physically written to the underlying hardware.
But see NOTES below.
Another options would be mounting the filesystem with -o sync (see mount(8)) or setting the S attribute on the file using the chattr(1) command.

Resources