Atomic file replacement in Clojure - file

I have an app that writes that updates a disk file, but I want to make sure, as much as possible, that the previous version of the file doesn't get corrupted.
The most straight forward way to update a file, of course, is to simply write:
(spit "myfile.txt" mystring)
However, if the PC (or java process) dies in the middle of writing, this has a small chance of corrupting the file.
A better solution is probably to write:
(do (spit "tempfile" mystring)
(.rename (file "tempfile") "myfile.txt")
(delete-file "tempfile"))
This uses the java file rename function, which I gather is typically atomic when performed on a single storage device in most cases.
Do any Clojurians with some deeper knowledge of Clojure file IO have any advice on whether this is the best approach, or if there's a better way to minimize the risk of file corruption when updating a disk file?
Thanks!

This is not specific to Clojure; a temp-rename-delete scenario does not guarantee an atomic replace under the POSIX standard. This is due to the possibility of write reordering - the rename might get to the physical disk before the temp writes do, so when a power failure happens within this time window, data loss happens. This is not a purely theoretical possibility:
http://en.wikipedia.org/wiki/Ext4#Delayed_allocation_and_potential_data_loss
You need an fsync() after writing the temp file. This question discusses calling fsync() from Java.

The example you give is to my understanding completely idiomatic and correct. I would just do a delete on tempfile first in case the previous run failed and add some error detection.

Based on the feedback from your comment, I would recommend that you avoid trying to roll your own file-backed database, based on a couple of observations:
Persistent storage of data structures in the filesystem that is consistent in the case of crashes is a tough problem to solve. Lots of really smart people have spent lots of time thinking about this problem.
Small databases tend to grow into big databases and collect extra features over time. If you roll your own, you'll find yourself reinventing the wheel over the course of the project.
If you're truly interested in maintaining consistency of your application's data in the event of a crash, then I'd recommend you look at embedding one of the many freely available databases that are available - you could start by looking at Berkely DB, HyperSQL, or for one with a more Clojure flavor, Datomic.

Related

Is there a way to read HD data past EOF?

Is there a way to read a file's data but continue reading the data on the hard drive past the end of file? For normal file I/O I could just use fread(), but, obviously, that will only read to the end of the file. And it might be beneficial if I add that I need this on a Windows computer.
All my Googling for a way to do this is instead coming up with results about unrelated topics concerning EOF, such as people having problems with normal I/O.
My reasoning for this is that I accidentally deleted part of the text in a text file I was working on, and it was an entire day's worth of work. I Googled up a bunch of file recovery stuff, but it all seems to be about recovering deleted files, where my problem is that the file is still there but without some of its information, and I'm hoping some of that data still exists directly after the currently marked end of file and is neither fragmented elsewhere or already claimed or otherwise overwritten. Since I can't find a program that helps with this specifically, I'm hoping I can quickly make something up for it (I understand that, depending on what is involved, this might not be as feasible as just redoing the work, but I'm hoping that's not the case).
As far as I can foresee, though I might not be correct (not sure, which is why I'm asking for help), there are 3 possibilities.
Worst of the three: I have to look up Windows API functions that allow direct access to the entire hard drive (similar to its functions for memory, perhaps? those I have experience with) and scan the entire thing for the data that I still have access to from the file and then just continue looking at what's after it.
Second: I can get a pointer to the file, then I still have to get raw access to HD but at least have a pointer to the file in it?
Best of the three: Just open the file for write access, seek to the end, then write a ways past EOF to claim more space, but first hope that Windows won't clean the data before it hands it over to me so that I get garbage data which was the previous data in that spot which would actually be what I'm looking for? This would be awesome if it were that simple, but I'm afraid to test it out because I'd lose the data if it failed, so hopefully someone else already knows. The PC in question is running Vista Home Premium if that matters to anyone that knows the gory details of Windows.
Do either of those three seem plausible? Whether yea or nay, I'm also open (and eager) for other suggestions, especially those which are better than my silly ideas, and especially if they come with direction toward specific functions to use to get the job done.
Also, if anyone else actually has heard of a recovery program that doesn't just recover deleted files but which would actually work for a situation like this, and which is free and trustworthy, that works too.
Thanks in advance for any assistance.
You should get a utility for scanning the free space of a hard drive and recovering data from it, for example PhotoRec or foremost. Note however that if you've been using the machine much at all (even web browsing, which will create files in your cache), the data has likely already been overwritten. Do not save your recovery tools on the same hard drive, or even use the same PC to download them; get them from another computer and save them to a USB device, then run them from that device.
As for the conceptual content of your question, files are abstract objects. There is no such thing as data "past eof" except (depending on the implementation) perhaps up to the next multiple of the filesystem/disk "blocksize". Also it's possible (very likely) that your editor "saved" the file by truncating it and writing everything newly from the beginning, meaning there's not necessarily any correspondence between the old and new storage.
Your question doesn't make a lot of sense -- by definition there is nothing in the file after the EOF. By your further description, it appears that you want to read whatever happens to be on the disk after the last byte that is used by the file, which might be random garbage (unused space) or might be some other file. But in either case, this isn't 'data after the EOF' its just data on the disk that's not part of the file. Its even possible that it might be some other part of the same file, if the filesystem happens to lay out its data that way -- some filesystems scatter blocks in seemingly random ways across the disk and figuring out what bytes belong to which files requires understanding the filesystem metadata.

Performance issues in writing to large files?

I have been recently involved in handling the console logs for a server and I was wondering, out of curiosity, that is there a performance issue in writing to a large file as compared to small ones.
For instance is it a good idea to keep the log file size small instead of letting them grow bulky, but I was not able to argue much in favor of either approach.
There might be problems in reading or searching in the file, but right now I am more interested in knowing if writing can be affected in any way.
Looking for an expert advice.
Edit:
The way I thought it was that the OS only has to open a file handle and push the data to the file system. There is little correlation to the file size, since you have to keep on appending the data to the end of the file and whenever a block of data is full, OS will assign another block to the file. As I said earlier, there can be problems in reading and searching because of defragmentation of file blocks, but I could not find much difference while writing.
As a general rule, there should be no practical difference between appending a block to a small file (or writing the first block which is appending to a zero-length file) or appending a block to a large file.
There are special cases (like trying to fault in a triple-indirect block or the initial open having to read all mapping information) which could add additional I/O's. but the steady-state should be the same.
I'd be more worried about the manageability of having huge files: slow to backup, slow to copy, slow to view, etc.
I am not an expert, but I will try to answer anyway.
Larger files may take longer to write on disk and in fact it is not a programming issue. It is file system issue. Perhaps there are file systems, which does not have such issues, but on Windows large files cannot be write down in one piece so fragmenting them will take time (for the simple reason that head will have to move to some other cylinder). Assuming that we are talking about "classic" hard drives...
If you want an advice, I would go for writing down smaller files and rotating them either daily or when they hit some size (or both actually). That is rather common approach I saw in an enterprise-grade products.

One large file or multiple small files?

I have an application (currently written in Python as we iron out the specifics but eventually it will be written in C) that makes use of individual records stored in plain text files. We can't use a database and new records will need to be manually added regularly.
My question is this: would it be faster to have a single file (500k-1Mb) and have my application open, loop through, find and close a file OR would it be faster to have the records separated and named using some appropriate convention so that the application could simply loop over filenames to find the data it needs?
I know my question is quite general so direction to any good articles on the topic are as appreciated as much as suggestions.
Thanks very much in advance for your time,
Dan
Essentially your second approach is an index - it's just that you're building your index in the filesystem itself. There's nothing inherently wrong with this, and as long as you arrange things so that you don't get too many files in the one directory, it will be plenty fast.
You can achieve the "don't put too many files in the one directory" goal by using multiple levels of directories - for example, the record with key FOOBAR might be stored in data/F/FO/FOOBAR rather than just data/FOOBAR.
Alternatively, you can make the single-large-file perform as well by building an index file, that contains a (sorted) list of key-offset pairs. Where the directories-as-index approach falls down is when you want to search on key different from the one you used to create the filenames - if you've used an index file, then you can just create a second index for this situation.
You may want to reconsider the "we can't use a database" restriction, since you are effectively just building your own database anyway.
Reading a directory is in general more costly than reading a file. But if you can find the file you want without reading the directory (i.e. not "loop over filenames" but "construct a file name") due to your naming convention, it may be benefical to split your database.
Given your data is 1 MB, I would even consider to store it entirely in memory.
To give you some clue about your question, I'd consider that having one single big file means that your application is doing the management of the lines. Having multiple small files is relying an the system and the filesystem to manage the data. The latter can be quite slow though, because it involves system calls for all your operations.
Opening File and Closing file in C Would take much time
i.e. you have 500 files 2 KB each... and if you process it 1000 Additonal Operation would be added to your application (500 Opening file and 500 Closing)... while only having 1 file with 1 MB of size would save you that 1000 additional operation...(That is purely my personal Opinion...)
Generally it's better to have multiple small files. Keeps memory usage low and performance is much better when searching through it.
But it depends on the amount of operations you'll need, because filesystem calls are much more expensive when compared to memory storage for instance.
This all depends on your file system, block size and memory cache among others.
As usual, measure and find out if this is a real problem since premature optimization should be avoided. It may be that using one file vs many small files does not matter much for performance in practice and that the choice should be based on clarity and maintainability instead.
(What I can say for certain is that you should not resort to linear file search, use a naming convention to pinpoint the file in O(1) time instead).
The general trade off is that having one big file can be more difficult to update but having lots of little files is fiddly. My suggestion would be that if you use multiple files and you end up having a lot it can get very slow traversing a directory with a million files in it. If possible break the files into some sort of grouping so they can be put into separate directories and "keyed". I have an application that requires the creation of lots of little pdf documents for all user users of the system. If we put this in one directory it would be a nightmare but having a directory per user id makes it much more manageable.
Why can't you use a DB, I'm curious? I respect your preference, but just want to make sure it's for the right reason.
Not all DBs require a server to connect to or complex deployment. SQLite, for instance, can be easily embedded in your application. Python already has it built-in, and it's very easy to connect with C code (SQLite itself is written in C and its primary API is for C). SQLite manages a feature-complete DB in a single file on the disk, where you can create multiple tables and use all the other nice features of a DB.

Is it possible to delete both ends of a large file without copying?

I would like to know if it is possible, using Windows and c++, to take a large video file (several gigabytes in length) and delete the first and last few hundred megabytes of it “in-place”.
The traditional approach of copying the useful data to a new file often takes upwards of 20 minutes of seemingly needless copying.
Is there anything clever that can be done low-level with the disk to make this happen?
Sure, it's possible in theory. But if your filesystem is NTFS, be prepared to spend a few months learning about all the data structures that you'll need to update. (All of which are officially undocumented BTW.)
Also, you'll need to either
Somehow unmount the volume and make your changes then; or
Learn how to write a kernel filesystem driver, buy a license from MS, develop the driver and use it to make changes to a live filesystem.
It's a bit easier if your filesystem is something simpler like FAT32. But either way: in short, it might be possible, but even if it is it'll take years out of your life. My advice: don't bother.
Instead, look at other ways you could solve the problem: e.g. by using an avisynth script to serve just the frames from the region you are interested in.
Are you hoping to just fiddle around with sector addresses in the directory entry? It's virtually inconceivable that plan would work.
First of all, it would require that the amount of data you wish to delete be exactly a sector size. That's not very likely considering that there is probably some header data at the very start that must remain there.
Even if it mets those requirements, it would take a low-level modification, which Windows tries very hard to prevent you from doing.
Maybe your file format allows to 'skip' the bytes, so that you could simply write over (i.e. with memory mapping) the necessary parts. This would of course still use up unnecessarily much disk space.
Yes, you can do this, on NTFS.
The end you remove with SetFileLength.
The beginning, or any other large consecutive region of the file, you overwrite with zeros. You then mark the file "sparse", which allows the file system to reclaim those clusters.
Note that this won't actually change the offset of the data relative to the beginning of the file, it only prevents the filesystem from wasting space storing unneeded data.
Even if low level filesystem operations were easy, editing a video file is not simply a matter of deleting unwanted megabytes. You still do have to consider concepts such as compression, frames, audio and video muxing, media file containers, and many others...
Your best solution is to simply accept your idle twenty minutes.

Truncate file at front

A problem I was working on recently got me to wishing that I could lop off the front of a file. Kind of like a “truncate at front,” if you will. Truncating a file at the back end is a common operation–something we do without even thinking much about it. But lopping off the front of a file? Sounds ridiculous at first, but only because we’ve been trained to think that it’s impossible. But a lop operation could be useful in some situations.
A simple example (certainly not the only or necessarily the best example) is a FIFO queue. You’re adding new items to the end of the file and pulling items out of the file from the front. The file grows over time and there’s a huge empty space at the front. With current file systems, there are several ways around this problem:
As each item is removed, copy the
remaining items up to replace it, and
truncate the file. Although it works,
this solution is very expensive
time-wise.
Monitor the size of the empty space at
the front, and when it reaches a
particular size or percentage of the
entire file size, move everything up
and truncate the file. This is much
more efficient than the previous
solution, but still costs time when
items are moved in the file.
Implement a circular queue in the
file, adding new items to the hole at
the front of the file as items are
removed. This can be quite efficient,
especially if you don’t mind the
possibility of things getting out of
order in the queue. If you do care
about order, there’s the potential of
having to move items around. But in
general, a circular queue is pretty
easy to implement and manages disk
space well.
But if there was a lop operation, removing an item from the queue would be as easy as updating the beginning-of-file marker. As easy, in fact, as truncating a file. Why, then, is there no such operation?
I understand a bit about file systems implementation, and don't see any particular reason this would be difficult. It looks to me like all it would require is another word (dword, perhaps?) per allocation entry to say where the file starts within the block. With 1 terabyte drives under $100 US, it seems like a pretty small price to pay for such functionality.
What other tasks would be made easier if you could lop off the front of a file as efficiently as you can truncate at the end?
Can you think of any technical reason this function couldn't be added to a modern file system? Other, non-technical reasons?
On file systems that support sparse files "punching" a hole and removing data at an arbitrary file position is very easy. The operating system just has to mark the corresponding blocks as "not allocated". Removing data from the beginning of a file is just a special case of this operation. The main thing that is required is a system call that will implement such an operation: ftruncate2(int fd, off_t offset, size_t count).
On Linux systems this is actually implemented with the fallocate system call by specifying the FALLOC_FL_PUNCH_HOLE flag to zero-out a range and the FALLOC_FL_COLLAPSE_RANGE flag to completely remove the data in that range. Note that there are restrictions on what ranges can be specified and that not all filesystems support these operations.
Truncate files at front seems not too hard to implement at system level.
But there are issues.
The first one is at programming level. When opening file in random access the current paradigm is to use offset from the beginning of the file to point out different places in the file. If we truncate at beginning of file (or perform insertion or removal from the middle of the file) that is not any more a stable property. (While appendind or truncating from the end is not a problem).
In other words truncating the beginning would change the only reference point and that is bad.
At a system level uses exist as you pointed out, but are quite rare. I believe most uses of files are of the write once read many kind, so even truncate is not a critical feature and we could probably do without it (well some things would become more difficult, but nothing would become impossible).
If we want more complex accesses (and there are indeed needs) we open files in random mode and add some internal data structure. Theses informations can also be shared between several files. This leads us to the last issue I see, probably the most important.
In a sense when we using random access files with some internal structure... we are still using files but we are not any more using files paradigm. Typical such cases are the databases where we want to perform insertion or removal of records without caring at all about their physical place. Databases can use files as low level implementation but for optimisation purposes some database editors choose to completely bypass filesystem (think about Oracle partitions).
I see no technical reason why we couldn't do everything that is currently done in an operating system with files using a database as data storage layer. I even heard that NTFS has many common points with databases in it's internals. An operating system can (and probably will in some not so far future) use another paradigm than files one.
Summarily i believe that's not a technical problem at all, just a change of paradigm and that removing the beginning is definitely not part of the current "files paradigm", but not a big and useful enough change to compell changing anything at all.
NTFS can do something like this with it's sparse file support but it's generaly not that useful.
I think there's a bit of a chicken-and-egg problem in there: because filesystems have not supported this kind of behavior efficiently, people haven't written programs to use it, and because people haven't written programs to use it, there's little incentive for filesystems to support it.
You could always write your own filesystem to do this, or maybe modify an existing one (although filesystems used "in the wild" are probably pretty complicated, you might have an easier time starting from scratch). If people find it useful enough it might catch on ;-)
Actually there are record base file systems - IBM have one and I believe DEC VMS also had this facility. I seem to remember both allowed (allow? I guess they are still around) deleting and inserting at random positions in a file.
There is also a unix command called head -- so you could do this via:
head -n1000 file > file_truncated
may can achieve this goal in two steps
long fileLength; //file total length
long reserveLength; //reserve length until the file ending
int fd; //file open for read & write
sendfile(fd, fd, fileLength-reserveLength, reserveLength);
ftruncate(fd, reserveLength);

Resources