fsync() atomicity across data blocks

fsync() atomicity across data blocks - c

When calling fsync() on a file, can the file become corrupted?
For example, say my file spreads across to disk blocks:
A B
|---------| |--------|
| Hello, | -> | World! |
|---------| |--------|
| 1234567 | | 89abcd |
|---------| |--------|
Say I want to change the entire file contents to lower case (in a very inefficient manner). So I seek to position 1 of the file to change "H" into "h" and then position 8 to change "W" to "w". I then call fsync() on the file. The file is spread across two disk blocks.
Is the ordering of the writes maintained?
Is the fsync() operation atomic *across the disk

The fsync call won't return until both writes are written to disk, along with any associated metadata. If your computer crashes (typically by losing power) and you have a corrupted file then log a bug report with the filesystem maintainers - that shouldn't happen. If fsync returns then the data is safely on disk.
To answer your questions though, there's no reason why the filesystem and disk driver can't reorder the writes (they see them as non-overlapping and it might be useful to write the second one first if that's where the disk head is on rotating media). And secondly there's no way for fsync to be atomic as it deals with real life hardware. It should act atomically though to the user (you will have the first copy of the file or the second but not something corrupted).

Related

How to check memory allocated to open files in a process

The scenario is that I'm running a test program which grows to 8GB in size, and then a series of asserts are done via Check.
I can see my process grow to 8GB in size in 'top' and I see my system memory grow accordingly. However, when a large number of asserts are done afterwards, I see VIRT stop growing as expected, and I see my total used memory continue to increase as Check is going through the asserts. So according to top there's no more memory being given to any processes, but I something is still chewing through memory.
I sort top by memory usage and see nothing else is reserving memory. Eventually I hit 100% swap and physical memory usage and the process gets killed.
Note that Check will fork each 'test' into its own process. When I break the test program (w/ ctrl+c), I still see the process in top, and its VIRT still reads 8GB.
I believe memory is being eaten by Check, because this behavior doesn't happen when I take out all the asserts. I saw in Check that a tmpfile() is created and used to track the last assert that happened and I see /tmp growing as the assert phase begins. If I modify Check code to write to a file in /tmp (instead of using tmpfile()), I see that file grows to be GBs big.
1) Why doesn't the virtual address space being taken up by the open file show up as part of the process' used memory? Note that Check forks off each 'test'. Also, even though swap is full, shouldn't unused parts of the file just get paged back out to memory? (writes are done via fwrite, not mmap)
2) A secondary question, I haven't used tmpfile() before, but why doesn't any file show up in /tmp when tmpfile() is invoked? If it is because the file is unlinked immediately, does that mean any unlinked file won't show up in the filesystem? (My understanding of what unlink does is also rudimentary).
edit: I'm using Arch Linux w/ kernel 4.0.5-1 and procps-ng version 3.3.10

A small experiment shows that yes, because tmpfile is created in /tmp, it is held purely in memory, and because I was running out of memory, my program got killed. Because the program had the last open file descriptor, the file is deleted(freed out of my tmpfs) after the program was killed. Because /tmp exists only in memory as a tmpfs, it can't be 'paged out' to disk because the filesystem itself exists only in memory, and swap was also full, hence the OOM error.
By performing the same experiment with a file in /var/tmp, which is instead on disk, no extra memory was being taken up so I didn't get this error.
While I couldn't view this behavior with top, free -h showed the memory being taken up under shared and buff/cache.
To narrow down the exact culprit, I used:
lsof | awk '$5 == "REG"' | sort -n -r -k 7,7 | head -n 5
from https://serverfault.com/questions/207100/how-can-i-find-phantom-storage-usage
Note that the above will not work if /tmp is already full.
So the open file was not the problem, i.e. buffering was not the problem. The problem was that the file existed in a tmpfs.
To answer my #2, the file doesn't show up because it was 'deleted' when it was unlinked with tmpfile():
$ lsof | awk '$5 == "REG"' | sort -n -r -k 7,7 | head -n 5
toy 13624 me 3u REG 0,33 4294963200 12505948 /tmp/tmpf0W4HOM (deleted)

Batch Processing

I have a list of strings that I want to output to different files according to a key for each file(this key is present in the list, so if this key is 1 in certain node then the string needs to be written to file 1.txt and if the key is 2 then the output should be redirected to 2.txt and so on...).
What I was thinking, is to assign each list member a unique key which makes it a unique record, and then spawn multiple threads depending on the number of processors available in the system. The thread redirect the output of a node from a pool of nodes(that is my list) to the concerned file. I was skeptical whether this is a good design for batch processing. Or should I just have one thread to do the whole output thingy.
ps - Before I get bashed or anything let me tell you I am just a curious learner.

Make it single threaded. Then run, find what your bottleneck is. If you find out, that your bottleneck is CPU and not disk IO, then enable parallel processing.

As I understand your processing steps is:
select file by the key
write item to file
I think this is not the case when parallel processing can result in performance improvement. If you want to speed up this code - use buffering and asynchronous I/O.
for each file maintain a flag - write-in-progress
when you want to write something to file - check this flag
if write-in-progress is False:
set write-in-progress = True
add your item to buffer
start writing this buffer to file asynchronously
if write-in-progress is True:
add your item to buffer
when pending asynchronous operation is completed
check is there is nonempty buffer, if so start async write
There is more simple approach: use buffering and synchronous I/O. It will be slower than asynchronous approach described above, but not very much. You can start several thread and traverse list in each thread independently. Each thread must process only some unique set of keys. For example, you can use two threads, first thread must write only items with odd keys, second thread must write only items with even keys.

You need a concurrency model for that - however serious it sounds :)
First analyze what can be done at the same time and is unrelated to each other. Imagine each step of your program is executed on different machine with a sort of communication between, e.g. IP network.
Then draw a flow between these instances (actions/machines). Mark what resources actions need to perform, e.g. a list, a file. Mark resources as separate instances (same as actions and machines).
Put the file system in your picture to see if writing separate files may be sped up or it will end in the file system and thus it will be serialized again.
Connect the instances. And see if you get any benefit. It could look like that:
list
|
list reader
/ \ \
/ \ ----------\
file file file
writer writer writer
| | |
file 1 file 2 file 3
\ / |
\ / |
file system 1 file system 2
In the example you can see that it may make sense to get some parallel execution

Secure File Delete in C

Secure File Deleting in C
I need to securely delete a file in C, here is what I do:
use fopen to get a handle of the file
calculate the size using lseek/ftell
get random seed depending on current time/or file size
write (size) bytes to the file from a loop with 256 bytes written each iteration
fflush/fclose the file handle
reopen the file and re-do steps 3-6 for 10~15 times
rename the file then delete it
Is that how it's done? Because I read the name "Gutmann 25 passes" in Eraser, so I guess 25 is the number of times the file is overwritten and 'Gutmann' is the Randomization Algorithm?

You can't do this securely without the cooperation of the operating system - and often not even then.
When you open a file and write to it there is no guarantee that the OS is going to put the new file on the same bit of spinning rust as the old one. Even if it does you don't know if the new write will use the same chain of clusters as it did before.
Even then you aren't sure that the drive hasn't mapped out the disk block because of some fault - leaving your plans for world domination on a block that is marked bad but is still readable.
ps - the 25x overwrite is no longer necessary, it was needed on old low density MFM drives with poor head tracking. On modern GMR drives overwriting once is plenty.

Yes, In fact it is overwriting n different patterns on a file
It does so by writing a series of 35 patterns over the
region to be erased.
The selection of patterns assumes that the user doesn't know the
encoding mechanism used by the drive, and so includes patterns
designed specifically for three different types of drives. A user who
knows which type of encoding the drive uses can choose only those
patterns intended for their drive. A drive with a different encoding
mechanism would need different patterns.
More information is here.

#Martin Beckett is correct; there is so such thing as "secure deletion" unless you know everything about what the hardware is doing all the way down to the drive. (And even then, I would not make any bets on what a sufficiently well-funded attacker could recover given access to the physical media.)
But assuming the OS and disk will re-use the same blocks, your scheme does not work for a more basic reason: fflush does not generally write anything to the disk.
On most multi-tasking operating systems (including Windows, Linux, and OS X), fflush merely forces data from the user-space buffer into the kernel. The kernel will then do its own buffering, only writing to disk when it feels like it.
On Linux, for example, you need to call fsync(fileno(handle)). (Or just use file descriptors in the first place.) OS X is similar. Windows has FlushFileBuffers.
Bottom line: The loop you describe is very likely merely to overwrite a kernel buffer 10-15 times instead of the on-disk file. There is no portable way in C or C++ to force data to disk. For that, you need to use a platform-dependent interface.

MFT(master File Table) similar as FAT (File Allocation table),
MFT keeps records: files offsets on disk, file name, date/time, id, file size, and even file data if file data fits inside record's empty space which is about 512 bytes,1 record size is 1KB.
Note: New HDD data set to 0x00.(just let you know)
Let's say you want overwrite file1.txt OS MFT finds this file offset inside record.
you begin overwrite file1.txt with binary (00000000) in binary mode.
You will overwrite file data on disk 100% this is why MFT have file offset on disk.
after you will rename it and delete.
NOTE: MFT will mark file as deleted, but you still can get some data about this file i.e. date/time : created, modified, accessed. file offset , attributes, flags.
1- create folder in c:\ and move file and in same time rename in to folder( use rename function ) rename file to 0000000000 or any another without extention
2- overwrite file with 0x00 and check if file was overwrited
3- change date/time
4- make without attributes
5- leave file size untouched OS faster reuse empty space.
6- delete file
7- repeat all files (1-6)
8- delete folder
or
(1, 2, 6, 7, 8)
9- find files in MFT remove records of these files.

The Gutmann method worked fine for older disk technology encoding schemes, and the 35 pass wiping scheme of the Gutmann method is no longer requuired which even Gutmann acknowledges. See: Gutmann method at: https://en.wikipedia.org/wiki/Gutmann_method in the Criticism section where Gutmann discusses the differences.
It is usually sufficient to make at most a few random passes to securely delete a file (with possibly an extra zeroing pass).
The secure-delete package from thc.org contains the sfill command to securely wipe disk and inode space on a hard drive.

How to avoid physical disk I/O

I have a process which writes huge data over the network. Let's say it runs on machine A and dumps around 70-80GB of file on machine B over NFS. After process 1 finishes and exits, my process 2 runs of machine A and fetches this file from machine B over NFS. The bottleneck in the entire cycle is the writing and reading of this huge data file. How can I reduce this
I/O time? Can I somehow keep the data loaded in the memory, ready to use by process 2 even after process 1 has exited?
I'd appreciate ideas on this. Thanks.
Edit: since the process 2 'reads' the data directly from the network, would it be better to
copy the data locally first and then read from the local disk?
I mean would
(read time over network) > (cp to local disk) + (read from local disk)

If you want to keep the data loaded in memory, then you'll need 70-80 GB of RAM.
The best is maybe to attach a local storage (hard disk drive) to system A to keep this file locally.

The obvious answer is to reduce network writes - which seems could save you time on an exponential scale and improve reliability - there seems very little point in copying any file to another machine only to copy it back, so in order to answer your questions more precisely we will need more information.

There is a lot of network and IO overhead with this approach. So you may not be able to reduce the latency further down.
Since the file is more than 80 GB, create an mmap that process 1 will write into and later process 2 can read from it - no network involved, use only machine A - but still IO overhead is unavoidable.
More faster: both the processes can run simultaneously and you can have a semaphore or other signalling mechanism wherein process 1 can indicate process 2 that the file is ready to be read.
Fastest approach: Let process 1 create a shared memory and share it with process 2. Whenever a limit (maximum data chunk that can be loaded into the memory, based on your RAM size) is reached, let process 1 signal process 2 that the data can be read and processed - this solution will be feasile only if the file/data can actually be processed chunks by chunks instead of one big chunk of your 80GB.

Whether you use mmap or plain read/write should make little difference; either way, everything happens through the filesystem cache/buffers. The big problem is NFS. The only way you can make this efficient is by storing the intermediate data locally on machine A rather than sending it all over the network to machine B only to pull it back again right afterwards.

Use tmpfs to leverage memory as (temporary) files.
Use mbuffer with netcat to simply relay from one port to another without storing the intermediate stream, but still allowing streaming to occur at varying speeds:
machine1:8001 -> machine2:8002 -> machine3:8003
At machine2 configure a job like:
netcat -l -p 8002 | mbuffer -m 2G | netcat machine3 8003
This will allow at most 2 gigs of data to be buffered. If the buffer is filled 100%, machine2 will just start blocking reads from machine1, delaying the output stream without failing.
When machine1 had completed transmission, the second netcat will stay around till the mbuffer is depleted

You can use RAM disk as storage
NFS is slow. Try use alternative way to transfer data to another PC. For sample - TCP/IP stream.
Another solution - you can use inmemory database (TimesTen for sample)

How to have a checkpoint file using mmap which is only synced to disk manually

I need the fastest way to periodically sync file with memory.
What I think I would like is to have an mmap'd file, which is only sync'd to disk manually. I'm not sure how to prevent any automatic syncing from happening.
The file cannot be modified except at the times I manually specify. The point is to have a checkpoint file which keeps a snapshot of the state in memory. I would like to avoid copying as much as possible, since this will be need to called fairly frequently and speed is important.

Anything you write to the memory within a MAP_SHARED mapping of a file is considered as being written to the file at that time, as surely as if you had used write(). msync() in this sense is completely analagous to fsync() - it merely ensures that changes you have already made to the file are actually pushed out to permanent storage. You can't change this - it's how mmap() is defined to work.
In general, the safe way to do this is to write a complete consistent copy of the data to a temporary file, sync the temporary file, then atomically rename it over the prior checkpoint file. This is the only way to ensure that a crash between checkpoints doesn't leave you with an inconsistent file. Any solution that does less copying is going to require both a more complicated transaction-log style file format, and be more intrusive to the rest of your application (requiring specific hooks to be invoked in each place that the in-memory state is changed).

You could mmap() the file as copy on write so that any updates you do in memory are not written back to the file, then when you want to sync, you could:
A) Make a new memory mapping that is not copy on write and copy just the pages you modified into it.
Or
B) Open the file (regular file open) with direct I/O (block size aligned sized reading and writing) and write only the pages you modified. Direct I/O would be nice and fast because you're writing whole pages (memory page size is a multiple of disk block size) and there's no buffering. This method has the benefit of not using address space in case your mmap() is large and there's no room to mmap() another huge file.
After the sync, your copy on write mmap() is the same as your disk file, but the kernel still has the pages you needed to sync marked as non shared (with the disk). So you can then close and recreate the mmap() (still copy on write) that way the kernel can discard your pages if necessary (instead of paging them out to swap space) if there's memory pressure.
Of course, you'd have to keep track of which pages you had modified yourself because I can't think of how you'd get access to where the OS keeps that info. (wouldn't that be a handy syscall()?)
-- edit --
Actually, see Can the dirtiness of pages of a mmap be found from userspace? for ideas on how to see which pages are dirty.

mmap can't be used for this purpose. There's no way to prevent data from being written to disk. In practice, using mlock() to make the memory unswappable might have a side effect of preventing it from getting written to disk except when you ask for it to be written, but there's no guarantee. Certainly if another process opens the file, it's going to see the copy cached in memory (with your latest changes), not the copy on physical disk. In many ways, what you should do depends on whether you're trying to do synchronization with other processes or just for safety in case of crash or power failure.
If your data size is small, you might try a number of other methods for atomic syncing to disk. One way is to store the entire dataset in a filename and create an empty file by that name, then delete the old file. If 2 files exist at startup (due to extremely unlikely crash time), delete the older one and resume from the newer one. write() may also be atomic if your data size is smaller than a filesystem block, page size, or disk block, but I don't know of any guarantee to that effect right off. You'd have to do some research.
Another very standard approach that works as long as your data isn't so big that 2 copies won't fit on disk: just create a second copy with a temporary name, then rename() it over top of the old one. rename() is always atomic. This is probably the best approach unless you have a reason not to do it that way.

As the other respondents have suggested, I don't think there's a portable way to do what you want without copying. If you're looking to do this in a special-purpose environment where you can control the OS etc, you may be able to do it under Linux with the btrfs filesystem.
btrfs supports a new reflink() operation which is essentially a copy-on-write filesystem copy. You could reflink() your file to a temporary on start-up, mmap() the temporary, then msync() and reflink() the temporary back to the original to checkpoint.

I highly suspect that may not be taken advantage of by any OS, but it would be possible for an OS to notice optimizations for:
int fd = open("file", O_RDWR | O_SYNC | O_DIRECT);
size_t length = get_lenght(fd);
uint8_t * map_addr = mmap(NULL, length, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
...
// This represents all of the changes that could possibly happen before you
// want to update the on disk file.
change_various_data(map_addr);
if (is_time_to_update()) {
write(fd, map_addr, length);
lseek(fd, 0, SEEK_SET);
// you could have just used pwrite here and not seeked
}
The reasons that an OS could possibly take advantage of this is that until you write to a particular page (and no one else did either) the OS would probably just use the actual file's page at that location as the swap for that page.
Then when you wrote to some set of those pages the OS would Copy On Write those pages for your process, but still keep the unwritten pages backed up by the original file.
Then, upon calling write the OS could notice that the write was block aligned both in memory and on disk, and then it could notice that some of the source memory pages were already synched up with those exact file system pages that they were being written to and only write out the pages which had changed.
All of that being said, it wouldn't surprise me if this optimization isn't done by any OS, and this type of code ends up being really slow and causes lots of disk writing when you call 'write'. It would be cool if it was taken advantage of.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight