What happens internally when a file is modified and saved? Will the OS allocates a new block of memory and copy the whole data or only the bits after the modified part are shifted?
Files are manipulated in blocks. A block on disk is like a byte in memory. You can only read and write in units of blocks. 512 bytes used to be the normal block size but 4096 is more common now.
The OS will read the entire block into memory; change whatever bytes; then write the entire block to the disk.
Clusters are units of file allocation. They are multiples of blocks. The disk hardware is generally unaware of clusters. Larger cluster sizes reduce the amount of system allocation overhead but are inefficient for large numbers of small files. You can read and write individual blocks within a cluster.
Diferent methods are for each one, remember we have diferent file systems. for example in ntfs when u write a file and it uses for example six cluster it will be like that in your file sistem:
123456
if u add a new file using 1 cluster it will be like that
1234561
so now u remove that first file:
1
and u will write a new file using 3 clusters
123 1
and now u want to write a file with 7 clusters
12312314567
for example if u want to copy a file in another folder it will be re written in new clusters in your filesystem, but if u want to cut it u will modify only the INDEX thats why is so fast cut files versus copy action.
so if u modify a file a part or complete in the most cases will be loaded in a buffer and then when u save your changes that buffer is written in the hard disk replacing the afected clusters and writting the new ones. But that depends cause diferent software uses diferent methods.
Related
My code does the following
do 100 times of
open a new file; write 10M data; close it
open the 100 files together, read and merge their data into a larger file
do steps 1 and 2 many times in a loop
I was wondering if I can keep the 100 open w/o opening and closing them too many times. What I can do is fopen them with w+. After writing I set position the beginning to read, after read I set position to the beginning to write, and so on.
The questions are:
if I read after write w/o closing, do we always read all the written data
would this save some overhead? File open and close must have some overhead, but is this overhead large enough to save?
Bases on the comments and discussion I will talk about why I need to do this in my work. It is also related to my other post
how to convert large row-based tables into column-based tables efficently
I have a calculation that generates a stream of results. So far the results are saved in a row-storage table. This table has 1M columns, each column could be 10M long. Actually each column is one attribute the calculation produces. At the calculation runs, I dump and append the intermediate results the table. The intermediate results could be 2 or 3 double values at each column. I wanted to dump it soon because it already consumes >16M memory. And the calculate needs more memoy. This ends up a table like the following
aabbcc...zzaabbcc..zz.........aabb...zz
A row of data are stored together. The problem happens when I want to analyze the data column by column. So I have to read 16 bytes then seek to the next row for reading 16 bytes then keep on going. There are too many seeks, it is much slower than if all columns are stored together so I can read them sequentially.
I can make the calculation dump less frequent. But to make the late read more efficent. I may want to have 4K data stored together since I assume each fread gets 4K by default even if I read only 16bytes. But this means I need to buffer 1M*4k = 4G in memory...
So I was thinking if I can merge fragment datas into larger chunks like that the post says
how to convert large row-based tables into column-based tables efficently
So I wanted to use files as offline buffers. I may need 256 files to get a 4K contiguous data after merge if each file contains 1M of 2 doubles. This work can be done as an asynchronous way in terms of the main calculation. But I wanted to ensure the merge overhead is small so when it runs in parallel it can finish before the main calculation is done. So I came up with this question.
I guess this is very related to how column based data base is constructed. When people create them, do they have the similar issues? Is there any description of how it works on creation?
You can use w+ as long as the maximum number of open files on your system allows it; this is usually 255 or 1024, and can be set (e.g. on Unix by ulimit).
But I'm not too sure this will be worth the effort.
On the other hand, 100 files of 10M each is one gigabyte; you might want to experiment with a RAM disk. Or with a large file system cache.
I suspect that huger savings might be reaped by analyzing your specific problem structure. Why is it 100 files? Why 10 M? What kind of "merge" are you doing? Are those 100 files always accessed in the same order and with the same frequency? Could some data be kept in RAM and never be written at all?
Update
So, you have several large buffers like,
ABCDEFG...
ABCDEFG...
ABCDEFG...
and you want to pivot them so they read
AAA...
BBB...
CCC...
If you already have the total size (i.e., you know that you are going to write 10 GB of data), you can do this with two files, pre-allocating the file and using fseek() to write to the output file. With memory-mapped files, this should be quite efficient. In practice, row Y, column X of 1,000,000 , has been dumped at address 16*X in file Y.dat; you need to write it to address 16*(Y*1,000,000 + X) into largeoutput.dat.
Actually, you could write the data even during the first calculation. Or you could have two processes communicating via a pipe, one calculating, one writing to both row-column and column-row files, so that you can monitor the performances of each.
Frankly, I think that adding more RAM and/or a fast I/O layer (SSD maybe?) could get you more bang for the same buck. Your time costs too, and the memory will remain available after this one work has been completed.
Yes. You can keep the 100 files open without doing the opening-closing-opening cycle. Most systems do have a limit on the number of open files though.
if I read after write w/o closing, do we always read all the written data
It depends on you. You can do an fseek goto wherever you want in the file and read data from there. It's all the way you and your logic.
would this save some overhead? File open and close must have some overhead, but is this overhead large enough to save?
This would definitely save some overhead, like additional unnecessary I/O operations and also in some systems, the content which you write to file is not immediately flushed to physical file, it may be buffered and flushed periodically and or done at the time of fclose.
So, such overheads are saved, but, the real question is what do you achieve by saving such overheads? How does it suit you in the overall picture of your application? This is the call which you must take before deciding on the logic.
I want to read very large chunks of data using memory mapped io.
These large chunks of data are comming from a harddisk, no file system just data.
Now before I start this whole ordeal I want to know 2 things.
is it possible to memory map only specific parts into memory after eachother and then read is sequentally? First instance I have a harddrive where I want to read 10 chunks of 100mb but each chunk is separated by 1gb of data. is it possible to memory map those 10 chunks of 100mb one after the other so I can acces it like if they were one after the other?
Can I memory map huge amount of data? e.g let`s say I have a 10tb disk. is it possible to memory the entire disk? I use a 64bit OS.
I hope someone can clarify!
On Linux, you can use the mmap() system call to map files (even block devices) into memory. If you don't know how mmap() works, consult the man page before continuing with this answer.
The mmap() call allows you to specify a base address for the mapping you want to create. POSIX specifies that the operating system may take this base address as a hint on where to place the mapping. On Linux, mmap() will place the mapping on the address you request if it is a page boundary (i.e. dividable by 4096). You can specify MAP_FIXED to make sure that the mapping is placed where you want it, but the kernel might tell you that this is not possible.
You can try to map the chunks you want one-after-another using the approach above but this obviously will only work if your chunks have sizes that are multiples of the page size (i.e. 4096 bytes). I would not advise you to do this as it might break on a different page size / configuration.
Mapping the entire disk should be possible depending on your memory configuration. You might need to configure the overcommiting behavior of your system for this.
I suggest you to try out if mapping the entire disk works.
I am writing a multi-threaded application and as of now I have this idea. I have a FILE*[n] where n is a number determined at runtime. I open all the n files for reading and then multiple threads can access to read it. The computation on the data of each file is equivalent i.e. if serial execution is supposed then each file will remain in memory for the same time.
Each files can be arbitrarily large so on should not assume that they can be loaded in memory.
Now in such a scenario I want to reduce the number of disk IO's that occur. It would be great if someone can suggest any shared memory model for such scenario (I don't know if I am using one because I have very less idea of how things are implemented) .I am not sure how should I achieve this. In other words i just want to know what is the most efficient model to implement such a scenario. I am using C.
EDIT: A more detailed scenario.
The actual problem is I have n bloom filters for data contained in n files and once all the elements from a file are inserted in the corresponding bloom filter I need to need to do membership testing. Since membership testing is a read-only process on data file I can read file from multiple threads and this problem can be easily parallelized. Now the number of files having data are fairly large(around 20k and note that number of files equals number of bloom filter) so I choose to spawn a thread for testing against a bloom-filter i.e. each bloom filter will have its own thread and that will read every other file one by one and test the membership of data against the bloom filter. I wan to minimize disk IO in such a case.
At the start use the mmap() function to map the files into memory, instead of opening/reading FILE*'s. After that spawn the threads which read the files.
In that way the OS buffers the accesses in memory, only performing disk io when the cache becomes full.
If your program is multi-threaded, all the threads are sharing memory unless you take steps to create thread-local storage. You don't need o/s shared memory directly. The way to minimize I/O is to ensure that each file is read only once if at all possible, and similarly that results files are only written once each.
How you do that depends on the processing you're doing.
f each thread is responsible for processing a file in its entirety, then the thread simply reads the file; you can't reduce the I/O any more than that. If a file must be read by several threads, then you should try to memory map the file so that it is available to all the relevant threads. If you're using a 32-bit program and the files are too big to all fit in memory, you can't necessarily do the memory mapping. Then you need to work out how the different threads will process each file, and try to minimize the number of times different threads have to reread the files. If you're using a 64-bit program, you may have enough virtual memory to handle all the files via memory mapped I/O. You still want to keep the number of times that the data is accessed to a minimum. Similar concepts apply to the output files.
I am writing an audio file to an SD/MMC storage card in real time, in WAVE format, working on an ARM board. Said card is (and must remain) in FAT32 format. I can write a valid WAVE file just fine, provided I know how much I'm going to write beforehand.
I want to be able to put placeholder data in the Chunk Data Size field of the RIFF and data chunks, write my audio data, and then go back and update the Chunk Data Size field in those two chunks so that they have correct values, but...
I have a working filesystem and some stdio functions, with some caveats:
fwrite() supports r, w, and a, but not any + modes.
fseek() does not work in write mode.
I did not write the implementations of the above functions (I am using ARM's RL-FLashFS), and I am not certain what the justification for the restrictions/partial implementations is. Adding in the missing functionality personally is probably an option, but I would like to avoid it if possible (I have no other need of those features, do not forsee any, and can't really afford to spend too much time on it.) Switching to a different implementation is also not an option here.
I have very limited memory available, and I do not know how much audio data will be received, except that it will almost certainly be more than I can keep in memory at any one time.
I could write a file with the raw interleaved audio data in it while keeping track of how many bytes I write, close it, then open it again for reading, open a second file for writing, write the header into the second file, and copy the audio data over. That is, I could post-process it into a properly formatted valid WAVE file. I have done this and it works fine. But I want to avoid post-processing large amounts of data if at all possible.
Perhaps I could somehow concatenate two files in place? (I.e. write the data, then write the chunks to a separate file, then join them in the filesystem, avoiding much of the time spent copying potentially vast amounts of data.) My understanding of that is that, if possible, it would still involve some copying due to block orientation of the storage.
Suggestions?
EDIT:
I really should have mentioned this, but there is no OS running here. I have some stdio functions running on top of a hardware abstraction layer, and that's about it.
This should be possible, but it involves writing a set of FAT table manipulation routines.
The concept of FAT is simple: A file is stored in a chain of "clusters" - fixed size blocks. The clusters do not have to be contiguous on the disk. The Directory entry for a file includes the ID of the first cluster. The FAT contains one value for each cluster, which is either the ID of the next cluster in the chain, or an "End-Of-Chain" (EOC) marker.
So you can concatenate files together by altering the first file's EOC marker to point to the head cluster of the second file.
For your application you could write all the data, rewrite the first cluster (with the correct header) into a new file, then do FAT surgery to graft the new head onto the old tail:
Determine the FAT cluster size (S)
Determine the size of the WAV header up to the first data byte (F)
Write the incoming data to a temp file. Close when stream ends.
Create a new file with the desired name.
Open the temp file for reading, and copy the header to the new file while filling in the size field(s) correctly (as you have done before).
Write min(S-F, bytes_remaining) to the new file.
Close the new file.
If there are no bytes remaining, you are done,
else,
Read the FAT and Directory into memory.
Read the Directory to get
the first cluster of the temp file (T1) (with all the data),
the first cluster of the wav file (W1). (with the correct header)
Read the FAT entry for T1 to find the second temp cluster (T2).
Change the FAT entry for W1 from "EOC" to T2.
Change the FAT entry for T1 from T2 to "EOC".
Swap the FileSize entries for the two files in the Directory.
Write the FAT and Directory back to disk.
Delete the Temp file.
Of course, by the time you do this, you will probably understand the file system well enough to implement fseek(fp,0,SEEK_SET), which should give you enough functionality to do the header fixup through standard library calls.
We are working with exactly the same scenario as you in our project - recorder application. Since the length of file is unknown - we write a RIFF header with 0 length in the beginning (to reserve space) and on closing - go back to the 0 position (with fseek) and write correct header. Thus, I think you have to debug why fseek doesn't work in write mode, otherwise you will not be able to perform this task efficiently.
By the way, you'd better off from file system internal specific workarounds like concatenating blocks, etc - this is hardly possible, will not be portable and can bring you new problems. Let's use standard and proven methods instead.
Update
(After finding out that your FS is ARM's RL-FlashFS) why not using rewind http://www.keil.com/support/man/docs/rlarm/rlarm_rewind.htm instead of fseek?
Secure File Deleting in C
I need to securely delete a file in C, here is what I do:
use fopen to get a handle of the file
calculate the size using lseek/ftell
get random seed depending on current time/or file size
write (size) bytes to the file from a loop with 256 bytes written each iteration
fflush/fclose the file handle
reopen the file and re-do steps 3-6 for 10~15 times
rename the file then delete it
Is that how it's done? Because I read the name "Gutmann 25 passes" in Eraser, so I guess 25 is the number of times the file is overwritten and 'Gutmann' is the Randomization Algorithm?
You can't do this securely without the cooperation of the operating system - and often not even then.
When you open a file and write to it there is no guarantee that the OS is going to put the new file on the same bit of spinning rust as the old one. Even if it does you don't know if the new write will use the same chain of clusters as it did before.
Even then you aren't sure that the drive hasn't mapped out the disk block because of some fault - leaving your plans for world domination on a block that is marked bad but is still readable.
ps - the 25x overwrite is no longer necessary, it was needed on old low density MFM drives with poor head tracking. On modern GMR drives overwriting once is plenty.
Yes, In fact it is overwriting n different patterns on a file
It does so by writing a series of 35 patterns over the
region to be erased.
The selection of patterns assumes that the user doesn't know the
encoding mechanism used by the drive, and so includes patterns
designed specifically for three different types of drives. A user who
knows which type of encoding the drive uses can choose only those
patterns intended for their drive. A drive with a different encoding
mechanism would need different patterns.
More information is here.
#Martin Beckett is correct; there is so such thing as "secure deletion" unless you know everything about what the hardware is doing all the way down to the drive. (And even then, I would not make any bets on what a sufficiently well-funded attacker could recover given access to the physical media.)
But assuming the OS and disk will re-use the same blocks, your scheme does not work for a more basic reason: fflush does not generally write anything to the disk.
On most multi-tasking operating systems (including Windows, Linux, and OS X), fflush merely forces data from the user-space buffer into the kernel. The kernel will then do its own buffering, only writing to disk when it feels like it.
On Linux, for example, you need to call fsync(fileno(handle)). (Or just use file descriptors in the first place.) OS X is similar. Windows has FlushFileBuffers.
Bottom line: The loop you describe is very likely merely to overwrite a kernel buffer 10-15 times instead of the on-disk file. There is no portable way in C or C++ to force data to disk. For that, you need to use a platform-dependent interface.
MFT(master File Table) similar as FAT (File Allocation table),
MFT keeps records: files offsets on disk, file name, date/time, id, file size, and even file data if file data fits inside record's empty space which is about 512 bytes,1 record size is 1KB.
Note: New HDD data set to 0x00.(just let you know)
Let's say you want overwrite file1.txt OS MFT finds this file offset inside record.
you begin overwrite file1.txt with binary (00000000) in binary mode.
You will overwrite file data on disk 100% this is why MFT have file offset on disk.
after you will rename it and delete.
NOTE: MFT will mark file as deleted, but you still can get some data about this file i.e. date/time : created, modified, accessed. file offset , attributes, flags.
1- create folder in c:\ and move file and in same time rename in to folder( use rename function ) rename file to 0000000000 or any another without extention
2- overwrite file with 0x00 and check if file was overwrited
3- change date/time
4- make without attributes
5- leave file size untouched OS faster reuse empty space.
6- delete file
7- repeat all files (1-6)
8- delete folder
or
(1, 2, 6, 7, 8)
9- find files in MFT remove records of these files.
The Gutmann method worked fine for older disk technology encoding schemes, and the 35 pass wiping scheme of the Gutmann method is no longer requuired which even Gutmann acknowledges. See: Gutmann method at: https://en.wikipedia.org/wiki/Gutmann_method in the Criticism section where Gutmann discusses the differences.
It is usually sufficient to make at most a few random passes to securely delete a file (with possibly an extra zeroing pass).
The secure-delete package from thc.org contains the sfill command to securely wipe disk and inode space on a hard drive.