Can a tar file be efficiently randomly edited? - file

Is there a way to modify an individual file within a tar file without having to rewrite the entire archive? I recognize this would probably result in fragmentation.
Is there any other archive format that does this?

First off, you should only ask exactly one question on StackOverflow. If you truly want to do frequent writes to the "archive", then you might be better off simply creating a large file, formatting it with some file system of your choice and then mounting it:
truncate -s $(( 512*1024*1024 )) 512MiB-filesystem.ext4
mkfs.ext4 512MiB-filesystem.ext4
sudo mount -o loop 512MiB-filesystem.ext4 mountpoint
sudo chmod a+w mountpoint/
echo foo > mountpoint/bar
sudo umount mountpoint
As for your question about TAR. It is possible and a fun exercise but it might lack the tools that actually implement this. First off, TAR is a very simple file format, it consists of 512 B blocks that can either contain metadata or actual file contents simply copied from the original file without any compression.
A TAR can actually contain multiple files for the same path and by convention, the last duplicate path wins. This means, in order to "modify" a file, you can simply append a newer version of that file to the TAR:
tar --append --file archive.tar modified-file
This should be fast, but it would grow the archive with every file change, so it should be used sparingly.
If you want even more in-place modifications, they should be possible but there is no tooling yet for that as far as I know. I would like to implement that into ratarmount but I'm not sure when I'll get to it.
File system operations and how to implement them:
Modifying a file:
File size is constant: As long as the file size does not change, we could simply change the file inside the TAR if we know the offset for the file contents in the TAR archive, which ratarmount does have stored in an SQLite database.
File size is quasi constant: Actually, the file size might even change by up to 511 B and it still would be possible to simply update the file inside the TAR as long as it doesn't change the number of required TAR blocks (512 B). This would also require updating the file size in the TAR metadata block and updating the checksum of that metadata block, though.
Required TAR blocks shrink: If the required TAR blocks become fewer than before, then it still would be rather easy to modify the TAR on the fly as outlined above. But we would have to somehow format the unused blocks. We could simply fill them with zeros, but in this case, we would have to call tar with the --ignore-zeros option to still get a valid tar. Without that, all files after that position would suddenly appear lost, so it might be unsuited in some circumstances. But we could also simply fill the empty blocks with dummy data, e.g., a directory metadata entry for the / (root) folder. As long as it contains the same metadata as the actual root folder, it basically is a no-op. It might even be possible to create dummy metadata blocks for invalid paths like . or .. to effectively create blocks that are ignored even without the --ignore-zeros option.
*Required TAR blocks grow:` This is the most difficult case. If there is simply no space to put the added data to the file, then we might have to delete it and move it to the end of the file (if it isn't already at the end). Removing the file without rewriting everything else in the TAR would be implemented as mentioned above by either filling the parts with zeros or dummy metadata blocks. At this point, however, we could implement defragmentation techniques, e.g., by keeping track of all empty / dummy blocks in the TAR and looking for fitting places. Or if we want to append 1 KiB to a 1 GiB file, then it might avoid fragmentation better if we move a small file right after the 1 GiB file to the end of the TAR to make space for the 1 KiB to append.
Modifying file metadata:
In General: In general, metadata can be changed by simply changing it in the metadata block and updating the block checksum. This does not require rewriting anything else in the archive
Removals: This is basically the same as file modifications for shrinking block counts. Simply overwrite the space for this file entry with zeros or dummy blocks and maybe keep track of it for writing files into this space at a later time.
Renames: Renames can actually be more tricky than one might think. In most cases, it can also simply be updated, however, there are two problematic cases:
The file name becomes too long: If the file name becomes too long, then the GNU long name extension will allocate further blocks right after the TAR metadata block, which will contain the very long filename. This however would require one more block, which might require moving around blocks inside the TAR as outlined for file modifications
There are file name collisions: If the target path already exists, then simply updating the metadata might not suffice depending on the order the files appear in the TAR. The last one with the same path wins. This might be easy to circumvent by simply forbidding to move to an existing path or by calling remove on the existing file beforehand.
Create: This is simple. Simply append the file to the end of the archive. If implemented manually, then we might have to find the actual end of the data because TAR archives have at least 2 (often more) zero-byte blocks after the last valid data and simply appending new files after those zero blocks would require the --ignore-zero-bytes option.

Related

How to create a virtual file with inode that has references to storage in memory

Let me explain clearly.
The following is my requirement:
Let's say there is a command which has an option specified as '-f' that takes a filename as argument.
Now I have 5 files and I want to create a new file merging those 5 files and give the new filename as argument for the above command.
But there is a difference between
reading a single file and
merging all files & reading the merged file.
There is more IO (read from 5 files + write to the merged file + any IO our command does with the given file) generated in the second case than IO (any IO our command does with the given file) generated in the first case.
Can we reduce this unwanted IO?
In the end, I really don't want the merged file at all. I only create this merged file just to let the command read the merged files content.
And to say, I also don't want this implementation. The file sizes are not so big and it is okay to have that extra negligible IO. But, I am just curious to know if this can be done.
So in order to implement this, I have following understanding/questions:
Generally what all the commands (that takes the filename argument) does is it reads the file.
In our case, the filename(filepath) is not ready, it's just an virtual/imaginary filename that exists (as the mergation of all files).
So, can we create such virtual filename?
What is a filename? It's an indirect inode entry for a storage location.
In our case, the individual files have different inode entries and all inode entries have different storage locations. And our virtual/imaginary file has in fact no inode and even if we could create an imaginary inode, that can only point to a storage in memory (as there is no reference to the storage location of another file from a storage location of one file in disk)
But, let's say using advanced programming, we are able to create an imaginary filepath with imaginary inode, that points to a storage in memory.
Now, when we give that imaginary filename as argument and when the command tries to open that imaginary file, it finds that it's inode entry is referring to a storage in memory. But the actual content is there in disk and not in the memory. So, the data is not loaded into memory yet, unless we read it explicitly. Hence, again we would need to read the data first.
Simply saying, as there is no continuity or references at storage in disk to the next file data, the merged data needs to be loaded to memory first.
So, with my deduction, it seems we would at least need to put the data in memory. However, as the command itself would need the file to be read (if not the whole file, at least a part of it until the commands's operation is done - let it be parsing or whatever). So, using this method, we could save some significant IO, if it's really a big file.
So, how can we create that virtual file?
My first answer is to write the merged file to tmpfs and refer to that file. But is it the only option or can we actually point to a storage location in memory, other than tmpfs? tmpfs is not option because, my script can be run from any server and we need to have a solution that work from all servers. If I mention to create merged file at /dev/shm in my script, it may fail in the server where it doesn't have /dev/shm. So I should be able to load to memory directly. But I think normal user will not have access to memory and so, it seems can not be done without shm.
Please let me know your comments and also kindly correct me if my understanding anywhere is wrong. Even if it is complicated for my level, kindly post your answer. At least, I might understand it after few months.
Create a fifo (named pipe) and provide its name as an argument to your program. The process that combines the five input files writes to this fifo
mkfifo wtf
cat file1 file2 file3 file4 file5 > wtf # this will block...
[from another terminal] cp wtf omg
Here I used cp as your program, and cat as the program combining the five files. You will see that omg will contain the output of your program (here: cp) and that the first terminal will unblock after the program is done.
Your program (here:cp) is not even aware that its 1st argument wtf refers to a fifo; it just opens it and reads from it like it would do with an ordinary file. (this will fail if the program attempts to seek in the file; seek() is not implemented for pipes and fifos)

Best way to transfer file from Local path to NFS path

I am looking for the best optimized way we can use to transfer the large log files from local path to NFS path.
Here the log files will keep on changing dynamically with time.
What i am currently using is a java utility which will read the file from local path and will transfer it to NFS path. But this seems to be consuming high time.
We cant use copy commands, as the log file are getting appended with more new logs. So this will not work.
What i am looking for is .. Is there any way other than using a java utility which will transfer the details of log file from local path to NFS path.
Thanks in Advance !!
If your network speed is higher than log growing speed, you can just cp src dst.
If log grows too fast and you can't push that much data, but you only want to take current snapshot, I see three options:
Like you do now, read whole file into memory, as you do now, and then copy it to destination. With large log files it may result in very large memory footprint. Requires special utility or tmpfs.
Make a local copy of file, then move this copy to destination. Quite obvious. Requires you to have enough free space and increases storage device pressure. If temporary file is in tmpfs, this is exactly the same as first method, but doesn't requires special tools (still needs memory and large enough tmpfs).
Take current file size and copy only that amount of data, ignoring anything that will be appended during copying.
E.g.:
dd if=src.file of=/remote/dst.file bs=1 count=`stat -c '%s' src.file`
stat takes current file size, and then this dd is instructed to copy only that amount of bytes.
Due to low bs, for better performance you may want to combine it with another dd:
dd status=none bs=1 count=`stat -c '%s' src.file` | dd bs=1M of=/remote/dst.file

Easiest way to overwrite a series of files with zeros

I'm on Linux. I have a list of files and I'd like to overwrite them with zeros and remove them. I tried using
srm file1 file2 file3 ...
but it's too slow (I have to overwrite and remove ~50 GB of data) and I don't need that kind of security (I know that srm does a lot of passes instead of a single pass with zeros).
I know I could overwrite every single file using the command
cat /dev/zero > file1
and then remove it with rm, but I can't do that manually for every single file.
Is there a command like srm that does a single pass of zeros, or maybe a script that can do cat /dev/zero on a list of files instead of on a single one? Thank you.
Something like this, using stat to get the correct size to write, and dd to overwrite the file, might be what you need:
for f in $(<list_of_files.txt)
do
read blocks blocksize < <(stat -c "%b %B" ${f})
dd if=/dev/zero bs=${blocksize} count=${blocks} of=${f} conv=notrunc
rm ${f}
done
Use /dev/urandom instead of /dev/zero for (slightly) better erasure semantics.
Edit: added conv=notrunc option to dd invocation to avoid truncating the file when it's opened for writing, which would cause the associated storage to be released before it's overwritten.
I use shred for doing this.
The following are the options that I generally use.
shred -n 3 -z <filename> - This will make 3 passes to overwrite the file with random data. It will then make a final pass overwriting the file with zeros. The file will remain on disk though, but it'll all the 0's on disk.
shred -n 3 -z -u <filename> - Similar to above, but also unlinks (i.e. deletes) the file. The default option for deleting is wipesync, which is the most secure but also the slowest. Check the man pages for more options.
Note: -n is used here to control the number of iterations for overwriting with random data. Increasing this number, will result in the shred operation taking longer to complete and better shredding. I think 3 is enough but maybe wrong.
The purpose of srm is to destroy the data in the file before releasing its blocks.
cat /dev/null > file is not at all equivalent to srm because
it does not destroy the data in the file: the blocks will be released with the original data intact.
Using /dev/zero instead of /dev/null does not even work because /dev/zero never ends.
Redirecting the output of a program to the file will never work for the same reason given for cat /dev/null.
You need a special-purpose program that opens the given file for writing, writes zeros over all bytes of the file, and then removes the file. That's what srm does.
Is there a command like srm that does a single pass of zeros,
Yes. SRM does this with the correct parameters. From man srm:
srm -llz
-l lessens the security. Only two passes are written: one mode with
0xff and a final mode random values.
-l -l for a second time lessons the security even more: only one
random pass is written.
-z wipes the last write with zeros instead of random data
srm -llzr will do the same recursively if wiping a directory.
You can even use 'srm -llz [file1] [file2] [file3] to wipe multiple files i this way with a single command

Header and structure of a tar format

I have a project for school which implies making a c program that works like tar in unix system. I have some questions that I would like someone to explain to me:
The dimension of the archive. I understood (from browsing the internet) that an archive has a define number of blocks 512 bytes each. So the header has 512 bytes, then it's followed by the content of the file (if it's only one file to archive) organized in blocks of 512 bytes then 2 more blocks of 512 bytes.
For example: Let's say that I have a txt file of 0 bytes to archive. This should mean a number of 512*3 bytes to use. Why when I'm doing with the tar function in unix and click properties it has 10.240 bytes? I think it adds some 0 (NULL) bytes, but I don't know where and why and how many...
The header chcksum. As I know this should be the size of the archive. When I check it with hexdump -C it appears like a number near the real size (when clicking properties) of the archive. For example 11200 or 11205 or something similar if I archive a 0 byte txt file. Is this size in octal or decimal? My bets are that is in octal because all information you put in the header it needs to be in octal. My second question at this point is what is added more from the original size of 10240 bytes?
Header Mode. Let's say that I have a file with 664, the format file will be 0, then I should put in header 0664. Why, on a authentic archive is printed 3 more 0 at the start (000064) ?
There have been various versions of the tar format, and not all of the extensions to previous formats were always compatible with each other. So there's always a bit of guessing involved. For example, in very old unix systems, file names were not allowed to have more than 14 bytes, so the space for the file name (including path) was plenty; later, with longer file names, it had to be extended but there wasn't space, so the file name got split in 2 parts; even later, gnu tar introduced the ##LongLink pseudo-symbolic links that would make older tars at least restore the file to its original name.
1) Tar was originally a *T*ape *Ar*chiver. To achieve constant througput to tapes and avoid starting/stopping the tape too much, several blocks needed to be written at once. 20 Blocks of 512 bytes were the default, and the -b option is there to set the number of blocks. Very often, this size was pre-defined by the hardware and using wrong blocking factors made the resulting tape unusable. This is why tar appends \0-filled blocks until the tar size is a multiple of the block size.
2) The file size is in octal, and contains the true size of the original file that was put into the tar. It has nothing to do with the size of the tar file.
The checksum is calculated from the sum of the header bytes, but then stored in the header as well. So the act of storing the checksum would change the header, thus invalidate the checksum. That's why you store all other header fields first, set the checksum to spaces, then calculate the checksum, then replace the spaces with your calculated value.
Note that the header of a tarred file is pure ascii. This way, In those old days, when a tar file (whose components were plain ascii) got corrupted, an admin could just open the tar file with an editor and restore the components manually. That's why the designers of the tar format were afraid of \0 bytes and used spaces instead.
3) Tar files can store block devices, character devices, directories and such stuff. Unix stores these file modes in the same place as the permission flags, and the header file mode contains the whole file mode, including file type bits. That's why the number is longer than the pure permission.
There's a lot of information at http://en.wikipedia.org/wiki/Tar_%28computing%29 as well.

removing a line from a text file?

I am working with a text file, which contains a list of processes under my programs control, along with relevant data.
At some point, one of the processes will finish, and thus will need to be removed from the file (as its no longer under control).
Here is a sample of the file contents (which has enteries added "randomly"):
PID=25729 IDLE=0.200000 BUSY=0.300000 USER=-10.000000
PID=26416 IDLE=0.100000 BUSY=0.800000 USER=-20.000000
PID=26522 IDLE=0.400000 BUSY=0.700000 USER=-30.000000
So for example, if I wanted to remove the line that says PID=26416.... how could I do that, without writing the file over again?
I can use external unix commands, however I am not very familiar with them so please if that is your suggestion, give an example.
Thanks!
Either you keep the contents of the file in temporary memory and then rewrite the file. Or you could have a file for each of the PIDs with the relevant information in them. Then you simply delete the file when it's no longer running. Or you could use a database for this instead.
As others have already pointed out, your only real choice is to rewrite the file.
The obvious way to do that with "external UNIX commands" would be grep -v "PID=26416" (or whatever PID you want to remove, obviously).
Edit: It is probably worth mentioning that if the lines are all the same length (as you've shown here) and order doesn't matter, you could delete a line more efficiently by copying the last line into the space being vacated, then shorten the file so eliminate what had been the last line. This will only work if they really are all the same length though (e.g., if you got a PID of '1', you'd need to pad it to the same length as the others in the file).
The only way is by copying each character that comes after the deleted line down over the characters that are deleted.
It is far more efficient to simply rewrite the file.
how could I do that, without writing the file over again?
You cannot. Filesystems (perhaps besides more esoteric record based ones) does not support insertion or deletion.
So you'll have to write the lines to a temporary file up till the line you want to delete, skip over that line, and write the rest of the lines to the file. When done, rename/copy the temp file to the original filename
Why are you maintaining these in a text file? That's not the best model for such a task. But, if you're stuck with it ... if these lines are guaranteed to all be the same length (it appears that way from the sample), and if the order of the lines in the file doesn't matter, then you can write the last line over the line for the process that has died and then shorten the file by one line with the (f)truncate() call if you're on a POSIX system: see Jonathan Leffler's answer in How to truncate a file in C?
But note carefully netrom's answer, which gives three different better ways to maintain this info.
Also, if you stick with a text file (preferably written from scratch each time from data structures you maintain, as per netrom's first suggestion), and you want to be sure that the file is always well formed, then write the new data into a temp file on the same device (putting it in the same directory is easiest) and then do a rename() call, which is an atomic operation.
You can use sed:
sed -i.bak -e '/PID=26416/d' test
-i is for editing in place. It also creates a back-up file with the new extension .bak
-e is for specifying the pattern. The /d indicates all lines matching the pattern should be deleted.
test is the filename
The unix command for it is:
grep -v "PID=26416" myfile > myfile.tmp
mv myfile.tmp myfile
The grep -v part outputs the file without the rows with the search term.
The > myfile.tmp part creates a new temp file for this output.
The mv part renames the temp file to the original file.
Note that we are rewriting the file here, and moreover, we can lose data if someone write something to file between the two commands.

Resources