Header and structure of a tar format - archive

I have a project for school which implies making a c program that works like tar in unix system. I have some questions that I would like someone to explain to me:
The dimension of the archive. I understood (from browsing the internet) that an archive has a define number of blocks 512 bytes each. So the header has 512 bytes, then it's followed by the content of the file (if it's only one file to archive) organized in blocks of 512 bytes then 2 more blocks of 512 bytes.
For example: Let's say that I have a txt file of 0 bytes to archive. This should mean a number of 512*3 bytes to use. Why when I'm doing with the tar function in unix and click properties it has 10.240 bytes? I think it adds some 0 (NULL) bytes, but I don't know where and why and how many...
The header chcksum. As I know this should be the size of the archive. When I check it with hexdump -C it appears like a number near the real size (when clicking properties) of the archive. For example 11200 or 11205 or something similar if I archive a 0 byte txt file. Is this size in octal or decimal? My bets are that is in octal because all information you put in the header it needs to be in octal. My second question at this point is what is added more from the original size of 10240 bytes?
Header Mode. Let's say that I have a file with 664, the format file will be 0, then I should put in header 0664. Why, on a authentic archive is printed 3 more 0 at the start (000064) ?

There have been various versions of the tar format, and not all of the extensions to previous formats were always compatible with each other. So there's always a bit of guessing involved. For example, in very old unix systems, file names were not allowed to have more than 14 bytes, so the space for the file name (including path) was plenty; later, with longer file names, it had to be extended but there wasn't space, so the file name got split in 2 parts; even later, gnu tar introduced the ##LongLink pseudo-symbolic links that would make older tars at least restore the file to its original name.
1) Tar was originally a *T*ape *Ar*chiver. To achieve constant througput to tapes and avoid starting/stopping the tape too much, several blocks needed to be written at once. 20 Blocks of 512 bytes were the default, and the -b option is there to set the number of blocks. Very often, this size was pre-defined by the hardware and using wrong blocking factors made the resulting tape unusable. This is why tar appends \0-filled blocks until the tar size is a multiple of the block size.
2) The file size is in octal, and contains the true size of the original file that was put into the tar. It has nothing to do with the size of the tar file.
The checksum is calculated from the sum of the header bytes, but then stored in the header as well. So the act of storing the checksum would change the header, thus invalidate the checksum. That's why you store all other header fields first, set the checksum to spaces, then calculate the checksum, then replace the spaces with your calculated value.
Note that the header of a tarred file is pure ascii. This way, In those old days, when a tar file (whose components were plain ascii) got corrupted, an admin could just open the tar file with an editor and restore the components manually. That's why the designers of the tar format were afraid of \0 bytes and used spaces instead.
3) Tar files can store block devices, character devices, directories and such stuff. Unix stores these file modes in the same place as the permission flags, and the header file mode contains the whole file mode, including file type bits. That's why the number is longer than the pure permission.
There's a lot of information at http://en.wikipedia.org/wiki/Tar_%28computing%29 as well.

Related

What kinds of things are stored in 1 byte files?

Page 301 of Tanenbaum's Modern Operating Systems contains the table below. It gives the file sizes on a 2005 commercial Web server. The chapter is on file systems, so these data points are meant to be similar to what you would see on a typical storage device.
File length (bytes)
Percentage of files less than length
1
6.67
2
7.67
4
8.33
8
11.30
16
11.46
32
12.33
64
26.10
128
28.49
...
...
1KB
47.82
...
...
1 MB
98.99
...
...
128 MB
100
In the table, you will see that 6.67% of files on this server are 1 byte in length. What kinds of processes are creating 1 byte files? What kind of data would be stored in these files?
I wasn't familiar with that table, but it piqued my interest. I'm not sure what the 1-byte files were at the time, but perhaps the 1-byte files of today can shed some light?
I searched for files of size 1 byte with
sudo find / -size 1c 2>/dev/null | while read line; do ls -lah $line; done
Looking at the contents of these files on my system, they contain a single character: a newline. This can be verified by running the file through hexdump. A file with a single newline can exist for multiple reasons, but it probably has to do with the convention of terminating a line with a newline.
There is a second type of file with size 1 byte: symbolic links where the target is a single character. ext4 appears to report the length of the target as the size of the symbolic link (at least for short-length targets).

Can a tar file be efficiently randomly edited?

Is there a way to modify an individual file within a tar file without having to rewrite the entire archive? I recognize this would probably result in fragmentation.
Is there any other archive format that does this?
First off, you should only ask exactly one question on StackOverflow. If you truly want to do frequent writes to the "archive", then you might be better off simply creating a large file, formatting it with some file system of your choice and then mounting it:
truncate -s $(( 512*1024*1024 )) 512MiB-filesystem.ext4
mkfs.ext4 512MiB-filesystem.ext4
sudo mount -o loop 512MiB-filesystem.ext4 mountpoint
sudo chmod a+w mountpoint/
echo foo > mountpoint/bar
sudo umount mountpoint
As for your question about TAR. It is possible and a fun exercise but it might lack the tools that actually implement this. First off, TAR is a very simple file format, it consists of 512 B blocks that can either contain metadata or actual file contents simply copied from the original file without any compression.
A TAR can actually contain multiple files for the same path and by convention, the last duplicate path wins. This means, in order to "modify" a file, you can simply append a newer version of that file to the TAR:
tar --append --file archive.tar modified-file
This should be fast, but it would grow the archive with every file change, so it should be used sparingly.
If you want even more in-place modifications, they should be possible but there is no tooling yet for that as far as I know. I would like to implement that into ratarmount but I'm not sure when I'll get to it.
File system operations and how to implement them:
Modifying a file:
File size is constant: As long as the file size does not change, we could simply change the file inside the TAR if we know the offset for the file contents in the TAR archive, which ratarmount does have stored in an SQLite database.
File size is quasi constant: Actually, the file size might even change by up to 511 B and it still would be possible to simply update the file inside the TAR as long as it doesn't change the number of required TAR blocks (512 B). This would also require updating the file size in the TAR metadata block and updating the checksum of that metadata block, though.
Required TAR blocks shrink: If the required TAR blocks become fewer than before, then it still would be rather easy to modify the TAR on the fly as outlined above. But we would have to somehow format the unused blocks. We could simply fill them with zeros, but in this case, we would have to call tar with the --ignore-zeros option to still get a valid tar. Without that, all files after that position would suddenly appear lost, so it might be unsuited in some circumstances. But we could also simply fill the empty blocks with dummy data, e.g., a directory metadata entry for the / (root) folder. As long as it contains the same metadata as the actual root folder, it basically is a no-op. It might even be possible to create dummy metadata blocks for invalid paths like . or .. to effectively create blocks that are ignored even without the --ignore-zeros option.
*Required TAR blocks grow:` This is the most difficult case. If there is simply no space to put the added data to the file, then we might have to delete it and move it to the end of the file (if it isn't already at the end). Removing the file without rewriting everything else in the TAR would be implemented as mentioned above by either filling the parts with zeros or dummy metadata blocks. At this point, however, we could implement defragmentation techniques, e.g., by keeping track of all empty / dummy blocks in the TAR and looking for fitting places. Or if we want to append 1 KiB to a 1 GiB file, then it might avoid fragmentation better if we move a small file right after the 1 GiB file to the end of the TAR to make space for the 1 KiB to append.
Modifying file metadata:
In General: In general, metadata can be changed by simply changing it in the metadata block and updating the block checksum. This does not require rewriting anything else in the archive
Removals: This is basically the same as file modifications for shrinking block counts. Simply overwrite the space for this file entry with zeros or dummy blocks and maybe keep track of it for writing files into this space at a later time.
Renames: Renames can actually be more tricky than one might think. In most cases, it can also simply be updated, however, there are two problematic cases:
The file name becomes too long: If the file name becomes too long, then the GNU long name extension will allocate further blocks right after the TAR metadata block, which will contain the very long filename. This however would require one more block, which might require moving around blocks inside the TAR as outlined for file modifications
There are file name collisions: If the target path already exists, then simply updating the metadata might not suffice depending on the order the files appear in the TAR. The last one with the same path wins. This might be easy to circumvent by simply forbidding to move to an existing path or by calling remove on the existing file beforehand.
Create: This is simple. Simply append the file to the end of the archive. If implemented manually, then we might have to find the actual end of the data because TAR archives have at least 2 (often more) zero-byte blocks after the last valid data and simply appending new files after those zero blocks would require the --ignore-zero-bytes option.

how to create a symbolic link in EXT2 file system

I am working with the EXT2 File System and spent the last 2 days trying to figure out how to create a symbolic link. From http://www.nongnu.org/ext2-doc/ext2.html#DEF-SYMBOLIC-LINKS, "For all symlink shorter than 60 bytes long, the data is stored within the inode itself; it uses the fields which would normally be used to store the pointers to data blocks. This is a worthwhile optimization as it we avoid allocating a full block for the symlink, and most symlinks are less than 60 characters long"
To create a sym link at /link1 to /source I create a new inode and say it gets index 24. Since it's <60 characters, I placed the string "/source" starting at the i_block[0] field (so printing new_inode->i_block[0] in gdb shows "/dir2/source") and set i_links_count to 1, i_size and i_blocks to 0. I then created a directory entry at the inode 2 (root inode) with the properties 24, "link1", and file type EXT2_FT_SYMLINK.
A link called "link1" gets created but its a directory and when I click it it goes to "/". I'm wondering what I'm doing wrong...
A (very) late response, but just because the symlink's data is in the block pointers that doesn't mean the file size is 0! You need to set the i_size field in the symlink's inode equal to the length of the path

Read/Write files in C

I'm writing a program in C that basically creates an archive file for a given list of file names. This is pretty similar to the ar command in linux. This is how the archive file would look like:
!<arch>
file1.txt/ 1350248044 45503 13036 100660 28 `
hello
this is sample file 1
file2.txt/ 1350512270 45503 13036 100660 72 `
hello
this is sample file 2
this file is a little larger than file1.txt
But I'm having difficulties trying to exract a file from the archive. Let's say the user wants to extract file1.txt. The idea is it should get the index/location of the file name (in this case file1.txt), skip 58 characters to reach the content of the file, read the content, and write it to a new file. So here's my questions:
1) How can I get the index/location of the file name in the archive file? Note that duplicate file names are NOT allowed, so I don't have to worry about having two different indecies.
2) How can I skip several characters (in this case 58) when reading a file?
3) How can I figure out when the content of a file ends? i.e. I need it to read the content and stop right before the file2.txt/ header.
My approach to solving this problem would be:
To have a header information that contains the size of each file, its name and its location in the file.
Then parse the header, use fseek() and ftell() as well as fgetc() or fread() functions to get bytes of the file and then, create+write that data to it. This is the simplest way I can think of.
http://en.wikipedia.org/wiki/Ar_(Unix)#File_header <- Header of ar archives.
EXAMPLE:
#programmer93 Consider your header is 80 bytes long(header contains the meta-data of the archive file). You have two files one of 112 bytes and the other of 182 bytes. Now they're laid out in a flat file(the archive file). So it would be 80(header).112(file1.txt).182(file2.txt).EOF . Thus if you know the size of each file, you can easily navigate(using fseek()) to a particular file and extract only that file. [to extract file2.txt I will just fseek(FILE*,(112+80),SEEK_SET); and then fgetc() 182 times. I think I made myself clear?
If the format of the file cannot be changed by adding additional header information to help, you'll have to search through it and work things out as you go.
This should not be too hard. Just read the file, and when you read a header line such as
file1.txt/ 1350248044 45503 13036 100660 28 `
you can check the filename and size etc. (You know you'll have a header line at the start after the !<arch>). If this is the file you want, the ftell() function from stdio.h will tell you exactly where you are in the file. Since the file size in bytes is given in the header line, you can read the file by reading that particular number of bytes ahead in the normal manner. Similarly, if it is not the file you want, you can use fseek() to move forward the number of bytes in the file you are skipping and be ready to read in the header info for the next file and repeat the process.

Hexadecimal virus signatures database

Over the past couple of weeks, I was in the process of developing a simple virus scanner. It works great but my question is does anybody know where I can get a database (a single file) that contains 8000 or more virus signatures WITH their names, and possibly risk meter (high, low, unknown)?
Try the ClamAV database. This also includes some more complex signatures, but some are just byte sequences.
The CVD file format is a compressed tar file with a header block attached; see here for header information, or this PDF for the real details.
As I understand it, you should be able to decompress it with
dd if=file.cvd bs=512 skip=1 | tar zxvf -
This will unpack to a collection of various files; for files that have simple hex signatures, these will be found in a file with the extension .db. Not all of these signatures are pure hex -- many of them contain wildcards such as ?? for "allow any byte here", * for "allow any number of intervening bytes here", (-4096) for "allow up to 4k of intervening bytes here", and so forth.

Resources