I'm using zlib to decompress a file. I want to verify that there is enough disk space to unzip the file. Do the zip format and zlib provide facilities to determine the decompressed size of its contents?
If your file was compressed using the gzip format (RFC1952), then the last 4 bytes, the ISIZE field indicates the uncompressed file size mod 2^32. Therefore, provided that the original file was smaller than 4GB, you can determine its size by reading the last 4 bytes. Check the man pages for gunzip.
If ZLIB or raw Deflate format was used, you will have to decompress first to determine the uncompressed size.
Just create a simple file structure for compressing:
{
FileFormatHeader (optional) x bytes
OriginalSize (4 or 8 bytes)
CompressedSize (optional) (4 or 8 bytes)
HashSum (optional) (16 bytes or different number [depends on hash algorithm])
CompressedData
}
Now you have all the information that you need for decompressing
Related
Page 301 of Tanenbaum's Modern Operating Systems contains the table below. It gives the file sizes on a 2005 commercial Web server. The chapter is on file systems, so these data points are meant to be similar to what you would see on a typical storage device.
File length (bytes)
Percentage of files less than length
1
6.67
2
7.67
4
8.33
8
11.30
16
11.46
32
12.33
64
26.10
128
28.49
...
...
1KB
47.82
...
...
1 MB
98.99
...
...
128 MB
100
In the table, you will see that 6.67% of files on this server are 1 byte in length. What kinds of processes are creating 1 byte files? What kind of data would be stored in these files?
I wasn't familiar with that table, but it piqued my interest. I'm not sure what the 1-byte files were at the time, but perhaps the 1-byte files of today can shed some light?
I searched for files of size 1 byte with
sudo find / -size 1c 2>/dev/null | while read line; do ls -lah $line; done
Looking at the contents of these files on my system, they contain a single character: a newline. This can be verified by running the file through hexdump. A file with a single newline can exist for multiple reasons, but it probably has to do with the convention of terminating a line with a newline.
There is a second type of file with size 1 byte: symbolic links where the target is a single character. ext4 appears to report the length of the target as the size of the symbolic link (at least for short-length targets).
I am trying to merge small files less than 512 mb in a hdfs directory. After merging the files size on disk is more than input size. Is there any way to control the size efficiently.
Df=spark.read.parquet("/./")
Magic_number=(total size of input file / 512)
Df.repartition(Magic_number).write.save("/./")
Repartition is causing lot of shuffling and input files are in parquet format.
import org.apache.spark.util.SizeEstimator
val numBytes = SizeEstimator.estimate(df)
val desiredBytesPerFile = ???
df.coalesce(numBytes / desiredBytesPerFile).write.save("/./")
This will give you approximately the write number of bytes per file.
I was trying to get number of blocks allocated to a file using C. I used the stat struct with its variable called st_blocks. However this is returning different number of blocks as compared to ls -s. Can anybody explain the reason for this and if there is a way to correct this?
There is no discrepancy; just a misunderstanding. There are two separate "block sizes" here. Use ls -s --block-size=512 to use 512 byte block size for ls, too.
The ls -s command lists the size allocated to the file in user-specified units ("blocks"), the size of which you can specify using the --block-size option.
The st_blocks field in struct stat is in units of 512 bytes.
You see a discrepancy, because the two "block sizes" are not the same. They just happen to be called the same name.
Here is an example that you can examine this effect. This works on all POSIXy/Unixy file systems (that support sparse file), but not on FAT/VFAT etc.
First, let's create a file that but is one megabyte long, but has a hole at the beginning (they read zeros, but are not actually stored on disk), with a single byte at end (I'll use 'X').
We do this by using dd to skip the first 1048575 bytes of the file (creating a "hole", and thus a sparse file on filesystems that support such):
printf 'X' | dd bs=1 seek=1048575 of=sparse-file count=1
We can use the stat utility to examine the file. Format specifier %s provides the logical size of the file (1048576), %b the number of blocks (st_blocks):
stat -c 'st_size=%s st_blocks=%b' sparse-file
On my system, I get st_size=1048576 st_blocks=8, because the actual filesystem block size is 4096 bytes (= 8×512), and this sparse file needs only one filesystem block.
However, using ls -s sparse-file I get 4 sparse-file, because the default ls block size is 1024 bytes. If I run
ls --block-size=512 -s sparse-file
then I see 8 sparse-file, as I'd expect.
"Blocks" here are not real filesystem blocks. They're convenient chunks for display.
st_blocks is using probably 512 byte blocks. See the POSIX spec.
st_blksize is the preferred block size for this file, but not necessarily the actual block size.
BSD ls -s always uses 512 byte "blocks". OS X, for example, uses BSD ls by default.
$ /bin/ls -s index.html
560 index.html
GNU ls appears to use 1K blocks unless overriden with --block-size.
$ /opt/local/bin/gls -s index.html
280 index.html
printf("%lld / %d\n", buf.st_blocks, buf.st_blksize); produces 560 / 4096. The 560 "blocks" are in 512 byte chunks, but the real filesystem blocks are 4k.
The file contains 284938 bytes of data...
$ ls -l index.html
-rw-r--r-- 1 schwern staff 284938 Aug 11 2016 index.html
...but we can see it uses 280K on disk or 70 bytes.
Note that OS X further confuses the issue by using 1000 bytes for a "kilobyte" instead of the correct 1024 bytes, that's why it says 287 KB for 70 4096 KB blocks (ie. 286720 bytes) instead of 280 KB. This was done because hard drive manufacturers started using 1000 byte "kilobytes" in order to inflate their size, and Apple got tired of customers complaining about "lost" disk space.
The 4K block size can be seen by making a tiny file.
I have a project for school which implies making a c program that works like tar in unix system. I have some questions that I would like someone to explain to me:
The dimension of the archive. I understood (from browsing the internet) that an archive has a define number of blocks 512 bytes each. So the header has 512 bytes, then it's followed by the content of the file (if it's only one file to archive) organized in blocks of 512 bytes then 2 more blocks of 512 bytes.
For example: Let's say that I have a txt file of 0 bytes to archive. This should mean a number of 512*3 bytes to use. Why when I'm doing with the tar function in unix and click properties it has 10.240 bytes? I think it adds some 0 (NULL) bytes, but I don't know where and why and how many...
The header chcksum. As I know this should be the size of the archive. When I check it with hexdump -C it appears like a number near the real size (when clicking properties) of the archive. For example 11200 or 11205 or something similar if I archive a 0 byte txt file. Is this size in octal or decimal? My bets are that is in octal because all information you put in the header it needs to be in octal. My second question at this point is what is added more from the original size of 10240 bytes?
Header Mode. Let's say that I have a file with 664, the format file will be 0, then I should put in header 0664. Why, on a authentic archive is printed 3 more 0 at the start (000064) ?
There have been various versions of the tar format, and not all of the extensions to previous formats were always compatible with each other. So there's always a bit of guessing involved. For example, in very old unix systems, file names were not allowed to have more than 14 bytes, so the space for the file name (including path) was plenty; later, with longer file names, it had to be extended but there wasn't space, so the file name got split in 2 parts; even later, gnu tar introduced the ##LongLink pseudo-symbolic links that would make older tars at least restore the file to its original name.
1) Tar was originally a *T*ape *Ar*chiver. To achieve constant througput to tapes and avoid starting/stopping the tape too much, several blocks needed to be written at once. 20 Blocks of 512 bytes were the default, and the -b option is there to set the number of blocks. Very often, this size was pre-defined by the hardware and using wrong blocking factors made the resulting tape unusable. This is why tar appends \0-filled blocks until the tar size is a multiple of the block size.
2) The file size is in octal, and contains the true size of the original file that was put into the tar. It has nothing to do with the size of the tar file.
The checksum is calculated from the sum of the header bytes, but then stored in the header as well. So the act of storing the checksum would change the header, thus invalidate the checksum. That's why you store all other header fields first, set the checksum to spaces, then calculate the checksum, then replace the spaces with your calculated value.
Note that the header of a tarred file is pure ascii. This way, In those old days, when a tar file (whose components were plain ascii) got corrupted, an admin could just open the tar file with an editor and restore the components manually. That's why the designers of the tar format were afraid of \0 bytes and used spaces instead.
3) Tar files can store block devices, character devices, directories and such stuff. Unix stores these file modes in the same place as the permission flags, and the header file mode contains the whole file mode, including file type bits. That's why the number is longer than the pure permission.
There's a lot of information at http://en.wikipedia.org/wiki/Tar_%28computing%29 as well.
I have some file, there's some random bytes, and multiple gzip files. How can i find start and end of gzip stream inside the some file? there's many random bytes between gzip streams. So, basically i need to find any gzip file and get it from there.
Reading from the RFC 1952 - GZIP :
Each GZIP file is just a bunch of data chunks (called members), one for each file contained.
Each member starts with the following bytes:
0x1F (ID1)
0x8B (ID2)
compression method. 0x08 for a DEFLATEd file. 0-7 are reserved values.
flags. The top three bits are reserved and must be zero.
(4 bytes) last modified time. May be set to 0.
extra flags, defined by the compression method.
operating system, actually the file system. 0=FAT, 3=UNIX, 11=NTFS
The end of a member is not delimited. You have to actually walk the entire member. Note that concatenating multiple valid GZIP files creates a valid GZIP file. Also note that overshooting a member may still result in a successful reading of the member (unless the decompressing library is fail-eagerly-and-completely).
Search for a three-byte gzip signature, 0x1f 0x8b 0x08. When you find it, try to decode a gzip stream starting with the 0x1f. If you succeed, then that was a gzip stream, and it ended where it ended. Continue the search from after that gzip stream if it is one, or after the 0x08 if it isn't. Then you will find all of them and you will know their location and span.