Find gzip start and end? - file

I have some file, there's some random bytes, and multiple gzip files. How can i find start and end of gzip stream inside the some file? there's many random bytes between gzip streams. So, basically i need to find any gzip file and get it from there.

Reading from the RFC 1952 - GZIP :
Each GZIP file is just a bunch of data chunks (called members), one for each file contained.
Each member starts with the following bytes:
0x1F (ID1)
0x8B (ID2)
compression method. 0x08 for a DEFLATEd file. 0-7 are reserved values.
flags. The top three bits are reserved and must be zero.
(4 bytes) last modified time. May be set to 0.
extra flags, defined by the compression method.
operating system, actually the file system. 0=FAT, 3=UNIX, 11=NTFS
The end of a member is not delimited. You have to actually walk the entire member. Note that concatenating multiple valid GZIP files creates a valid GZIP file. Also note that overshooting a member may still result in a successful reading of the member (unless the decompressing library is fail-eagerly-and-completely).

Search for a three-byte gzip signature, 0x1f 0x8b 0x08. When you find it, try to decode a gzip stream starting with the 0x1f. If you succeed, then that was a gzip stream, and it ended where it ended. Continue the search from after that gzip stream if it is one, or after the 0x08 if it isn't. Then you will find all of them and you will know their location and span.

Related

Compressed AS2 Body

I am struggling with decompression of a Zlib compressed Mime body of AS2 request coming from BizTalk Server.
The thing is:
The HTTP Body I receive looks as expected. I can read the ASCII encoded Mime Header:
"Content-type: application/pkcs7-mime; smime-type=compressed-data; name=smime.p7m\r\nContent-Transfer-Encoding: binary\r\n\r\n"
Ending with two line breaks, I am expecting the compressed body after.
But when I use Ionic.Zlib ZlibStream.UncompressBuffer() to decompress the following bytes it throws an error.
Zlib Header can be identified for example by hex coded bytes "78da". When I start decompressing it from there on, it works fine.
What are the bytes between the two line breaks ending mime header and "78da" starting zlib compressed data?
"3080060b2a864886f70d0109100109a0803080020100300f060b2a864886f70d01091003080500308006092a864886f70d010701a080248004820769"
Next problem, if I read all bytes to the end, the last bytes can not be decompressed.
AS far as I Understood the zLib data should end with adler32 checksum, but how can I identify the end or length of the compressed data without trying to decompress.
I see some trailing bytes after the sucessfully decompressed data:
"1f9b1f1fcbc51f0482000445a59371"
What is that?
Thanks for your ideas!
You cannot find the end of the compressed data without decompressing. You don't need to save the result of the decompression, but you at least need to decode all of the compressed data in order to find where it self terminates.

Remove trailing "0d 0a" bytes from a file, using PowerShell

I am trying to encrypt and decrypt files using PowerShell. In this case, I am working with .docx files. After encrypting the file, I passed it onto the decrypt function, and after decrypting it, the file is corrupted when trying to open.
However, after using a hex editor to compare both original and decrypted .docx files, the only difference is that the decrypted .docx file has 2 trailing bytes of "0d 0a".
I think this is the result of PowerShell's "Set-Content" command.
(The Out-File command produces a far worse result.)
However, I am not able to just replace all the carriage return and line feed bytes as I would like to preserve the line feeds and carriage returns of the word document.
Is there a way I am able to remove only the trailing bytes of "0d 0a" of the decrypted and already-written .docx file?
Without seeing your example its impossible to to tell where the extra CRLF is coming from. I would recommend examining your code to determine where it is coming from and then use an alternate route like the System.IO.File class. If you are just looking for a quick solution you could read in the file and strip the last four bytes then output the byte array back to the same file overwriting it. This is a bandaid but should work.
#read in all contents
$bytes = [system.io.file]::ReadAllBytes("somefile.docx")
#write out all bytes except the last 4
#0 based so the last byte is at position length-1 then an additional 4 bytes
[System.IO.File]::WriteAllBytes("somefile.docx",$bytes[0..($bytes.length-5)])

Header and structure of a tar format

I have a project for school which implies making a c program that works like tar in unix system. I have some questions that I would like someone to explain to me:
The dimension of the archive. I understood (from browsing the internet) that an archive has a define number of blocks 512 bytes each. So the header has 512 bytes, then it's followed by the content of the file (if it's only one file to archive) organized in blocks of 512 bytes then 2 more blocks of 512 bytes.
For example: Let's say that I have a txt file of 0 bytes to archive. This should mean a number of 512*3 bytes to use. Why when I'm doing with the tar function in unix and click properties it has 10.240 bytes? I think it adds some 0 (NULL) bytes, but I don't know where and why and how many...
The header chcksum. As I know this should be the size of the archive. When I check it with hexdump -C it appears like a number near the real size (when clicking properties) of the archive. For example 11200 or 11205 or something similar if I archive a 0 byte txt file. Is this size in octal or decimal? My bets are that is in octal because all information you put in the header it needs to be in octal. My second question at this point is what is added more from the original size of 10240 bytes?
Header Mode. Let's say that I have a file with 664, the format file will be 0, then I should put in header 0664. Why, on a authentic archive is printed 3 more 0 at the start (000064) ?
There have been various versions of the tar format, and not all of the extensions to previous formats were always compatible with each other. So there's always a bit of guessing involved. For example, in very old unix systems, file names were not allowed to have more than 14 bytes, so the space for the file name (including path) was plenty; later, with longer file names, it had to be extended but there wasn't space, so the file name got split in 2 parts; even later, gnu tar introduced the ##LongLink pseudo-symbolic links that would make older tars at least restore the file to its original name.
1) Tar was originally a *T*ape *Ar*chiver. To achieve constant througput to tapes and avoid starting/stopping the tape too much, several blocks needed to be written at once. 20 Blocks of 512 bytes were the default, and the -b option is there to set the number of blocks. Very often, this size was pre-defined by the hardware and using wrong blocking factors made the resulting tape unusable. This is why tar appends \0-filled blocks until the tar size is a multiple of the block size.
2) The file size is in octal, and contains the true size of the original file that was put into the tar. It has nothing to do with the size of the tar file.
The checksum is calculated from the sum of the header bytes, but then stored in the header as well. So the act of storing the checksum would change the header, thus invalidate the checksum. That's why you store all other header fields first, set the checksum to spaces, then calculate the checksum, then replace the spaces with your calculated value.
Note that the header of a tarred file is pure ascii. This way, In those old days, when a tar file (whose components were plain ascii) got corrupted, an admin could just open the tar file with an editor and restore the components manually. That's why the designers of the tar format were afraid of \0 bytes and used spaces instead.
3) Tar files can store block devices, character devices, directories and such stuff. Unix stores these file modes in the same place as the permission flags, and the header file mode contains the whole file mode, including file type bits. That's why the number is longer than the pure permission.
There's a lot of information at http://en.wikipedia.org/wiki/Tar_%28computing%29 as well.

Read/Write files in C

I'm writing a program in C that basically creates an archive file for a given list of file names. This is pretty similar to the ar command in linux. This is how the archive file would look like:
!<arch>
file1.txt/ 1350248044 45503 13036 100660 28 `
hello
this is sample file 1
file2.txt/ 1350512270 45503 13036 100660 72 `
hello
this is sample file 2
this file is a little larger than file1.txt
But I'm having difficulties trying to exract a file from the archive. Let's say the user wants to extract file1.txt. The idea is it should get the index/location of the file name (in this case file1.txt), skip 58 characters to reach the content of the file, read the content, and write it to a new file. So here's my questions:
1) How can I get the index/location of the file name in the archive file? Note that duplicate file names are NOT allowed, so I don't have to worry about having two different indecies.
2) How can I skip several characters (in this case 58) when reading a file?
3) How can I figure out when the content of a file ends? i.e. I need it to read the content and stop right before the file2.txt/ header.
My approach to solving this problem would be:
To have a header information that contains the size of each file, its name and its location in the file.
Then parse the header, use fseek() and ftell() as well as fgetc() or fread() functions to get bytes of the file and then, create+write that data to it. This is the simplest way I can think of.
http://en.wikipedia.org/wiki/Ar_(Unix)#File_header <- Header of ar archives.
EXAMPLE:
#programmer93 Consider your header is 80 bytes long(header contains the meta-data of the archive file). You have two files one of 112 bytes and the other of 182 bytes. Now they're laid out in a flat file(the archive file). So it would be 80(header).112(file1.txt).182(file2.txt).EOF . Thus if you know the size of each file, you can easily navigate(using fseek()) to a particular file and extract only that file. [to extract file2.txt I will just fseek(FILE*,(112+80),SEEK_SET); and then fgetc() 182 times. I think I made myself clear?
If the format of the file cannot be changed by adding additional header information to help, you'll have to search through it and work things out as you go.
This should not be too hard. Just read the file, and when you read a header line such as
file1.txt/ 1350248044 45503 13036 100660 28 `
you can check the filename and size etc. (You know you'll have a header line at the start after the !<arch>). If this is the file you want, the ftell() function from stdio.h will tell you exactly where you are in the file. Since the file size in bytes is given in the header line, you can read the file by reading that particular number of bytes ahead in the normal manner. Similarly, if it is not the file you want, you can use fseek() to move forward the number of bytes in the file you are skipping and be ready to read in the header info for the next file and repeat the process.

Hexadecimal virus signatures database

Over the past couple of weeks, I was in the process of developing a simple virus scanner. It works great but my question is does anybody know where I can get a database (a single file) that contains 8000 or more virus signatures WITH their names, and possibly risk meter (high, low, unknown)?
Try the ClamAV database. This also includes some more complex signatures, but some are just byte sequences.
The CVD file format is a compressed tar file with a header block attached; see here for header information, or this PDF for the real details.
As I understand it, you should be able to decompress it with
dd if=file.cvd bs=512 skip=1 | tar zxvf -
This will unpack to a collection of various files; for files that have simple hex signatures, these will be found in a file with the extension .db. Not all of these signatures are pure hex -- many of them contain wildcards such as ?? for "allow any byte here", * for "allow any number of intervening bytes here", (-4096) for "allow up to 4k of intervening bytes here", and so forth.

Resources