Can I use the zlib header as a delimiter? - zlib

I have multiple blocks of data compressed with zlib. I want to concatenate these blocks of data and store that in one file.
Obviously, I could use something like JSON or XML to separate the zlib data blocks, but I'm wondering if, to save space, I can just search for the next 78 01, 78 9C or 78 DA?
Basically my question is, can, theoretically, these byte combinations exist in a zlib data stream, or can I be sure that when I find one of these byte combinations, a new zlib data block is started, and the end is at the found position minus one?
I know the uncompressed data blocks are always 1024 bytes or less in length, so the compressed stream will never be > 1024 bytes.

No, you can't. Any byte sequence can appear in the compressed data. At any byte position, there is a probability of 1/1024 of finding a valid zlib header. So you will find a lot of valid zlib headers in a long compressed stream that are not actually zlib headers.
You could create your own byte stuffing scheme that wraps around arbitrary data, including zlib streams or anything else, that assures that certain sequences cannot occur unless they really are delimiters. Such schemes can incur an arbitrarily small expansion of the data. For example if you find three 0xff's in a row in the data, then insert a 0x00 byte. Then 0xff 0xff 0xff 0xff can be a delimiter, since it will never appear in the data. This will only expand the stream, on average, by about 0.000006%.

Related

Why most files like jpeg or pdf don't use just ASCII characters for encoding? [migrated]

This question was migrated from Stack Overflow because it can be answered on Super User.
Migrated 19 days ago.
Whenever we try to open jpeg or pdf file with any text editor we find strange symbols other than ASCII. Isn't Ascii most efficient because of less space consumption by limited number of possible characters available.
I was working with a database file in linux with plocate and I found something similar.
Isn't Ascii most efficient because of less space consumption by limited number of possible characters available.
Not at all. Where did you get that idea from?
ASCII chars are 7bits long, but hardware doesn't support storing 7bits items, so ASCII is stored with 8bits, the first bit being always 0. Furthermore, ASCII includes a number of control characters that can cause issues in some situation. Therefore, the most prominent ASCII encoding (base 64) uses only 6bits. This mean that in order to encode 3 bytes (38 = 24 bits) of data you need 4 ASCII characters (4 6 = 24). Those 4 ASCII characters are then stored using 4 bytes on disk. Hence, converting a file to ASCII increases disk usage by 33%.
You can test this with the base64 command:
base64 pic.jpg > b64_jpeg.txt
ls -lh pic.jpg b64_jpeg.txt
Of course, you could try to use another ASCII encoding than the standard base64 and use all 7 bits available in ASCII. You would still get only 7bits of data per bytes on disk, thus have a +14% disk usage increase for the same data.
All modern storage uses 8-bit bytes. ASCII is an obsolete 7 bits standard, so it would take 8/7th as much storage (+14%).
It is nothing to do with number of bits as such, all binary files are the same 2 bits (true or false) what makes an image or PDF look different to Ascii text is that each single byte of bits is compressed in groups for optimal efficiency. Those symbolic strings are perhaps ASCII but compressed to about 10%.
Take a pdf of a graph as follows
ASCII = 394,132 bytes
ZIP    =   88,367 bytes
PDF   =   75,753 bytes
DocX =   32,940 bytes its text and lines (there are no images)
Take an image
PNG = 265,490 bytes
ZIP = 265,028 bytes
PDF = 220,152 bytes
PDF as ASCII = 3,250,970 bytes
3 0 obj
<</Length 3120001/Type/XObject/Subtype/Image/Width 640/Height 800/BitsPerComponent 8/SMask 4 0 R/ColorSpace/DeviceRGB/Filter/ASCIIHexDecode>>
stream
9cb6c79cb6c79cb6c79cb6c79db7c89db7c89db7c89fb7c9a0b8caa1b8caa1b8
caa1b8caa2b9cba2b9cba2b9cba2b9cba3bacba3bacaa4bbcba4bbcba6bccca7
...to infinity and beyond
So why is as ASCII image bigger than all the rest is because those 9cb6c7 can be tokenised as 4 x 9cb6c7 , 3 x 9db7c8 , etc , that's roughly how RunLengthEncoding would work, but zip is better than that.
So PARTS of a pdf may be compressed (needing slower decompression to view) in a zip style of coding (used for lossless fonts and bitmaps), whilst others may keep their optimal native photographic lossy compression (like jpeg). Overall for PDF parsing a higher percentage needs to be 8 bit ANSI (compatible with uni-coding or variable per platform) or 7bit ASCII for simplistic parsing.
Short answer compression is the means to reduce time of transmission or amount of storage resources. However decompression adds an overhead so is slower than RAW ASCII to display as graphics. Avoid exotic wavelets in a PDF where most objects need fast decompression.

Reading a SEG-Y file with EBCDIC and binary data contents in Fortran

I am reading a SEG-Y file (used in geophysics to store data) which has 2 header sections, the first is 3200 bytes containing information in EBCDIC format, while the second header is in binary format and is 400 bytes length. The data later follows where the size of the data is determined by a number defined in the binary header defined in given byte locations 3217-3218.
I managed to read the EBCDIC (bytes 1-3200) header using simple open command in Fortran 90 with no access or format definitions, but I can't go further to read the specific bytes in the binary header (3201-3204, 3205-3206, ... and so on) which contains important information needed to read the rest of the binary data afterwards.
How to properly define the access/formatting for the file to successfully read everything at once? Does Fortran support changing the file access/format/... within a code? If this is not possible, how then I can skip the first 3200 bytes and move to the binary section (bytes 3201-3600) to read the data I need?
If you open the data-file with access="stream", you can read the file byte by byte from any position you want.
character :: byte ! integer(int8) might be a better type
open(11, file="filename",access="stream",form="unformatted",action="read",status="old")
!be careful, the positions are numbered from 1, not from 0
read(11, pos=3200) byte
You can also read other datatypes if they are stored there in a compatible binary format
integer :: i
...
read(11, pos=...) i
On a little-endian machine you may have to convert their endianness.

Zlib minimum deflate size

I'm trying to figure out if there's a way to calculate a minimum required size for an output buffer, based on the size of the input buffer.
This question is similar to zlib, deflate: How much memory to allocate?, but not the same. I am asking about each chunk in isolation, rather than the entire stream.
So suppose we have two buffers: INPUT and OUTPUT, and we have a BUFFER_SIZE, which is - say, 4096 bytes. (Just a convenient number, no particular reason I choose this size.)
If I deflate using:
deflate(stream, Z_PARTIAL_FLUSH)
so that each chunk is compressed, and immediately flushed to the output buffer, is there a way I can guarantee I'll have enough storage in the output buffer without needing to reallocate?
Superficially, we'd assume that the DEFLATED data will always be larger than the uncompressed input data (assuming we use a compression level that is greater than 0.)
Of course, that's not always the case - especially for small values. For example, if we deflate a single byte, the deflated data will obviously be larger than the uncompressed data, due to the overhead of things like headers and dictionaries in the LZW stream.
Thinking about how LZW works, it would seem if our input data is at least 256 bytes (meaning that worst case scenario, every single byte is different and we can't really compress anything), we should realize that input size LESS than 256 bytes + zlib headers could potentially require a LARGER output buffer.
But, generally, realworld applications aren't going to be compressing small sizes like that. So assuming an input/output buffer of something more like 4K, is there some way to GUARANTEE that the output compressed data will be SMALLER than the input data?
(Also, I know about deflateBound, but would rather avoid it because of the overhead.)
Or, to put it another way, is there some minimum buffer size that I can use for input/output buffers that will guarantee that the output data (the compressed stream) will be smaller than the input data? Or is there always some pathological case that can cause the output stream to be larger than the input stream, regardless of size?
Though I can't quite make heads or tails out of your question, I can comment on parts of the question in isolation.
is there some way to GUARANTEE that the output compressed data will be
SMALLER than the input data?
Absolutely not. It will always be possible for the compressed output to be larger than some input. Otherwise you wouldn't be able to compress other input.
(Also, I know about deflateBound, but would rather avoid it because of
the overhead.)
Almost no overhead. We're talking a fraction of a percent larger than the input buffer for reasonable sizes.
By the way, deflateBound() provides a bound on the size of the entire output stream as a function of the size of the entire input stream. It can't help you when you are in the middle of a bunch of deflate() calls with incomplete input and insufficient output space. For example, you may still have deflate output pending and delivered by the next deflate() call, without providing any new input at all. Then the expansion ratio is infinite for that isolated call.
due to the overhead of things like headers and dictionaries in the LZW
stream.
deflate is not LZW. The approach it uses is called LZ77. It is very different from LZW, which is now obsolete. There are no "dictionaries" stored in compressed deflate data. The "dictionary" is simply the uncompressed data that precedes the data currently being compressed or decompressed.
Or, to put it another way, is there some minimum buffer size that I
can use for input/output buffers ...
The whole idea behind the zlib interface is for you to not have to worry about what will fit in the buffers. You just keep calling deflate() or inflate() with more input data and more output space until you're done, and all will be well. It does not matter if you need to make more than one call to consume one buffer of input, or more than one call to fill one buffer of output. You just have loops to make more calls, provide more input when needed, and disposition the output when needed and provide fresh output space.
Information theory dictates that there must always be pathological cases which "compress" to something larger.
This page starts off with the worst case encoding sizes for zlib - looks like the worst case growth is 6 bytes, plus 5 bytes per started 16KB block. So if you always flush after less than 16KB, having buffers which are 11 bytes plus your flush interval should be safe.
Unless you have tight control over the type of data you're compressing, finding pathological cases isn't hard. Any random number generator will find you some pretty quickly.

IDAT Chunk of PNG File Format

I am currently developing a proprietary file format based on the png file format. I am done so far, except it doesn't work :-p The deflate decompressor I implemented works like a charm but the png decoder doesn't want to perform nicely so I took a look at the original png file.
The standard says that after a IDAT header, the compressed data is following immediatly. So as the data is a deflate stream the first char after IDAT is 0x78 == 01111000 which means, a mode one block (uncompressed) and its not the final one.
Strange though - its hard for me to imagine that a PNG encoder doesn't use dynamic huffman coding for compressing the filtered raw image data. The deflate standard says that the rest of the current byte is skipped in mode one.
So the next four bytes indicate the size of the uncompressed block and its one complement.
But 0x59FD is not the one complement of 0xECDA. Even if I screwed up the byte ordering: 0xFD59 is not the one complement of 0xDAEC either.
Well, the knockout byte just follows. 0x97 is considered to be the first byte of the uncompressed but still filtered raw png image data and as such must be the filtertype. But 0x97 == 10010111 is not a valid filter type. Event if I screwed up bit packing order 11101001 == 0xe9 is still no valid filter type.
I didn't focus on RFC 1951 much anymore as I am able to inflate all kind of files so far using my implementation of the deflate decompressor, so I suspect some misunderstanding on my part concering the PNG standard.
I read the RFC 2083 over and over again but the data I see here don't match the RFC, it doesn't make sense to me, there must be a missing piece!
When I look at the following bytes, I can actually not see a valid filter type byte anywhere near which makes me think that the filtered png data stream is nevertheless compressed after all.
It would make sense if 0x78 (the first byte after IDAT) would be read from MSB to LSB but RFC 1951 says otherwise. Another idea (more likely to me) is that there some data between the IDAT string and the start of the compressed deflate stream but RFC 2083 says otherwise. The Layout is clear
4Bytes Size
4Bytes ChunkName (IDAT)
[Size] Bytes (compressed deflate stream)
4Bytes CRC Checksum
So the first byte after IDAT must be the first byte of the compressed deflate stream - which indicates a mode 1 uncompressed data block. Which means that 0x97 must be the first byte of uncompressed but filtered png image data - which means 0x97 is the filtertype for the first row - which is invalid...
I just don't get it, am I stupid or what??
Summary:
Possibility 1:
There is some other data between IDAT and the effective start of the compressed deflate stream which, if renders to be true, is not meantioned in the RFC2083 nor in any book I read about image compression.
Possibility 2:
The number 0x78 is interpreted MSB -> LSB which would indicate a mode 3 block (dynamic huffman coding), but this contradicts with RF1951 which is very clear about Bit packing: (LSB -> MSB)
I know already, the missing piece must be something very stupid and I will feel the urgend need to sell my soul if there was only a delete button in Stack Overflow :-p
Two corrections which may help you get you on your way:
The number of zlib bytes in the flags is 2, not 1 -- see RFC 1950. The first is CMF, the next FLG.
In your data:
78 DA
---CMF--- ---FLG---
0111.1000 1101.0101
CINF -CM- +-||
| |+- FCHECK
| +-- FDICT
+---- FLEVEL
CINF is 7, indicating the standard 32Kb compression window.
CM is 8, indicating the compression algorithm is, indeed, DEFLATE.
FCHECK is just a checksum; I didn't check if it's correct (but I'd bet it is).
FDICT is clear, meaning there is no preset dictionary stored.
FLEVEL is 3, indicating Maximum Compression.
See also Trying to understand zlib/deflate in PNG files, esp. dr. Adler's answer.
LEN and NLEN are only set for uncompressed blocks; that's why you didn't find them. (Also, partially, because you were looking at the wrong byte(s).)
The next byte in the stream is EC; bitwise, this is 1110 1100 but remember to read bits from low to high. So the next bit read is 0, meaning not FINAL, and the next 2 bits read are 10 (in that order!), indicating a regular dynamic Huffman encoded data block.

Encrypting a Large File of irregular size using AES algorithm

Currently in my project, i am required to encrypt a large file of variable size (around 1 to 1.5 GB)
I am using the aes algorithm from the openssl project. But i am not using the entire library, but just a few functions which generate keys from "passwords" and use those keys to encrypt a fixed block of 128 bytes
In short,
void aes_encrypt(char* in, char* out , AES_KEY ekey);
void aes_decrypt(char* in, char* out , AES_KEY dkey);
The main problem now is that these functions work with a block size of 128 bytes only.
So i must write a wrapper function which takes my file and divides it into chunks of 128 bytes, and feed it to these encryption/decryption routines.
So my question is,
In my wrapper, how do i handle the case where the file size is not an
integer multiple of 128
Do i need to pad the encrypted file with 0's to make a multiple of
If this is the case, how do i recognize the amount zero padding i have done, as i understand that just removing the last bits of 0
from a file, may result in the file losing integrity especially if
the file happens to contain a 0 at the end.
Is it a better approach to prepend a header to the encrypted file,
containing the size information of the file and possibly its
checksum.
Thanks.
ps: I am new to encryption(especially AES)
The block length is 128 bit or 16 bytes. You can for example use PKCS7 padding (see section 10.3 of RFC 2315) to make the last block 16 bytes long.
It works like this: if one byte needs to be added you add a byte with value (all values shown in hex) 01, if two bytes need to be added you add two bytes with values 02, and so on. In the case that no padding is required you still have to add a block with 16 padding bytes with value 10.
To remove the padding bytes, just look at the last byte of the file, it gives the number of bytes to remove.
Also note that ECB mode (encrypting blocks independently of each other) is probably not the best to use, have a look at CBC mode as well.

Resources