Why most files like jpeg or pdf don't use just ASCII characters for encoding? [migrated] - file

This question was migrated from Stack Overflow because it can be answered on Super User.
Migrated 19 days ago.
Whenever we try to open jpeg or pdf file with any text editor we find strange symbols other than ASCII. Isn't Ascii most efficient because of less space consumption by limited number of possible characters available.
I was working with a database file in linux with plocate and I found something similar.

Isn't Ascii most efficient because of less space consumption by limited number of possible characters available.
Not at all. Where did you get that idea from?
ASCII chars are 7bits long, but hardware doesn't support storing 7bits items, so ASCII is stored with 8bits, the first bit being always 0. Furthermore, ASCII includes a number of control characters that can cause issues in some situation. Therefore, the most prominent ASCII encoding (base 64) uses only 6bits. This mean that in order to encode 3 bytes (38 = 24 bits) of data you need 4 ASCII characters (4 6 = 24). Those 4 ASCII characters are then stored using 4 bytes on disk. Hence, converting a file to ASCII increases disk usage by 33%.
You can test this with the base64 command:
base64 pic.jpg > b64_jpeg.txt
ls -lh pic.jpg b64_jpeg.txt
Of course, you could try to use another ASCII encoding than the standard base64 and use all 7 bits available in ASCII. You would still get only 7bits of data per bytes on disk, thus have a +14% disk usage increase for the same data.

All modern storage uses 8-bit bytes. ASCII is an obsolete 7 bits standard, so it would take 8/7th as much storage (+14%).

It is nothing to do with number of bits as such, all binary files are the same 2 bits (true or false) what makes an image or PDF look different to Ascii text is that each single byte of bits is compressed in groups for optimal efficiency. Those symbolic strings are perhaps ASCII but compressed to about 10%.
Take a pdf of a graph as follows
ASCII = 394,132 bytes
ZIP    =   88,367 bytes
PDF   =   75,753 bytes
DocX =   32,940 bytes its text and lines (there are no images)
Take an image
PNG = 265,490 bytes
ZIP = 265,028 bytes
PDF = 220,152 bytes
PDF as ASCII = 3,250,970 bytes
3 0 obj
<</Length 3120001/Type/XObject/Subtype/Image/Width 640/Height 800/BitsPerComponent 8/SMask 4 0 R/ColorSpace/DeviceRGB/Filter/ASCIIHexDecode>>
stream
9cb6c79cb6c79cb6c79cb6c79db7c89db7c89db7c89fb7c9a0b8caa1b8caa1b8
caa1b8caa2b9cba2b9cba2b9cba2b9cba3bacba3bacaa4bbcba4bbcba6bccca7
...to infinity and beyond
So why is as ASCII image bigger than all the rest is because those 9cb6c7 can be tokenised as 4 x 9cb6c7 , 3 x 9db7c8 , etc , that's roughly how RunLengthEncoding would work, but zip is better than that.
So PARTS of a pdf may be compressed (needing slower decompression to view) in a zip style of coding (used for lossless fonts and bitmaps), whilst others may keep their optimal native photographic lossy compression (like jpeg). Overall for PDF parsing a higher percentage needs to be 8 bit ANSI (compatible with uni-coding or variable per platform) or 7bit ASCII for simplistic parsing.
Short answer compression is the means to reduce time of transmission or amount of storage resources. However decompression adds an overhead so is slower than RAW ASCII to display as graphics. Avoid exotic wavelets in a PDF where most objects need fast decompression.

Related

About the wav data sub-chunk

I am working on a project in which I have to merge two 8bits .wav files using C and i still have no clue how to do it.
I have read about wav files and I want to start by reading one of the files.
There's one thing i didn't understand:
Let's say i have an 8bit WAV audio file, And i was able to read (even tho I am still trying to) the Data that starts after the 44 byte, I will get numbers between 0 and 255 logically.
My question is:
What do those numbers mean?
If I get 255 or 0 what do they mean?
Are they samples from the wave?
Can anyone please explain?
Thanks in advance
Assuming we're not dealing with file format issues, getting values between 0 and 255 means that the audio samples are of unsigned eight-bit format, as you have put it.
One way of merging data would consist of reading data from files into buffers, arrays a and b and summing them value by value: c[i] = a[i] + b[i]. By doing so, you'd have to take care of the following:
length of the files may not be equal
on summing the unsigned 8-bit buffers, such as yours will almost certainly overflow
This is usually achieved using a for loop. You first get the sizes of the chunks. Your for loop has to be written in such a way that it neither reads past the array boundary, nor ignores what can be read. For preventing overflows you can either:
divide values by two on reading
or
read (convert) into a format which wouldn't overflow, then normalize and convert the merged data back into the original format or whichever format desired.
For all particulars of reading from and writing to a .wav format file you may use some of the existing audio file libraries, or write your own routine. Dealing with audio file format is not a trivial thing, though. Here's a reference on .wav format.
Here are few audio file APIs worth of looking at:
libsndfile
sndlib
Hope this can help.
See any good guide to WAVE for information on the format of samples in the data chunk, such as this one I found: http://www.neurophys.wisc.edu/auditory/riff-format.txt
Relevant excerpts:
In a single-channel WAVE file, samples are stored consecutively. For
stereo WAVE files, channel 0 represents the left channel, and channel
1 represents the right channel. The speaker position mapping for more
than two channels is currently undefined. In multiple-channel WAVE
files, samples are interleaved.
Data Format of the Samples
Each sample is contained in an integer i. The size of i is the
smallest number of bytes required to contain the specified sample
size. The least significant byte is stored first. The bits that
represent the sample amplitude are stored in the most significant bits
of i, and the remaining bits are set to zero.
For example, if the sample size (recorded in nBitsPerSample) is 12
bits, then each sample is stored in a two-byte integer. The least
significant four bits of the first (least significant) byte is set to
zero.
The data format and maximum and minimums values for PCM waveform
samples of various sizes are as follows:
Sample Size Data Format Maximum Value Minimum Value
One to Unsigned 255 (0xFF) 0
eight bits integer
Nine or Signed Largest Most negative more bits
integer i positive value of i
value of i
N.B.: Even if the file has >8 bits of audio resolution, you should read the file as an array of unsigned char and reconstitute the larger samples manually as per the above spec. Don't try to do anything like reading the samples directly over an array of native C ints, as their layout and size is platform-dependent and therefore should not be relied upon in any code.
Note also that the header is not guaranteed to be 44 bytes long: How can I detect whether a WAV file has a 44 or 46-byte header? You need to read the length and process the header based on that, not any assumption.

Calculate string size in UTF-8 when converted from Latin-9 (ISO/IEC 8859-15)

We have a jdbc program which moves data from one database to another.
Source database is using Latin9 character set
Destination database uses UTF-8 encoding and the size of a column is specified in bytes instead of characters
We have converted ddl scripts of source database to equivalent script in destination database keeping the size of the column as-is.
In some cases, if there are some special characters, the size of the data after converting to UTF-8 is exceeding the size of the column in destination database causing the jdbc program to fail.
I understand that UTF-8 is variable-width encoding scheme which can take 1-4 bytes per character, given this the worst case solution would be to allocate 4 times the size of a column in destination database.
Is there a better estimate?
Since there's no telling in advance exactly how much a text string will grow, I think that all you can do is a trial run to convert the text to UTF-8, and generate a warning that certain columns need to be increased in size. Any ASCII (unaccented) characters will remain single bytes, and most Latin-9 accented characters will probably be 2 bytes each, but there are some that might be 3. You'd have to look at the Latin-9 and UTF-8 tables to see if any will be 3 or 4 bytes after conversion. Still, you'd have to examine your Latin-9 text to see how much it will grow.
The Euro symbol in Latin-9 will take 3 bytes to represent in utf-8. The ascii characters will only take 1 byte. The remaining 127 characters will take 2 bytes. Depending on what the actual locale is (and what characters are commonly used) an estimate between 1.5x and 2x should be sufficient.

How to simply generate a random base64 string compatible with all base64 encodings

In C, I was asked to write a function to generate a random Base64 string of length 40 characters (30 bytes ?).
But I don't know the Base64 flavor, so it needs to be compatible with many version of Base64.
What can I do ? What is the best option ?
All the Base64 encodings agree on some things, such as the use of [0-9A-Za-z], which are 62 characters. So you won't get a full 64^40 possible combinations, but you can get 62^40 which is still quite a lot! You could just generate a random number for each digit, mod 62. Or slice it up more carefully to reduce the amount of entropy needed from the system. For example, given a 32-bit random number, take 6 bits at a time (0..63), and if those bits are 62 or 63, discard them, otherwise map them to one Base64 digit. This way you only need about 8, 32-bit integers to make a 40-character string.
If this system has security considerations, you need to consider the consequences of generating "unusual" Base64 numbers (e.g. an attacker could detect that your Base64 numbers are special in having only 62 symbols with just a small corpus--does that matter?).

Save as binary file in C but it doesn't display zeros and ones

As i do understand, by saving a file in C using wb mode, shouldn't I see binary numbers in the saved files (zeros and ones).
When I save in wb mode the output in the file is:
Feras Wilson — n FFFF îè` c P xHF F
û¥2012
But this is not binary zeros and ones. How do I save file to contain zeros and ones and then be able to read It in C?
It is saved as 0 and 1, but your text editor reads them as bytes (it groups them in 8 bits) and displays them using ASCII. [1]
When you write to a text file, a lot of effort is done in order to interpret the binary data that you wish to write so it is put in a human readable format.
For example if you write the number 255, it would have to bring it to the form '2', '5', '5' (which are characters! ) and then write these each character.
If it writes to a binary file, it will just put in the file the actually binary data. This depends on what kind of variable it is ( on how many octets is it represent it on ) and on endianess and other things. If it is an unsigned char it will put in the binary file 0b11111111 ( which is the actual raw number, not characters!).
[1] http://www.asciitable.com/
This is only the textual representation of the file by your editor or command. Internally all files are stored with 0s and 1s on the HDD/SDD/RAM/... - try opening the file with a hex editor like bless (easy to use on linux, Mono required for Windows - alternatively search for another Hex Editor you want to use) to see how the bytes are stored. Furthermore I suggest using bless because if offers different representations in different formats.
In your code, you can use the read methods to store the content bytewise and interpret this. Just keep a possible endianness fix in mind if you read more than one byte at a time. That is that Little and Big Endian systems store and read bytes in "reversed" order. A word 0x1337 being read could possibly be read as 0x3713. Just get familiar with this term and use Wikipedia to understand how to handle this, if necessary.
All files are stored in binary! It's just a question of how a successive program views/interprets this binary. Depending on how you use this file, it'll get read as a sequence of bytes representing chararacters, or a sequence of bytes representing instructions, or words representing Unicode etc. etc.
If you want to see your file in different formats, use od:
NAME
od - dump files in octal and other formats
which will dump your file in hex, characters, octal etc. (the one thing it won't do is show you in binary, but you can derive that from the octal/hex output easily enough)

What are the uses of having binary files?

I am learning FileIO in C and was little confused with the binary files. My question is what is the use of having binary files when we can always use files in ASCII or someother format which can be easily understandable. Also in what applications are binary files more useful?
Any help on this really appriciated.Thanks!
All files are binary in nature. ASCII files are those subset of binary files that contain what can be considered to be 'human-readable' data. A pure binary file is not constrained to that subset of characters that is readable.
Speed of access
Obfuscation
The ability to write native objects to file without creating big serialised files.
ASCII is easily understandable by humans, but for many other purposes, it's more efficient and easier for the computer to store things in a binary format. For example, if you want to keep a sequence of integers, it's easier for the computer to read/write the 4 bytes it takes to represent an int, than it is to write out the ascii representation of the number, then parse it while reading.
It is critically important that any byte value can be stored, for example programs are binary. Any possible binary code may be a program instruction for the CPU.
ASCII only stores 7-bit values, so there are half the possible values wasted.
Further, what would an integer be stored as?
The number 4294967295 can be stored in 4 bytes, 32 bits, but if it were stored in ASCII, as a number, it would require 10 characters. Further, it would require processing to convert it into the 32bit number. Neither of those things are good.
The 32bit number is a fixed size, so it is easy to get to the 234856th value in the file, just seek to position 4*234856.
If 32bit numbers are stored as ASCII, either they must always take 10 bytes, making the file 2.5 times bigger, or they are stored as variable size, making it virtually impossible to seek to a particular value without reading the whole file.
Edit:
Is is worth adding that (in normal use) a human can not see the data held in a file. The only way to examine the contents of files is by running programs which can read and use the data. So the convenience of a human is a small consideration.
In general, data is stored in the most convenient form for programs use, and the form is designed to fit the programs purpose. ASCII is a format designed for text edit programs to create human readable documents and support simple ways to display the text, which are limited to English letters, numbers and some punctuation. When we want to support all human written language, ASCII is far too limited.
I believe we have over one million characters to represent human written languages (and some other pictures), and we have not yet got characters for all human languages.
UTF-8 is a way to represent the written characters we have so far, as multiple bytes. UTF-8 uses 8bit encoding, which is beyond the range of ASCII.
Think of a binary file as a true representation of data to be interpreted directly by a computer program and not to be read by humans. It would be a lot of overhead for a program to write out data, whether ascii or numeric in an ascii format. Most likely, the programmer would have to invent a protocol for writing arrays, structs, and scalars out into a file in ascii form, so they could be human readable and also be read back in by the program and converted back to binary form.
A database table is a good example. Whether or not there are text or numeric fields in the table, the database manager reads and writes that data in binary format. It is easier to write out, read in, and then convert as needed to display any data you can read.
Perception gave a great answer I had never considered before. All data is binary and ascii is a subset. That answer made me think of ftp and setting the mode to ascii or binary. If I'm shuttling Windows binaries being stored on a Linux system, I tell ftp to transfer them as binary. That means, don't interpret as an ascii file and add \cr at the end of each line. There are even times I'll transfer .csv and .txt data as binary, because I know Windows Excel knows how to interpret those non-DOS files.
I would not want to write a program that had to encode/decode images, or audio files, or GIS data, or spacecraft telemetry, or <fill in the blank> as ASCII.

Resources