C: Read Binary and Write as encoded text - c

I wish to read a jar in binary mode and write the binary as string(encoded if need be) to a file.
This string would be included in code to be read and converted back into binary and written as a jar file.
I'm trying to replicate what get-pip.py has done to distribute pip.
What would be the best way to do this in C?

You can use base64 encoding for this. Base64 is a standard, and you can use the encoder from one implementation, and the decoder from another, and you'd get back the original string on the other side. The typical straightforward implementation is a table driven one, as in this answer here: https://stackoverflow.com/a/6782480/1212725

Related

Use magic.mgc from another language

I'm currently working on a project which involves reading file's magic files (without bindings). I'd like to know how it would be possible to read the file tests from the compiled binary magic.mgc directly, in another language (like Go), as I'm unsure of how its contents should be interpreted.
According to Christos Zoulas, main contributor of file:
If you want to use them directly you
need to understand the binary format (which changes over time) and load
it in your own data structures. [...] The code that parses the file is in apprentice.c. See check_buffer()
for the reader and apprentice_compile() for the writer. There is
a 4 byte magic number, followed by a 4 byte version number followed
by MAGIG_SET (2) number of 4 byte counts one for each set (ascii,
binary) followed by an array of 'struct magic' entries, in native
byte format.
So that's the format one should expect! Nevertheless, it has to be parsed just like the raw files.

Write Midi Sequence to file

I Have a console program, written in C, which generates short random musical compositions using the PortMidi library. Ultimately I would like to write these sequences as either a midi or audio file.
I have found some explanations of reading and writing functions within the portmidi library: Pm_read(), and Pm_write(); but, without examples, I am struggling to understand and implement this.
Is there anyway I can export the entire sequence at once?
If not, is it necesary to recursively read into a buffer and save individual midi notes? Or do I need to read the whole sequence into the buffer and then save it?
PortMidi doesn't have any way to do do the whole thing in one shebang (as far as I can tell), so you have to manually buffer all the output MIDI messages into an array and then save them to a file. A good example of this can be found in the receive_sysex() function at http://audacity.googlecode.com/svn/audacity-src/trunk/lib-src/portmidi/pm_test/sysex.c.

a clear understanding of file, file encoding, file format

I lack a clear understanding of the concepts of file, file encoding and file format. Google helped up to a point.
From what I understand so far, all the files are binary, i.e., each byte in such a file can contain any of the 256 possible strings of bits. ASCII files (and here's where we get to the encoding part) are a subset of binary files, where each byte uses only 7 bits.
And here's where things get mixed up. A file format seems to be a way to interpret the bytes in a file, and file extensions seem to be one of the most used ways of identifying a file format.
Does this mean there are formats defined for binary files and formats defined for ASCII files? Are formats like xml, pdf, doc, rtf, html, xls, sql, tex, java, cs "referring" to ASCII files? Whereas formats like jpg, mp3, avi, eps, obj, out, dll are a clue that we're talking about binary files?
I don't think you can talk about ASCII and BINARY files, but TEXT and BINARY files.
In that sense, these are text files: XML, HTML, RTF, SQL, TEXT, JAVA, CSS, EPS.
And these are binary files: PDF, DOC, XLS, JPG, MP3, AVI, OBJ, DLL.
ASCII is just a table of characters used in the beginning of computing to represent text, but its is nowadays somewhat discouraged since it can't represent text in languages such as Chinese, Arabic, Spanish (word with ñ, Ñ, tildes), French and others. Nowadays other CHARACTER REPRESENTATIONS are encouraged instead of ASCII. The most well known is probably UTF-8. But there are others like ISO-8859-1, ISO-8859-3 and such. Take a look at this article by Joel Spolsky talking about UNICODE. It's very enlightening.
File formats are just another very different issue. File formats are protocols which programs agree on, to represent information. In that sense, a JPG file is an image that has a certain (well know) internal format that allows programs (Browsers, Spreadsheets, Word Processors) to use them as images.
Text files also have formats (I.E., there are specifications for text files like XML and HTML). Its format, as in JPG and other binary files permits applications to use them in a coherent and specific way to achieve something: I.E., render a WEB PAGE (HTML and XHTML file format).
The actual way the file is stored on the hard-drive is defined by the OS. The actual content of the file can be described as array of bytes - each one has up to a byte size possible values.
Text files - will use either the 256 char (ASCII) set - and then you can read them easily or a wider char set - in that case - only suitable apps can read it.
The rest - what you might call binary (and any other formats which is "unreadable" by "text" viewers) - are formats that designed to be read by a certain other apps or the OS.
if it's executable - the OS can read them and execute, others - like jpg - designed to be "understand" by photo viewers ect....
This is an old question but still very relevant. I was confused by this as well, and asked around for clarification. Here's the summary (hope it helps someone):
Format: File/record format is the way data is represented. You might use CSV, TSV, JSON, Apache Log format, Thrift format, Protobuf format etc to represent your data. Format is responsible for ensuring the data is structured properly and correctly represented. Ex: when you read a json file, you should have nested key-value pairs; that's the guarantee always present.
{
"story": {
"title": "beauty and the beast"
}
}
Encoding: Encoding basically transforms your data (in any format or plain text) to a specific scheme. Now, what is this scheme? Scheme is specific to the purpose of encoding. Example, while transferring data over wire (internet), we would want to make sure the above example json reach the other side correctly, should not be corrupted. To ensure this, we would add some meta info like checksum that can be used to verify data's correctness. Other usage of encoding involve shortening data, exchanging secret etc.
Base64 encoding of above JSON example:
ew0KICAgICAgICAic3RvcnkiOiB7DQogICAgICAgICAgICAidGl0bGUiOiAiYmVhdXR5IGFuZCB0aGUgYmVhc3QiDQogICAgICAgIH0NCn0=
I think it is worth noting that with media files, mpeg and others are a form of media codecs. They explain how digital data can express visual and audio. They are generally housed in a media file container such as an avi file which is really a riff file type that is for media.

how to read pcm samples from a file using fread and fwrite?

I want to read pcm samples from a file using fread and want to determine the signal strength of the samples.How do I go about it?
For reading, how many bytes constitute 1 pcm sample? Can I read more than 1 pcm sample at a time?
This is for WAV and AAC files.
You have to understand that WAV-files (and even more so AAC-files) are not all the same. I will only explain about WAV-files, you'll hopefully understand how it is with AAC-files then. As you pointed out, a WAV-file has PCM-encoded data. However that can be: 8-bit, 16-bit, 32-bit, ...Mono, Stereo, 5.1, ...,8kHz, 16kHz, 44.1kHz, etc. Depending on these values you have to interpret the data (e.g. when reading it with the fread()-function) differently. Therefore WAV-files have a header. You have to read that header first, in the standard way (I do not know the details). Then you know how to read the actual data. Since it is not that easy, I suggest you use on of the libraries out there, that read WAV-files for you, e.g. http://www.mega-nerd.com/libsndfile/ . Of course you can also google or use SO to find others. Or you do it the hard way and find out how WAV-file headers look like and decode that data first, then move on to the actual PCM-encoded data.
I have no experience tackling with WAV file, but once read data from mp3 file. As to the mp3 file, each 576 pcm samples are encoded into a frame. All the frames are stored directly into a file alone with some side information. When processing encoded data, I read binary data from the mp3 file and stored in a buffer, decoding buffered data and extract what is meaningful to me.
I think processing wav file(which stores pcm samples per my understand) is not quite different. You can read the binary data from file directly and perform some transformation according to wav encoding specification.
The file itself does not know what kind of data even what format of the data is in it. You can take everything in a file as bytes(even plain text), read byte from file interpreting the binary data yourself.

Writing more to a file than just plain text

I have always been able to read and write basic text files in C++, but so far no one has discussed much more than that.
My question is this:
If developing a file type by myself for use by an application I also create, how would I go about writing the data to a file and preserve the layout, formatting, etc.? Are there any standards, or does it just depend on the creativity of the programmer?
You basically have to come up with your own file format and write binary data.
You can also serialize your object model and write the output to a file, but that's usually less efficient.
Better to use an existing database, or use xml (or other) for simple needs. If you want to write a file in a format that already exists, find a library that supports it.
You have to know the binary file format for the file you are trying to create. Consider Joel's post on this topic: the 97-2003 File Format is a 349 page spec.
Nearly all the time, to do something like that, you use an API, to avoid the grunt work. Be careful however, because trial and error and figuring out "what works" by trial and error can result in an upgrade of the program breaking your code. Plus you have to take into account other operating systems, minor version differences, patches, etc.
There are a number of standards of course. The likely one to use is some flavor of xml since there are libraries and tools that already exist to help you work with it, but nothing is stopping you from inventing your own.
Well you could store the data in a format you could read, but which maintained the integrity of your data (XML or JSON for instance).
Or (shudder) you could come up with your own propriatory binary format, and use that.
you would go at it exactly the same way as you would a text file. writing your data byte by byte, encoded in such a way that when you read the file you know what you are reading.
for a spreadsheet application you could even use a text format (OOXML, OpenDocument) to store presentation and content information.
Or you could define binary datastructures and write that directly to the file.
the choice between text or binary format depends on the application. for a configuration file you may prefer a text file which can be modified outside your app, for a database you will most likely choose a binary format for performance reasons.
See wotsit.org for information on file formats for various file types. Example: You can figure out exactly how to write out a .BMP file and how it is composed.
Writing to a database can be done by using a wrapper class in your language, mainly passing it SQL commands.
If you create a binary file , you can write any file to it . The only drawback is that you have to know exactly where it starts and where it ends .
Use xml (something open, descriptive, and validatable), and stick with the text. There are standards for this sort of thing as well, including ODF
You can open the file as binary, instead of text (how one does this depends somewhat on the platform), from there you can write the data directly out to disk. The only real caveat to this is endianess, which can become an issue when moving the files from one architecture to another (x86 to PPC for instance).
Writing binary data to disk is really no harder than writing text, and really, your creativity is key for how you store the data.
The general problem is usually referred to as serialization of your application state and in your case with a source/target of a file in whatever format makes sense for you. These days the preferred input/output format is XML, and you may want to look into the existing standards in this field. The problem then becomes how do I map from the state of my system to the particular schema. Boost has a serialization framework that you may want to check out.
/Allan
There are a variety of approaches you can take, but in general you'll want some sort of serialization library. BOOST::Serialization, or Google's Protocal Buffers are a good example of these. The basic idea is that you have memory structures (classes and objects) that represent your data, and you want to write that data to a file in a way that can be used to reconstruct those structures again.
If you're hesitant to use a library, you can do it all manually, but realize that you can end up writing a lot of redundant code, or developing your own library. See fopen, fread, fwrite and fclose for a starting point.
A typical binary file format for custom data is an "indexed file format" consisting of
-------
|index|
-------
|data |
-------
Where the index contains records "pointing" to the data.
The index consists of records containing an offset and a size. The offset tells you where in the file the data is stored and the size tells you the size of the data at that offset (i.e. the number of bytes to read).
typedef struct {
size_t offset
size_t size
} Index
typedef struct {
int ID
char First[20]
char Last[20]
char *RandomInfo
} Data
Suppose you want to store 50 records in the file you would create 50 indices and 50 data structures. The 50 index structures would be written to the file first, followed by the 50 data structures.
To read the file you would read in the 50 index structures, then from the data in the read-in index structures you could tell where to "seek" to read the data records.
Look up (fopen, fread, fwrite, fclose, ftell) for functions to read/write the data.
(Sorry my semicolon key doesn't work)
You usually use a third party library for these things. For example, you would link in a database library for say Oracle that would allow you to talk to the database. Because the underlying file type, ( i.e. Excel spreadsheet vs Openoffice, Oracle vs MySQL, etc. ) differ these libraries abstract away your need to care how the file is constructed.
Hope that helps you find what you're looking for!
1985 called, and said they have some help IFF you are willing to read up. The interchange file format is still in use today and provides some basic metadata around binary files, such as RIFF or WAV audio. (Unfortunately, TIFF is a false friend.) It allegedly even inspired PNG, so it can't be that bad.

Resources