Why do application folders contain so many files? - file

I have a general question about finished applications. When I go into the files of a windows computer application, some files make sense as to why they are there, such as the executable, various media files, .dll files, etc. However, what I don't understand is how there's potentially thousands of different files, located in hundreds of different directories (counting hierarchy) with anywhere between dozens and hundreds of different filetypes. Some of the filetypes don't even seem like actual files, the extension could be something completely obscure. How does the application know how to work with that? Are all of those files hand-written and compiled or are many of them supplied automatically upon generating a desktop application (which would vary based on the application, of course)? I've never actually compiled an application in any language, as I've been studying JavaScript as a starting point, and I recognize that JavaScript is not intended for creating standalone applications, it's used to implement inside HTML. This is why I have so many questions about the generation of the application itself.
To provide an example, a few of the file extensions I see contained in the Audacity application folder which I don't recognize are as follows: .lsp .raw .mo .ny .exp
Even that is a very short list compared to the amount of filetypes/extensions I usually encounter which I have no knowledge of. So, all in all, my main question is why there's such a crazy amount of files, folders, and filetypes/extensions being used by an application. Hopefully someone can help me understand.
Extra question, for those who might care to answer it:
What does it mean when you open a file in an application like Notepad++ (or a .plist editor) and it's just a bunch of unreadable characters? I'm assuming that means it's a compiled file, but I could use some clarification. This happens when I try to open an .exe, a .dll, etc. I understand why I can't edit things like that in a text editor of course, yet why all the strange symbols and characters? Why wouldn't it just throw an error upon trying to open it? Are all the strange characters just a way of attempting to interpret already compiled code?
Bear with me, I'm pretty new to programming and I'm trying to get a better understanding of the process behind actually generating a GUI-based desktop application. As I said before, my current knowledge doesn't extend to the point of actually compiling an application.
Thank you for any help, I really appreciate it.

Focusing on your extra question: you have to learn what a binary file and a text file is, but in short: Imagine you have a simple calculator program that stores the result in a file. Lets say the result you want to store is the number 64. You have to options to do it: saving it as text (characteres 6 and 4) or as a binary data.
If you store it as a text, you need two bytes: one for the code of the character 6 and other for the character 4. You can open that file with the notepad and you'll see that two characteres '64'.
If you store it as a binary value, you only need one byte, but if you open it with the notepad, you'll see the character whose code is 64: 'A'

Most of such "strange" files are resources needed by parts of the application. A complex application is constructed very modular, and each component may need to load different additional resources, often depending on conditions decided at runtime.
For example, on startup if a Qt-based application reads it should use German translation, it may load trans/de_DE.qm from a directory also containing other language files. Or a game may load level by level from different files depending on how far you've come.
Your second question is quite simple. Most resource files are read by an application function as stream of bytes. If e.g. such stream contains '005a' as 4 bytes, you'll see strange symbols in notepade.exe since that editor interprets such bytes as ASCII code, which means it prints the symbols it finds at place 0, 0, 5, and a in the ASCII table. But the application actually reads it in as 4 x 8 bits = 32bit value, which may mean a 32bit integer value of a variable in my simple example. So the variable value is set to 0x5a wich is decimal 90.

Related

How to add (and use) binary data to compiled executable?

There are several questions dealing with some aspects of this problem, but neither seems to answer it wholly. The whole problem can be summarized as follows:
You have an already compiled executable (obviously expecting the use of this technique).
You want to add an arbitrarily sized binary data to it (not necessarily by itself which would be another nasty problem to deal with).
You want the already compiled executable to be able to access this added binary data.
My particular use-case would be an interpreter, where I would like to make the user able to produce a single file executable out of an interpreter binary and the code he supplies (the interpreter binary being the executable which would have to be patched with the user supplied code as binary data).
A similar case are self-extracting archives, where a program (the archiving utility, such as zip) is capable to construct such an executable which contains a pre-built decompressor (the already compiled executable), and user-supplied data (the contents of the archive). Obviously no compiler or linker is involved in this process (Thanks, Mathias for the note and pointing out 7-zip).
Using existing questions a particular path of solution shows along the following examples:
appending data to an exe - This deals with the aspect of adding arbitrary data to arbitrary exes, without covering how to actually access it (basically simple append usually works, also true with Unix's ELF format).
Finding current executable's path without /proc/self/exe - In companion with the above, this would allow getting a file name to use for opening the exe, to access the added data. There are many more of these kind of questions, however neither focuses especially on the problem of getting a path suitable for the purpose of actually getting the binary opened as a file (which goal alone might (?) be easier to accomplish - truly you don't even need the path, just the binary opened for reading).
There also may be other, probably more elegant ways around this problem than padding the binary and opening the file for reading it in. For example could the executable be made so that it becomes rather trivial to patch it later with the arbitrarily sized data so it appears "within" it being in some proper data segment? (I couldn't really find anything on this, for fixed size data it should be trivial though unless the executable has some hash)
Can this be done reasonably well with as little deviation from standard C as possible? Even more or less cross-platform? (At least from maintenance standpoint) Note that it would be preferred if the program performing the adding of the binary data didn't rely on compiler tools to do it (which the user might not have), but solutions necessiting those might also be useful.
Note the already compiled executable criteria (the first point in the above list), which requires a completely different approach than solutions described in questions like C/C++ with GCC: Statically add resource files to executable/library or SDL embed image inside program executable , which ask for embedding data compile-time.
Additional notes:
The problems with the obvious approach outlined above and suggested in some comments, that to just append to the binary and use that, are as follows:
Opening the currently running program's binary doesn't seem something trivial (opening the executable for reading is, but not finding the path to supply to the file open call, at least not in a reasonably cross-platform manner).
The method of acquiring the path may provide an attack surface which probably wouldn't exist otherwise. This means that a potential attacker could trick the program to see different binary data (provided by him) like which the executable actually has, exposing any vulnerability which might reside in the parser of the data.
It depends on how you want other systems to see your binary.
Digital signed in Windows
The exe format allows for verifying the file has not been modified since publishing. This would allow you to :-
Compile your file
Add your data packet
Sign your file and publish it.
The advantage of following this system, is that "everybody" agrees your file has not been modified since signing.
The easiest way to achieve this scheme, is to use a resource. Windows resources can be added post- linking. They are protected by the authenticode digital signature, and your program can extract the resource data from itself.
It used to be possible to increase the signature to include binary data. Unfortunately this has been banned. There were binaries which used data in the signature section. Unfortunately this was used maliciously. Some details here msdn blog
Breaking the signature
If re-signing is not an option, then the result would be treated as insecure. It is worth noting here, that appended data is insecure, and can be modified without people being able to tell, but so is the code in your binary.
Appending data to a binary does break the digital signature, and also means the end-user can't tell if the code has been modified.
This means that any self-protection you add to your code to ensure the data blob is still secure, would not prevent your code from being modified to remove the check.
Running module
Windows GetModuleFileName allows the running path to be found.
Linux offers /proc/self or /proc/pid.
Unix does not seem to have a method which is reliable.
Data reading
The approach of the zip format, is to have a directory written to the end of the file. This means the data can be found at the end of the location, and then looked backwards for the start of the data. The advantage here, is the data blob is signposted from the end of the data, rather than the natural start.

C - Storing a large group of files as a single resource

Please forgive me if there is a glaringly obvious answer to this question; I haven't found it because I'm not entire sure what I'm looking for. It may well be this duplicates a question I haven't found; sorry.
I have a C executable that uses text, audio, video, icons and a variety of different file types. These files are stored locally; the folder structure is large and deep and would need to be installed alongside the application for it to operate correctly (not that I anticipate it being distributed I'm looking to package my own work for convenience).
In my own opinion it would be more convenient if the file library was stored in a single file that remained accessible to the application for example alongside /usr/bin/APPLICATION or in the most appropriate location; accessed by the executable when required.
I searched for questions similar and found suggestions that indicated two possible options Resource Files which appear to be native to Windows and Including files at compile. The first question leads to an answer similar to the second and doesn't answer the question relating to the existence of resource files for linux executables. It (like the second) looks at including the datafile in the compilation process. This is not so useful as if I only want to update my resources I'm forced to recompile the entire application (the media is dynamically added).
QUESTION: Is there a way to store a variety of file types in one single file accessible to an executable in linux, and if so how would you implement this?
My thoughts on this initially were to create a .zip or .gz file which might also offer compression as an added bonus but I have no idea how (or if it is even possible) to access data within such a file on the fly. I'm equally uncertain if there is a specific file type or library that offers a more suitable solution. Also I know virtually nothing about .dat files could these be used in this context on a linux system?
I do not understand why you would use a single file at all. Considering the added complexity (and increased chance of bugs creeping in) of file extraction and the associated overheads, I do not see how it would be "more convenient".
I have a C executable that uses text, audio, video, icons and a variety of different file types.
So do many other Linux applications. The normal approach, when using package management, is to put the architecture independent data (icons, audio, video, and so on) for application /usr/bin/YOURAPP in /usr/share/YOURAPP/, and architecture dependent data (like helper binaries) in /usr/lib/YOURAPP. It is extremely common for the latter two to be full directory trees, sometimes quite deep and wide.
For locally compiled stuff, it is common to put these in /usr/local/bin/YOURAPP, /usr/local/share/YOURAPP/, and /usr/local/share/YOURAPP/ instead, just to avoid confusing the package manager. (If you check ./configure scripts or read Makefiles, this is the chief purpose of the PREFIX variable they support.)
It is also common for the /usr/bin/YOURAPP to be a simple shell script, setting environment variables, or checking for user-specific overrides (from $HOME/.YOURAPP/), ending up with exec /usr/lib/YOURAPP/YOURAPP.bin [parameters...], which replaces the shell with the actual binary executable without leaving the shell in memory.
As an example, /usr/share/octave/ on my machine contains a total of 138 directories (in a hierarchy of up to 7 directories deep) and 1463 files; about ten megabytes of "stuff" all told. LibreOffice, Eagle, Fritzing, and KiCAD take hundreds of megabytes there each, so Octave is not an extreme example in any way either.
You have several alternatives (TODO: add more ;)):
You can read some archiver file format specifications, writting code to read/write to those archivers, and waste your time doing so.
You can invent a dirty, simple file format, for example ("dsa" stands for "Dirty and Simple Archiver"):
#include <stdint.h>
// Located at the beginning of the file
struct DSAHeader {
char magic[3]; // Shall be (char[]) { 'D', 'S', 'A' }
unsigned char endianness; // The rest of the file is translated according to this field. 0 means little-endian, 1 means big-endian.
unsigned char checksum[16]; // MD5 sum of the whole file. (when calculating checksums, this field is psuedo-filled with zeros).
uint32_t fileCount;
uint32_t stringTableOffset; // A table containing the files' names.
};
// A dsaHeader.fileCount-sized array of DSAInodeHeader follows the DSAHeader.
struct DSANodeHeader {
unsigned char type; // 0 means directory, 1 means regular file.
uint32_t parentOffset; // Pointer to the parent directory, or zero if the node is in the root.
uint32_t offset; // The node's type-dependent header starts here.
uint32_t nodeSize; // In bytes for files, and in number of entries for directories.
uint32_t dataOffset; // The file's data starts at this offset for files, and a pointer to the first DSADirectoryEntryHeader for directories.
uint32_t filenameOffset; // Relative to the string table.
};
typedef uint32_t DSADirectoryEntryHeader; // Offset to the entry's DSANodeHeader
The "string table" is a contiguous sequence of null-terminated character strings.
This format is greatly simple (and portable ;)). And, as a bonus, if you want (de)compression, you can use something like Zip, BZ2, or XZ to (de)compress your file (those programs/formats are archiver-agnostic, i.e, not dependent on tar, as commonly believed).
As last last (or first?) resort, you may use an existent library/API for manipulating archivers and compressed file formats.
Edit: Added support for directories :).
I have a C executable that uses text, audio, video, icons and a variety of different file types. These files are stored locally; the folder structure is large and deep and would need to be installed alongside the application for it to operate correctly.
Considering the added complexity of associated differrent file types alongwith folder structure large and deep and required installed with application. Adding a single resources file would be difficult or would say near to immpossible to trace changes in case if you want to change resources dynamically. Certainly, adding resources to executable file is not an option as it will be increase the size of executable file and needed frequent re-complation in case of update of resources.
After giving consideration on all aspects of your project it seems to me the solution would be using INI file. INI would be stored at definate location and other resources location should be prived in INI File. As with INI you can store the locations of resources, hash keys and sizes easily and would easy check the changes or update the resources.
Since you are using already compressed versions of File type and thus General Zipping algos would not work as the rate would be very low. Thus recommend to use 7z algos for compression. From various algo I would suggest to opt of xz zipping algo as it is currently used by many opensource project to compress the binaries and decrease the size.
Foreach file compression its crc32 or hash value should also included in INI file to check the validity of data transfered.
Lets say you have:
top-level-folder/
|
- your-linux-executable
- icon-files-folder/
- image-files-folder/
- other-folders/
- other-files
Do this (inside top-level-folder)
tar zcvf my-package.tgz top-level-folder
To expand, do this:
tar zxvf my-package.tgz

read thunderbird address mab files content

I have several address list's on my TBIRD address book.
every time I need to edit an address that is contained in several lists, is a pain on the neck to find which list contains the address to be modified.
As a help tool I want to read the several files and just gave the user a list of which
xxx.MAB files includes the searched address on just one search.
having the produced list, the user can simply go to edit just the right address list's.
Will like to know a minimum about the format of mentioned MAB files, so I can OPEN + SEARCH for strings into the files.
thanks in advance
juan
PD have asked mozilla forum, but there are no plans from mozilla to consolidate the address on one master file and have the different list's just containing links to the master. There is one individual thinking to do that, but he has no idea when due to lack of resources,
on this forum there is a similar question mentioning MORK files, but my actual TBIRD looks like to have all addresses contained on MAB files
I am afraid there is no answer that will give you a proper solution for this question.
MORK is a textual database containing the files Address Book Data (.mab files) and Mail Folder Summaries (.msf files).
The format, written by David McCusker, is a mix of various numerical namespaces and is undocumented and seem to no longer be developed/maintained/supported. The only way you would be able to get the grips of it is to reverse engineer it parallel with looking at source code using this format.
However, there have been experienced people trying to write parsers for this file format without any success. According to Wikipedia former Netscape engineer Jamie Zawinski had this to say about the format:
...the single most brain-damaged file format that I have ever seen in
my nineteen year career
This page states the following:
In brief, let's count its (Mork's) sins:
Two different numerical namespaces that overlap.
It can't decide what kind of character-quoting syntax to use: Backslash? Hex encoding with dollar-sign?
C++ line comments are allowed sometimes, but sometimes // is just a pair of characters in a URL.
It goes to all this serious compression effort (two different string-interning hash tables) and then writes out Unicode strings
without using UTF-8: writes out the unpacked wchar_t characters!
Worse, it hex-encodes each wchar_t with a 3-byte encoding, meaning the file size will be 3x or 6x (depending on whether whchar_t is 2
bytes or 4 bytes.)
It masquerades as a "textual" file format when in fact it's just another binary-blob file, except that it represents all its magic
numbers in ASCII. It's not human-readable, it's not hand-editable, so
the only benefit there is to the fact that it uses short lines and
doesn't use binary characters is that it makes the file bigger. Oh
wait, my mistake, that isn't actually a benefit at all."
The frustration shines through here and it is obviously not a simple task.
Consequently there apparently exist no parsers outside Mozilla products that is actually able to parse this format.
I have reversed engineered complex file formats in the past and know it can be done with the patience and right amount of energy.
Sadly, this seem to be your only option as well. A good place to start would be to take a look at Thunderbird's source code.
I know this doesn't give you a straight-up solution but I think it is the only answer to the question considering the circumstances for this format.
And of course, you can always look into the extension API to see if that allows you to access the data you need in a more structured way than handling the file format directly.
Sample code which reads mork
Node.js: https://www.npmjs.com/package/mork-parser
Perl: http://metacpan.org/pod/Mozilla::Mork
Python: https://github.com/KevinGoodsell/mork-converter
More links: https://wiki.mozilla.org/Mork

Why do file formats have magic numbers?

For example, Portable Executable has several, including the famous "MZ" at the beginning, as well as the "PE\0\0" at the start of the PE header. The Rar file format has the "Rar!" header at the beginning, and several others have similar "magic values" in the file.
What purpose do such magic values serve?
Because users change the file extension, or other programs steal the file extension, it allows the application to cancel processing of a file in an unknown format instead of trying its best and then failing anyway.
the concept of magic numbers goes back to unix and pre-dates the use of file extensions.
The original idea of the shell was that all 'executable' would look the same - it didn't matter how the file had been created or what program should be used to evaluate it. The shell would look at the contents of the file and determine the appropriate file. Microsoft came along and chose a different approach and the era of file extensions was born. Then to make things 'nicer' for users microsoft chose to 'hide' these extensions and the era of trojan files which look like they are of one type but really have a different extension and are processed by a different file was born.
If two applications store data differently, but are constructed such that a file for one might possibly also be a valid (but meaningless) file for the other, very bad things can happen. A program may think it has successfully loaded the file (unaware that the data is meaningless) and then write back a file which to it would be semantically identical, but which would no longer be meaningfully readable by the application that wrote it (or anything else for that matter).
Using magic numbers doesn't entirely prevent this, but it can help at least somewhat.
BTW, trying to guess about the format of data is often very dangerous. For example, suppose one has a list of what are probably dates in the format nn-nn-nn. If one doesn't know what format the dates are in, there may be enough information to pretty well guess the format (e.g. if one of the records is 12-31-99, then absent information to the contrary, the dates are probably mm-dd-yy) but if all dates are within the first 12 days of a month, the data could easily be misinterpreted. Suppose, though, the data were preceded by something saying "MM-DD-YY". Then the risks of misinterpretation could be reduced.
To quickly identify the type of the file, or the positions within it.
Your question should not be “why do file formats have magic number”, but rather “what are the advantages of file formats having magic number”!
Suggestions:
Programs that undelete files by reading disk free space may recognize file types
Your UNIX knows whether an executable file is to be interpreted (she-bang) or is binary
When you lose extensions, programs like file can detect what your files are
Designer of file formats consider it is always safer when applications can easily ensure they are reading a file which has the good format.
As you have a header, it does not cost much to put it at header start.

What should I know before poking around an unknown archive file for things?

A game that I play stores all of its data in a .DAT file. There has been some work done by people in examining the file. There are also some existing tools, but I'm not sure about their current state. I think it would be fun to poke around in the data myself, but I've never tried to examine a file, much less anything like this before.
Is there anything I should know about examining a file format for data extraction purposes before I dive headfirst into this?
EDIT: I would like very general tips, as examining file formats seems interesting. I would like to be able to take File X and learn how to approach the problem of learning about it.
You'll definitely want a hex editor before you get too far. It will let you see the raw data as numbers instead of as large empty blocks in whatever font notepad is using (or whatever text editor).
Try opening it in any archive extractors you have (i.e. zip, 7z, rar, gz, tar etc.) to see if it's just a renamed file format (.PK3 is something like that).
Look for headers of known file formats somewhere within the file, which will help you discover where certain parts of the data are stored (i.e. do a search for "IPNG" to find any (uncompressed) png files somewhere within).
If you do find where a certain piece of data is stored, take a note of its location and length, and see if you can find numbers equal to either of those values near the beginning of the file, which usually act as pointers to the actual data.
Some times you just have to guess, or intuit what a certain value means, and if you're wrong, well, keep moving. There's not much you can do about it.
I have found that http://www.wotsit.org is particularly useful for known file type formats, for help finding headers within the .dat file.
Back up the file first. Once you've restricted the amount of damage you can do, just poke around as Ed suggested.
Looking at your rep level, I guess a basic primer on hexadecimal numbers, endianness, representations for various data types, and all that would be a bit superfluous. A good tool that can show the data in hex is of course essential, as is the ability to write quick scripts to test complex assumptions about the data's structure. All of these should be obvious to you, but might perhaps help someone else so I thought I'd mention them.
One of the best ways to attack unknown file formats, when you have some control over contents is to take a differential approach. Save a file, make a small and controlled change, and save again. Do a binary compare of the files to find the difference - preferably using a tool that can detect inserts and deletions. If you're dealing with an encrypted file, a small change will trigger a massive difference. If it's just compressed, the difference will not be localized. And if the file format is trivial, a simple change in state will result in a simple change to the file.
The other thing is to look at some of the common compression techniques, notably zip and gzip, and learn their "signatures". Most of these formats are "self identifying" so when they start decompressing, they can do quick sanity checks that what they're working on is in a format they understand.
Barring encryption, an archive file format is basically some kind of indexing mechanism (a directory or sorts), and a way located those elements from within the archive via pointers in the index.
With the the ubiquitousness of the standard compression algorithms, it's mostly a matter of finding where those blocks start, and trying to hunt down the index, or table of contents.
Some will have the index all in one spot (like a file system does), others will simply precede each element within the archive with its identity information. But in the end somewhere, there is information about offsets from one block to another, there is information about data types (for example, if they're storing GIF files, GIF have a signature as well), etc.
Those are the patterns that you're trying to hunt down within the file.
It would be nice if somehow you can get your hand on two versions of data using the same format. For example, on a game, you might be able to get the initial version off the CD and a newer, patched version. These can really highlight the information you're looking for.

Resources