Please forgive me if there is a glaringly obvious answer to this question; I haven't found it because I'm not entire sure what I'm looking for. It may well be this duplicates a question I haven't found; sorry.
I have a C executable that uses text, audio, video, icons and a variety of different file types. These files are stored locally; the folder structure is large and deep and would need to be installed alongside the application for it to operate correctly (not that I anticipate it being distributed I'm looking to package my own work for convenience).
In my own opinion it would be more convenient if the file library was stored in a single file that remained accessible to the application for example alongside /usr/bin/APPLICATION or in the most appropriate location; accessed by the executable when required.
I searched for questions similar and found suggestions that indicated two possible options Resource Files which appear to be native to Windows and Including files at compile. The first question leads to an answer similar to the second and doesn't answer the question relating to the existence of resource files for linux executables. It (like the second) looks at including the datafile in the compilation process. This is not so useful as if I only want to update my resources I'm forced to recompile the entire application (the media is dynamically added).
QUESTION: Is there a way to store a variety of file types in one single file accessible to an executable in linux, and if so how would you implement this?
My thoughts on this initially were to create a .zip or .gz file which might also offer compression as an added bonus but I have no idea how (or if it is even possible) to access data within such a file on the fly. I'm equally uncertain if there is a specific file type or library that offers a more suitable solution. Also I know virtually nothing about .dat files could these be used in this context on a linux system?
I do not understand why you would use a single file at all. Considering the added complexity (and increased chance of bugs creeping in) of file extraction and the associated overheads, I do not see how it would be "more convenient".
I have a C executable that uses text, audio, video, icons and a variety of different file types.
So do many other Linux applications. The normal approach, when using package management, is to put the architecture independent data (icons, audio, video, and so on) for application /usr/bin/YOURAPP in /usr/share/YOURAPP/, and architecture dependent data (like helper binaries) in /usr/lib/YOURAPP. It is extremely common for the latter two to be full directory trees, sometimes quite deep and wide.
For locally compiled stuff, it is common to put these in /usr/local/bin/YOURAPP, /usr/local/share/YOURAPP/, and /usr/local/share/YOURAPP/ instead, just to avoid confusing the package manager. (If you check ./configure scripts or read Makefiles, this is the chief purpose of the PREFIX variable they support.)
It is also common for the /usr/bin/YOURAPP to be a simple shell script, setting environment variables, or checking for user-specific overrides (from $HOME/.YOURAPP/), ending up with exec /usr/lib/YOURAPP/YOURAPP.bin [parameters...], which replaces the shell with the actual binary executable without leaving the shell in memory.
As an example, /usr/share/octave/ on my machine contains a total of 138 directories (in a hierarchy of up to 7 directories deep) and 1463 files; about ten megabytes of "stuff" all told. LibreOffice, Eagle, Fritzing, and KiCAD take hundreds of megabytes there each, so Octave is not an extreme example in any way either.
You have several alternatives (TODO: add more ;)):
You can read some archiver file format specifications, writting code to read/write to those archivers, and waste your time doing so.
You can invent a dirty, simple file format, for example ("dsa" stands for "Dirty and Simple Archiver"):
#include <stdint.h>
// Located at the beginning of the file
struct DSAHeader {
char magic[3]; // Shall be (char[]) { 'D', 'S', 'A' }
unsigned char endianness; // The rest of the file is translated according to this field. 0 means little-endian, 1 means big-endian.
unsigned char checksum[16]; // MD5 sum of the whole file. (when calculating checksums, this field is psuedo-filled with zeros).
uint32_t fileCount;
uint32_t stringTableOffset; // A table containing the files' names.
};
// A dsaHeader.fileCount-sized array of DSAInodeHeader follows the DSAHeader.
struct DSANodeHeader {
unsigned char type; // 0 means directory, 1 means regular file.
uint32_t parentOffset; // Pointer to the parent directory, or zero if the node is in the root.
uint32_t offset; // The node's type-dependent header starts here.
uint32_t nodeSize; // In bytes for files, and in number of entries for directories.
uint32_t dataOffset; // The file's data starts at this offset for files, and a pointer to the first DSADirectoryEntryHeader for directories.
uint32_t filenameOffset; // Relative to the string table.
};
typedef uint32_t DSADirectoryEntryHeader; // Offset to the entry's DSANodeHeader
The "string table" is a contiguous sequence of null-terminated character strings.
This format is greatly simple (and portable ;)). And, as a bonus, if you want (de)compression, you can use something like Zip, BZ2, or XZ to (de)compress your file (those programs/formats are archiver-agnostic, i.e, not dependent on tar, as commonly believed).
As last last (or first?) resort, you may use an existent library/API for manipulating archivers and compressed file formats.
Edit: Added support for directories :).
I have a C executable that uses text, audio, video, icons and a variety of different file types. These files are stored locally; the folder structure is large and deep and would need to be installed alongside the application for it to operate correctly.
Considering the added complexity of associated differrent file types alongwith folder structure large and deep and required installed with application. Adding a single resources file would be difficult or would say near to immpossible to trace changes in case if you want to change resources dynamically. Certainly, adding resources to executable file is not an option as it will be increase the size of executable file and needed frequent re-complation in case of update of resources.
After giving consideration on all aspects of your project it seems to me the solution would be using INI file. INI would be stored at definate location and other resources location should be prived in INI File. As with INI you can store the locations of resources, hash keys and sizes easily and would easy check the changes or update the resources.
Since you are using already compressed versions of File type and thus General Zipping algos would not work as the rate would be very low. Thus recommend to use 7z algos for compression. From various algo I would suggest to opt of xz zipping algo as it is currently used by many opensource project to compress the binaries and decrease the size.
Foreach file compression its crc32 or hash value should also included in INI file to check the validity of data transfered.
Lets say you have:
top-level-folder/
|
- your-linux-executable
- icon-files-folder/
- image-files-folder/
- other-folders/
- other-files
Do this (inside top-level-folder)
tar zcvf my-package.tgz top-level-folder
To expand, do this:
tar zxvf my-package.tgz
Related
Context
I'm currently working on a firmware for a STM32F411CEU6, using STM32CubeIDE, I'm going to be programming several UC's, everyone of them is going to have an ID (a 32 bit unsigned number), this number is static and it will never be change in his lifespan, we are a small team but maybe we will have to program a few hundred of these devices, so changing the value associated whit that ID in the code manually will be kinda exhausting, and time consuming, so, my question is:
¿Is there a way to compile different versions of firmware so it generate several .bin files, each one whit the only difference that this single constant change?
¿Is there a way to automate this process?
What have I thought
I have thought on defining this constant (and other constants if I have to) on a header file, then use something like Python to make different versions of the code, but then I would have to open every project or workspace and still have to compile and produce every .binfile manually, ¿Is there a way to produce the .bin file from python (using the STM32CubeIDE), or something like that?
Additional information
Working on a STM32F411CEU6
Using STM32CubeIDE
I have basic knowledge in python C++
Medium-advance knowledge in C
Thanks in advance!
Any help would be very much appreciated
Here are a few ideas.
The STM32F411 chip is pre-programmed (by STMicro at the factory) with a 96-bit unique device ID. Perhaps you can use the device's unique ID for your purposes rather than creating and assigning your own ID value. See Section 24.1 of the reference manual. This seems much safer than trying to create and manage a different bin file for each ID value.
If you really want your own custom ID value, then program the ID value separately from the firmware bin file so that you don't need to create/manage different bin files for each unit. Write the program so that the ID value is at a known fixed address in ROM. Use the linker scatter file to reserve that address for the ID value. Program the ROM of each unit in two steps, the bin file and the ID value.
If you really want to incorporate the ID value into the bin file then you can use a tool such as srec_cat.exe to concatenate bin (also hex or srec) files. It's very versatile and you should study the man page. One example of how you could use this tool is this: In the source code for your program, declare your unique ID value a constant pointer to a constant value located at a fixed address in ROM beyond the end of the ROM consumed by the bin file. Build the bin file like normal. Then run srec_cat.exe to concatenate the unique ID value to the bin file with the appropriate offset. You could write a script to do this repeatedly for each unique ID value. Perhaps this script runs as a post-build action from the IDE. This solution could work but it seems like a maintenance nightmare to ensure the right bin file gets programmed onto the right device.
If using a hex file is an option, you could avoiding the need for re-compilation like so:
Reserve some flash space outside of your program (optionally configure the linker script to make sure no data is placed in that section).
Use a python script to generate intel hex data with the required ID placed in the reserved location.
Simply concatenate the two hex files and program as usual. I tested this with STM32 ST-LINK Utility / STM32CubeProgrammer.
To generate the hex data, you can use the intelhex package. For example:
import struct
from intelhex import IntelHex
from io import StringIO
ID_FLASH_ADDRESS = 0x8020000
hex_data = StringIO()
ih = IntelHex()
ih.puts(ID_FLASH_ADDRESS, struct.pack('<I', chip_id))
# Output data to variable
ih.write_hex_file(hex_data)
# Get the data
hex_data.getvalue().encode('utf-8')
Notes:
See the struct documentation for the meaning of '<I'.
I output the data to a variable, but you could also write directly to a file. See intelhex documentation.
There are several questions dealing with some aspects of this problem, but neither seems to answer it wholly. The whole problem can be summarized as follows:
You have an already compiled executable (obviously expecting the use of this technique).
You want to add an arbitrarily sized binary data to it (not necessarily by itself which would be another nasty problem to deal with).
You want the already compiled executable to be able to access this added binary data.
My particular use-case would be an interpreter, where I would like to make the user able to produce a single file executable out of an interpreter binary and the code he supplies (the interpreter binary being the executable which would have to be patched with the user supplied code as binary data).
A similar case are self-extracting archives, where a program (the archiving utility, such as zip) is capable to construct such an executable which contains a pre-built decompressor (the already compiled executable), and user-supplied data (the contents of the archive). Obviously no compiler or linker is involved in this process (Thanks, Mathias for the note and pointing out 7-zip).
Using existing questions a particular path of solution shows along the following examples:
appending data to an exe - This deals with the aspect of adding arbitrary data to arbitrary exes, without covering how to actually access it (basically simple append usually works, also true with Unix's ELF format).
Finding current executable's path without /proc/self/exe - In companion with the above, this would allow getting a file name to use for opening the exe, to access the added data. There are many more of these kind of questions, however neither focuses especially on the problem of getting a path suitable for the purpose of actually getting the binary opened as a file (which goal alone might (?) be easier to accomplish - truly you don't even need the path, just the binary opened for reading).
There also may be other, probably more elegant ways around this problem than padding the binary and opening the file for reading it in. For example could the executable be made so that it becomes rather trivial to patch it later with the arbitrarily sized data so it appears "within" it being in some proper data segment? (I couldn't really find anything on this, for fixed size data it should be trivial though unless the executable has some hash)
Can this be done reasonably well with as little deviation from standard C as possible? Even more or less cross-platform? (At least from maintenance standpoint) Note that it would be preferred if the program performing the adding of the binary data didn't rely on compiler tools to do it (which the user might not have), but solutions necessiting those might also be useful.
Note the already compiled executable criteria (the first point in the above list), which requires a completely different approach than solutions described in questions like C/C++ with GCC: Statically add resource files to executable/library or SDL embed image inside program executable , which ask for embedding data compile-time.
Additional notes:
The problems with the obvious approach outlined above and suggested in some comments, that to just append to the binary and use that, are as follows:
Opening the currently running program's binary doesn't seem something trivial (opening the executable for reading is, but not finding the path to supply to the file open call, at least not in a reasonably cross-platform manner).
The method of acquiring the path may provide an attack surface which probably wouldn't exist otherwise. This means that a potential attacker could trick the program to see different binary data (provided by him) like which the executable actually has, exposing any vulnerability which might reside in the parser of the data.
It depends on how you want other systems to see your binary.
Digital signed in Windows
The exe format allows for verifying the file has not been modified since publishing. This would allow you to :-
Compile your file
Add your data packet
Sign your file and publish it.
The advantage of following this system, is that "everybody" agrees your file has not been modified since signing.
The easiest way to achieve this scheme, is to use a resource. Windows resources can be added post- linking. They are protected by the authenticode digital signature, and your program can extract the resource data from itself.
It used to be possible to increase the signature to include binary data. Unfortunately this has been banned. There were binaries which used data in the signature section. Unfortunately this was used maliciously. Some details here msdn blog
Breaking the signature
If re-signing is not an option, then the result would be treated as insecure. It is worth noting here, that appended data is insecure, and can be modified without people being able to tell, but so is the code in your binary.
Appending data to a binary does break the digital signature, and also means the end-user can't tell if the code has been modified.
This means that any self-protection you add to your code to ensure the data blob is still secure, would not prevent your code from being modified to remove the check.
Running module
Windows GetModuleFileName allows the running path to be found.
Linux offers /proc/self or /proc/pid.
Unix does not seem to have a method which is reliable.
Data reading
The approach of the zip format, is to have a directory written to the end of the file. This means the data can be found at the end of the location, and then looked backwards for the start of the data. The advantage here, is the data blob is signposted from the end of the data, rather than the natural start.
I have a general question about finished applications. When I go into the files of a windows computer application, some files make sense as to why they are there, such as the executable, various media files, .dll files, etc. However, what I don't understand is how there's potentially thousands of different files, located in hundreds of different directories (counting hierarchy) with anywhere between dozens and hundreds of different filetypes. Some of the filetypes don't even seem like actual files, the extension could be something completely obscure. How does the application know how to work with that? Are all of those files hand-written and compiled or are many of them supplied automatically upon generating a desktop application (which would vary based on the application, of course)? I've never actually compiled an application in any language, as I've been studying JavaScript as a starting point, and I recognize that JavaScript is not intended for creating standalone applications, it's used to implement inside HTML. This is why I have so many questions about the generation of the application itself.
To provide an example, a few of the file extensions I see contained in the Audacity application folder which I don't recognize are as follows: .lsp .raw .mo .ny .exp
Even that is a very short list compared to the amount of filetypes/extensions I usually encounter which I have no knowledge of. So, all in all, my main question is why there's such a crazy amount of files, folders, and filetypes/extensions being used by an application. Hopefully someone can help me understand.
Extra question, for those who might care to answer it:
What does it mean when you open a file in an application like Notepad++ (or a .plist editor) and it's just a bunch of unreadable characters? I'm assuming that means it's a compiled file, but I could use some clarification. This happens when I try to open an .exe, a .dll, etc. I understand why I can't edit things like that in a text editor of course, yet why all the strange symbols and characters? Why wouldn't it just throw an error upon trying to open it? Are all the strange characters just a way of attempting to interpret already compiled code?
Bear with me, I'm pretty new to programming and I'm trying to get a better understanding of the process behind actually generating a GUI-based desktop application. As I said before, my current knowledge doesn't extend to the point of actually compiling an application.
Thank you for any help, I really appreciate it.
Focusing on your extra question: you have to learn what a binary file and a text file is, but in short: Imagine you have a simple calculator program that stores the result in a file. Lets say the result you want to store is the number 64. You have to options to do it: saving it as text (characteres 6 and 4) or as a binary data.
If you store it as a text, you need two bytes: one for the code of the character 6 and other for the character 4. You can open that file with the notepad and you'll see that two characteres '64'.
If you store it as a binary value, you only need one byte, but if you open it with the notepad, you'll see the character whose code is 64: 'A'
Most of such "strange" files are resources needed by parts of the application. A complex application is constructed very modular, and each component may need to load different additional resources, often depending on conditions decided at runtime.
For example, on startup if a Qt-based application reads it should use German translation, it may load trans/de_DE.qm from a directory also containing other language files. Or a game may load level by level from different files depending on how far you've come.
Your second question is quite simple. Most resource files are read by an application function as stream of bytes. If e.g. such stream contains '005a' as 4 bytes, you'll see strange symbols in notepade.exe since that editor interprets such bytes as ASCII code, which means it prints the symbols it finds at place 0, 0, 5, and a in the ASCII table. But the application actually reads it in as 4 x 8 bits = 32bit value, which may mean a 32bit integer value of a variable in my simple example. So the variable value is set to 0x5a wich is decimal 90.
I was thinking about developing an own file archive format to use for private projects. The thing is that I am not looking for a solution like 7z or RAR, but I want to make something different, similar to a file system.
Looking at real file system, each has two sections in common in its architecture - information about files stored on disk and actual data of the files, as follows:
----------------------------
METADATA | FILE DATA
----------------------------
My question is - how is it possible that these two sections will not overlap? I mean, the FAT STRUCTURE section grows towards the FILE DATA section, while the latter grows towards the end of the disk (partition). How does a file system manage these sections?
This is what I have been trying to figure out for most of the time and any tip would be more than welcome.
Most file systems operate with clusters or pages or blocks, which have fixed size. In many filesystems the directory (metadata) is a just a special file, so it can grow in the same way the regular data files grow. On other filesystems some master metadata block has a fixed size which is pre-allocated during file system formatting. In this case the file system can become full before files take all available space.
On a side note, is there a reason to reinvent the wheel (custom file system for private needs)? There exist some implementations of in-file virtual file systems which are similar to archives, but provide more functionality. One of examples is our SolFS.
All you need is a manifest containing the file list, archive name, and or password and then have all the files listed there
if you can make the files smaller than that's even better!
so I want to ask, and forgive me if this is obvious, or newbie question:
if I create a file, say a text file - save it, (I'm using Ubuntu), so this file I have created, has some extra information associated with it, such as, the place on my hard drive where it has been saved. How to examine this information? Where does this information get stored for my specific file? How to examine the file as it is stored on my disk, I assume in terms of, what, bytes?
Maybe I need to focus this question,
Thanks,
B
This is the responsibility of your file system. In very brief, a file system is a data structure which is laid out onto your entire disk -- that's what "formatting" a disk does -- and your files are saved into that data structure. There are lots of file systems, and their details vary quite widely. http://www.forensics.nl/filesystems has a whole bunch of papers on file system design and organization. I'd start with McKusick's A Fast File System for UNIX; it's old, but it contains lots of ideas that are still influential today.
You need a filesystem-specific forensics tool if you want to look at the data structures on your disks. Ubuntu's probably using something in the ext2 family, so try debugfs.
I think maybe you do need to focus it a bit :-)
For UNIX file systems, there are many different types.
The one I'm most familiar with (ext2) has a "file" on disk containing directory entries. These entries are simple names and pointers to the file itself (which is why you can have multiple directory entries pointing to the same file, hard links).
The file itself is an inode which contains the properties of the file (owner, size, permissions and so on).
The inode also contains direct and indirect pointers to the contents of the file. By direct, I mean a pointer to a data block.
An indirect pointer is a pointer to a pointer to contents. I believe you can go to another two levels of indirection, which gives you truly massive file sizes:
More details on Wikipedia.